Fork me on GitHub
#clojure
<
2024-05-12
>
P Alexander Schofield04:05:20

Is it just me, or does it seem like almost all the filter and map functions in clojure are hell-bent on returning a collection to which conj prepends-- rather than appends-- regardless of whether their input collection had conj append behavior or not. The two most critical functions here have corresponding mapv and filterv, which return a vector-- so append behavior can be preserved-- but that has the effect of tossing laziness out the window, Does clojure have a lazy collection with conj append behavior? Even better, is there one which all the various filter and map functions respect in terms of preserving conj append behavior?

seancorfield05:05:09

This is the sequence abstraction at work. You can use transducers to perform transformations and produce whatever collection type you want

seancorfield05:05:55

map, filter, etc are explicitly defined to call seq on the collection so they operate on an abstraction, not a concrete collection type. That is by design.

phronmophobic05:05:02

The idea of laziness and appending to the end of a sequence are in tension with each other. One option is to use concat which let's you combine two lazy sequences, but thar be dragons.

1
didibus06:05:00

@U06DE7HR6US Only sequences are lazy, and those functions return sequences, and they mimic lists in behavior. Those functions you are talking about are called the sequence functions, because they take a sequence and return a sequence, but they will coerce the input to a sequence as well if giving a seqable (something that can be made into a sequence). So the answer is no. There isn't anything in Clojure that is lazy and would have conj append to it. But like Sean said, there are also transducers and transducing functions. Those let you choose what collection type they return. It's still "filter" and "map", but it's the transducer arity, which returns a transducer of them, and then you can use a transducing function to apply them. Transducers are not lazy though, they can be lazily applied using sequence, but that returns a sequence again. But you can compose transforms and they do loop fusion, and with eduction you can also delay when they are realized and compose more transforms on top of the result before realizing them. So only for use-cases where not everything will be needed would they not be as appropriate.

phronmophobic06:05:42

How do you append to the end of a potentially infinite lazy sequence?

1
phronmophobic06:05:59

There are only two options, you can either realize a lazy sequence to append to the end which makes it no longer lazy, or you can enqueue work to do later (ie. concat). Using concat may work in some cases, but it's an easy recipe to start enqueuing work faster than it can be consumed.

1
👍 1
didibus07:05:20

Maybe I'm tired, but are you sure? The realized elements will take up memory, that's true even now if you retain the head. But realizing the next element can be lazy, why does it matter where it then gets cached ? conj cannot be used lazily, even when prepending.

didibus07:05:38

So I mean, couldn't as elements are lazily realized, they just get put into a vector for example, if you conj, it would realize it all and then append?.

didibus07:05:18

And ya, in that way cons would enqueue the "next" element which would be appended when realized

P Alexander Schofield19:05:19

Thanks @U04V70XH6 and @U7RJTCH6J for your replies. To summarize my understanding-- no there is no lazy conj-append clojure data structure, and it's like that because that's the way it is, not because of any inherent algorithmic reason.

seancorfield20:05:26

conj has a specific meaning and for list or lazy sequence, it's prepend, and for a vector it's append. Lazy append is concat. I'd classify that as "algorithm" -- or at least "deliberate semantics".

seancorfield20:05:08

(concat my-lazy-seq [my-new-item]) will lazily append my-new-item to the end of my-lazy-seq as it is realized (consumed).

seancorfield20:05:01

conj is a collection function -- its behavior depends on the type of the collection and the collection argument comes first. cons is a sequence function -- its behavior is always the same (prepend to a sequence) and the sequence argument comes last.

seancorfield20:05:42

map, filter, etc and sequence functions -- the sequence argument comes last (and they call seq on that argument).

gtbono05:05:20

Have anyone here used Kit framework? what are your opinions? any alternatives? I'm starting a new project and the idea of building up a framework from scratch starting from a plain Leiningen template and adding all the pieces is exausting to me. I want a batteries-included API template. Do you guys use something like this? please share! thanks 🙂

Casey09:05:41

Kit is a great batteries included way to get going on a web project. It's made by the folks who did Luminus which is similar but lein based. I've used both a lot. Biff is another alternative here, though it's much more framework-y than Kit which is more of a template than a framework.

didibus05:05:44

People that have libs, if you do a change that is kind of not related to the code, for example, I added clj-kondo linting support, would you bump the minor version, or can you just publish at the same version and it will override the artifact in clojars? And is the latter bad practice even if possible?

seancorfield05:05:43

Every change is a new version. But I only release when there's a change that affects end users.

seancorfield05:05:35

You can't overwrite versions in Clojars. It's not mutable you know!😁

didibus06:05:13

Ah well, if I can't even do it haha.

mogverse06:05:50

We are truly in the abode of immutability

clojure-spin 3
DrLjótsson11:05:17

I want to "convert” a thread last macro that includes only functions with transducing rarity into a transducer. No into or reduce as the last function in the thread. Would I then use sequence to get a behavior that most closely resembles the original?

p-himik13:05:42

Yes.

🙏 1
phronmophobic19:05:40

One important distinction between sequence and the typical lazy function like map and filter is that sequence will partially consume its input when created: https://ask.clojure.org/index.php/13155/sequence-partially-consumes-input.

👍 1
Mario Trost13:05:02

Having a huge XML stored on disk, what’s the most memory efficient way to change the attribute of 1 (one) element that I can uniquely identify by its tag name? (background in 🧵, not strictly needed)

👀 1
Mario Trost13:05:46

I’m very new to xml parsing, traversing and writing and got myself into a pickle with a Rest Api I have to call with large XML payloads that sometimes include base 64 encoded images. First implementation was rather naive, so we quickly ran into Java Heap Space errors after some people began to use the system in earnest. Local testing with the time+™️ macro showed me this ; time: 7.81 s alloc: 11’527’032’096b Iterations: 1 After some refactorings I have it down to ; time: 5.02 s alloc: 357’261’824b Iterations: 1 . I store the XML on disk and the post body is a io/input-stream . Now sometimes the endpoint returns an error that I can parse to know which element I can change and try again. Before the refactorings I just str/replaced the heck out of it. One of my tries to change this

(defn- replace-issue-name2 [{:keys [file-path new-issue-name]}]
  (let [new-path (storage/tmpfile)]
    (with-open [is (io/input-stream file-path)
                w (io/writer new-path)]
      (let [xml (xml/parse is :skip-whitespace true)]
        (xml/emit
         (walk/postwalk #(if (and (map? %) (= :issue (:tag %)))
                          (assoc-in % [:attrs :name] new-issue-name)
                          %)
                        xml)
         w)
        new-path))))
It’s okayish: ; time: 352.29 ms alloc: 80’724’337b Iterations: 6 and better than slurping and str/replacing (`; time: 266.53 ms alloc: 385’816’448b Iterations: 8`) . From what I understand postwalk is eager and perhaps there is no way to avoid that. But what other ways are there to do this? Perhaps something with zippers?

igrishaev13:05:38

Maybe you should take a look at java SAX XML parsers. SAX parsers do not consume the whole document in memory. They scan the doc using a narrow window (1-2 kilobytes) and call certain methods of you class, for example tagNameStarted and tagNameEnded, etc. In you parser, you redirect the content to another stream with some correction

igrishaev13:05:51

Second, is it possible to simplify the documents you're working with? Say, to put Base64-encoded images into separate files and replace them with references like

<image src="static/foobar.png.base64"><image>

p-himik13:05:53

After having experimented with a few available parsers, I found that https://github.com/mudge/riveted is the most performant. Altering existing XML documents in-place is possible but might be a tad tricky. If you end up confirming that its performance/trickiness trade-off is suitable for you, let me know if you have any issues with getting stuff to work - I have some more or less ad hoc functions that might be helpful.

1
Alex Miller (Clojure team)13:05:17

I would probably focus on either StAX (streaming) APIs or depending on the document shape this might be a good use case for an XSLT transform. But also, have you considered not treating it as xml at all?

Mario Trost17:05:32

Thanks for all the quick replies! > Maybe you should take a look at java SAX XML parsers I already kind of suspected I maybe need to go directly to Java/StAX, but didn’t know about SAX, reading up on it. > Second, is it possible to simplify the documents you’re working with? Say, to put Base64-encoded images into separate files and replace them with references like In the end I have to send everything in one XML along the wire, so I think not? I kind of do that when building the initial XML: First implementation was everything in the XML when creating it. Now I put placeholders for the images. Because this XML has line breaks I can line-seq over it, streaming/writing it to disk line by line, replacing the found placeholders with the images. But in the final XML payload the image payload has to be present at the moment (we’re hoping for some api changes soon).

Mario Trost17:05:03

> After having experimented with a few available parsers, I found that https://github.com/mudge/riveted is the most performant. First time I heard about riveted, thanks for bringing it to my attention, I will definitely have a look at it when I’m back working on this later next week. Will come back to you if I stick with it 🙏

Mario Trost17:05:38

> But also, have you considered not treating it as xml at all? What do you have in mind exactly? Funnily enough, that function I shared is the time I do treat is as XML and not just a string or byte stream so I’m open to other approaches

Alex Miller (Clojure team)19:05:44

Can you identify the location in the document via string search without parsing it at all?

Mario Trost20:05:24

Yes I can and I was just now thinking I could shell out to sed I did try one solution with a io/reader that I called line-seq on and then I just str/replace’d every line and wrote it to disk. But that used way more memory than the postwalk I posted. Is shelling out for something like this appropriate? I’m a bit hesitant to leave the “safety” of the JVM 😅

p-himik20:05:18

It's appropriate if you handle exit code and can guarantee that sed exists on the target system, is of the required version (probably not important), and if you make sure that it's the system sed and not something some users might've installed in $PATH. But also, that riveted I mentioned uses VTD-XML and that ends up replacing bytes from idx0 to idx1 with bytes b. Pretty much what sed does.

🙏 1
DrLjótsson21:05:07

Hi. I've struggled with starting to use transducers, mainly because I think that the syntax is clunky compared to the ->> macro. I thought that https://github.com/johnmn3/injest was pretty cool but maybe it is a bit too magical with its auto detection of transducible forms. I know that there are arguments against creating new "syntax macros" but I wanted to give it a try. So I created >> (https://gist.github.com/brjann/0d4026ce22ba0e60e18a68ff17e0fed5) that transforms a ->> style call into a call using a transducer. It assumes that all threaded forms are transducible, and returns a call to (sequence xf coll). However, it has two special cases, if the last threaded form is either (into into-coll) or (reduce rf init) , in which case it will return (into into-coll xf coll) or (transduce xf (completing rf) init coll). Thoughts welcome! Examples in 🧵

👍 1
DrLjótsson21:05:07

(>> [1 2 3 4]
    (map inc)
    (filter odd?))
;; expands to
(sequence (comp 
             (map inc) 
             (filter odd?)) 
           [1 2 3 4])
;=> (3 5)

DrLjótsson21:05:14

(>> [1 2 3 4]
    (map inc)
    (filter odd?)
    (into []))
;; expands to
(into []
      (comp
       (map inc)
       (filter odd?))
      [1 2 3 4])
;=> [3 5]

DrLjótsson21:05:35

(>> [1 2 3 4]
    (map inc)
    (filter odd?)
    (reduce + 0))
;; expands to
(transduce (comp (map inc)
                 (filter odd?))
           (completing +)
           0
           [1 2 3 4])
;; 8
This reduce case was a bit tricky. As far as I understand, transduce does not handle the reducing function rf in the same way as reduce does when the init value is missing. transduce will call rf with zero arguments to get the init value whereas reduce will call it with the two first items in coll. To avoid ambiguity, I made a final reduce in >> require an init value

seancorfield21:05:07

I think that's a reasonable decision. Rich has said that if could do reduce over it would require init (and IReduce wouldn't exist -- we'd only have IReduceInit). The semantics of the init-less reduce are pretty weird, and dependent on whether the collection is empty, has one element, or has multiple elements 😞

seancorfield21:05:42

user=> (defn f [& args] (println "I was called with" (count args) "args") 0)
#'user/f
user=> (reduce f [])
I was called with 0 args
0
user=> (reduce f [1])
1
user=> (reduce f [1 2])
I was called with 2 args
0
user=> (reduce f 0 [])
0
user=> (reduce f 0 [1])
I was called with 2 args
0
user=> (reduce f 0 [1 2])
I was called with 2 args
I was called with 2 args
0
user=>

DrLjótsson21:05:04

Thanks @U04V70XH6, the behavior of an ending reduce function would certainly be unpredictable without an init value!

didibus22:05:31

I like it.

didibus22:05:46

Might be nice if without a starting collection it returned a transducer similar to comp. And a way to make an eduction instead of a sequence could be cool as well. Not sure what the syntax should be for that.

didibus22:05:35

Maybe if it requires a context as the last thing? So you had to put sequence or eduction or into or reduce at the end?

seancorfield22:05:36

@U0K064KQV How would it tell what the first item was, as a macro? Seems like you'd need a syntactic hint for that or (more easily) a different macro?

2
john00:05:59

I thought about checking for into in the final form too in injest. As an optimization. I like the idea of trying to detect if it can return an eduction somehow. That'd be awesome

john00:05:21

Wrt the magic of how injest reads transducer forms, if we all agreed to tag transducers in some common way, perhaps via metadata, then it might be less magical

john00:05:53

But in reality, the list of core transducers probably won't change that frequently

john00:05:28

I think I may be returning a vector in injest too iirc and some may prefer a seq. Lots of different options to explore

john00:05:04

I also think it'd be cool if you could return threads as eductions that can then be used as transducer in other transducer threads that can be returned as an eduction or a sequence

jasonjckn00:05:18

why not just use reducible ? take lazy seq code and slap a 'r/' infront of it and you have transducer code

jasonjckn00:05:56

(->> [1 2 3 4]
    (r/map inc)
    (r/filter odd?)
    (into []))

john00:05:22

@U04V70XH6 the macro can just look at the value of the first item, right?

john00:05:06

Just look at it... If the first item is a transducer, then the user obviously wants to return this thread as a transducer, right?

john00:05:39

Oh, it could be a let bound name of a thing that could be a transducer or a collection

john00:05:59

So you'd need a runtime check

✔️ 2
john00:05:46

Might not be worth it dunno

jasonjckn00:05:07

only the last expression needs macro power , so you could so the rest at runtime , but not a fan of this abstraction in general

seancorfield00:05:28

No, a macro cannot look at a runtime value @U050PJ2EU

john00:05:40

There's a purely functional version of the thrush too

john00:05:07

I mean, a thrush of functions

john00:05:40

Fogus I think talked about it on a blog post I believe

john00:05:34

The function thrush formalism might play better with transducers, without even having to use macros

jasonjckn00:05:27

just use reducibles :man-shrugging:

john00:05:26

Is your above example doing actual loop fusion that we get with transducers?

john00:05:21

That's the value prop of these fancy thread macros

jasonjckn00:05:57

have you seen the underlying protocol CollReduce ?

jasonjckn00:05:27

im on my cell so cant write a wall of text, maybe someone else

jasonjckn00:05:39

its the same performance

john00:05:49

It's been a while since I saw the reducibles talks, that came before the transducer stuff, so I don't recall how that form you wrote might produce loop fusion

jasonjckn01:05:27

"Another objective of the library is to support reducer-based code with the same shape as our current seq-based code. Getting there is easy:" https://clojure.org/news/2012/05/15/anatomy-of-reducer

jasonjckn01:05:54

'rmap' = 2 arity version of r/map

john01:05:26

Does this fuse odd? and inc into the same loop?

(r/map inc (r/filter odd? [1 2 3 4]))

john01:05:10

Hmm, I'm not getting that from the docs. Can you point me to some explanation of that?

jasonjckn01:05:13

the link i sent is the best i found , gonna focus on something else, try watching rich talks again maybe

john01:05:33

Word. Anyway I'd love to see some competition with injest. Lots of different optimizations are possible to explore

jasonjckn01:05:48

you can always benchmark it

john01:05:34

One of the selling points of this new "mojo" language is some auto scaling aspect of their parallelization stuff, which would be pretty easy on top of r/fold

john01:05:55

@U0J3J79FE injest's parallel => actually uses r/fold as the transducing context under the hood

john01:05:11

Auto scaling on top of => would be interesting

didibus01:05:28

Ya, reducers don't reduce until the reducible context. Transducers are a generalization of reducers

didibus01:05:02

But you're not converting things to transducers, but to reducers when you use reducers. I don't think they can be intermixed, not sure though

john01:05:25

Okay. I'm remembering now. I think I completely brain dumped on all the reducer's stuff after the transducer stuff came out lol

didibus01:05:19

I think the difference is the reducer always captures the input coll maybe? And transducers separate that out, so you can compose transducers on their own, and then apply their combinations to different input coll and also use difference context to apply them.

didibus01:05:28

I do wish that transducers had continued down the "parallel" track. For example, Java streams really went strong in auto-parallelization, and even the reduce function is designed in ways so it's always parallelizable.

didibus01:05:29

Basically java only has r/fold

john01:05:33

A lot of transducers are parallelizable

didibus01:05:50

Can r/fold apply transducers?

didibus01:05:00

in parallel?

john01:05:25

Some of them

john01:05:42

They're labeled in the docs whether they're parallelizable or not

john01:05:40

r/fold, core.async/pipeline

john01:05:40

r/fold is better for seq-to-seq parallelization. Pipeline is better for many in to many out with many streams not correlated

john01:05:40

But if all your streams have to rejoin at the end of some thing, the work stealing of forkjoin under r/fold does magic in keeping work on one core if distributing it is too much of a penalty

jasonjckn01:05:52

most of the 2-arity versions, like r/map, return 'foldables' (word?) as well as 'reducibles'

jasonjckn01:05:23

in terms of protocols, foldables are also reducibles, since they both implement CollReduce

john01:05:43

Aye, it's coming back to me

jasonjckn01:05:05

but foldables also implement CollFold ontop of CollReduce

john01:05:02

Does this:

(->> [1 2 3 4]
  (r/map inc)
  (r/filter odd?)
  (into []))
Do exactly the same thing as injest's:
(x>> [1 2 3 4]
  (map inc) 
  (filter odd?)
  (into []))
?

👍 1
john01:05:28

If so then yeah that's a pretty fair trade off. You trade some characters here for some characters there

didibus01:05:46

I still agree though, the syntax isn't that great. If I compare it to Java streams for example. You pipe your transformation, then you choose your "transcuding context". ? I would much prefer

didibus01:05:26

Be able to do (as-parallel) or something at the end.

john01:05:41

That's transparent in injest. But checking for users using transducing contexts at the end would be an interesting optimization, like @UGDTSFM4M did in the gist

didibus01:05:13

Ya, in injest you change from ->> to x>> right?

didibus01:05:19

For "parralel"

john02:05:10

Yeah, it always terminates on sequence for non parallel x>> and r/fold for parallel =>> so you don't have to terminate with into or anything in particular

john02:05:18

And =>> defaults to the x>> semantics for parts of the thread that are not parallelizable

john02:05:31

So maybe that's different than the reducers namespace

john02:05:32

That would be an interesting benchmark shootout - injest's transducers vs reducers tools in thread transformations

didibus02:05:32

But do you need to switch to reducers for parallel? Or can transducers be fed to r/fold ?

john02:05:27

r/fold takes transducers in the way I did it in the lib

didibus02:05:56

Cool, so r/fold can be used with transducers to reduce in parallel. That's neat.

john02:05:35

Yeah, definitely not publicized enough. It's fire

john02:05:27

But in a thread-last form you'd have to terminate with r/fold. And I'm not sure if you'd be allowed to use non-parallelizable reducers in your thread

john02:05:29

So I'm not sure if a parallel shootout would be as apples and apples

john02:05:38

Anyway, I think it would be a great idea @UGDTSFM4M for folks to come up with alternatives to injest and/or just steal the best parts you like and add what you want. That's what I did with the core macros. It's a sweet DX space to work on and the fruit is just hanging too low here to not explore it imo

💯 2
didibus03:05:10

I do like the DX which is basically: 1. Parallel or not? 2. Transformations you want 3. Choose the execution context and resulting output shape

onionpancakes03:05:26

I want to bring awareness to https://clojuredocs.org/clojure.core/eduction. I use transducers heavily, rather than reaching out to these transducer DX libraries, I find using eduction is sufficiently ergonomic enough. I wrote examples of its use on clojuredocs.

didibus03:05:20

Ya, it's funny that eduction takes one or more xform in the middle. It's annoying though that sequence and into don't.

john04:05:38

Yeah that's pretty ergonomic

john04:05:49

I'd usually want the caching of sequence though

didibus04:05:06

You can into the eduction to cache I think.

didibus04:05:34

But then the DX is no longer good haha

john04:05:50

Also, all those methods encourage segregating your transducer code from your regular code, whereas injest is encouraging you to entail them together so it all looks like a regular thread. Slightly different use case

john04:05:45

If you have some lib or app that actually necessitates the defining and composing and recomposing of context free transformations, then keeping things separated from context for as long as possible makes sense

john04:05:18

Sometimes though, at the point of service, where things are finally composed together do the transformation in all the context, we'll want to compose transducers for their final form for usage and mixing into the context of "regular code."

john04:05:52

And not for reusability

john04:05:28

That's where the segregation causes friction

john04:05:24

Not a ton, just slightly frustrating enough for you to say, "wow, this could actually be a thread-last form right here, sure would be nice..."

john04:05:36

So I guess injest is a little orthogonal to the reusable-component phase of building transducers

john04:05:17

And is more about the phase where you're intermingling them with context

john05:05:59

The way I imagine you using injest: You have some part of your app with transducers composed into general purpose transformations. Another part servicing an endpoint has a few thread-last forms, some of which use those transducers from above. Bob adds a mapping operation after a filtering operation, which already came after another mapping operation, so Bob says, "these are all transducers :thinking_face: let's grab that low hanging fruit by replacing ->> with x>> for now." And then a week later, Alice adds two more mapping and filtering operations after the ones Bob added. Alice notices this part of the thread is becoming embarrassingly saturated with transducers, so she abstracts that out into it's own transducer and adds it to the common transducers namespace for everyone else to reuse. So in that scheme, you'd use injest to slowly grow your code towards more and more reusable transducer stacks while still scooping up the low hanging fruit in those thread-last pipelines that are still in flux and intermingled between transducers and regular code.

DrLjótsson19:05:24

Thanks for all comments! >> is an effort to make most of my uses of ->> more efficient through transducers. I don't think I have ever needed the de-coupling of input/output and transformation process that transducers offer (though I suspect that this de-coupling underlies the increased efficiency) , nor have I found a need of re-use of transforming processes. But it has bothered me that I'm not getting the performance boost from transducers.

didibus19:05:13

Basically if you're transforming something and then later you transform it some more, as opposed to consuming it for say doing I/O or returning results. The eduction will keep fusing the transforms and only when you actually need the data will it eagerly perform it all. Sequence I think will be a bit similar, but with a different profile. Into won't do that and realize things right away.

didibus19:05:59

So it's good if you want to factor out your transformation code in smaller functions with good names.

john21:05:40

Yeah, we just don't do that most of the time. You just don't need these reusability constructs 90% of the time, because simple functions are so reusable

john21:05:11

Most applications just don't need reusable transformations

john21:05:33

Most of them are one-off throw aways

didibus23:05:51

Now I'm wondering... Could you have a function instead of a macro?

(xfm
  coll
  (map inc)
  (filter even?)
  :to :sequence
  :parallel true)
Something like:
(defn xfm
  [coll xforms* & {:keys [to parallel reduce] :or {to :sequence parallel false}}]
   ...)

;; :to choices would be :sequence, :eduction, :vector, :list, :set, :map, etc.
;; :reduce can take a map or vector of [rf init] or {:rf  :init init}, and if :parallel it also needs combinerfn
It could even support no coll by taking to :transducer as the option, in which case it knows that the coll arg is actually another xform

didibus00:05:51

I kind of like that for pure transducer pipeline to be honest. For general utility I still like injest because it'll mix and match non-transducing function automatically, so you can have a big transformation pipeline and just not worry what has a transducer or not.

john00:05:10

Interesting. Yeah, a function thrush version might actually work pretty good for that