This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
Is it just me, or does it seem like almost all the filter and map functions in clojure are hell-bent on returning a collection to which conj prepends-- rather than appends-- regardless of whether their input collection had conj append behavior or not. The two most critical functions here have corresponding mapv and filterv, which return a vector-- so append behavior can be preserved-- but that has the effect of tossing laziness out the window, Does clojure have a lazy collection with conj append behavior? Even better, is there one which all the various filter and map functions respect in terms of preserving conj append behavior?
This is the sequence abstraction at work. You can use transducers to perform transformations and produce whatever collection type you want
map
, filter
, etc are explicitly defined to call seq
on the collection so they operate on an abstraction, not a concrete collection type. That is by design.
The idea of laziness and appending to the end of a sequence are in tension with each other. One option is to use concat
which let's you combine two lazy sequences, but thar be dragons.
@U7RJTCH6J Why is that so?
@U06DE7HR6US Only sequences are lazy, and those functions return sequences, and they mimic lists in behavior. Those functions you are talking about are called the sequence functions, because they take a sequence and return a sequence, but they will coerce the input to a sequence as well if giving a seqable (something that can be made into a sequence). So the answer is no. There isn't anything in Clojure that is lazy and would have conj append to it. But like Sean said, there are also transducers and transducing functions. Those let you choose what collection type they return. It's still "filter" and "map", but it's the transducer arity, which returns a transducer of them, and then you can use a transducing function to apply them. Transducers are not lazy though, they can be lazily applied using sequence, but that returns a sequence again. But you can compose transforms and they do loop fusion, and with eduction you can also delay when they are realized and compose more transforms on top of the result before realizing them. So only for use-cases where not everything will be needed would they not be as appropriate.
There are only two options, you can either realize a lazy sequence to append to the end which makes it no longer lazy, or you can enqueue work to do later (ie. concat). Using concat may work in some cases, but it's an easy recipe to start enqueuing work faster than it can be consumed.
Maybe I'm tired, but are you sure? The realized elements will take up memory, that's true even now if you retain the head. But realizing the next element can be lazy, why does it matter where it then gets cached ? conj cannot be used lazily, even when prepending.
So I mean, couldn't as elements are lazily realized, they just get put into a vector for example, if you conj, it would realize it all and then append?.
And ya, in that way cons would enqueue the "next" element which would be appended when realized
Thanks @U04V70XH6 and @U7RJTCH6J for your replies. To summarize my understanding-- no there is no lazy conj-append clojure data structure, and it's like that because that's the way it is, not because of any inherent algorithmic reason.
conj
has a specific meaning and for list or lazy sequence, it's prepend, and for a vector it's append. Lazy append is concat
. I'd classify that as "algorithm" -- or at least "deliberate semantics".
(concat my-lazy-seq [my-new-item])
will lazily append my-new-item
to the end of my-lazy-seq
as it is realized (consumed).
conj
is a collection function -- its behavior depends on the type of the collection and the collection argument comes first.
cons
is a sequence function -- its behavior is always the same (prepend to a sequence) and the sequence argument comes last.
map
, filter
, etc and sequence functions -- the sequence argument comes last (and they call seq
on that argument).
See https://clojure.org/reference/sequences and https://clojure.org/reference/data_structures#Collections for more on that.
Have anyone here used Kit framework? what are your opinions? any alternatives? I'm starting a new project and the idea of building up a framework from scratch starting from a plain Leiningen template and adding all the pieces is exausting to me. I want a batteries-included API template. Do you guys use something like this? please share! thanks 🙂
Kit is a great batteries included way to get going on a web project. It's made by the folks who did Luminus which is similar but lein based. I've used both a lot. Biff is another alternative here, though it's much more framework-y than Kit which is more of a template than a framework.
People that have libs, if you do a change that is kind of not related to the code, for example, I added clj-kondo linting support, would you bump the minor version, or can you just publish at the same version and it will override the artifact in clojars? And is the latter bad practice even if possible?
Every change is a new version. But I only release when there's a change that affects end users.
You can't overwrite versions in Clojars. It's not mutable you know!😁
I want to "convert” a thread last macro that includes only functions with transducing rarity into a transducer. No into
or reduce
as the last function in the thread. Would I then use sequence
to get a behavior that most closely resembles the original?
One important distinction between sequence
and the typical lazy function like map
and filter
is that sequence
will partially consume its input when created: https://ask.clojure.org/index.php/13155/sequence-partially-consumes-input.
Having a huge XML stored on disk, what’s the most memory efficient way to change the attribute of 1 (one) element that I can uniquely identify by its tag name? (background in 🧵, not strictly needed)
I’m very new to xml parsing, traversing and writing and got myself into a pickle with a Rest Api I have to call with large XML payloads that sometimes include base 64 encoded images.
First implementation was rather naive, so we quickly ran into Java Heap Space errors after some people began to use the system in earnest. Local testing with the time+™️ macro showed me this ; time: 7.81 s alloc: 11’527’032’096b Iterations: 1
After some refactorings I have it down to ; time: 5.02 s alloc: 357’261’824b Iterations: 1
. I store the XML on disk and the post body is a io/input-stream
.
Now sometimes the endpoint returns an error that I can parse to know which element I can change and try again. Before the refactorings I just str/replace
d the heck out of it. One of my tries to change this
(defn- replace-issue-name2 [{:keys [file-path new-issue-name]}]
(let [new-path (storage/tmpfile)]
(with-open [is (io/input-stream file-path)
w (io/writer new-path)]
(let [xml (xml/parse is :skip-whitespace true)]
(xml/emit
(walk/postwalk #(if (and (map? %) (= :issue (:tag %)))
(assoc-in % [:attrs :name] new-issue-name)
%)
xml)
w)
new-path))))
It’s okayish: ; time: 352.29 ms alloc: 80’724’337b Iterations: 6
and better than slurping and str/replacing (`; time: 266.53 ms alloc: 385’816’448b Iterations: 8`) .
From what I understand postwalk
is eager and perhaps there is no way to avoid that.
But what other ways are there to do this? Perhaps something with zippers?Maybe you should take a look at java SAX XML parsers. SAX parsers do not consume the whole document in memory. They scan the doc using a narrow window (1-2 kilobytes) and call certain methods of you class, for example tagNameStarted and tagNameEnded, etc. In you parser, you redirect the content to another stream with some correction
Second, is it possible to simplify the documents you're working with? Say, to put Base64-encoded images into separate files and replace them with references like
<image src="static/foobar.png.base64"><image>
After having experimented with a few available parsers, I found that https://github.com/mudge/riveted is the most performant. Altering existing XML documents in-place is possible but might be a tad tricky. If you end up confirming that its performance/trickiness trade-off is suitable for you, let me know if you have any issues with getting stuff to work - I have some more or less ad hoc functions that might be helpful.
I would probably focus on either StAX (streaming) APIs or depending on the document shape this might be a good use case for an XSLT transform. But also, have you considered not treating it as xml at all?
Thanks for all the quick replies!
> Maybe you should take a look at java SAX XML parsers
I already kind of suspected I maybe need to go directly to Java/StAX, but didn’t know about SAX, reading up on it.
> Second, is it possible to simplify the documents you’re working with? Say, to put Base64-encoded images into separate files and replace them with references like
In the end I have to send everything in one XML along the wire, so I think not?
I kind of do that when building the initial XML: First implementation was everything in the XML when creating it. Now I put placeholders for the images. Because this XML has line breaks I can line-seq
over it, streaming/writing it to disk line by line, replacing the found placeholders with the images.
But in the final XML payload the image payload has to be present at the moment (we’re hoping for some api changes soon).
> After having experimented with a few available parsers, I found that https://github.com/mudge/riveted is the most performant.
First time I heard about riveted
, thanks for bringing it to my attention, I will definitely have a look at it when I’m back working on this later next week. Will come back to you if I stick with it 🙏
> But also, have you considered not treating it as xml at all? What do you have in mind exactly? Funnily enough, that function I shared is the time I do treat is as XML and not just a string or byte stream so I’m open to other approaches
Can you identify the location in the document via string search without parsing it at all?
Yes I can and I was just now thinking I could shell out to sed
I did try one solution with a io/reader
that I called line-seq
on and then I just str/replace’d every line and wrote it to disk. But that used way more memory than the postwalk
I posted.
Is shelling out for something like this appropriate? I’m a bit hesitant to leave the “safety” of the JVM 😅
It's appropriate if you handle exit code and can guarantee that sed
exists on the target system, is of the required version (probably not important), and if you make sure that it's the system sed
and not something some users might've installed in $PATH
.
But also, that riveted
I mentioned uses VTD-XML and that ends up replacing bytes from idx0
to idx1
with bytes b
. Pretty much what sed
does.
Hi. I've struggled with starting to use transducers, mainly because I think that the syntax is clunky compared to the ->>
macro. I thought that https://github.com/johnmn3/injest was pretty cool but maybe it is a bit too magical with its auto detection of transducible forms. I know that there are arguments against creating new "syntax macros" but I wanted to give it a try. So I created >>
(https://gist.github.com/brjann/0d4026ce22ba0e60e18a68ff17e0fed5) that transforms a ->>
style call into a call using a transducer. It assumes that all threaded forms are transducible, and returns a call to (sequence xf coll)
. However, it has two special cases, if the last threaded form is either (into into-coll)
or (reduce rf init)
, in which case it will return (into into-coll xf coll)
or (transduce xf (completing rf) init coll)
. Thoughts welcome!
Examples in 🧵
(>> [1 2 3 4]
(map inc)
(filter odd?))
;; expands to
(sequence (comp
(map inc)
(filter odd?))
[1 2 3 4])
;=> (3 5)
(>> [1 2 3 4]
(map inc)
(filter odd?)
(into []))
;; expands to
(into []
(comp
(map inc)
(filter odd?))
[1 2 3 4])
;=> [3 5]
(>> [1 2 3 4]
(map inc)
(filter odd?)
(reduce + 0))
;; expands to
(transduce (comp (map inc)
(filter odd?))
(completing +)
0
[1 2 3 4])
;; 8
This reduce
case was a bit tricky. As far as I understand, transduce
does not handle the reducing function rf
in the same way as reduce
does when the init value is missing. transduce
will call rf
with zero arguments to get the init value whereas reduce
will call it with the two first items in coll. To avoid ambiguity, I made a final reduce
in >>
require an init valueI think that's a reasonable decision. Rich has said that if could do reduce
over it would require init
(and IReduce
wouldn't exist -- we'd only have IReduceInit
).
The semantics of the init
-less reduce
are pretty weird, and dependent on whether the collection is empty, has one element, or has multiple elements 😞
user=> (defn f [& args] (println "I was called with" (count args) "args") 0)
#'user/f
user=> (reduce f [])
I was called with 0 args
0
user=> (reduce f [1])
1
user=> (reduce f [1 2])
I was called with 2 args
0
user=> (reduce f 0 [])
0
user=> (reduce f 0 [1])
I was called with 2 args
0
user=> (reduce f 0 [1 2])
I was called with 2 args
I was called with 2 args
0
user=>
Thanks @U04V70XH6, the behavior of an ending reduce
function would certainly be unpredictable without an init
value!
Might be nice if without a starting collection it returned a transducer similar to comp. And a way to make an eduction instead of a sequence could be cool as well. Not sure what the syntax should be for that.
Maybe if it requires a context as the last thing? So you had to put sequence or eduction or into or reduce at the end?
@U0K064KQV How would it tell what the first item was, as a macro? Seems like you'd need a syntactic hint for that or (more easily) a different macro?
I thought about checking for into
in the final form too in injest. As an optimization. I like the idea of trying to detect if it can return an eduction somehow. That'd be awesome
Wrt the magic of how injest reads transducer forms, if we all agreed to tag transducers in some common way, perhaps via metadata, then it might be less magical
I think I may be returning a vector in injest too iirc and some may prefer a seq. Lots of different options to explore
I also think it'd be cool if you could return threads as eductions that can then be used as transducer in other transducer threads that can be returned as an eduction or a sequence
why not just use reducible ? take lazy seq code and slap a 'r/' infront of it and you have transducer code
@U04V70XH6 the macro can just look at the value of the first item, right?
Just look at it... If the first item is a transducer, then the user obviously wants to return this thread as a transducer, right?
only the last expression needs macro power , so you could so the rest at runtime , but not a fan of this abstraction in general
No, a macro cannot look at a runtime value @U050PJ2EU
The function thrush formalism might play better with transducers, without even having to use macros
It's been a while since I saw the reducibles talks, that came before the transducer stuff, so I don't recall how that form you wrote might produce loop fusion
"Another objective of the library is to support reducer-based code with the same shape as our current seq-based code. Getting there is easy:" https://clojure.org/news/2012/05/15/anatomy-of-reducer
the link i sent is the best i found , gonna focus on something else, try watching rich talks again maybe
Word. Anyway I'd love to see some competition with injest. Lots of different optimizations are possible to explore
One of the selling points of this new "mojo" language is some auto scaling aspect of their parallelization stuff, which would be pretty easy on top of r/fold
@U0J3J79FE injest's parallel =>
actually uses r/fold
as the transducing context under the hood
Ya, reducers don't reduce until the reducible context. Transducers are a generalization of reducers
> The following functions (analogous to the sequence versions) create reducers from a reducible or foldable collection: https://clojure.github.io/clojure/clojure.core-api.html#clojure.core.reducers/map https://clojure.github.io/clojure/clojure.core-api.html#clojure.core.reducers/mapcat https://clojure.github.io/clojure/clojure.core-api.html#clojure.core.reducers/filter https://clojure.github.io/clojure/clojure.core-api.html#clojure.core.reducers/remove https://clojure.github.io/clojure/clojure.core-api.html#clojure.core.reducers/flatten https://clojure.github.io/clojure/clojure.core-api.html#clojure.core.reducers/take-while https://clojure.github.io/clojure/clojure.core-api.html#clojure.core.reducers/take and https://clojure.github.io/clojure/clojure.core-api.html#clojure.core.reducers/drop. None of these functions actually transforms the source collection. To produce an accumulated result, you must use r/reduce or r/fold. To produce an output collection, use https://clojure.github.io/clojure/clojure.core-api.html#clojure.core/into to choose the collection type or the provided https://clojure.github.io/clojure/clojure.core-api.html#clojure.core.reducers/foldcat to produce a collection that is reducible, foldable, seqable, and counted.
But you're not converting things to transducers, but to reducers when you use reducers. I don't think they can be intermixed, not sure though
Okay. I'm remembering now. I think I completely brain dumped on all the reducer's stuff after the transducer stuff came out lol
I think the difference is the reducer always captures the input coll maybe? And transducers separate that out, so you can compose transducers on their own, and then apply their combinations to different input coll and also use difference context to apply them.
I do wish that transducers had continued down the "parallel" track. For example, Java streams really went strong in auto-parallelization, and even the reduce function is designed in ways so it's always parallelizable.
r/fold is better for seq-to-seq parallelization. Pipeline is better for many in to many out with many streams not correlated
But if all your streams have to rejoin at the end of some thing, the work stealing of forkjoin under r/fold does magic in keeping work on one core if distributing it is too much of a penalty
most of the 2-arity versions, like r/map, return 'foldables' (word?) as well as 'reducibles'
in terms of protocols, foldables are also reducibles, since they both implement CollReduce
Does this:
(->> [1 2 3 4]
(r/map inc)
(r/filter odd?)
(into []))
Do exactly the same thing as injest's:
(x>> [1 2 3 4]
(map inc)
(filter odd?)
(into []))
?If so then yeah that's a pretty fair trade off. You trade some characters here for some characters there
I still agree though, the syntax isn't that great. If I compare it to Java streams for example. You pipe your transformation, then you choose your "transcuding context". ? I would much prefer
That's transparent in injest. But checking for users using transducing contexts at the end would be an interesting optimization, like @UGDTSFM4M did in the gist
Yeah, it always terminates on sequence
for non parallel x>>
and r/fold
for parallel =>>
so you don't have to terminate with into
or anything in particular
And =>>
defaults to the x>>
semantics for parts of the thread that are not parallelizable
That would be an interesting benchmark shootout - injest's transducers vs reducers tools in thread transformations
But do you need to switch to reducers for parallel? Or can transducers be fed to r/fold ?
But in a thread-last form you'd have to terminate with r/fold. And I'm not sure if you'd be allowed to use non-parallelizable reducers in your thread
Anyway, I think it would be a great idea @UGDTSFM4M for folks to come up with alternatives to injest and/or just steal the best parts you like and add what you want. That's what I did with the core macros. It's a sweet DX space to work on and the fruit is just hanging too low here to not explore it imo
I do like the DX which is basically: 1. Parallel or not? 2. Transformations you want 3. Choose the execution context and resulting output shape
I want to bring awareness to https://clojuredocs.org/clojure.core/eduction. I use transducers heavily, rather than reaching out to these transducer DX libraries, I find using eduction is sufficiently ergonomic enough. I wrote examples of its use on clojuredocs.
Ya, it's funny that eduction takes one or more xform in the middle. It's annoying though that sequence and into don't.
Also, all those methods encourage segregating your transducer code from your regular code, whereas injest is encouraging you to entail them together so it all looks like a regular thread. Slightly different use case
If you have some lib or app that actually necessitates the defining and composing and recomposing of context free transformations, then keeping things separated from context for as long as possible makes sense
Sometimes though, at the point of service, where things are finally composed together do the transformation in all the context, we'll want to compose transducers for their final form for usage and mixing into the context of "regular code."
Not a ton, just slightly frustrating enough for you to say, "wow, this could actually be a thread-last form right here, sure would be nice..."
So I guess injest is a little orthogonal to the reusable-component phase of building transducers
The way I imagine you using injest:
You have some part of your app with transducers composed into general purpose transformations.
Another part servicing an endpoint has a few thread-last forms, some of which use those transducers from above.
Bob adds a mapping operation after a filtering operation, which already came after another mapping operation, so Bob says, "these are all transducers :thinking_face: let's grab that low hanging fruit by replacing ->>
with x>>
for now."
And then a week later, Alice adds two more mapping and filtering operations after the ones Bob added. Alice notices this part of the thread is becoming embarrassingly saturated with transducers, so she abstracts that out into it's own transducer and adds it to the common transducers namespace for everyone else to reuse.
So in that scheme, you'd use injest to slowly grow your code towards more and more reusable transducer stacks while still scooping up the low hanging fruit in those thread-last pipelines that are still in flux and intermingled between transducers and regular code.
Thanks for all comments! >>
is an effort to make most of my uses of ->>
more efficient through transducers. I don't think I have ever needed the de-coupling of input/output and transformation process that transducers offer (though I suspect that this de-coupling underlies the increased efficiency) , nor have I found a need of re-use of transforming processes. But it has bothered me that I'm not getting the performance boost from transducers.
Basically if you're transforming something and then later you transform it some more, as opposed to consuming it for say doing I/O or returning results. The eduction will keep fusing the transforms and only when you actually need the data will it eagerly perform it all. Sequence I think will be a bit similar, but with a different profile. Into won't do that and realize things right away.
So it's good if you want to factor out your transformation code in smaller functions with good names.
Yeah, we just don't do that most of the time. You just don't need these reusability constructs 90% of the time, because simple functions are so reusable
Now I'm wondering... Could you have a function instead of a macro?
(xfm
coll
(map inc)
(filter even?)
:to :sequence
:parallel true)
Something like:
(defn xfm
[coll xforms* & {:keys [to parallel reduce] :or {to :sequence parallel false}}]
...)
;; :to choices would be :sequence, :eduction, :vector, :list, :set, :map, etc.
;; :reduce can take a map or vector of [rf init] or {:rf :init init}, and if :parallel it also needs combinerfn
It could even support no coll by taking to :transducer
as the option, in which case it knows that the coll
arg is actually another xform