This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
So, what are people's go-to idioms for processing a large sequence into an output sequence, while accumulating state as you go? (e.g. in my case I'm walking a git log, with several hundred thousand commits, storing parent-child relationships as I go, then sending the result to a json file)
I could do all this via core.async but that feels a bit like overkill (and would make the code far more confusing to the casual reader)
@dominicm trouble is it has to finish reducing before it produces any results - see my posts yesterday - unless I'm having a brain-fart and it's work with some tweaks.
Basically, why do: log -> accumulator -> json When you could do: log -> accumulator ā> json
Maybe use the reducers library to parallelize it? S'almost for free if you're doing a fold
Unless memory is your bottleneck rather than speed of operation
Memory is the main problem I think - I haven't profiled it. @dominicm has a point, I could walk the log twice, once with reduce
to generate the parent/child mappings and once with map
to merge and filter into the json output. It would be slow though to do it twice
'group-by' on the parent keys which would give you '{key [things], key [more things]}' Walking twice is also a good idea. I'm just thinking off the top of my head.
/\/\ your memory looks like this if you do it twice, and like / \ if you do it once.
Yeah, doing it twice definitely solves the memory problem. I'd need to profile it to see if it's too slow though. Yes, I greedily want speed and memory efficiency
@korny if you're processing the log lazily, then 2 threads processing it in parallel shouldn't be a problem?
It's probably the answer - it just feels like clojure is smart enough that there should be a way to do it in one pass. Probably by using some custom function that captures the intermediate state somehow, and then map
You're doing 2 logically separate tasks, doing it in "one pass" doesn't really buy you anything.
Unless you're attempting to realize the whole log into memory, I don't expect memory to be an issue for you. You're processing 1 item at a time.
They aren't totally separate - I haven't explained totally! I'll try to explain a bit more. Git stores, for every commit, a list of parents of the commit. But not children. Jgit will walk the whole commit tree in topographic order - children before parents (though it's a badly documented black box, so how it does this is unclear) My code wants some info from the children of each commit, for building into the output log. So my reduce function builds a hash of parent->children as it goes, and then uses it to add data to the log.
I think the most performant option would be to use the current reduce logic, but send the log to an async channel. Anyway, got to go back to holidaying
Sort of a "`map` with context" then? Where context comes from earlier processed items?
https://github.com/clojure/clojure/blob/841fa60b41bc74367fb16ec65d025ea5bde7a617/src/clj/clojure/core.clj#L4985-L4994 I suspect you could write that similarly to distinct
(defn map-ctx
[f coll]
(let [step (fn step [xs ctx]
(lazy-seq
((fn [[fi :as xs] seen]
(when-let [s (seq xs)]
(let [[a b] (f ctx fi)]
(cons b (step (rest s) a)))))
xs ctx)))]
(step coll nil)))
(map-ctx (fn [ctx {:keys [parents value]}]
[(reduce #(update %1 %2 conj value) ctx parents)
{:v value :children (get ctx value)}])
[{:parents [2 3] :value 1}
{:parents [5] :value 2}
{:parents [4] :value 3}
{:value 4}
{:value 5}])
Produces ({:v 1, :children nil} {:v 2, :children (1)} {:v 3, :children (1)} {:v 4, :children (3)} {:v 5, :children (2)})
If you really needed it, you could perhaps squeeze some extra performance out of it by:
- Making your ctx be a transient (alternative f, on you)
- Looking at chunked-first
& so on in how map
is implemented
https://github.com/brandonbloom/transduce looks like this is already a library š
https://clojure.org/reference/transducers#_transducers_with_reduction_state you could totally use transients and write a transducer too