Fork me on GitHub
#clojure-uk
<
2018-06-10
>
korny09:06:01

So, what are people's go-to idioms for processing a large sequence into an output sequence, while accumulating state as you go? (e.g. in my case I'm walking a git log, with several hundred thousand commits, storing parent-child relationships as I go, then sending the result to a json file)

korny09:06:02

I could do all this via core.async but that feels a bit like overkill (and would make the code far more confusing to the casual reader)

korny09:06:49

@dominicm trouble is it has to finish reducing before it produces any results - see my posts yesterday - unless I'm having a brain-fart and it's work with some tweaks.

flefik09:06:46

nvm me i misunderstood what you said

dominicm09:06:48

So, why not "process" it twice?

dominicm09:06:53

Basically, why do: log -> accumulator -> json When you could do: log -> accumulator ā””> json

dominicm09:06:10

Or I suppose you do: log šŸ ¢ accumulator ā””ā”€ā”€ā”¬ā”€ā”€ā”˜ šŸ ƒ JSON

alexlynham10:06:42

Maybe use the reducers library to parallelize it? S'almost for free if you're doing a fold

alexlynham10:06:58

Unless memory is your bottleneck rather than speed of operation

jasonbell11:06:07

Sounds like the perfect use for a transducer but I don't know the context enough.

korny11:06:24

Memory is the main problem I think - I haven't profiled it. @dominicm has a point, I could walk the log twice, once with reduce to generate the parent/child mappings and once with map to merge and filter into the json output. It would be slow though to do it twice

jasonbell11:06:29

Everything is an event.

korny11:06:11

(I'm driving on holiday so apologies for intermittent comms!)

jasonbell11:06:32

No problem, I'm still half asleep from speaking at a conference yesterday šŸ™‚

jasonbell11:06:46

'group-by' on the parent keys which would give you '{key [things], key [more things]}' Walking twice is also a good idea. I'm just thinking off the top of my head.

dominicm12:06:26

@korny if memory is the problem, well, doing it twice isn't a problem?

dominicm12:06:24

/\/\ your memory looks like this if you do it twice, and like / \ if you do it once.

korny12:06:53

Yeah, doing it twice definitely solves the memory problem. I'd need to profile it to see if it's too slow though. Yes, I greedily want speed and memory efficiency

dominicm12:06:01

@korny if you're processing the log lazily, then 2 threads processing it in parallel shouldn't be a problem?

korny12:06:10

It's probably the answer - it just feels like clojure is smart enough that there should be a way to do it in one pass. Probably by using some custom function that captures the intermediate state somehow, and then map

dominicm12:06:20

You're doing 2 logically separate tasks, doing it in "one pass" doesn't really buy you anything.

dominicm12:06:13

Unless you're attempting to realize the whole log into memory, I don't expect memory to be an issue for you. You're processing 1 item at a time.

dominicm12:06:38

If you go in parallel you're processing up to 2 items at a time.

korny12:06:12

They aren't totally separate - I haven't explained totally! I'll try to explain a bit more. Git stores, for every commit, a list of parents of the commit. But not children. Jgit will walk the whole commit tree in topographic order - children before parents (though it's a badly documented black box, so how it does this is unclear) My code wants some info from the children of each commit, for building into the output log. So my reduce function builds a hash of parent->children as it goes, and then uses it to add data to the log.

korny12:06:07

I think the most performant option would be to use the current reduce logic, but send the log to an async channel. Anyway, got to go back to holidaying

dominicm12:06:58

Sort of a "`map` with context" then? Where context comes from earlier processed items?

dominicm13:06:40

@korny:

(defn map-ctx
  [f coll]
  (let [step (fn step [xs ctx]
               (lazy-seq
                 ((fn [[fi :as xs] seen]
                    (when-let [s (seq xs)]
                      (let [[a b] (f ctx fi)]
                        (cons b (step (rest s) a)))))
                  xs ctx)))]
    (step coll nil)))

(map-ctx (fn [ctx {:keys [parents value]}]
           [(reduce #(update %1 %2 conj value) ctx parents)
            {:v value :children (get ctx value)}])
         [{:parents [2 3] :value 1}
          {:parents [5] :value 2}
          {:parents [4] :value 3}
          {:value 4}
          {:value 5}])

dominicm13:06:02

Produces ({:v 1, :children nil} {:v 2, :children (1)} {:v 3, :children (1)} {:v 4, :children (3)} {:v 5, :children (2)})

dominicm13:06:44

If you really needed it, you could perhaps squeeze some extra performance out of it by: - Making your ctx be a transient (alternative f, on you) - Looking at chunked-first & so on in how map is implemented

dominicm13:06:56

https://github.com/brandonbloom/transduce looks like this is already a library šŸ™‚

korny18:06:27

Cool - I should learn more about transducers...

dominicm18:06:11

(transduce has nothing to do with transducers fyi)

korny19:06:36

yeah, I worked that out. Naming is fun šŸ™‚