Fork me on GitHub
#data-science
<
2017-11-22
>
joachim12:11:25

Question: what would be the best (= fast, parallelised and with low memory pressure) way to map a computationally expensive function producing 4-byte ints over a very large (lazy) sequence and put the resulting ints in a Java Array of ints? The large input sequence is too big to fit in memory, but the resulting int array obviously (but barely) fits in memory.

val_waeselynck13:11:32

@U0AFUP6RH how could the memory pressure be low when the resulting array barely fits in memory ?

joachim13:11:17

well, let me rephrase: the extra pressure should be low because the resulting array barely fits in memory

val_waeselynck16:11:21

@U0AFUP6RH well I guess you could process the lazy sequence using pmap (yielding another lazy sequence), and then run! it to append the resulting ints to a dynamically-sized int array (e.g an similar implementation to ArrayList which elements are primitive ints)

val_waeselynck17:11:07

that would require your memory to be able to hold 150% at least of the final array size. If that's too much, maybe stream the results to disk while keeping a count of them, then slurp them into a fixed-sized array ?

niek07:11:37

Hello @U06GS6P1N, I'm working with @U0AFUP6RH on this. Thanks for the suggestions! I implemented the pmap suggestion (in fact I had to build a chunked variant to avoid creating a separate future for each individual item) and then used an into-array to get my array of primitive ints. Unfortunately I still saw millions of lazy seq's and cons being created during the process and only be cleaned up after the whole process.

niek08:11:52

Now I simply stream the results to file and in a next step I read the file back in into an array. This avoids the memory overhead completely. Maybe the into-array is not able to dynamically size without completely recreating the array?

niek08:11:15

Just had a look at the into-array implementation and it does a count on the lazy seq to create an array of the right size. The count causes the seq to be realized and since we also have a reference to the head of the seq it will not be able to GC the intermediary structures. Makes sense?

val_waeselynck08:11:59

@U5RGDD8C9 Ah right, I didn't suggest the chunked version because the transformation was described as computationally expensive, so I thought the overhead of pmap and lazy seqs would be insignificant. It's weird that the cleaning up only happens in the end, would you by chance by using core.async's onto or something like that to write the stream to disk? If so, be aware that there's a bug in onto that will hold on to the head of the lazy seq until all items have been put in the channel

val_waeselynck08:11:41

@U5RGDD8C9 oh right your explanation is much simpler

niek08:11:57

Im doing a with-open - doseq - .write to write the items to disk, I guess its not using onto, but thanks for brining that up 🙂

niek08:11:28

Thanks, that's a dynamically resizing int Array right?

val_waeselynck09:11:54

@U5RGDD8C9 I guess! Anyway, I guess you could implement the dynamic resizing yourself, it's not that hard. Don't forget to use System.arraycopy() or a wrapper of that for efficiency

niek09:11:20

I'm afraid the arraycopy implies that I would need enough memory to store the whole structure twice in memory. All I need is to know the count of the lazy seq before its realized, create an empty array of that size, and fill it up on-the-fly

val_waeselynck09:11:07

@U5RGDD8C9 oh yeah right, in this case it's even easier

niek09:11:04

Do you know how to do a count without keeping the head? See also https://groups.google.com/forum/#!topic/clojure/QbyD2-PMzHQ where "(reduce (fn [x _] (inc x)) 0 the-seq)" is suggested (very old post though :-))

niek09:11:25

havent tried it out yet but I dont see how that's not holding the head

val_waeselynck10:11:56

@U5RGDD8C9 google 'Clojure locals clearing'. clojure.core/count also works that way, so it should not retain the head of a lazy sequence (verify it at the REPL with `(count (repeatedly 100000 #(int-array 100000))) `)

niek10:11:01

Interesting! But thinking of it, even if the count clears the locals, it will most probably have to realize the computations, which will then happen again while filling up the array

niek10:11:00

(btw the into-array implementation doesnt even use (count my-lazy-seq), it uses length(seq))

joachim08:11:56

hi guys. We did implement a solution based on async, but I forgot about the bug. @U5RGDD8C9 , here's a fixed version of onto-chan:

joachim08:11:59

(defn onto-chan
  "Patched version of core.async/onto-chan, see "
  ([ch coll] (onto-chan ch coll true))
  ([ch coll close?]
   (let [coll-fn (^:once fn* [] coll)]
     (as/go-loop [vs (seq (coll-fn))]
       (if (and vs (as/>! ch (first vs)))
         (recur (next vs))
         (when close?
           (as/close! ch)))))))

joachim08:11:16

furthermore, you don't need to count the sequence as you can calculate the size in advance (see the n-pairs fn in my solution code)

joachim08:11:42

anyway, thanks a lot for the suggestions @U06GS6P1N!