Fork me on GitHub
#planck
<
2018-01-07
>
nooga14:01:29

I’m trying to use planc for processing a huuge file, I thought it might be cool to make it stream based and do ./blah.cljs < file > output instead of reading the file, processing and then writing

nooga14:01:05

I’m trying to decipher docs but I don’t understand how to get stdin stream and stdout stream

mfikes14:01:17

@nooga If your processing is textual and line based, planck.core/read-line might be useful

mfikes14:01:37

An interesting thing that may easily occur when processing an absolutely huge file this way is head-holding.

mfikes14:01:08

My immediate thoughts on that issue is to try to build something that reduces on (iterate (fn [_] (planck.core/read-line)) nil)

nooga14:01:12

I’ve got ~300MB of stanzas like: AA=12345678 BBA=12345678 CCC=12345678 and I basically need to make it so that they end up as 12345678 12345678 12345678 in separate lines

mfikes14:01:20

Ahh, that's cool, perhaps a partition transducer could help get the pairs of lines.

mfikes14:01:11

Also, to really go the transducer route, you'd need the reducible iterator that is in ClojureScript head, which isn't yet in the shipping Planck. (It is easily built, though via script/pilot in the Planck source tree.)

mfikes14:01:01

@nooga The reason I mention holding head, is that if blah.cljs looked like

(require '[planck.core :refer [line-seq *in*]])

(run! println (partition 2 (line-seq *in*)))
Then it would print the pairs of lines, and arguably be a clean streaming solution. But it will still hold all lines in memory, if that's a concern.

nooga14:01:34

it may be since these files are huuge 😉

mfikes14:01:55

And in that case, the new iterate may be self-hosted's friend 🙂

nooga14:01:27

I’m writing an openrisc emulator in Java to have linux running inside of JVM and my main method of debugging is comparing CPU state logs from my emu and openrisc qemu

mfikes14:01:32

300 MB should easily fit in RAM. The transducer approach is fun to mess around with though.

nooga14:01:52

yeah, got 16GB of ram here but somehow this feels dirty 😄

nooga14:01:42

I tried sed but it drove me crazy

mfikes14:01:10

I agree. The only reason ClojureScript doesn't clear locals is because there hasn't been much demand for it. Maybe if self-hosted ClojureScript becomes popular, that could cause some demand. In the meanwhile, I've been exploring the "reducible" route, if that makes sense. In other words, you could transduce on the sequence produced by iterate without consuming RAM. The only dirty thing about that approach for this problem is that you'd need to write to stdout as a side effect of the reduction 😞

mfikes15:01:14

@nooga I'm checking to see if this doesn't consume RAM:

(require '[planck.core :refer [read-line]])

(transduce (comp (drop 1)
                 (take-while some?)
                 (partition-all 2))
  (fn [_ x] (println x))
  nil
  (iterate (fn [_] (read-line)) nil))

nooga15:01:54

cool, I settled for a simple loop

nooga15:01:02

and it did the job… slowly

mfikes15:01:51

Cool. FWIW, Planck also has -s, -f, and -O simple as ways to try to make things run faster.

nooga15:01:11

nice! didn’t know that

nooga15:01:39

ah, I converted the files and tried to use them but now I see that they’re rubbish :F

nooga15:01:02

debugging linux kernel on a CPU that you wrote is no fun

nooga15:01:05

esp after writing mostly clojure and functional langs for last 3 years

mfikes15:01:52

Well, FWIW, the transducer approach using iterate (with ClojureScript master) doesn't consume RAM

nooga15:01:27

that’s awesome!

nooga15:01:42

thanks for checking it out 🙂

mfikes21:01:15

On Planck master line-seq is directly reducible. This allows reducing over gigantic files without consuming RAM, avoiding ClojureScript head-holding. This example is over a 1 GB file.

cljs.user=> (require '[planck.core :refer [line-seq]]
       #_=>  '[ :as io]
       #_=>  '[clojure.string :as string])
nil
cljs.user=> (reduce
       #_=>  (fn [c line]
       #_=>   (cond-> c
       #_=>    (string/starts-with? "a" line) inc))
       #_=>  0
       #_=>  (line-seq (io/reader "big.txt")))
134217728