Fork me on GitHub
#clojure
<
2021-06-27
>
ribelo00:06:48

Is there any possibility to use protocol in macro during macroexpansion time?

No implementation of method: :-patterns of protocol: #'ribelo.munich/IMulti
   found for class: clojure.lang.Symbol

dpsutton00:06:59

Sure. Just remember when you have clojure forms versus runtime data. Emit code that does the right thing at runtime. You most likely are just dealing with sequences of symbols which is what you are seeing in your error message

Nazral06:06:45

I have a (very large) number of gzipped files that contain edn per lines, so I made these two functions to handle process them:

(defn read-gzipped
  [fname]
  (with-open [in (java.util.zip.GZIPInputStream.
                  (io/input-stream fname))]
    (slurp in)))

(defn read-edn-per-line
  [in f]
  (->> in
       str/split-lines
       (map (comp f read-string))))
I would expect these functions to parallelize well and to be able to do pmap (or upmap when using claypoole) over the list of files, however there is no difference in time whether I use pmap or map, not sure why, am I missing something?

dominicm06:06:32

@archibald.pontier_clo could it be that your producer is slower than your reader? i.e. doing the gunzip and slurp is slower than read-string?

Nazral06:06:48

I am not sure, but how would that stop calling read-gzipped + read-edn-per-line from having the same speed in a map and in a pmap? Because even if read-gzipped is slow, I should be able to read multiple files at once no?

dominicm06:06:25

@archibald.pontier_clo Perhaps I misunderstood where your pmap was. I thought it was in place of the map at the end of read-edn-per-line. Not that you were pmap 'ing your list of files. The other option is that it's so fast the overhead of the parallelism makes it the same speed. pmap does have some footguns due to laziness, and I'm not sure if those might apply here, it depends on how you consume the sequence afterwards.

seancorfield06:06:34

@archibald.pontier_clo Are you sure you are measuring the complete result? pmap is semi-lazy so unless you are forcing the whole result you may not be getting accurate times?

Nazral06:06:06

@seancorfield I do a pmap followed by a mapcat and doall (last call), that should be fine no? @dominicm reading one file takes 20s+ so I don't think the overhead plays a role there

dominicm06:06:23

@archibald.pontier_clo to confirm, your full code is (pmap #(read-edn-per-line (read-gzipped %) %) ["file-1" "file-2"])?

Nazral06:06:59

(->> selected-days
         (pmap
          (fn [f]
            (-> (str f "/" ticker ".txt.gz")
                utils/read-gzipped
                ;;(utils/read-edn-per-line parse-line)
                )))
         (mapcat identity)
         doall)

Nazral06:06:16

I removed read-edn-per-line for the moment

dominicm06:06:56

How many selected-days are we talking here?

Nazral06:06:16

10 for the moment (I'm testing on a small subset of files for the moment)

hiredman06:06:47

Slurp+read-string is generally horrendous, use read

👍 6
dominicm06:06:49

@hiredman I think it's newline-separated files, so it would be a map over .readLine (which isn't there on InputStreams).

Nazral06:06:12

Yes, one edn per line

hiredman06:06:25

Read will handle that fine

dominicm06:06:38

You're right, just needs repeated calls to read.

hiredman06:06:39

That is more or less what a clojure source file jsis

dominicm06:06:40

My bad 🙂

hiredman06:06:11

Pmap entangles a lot of things so it is tricky to understand. Pmap limits its parallelism to the number of cores the java runtime reports

Nazral06:06:33

I need to convert the gzip stream to a stream that read understands though

hiredman06:06:04

Yes, java.io.PushbackReader

hiredman06:06:50

You may need to wrap in a reader first via http://clojure.java.io/reader

Nazral06:06:07

class java.io.BufferedReader cannot be cast to class
   java.io.PushbackReader (java.io.BufferedReader and
   java.io.PushbackReader are in module java.base of loader
   'bootstrap')

Nazral06:06:23

I found a previous slack thread on that topic, doesn't seem straightforward but I'll figure it out

dominicm06:06:20

That's where the core limiter is, I knew there must be one around there somewhere. pmap is an interesting beast 😛

dominicm06:06:34

There's also a lot of environment involved here: If you've only got a couple of cores, (I only have 4 for example) then you're not going to get loads of parallelism here. Although I am surprised you're seeing absolutely no speedup. I'd expect it to be less than 200s.

Nazral06:06:04

some, but nothing

dominicm06:06:24

@archibald.pontier_clo For comparison, how long does this take? (time (doall (pmap #(Thread/sleep (+ 5000 %)) (range 20)))) That should give some idea of parallelism available to you.

hiredman06:06:19

Again, pmap is tricky, I forget if the specialized range type implements chunking, but chunking does weird things to pmaps attempts to limit parallelism

dominicm06:06:23

That's true again. But at least indicates there're enough cores around to be making use of this parallelism.

hiredman06:06:56

If you use an ExecutorService and an ExecutorCompletionService instead of pmap, you have a lot more visibility and control.

3
Nazral06:06:38

Ok thank you I'll look into it

Nazral07:06:07

Isn't this what is under the hood in the claypoole library?

hiredman06:06:35

Your process may just be io bound, such that your io requests are queuing sequential somewhere else (os kernel, disk driver, etc), such that any parallelism in dispatching the requests doesn't result in faster processing

dominicm06:06:02

I was trying to figure out which profiler or debugging tool would give insight into this, and I wasn't sure.

Nazral06:06:09

that might be it

Nazral07:06:14

Out of spite I ran that code on my prod server (significantly more powerful / better ssd than my laptop), and there pmap gives a very nice speed boost

Nazral09:06:23

And thanks for the help! :hugging_face:

honza10:06:34

Is lein still the best build system?

borkdude10:06:40

@honza "best" is subjective, but it's the most complete solution. there is also now deps.edn which is more "decomplected": it does less and you can build tooling around this (which people have done and more to come)

❤️ 9
vemv11:06:14

Lein has a lot of power (from its existing ecosystem to its unquoting and middleware systems) but deps.edn has shown the right way for a number of things (single JVM per task, first-class git dependencies, composable aliases) All those are technically possible in Lein but not the default... in an ideal world Lein would pick up some insights or even implementational details from deps.edn In practice it would be quite a lot of work, like so many things in OSS