This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2021-06-27
Channels
- # announcements (5)
- # aws (2)
- # babashka (2)
- # beginners (79)
- # calva (14)
- # clojure (45)
- # clojure-canada (1)
- # clojure-europe (26)
- # clojuredesign-podcast (14)
- # clojurescript (4)
- # cursive (30)
- # datascript (8)
- # depstar (2)
- # emacs (7)
- # events (1)
- # helix (2)
- # honeysql (4)
- # jobs-discuss (1)
- # off-topic (5)
- # polylith (1)
- # quil (2)
- # reagent (7)
- # shadow-cljs (14)
- # tools-deps (26)
- # xtdb (9)
Is there any possibility to use protocol in macro during macroexpansion time?
No implementation of method: :-patterns of protocol: #'ribelo.munich/IMulti
found for class: clojure.lang.Symbol
Sure. Just remember when you have clojure forms versus runtime data. Emit code that does the right thing at runtime. You most likely are just dealing with sequences of symbols which is what you are seeing in your error message
I have a (very large) number of gzipped files that contain edn per lines, so I made these two functions to handle process them:
(defn read-gzipped
[fname]
(with-open [in (java.util.zip.GZIPInputStream.
(io/input-stream fname))]
(slurp in)))
(defn read-edn-per-line
[in f]
(->> in
str/split-lines
(map (comp f read-string))))
I would expect these functions to parallelize well and to be able to do pmap
(or upmap
when using claypoole) over the list of files, however there is no difference in time whether I use pmap
or map
, not sure why, am I missing something?@archibald.pontier_clo could it be that your producer is slower than your reader? i.e. doing the gunzip and slurp is slower than read-string?
I am not sure, but how would that stop calling read-gzipped
+ read-edn-per-line
from having the same speed in a map
and in a pmap
? Because even if read-gzipped
is slow, I should be able to read multiple files at once no?
@archibald.pontier_clo Perhaps I misunderstood where your pmap
was. I thought it was in place of the map
at the end of read-edn-per-line
. Not that you were pmap
'ing your list of files. The other option is that it's so fast the overhead of the parallelism makes it the same speed. pmap does have some footguns due to laziness, and I'm not sure if those might apply here, it depends on how you consume the sequence afterwards.
@archibald.pontier_clo Are you sure you are measuring the complete result? pmap
is semi-lazy so unless you are forcing the whole result you may not be getting accurate times?
@seancorfield I do a pmap
followed by a mapcat
and doall
(last call), that should be fine no?
@dominicm reading one file takes 20s+ so I don't think the overhead plays a role there
@archibald.pontier_clo to confirm, your full code is (pmap #(read-edn-per-line (read-gzipped %) %) ["file-1" "file-2"])
?
(->> selected-days
(pmap
(fn [f]
(-> (str f "/" ticker ".txt.gz")
utils/read-gzipped
;;(utils/read-edn-per-line parse-line)
)))
(mapcat identity)
doall)
@hiredman I think it's newline-separated files, so it would be a map over .readLine
(which isn't there on InputStreams).
Pmap entangles a lot of things so it is tricky to understand. Pmap limits its parallelism to the number of cores the java runtime reports
You may need to wrap in a reader first via http://clojure.java.io/reader
class java.io.BufferedReader cannot be cast to class
java.io.PushbackReader (java.io.BufferedReader and
java.io.PushbackReader are in module java.base of loader
'bootstrap')
I found a previous slack thread on that topic, doesn't seem straightforward but I'll figure it out
That's where the core limiter is, I knew there must be one around there somewhere. pmap is an interesting beast 😛
There's also a lot of environment involved here: If you've only got a couple of cores, (I only have 4 for example) then you're not going to get loads of parallelism here. Although I am surprised you're seeing absolutely no speedup. I'd expect it to be less than 200s.
@archibald.pontier_clo For comparison, how long does this take? (time (doall (pmap #(Thread/sleep (+ 5000 %)) (range 20))))
That should give some idea of parallelism available to you.
Again, pmap is tricky, I forget if the specialized range type implements chunking, but chunking does weird things to pmaps attempts to limit parallelism
That's true again. But at least indicates there're enough cores around to be making use of this parallelism.
If you use an ExecutorService and an ExecutorCompletionService instead of pmap, you have a lot more visibility and control.
Your process may just be io bound, such that your io requests are queuing sequential somewhere else (os kernel, disk driver, etc), such that any parallelism in dispatching the requests doesn't result in faster processing
I was trying to figure out which profiler or debugging tool would give insight into this, and I wasn't sure.
Out of spite I ran that code on my prod server (significantly more powerful / better ssd than my laptop), and there pmap
gives a very nice speed boost
@honza "best" is subjective, but it's the most complete solution. there is also now deps.edn
which is more "decomplected": it does less and you can build tooling around this (which people have done and more to come)
Lein has a lot of power (from its existing ecosystem to its unquoting and middleware systems) but deps.edn has shown the right way for a number of things (single JVM per task, first-class git dependencies, composable aliases) All those are technically possible in Lein but not the default... in an ideal world Lein would pick up some insights or even implementational details from deps.edn In practice it would be quite a lot of work, like so many things in OSS