This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-07-25
Channels
- # announcements (9)
- # babashka (38)
- # beginners (41)
- # biff (1)
- # clojure (19)
- # clojure-europe (7)
- # clojure-uk (2)
- # clojurescript (3)
- # code-reviews (30)
- # conjure (4)
- # cursive (8)
- # datomic (32)
- # docker (2)
- # emacs (7)
- # etaoin (2)
- # fulcro (37)
- # graphql (2)
- # jobs (1)
- # jobs-discuss (8)
- # leiningen (10)
- # lsp (36)
- # meander (4)
- # missionary (4)
- # nbb (12)
- # off-topic (1)
- # other-languages (10)
- # pathom (11)
- # re-frame (5)
- # reitit (4)
- # remote-jobs (3)
- # shadow-cljs (13)
- # sql (1)
- # tools-build (4)
- # tools-deps (31)
- # xtdb (2)
I stumbled on this the other day: https://benhoyt.com/writings/count-words/ (there is a fresh HN discussion on it here: https://news.ycombinator.com/item?id=32214419) and so tried out what I considered my little simple solution:
(defn word-freqs [text]
(let [data (slurp text)]
(->> (clojure.string/split data #"\n|\W+")
(map #(clojure.string/lower-case %))
(frequencies)
(sort-by val >))))
(defn print-word-freqs [pairs]
(doseq [pair pairs]
(println pair)))
To recreate the test input I grabbed the text from: https://github.com/benhoyt/countwords/blob/master/kjvbible.txt and ran for i in {1..10};do cat kjvbible.txt >> bible-10.txt; done
on it as he mentioned to get the correct 43mb input. My computer seems to be a little slower than his as the simple.py
solution took 4.354 seconds on mine vs his 3.872 seconds. But my Clojure solution takes over twice as long at about 9.858 seconds. I used criterium (for the first time so just using the bench
function) so I think that takes out any JVM startup time.
How would you folks approach this in an idiomatic way? I assume the big difference is using slurp
vs. some kind of buffered input but I tried using
and I just couldn't get that to work. Any insights into using
would be greatly appreciated. Of course I might also be doing something completely different than his solution too which could be causing a slow down. I would also love to see a solution using transducers
if you got it.If you're interested in speeding things up, I highly recommend clj-async-profiler to find the slow parts: • https://github.com/clojure-goes-fast/clj-async-profiler • http://clojure-goes-fast.com/blog/profiling-tool-async-profiler/
I'll play around with this now. Part of the exercise for me was to check out the benchmarking/profiling tools. I would still like to see how others approach this as I'm sure there is so much low hanging fruit for performance gains. I'm looking for a guide on
as opposed to slurp
ing a 43 mb file (unless that is efficient?)
I have a transducer version that uses line-seq
, but performance wasn't a goal, so I don't claim it's efficient. It might be a starting point for comparison though.
(def line-counts
(with-open [is (io/input-stream fname)]
(let [is (if (clojure.string/ends-with? fname ".gz")
(java.util.zip.GZIPInputStream. is)
is)
rdr (io/reader is)
ret (into
[]
(comp (map count))
(line-seq rdr))]
ret)))
If I was interested in performance, then I would definitely run it with a profiler and look at the flame graph to see where most of the time is being spent.
Ok, cool, I'll try and wrap my head around this. I'm getting permission denied
errors when trying to run some required config commands on that profiler you sent me. Trying to work around that now.
I think my issue was trying to use this line-seq rdr
part but continuously tallying the frequencies
as it was processed. I couldn't get that worked out.
I guess since you're splitting on words, I might start with java.util.Scanner rather than line-seq
Nice. That seems to be what his Go
solution used as well. I consider lack of knowledge of the java ecosystem one of my biggest clojure holes and it seems IO
is full of it.
yea, I'm trying to give pointers without just writing a solution, but there is some amount of just java arcana
It's also been long enough since I've learned java where I'm not totally sure if Scanner
would be the recommended option if efficiency is a priority.
Yeah the first thing that popped up in a search mentioned it wasn't too efficient. haha. I'm thinking a "simple" idiomatic Clojure solution should at least be on par with these middle ground simple solutions presented in the article that take about 2.5-4 seconds.
here's at least a basic skeleton for using transducers with Scanner:
(import 'java.util.Scanner)
(defn scan [xform f init scanner]
(let [f (xform f)]
(loop [result init]
(if (.hasNext ^Scanner scanner)
(recur (f result (.next ^Scanner scanner)))
result))))
(with-open [is (io/input-stream fname)]
(scan (comp (map #(.length ^String %)))
+
0
(Scanner. is)))
It currently just sums up the lengths of each word
to do word counts, you would need to come up with a different reducing function, f
, an initial value, init
, and potentially xform
Yeah, I'll play around with it. They have to all be lowercased too but not sure that will add much overhead
yea, that's good thing to put in the xform
I won't be able to use that cool profiler btw. I'm on a pixelbook running a Linux VM but I guess the required kernal commands for the profiler aren't allowed for the host OS on my system.
you can try visuamvm, http://clojure-goes-fast.com/blog/profiling-tool-jvisualvm/
but I think it's really difficult to write efficient code without a profiler. It's like trying to dig a hole without a shovel.
this might be helpful for optimizing the io
@U9J50BY4C see anything interesting in the profiling?
If you want to work really hard you can so this with arrays and cut down dramatically on allocation
@U7RJTCH6J still working on grokking visualvm. I would probably still need assistance on what it's actually showing me. This is why I'm ready to stop the solo self learning and try to get in as a junior developer or even just get involved in an open source project to be mentored by senior devs. I'm not progressing like I want by myself but I digress