Fork me on GitHub
#code-reviews
<
2022-07-25
>
Chase21:07:09

I stumbled on this the other day: https://benhoyt.com/writings/count-words/ (there is a fresh HN discussion on it here: https://news.ycombinator.com/item?id=32214419) and so tried out what I considered my little simple solution:

(defn word-freqs [text]                                                                          
  (let [data (slurp text)]                                                                      
    (->> (clojure.string/split data #"\n|\W+")                                                   
         (map #(clojure.string/lower-case %))                                                    
         (frequencies)                                                                           
         (sort-by val >))))                                                                      
                                                                                                 
(defn print-word-freqs [pairs]                                                                   
  (doseq [pair pairs]                                                                            
    (println pair)))
To recreate the test input I grabbed the text from: https://github.com/benhoyt/countwords/blob/master/kjvbible.txt and ran for i in {1..10};do cat kjvbible.txt >> bible-10.txt; done on it as he mentioned to get the correct 43mb input. My computer seems to be a little slower than his as the simple.py solution took 4.354 seconds on mine vs his 3.872 seconds. But my Clojure solution takes over twice as long at about 9.858 seconds. I used criterium (for the first time so just using the bench function) so I think that takes out any JVM startup time. How would you folks approach this in an idiomatic way? I assume the big difference is using slurp vs. some kind of buffered input but I tried using and I just couldn't get that to work. Any insights into using would be greatly appreciated. Of course I might also be doing something completely different than his solution too which could be causing a slow down. I would also love to see a solution using transducers if you got it.

phronmophobic21:07:29

If you're interested in speeding things up, I highly recommend clj-async-profiler to find the slow parts: • https://github.com/clojure-goes-fast/clj-async-profilerhttp://clojure-goes-fast.com/blog/profiling-tool-async-profiler/

Chase22:07:13

I'll play around with this now. Part of the exercise for me was to check out the benchmarking/profiling tools. I would still like to see how others approach this as I'm sure there is so much low hanging fruit for performance gains. I'm looking for a guide on as opposed to slurp ing a 43 mb file (unless that is efficient?)

phronmophobic22:07:14

I have a transducer version that uses line-seq , but performance wasn't a goal, so I don't claim it's efficient. It might be a starting point for comparison though.

(def line-counts
    (with-open [is (io/input-stream fname)]
      (let [is (if (clojure.string/ends-with? fname ".gz")
                 (java.util.zip.GZIPInputStream. is)
                 is)
            rdr (io/reader is)
            ret (into
                 []
                 (comp (map count))
                 (line-seq rdr))]
        ret)))

phronmophobic22:07:12

If I was interested in performance, then I would definitely run it with a profiler and look at the flame graph to see where most of the time is being spent.

Chase22:07:44

Ok, cool, I'll try and wrap my head around this. I'm getting permission denied errors when trying to run some required config commands on that profiler you sent me. Trying to work around that now.

Chase22:07:27

I think my issue was trying to use this line-seq rdr part but continuously tallying the frequencies as it was processed. I couldn't get that worked out.

phronmophobic22:07:00

I guess since you're splitting on words, I might start with java.util.Scanner rather than line-seq

Chase22:07:21

Nice. That seems to be what his Go solution used as well. I consider lack of knowledge of the java ecosystem one of my biggest clojure holes and it seems IO is full of it.

phronmophobic22:07:17

yea, I'm trying to give pointers without just writing a solution, but there is some amount of just java arcana

phronmophobic23:07:25

It's also been long enough since I've learned java where I'm not totally sure if Scanner would be the recommended option if efficiency is a priority.

Chase23:07:41

Yeah the first thing that popped up in a search mentioned it wasn't too efficient. haha. I'm thinking a "simple" idiomatic Clojure solution should at least be on par with these middle ground simple solutions presented in the article that take about 2.5-4 seconds.

👍 1
phronmophobic23:07:12

here's at least a basic skeleton for using transducers with Scanner:

(import 'java.util.Scanner)

(defn scan [xform f init scanner]
  (let [f (xform f)]
    (loop [result init]
      (if (.hasNext ^Scanner scanner)
        (recur (f result (.next ^Scanner scanner)))
        result))))

(with-open [is (io/input-stream fname)]
  (scan (comp (map #(.length ^String %)))
        +
        0
        (Scanner. is)))

phronmophobic23:07:33

It currently just sums up the lengths of each word

phronmophobic23:07:41

to do word counts, you would need to come up with a different reducing function, f, an initial value, init, and potentially xform

Chase23:07:24

Yeah, I'll play around with it. They have to all be lowercased too but not sure that will add much overhead

Chase23:07:36

I appreciate the help!

phronmophobic23:07:37

yea, that's good thing to put in the xform

Chase23:07:06

I won't be able to use that cool profiler btw. I'm on a pixelbook running a Linux VM but I guess the required kernal commands for the profiler aren't allowed for the host OS on my system.

phronmophobic23:07:46

but I think it's really difficult to write efficient code without a profiler. It's like trying to dig a hole without a shovel.

Chase00:07:36

makes sense to me. I got that profiler working so I'll play around with it.

👍 1
Cora (she/her)15:07:22

this might be helpful for optimizing the io

Ben Sless15:07:37

Iirc the most relevant conclusion to this use case is not using a push back reader

phronmophobic15:07:46

@U9J50BY4C see anything interesting in the profiling?

Ben Sless15:07:06

If you want to work really hard you can so this with arrays and cut down dramatically on allocation

Chase16:07:41

What does "push back" reader mean?

Ben Sless16:07:15

that's a java type that has an unread method

Chase16:07:38

@U7RJTCH6J still working on grokking visualvm. I would probably still need assistance on what it's actually showing me. This is why I'm ready to stop the solo self learning and try to get in as a junior developer or even just get involved in an open source project to be mentored by senior devs. I'm not progressing like I want by myself but I digress