code-reviews

Chase 2022-07-25T21:46:09.248699Z

I stumbled on this the other day: https://benhoyt.com/writings/count-words/ (there is a fresh HN discussion on it here: https://news.ycombinator.com/item?id=32214419) and so tried out what I considered my little simple solution:

(defn word-freqs [text]                                                                          
  (let [data (slurp text)]                                                                      
    (->> (clojure.string/split data #"\n|\W+")                                                   
         (map #(clojure.string/lower-case %))                                                    
         (frequencies)                                                                           
         (sort-by val >))))                                                                      
                                                                                                 
(defn print-word-freqs [pairs]                                                                   
  (doseq [pair pairs]                                                                            
    (println pair)))
To recreate the test input I grabbed the text from: https://github.com/benhoyt/countwords/blob/master/kjvbible.txt and ran for i in {1..10};do cat kjvbible.txt >> bible-10.txt; done on it as he mentioned to get the correct 43mb input. My computer seems to be a little slower than his as the simple.py solution took 4.354 seconds on mine vs his 3.872 seconds. But my Clojure solution takes over twice as long at about 9.858 seconds. I used criterium (for the first time so just using the bench function) so I think that takes out any JVM startup time. How would you folks approach this in an idiomatic way? I assume the big difference is using slurp vs. some kind of buffered input but I tried using and I just couldn't get that to work. Any insights into using would be greatly appreciated. Of course I might also be doing something completely different than his solution too which could be causing a slow down. I would also love to see a solution using transducers if you got it.

Cora (she/her) 2022-07-26T15:00:46.145129Z

https://clojurians.slack.com/archives/C06MAR553/p1648830782525509

🆒 1
Cora (she/her) 2022-07-26T15:01:22.901109Z

this might be helpful for optimizing the io

phronmophobic 2022-07-25T21:52:29.883369Z

If you're interested in speeding things up, I highly recommend clj-async-profiler to find the slow parts: • https://github.com/clojure-goes-fast/clj-async-profilerhttp://clojure-goes-fast.com/blog/profiling-tool-async-profiler/

Chase 2022-07-25T22:25:13.600999Z

I'll play around with this now. Part of the exercise for me was to check out the benchmarking/profiling tools. I would still like to see how others approach this as I'm sure there is so much low hanging fruit for performance gains. I'm looking for a guide on as opposed to slurp ing a 43 mb file (unless that is efficient?)

phronmophobic 2022-07-25T22:28:14.572599Z

I have a transducer version that uses line-seq , but performance wasn't a goal, so I don't claim it's efficient. It might be a starting point for comparison though.

(def line-counts
    (with-open [is (io/input-stream fname)]
      (let [is (if (clojure.string/ends-with? fname ".gz")
                 (java.util.zip.GZIPInputStream. is)
                 is)
            rdr (io/reader is)
            ret (into
                 []
                 (comp (map count))
                 (line-seq rdr))]
        ret)))

phronmophobic 2022-07-25T22:29:12.607629Z

If I was interested in performance, then I would definitely run it with a profiler and look at the flame graph to see where most of the time is being spent.

Chase 2022-07-25T22:31:44.156029Z

Ok, cool, I'll try and wrap my head around this. I'm getting permission denied errors when trying to run some required config commands on that profiler you sent me. Trying to work around that now.

Chase 2022-07-25T22:42:27.727529Z

I think my issue was trying to use this line-seq rdr part but continuously tallying the frequencies as it was processed. I couldn't get that worked out.

phronmophobic 2022-07-25T22:49:00.120389Z

I guess since you're splitting on words, I might start with java.util.Scanner rather than line-seq

Chase 2022-07-25T22:52:21.345029Z

Nice. That seems to be what his Go solution used as well. I consider lack of knowledge of the java ecosystem one of my biggest clojure holes and it seems IO is full of it.

phronmophobic 2022-07-25T22:53:17.824159Z

yea, I'm trying to give pointers without just writing a solution, but there is some amount of just java arcana

phronmophobic 2022-07-25T23:02:25.461679Z

It's also been long enough since I've learned java where I'm not totally sure if Scanner would be the recommended option if efficiency is a priority.

Chase 2022-07-25T23:08:41.680119Z

Yeah the first thing that popped up in a search mentioned it wasn't too efficient. haha. I'm thinking a "simple" idiomatic Clojure solution should at least be on par with these middle ground simple solutions presented in the article that take about 2.5-4 seconds.

👍 1
phronmophobic 2022-07-25T23:12:12.528759Z

here's at least a basic skeleton for using transducers with Scanner:

(import 'java.util.Scanner)

(defn scan [xform f init scanner]
  (let [f (xform f)]
    (loop [result init]
      (if (.hasNext ^Scanner scanner)
        (recur (f result (.next ^Scanner scanner)))
        result))))

(with-open [is (io/input-stream fname)]
  (scan (comp (map #(.length ^String %)))
        +
        0
        (Scanner. is)))

phronmophobic 2022-07-25T23:12:33.237209Z

It currently just sums up the lengths of each word

phronmophobic 2022-07-25T23:14:41.914569Z

to do word counts, you would need to come up with a different reducing function, f, an initial value, init, and potentially xform

Chase 2022-07-25T23:15:24.698639Z

Yeah, I'll play around with it. They have to all be lowercased too but not sure that will add much overhead

Chase 2022-07-25T23:15:36.933039Z

I appreciate the help!

phronmophobic 2022-07-25T23:15:37.516839Z

yea, that's good thing to put in the xform

Chase 2022-07-25T23:38:06.329319Z

I won't be able to use that cool profiler btw. I'm on a pixelbook running a Linux VM but I guess the required kernal commands for the profiler aren't allowed for the host OS on my system.

phronmophobic 2022-07-25T23:50:03.319909Z

you can try visuamvm, http://clojure-goes-fast.com/blog/profiling-tool-jvisualvm/

phronmophobic 2022-07-25T23:50:46.658739Z

but I think it's really difficult to write efficient code without a profiler. It's like trying to dig a hole without a shovel.

Chase 2022-07-26T00:35:36.701309Z

makes sense to me. I got that profiler working so I'll play around with it.

👍 1
Ben Sless 2022-07-29T15:56:37.243509Z

Iirc the most relevant conclusion to this use case is not using a push back reader

phronmophobic 2022-07-29T15:57:46.115669Z

@chase-lambert see anything interesting in the profiling?

Ben Sless 2022-07-29T15:58:06.940099Z

If you want to work really hard you can so this with arrays and cut down dramatically on allocation

Chase 2022-07-29T16:17:41.775519Z

What does "push back" reader mean?

Ben Sless 2022-07-29T16:18:15.722489Z

that's a java type that has an unread method

Chase 2022-07-29T16:19:38.823759Z

@smith.adriane still working on grokking visualvm. I would probably still need assistance on what it's actually showing me. This is why I'm ready to stop the solo self learning and try to get in as a junior developer or even just get involved in an open source project to be mentored by senior devs. I'm not progressing like I want by myself but I digress