This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-12-24
Channels
Suppose I have a collection of words, say like this: '("and" "back" "and") and I want to count the number of strings that appear only once in the collection. What's the best way to do it? I thought about removing strings that appear more than once and then count the remaining words but I'm not sure how to go about this. In this context, the function should return 1 for the list above. Any idea? This originally comes from a project where I need to count a number of words that appear only once in a text.
This is very smart, I am going to try to do this.
I just have one question, how would I filter for the number? Because it appears next to the word in question, so what is the data structure exactly? For instance, I'd get {"and" 2, "back" 1}, is this a map?
Yep, that is a map. See here for the filtering: https://clojuredocs.org/clojure.core/filter#example-57afb352e4b02d8da95c26fe
(and then do count
on it)
Thank you very much, I will take a look!
Another approach could be to loop through the collection, keeping a set of seen elements and filtering out those that appear more than once:
(defn count-uniques [coll]
(->> (reduce (fn [[res seen] x]
(if (seen x)
[(disj res x) seen]
[(conj res x) (conj seen x)]))
[#{} #{}] coll)
first
count))
That was the approach I initially had in mind but wasn't sure how to make happen. Is one or the other faster?
I'd presume Fredrik's version is faster because it does all of the bulk work in a single pass and doesn't do the unnecessary ( though perhaps minor) counting work that frequencies
does.
I'll keep that in mind, though what I like in your solution is that I can even keep the words if I'd like to, which for my project might be useful.
If you have a big enough input, time
https://clojuredocs.org/clojure.core/time might give you a very rough benchmark.
They are pretty comparable in terms of speed, while my version only does a single pass, each iteration is slower than the one in frequencies
, since frequencies
uses a https://clojure.org/reference/transients.
Ah okay I see, I will keep mrnhrd's approach then I think. Thank you for showing me a solution the way I originally envisioned it though.
Here's a simple benchmark you can try out:
(defn count-uniques [coll]
(->> (reduce (fn [[res seen] x]
(if (seen x)
[(disj res x) seen]
[(conj res x) (conj seen x)]))
[#{} #{}] coll)
first
count))
(defn count-uniques2 [coll]
(loop [res (transient #{})
seen (transient #{})
[x & xs] coll]
(if x
(if (seen x)
(recur (disj! res x) seen xs)
(recur (conj! res x) (conj! seen x) xs))
(count res))))
(defn count-uniques3 [coll]
(->> (frequencies coll)
(filter (fn [[x count]] (= count 1)))
count))
(def word-list
(repeatedly 1000
(fn gen-word []
(apply str (->> (range (int \a) (inc (int \z)))
(map char)
shuffle
(take 2))))))
(dotimes [_ 5 ]
(time (dotimes [i 1000]
(count-uniques word-list))))
(dotimes [_ 5 ]
(time (dotimes [i 1000]
(count-uniques2 word-list))))
(dotimes [_ 5 ]
(time (dotimes [i 1000]
(count-uniques3 word-list))))
Note that time
will provide a measurement of time that can be easily affected even when run multiple times, although as a quick guide can be very useful.
If it is important to optomise a code, e.g. to minimise bottlenecks, then https://github.com/hugoduncan/criterium/ provides a more accurate timing.
Oh, thank you very much. I knew about time, but not criterium. I will take a look, thank you.
I usually include the quick-bench funciton of Criterium and include it as a library dependency in a development time alias, e.g. :repl/reloaded
, so I can use it when needed. I still used time
when I am simply curious
Some critierium examples: https://practical.li/clojure/performance/testing-functions/
(map first (get (->> '("and" "back" "and") (frequencies) (group-by second)) 1))
yields ("back")
Sometimes I see people talking online about the AOT story of different languages like .NET and Clojure. Someone usually jumps in saying if you need AOT compiled apps in Clojure, you can use Babashka.
Am I missing something here? As far as I understand, Babashka cannot use Clojure libraries that use Java libraries, which disqualifies a lot of clojure apps. Even using core libraries like cache
is dependent on Java libraries.
Is it more accurate to say Babashka is for small utility scripts and GraalVM should be used directly if you need Java support?