Fork me on GitHub
#beginners
<
2023-12-24
>
Nathan Nolk11:12:35

Suppose I have a collection of words, say like this: '("and" "back" "and") and I want to count the number of strings that appear only once in the collection. What's the best way to do it? I thought about removing strings that appear more than once and then count the remaining words but I'm not sure how to go about this. In this context, the function should return 1 for the list above. Any idea? This originally comes from a project where I need to count a number of words that appear only once in a text.

mrnhrd11:12:30

frequencies and then filter for the value 1 would be an option

Nathan Nolk11:12:10

This is very smart, I am going to try to do this.

Nathan Nolk11:12:30

I just have one question, how would I filter for the number? Because it appears next to the word in question, so what is the data structure exactly? For instance, I'd get {"and" 2, "back" 1}, is this a map?

mrnhrd12:12:46

Yep, that is a map. See here for the filtering: https://clojuredocs.org/clojure.core/filter#example-57afb352e4b02d8da95c26fe (and then do count on it)

Nathan Nolk12:12:16

Thank you very much, I will take a look!

Nathan Nolk12:12:32

This works wonderfully, thank you!

👍 1
Fredrik12:12:49

Another approach could be to loop through the collection, keeping a set of seen elements and filtering out those that appear more than once:

(defn count-uniques [coll]
  (->> (reduce (fn [[res seen] x]
                 (if (seen x)
                   [(disj res x) seen]
                   [(conj res x) (conj seen x)]))
               [#{} #{}] coll)
       first
       count))

Nathan Nolk12:12:28

That was the approach I initially had in mind but wasn't sure how to make happen. Is one or the other faster?

mrnhrd12:12:49

I'd presume Fredrik's version is faster because it does all of the bulk work in a single pass and doesn't do the unnecessary ( though perhaps minor) counting work that frequencies does.

Nathan Nolk12:12:37

I'll keep that in mind, though what I like in your solution is that I can even keep the words if I'd like to, which for my project might be useful.

mrnhrd12:12:18

If you have a big enough input, time https://clojuredocs.org/clojure.core/time might give you a very rough benchmark.

Fredrik12:12:11

They are pretty comparable in terms of speed, while my version only does a single pass, each iteration is slower than the one in frequencies , since frequencies uses a https://clojure.org/reference/transients.

Nathan Nolk12:12:48

Ah okay I see, I will keep mrnhrd's approach then I think. Thank you for showing me a solution the way I originally envisioned it though.

Fredrik12:12:31

Here's a simple benchmark you can try out:

(defn count-uniques [coll]
  (->> (reduce (fn [[res seen] x]
                 (if (seen x)
                   [(disj res x) seen]
                   [(conj res x) (conj seen x)]))
               [#{} #{}] coll)
       first
       count))

(defn count-uniques2 [coll]
  (loop [res (transient #{})
         seen (transient #{})
         [x & xs] coll]
    (if x
      (if (seen x)
        (recur (disj! res x) seen xs)
        (recur (conj! res x) (conj! seen x) xs))
      (count res))))

(defn count-uniques3 [coll]
  (->> (frequencies coll)
       (filter (fn [[x count]] (= count 1)))
       count))

(def word-list
  (repeatedly 1000
              (fn gen-word []
                (apply str (->> (range (int \a) (inc (int \z)))
                                (map char)
                                shuffle
                                (take 2))))))

(dotimes [_ 5 ]
  (time (dotimes [i 1000]
          (count-uniques word-list))))

(dotimes [_ 5 ]
  (time (dotimes [i 1000]
          (count-uniques2 word-list))))

(dotimes [_ 5 ]
  (time (dotimes [i 1000]
          (count-uniques3 word-list))))

practicalli-johnny12:12:45

Note that time will provide a measurement of time that can be easily affected even when run multiple times, although as a quick guide can be very useful. If it is important to optomise a code, e.g. to minimise bottlenecks, then https://github.com/hugoduncan/criterium/ provides a more accurate timing.

Nathan Nolk12:12:10

Oh, thank you very much. I knew about time, but not criterium. I will take a look, thank you.

practicalli-johnny13:12:11

I usually include the quick-bench funciton of Criterium and include it as a library dependency in a development time alias, e.g. :repl/reloaded, so I can use it when needed. I still used time when I am simply curious Some critierium examples: https://practical.li/clojure/performance/testing-functions/

👍 1
phill00:12:31

(map first (get (->> '("and" "back" "and") (frequencies) (group-by second)) 1)) yields ("back")

Dallas Surewood14:12:17

Sometimes I see people talking online about the AOT story of different languages like .NET and Clojure. Someone usually jumps in saying if you need AOT compiled apps in Clojure, you can use Babashka. Am I missing something here? As far as I understand, Babashka cannot use Clojure libraries that use Java libraries, which disqualifies a lot of clojure apps. Even using core libraries like cache is dependent on Java libraries. Is it more accurate to say Babashka is for small utility scripts and GraalVM should be used directly if you need Java support?