I've got a question about clojure.core.async/go blocks: I understand I should not do IO in the body of such a block, and there exists clojure.core.async/io-thread for such things. Do the following demonstrate correct understanding of how to use these together?
(require '[clojure.core.async :as csp])
(csp/go
(let [status
(csp/<! (csp/io-thread (do-io-work! ...)))]
(do-with status...))) Yes.
reduce-kv vs seq . Are there any benefits in performance (like from reduced allocations) to be chased from:
(reduce-kv (fn [accum k _v] (cons k accum)) (lazy-seq) m)
vs
(keys m)
I know the answer for perf questions in most cases is āit dependsā and āuse a profilerā, etc. Iām doing that in general.
As far as I understand, reduce-kv should be doing fewer allocations of MapEntry, etc: Is my understanding correct?TLRD; all approaches Iāve done so far seem to be about the same in wall clock time for map traversal (when the benchmark test is setup correctly, mutability and all). Have not done memory allocation exploration. Thanks @borkdude and @alexmiller for the help.
(reduce-kv (fn [accum k _v] (cons k accum)) (lazy-seq) m)You are mostly paying for all those conses here, not for the reduce-kv. You need to choose a benchmark with cheaper accumulation. Like add to a transient vector or consume in situ.
Well, absotelutely far from same wall clock time:
(let [m (zipmap (range 10000) (range 10000))]
(criterium.core/quick-bench
(reduce
(fn [^java.util.ArrayList accum x]
(.add accum x)
accum)
(java.util.ArrayList.)
(keys m))))
;; => Execution time mean : 220.893201 µs
(let [m (zipmap (range 10000) (range 10000))]
(criterium.core/quick-bench
(reduce-kv
(fn [^java.util.ArrayList accum x _]
(.add accum x)
accum)
(java.util.ArrayList.)
m)))
;; => Execution time mean : 80.104886 µsOr perhaps you were talking about reduce-kv vs the Java interop interfaces.
Yeah, I didnāt do reduce-kv because Iām mainly interested in getting the keys from the map, for my use case.
(defonce bench-output (atom {}))
(defn make-map [gen-zipmap-seq]
(zipmap (gen-zipmap-seq) (gen-zipmap-seq)))
(defn benchmark-map-traversal [m]
(reset! bench-output {})
(System/gc)
(println "Benchmarking .forEachRemaining :::")
(criterium.core/quick-bench
(do
(swap! bench-output dissoc :forEachRemaining)
(let [itr (.keyIterator ^clojure.lang.IMapIterable m)
arr (java.util.ArrayList.)]
(.forEachRemaining
itr
(fn [x] (.add ^java.util.ArrayList arr x)))
(swap! bench-output assoc :forEachRemaining arr))))
(System/gc)
(println "Benchmarking (doall (keys m)) :::")
(criterium.core/quick-bench
(do
(swap! bench-output dissoc :doall-keys-m)
(let [output (doall (keys m))]
(swap! bench-output assoc :doall-keys-m output))))
(System/gc)
(println "Benchmarking KeySeq reduce :::")
(criterium.core/quick-bench
(do
(swap! bench-output dissoc :KeySeq-reduce)
(let [output (reduce
(fn [^java.util.ArrayList accum x]
(.add accum x)
accum)
(java.util.ArrayList.)
(keys m))]
(swap! bench-output assoc :KeySeq-reduce output))))
;;are all the results equal? (should be)
(apply = (vals @bench-output)))
(comment
;map with numbers
(benchmark-map-traversal (make-map (fn [] (repeatedly 1000000 #(rand-int Integer/MAX_VALUE)))))
;map with random-uuid's
(benchmark-map-traversal (make-map (fn [] (repeatedly 10000000 random-uuid)))))The results that I referred to as āabout the same in wall clock timeā (same as in the gist above):
;; Results
;; Apple MBP M2 Max 64GB
;; JVM 24
;; Clojure 1.12.2
;(benchmark-map-traversal (make-map (fn [] (repeatedly 1000000 #(rand-int Integer/MAX_VALUE)))))
Benchmarking .forEachRemaining :::
Evaluation count : 12 in 6 samples of 2 calls.
Execution time mean : 74.724221 ms
Execution time std-deviation : 9.987542 ms
Execution time lower quantile : 70.011395 ms ( 2.5%)
Execution time upper quantile : 92.008934 ms (97.5%)
Overhead used : 1.273186 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 31.7136 % Variance is moderately inflated by outliers
Benchmarking (doall (keys m)) :::
Evaluation count : 12 in 6 samples of 2 calls.
Execution time mean : 69.649107 ms
Execution time std-deviation : 7.719125 ms
Execution time lower quantile : 63.257249 ms ( 2.5%)
Execution time upper quantile : 82.625515 ms (97.5%)
Overhead used : 1.273186 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 30.9776 % Variance is moderately inflated by outliers
Benchmarking KeySeq reduce :::
Evaluation count : 12 in 6 samples of 2 calls.
Execution time mean : 68.371128 ms
Execution time std-deviation : 630.732771 µs
Execution time lower quantile : 67.699291 ms ( 2.5%)
Execution time upper quantile : 69.089976 ms (97.5%)
Overhead used : 1.273186 ns
=> trueEven the most typical/idiomatic use case (keys m) appears to be fast, which is great.
Initially, I chased after a red herring because I did the ArrayList forEachRemaining benchmark incorrectly. Mutability is always harder than I think š
It would be curious to repeat those tests on a non-Apple CPU.
(keys m) is shorter and is more idiomatic. It doesn't expose how actually it collects the keys. The first way with reduce is pretty noisy.
Yes, I agree that in 98% of cases, for most ābusiness logicā, idiomatic is best. However, in this case, I would like to achieve maximum or very good performance (both in terms of time and reduced allocations).
(Iām working on a set-like data structure, which right now is using a regular Clojure map as a backing/internal data structure)
Is this in a critical execution path? If not, I wouldn't bother. If this is going into a library that other people use, then you may bother.
latter
Traversing a map is generally a critical path, especially when performance matters, not only for reading the map but also for =, as itās used for hash computation. Also, a library.
Map equality, eh?
This rings some bellss.
I will try to revive my patch in September if Alex could spare a few minutes
I spent a bunch of time on that, trying to separate it out into a couple pieces, got to some weird places with it
iirc there were some surprising cases that were slower and I ran out of available time to keep working on it back then, was hoping to revisit in 1.13
I'd like that. From what I recall you mentioned there were some weird edge cases about maps with nils but you haven't saved the experiment or results so reproduction was difficult. I'll happily make myself available for this. I'd like to test this on multiple JVM versions and CPU architectures as well.
I would like to separate it into the getOrDefault support and then equality which can leverage it
I think I spun off a new ticket for getOrDefault
I'd like that sorted out before looking at =. iirc the sticking point when I tested was that the default interface impl (effectively current) is faster for empty maps (and/or maybe map with nil key, don't remember) than the valAt path, and I got off down a rabbit hole trying to improve valAt which I never emerged from
Improving valAt would be grand, but seems like a whale of a task Meanwhile, I'll proceed assuming getOrDefault will get merged and see about looking for those edge cases with the benchmarks
Literally advent of code - https://github.com/gaverhae/misc/commit/b6d6290aa3199c474085e49579d003dd4d9e5750 gave me a 30% performance increase, for example.
(Note: obviously spoiler for advent of code.)
@gaverhae I looked at the code just for a few minutes, but does your original code use transients there?
(as far as I can tell, no, but I didn't clone the whole project to run it)
It's calling (conj #{} x) which will not automagically make a transient (but (into #{} seq-of-xs) will)
When I try to use data.int-map I typically get results that are less than impressive (and unfortunately, sometimes worse than standard sets/maps):
(def v (into [] (map int) (repeatedly 1000 #(rand-int Integer/MAX_VALUE))))
(criterium.core/quick-bench (into (i/int-set) v))
Evaluation count : 9042 in 6 samples of 1507 calls.
Execution time mean : 66.931593 µs
Execution time std-deviation : 693.136333 ns
Execution time lower quantile : 65.904502 µs ( 2.5%)
Execution time upper quantile : 67.642958 µs (97.5%)
Overhead used : 1.686446 ns
=> nil
(criterium.core/quick-bench (into #{} v))
Evaluation count : 9372 in 6 samples of 1562 calls.
Execution time mean : 65.061196 µs
Execution time std-deviation : 804.951230 ns
Execution time lower quantile : 64.190908 µs ( 2.5%)
Execution time upper quantile : 65.997327 µs (97.5%)
Overhead used : 1.686446 ns
=> nil
^^^ identical
With bigger collections, though, things seem to go downhill... sometimes slower by 2x (on my benchmarks, via criterium)
(def v (into [] (map int) (repeatedly 100000 #(rand-int Integer/MAX_VALUE))))
(criterium.core/quick-bench (into (i/int-set) v))
Evaluation count : 42 in 6 samples of 7 calls.
Execution time mean : 15.490623 ms
Execution time std-deviation : 639.664446 µs
Execution time lower quantile : 14.780885 ms ( 2.5%)
Execution time upper quantile : 16.192827 ms (97.5%)
Overhead used : 1.686446 ns
=> nil
(criterium.core/quick-bench (into #{} v))
Evaluation count : 78 in 6 samples of 13 calls.
Execution time mean : 8.265934 ms
Execution time std-deviation : 201.586207 µs
Execution time lower quantile : 8.030925 ms ( 2.5%)
Execution time upper quantile : 8.490226 ms (97.5%)
Overhead used : 1.686446 ns
=> nilI don't have much experience with the data.int-map library, so I could be "using it wrong" but I can't see anything obvious atm
Oh, seem to have found a speed up, with dense-int-set :
(def v2 (into [] (map int) (range 100000)))
(criterium.core/quick-bench (into (i/dense-int-set) v2))
Evaluation count : 174 in 6 samples of 29 calls.
Execution time mean : 3.528156 ms
Execution time std-deviation : 25.757742 µs
Execution time lower quantile : 3.500620 ms ( 2.5%)
Execution time upper quantile : 3.565914 ms (97.5%)
Overhead used : 1.686446 ns
=> nil
(criterium.core/quick-bench (into #{} v2))
Evaluation count : 78 in 6 samples of 13 calls.
Execution time mean : 8.247145 ms
Execution time std-deviation : 122.377096 µs
Execution time lower quantile : 8.113931 ms ( 2.5%)
Execution time upper quantile : 8.406452 ms (97.5%)
Overhead used : 1.686446 ns
=> nilObviously, those are all very basic benchmarks. Many more facets of performance: reading, merging collections, etc. So, reader beware.
why (lazy-seq) instead of just nil or '() ?
Oh, nice. Yes. nil works.
On a micro-benchmark, the reduce-kv is somewhat faster than keys , FWIW, with (count m) of 10,000. So at least thatās in the ārightā direction. Nothing too groundbreaking, though.
My hope for a bigger benefit is from reduced allocations, thatās TBD.
there is a fast path embedded for key and val iterators on concrete map paths that avoid entry allocation (marker for this is IMapIterable). so you could also try reducing on the keys seq
I have a feeling that this is too good to be true. I suspect the benefit decreases with data that is not just numbers⦠and yet itās a big difference (orders of magnitude).
(this is all on JVM 24, Clojure 1.12.2)
doall is seqing and going to be much slower. Try reducing the key seq (which should use the iteratoe)
This KeySeq ?
(type (keys {:a 1}))
=> clojure.lang.APersistentMap$KeySeq
(let [m (zipmap (range 10000) (range 10000))]
(criterium.core/quick-bench
(reduce
(fn [^java.util.ArrayList accum x]
(.add accum x)
accum)
(java.util.ArrayList.)
(keys m))))
Evaluation count : 2412 in 6 samples of 402 calls.
Execution time mean : 256.663788 µs
Execution time std-deviation : 1.515594 µs
Execution time lower quantile : 253.956371 µs ( 2.5%)
Execution time upper quantile : 257.781619 µs (97.5%)
Overhead used : 1.756574 ns
=> nil
Same ballpark as (doall (keys m)) unless Iām calling it wrong.The only guess I can make to explain this so far is that
.forEachRemaining is truly well-optimized by the JVM internally. Nanoseconds! I can barely believe it.Perhaps some SIMD-style optimizations are kicking in? Not sure.
FWIW, this is on Apple M2 Max (a laptop) with Energy Mode set to High. Curious how it compares on other platforms, specifically Linux.
Is my benchmark code getting optimized away by the JIT? That would be the reality checkā¦
I think no. I started saving to output to an atom to make sure⦠the results of the map traversal are there.
So the benchmark code does actually run (updated the code snippet above).
Hereās a gist for easier viewing https://gist.github.com/raspasov/b4faff9a9b8aa377baaa552e08871475
Ok, mutability bites again⦠well of course⦠forEachRemaining⦠only runs once for an Iterator ā¦. OOP
Iām so used to the Clojure way of doing things that this seemed almost unthinkable LOL.
Sorry for the noise. The performance of all of those is almost identical.
I actually did try http://data.int map at some point but I should take a look again; not always ints but sometimes Do you recall what exact uses cases it helped?
Are your keys integers, or are those just simplified examples? On advent-of-code-type problems, I've found https://github.com/clojure/data.int-map sometimes makes a huge difference, if I can reformulate my problem to use integer keys.