This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-10-30
Channels
- # announcements (2)
- # babashka (37)
- # beginners (64)
- # biff (27)
- # cherry (7)
- # cider (19)
- # clj-kondo (10)
- # clojure-austin (4)
- # clojure-doc (18)
- # clojure-europe (72)
- # clojure-nl (1)
- # clojure-norway (13)
- # clojure-uk (5)
- # clojurescript (18)
- # data-science (28)
- # events (5)
- # graalvm (32)
- # hyperfiddle (6)
- # introduce-yourself (1)
- # jobs (4)
- # joyride (16)
- # juxt (6)
- # malli (7)
- # missionary (3)
- # off-topic (18)
- # pathom (15)
- # portal (14)
- # re-frame (14)
- # reitit (5)
- # releases (1)
- # rum (2)
- # sci (1)
- # shadow-cljs (102)
- # spacemacs (3)
- # sql (6)
- # web-security (2)
- # xtdb (10)
Hey everyone, binding is clojure is like variable global in JavaScript for example? How it connect with dynamic var?
binding
is used to set a thread-local value to a dynamic var
. It’s not for creating “global variables”, the new binding is only visible within the context of the binding
form.
Do you mean binding
the function (macro?) or binding in general as a concept?
I mean the original question 🙂
I think can be the two things, macros and concepts
Hi, 👋
I'm looking for a way to suppress stderr when testing a function that's supposed to raise an exception. I'm using lein test
+ the clojure.test
runner.
When performance testing code does the doall function work inside threading macros?
(crit/quick-bench
(dorun
(->> random-bytes
(mapcat byte-to-bits)
doall
multi-partition
doall)))
"Evaluation count : 180 in 6 samples of 30 calls.
Execution time mean : 3.423844 ms
Execution time std-deviation : 35.747465 µs
Execution time lower quantile : 3.393329 ms ( 2.5%)
Execution time upper quantile : 3.466936 ms (97.5%)
Overhead used : 1.938030 ns
"
ClojureGoesFast seems to imply that this is required:
https://clojure-goes-fast.com/blog/benchmarking-tool-criterium/
I am confused though, because I wrote some similar code in Python and it is taking 100x longer.
That being said I did run two performance benchmarks that definitely realize the full collection:
(->> partitioned-random-bytes
(pmap (partial map (partial r/fold +))) ;; Parallel sum up each inner list
(pmap (partial r/fold +)) ;; Parallel sum up each outer list
(r/fold +)
time)
"Elapsed time: 25.388449 msecs"
=> 4778384
(->> random-bytes
(mapcat byte-to-bits)
multi-partition
(pmap (partial map (partial r/fold +))) ;; Parallel sum up each inner list
(pmap (partial r/fold +)) ;; Parallel sum up each outer list
(r/fold +)
time)
"Elapsed time: 135.394118 msecs"
=> 4778384
It would appear that the benchmark SHOULD show ~110msdoes the doall function work inside threading macros?Yes.
Trying to time
a single quick operation can be very noisy for a number of reasons. crit/quick-bench
and other tools have various methods for getting more accurate measurements.
One quick and dirty method is to run the operation a few thousand times and find the average of the amount of time it took to execute. A better option is just to use libraries like criterium.
criterium's readme has a short summary of some of the pitfalls it tries to avoid when benchmarking: > Criterium measures the computation time of an expression. It is designed to address some of the pitfalls of benchmarking, and benchmarking on the JVM in particular. > This includes: > • statistical processing of multiple evaluations > • inclusion of a warm-up period, designed to allow the JIT compiler to optimise its code > • purging of gc before testing, to isolate timings from GC state prior to testing > • a final forced GC after testing to estimate impact of cleanup on the timing results
In general pmap is not great, and the performance of stacked pmaps is going to get real weird
fold depends on the tree like structure of its inputs to do parallel operations, and a lazy seq like what is produced by pmap is not a tree
I am using Criterium in most of my code. This code runs in 3ms:
(with-out-str
(crit/quick-bench
(dorun
(->> random-bytes
(mapcat byte-to-bits)
doall
multi-partition
doall))))
"Evaluation count : 186 in 6 samples of 31 calls.
Execution time mean : 3.521669 ms
Execution time std-deviation : 226.311093 µs
Execution time lower quantile : 3.342995 ms ( 2.5%)
Execution time upper quantile : 3.884340 ms (97.5%)
Overhead used : 1.938030 ns
Found 1 outliers in 6 samples (16.6667 %)
\tlow-severe\t 1 (16.6667 %)
Variance from outliers : 14.9150 % Variance is moderately inflated by outliers
Without the doalls
it takes 676ns
But here, I split the runs in half
(def partitioned-random-bytes (->> random-bytes
(mapcat byte-to-bits)
multi-partition))
(crit/quick-bench
(->> partitioned-random-bytes
(pmap (partial map (partial r/fold +))) ;; Parallel sum up each inner list
(pmap (partial r/fold +)) ;; Parallel sum up each outer list
(r/fold +)))
Evaluation count : 12 in 6 samples of 2 calls.
Execution time mean : 63.192811 ms
Execution time std-deviation : 3.428099 ms
Execution time lower quantile : 59.604656 ms ( 2.5%)
Execution time upper quantile : 67.841242 ms (97.5%)
Overhead used : 1.938030 ns
=> nil
(crit/quick-bench
(->> random-bytes
(mapcat byte-to-bits)
multi-partition
(pmap (partial map (partial r/fold +))) ;; Parallel sum up each inner list
(pmap (partial r/fold +)) ;; Parallel sum up each outer list
(r/fold +)))
Evaluation count : 6 in 6 samples of 1 calls.
Execution time mean : 152.651639 ms
Execution time std-deviation : 39.990973 ms
Execution time lower quantile : 122.679632 ms ( 2.5%)
Execution time upper quantile : 205.920108 ms (97.5%)
Overhead used : 1.938030 ns
Clearly the (def) is having some effect on the performance, but
152ms - 63ms != ~3ms@U0NCTKEV8 I have at dozens of different iterations of these tests using various threading macros and transducers. I am still trying to figure out the optimal way to do a "multi-partition", and then perform parallel operations on it. I have tried tresser.core, net.cgrand.xforms, uncomplicate.fluokitten, clojure.core.reducers, transducer versions, etc...
If it is not deterministic like this, then I want to dive into why. I have also been trying out
clj-async-profiler.core :as prof
But I am not an expert with with it yet, the flamegraphs summarize a bit too much.
I might need to compare traces for each run to really understand.there are many many sources of non-determinism in modern computers, caches, jits, branch predictors, multithreading, instruction dependencies, etc, etc, etc
it is unlikely that you will ever get the same mean runtime for any benchmark, let alone a run where you can cleanly slice out some work and subtract its mean run time
there are even recentish papers that suggest the entire modern approach to benchmarking jitted runtimes is flawed because they end up never hitting a stable state of performance
If you search for Chris Nuernberger on youtube, he's done a number of talks about performance with clojure. Here's a link where he talks about some of the different ways he measures performance of a datastructure, https://youtu.be/ralZ4j_ruVg?t=749. As @U0NCTKEV8 mentioned, the intuition for performance is a bit subtle.
I guess @U0NCTKEV8 already covered that
pmap is not "parallel map" despite its name, it is "lazy map + sort of weird parallel but still lazy realization"
and multiple layers of "sort of weird parallel but still lazy realization" is going to be weird
I mean, I don't know what to tell y'all, but I tested all of them, and this is the fastest according to criterium.
I don't know what your other comparisons look like, but almost certainly you are using a small enough data set that the overhead of coordination is overwhelming any gains from parallelism, and the ceiling for gains from parallelism (assuming purely cpu work) will be the number of cores available
The data structure is like this:
(((0 0)
(1 0)
(0 0)
(0 0)
...
((0 0 1) (0 1 0)
...
((0 0 1 0) (0 1 0 0)
...
)))
I use a function that I call "multi-partition" to create n-sliding windowed views into a single collection.
(defn multi-partition [coll]
(map (fn [index]
(partition index 1 coll))
(range 2 16)))
so if you have a low ceiling(2-4 cores) and high over head (haven't seen over examples, so dunno), you will get a slow down instead of a speed up
it also would not surprise me if you are repeating the fold issue over and over different ways
fold needs its input to be something like a persistent vector or map which are data structures that are internally represented as trees, and those trees provide for a natural parallel divide and conquer
so what it looks like you are seeing, is a mostly sequential reduce being the fastest (hard to say but replacing those calls to pmap with map might make it go even faster) indicates the overhead of coordination is not being paid for
multi-partition is fundamentally parallel defeating, because for a data item N you have to look at data item N-1
I want to try to take advantage of shared structure of immutable sequences where possible. I am trying to understand what is going on under the hood. But it's clear that pmap is helping. It's almost as fast as tesser, which is designed for this, I think.
(with-out-str (crit/quick-bench
(->> partitioned-random-bytes
(pmap (partial map (partial r/fold +))) ;; Parallel sum up each inner list
(pmap (partial r/fold +)) ;; Parallel sum up each outer list
(r/fold +))))
"
Evaluation count : 24 in 6 samples of 4 calls.
Execution time mean : 35.703802 ms
Execution time std-deviation : 4.905961 ms
Execution time lower quantile : 31.469920 ms ( 2.5%)
Execution time upper quantile : 41.349818 ms (97.5%)
Overhead used : 1.938030 ns"
Tesser
(with-out-str (crit/quick-bench
(=>> partitioned-random-bytes
#(t/tesser % (t/fold + (t/mapcat concat))))))
"
Evaluation count : 24 in 6 samples of 4 calls.
Execution time mean : 30.458561 ms
Execution time std-deviation : 2.668739 ms
Execution time lower quantile : 25.779279 ms ( 2.5%)
Execution time upper quantile : 32.786885 ms (97.5%)
Overhead used : 1.938030 ns"
Compared to regular reduce
(with-out-str (crit/quick-bench
(->> partitioned-random-bytes
(pmap (partial map (partial reduce +)))
(pmap (partial reduce +))
(reduce +))))
"
Evaluation count : 12 in 6 samples of 2 calls.
Execution time mean : 79.035755 ms
Execution time std-deviation : 1.810841 ms
Execution time lower quantile : 77.008691 ms ( 2.5%)
Execution time upper quantile : 81.572205 ms (97.5%)
Overhead used : 1.938030 ns"
Fastest is when I use injest macro
(with-out-str (crit/quick-bench
(=>> partitioned-random-bytes
(pmap (partial map (partial r/fold +))) ;; Parallel sum up each inner list
(pmap (partial r/fold +)) ;; Parallel sum up each outer list
(r/fold +))))
"
Evaluation count : 36 in 6 samples of 6 calls.
Execution time mean : 19.573444 ms
Execution time std-deviation : 2.955045 ms
Execution time lower quantile : 17.462556 ms ( 2.5%)
Execution time upper quantile : 24.577769 ms (97.5%)
Overhead used : 1.938030 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 47.1436 % Variance is moderately inflated by outliers"
(with-out-str (crit/quick-bench
(=>> partitioned-random-bytes
(map (partial map (partial reduce +)))
(map (partial reduce +))
(reduce +))))
"
Evaluation count : 6 in 6 samples of 1 calls.
Execution time mean : 192.815422 ms
Execution time std-deviation : 1.641831 ms
Execution time lower quantile : 191.888084 ms ( 2.5%)
Execution time upper quantile : 195.601536 ms (97.5%)
Overhead used : 1.938030 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 13.8889 % Variance is moderately inflated by outliers"
(with-out-str (crit/quick-bench
(->> partitioned-random-bytes
(map (partial map (partial reduce +)))
(map (partial reduce +))
(reduce +))))
"
Evaluation count : 6 in 6 samples of 1 calls.
Execution time mean : 186.821265 ms
Execution time std-deviation : 300.414158 µs
Execution time lower quantile : 186.463504 ms ( 2.5%)
Execution time upper quantile : 187.204097 ms (97.5%)
Overhead used : 1.938030 ns"
I am not using pmap in the inner loop, when I use pmap across the board, performance takes a dump.
what you need to do is use the combinators in the reducers namespace together with fold
instead of mix and matching the sequence combinators in clojure.core with fold from clojure.core.reducers
it seems like what you are seeing is there is a large enough amount of data that there is a win running the pmap stuff in parallel, but because pmap outputs a lazy-seq, you get no parallelism when doing the outer most reduce
I think the injest macro does that for me:
#?(:cljs (defmacro =>> "Just like x>>, for now" [& args] `(x>> ~@args))
:clj (defmacro =>>
"Just like x>> but first composes stateless transducers into a function that
`r/fold`s in parallel the values flowing through the thread. Remaining
stateful transducers are composed just like x>>."
[x & thread]
`(x>> ~x ~@(->> thread (i/pre-transducify-thread &env 1 `i/fold-xfn i/par-transducable?)))))
you'll get better performance with a single reduce instead of nesting reduces as well (if you are using the same reducing function +, then use the cat reducer)
color me extremely skeptical, without context it is hard to tell what that docstring is describing it is doing
it could just be replacing stateful transducers with fold safe versions? hard to tell, and reducers (where fold comes from) and transducers are not at all the same thing