Fork me on GitHub
#beginners
<
2023-10-30
>
Renan Oliveira10:10:07

Hey everyone, binding is clojure is like variable global in JavaScript for example? How it connect with dynamic var?

bg11:10:14

binding is used to set a thread-local value to a dynamic var. It’s not for creating “global variables”, the new binding is only visible within the context of the binding form.

Jason Bullers13:10:59

Do you mean binding the function (macro?) or binding in general as a concept?

bg13:10:43

The binding macro

Jason Bullers13:10:24

I mean the original question 🙂

bg13:10:04

Oh! Since Renan mentioned dynamic var, I assumed it was the macro.

Renan Oliveira13:10:58

I think can be the two things, macros and concepts

Renan Oliveira13:10:16

A think I got it @U0505RKEL , I will implement to improve the concept

clojure-spin 1
Fernando Cordeiro12:10:01

Hi, 👋 I'm looking for a way to suppress stderr when testing a function that's supposed to raise an exception. I'm using lein test + the clojure.test runner.

Dave19:10:08

When performance testing code does the doall function work inside threading macros?

(crit/quick-bench
    (dorun
      (->> random-bytes
           (mapcat byte-to-bits)
           doall
           multi-partition
           doall)))

"Evaluation count : 180 in 6 samples of 30 calls.
              Execution time mean : 3.423844 ms
     Execution time std-deviation : 35.747465 µs
    Execution time lower quantile : 3.393329 ms ( 2.5%)
    Execution time upper quantile : 3.466936 ms (97.5%)
                    Overhead used : 1.938030 ns
"
ClojureGoesFast seems to imply that this is required: https://clojure-goes-fast.com/blog/benchmarking-tool-criterium/ I am confused though, because I wrote some similar code in Python and it is taking 100x longer. That being said I did run two performance benchmarks that definitely realize the full collection:
(->> partitioned-random-bytes
     (pmap (partial map (partial r/fold +)))  ;; Parallel sum up each inner list
     (pmap (partial r/fold +))               ;; Parallel sum up each outer list
     (r/fold +)
     time)
"Elapsed time: 25.388449 msecs"
=> 4778384

(->> random-bytes
     (mapcat byte-to-bits)
     multi-partition
     (pmap (partial map (partial r/fold +)))  ;; Parallel sum up each inner list
     (pmap (partial r/fold +))               ;; Parallel sum up each outer list
     (r/fold +)
     time)
"Elapsed time: 135.394118 msecs"
=> 4778384
It would appear that the benchmark SHOULD show ~110ms

phronmophobic19:10:12

does the doall function work inside threading macros?Yes. Trying to time a single quick operation can be very noisy for a number of reasons. crit/quick-bench and other tools have various methods for getting more accurate measurements.

phronmophobic19:10:54

One quick and dirty method is to run the operation a few thousand times and find the average of the amount of time it took to execute. A better option is just to use libraries like criterium.

phronmophobic19:10:44

criterium's readme has a short summary of some of the pitfalls it tries to avoid when benchmarking: > Criterium measures the computation time of an expression. It is designed to address some of the pitfalls of benchmarking, and benchmarking on the JVM in particular. > This includes: > • statistical processing of multiple evaluations > • inclusion of a warm-up period, designed to allow the JIT compiler to optimise its code > • purging of gc before testing, to isolate timings from GC state prior to testing > • a final forced GC after testing to estimate impact of cleanup on the timing results

hiredman19:10:58

There are a lot of issues with this code to start with

hiredman19:10:34

In general pmap is not great, and the performance of stacked pmaps is going to get real weird

hiredman19:10:37

fold depends on the tree like structure of its inputs to do parallel operations, and a lazy seq like what is produced by pmap is not a tree

Dave19:10:39

I am using Criterium in most of my code. This code runs in 3ms:

(with-out-str
  (crit/quick-bench
    (dorun
      (->> random-bytes
           (mapcat byte-to-bits)
           doall
           multi-partition
           doall))))

"Evaluation count : 186 in 6 samples of 31 calls.
              Execution time mean : 3.521669 ms
     Execution time std-deviation : 226.311093 µs
    Execution time lower quantile : 3.342995 ms ( 2.5%)
    Execution time upper quantile : 3.884340 ms (97.5%)
                    Overhead used : 1.938030 ns
 
 Found 1 outliers in 6 samples (16.6667 %)
 \tlow-severe\t 1 (16.6667 %)
  Variance from outliers : 14.9150 % Variance is moderately inflated by outliers
Without the doalls it takes 676ns But here, I split the runs in half
(def partitioned-random-bytes (->> random-bytes
                                   (mapcat byte-to-bits)
                                   multi-partition))

(crit/quick-bench
  (->> partitioned-random-bytes
       (pmap (partial map (partial r/fold +)))              ;; Parallel sum up each inner list
       (pmap (partial r/fold +))                            ;; Parallel sum up each outer list
       (r/fold +)))

Evaluation count : 12 in 6 samples of 2 calls.
             Execution time mean : 63.192811 ms
    Execution time std-deviation : 3.428099 ms
   Execution time lower quantile : 59.604656 ms ( 2.5%)
   Execution time upper quantile : 67.841242 ms (97.5%)
                   Overhead used : 1.938030 ns
=> nil
(crit/quick-bench
  (->> random-bytes
       (mapcat byte-to-bits)
       multi-partition
       (pmap (partial map (partial r/fold +)))              ;; Parallel sum up each inner list
       (pmap (partial r/fold +))                            ;; Parallel sum up each outer list
       (r/fold +)))

Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 152.651639 ms
    Execution time std-deviation : 39.990973 ms
   Execution time lower quantile : 122.679632 ms ( 2.5%)
   Execution time upper quantile : 205.920108 ms (97.5%)
                   Overhead used : 1.938030 ns
Clearly the (def) is having some effect on the performance, but 152ms - 63ms != ~3ms

hiredman19:10:10

Fold on lists/seqs is not parellel

hiredman19:10:44

Performance is not deterministic like that

hiredman19:10:34

Criterium is giving you a mean, you cannot do math like that on averages

Dave19:10:35

@U0NCTKEV8 I have at dozens of different iterations of these tests using various threading macros and transducers. I am still trying to figure out the optimal way to do a "multi-partition", and then perform parallel operations on it. I have tried tresser.core, net.cgrand.xforms, uncomplicate.fluokitten, clojure.core.reducers, transducer versions, etc...

hiredman19:10:10

You can do all the comparisons you want, but gigo

Dave19:10:34

If it is not deterministic like this, then I want to dive into why. I have also been trying out

clj-async-profiler.core :as prof
But I am not an expert with with it yet, the flamegraphs summarize a bit too much. I might need to compare traces for each run to really understand.

hiredman20:10:19

there are many many sources of non-determinism in modern computers, caches, jits, branch predictors, multithreading, instruction dependencies, etc, etc, etc

1
hiredman20:10:30

it is unlikely that you will ever get the same mean runtime for any benchmark, let alone a run where you can cleanly slice out some work and subtract its mean run time

hiredman20:10:32

there are even recentish papers that suggest the entire modern approach to benchmarking jitted runtimes is flawed because they end up never hitting a stable state of performance

phronmophobic20:10:53

If you search for Chris Nuernberger on youtube, he's done a number of talks about performance with clojure. Here's a link where he talks about some of the different ways he measures performance of a datastructure, https://youtu.be/ralZ4j_ruVg?t=749. As @U0NCTKEV8 mentioned, the intuition for performance is a bit subtle.

ghadi20:10:35

pmap + r/fold + seqs is non sensical

hiredman20:10:49

pmap + pmap is pretty nonsensical

ghadi20:10:54

I guess @U0NCTKEV8 already covered that

hiredman20:10:08

pmap is not "parallel map" despite its name, it is "lazy map + sort of weird parallel but still lazy realization"

hiredman20:10:42

and multiple layers of "sort of weird parallel but still lazy realization" is going to be weird

ghadi20:10:30

and as mentioned r/fold on non-vectors is going to fallback to plain reduce

Dave20:10:04

I mean, I don't know what to tell y'all, but I tested all of them, and this is the fastest according to criterium.

ghadi20:10:58

fastest thing you've tested *so far

Dave20:10:06

Yes, this is true.

ghadi20:10:08

but you haven't tested using the APIs correctly

Dave20:10:18

Very likely.

Dave20:10:52

That's why I am asking in the first place. Is to learn more about what is going on.

ghadi20:10:12

call r/fold on a vector, it already embodies parallelism

hiredman20:10:47

I don't know what your other comparisons look like, but almost certainly you are using a small enough data set that the overhead of coordination is overwhelming any gains from parallelism, and the ceiling for gains from parallelism (assuming purely cpu work) will be the number of cores available

Dave20:10:42

The data structure is like this:

(((0 0)
  (1 0)
  (0 0)
  (0 0)
...
  ((0 0 1) (0 1 0)
...
  ((0 0 1 0) (0 1 0 0)
...
)))
I use a function that I call "multi-partition" to create n-sliding windowed views into a single collection.
(defn multi-partition [coll]
  (map (fn [index]
         (partition index 1 coll))
       (range 2 16)))

hiredman20:10:48

so if you have a low ceiling(2-4 cores) and high over head (haven't seen over examples, so dunno), you will get a slow down instead of a speed up

hiredman20:10:18

it also would not surprise me if you are repeating the fold issue over and over different ways

hiredman20:10:16

fold needs its input to be something like a persistent vector or map which are data structures that are internally represented as trees, and those trees provide for a natural parallel divide and conquer

hiredman20:10:42

when you input a sequence, which is not a tree, fold is just a sequentially reduce

hiredman20:10:50

so what it looks like you are seeing, is a mostly sequential reduce being the fastest (hard to say but replacing those calls to pmap with map might make it go even faster) indicates the overhead of coordination is not being paid for

hiredman20:10:25

multi-partition is fundamentally parallel defeating, because for a data item N you have to look at data item N-1

Dave20:10:58

I want to try to take advantage of shared structure of immutable sequences where possible. I am trying to understand what is going on under the hood. But it's clear that pmap is helping. It's almost as fast as tesser, which is designed for this, I think.

(with-out-str (crit/quick-bench
                (->> partitioned-random-bytes
                     (pmap (partial map (partial r/fold +))) ;; Parallel sum up each inner list
                     (pmap (partial r/fold +))              ;; Parallel sum up each outer list
                     (r/fold +))))
"
Evaluation count : 24 in 6 samples of 4 calls.
             Execution time mean : 35.703802 ms
    Execution time std-deviation : 4.905961 ms
   Execution time lower quantile : 31.469920 ms ( 2.5%)
   Execution time upper quantile : 41.349818 ms (97.5%)
                   Overhead used : 1.938030 ns"
Tesser
(with-out-str (crit/quick-bench
                (=>> partitioned-random-bytes
                     #(t/tesser % (t/fold + (t/mapcat concat))))))
"
Evaluation count : 24 in 6 samples of 4 calls.
             Execution time mean : 30.458561 ms
    Execution time std-deviation : 2.668739 ms
   Execution time lower quantile : 25.779279 ms ( 2.5%)
   Execution time upper quantile : 32.786885 ms (97.5%)
                   Overhead used : 1.938030 ns"
Compared to regular reduce
(with-out-str (crit/quick-bench
                (->> partitioned-random-bytes
                     (pmap (partial map (partial reduce +)))
                     (pmap (partial reduce +))
                     (reduce +))))
"
Evaluation count : 12 in 6 samples of 2 calls.
             Execution time mean : 79.035755 ms
    Execution time std-deviation : 1.810841 ms
   Execution time lower quantile : 77.008691 ms ( 2.5%)
   Execution time upper quantile : 81.572205 ms (97.5%)
                   Overhead used : 1.938030 ns"
Fastest is when I use injest macro
(with-out-str (crit/quick-bench
                (=>> partitioned-random-bytes
                     (pmap (partial map (partial r/fold +))) ;; Parallel sum up each inner list
                     (pmap (partial r/fold +))              ;; Parallel sum up each outer list
                     (r/fold +))))
"
Evaluation count : 36 in 6 samples of 6 calls.
             Execution time mean : 19.573444 ms
    Execution time std-deviation : 2.955045 ms
   Execution time lower quantile : 17.462556 ms ( 2.5%)
   Execution time upper quantile : 24.577769 ms (97.5%)
                   Overhead used : 1.938030 ns

Found 1 outliers in 6 samples (16.6667 %)
	low-severe	 1 (16.6667 %)
 Variance from outliers : 47.1436 % Variance is moderately inflated by outliers"

Dave20:10:43

When I use map or regular sequences it goes at least twice as slow.

hiredman20:10:45

now change pmap to map and fold to reduce

Dave20:10:22

(with-out-str (crit/quick-bench
                (=>> partitioned-random-bytes
                     (map (partial map (partial reduce +)))
                     (map (partial reduce +))
                     (reduce +))))
"
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 192.815422 ms
    Execution time std-deviation : 1.641831 ms
   Execution time lower quantile : 191.888084 ms ( 2.5%)
   Execution time upper quantile : 195.601536 ms (97.5%)
                   Overhead used : 1.938030 ns

Found 1 outliers in 6 samples (16.6667 %)
	low-severe	 1 (16.6667 %)
 Variance from outliers : 13.8889 % Variance is moderately inflated by outliers"
(with-out-str (crit/quick-bench
                (->> partitioned-random-bytes
                     (map (partial map (partial reduce +)))
                     (map (partial reduce +))
                     (reduce +))))
"
Evaluation count : 6 in 6 samples of 1 calls.
             Execution time mean : 186.821265 ms
    Execution time std-deviation : 300.414158 µs
   Execution time lower quantile : 186.463504 ms ( 2.5%)
   Execution time upper quantile : 187.204097 ms (97.5%)
                   Overhead used : 1.938030 ns"

Dave20:10:55

I am not using pmap in the inner loop, when I use pmap across the board, performance takes a dump.

Dave20:10:06

But for the outer loops, it seems to help a lot.

hiredman20:10:51

what you need to do is use the combinators in the reducers namespace together with fold

hiredman20:10:19

instead of mix and matching the sequence combinators in clojure.core with fold from clojure.core.reducers

hiredman20:10:49

it seems like what you are seeing is there is a large enough amount of data that there is a win running the pmap stuff in parallel, but because pmap outputs a lazy-seq, you get no parallelism when doing the outer most reduce

Dave20:10:35

I think the injest macro does that for me:

#?(:cljs (defmacro =>> "Just like x>>, for now" [& args] `(x>> ~@args))
   :clj  (defmacro =>>
           "Just like x>> but first composes stateless transducers into a function that 
            `r/fold`s in parallel the values flowing through the thread. Remaining
            stateful transducers are composed just like x>>."
           [x & thread]
           `(x>> ~x ~@(->> thread (i/pre-transducify-thread &env 1 `i/fold-xfn i/par-transducable?)))))

hiredman20:10:47

you'll get better performance with a single reduce instead of nesting reduces as well (if you are using the same reducing function +, then use the cat reducer)

hiredman20:10:02

color me extremely skeptical, without context it is hard to tell what that docstring is describing it is doing

hiredman20:10:57

it could just be replacing stateful transducers with fold safe versions? hard to tell, and reducers (where fold comes from) and transducers are not at all the same thing

hiredman20:10:13

and it is possible to mix them, but easy to get wrong

Dave20:10:26

I am skeptical too, but =>> with (r/fold +) is faster than tesser somehow.. I am going to keep experimenting until I get it all. I still barely understand transducers. Which combinators did you mean? monoid?

hiredman20:10:42

map. filter, partition, etc