Fork me on GitHub
#data-science
<
2020-04-28
>
jumar05:04:55

I'm playing with NumPy (Python for Data Analysis book chapter 4) and found following "benchmark":

my_arr = np.arange(1000000)
%time for _ in range(10): my_arr2 = my_arr * 2
# => 
CPU times: user 14.5 ms, sys: 8.33 ms, total: 22.8 ms
Wall time: 22.8 ms
I tried to compare this with clojure but it's at least 10x slower using a vector
(def my-vec (vec (range 1000000)))
(dotimes [i 10]
  (time (mapv #(* % 2)
              my-vec)))
;;=>
"Elapsed time: 33.79915 msecs"
"Elapsed time: 32.220132 msecs"
"Elapsed time: 32.664845 msecs"
"Elapsed time: 34.316561 msecs"
"Elapsed time: 30.00047 msecs"
"Elapsed time: 31.13807 msecs"
"Elapsed time: 33.537312 msecs"
"Elapsed time: 29.882778 msecs"
"Elapsed time: 27.428827 msecs"
"Elapsed time: 23.888296 msecs"
(I also tried (fastmath.vector/mult my-vec 2) with very similar results. With arrays and significantly uglier code it's a bit faster but not that much
(def my-array (int-array (range 1000000)))
(dotimes [i 10]
  (time (amap ^ints my-array
              idx
              ret
              (* (int 2) (aget ^ints my-array idx)))))
"Elapsed time: 52.724497 msecs"
"Elapsed time: 20.789277 msecs"
"Elapsed time: 15.796688 msecs"
"Elapsed time: 16.697641 msecs"
"Elapsed time: 16.42215 msecs"
"Elapsed time: 21.23476 msecs"
"Elapsed time: 15.189691 msecs"
"Elapsed time: 14.85 msecs"
"Elapsed time: 15.253646 msecs"
"Elapsed time: 17.112342 msecs"
Is there any way to speed it up or is this an expected performance difference?

jumar05:04:39

I tried to make it more similar to the Numpy benchmark by wrapping dotimes with time but the results are very similar:

(def my-vec (vec (range 1000000)))
(time
 (dotimes [i 10]
   (mapv #(* % 2)
         my-vec)))
"Elapsed time: 304.762938 msecs"

;; Using arrays => still not great
(def my-array (int-array (range 1000000)))
(time
 (dotimes [i 10]
   (amap ^ints my-array
         idx
         ret
         (* (int 2) (aget ^ints my-array idx)))))
"Elapsed time: 133.522233 msecs"

;; Try fastmath
(time
 (dotimes [i 10]
   (fmv/mult my-vec 2)))
;; => unfortunately, not faster :(
"Elapsed time: 285.1403 msecs"

genmeblog08:04:19

actually fastmath library adds nothing to optimise native Clojure vector. Multiplication is done using mapv . It has optimized 2,3 and 4 dimensional vectors. For other types is just adds the same api for consistency. Also main assumption is use only double type.

genmeblog08:04:13

Also be aware of boxed math (first case)

genmeblog08:04:36

Anyway I play with benchmarks now I'll post results here.

genmeblog09:04:45

(require '[criterium.core :as crit]
         '[fastmath.vector :as v]
         '[tech.v2.datatype.functional :as dfn]
         '[tech.v2.datatype :as dtype])

(set! *unchecked-math* :warn-on-boxed)
(set! *warn-on-reflection* true)

(def my-vec (vec (range 1000000)))

;; boxed call
(crit/quick-bench (mapv #(* % 2) my-vec))
;; Evaluation count : 18 in 6 samples of 3 calls.
;; Execution time mean : 53.953946 ms
;; Execution time std-deviation : 31.708286 ms
;; Execution time lower quantile : 36.049257 ms ( 2.5%)
;; Execution time upper quantile : 106.459891 ms (97.5%)
;; Overhead used : 9.546390 ns

;; Found 1 outliers in 6 samples (16.6667 %)
;; low-severe	 1 (16.6667 %)
;; Variance from outliers : 82.8097 % Variance is severely inflated by outliers

(crit/quick-bench (mapv (fn [^long x] (* x 2)) my-vec))
;; Evaluation count : 24 in 6 samples of 4 calls.
;; Execution time mean : 47.258691 ms
;; Execution time std-deviation : 15.277055 ms
;; Execution time lower quantile : 36.745716 ms ( 2.5%)
;; Execution time upper quantile : 73.059491 ms (97.5%)
;; Overhead used : 9.546390 ns

;; Found 1 outliers in 6 samples (16.6667 %)
;; low-severe	 1 (16.6667 %)
;; Variance from outliers : 81.6028 % Variance is severely inflated by outliers

(def my-array (int-array my-vec))
(crit/quick-bench (let [^ints my-array my-array]
                    (amap my-array
                          idx
                          ret
                          (* 2 (aget my-array idx)))))
;; Evaluation count : 156 in 6 samples of 26 calls.
;; Execution time mean : 4.485725 ms
;; Execution time std-deviation : 326.096583 µs
;; Execution time lower quantile : 4.184825 ms ( 2.5%)
;; Execution time upper quantile : 4.969988 ms (97.5%)
;; Overhead used : 9.546390 ns

(def my-double-array (double-array my-vec))
(crit/quick-bench (v/mult my-double-array 2.0))
;; Evaluation count : 96 in 6 samples of 16 calls.
;; Execution time mean : 6.374301 ms
;; Execution time std-deviation : 90.013453 µs
;; Execution time lower quantile : 6.254278 ms ( 2.5%)
;; Execution time upper quantile : 6.474364 ms (97.5%)
;; Overhead used : 9.546390 ns

(def datatype-ints (dtype/->reader my-vec :int32))
(crit/quick-bench (dtype/->int-array (dfn/* datatype-ints 2)))
;; Evaluation count : 12 in 6 samples of 2 calls.
;; Execution time mean : 61.180107 ms
;; Execution time std-deviation : 18.228253 ms
;; Execution time lower quantile : 51.501341 ms ( 2.5%)
;; Execution time upper quantile : 92.064166 ms (97.5%)
;; Overhead used : 9.546390 ns

genmeblog09:04:14

Looks like pure int-array amap is fastest. Then fastmath mult on double-array. I was pretty sure that http://tech.ml.datatype could help but it's the same as Clojure vec operations.

genmeblog09:04:53

Python for comparison on my machine:

genmeblog09:04:57

>>> timeit.timeit(my_fn,number=1000)
3.450289999949746

jumar10:04:47

Thanks for sharing the results; In your case the array is surprisingly faster (4.5ms vs 15+ ms in my case) - I'm wondering whether that's due to using criterium or something else.

jumar10:04:01

Using criterium I didn't get significantly better results - 12 ms at best using your version with let [^ints my-array my-array]

genmeblog12:04:50

We have different hardware I suppose. Also it may depends on jvm settings (like memory) and jvm itself. Anyway primitive array gives similar performance to python3 numpy: 4.5ms (Clojure) vs 3.5ms (Python) in my case. I wonder why http://tech.ml stack is not efficient - need to ask Chris about it.

jumar08:04:13

I wasn't surprised that much about differences between our computers but differences between NumPy vs Clojure on my computer: ~2-3ms NumPy vs 15+ clojure arrays Anyway, thanks a lot for looking at this and doing all that work +1

👍 4
chrisn20:04:10

This is what I have when I experimented:

user> (crit/quick-bench (mapv #(* 2 %) my-vec))
Evaluation count : 12 in 6 samples of 2 calls.
             Execution time mean : 60.642783 ms
    Execution time std-deviation : 1.883564 ms
   Execution time lower quantile : 59.229796 ms ( 2.5%)
   Execution time upper quantile : 62.815753 ms (97.5%)
                   Overhead used : 9.234052 ns
nil
user> (crit/quick-bench (dtype/make-container :java-array :int32 (dfn/* my-vec 2)))
Evaluation count : 24 in 6 samples of 4 calls.
             Execution time mean : 28.949417 ms
    Execution time std-deviation : 177.687414 µs
   Execution time lower quantile : 28.780664 ms ( 2.5%)
   Execution time upper quantile : 29.187880 ms (97.5%)
                   Overhead used : 9.234052 ns
nil
user> (crit/quick-bench (dfn/* my-vec 2))
Evaluation count : 84426 in 6 samples of 14071 calls.
             Execution time mean : 7.118790 µs
    Execution time std-deviation : 140.621151 ns
   Execution time lower quantile : 6.979899 µs ( 2.5%)
   Execution time upper quantile : 7.328529 µs (97.5%)
                   Overhead used : 9.234052 ns
nil

chrisn20:04:09

So, if you want a new container that takes 30ms, half the time of mapv but nowhere near python. If you just want something that represents the answer that takes about 8ns. The result will look like a persistent vector but it hasn't been concretely realized.

chrisn20:04:50

ranges are long datatypes, however so there is an implicit cast in the second one. If we remove that cast we get about 25% faster:

user> (crit/quick-bench (dtype/make-container :java-array :int64 (dfn/* my-vec 2)))
Evaluation count : 36 in 6 samples of 6 calls.
             Execution time mean : 20.003307 ms
    Execution time std-deviation : 702.153184 µs
   Execution time lower quantile : 19.362756 ms ( 2.5%)
   Execution time upper quantile : 21.060910 ms (97.5%)
                   Overhead used : 9.234052 ns
nil

chrisn20:04:44

If I redefine it as a more specific function I can get close to python without completely cheating (which is what the last one does because it is lazy):

user> (defn general-reader []
        (let [src-data (typecast/datatype->reader :int64 my-vec)
              dst-data (int-array (dtype/ecount src-data))]
          (parallel-for/parallel-for 
           idx
           (dtype/ecount src-data)
           (aset dst-data idx (* 2 (.read src-data idx))))
          dst-data))
             
          
#'user/general-reader
user> (take 10 (general-reader))
(0 2 4 6 8 10 12 14 16 18)
user> (crit/quick-bench general-reader)
Evaluation count : 47894784 in 6 samples of 7982464 calls.
             Execution time mean : 3.283733 ns
    Execution time std-deviation : 0.070009 ns
   Execution time lower quantile : 3.211260 ns ( 2.5%)
   Execution time upper quantile : 3.387039 ns (97.5%)
                   Overhead used : 9.234052 ns
nil
user> (type (general-reader))
[I

chrisn20:04:16

I think there were some SIMD opts or something in that last one; I really don't understand why it is so much faster. The point is that if the operation is important enough then we have to tools to make it just as fast; they just aren't automatic.

chrisn20:04:54

If I add any functional abstraction, however, my time shoots up quite a bit:

user> (defn specific-reader [map-fn]
        (let [src-data (typecast/datatype->reader :int64 my-vec)
              dst-data (int-array (dtype/ecount src-data))]
          (parallel-for/parallel-for 
           idx
           (dtype/ecount src-data)
           (aset dst-data idx (unchecked-int (map-fn (.read src-data idx)))))
          dst-data))
             
          
#'user/specific-reader
user> (crit/quick-bench (specific-reader #(* 2 (long %))))
Evaluation count : 42 in 6 samples of 7 calls.
             Execution time mean : 14.582336 ms
    Execution time std-deviation : 84.414539 µs
   Execution time lower quantile : 14.478312 ms ( 2.5%)
   Execution time upper quantile : 14.673229 ms (97.5%)
                   Overhead used : 9.234052 ns
nil

chrisn20:04:02

The names are bad but here is a summary on my laptop: mapv - 60ms datatype-lib-with-copy-and-datatype-change - 29ms datatype-lib-with-copy - 20ms lazy-creation-of-result-alone - 7us copy-loop-with-code-embedded - 3ns copy-loop-with-passed-in-function - http://14.ms I wouldn't worry about it is the real answer - for any chain of reductions/elementwise operations if you need that codeblock to be fast it is probably not going to be much work. The defaults don't get you there.

💯 4
chrisn20:04:04

Working with this a bit more, any abstraction I use results in times in the ms. Only the copy loop with the code embedded has real speed; this speaks to being able to generically create these things for blocks of operations when speed is really that necessary.

chrisn01:05:30

If you just use a range instead of making a vec out of the range then all the times start working out a lot better with the datatype library.

chrisn01:05:22

Long story short, creating the vec out of the range isn't helping. On my computer, either using the datatype library or numpy make the operation take about 2ms if the src data is either a long array or the range. If it is a persistent vector then things take longer (60ms like above).

chrisn01:05:21

>>> timeit.timeit('my_arr2 = my_arr * 2', 'import numpy as np; my_arr = np.arange(1000000)', number=10)
0.016360879999865574

chrisn02:05:12

tech.main> (def src-rdr (typecast/datatype->reader :int64 (range 100000)))
#'tech.main/src-rdr
tech.main> (crit/quick-bench (dtype/make-container :java-array :int64 (dfn/* src-rdr 2)))
Evaluation count : 246 in 6 samples of 41 calls.
             Execution time mean : 2.622852 ms
    Execution time std-deviation : 286.040484 µs
   Execution time lower quantile : 2.419191 ms ( 2.5%)
   Execution time upper quantile : 3.044613 ms (97.5%)
                   Overhead used : 9.174849 ns
nil
tech.main> (def src-rdr (typecast/datatype->reader :int64 (vec (range 100000))))
#'tech.main/src-rdr
tech.main> (crit/quick-bench (dtype/make-container :java-array :int64 (dfn/* src-rdr 2)))
Evaluation count : 156 in 6 samples of 26 calls.
             Execution time mean : 3.900233 ms
    Execution time std-deviation : 75.754664 µs
   Execution time lower quantile : 3.835530 ms ( 2.5%)
   Execution time upper quantile : 3.995999 ms (97.5%)
                   Overhead used : 9.174849 ns
nil
tech.main> (crit/quick-bench (dtype/make-container :java-array :int64 (dfn/* (range 1000000) 2)))
Evaluation count : 30 in 6 samples of 5 calls.
             Execution time mean : 22.554134 ms
    Execution time std-deviation : 554.997519 µs
   Execution time lower quantile : 22.148475 ms ( 2.5%)
   Execution time upper quantile : 23.349018 ms (97.5%)
                   Overhead used : 9.174849 ns
nil

chrisn02:05:06

I don't know how much of that makes sense. But it appears that you can get roughly equivalent speed on the jvm. Using Clojure. Just not with the normal things. This was fascinating and really got my timing OCD side going.

David Pham07:04:17

did you try neandhertal and the axfunction?

David Pham07:04:05

numpy used mkl undenearth, and I think, it leverage the memory layout of your array to perform the operations.

blueberry11:04:41

@neo2551 @jumar Actually, scal! would be the fastest and most appropriate if you just wanted to multiply all entries in a vector by a number, without creating a new instance.

Daniel Slutsky13:04:16

Hi. Following a decision at the scicloj organizing team, @teodorlu and I are working on a draft for an update about April 2020. The goal is mainly to reflect about the directions the community is going to, and about the community goals. Hopefully, it can also serve in making the activity visible to broader audiences. The scope is anything about data in Clojure: data science, data engineering, scientific computing, etc. Preparing that, we will be reading back through the activity here and at other places. If there is anything you wold like to add to that update, please tell. That could be a new release of a library, a discussion worth mentioning, a meeting that took place, etc. (some of these will be there anyway, but it would be nice to get people's perspective about what is actually noteworthy).

8
teodorlu15:04:56

Adding to what @daslu said above, The Clojure data science community is currently moving fast. It's hard to keep track of all the changes. And that's exactly where we want to be! But keeping up with all the moving pieces is challenging. That's something we hope to address with this update, and possibly later updates. For that to happen, we'll rely on a bit of proactive communication from all the people who help move us forward. In return, the update(s) might provide some publicity.

jsa-aerial15:04:39

@daslu @teodorlu What do you mean by a 'update about April 2020'? What is that?

teodorlu15:04:26

A planned blogpost for https://scicloj.github.io/ highlighting recent community activity