This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-03-03
Channels
- # announcements (15)
- # babashka (143)
- # babashka-sci-dev (2)
- # beginners (35)
- # biff (11)
- # calva (5)
- # cider (8)
- # clerk (4)
- # clj-kondo (58)
- # cljdoc (6)
- # clojure (88)
- # clojure-denmark (1)
- # clojure-europe (77)
- # clojure-nl (1)
- # clojure-norway (16)
- # clojure-uk (1)
- # clojurescript (19)
- # clr (32)
- # code-reviews (158)
- # datahike (5)
- # datomic (10)
- # deps-new (3)
- # fulcro (12)
- # graalvm (20)
- # honeysql (23)
- # hyperfiddle (32)
- # kaocha (17)
- # membrane (6)
- # observability (1)
- # other-languages (2)
- # pathom (5)
- # practicalli (12)
- # reagent (4)
- # reitit (7)
- # releases (1)
- # sci (25)
- # shadow-cljs (52)
I stumbled on a Hacker News article about performance (https://www.computerenhance.com/p/clean-code-horrible-performance) and was curious to write up a quick Clojure solution and see how it did. Someone in the Rust forum did the same (https://www.reddit.com/r/rust/comments/11fkfib/i_ported_casey_muratoris_c_example_of_clean_code/) and according to their github repo (https://github.com/interpol-kun/clean_code_rust), they were able to calculate the total area of a 400,000 shape array in about 350 microseconds. My solution does it in about 66 milliseconds, which is 66,000 microseconds, right? That's almost 200x slower I think. Are we measuring the same thing? What would you do different here to test the same thing they are?
I definitely find the Clojure code so much nicer than all the other stuff I'm seeing in these discussions, that's for sure lol
And I thought my random-shape
generator was fun and way more interesting and certainly easier to do than these other attempts would have required for the same, but they just hard-coded the same shapes with the same sides over and over again. On that note, what would you change in that function? The repeated (rand-int 10)
seems like a code smell but maybe not.
Hey @U7RJTCH6J you might not remember but I finally took your advice and was able to get clj-async-profiler
up and running. This probably wasn't the best use case for it idk, but it did give me this flamegraph and I am wondering how you would approach interpreting it:
One thing that confuses me is at some point it seems like it is calling repeatedly
and make-random-shape
but I explicitly used def
to create that data structure beforehand so when calling (prof/profile (total-area lots-of-shapes)
I didn't get any of that creation time polluting my total-area
calculation.
And it's also showing nrepl.middleware.*
things happening. Is that interfering with the profiling of the specific function I want?
repeatedly
returns a lazy sequence so you're seeing its code called as each element is realized I expect.
I’m away from keyboard, but you can also check this discussion https://clojurians.slack.com/archives/C03L9H1FBM4/p1677651571661219
You've also got a lot of boxed arithmetic going on -- I expect the Rust code is all primitives.
I tried to throw some type hints in there but didn't see any improvements. How would I do that? Would something like unchecked-math help?
And so you think creating the shape array is being done at the same time I am calling total-area
?! That would definitely screws up the findings right? How do I not have that happen?
doall
Addressing the unboxed math will require more messing with the code than just type hints I think since you're putting longs in hash maps and getting them back out via destructuring so that pretty much forces them to be boxed (to Long
and then treated as Object
)
Also dispatch via defmulti
is going to be a lot slower than dispatch via a protocol
so you'll want records (containing primitives) instead of hash maps, and protocols for your various size calculations.
(off the top of my head -- I'm a bit too busy right now to fire up a REPL to try anything out)
Oh no worries. I figured defmulti
wouldn't be the fastest (I think that was sort of the point of the original blog post?) but that's how I would have written it naturally so was curious to see how that would perform.
that #C03L9H1FBM4 channel discussion was exactly what I was hoping for here too in this discussion. new channel for me
Fixing that silly lack of doall
error saved me about 10 milliseconds which is about 15%. The flamegraph confirms it. That profiling tool is cool
> new channel for me Somewhat terrifyingly, we have nearly 900 channels here now...
Adding type hints (and uglifying my code) and the unboxed math thing "just" saved me another 8 ms so in total it all went down from 66 to 48 ms.
And protocols?
From phronmophobic's attempts in that other thread, protocols seem to be about 30% faster if I'm reading his results correctly.
I had another question about my flamegraph. What is going on with that whole first 2/3rds of the graph? On all the pictures I see of others using this tool it only has that green and blue part. Is something happening in my system that is taking a long time before my actual logic code is run? Is that screwing up my results?
No idea. I've never actually played with flamegraph stuff before (but now I might 🙂 )
I lowered the number of shapes in my test to 1000 to match what phronmophobic was using and my code still takes 130 microseconds to his 56. Not sure if that's because our code might be slightly different still or if the actual computer you run it on (mine is old and slow) affects benchmarks like this.
@U7RJTCH6J is computing over 1,000 shapes? And the target is 400,000? So 400 * the fastest time he has? 20 mu seconds -- compared to 600 mu seconds for Rust?
no, the Rust example I found was using 400,000 shapes. phronmophobic was just using 1000 in his benchmark. Rust seems to destroy Clojure here.
Rust is a heavily optimized systems programming languages so that's not surprising -- for a math/compute-heavy example.
Definitely. My Rust code always beats my Clojure code. It's not that bad to write either, I enjoy it for what it is. That example repo I linked at the beginning was quick, simple, and readable. It's still a far second place on the joy scale for me though compared to Clojure especially as I don't really do much low level stuff (yet)
A lot of these "performance" benchmarks involve ridiculous math computation stuff which isn't at all realistic -- in terms of what "Real World(tm)" programs actually do (lots of database access, some general data structure transforms, not much math 🙂 )
I figured. I am just putting out my first applications this week to join this professional world so I am curious to see how it looks in the "real world"
I have to say, I like Rust. If I had to go back to the sort of environments where I used to write C or C++ (which I did for a decade and a half), I'd want to use Rust instead now.
When I learned Go, I hoped it would be more like what Rust turned out to be 🙂
(I do not like Go)
Same. I've tried a couple times to give it a fair shot and I dislike pretty much all of it tbh but I don't begrudge those who like it
A really good friend of mine is a Developer Advocate at Google, for the cloud platform, so he is heavily into Go and loves it... But it just isn't for me...
I do hate how in the example given here in the start of the discussion, to optimize for performance they basically take "short cuts" and use width
and height
to represent all the lengths of every shape, including using width
and height
as the radius
calculations in the circle area functions. Is that the kind of stuff you see in the real world? hahaha
It feels like the kind of stuff you get asked in l33t coding interviews 😉
Yeah, I've seen your comments on current interviewing practices. I'm hoping I can avoid some of that in my upcoming interviews. I'm focusing on Clojure so we shall see if our niche is more enlightened (?).
Thanks for the convo. It has been a fun night of nerd sniping, learning benchmarking and profiling tools a little better, and getting through 50 songs on this Spotify Blues Classics playlist.
I was playing with performance tuning a prime sieve some while ago and the Rust contributions just crushed my fastest Clojure (and Java) implementations. I had great use of the flame graphs, and also from decompiling the Clojure code, to see where thing like type hinting took me. Here's a recorded live stream where we try to share a bit of the process: https://www.youtube.com/watch?v=T_wuPrHIupU
sorry to ping you again @U7RJTCH6J but could you take a look at this question I had about the flamegraphs?
I had another question about my flamegraph. What is going on with that whole first 2/3rds of the graph? On all the pictures I see of others using this tool it only has that green and blue part. Is something happening in my system that is taking a long time before my actual logic code is run? Is that screwing up my results?
Are people just cropping that beginning part out of their screenshots or is this something going on with my system?That means it's spending a lot of time in garbage collection. Usually, that means you either don't have enough memory, or you're hanging on to references too long so that memory can't be freed.
The flame graph doesn't isn't fully sorted by time so the fact that it take up the first half of the picture doesn't mean that the first half of measurements are in GC.
The Rust version does a few things very differently that you might want to consider:
• It uses float32 vs double wide like float64
• it uses structs vs records which are backed by hashmaps (more pointer chasing)
So maybe just using deftype
might bring you quite noticeable speedups? I never used it but I assume you get the equivalent of structs that way.
Yep, that did create some speedups in the #C03L9H1FBM4 thread linked above. In reading some of the online conversations and Casey's (the OP author) opinions that the polymorphism doesn't just slow code down, it's also not as readable or maintainable, I think I disagree. The multimethods are definitely slower but they seem like they would be much better for maintenance in this contrived example right? If you want to add a new shape you just create a new defmethod
. total area
would still work. This high level architecture stuff is a little beyond me though.
Well we're talking about actual math computation here. Code like this might be used in say a video game (engine), some other graphics related program, or to do the heavy lifting in some data pipeline etc. But I'm personally a big fan of relational structures, multimethods, trees of maps etc. Because in the type of web dev I do it's all about dates, texts, documents, rules, request/response cycles, data storage and queries. All asynchronous and irregular computation... Not the type of continuous heavy throughput/latency numeric stuff that a game programmer does.
Muratori is concerned about "How can I push as much data through the CPU as possible during a single frame" and stuff like that. Doesn't mean we shouldn't take some general advice here. But each domain should apply it differently IMO.
Yep, that all makes sense. One day I'm going to try my hand at making a video game and get some hands on experience on getting everything processed in that quick ole 16ms. I'll probably use Rust for it though.
Or maybe even this Muratori fellow's buddy, Jon Blow's Jai: https://clojurians.slack.com/archives/C16LEKSLT/p1677885977213679
Here's my attempt:
https://gist.github.com/dgb23/db063716b1e5a25c87483037f9b864b7
I also ran the rust version locally, got roughly the same results as published.
My own version is within <10x of the fastest rust implementation it seems. 3ms for 400'000 entries, about 7microseconds per 1000 entries.
I tried to do something that is both extensible (Clojure-y) but also cares about memory layout. I looked at different formulas for area calculation and thought about making it easy to add arbitrary area definitions.
It can certainly be written more nicely. And it would be possible to extend this with a remove-shape
function if I maintain some unique ids that map to indices on the storage.
And I'm not actually sure if there's something fundamentally wrong with it. 😄
If anyone has tips to speed it up or write more nicely I'm stoked to hear about it!
I thought having a heterogeneous collection of shapes was part of the requirements. I think this version is interesting, but if you pre-sort the shapes into different buckets, it seems like that's measuring a different thing.
It's mentioned in the other thread, but I'm curious how an implementation built with dtype-next would compare. It provides a high level interface and should be able to get very good performance.
@U9J50BY4C most of your flame graph is the C2 compiler optimizing your code Run your benchmark 3 times in a row in the same process to get "hot" results Remember that most perf on the jvm comes from JIT compilation
Thanks Ben, that makes sense. Things like Criterium take that into account right? It looks like that is possible with this profiler too. The readme shows a simple (dotimes...
way of doing it.
@U7RJTCH6J I didn't know / or forgot that requirement. But for me it's obvious to put them into an appropriate bucket. Not only might it be better for layout, but also you're only going to ask for a function once per bucket. Instead of every time. Related tech/paradigms would be relational data and data oriented design. ECS does that as far as I'm aware. If the goal is to have generic interaction but cache/cpu friendly computation, then you want something that looks this way I think. You can somewhat get the best of both worlds. You could even go further and ask for a storage mechanism per type of component. I give you a uniform interface and you give me the functions, storage, and maybe even an iterator. That unifies the need for abstraction and generality with the need for efficiency.
I think this (abrasive) twitter thread goes into a bit more detail on how he perceived the comparison should go https://twitter.com/cmuratori/status/1631045864558051328
Yep, basically the same idea. To me, all these methods are interesting. I would love to see a table of all the different techniques and how they compare and what the trade-offs are. @U01EFUL1A8M, I don't think there were any explicit requirements or rules. In my head there's a difference between: • unsorted, heterogenous sequence of shapes -> total area • shapes organized by type -> total area I think the pre-organized technique is a good one to analyze, but I'm not sure it's a totally fair comparison since you're moving some of the calculation out of the benchmark into work done ahead of time. As an example, I get a 2x speedup in my benchmark just by using a sequence sorted by type compared to an unsorted sequence of shapes.
shapes> (bench
(total-area-switch switch-shapes))
;; Execution time mean : 41.113578 µs
vs.
shapes> (let [sorted-switch-shapes (vec
(sort-by :type switch-shapes))]
(bench
(total-area-switch sorted-switch-shapes )))
;; Execution time mean : 19.228138 µs
Sorting by types gives you limited monomorphism, makes relevant things remain in cache, etc
Right, but if you include the time to sort it might not be faster. If you pre-sort then it can make other parts of your program slower. For the most part you can swap protocols for multimethods for interfaces, but sorting is this extra step (which might be a great idea, but depends on the situation).
Usually in those systems you don't sort, but already keep different areas and insert to them in order
Any logic that is "for all". Doesn't care about maintaining order, so opportunistically inserting to dedicated arrays doesn't matter
having different buckets for each type is sorting the elements by type.
> Any logic that is "for all". Doesn't care about maintaining order, so opportunistically inserting to dedicated arrays doesn't matter It depends. Not all operations are "for all". Not all reductions are commutative. There are other orders and sorts that might be important. I'm not saying there aren't plenty of good use cases for bucketing objects by type, but there are trade-offs.
Here's another thing that I don't quite understand and might be a possible speedup:
I don't quite understand what it actually means to put stuff defined with deftype
into a vector. Is this a vector of pointers? I assume so? Can the compiler or JIT infer that the same compact (size is known exactly) of the backing array?
So I wonder how the data actually (or likely?) looks like and I wonder whether there's ways to control this?
The vector is of pointers, I think if you put it in an array you can get contiguous memory
switching to deftype from defrecord shaves a few more ms, that's as far as I can get
I didn't do all the variations on every alternative, partly due to laziness and because I sort of knew what the fastest result will be
That's 88us / 100'000 shapes right? So in comparison to the rust version which is around 350us per 400'000 that's roughly the same throughput!
I'm showing 465 microseconds for the fastest rust bench and 1.37 milliseconds for yours. I'm not sure if they are for the same, corresponding strategies. I'm on like a 6 or 7 year old i7. I ran the rust one twice and ran yours a few times. Let me know if you want me to check anything else.
The separate data is a bit of a cheat, so I'm not considering it, but I'm happy with the solution being in the same order of magnitude Any other ideas how I can speed mine up?
I would say separating the data into an array for each type is no less cheating than using a generic container that just happens to be able to cover these 4 shapes (factor, width, height). Other area formulas that are not in the example here might have different arities (for example many points etc.) so it's not really extensible. The "clean code" idea is to have an extensible mechanism without resorting to changing the dispatch function itself (turn a switch into dynamic dispatch). So it is really about generality and extensibility and not about how exactly to achieve it. Another issue is that you're reading the constant and possibly other parameters that you made constant from memory (or cache) every time you calculate instead of having the constant (like Pi) on the stack. Plus you are doing additional calculations that you don't need to do. I would assume if you'd combine your code with the idea of having separate storage for each type that you can manipulate generically (add-shape or add-shape-type can be multimethods even!) Then you get the best of both worlds and a noticeable speed up.
What I meant is that for example a trapezoid area formula needs a constant and three parameters. You'd have to extend the struct to cover the maximum possible parameters for a formula and all the formula that have <Max parameters then need to set them to 0 or 1 and do additional operations that are not necessary for that specific calculation.
that's fair, but to be fair to the original point, I think the idea is to only make those changes as necessary, instead of designing in advance for growth
Something like
(defn total-area-prim-record-method-reduce
[shapes]
(reduce (fn ^double [^double acc shape] (+ acc (get-area shape))) 0 shapes))
Because you can type hint the return type, so the generated code becomes like (decompiled):
public double area() {
return 3.141592653589793 * Math.pow(this.radius, 2L);
}
And then with:
(defn total-area [shapes]
(loop [n (double 0)
[^IArea2 s & r] shapes]
(if s
(recur (+ n ^double (.area s)) r)
n)))
it’s getting quite fast.Do you get a speedup if you use an object array to store the shapes and use aget
to iterate instead of the seq functions?
Generally speaking, if you do use loop/recur
+ seq functions, I use the following pattern which avoids an unnecessary call to seq
per loop compared to using destructuring.
(loop [xs (seq xs)
ret []]
(if xs
(let [x (first x)]
(recur (next xs)
(conj ret (f x))))
ret))
Array gets speedup for sure.
(defn total-area [^"[Larea2.IArea2;" shapes]
(let [cnt (count shapes)]
(loop [n (double 0)
i (int 0)]
(if (< i cnt)
(recur (+ n (.area ^IArea2 (aget shapes i)))
(inc i))
n))))
results in
public final class area2$total_area extends AFunction
{
public static Object invokeStatic(final Object shapes) {
final int cnt = RT.count(shapes);
double n = 0L;
double n2;
for (long i = (int)0L; i < cnt; ++i, n = n2) {
n2 = n + ((IArea2)RT.aget((Object[])shapes, (int)i)).area();
}
return n;
}
@Override
public Object invoke(final Object shapes) {
return invokeStatic(shapes);
}
Which seems rather close to what you’d write in java.It's nice to do have the option to do that where it matters.
and use more convenient options when it doesn't
Of course for this silly example, you could go as far as to encode your data in tuples of three: [type | l1 | l2]
so then you could:
(defn ->circle [r]
(double-array [0 r 0]))
(defn ->square [h]
(double-array [1 h 0]))
(defn ->rect [h w]
(double-array [2 h w]))
(defn ->triangle [h w]
(double-array [3 h w]))
And then have an area fn which dispatches on the first in the array, but I guess you’d be proving the point of the guy in the videoIn theory, clojure would allow you to encode the semantic information of what process transformation you want and independently choose a performant implementation that satisfies the requirements (ie. tools for building a sufficiently smart compiler).
I think project valhalla would also provide some tools to improve performance.
Anyways, I got this (on my machine) from around 50ms on the multithod impl down to 8ms with the definterface plus arrays impl.
Yea, it seems possible to write some macrology that looks basically the same as defprotocol
, but uses float arrays for everything and is much faster.
I added a version a version that uses a single double array to hold all the data and it wasn't any faster than using a defrecord :thinking_face: https://github.com/phronmophobic/shape-bench/blob/main/src/shapes.clj#L354
As for what is idiomatic or not: I always understood that Clojure is intentionally designed to be a pragmatic language that ultimately cares about the running program. Here's how you can do it by default, and here are all the tools to make other approaches possible and sane. To me it's not a patronising language but an enabling one. As for the double array: I tried that too and it was slower than vectors with deftype. Which confused the hell out of me. Not sure why that is.
I did find that if I inlined the area calculation, it was about 2x faster.
But in terms of the double array: How many KBs large is it? From your code it seems to be 192kb. Is that about right? It might just be that the array is too large to fit in cache
vectors have the disadvantage of using pointers, but they are chunked nicely into 32 elements I think.
> vectors have the disadvantage of using pointers, but they are chunked nicely into 32 elements I think. right, but those are 32 elements of pointers.
> It might just be that the array is too large to fit in cache Obviously, my intuition is wrong here, but I thought that shouldn't matter. As long as you are scanning through memory at a predictable stride.
Ok I'm out of my depth here, but I assume that you only get stuff into cache which fits as a whole
I don't think so. I don't think the cache knows about types, arrays, records, etc, Only about memory addresses.
If otherwise the CPU would have to generate additional instructions to load the chunks?
I don't think that's how it works. Simplistically, I think it's more like an LRU cache with some special optimizations for scanning through memory at a consistent stride (since that's a common and useful heuristic). At least, that's how I remember it. Would love to get a refresher or correction.
Here's someone explaining it on SO. It seems they know their stuff? https://stackoverflow.com/questions/27975715/does-size-of-array-in-c-affect-time-to-access-or-change-a-piece-of-data-in-the-a
> For another example, if you use a sequential access pattern (e.g. "for each element in array { ... }") then it's likely the CPU might do hardware pre-fetching and you won't pay the full cost of cache misses. The idea behind putting all the data in one double array is to try to get on the happy path of the "sequential access pattern"
Then the question is what gets recognized as "sequential access pattern" and what doesn't. And what "not the full cost of cache misses" means.
Yea, this kind of stuff can get quite tricky. It's also possible that when using records, the records themselves usually end up located mostly sequentially in memory and you end up getting the "sequential access pattern" regardless.
interesting:
(import 'org.openjdk.jol.vm.VM)
(defn address-of [o]
(.addressOf (VM/current) o))
(def addresses
(into []
(map address-of)
switch-shapes))
(frequencies (map - (rest addresses) addresses))
;; {-88 966,
;; -256 30,
;; -69600 1,
;; -139064 1,
;; 301488 1}
So 96.6% of the shapes are exactly 88 bytes apart (and one shape is very lonely)
assuming I didn't do anything stupid
If you randomize the order of the shapes in the vector, you get a 2x slowdown:
(let [shuffled-switch-shapes
(vec
(shuffle switch-shapes))]
(bench
(total-area-switch shuffled-switch-shapes)))
I figured my understanding is too shaky and I need to read up on it a bit 😄 https://en.wikipedia.org/wiki/Cache_placement_policies
Well, shuffling is closer to random access than sequential access.
Actually, now my results are weird. Let me double check some things.
Ok, I must have done something stupid, because now it's showing that shuffling doesn't make a difference.
Here's my attempt:
https://gist.github.com/dgb23/db063716b1e5a25c87483037f9b864b7
I also ran the rust version locally, got roughly the same results as published.
My own version is within <10x of the fastest rust implementation it seems. 3ms for 400'000 entries, about 7microseconds per 1000 entries.
I tried to do something that is both extensible (Clojure-y) but also cares about memory layout. I looked at different formulas for area calculation and thought about making it easy to add arbitrary area definitions.
It can certainly be written more nicely. And it would be possible to extend this with a remove-shape
function if I maintain some unique ids that map to indices on the storage.
And I'm not actually sure if there's something fundamentally wrong with it. 😄
If anyone has tips to speed it up or write more nicely I'm stoked to hear about it!