Fork me on GitHub

When performance benchmarking, libraries like criterium often use Java APIs to get the current time in nanoseconds. I was doing some benchmarking recently using criterium, and got some results that surprised me at first, so I decided to run the whole batch again, and then a third time (it was only 15 minutes each batch, but far from instant gratification). The 2nd and 3rd batches were fairly close to each other, but made the first look like the measurements were probably off in pretty significant ways. My laptop might have gone to sleep somewhere during that run, but I am not sure if (a) it actually did, or (b) if it did, whether that would have caused so many of the results to be off (I could understand one outlier pretty easily, but this was multiple 30-second runs that had off results).


I know that the most straightforward advice is: Make sure the machine did not go to sleep during benchmarking, and I've taken steps to avoid that now, but it makes me wonder if it would be easy to add something to criterium, or benchmarking libs in general, that can detect this and preemptively warn you that the results are probably wrong.


simplest thing I can imagine would be to start up a parallel thread that does nothing but periodically ask for the elapsed real world time via some API that would have a noticeable "gap" if the machine went to sleep.

Wolfram Schroers16:11:58

Hi @U0CMVHBL2, it depends on what you consider “simple”. If you are able to run the task at the command line (and assuming you have macOS or Linux) you can prepend the task with time and get different time stats for the runtime of the entire process. This can tell you how much time the process had been running and how long it had actually been “active”.


The advice I've heard, even with criterium, is to do one run to warm up the JVM and ignore the results, and then do at least two runs for actual timing purposes.


So I always do at least 3 criterium runs and just ignore the first one.

Alex Miller (Clojure team)16:11:16

really you should probably do that across multiple jvm starts as well before you really believe it

Alex Miller (Clojure team)16:11:06

generally I don't put much stock in any number reported in nanoseconds - better to make bigger chunks to compare

Alex Miller (Clojure team)16:11:23

not sure if you were using bench or quick-bench but it's not unusual to see different results between the two


Definitely true on the bigger runs. criterium already tries to do that by rerunning your expression as many times as it takes to fill up 60 seconds for the full one, which I usually change to 30 since I'm doing a dozen or so expressions and they are all pretty quick.


The big warning flag I got on one run was negative numbers, which I may dig into and figure out when/why that can occur. I definitely don't believe that one.


The more worrying case was the numbers that were low, but didn't look impossibly low, but were not matched by later runs of the same expressions.

Alex Miller (Clojure team)17:11:10

the answer is 37 - what is your question?


So yeah, a general rule is to measure more than once, or in more than one way if possible, and I will do that here, too.


FYI, Alex, I am doing little benchmarks on some potential optimizations to some operations on Clojure's built-in vectors. There are multiple operations where they use for (i=0; i < count(); i++) do_something(nth(i)); which is fine for small vectors, but for about 33 and up the nth(i) is redoing a lot of tree walking work on each element. Trying out a small patch that uses Java iterators on the vector elements instead.


Mostly things in e.g. doEquiv, doEquals, hashCode, hasheq, indexOf, compareTo, maybe one or two others. Anything that linearly scans forward through most or all of the elements.


Early measurements indicate possible 20% to 40% reduction in time for such operations on vectors with at least 33 elements, which I know is not the common case in most Clojure applications, so nothing to get extremely excited about.


Unfortunately the current change I'm testing makes = (the only one I've measured so far) slightly slower for small vectors (<= 32 elements), so next I will see about a change that uses existing code for small ones, and iterators for large ones.


Anyway, mainly tinkering at the fringes because of things I've noticed from reading the code while working on core.rrb-vector. Nothing I expect to get into the master release any time soon, if ever.

Alex Miller (Clojure team)17:11:56

cool, seems worth trying


Sounds like clock drift/ntp might also be effecting you if you're seeing negative numbers.


System/nanoTime should be monotonic if that is what is being used, so ntp could cause 0 elapsed time to be reported, but not negative


criterium does some calculations that I am not intimately familiar with before showing the times, obviously, if it can in some cases show negative numbers. My current suspicion is that it has ways of estimating and subtracting out the "measurement loop overhead" that criterium itself introduces, and that somehow turned out to be way too large of a number that was subtracted from the actual (elapsed time / number of iterations), which was positive.


Negative time is a clear sign the results cannot be trusted. I know that one benchmark run to another can vary easily by plus or minus 10%, even in Criterium runs that execute the same code as many times as possible in 30 seconds, even for compute-bound code with no I/O. It is the larger variations that shake my trust.


have you looked at at all, it seems like the tool dejour used for benchmarking java, so it might have more man hours behind it than criterium. I haven't looked at it at all, so I am not sure what it would take to use it with/from clojure

Alex Miller (Clojure team)18:11:19

there is a clojure harness for it



Alex Miller (Clojure team)18:11:41

jmh is somewhat cumbersome to use from clojure iirc b/c it relies on subclassing and/or annotations

Alex Miller (Clojure team)18:11:23

but JMH itself is truly high-quality stuff created and used by people working on java itself, so highly recommended


In the past, when criterium has shaken my trust, I have often dug into it to find out what it does in more detail. It may be time for another round of that for me 🙂


Could it be that your processor was throttling the third time?


This is sometimes an issue when I've compiled something heavy and do benchmarks while my fans are still spinning ;)


I doubt it, given that I saw that only 1 of 8 cores was busy during the times I was watching. I am nearly certain it did go to sleep for 30 to 60 minutes somewhere in the middle, though, based upon some modification times on different output files.