Fork me on GitHub

Are there any core.async-oriented Clojure profiling tools for ongoing profiling?


Seems like tufte may be the best I've found, but it doesn't have first class support for core.async.

Ben Sless19:09:53

What capabilities are you missing? I'm usually satisfied with clj-async-profiler and visualVM @jeaye


@ben.sless I haven't used the former, but its documentation seems to be focused on benchmarking, not profiling.


i.e. more criterium than tufte.

Ben Sless19:09:36

The profiler's output is a flame graph of %CPU time on X axis and stack on Y axis. I find it very useful, usually


For VisualVM, I think it could work, but things are made more complicated by me wanting to profile a staging environment within AWS.

Ben Sless19:09:19

Ah, been there 🙂

Ben Sless19:09:33

you need to inject a bunch of flags to your run command


Actually, might be the same issue with clj-async-profile. For tufte, everything was done in-process and then can be exposed through an HTTP endpoint.


That was some of the appeal, I think.

Ben Sless19:09:13

It's a bother, but I have done both. Async profiler can also be exposed as an endpoint

Ben Sless19:09:42

If you have a flag or environment variable which lets you know you're in a staging environment,

Ben Sless19:09:11

I recently wrapped one service with a run script:

Ben Sless19:09:24

#!/usr/bin/env bash

opts="-server -Xmx5g -Xms5g"

if [ "ENVTYPE" = "prod" ]; then
    echo "running in prod"
    opts="$opts -XX:+PreserveFramePointer -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints"

cmd="java $opts -jar $jar $*"
echo running "$cmd"
exec $cmd

Ben Sless19:09:38

You can inject whatever arguments you'd like, that way

Ben Sless19:09:57

we're using it to run async profiler, we expose it as an endpoint


Thank you. 🙂 We have a whole init setup with docker/ECR, so I can plug stuff into there.


Out of curiosity and thoroughness, you've used VisualVM and clj-async-profiler. Have you used tufte?

Ben Sless19:09:44

Not yet, I have found them adequate for all my use cases until now


Seems like ECS doesn't set perf_event_paranoid , so that may rule async-profiler out.

Ben Sless19:09:27

From what I see tufte is used to profile a specific piece of code, not the entire JVM process?

Ben Sless19:09:43

For that you can consider using jmh

Ben Sless19:09:59

It's more oriented towards benchmarking, but the results are more detailed, like tufte's


Yeah, benchmarking isn't what I'm after.


I did see jmh-clojure in my research.

Ben Sless19:09:37

Trying to understand which features exactly you're looking for, what you want to check, etc.


I think either whole-JVM process for exporatory flame graphs or tufte's approach, where I add profiling to specific parts of the code to see how long things take. The former would be better for finding surprises; the latter would be better for timing things I already know I want timed.


I wouldn't be lying if I said both sounded good. 🙂

Ben Sless19:09:05

Ah!, why not just use metrics?


I am using prometheus + grafana for metrics.

Ben Sless19:09:46

So, wrapping the piece of code under test in a timer wouldn't give you the same amount of information as tufte?


For tufte, they'd be quite similar, but rendered differently and tufte would be more convenient, rather than timing myself and pushing it into prometheus.


I suppose the biggest gain, now that we talk it all out, would be from whole-JVM process flame graph exploration.

Ben Sless19:09:25

I see. It's a bit of a hack but you don't have to push it to prometheus

Ben Sless19:09:45

You can get all the data from the timer object as an endpoint 🙃


Ah. We already have prometheus and grafana up, so any time series data might as well get rendered nicely through grafana.


Ok, refocusing. If I want a whole-JVM process flame graph for exploring the unknown, what's recommended?

Ben Sless19:09:19


Ben Sless19:09:38

use the (prof/profile-for duration options) function

Ben Sless19:09:24

trigger it with a http request sending it the options.

Ben Sless19:09:23

Pretty simple solution. If your flame graph comes out weird, means you need to add JVM flags. If you get an exception, you need perf_event_paranoid

Ben Sless19:09:57

You can then use it to find the function call you're interested in, zoom in on it, and analyze the CPU usage. If you're using go blocks pervasively, the graph will be slightly harder to analyze, because everything will happen on the same thread pool


I'll look into it. I'm currently concerned about perf_event_paranoid within ECS, but we'll see where I can get.


Yeah, everything we do is within go blocks.

Ben Sless19:09:33

I can tell you from experience I had no issues with it on AWS


ECS as well?


i.e. docker

Ben Sless19:09:59

Docker on EC2


Cool. That's good to know. Thanks a lot, Ben.

Ben Sless19:09:48

Good luck, let me know if you found something interesting

Ben Sless19:09:00

To satisfy my curiosity, what are you concerned about which led you to need to profile? Think you might be spending too much time with locks or something else?


@ben.sless We just moved our whole back-end to core.async to try to resolve some scalability issues. We're now fully async, with no blocking anywhere, but our scalability is 100x lower than where we need it and where we need it is still a very reasonable number. So I need to figure out what's taking so much time; we may be fully async, but we have the latency of molasses with even 1k clients.


Yeah, ring, compojure, etc.

Ben Sless07:09:46

And you suspect you might be blocking on your middleware?


@ben.sless Nope, I don't think there's any blocking going on. We checked pretty thoroughly.


I do think we're being quite slow with some things, though; just trying to figure out what those things are.

Ben Sless21:09:42

Then async profiler will serve you well. Swapping from compojure to reitit+muuntaja will speed you up. ring-json uses slurp and regular expressions, performance killer. I PRed it a few days ago to remove slurp but still, switch to #reitit if you want speed. (also jsonista > cheshire)


@ben.sless We have reitit on the list to bring in! Based on some testing we did a few months ago, it gave us a 15% speed up in running all of our tests, which is a surprising amount of time we we're spending doing routing.


We're using jsonista already.

Ben Sless21:09:10

yeah, and when the server is under stress it compounds exponentially


Thanks for all the tips. 🙂


I have clj-async-profiler working from an HTTP endpoint now. Just digging into what's going on.

Ben Sless21:09:59

Best of luck! hope you find some performance lying on the floor for quick wins

Ben Sless21:09:11

Then write a blog post 🙃


hehe, maybe. It's been years since my last blog post. This type of stuff is typically quite fun to write about, though.


What I find very weird is that a quarter samples seem to include the JVM function compiler. This isn't anything I've seen in other JVM flamegraph articles, aside from one, which verified it's the JIT.


Do you typically see that?

Ben Sless04:09:26

did you run with these flags? -XX:+PreserveFramePointer -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints


Not the first one, but the other two.


I'm using -Djdk.attach.allowAttachSelf -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints

Ben Sless04:09:41

I think PreserveFramePointer is required for analyzing JITed code. you should also wait a few minutes before starting the profiler while the JIT does its job

Ben Sless04:09:57

(a few ~3 should be enough)


Yep, I'll add that flag.

Alex Miller (Clojure team)20:09:49

It’s probably worth running a load test and then just taking periodic thread dumps. Generally if you take 10 thread dumps and 8 show the same thing, that’s the bottleneck


emphatic agree here, I've tried much more complicated schemes and not found more useful info. this gets great results fast without all the nonsense of profiler tuning

Alex Miller (Clojure team)20:09:06

This is dumb, but shockingly effective

💯 3

are there plans to add some sort of metrics/profiling help to core-async?

Alex Miller (Clojure team)21:09:37

No plans, not sure what would be genetically useful


Something we watch in production is the latency of the go block thread pool, we periodically throw a task on there that just reports the elapsed time between when it was queued and when it was executed.


It would be interesting to get the queue size out, I forget if I tired to get that


In theory the queue should pretty much always be empty, because tasks run fast, run a little logic, then wait on a channel


So a constant cycle through the queue


emphatic agree here, I've tried much more complicated schemes and not found more useful info. this gets great results fast without all the nonsense of profiler tuning