This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2020-09-27
Channels
- # announcements (1)
- # babashka (7)
- # beginners (60)
- # calva (24)
- # cider (2)
- # cljdoc (9)
- # clojure (21)
- # clojure-australia (1)
- # clojure-europe (2)
- # clojurescript (26)
- # core-async (89)
- # datalog (11)
- # datomic (40)
- # fulcro (24)
- # graalvm (5)
- # luminus (1)
- # meander (1)
- # numerical-computing (1)
- # off-topic (3)
- # shadow-cljs (6)
- # test-check (10)
- # tools-deps (7)
- # xtdb (18)
Seems like tufte may be the best I've found, but it doesn't have first class support for core.async.
What capabilities are you missing? I'm usually satisfied with clj-async-profiler and visualVM @jeaye
@ben.sless I haven't used the former, but its documentation seems to be focused on benchmarking, not profiling.
The profiler's output is a flame graph of %CPU time on X axis and stack on Y axis. I find it very useful, usually
For VisualVM, I think it could work, but things are made more complicated by me wanting to profile a staging environment within AWS.
Actually, might be the same issue with clj-async-profile. For tufte, everything was done in-process and then can be exposed through an HTTP endpoint.
It's a bother, but I have done both. Async profiler can also be exposed as an endpoint
If you have a flag or environment variable which lets you know you're in a staging environment,
#!/usr/bin/env bash
opts="-server -Xmx5g -Xms5g"
if [ "ENVTYPE" = "prod" ]; then
echo "running in prod"
else
opts="$opts -XX:+PreserveFramePointer -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints"
fi
jar=path/to/your.jar
cmd="java $opts -jar $jar $*"
echo running "$cmd"
exec $cmd
Thank you. 🙂 We have a whole init setup with docker/ECR, so I can plug stuff into there.
Out of curiosity and thoroughness, you've used VisualVM and clj-async-profiler. Have you used tufte?
From what I see tufte is used to profile a specific piece of code, not the entire JVM process?
It's more oriented towards benchmarking, but the results are more detailed, like tufte's
Trying to understand which features exactly you're looking for, what you want to check, etc.
I think either whole-JVM process for exporatory flame graphs or tufte's approach, where I add profiling to specific parts of the code to see how long things take. The former would be better for finding surprises; the latter would be better for timing things I already know I want timed.
So, wrapping the piece of code under test in a timer wouldn't give you the same amount of information as tufte?
For tufte, they'd be quite similar, but rendered differently and tufte would be more convenient, rather than timing myself and pushing it into prometheus.
I suppose the biggest gain, now that we talk it all out, would be from whole-JVM process flame graph exploration.
Ah. We already have prometheus and grafana up, so any time series data might as well get rendered nicely through grafana.
Ok, refocusing. If I want a whole-JVM process flame graph for exploring the unknown, what's recommended?
Pretty simple solution. If your flame graph comes out weird, means you need to add JVM flags. If you get an exception, you need perf_event_paranoid
You can then use it to find the function call you're interested in, zoom in on it, and analyze the CPU usage. If you're using go blocks pervasively, the graph will be slightly harder to analyze, because everything will happen on the same thread pool
I'll look into it. I'm currently concerned about perf_event_paranoid
within ECS, but we'll see where I can get.
To satisfy my curiosity, what are you concerned about which led you to need to profile? Think you might be spending too much time with locks or something else?
@ben.sless We just moved our whole back-end to core.async to try to resolve some scalability issues. We're now fully async, with no blocking anywhere, but our scalability is 100x lower than where we need it and where we need it is still a very reasonable number. So I need to figure out what's taking so much time; we may be fully async, but we have the latency of molasses with even 1k clients.
@ben.sless Nope, I don't think there's any blocking going on. We checked pretty thoroughly.
I do think we're being quite slow with some things, though; just trying to figure out what those things are.
Then async profiler will serve you well.
Swapping from compojure to reitit+muuntaja will speed you up. ring-json uses slurp
and regular expressions, performance killer. I PRed it a few days ago to remove slurp
but still, switch to #reitit if you want speed. (also jsonista > cheshire)
@ben.sless We have reitit on the list to bring in! Based on some testing we did a few months ago, it gave us a 15% speed up in running all of our tests, which is a surprising amount of time we we're spending doing routing.
I have clj-async-profiler working from an HTTP endpoint now. Just digging into what's going on.
hehe, maybe. It's been years since my last blog post. This type of stuff is typically quite fun to write about, though.
What I find very weird is that a quarter samples seem to include the JVM function compiler. This isn't anything I've seen in other JVM flamegraph articles, aside from one, which verified it's the JIT.
did you run with these flags? -XX:+PreserveFramePointer -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints
I'm using -Djdk.attach.allowAttachSelf -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints
I think PreserveFramePointer is required for analyzing JITed code. you should also wait a few minutes before starting the profiler while the JIT does its job
It’s probably worth running a load test and then just taking periodic thread dumps. Generally if you take 10 thread dumps and 8 show the same thing, that’s the bottleneck
emphatic agree here, I've tried much more complicated schemes and not found more useful info. this gets great results fast without all the nonsense of profiler tuning
No plans, not sure what would be genetically useful
Something we watch in production is the latency of the go block thread pool, we periodically throw a task on there that just reports the elapsed time between when it was queued and when it was executed.
In theory the queue should pretty much always be empty, because tasks run fast, run a little logic, then wait on a channel
emphatic agree here, I've tried much more complicated schemes and not found more useful info. this gets great results fast without all the nonsense of profiler tuning