This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2024-03-28
Channels
- # aleph (7)
- # babashka (13)
- # beginners (10)
- # biff (4)
- # calva (75)
- # cljs-dev (22)
- # clojure (55)
- # clojure-berlin (1)
- # clojure-europe (15)
- # clojure-nl (1)
- # clojure-norway (35)
- # clojure-serbia (1)
- # clojure-uk (2)
- # clojurescript (46)
- # community-development (1)
- # core-async (23)
- # data-science (1)
- # datalevin (2)
- # datascript (10)
- # datomic (11)
- # fulcro (28)
- # helix (12)
- # hyperfiddle (26)
- # introduce-yourself (4)
- # malli (16)
- # off-topic (1)
- # pathom (4)
- # pedestal (1)
- # polylith (12)
- # quil (11)
- # releases (3)
- # scittle (24)
- # shadow-cljs (85)
- # specter (1)
- # sql (9)
- # xtdb (5)
Hey team, I am using core.async heavily in my system, and noticed a big slow down as more users get on the service.
Looking at logs, I see weird things, like:
• a. Database queries complete in < 10 ms
• b. But, go-loops are taking 5s, sometimes 100s to complete
The database IO is handled in an a/thread
, so I know that can't be blocking. I also know it's returning queries quickly. The machine is beefy, and I know the CPU isn't at full capacity, so I doubt it's some computation I'm running that's expensive
What I think is happening, is that somewhere in my code I am running io-blocking operations in go blocks, and this is halting the whole system.
As a cursory look, I tried increasing the pool-size to 200, and that helped a lot. P95 latency went down from 5s to 700ms.
I am using Honeycomb and have these traces, but they don't feel helpful, as lots of the time seems to be "unexplained".
Question: is there any command I could run in a production REPL, or tool I could use, which could help me see a true "flame graph" for one of these queries?
Are you running db queries in go blocks?
there is a clojure.core.async.go-checking
system property that you can set to detect use of core.async blocking ops inside go blocks (for dev, not production), see doc at top of https://clojure.github.io/core.async/
> Are you running db queries in go blocks? AFAIK no. I am doing something like this:
(go
(<! (go-db-query
(defn go-db-query []
(a/thread (jdbc/...)))
(So the db queries should be offloaded)Will do @U064X3EF3! Afaik we don't use any core async blocking ops (if I understand correctly you mean <!! etc), but will enable just in case
if increasing the core.async threadpool size helped a lot, it means something is gumming up that threadpool, which could be i/o or anything else that blocks up a thread for a long time, even long running computation that doesn't have channel operations in the middle
I would shrink the core.async threadpool (make the problem worse) then get stack dumps of the core.async thread pool threads, and see what they are spending their time doing
As far as tools, there are several tools https://github.com/async-profiler/async-profiler has a -t
option to do per thread profiling.
it seems stupid, but it is unreasonably effective to just kill -3 a java process a few times (triggers a stack dump) and then look at what the the threads are doing

if there is a bottleneck, by definition you will see the botteleneck at the top of a lot of stacks
this may be harder if you have 100s or 1000s of threads :)
This problem also seems like the type of problem that shouldn't be too difficult to reproduce in a dev setup (in an ideal world, ymmv, certain restriction may apply).
you could also be spinning up an unbounded number of go's, which would swamp the thread pool, if you are launching go's without waiting for them to complete, etc
but there tend to be other signs of that, like hitting the 1024 pending ops limits on channels
Thank you for jumping on this team. Will aim to look deeper tomorrow, try the suggestions here and fix the root cause
If you want a slightly more ergonomic solution, record the execution using JFR And automatic analysis might yell at you about something (that's a good thing), or you'll be able to see useful details like lock analysis and thread dumps. Combined with hiredman's advice it'll probably be the most effective solution
One option I would try is to simply not use (go ...)
and see if the perf. parameters improve? (aka switch all (go ...)
to (thread ...)
)
Any io-blocking code in a downstream function call that originates inside a`(go ...)` can be an issue – easy to miss since it’s not immediately obvious the blocking code is still running on the (go ...)
threadpool.
If still a problem, I would try attaching YourKit and see if something obvious stands out… often the answer can be quite surprising. Somewhat recently I discovered that a bottleneck can be simply trying to generate too many UUIDs too fast since it’s a global resource and each one takes a long time and blocks… Basically the default UUID generation is also akin to an I/O operation 🙂 (and a global (!) one at that… it will temporarily halt all threads trying to call it – so in a way worse than any other I/O that would presumably block only a single threadpool thread). Tldr; the cause of the slowdown might not be in core.async at all – just some food for thought.