Fork me on GitHub
#core-async
<
2024-03-28
>
stopa20:03:43

Hey team, I am using core.async heavily in my system, and noticed a big slow down as more users get on the service. Looking at logs, I see weird things, like: • a. Database queries complete in < 10 ms • b. But, go-loops are taking 5s, sometimes 100s to complete The database IO is handled in an a/thread, so I know that can't be blocking. I also know it's returning queries quickly. The machine is beefy, and I know the CPU isn't at full capacity, so I doubt it's some computation I'm running that's expensive What I think is happening, is that somewhere in my code I am running io-blocking operations in go blocks, and this is halting the whole system. As a cursory look, I tried increasing the pool-size to 200, and that helped a lot. P95 latency went down from 5s to 700ms. I am using Honeycomb and have these traces, but they don't feel helpful, as lots of the time seems to be "unexplained". Question: is there any command I could run in a production REPL, or tool I could use, which could help me see a true "flame graph" for one of these queries?

phronmophobic20:03:32

Are you running db queries in go blocks?

Alex Miller (Clojure team)20:03:28

there is a clojure.core.async.go-checking system property that you can set to detect use of core.async blocking ops inside go blocks (for dev, not production), see doc at top of https://clojure.github.io/core.async/

stopa20:03:30

> Are you running db queries in go blocks? AFAIK no. I am doing something like this:

(go 
  (<! (go-db-query
(defn go-db-query [] 
  (a/thread (jdbc/...)))
(So the db queries should be offloaded)

👍 1
stopa20:03:24

Will do @U064X3EF3! Afaik we don't use any core async blocking ops (if I understand correctly you mean <!! etc), but will enable just in case

1
hiredman20:03:34

if increasing the core.async threadpool size helped a lot, it means something is gumming up that threadpool, which could be i/o or anything else that blocks up a thread for a long time, even long running computation that doesn't have channel operations in the middle

hiredman20:03:43

I would shrink the core.async threadpool (make the problem worse) then get stack dumps of the core.async thread pool threads, and see what they are spending their time doing

2
phronmophobic20:03:14

As far as tools, there are several tools https://github.com/async-profiler/async-profiler has a -t option to do per thread profiling.

hiredman20:03:40

the core async threadpool threads all have names that start with async-dispatch-

Alex Miller (Clojure team)20:03:16

it seems stupid, but it is unreasonably effective to just kill -3 a java process a few times (triggers a stack dump) and then look at what the the threads are doing

1
til 3
Alex Miller (Clojure team)20:03:41

if there is a bottleneck, by definition you will see the botteleneck at the top of a lot of stacks

❤️ 1
Alex Miller (Clojure team)20:03:12

this may be harder if you have 100s or 1000s of threads :)

phronmophobic20:03:20

This problem also seems like the type of problem that shouldn't be too difficult to reproduce in a dev setup (in an ideal world, ymmv, certain restriction may apply).

hiredman20:03:57

you could also be spinning up an unbounded number of go's, which would swamp the thread pool, if you are launching go's without waiting for them to complete, etc

hiredman20:03:39

but there tend to be other signs of that, like hitting the 1024 pending ops limits on channels

stopa22:03:04

Thank you for jumping on this team. Will aim to look deeper tomorrow, try the suggestions here and fix the root cause

Ben Sless07:03:04

If you want a slightly more ergonomic solution, record the execution using JFR And automatic analysis might yell at you about something (that's a good thing), or you'll be able to see useful details like lock analysis and thread dumps. Combined with hiredman's advice it'll probably be the most effective solution

raspasov18:03:39

One option I would try is to simply not use (go ...) and see if the perf. parameters improve? (aka switch all (go ...) to (thread ...) )

raspasov18:03:15

Any io-blocking code in a downstream function call that originates inside a`(go ...)` can be an issue – easy to miss since it’s not immediately obvious the blocking code is still running on the (go ...) threadpool.

dergutemoritz09:04:38

One particularly innocuous-looking I/O operation is logging

☝️ 1
💯 1
raspasov16:04:49

If still a problem, I would try attaching YourKit and see if something obvious stands out… often the answer can be quite surprising. Somewhat recently I discovered that a bottleneck can be simply trying to generate too many UUIDs too fast since it’s a global resource and each one takes a long time and blocks… Basically the default UUID generation is also akin to an I/O operation 🙂 (and a global (!) one at that… it will temporarily halt all threads trying to call it – so in a way worse than any other I/O that would presumably block only a single threadpool thread). Tldr; the cause of the slowdown might not be in core.async at all – just some food for thought.

stopa14:04:02

Wow! TIL on UUID's lock (makes sense). Thank you @U050KSS8M

🙏 1
Ben Sless15:04:41

Oh yeah that's a nasty surprise, another one - single threaded uuid generation is faster than parallel Also fun - use uuid v6

👍 1