core-async

2024-03-28T20:21:43.720409Z

Hey team, I am using core.async heavily in my system, and noticed a big slow down as more users get on the service. Looking at logs, I see weird things, like: • a. Database queries complete in < 10 ms • b. But, go-loops are taking 5s, sometimes 100s to complete The database IO is handled in an a/thread, so I know that can't be blocking. I also know it's returning queries quickly. The machine is beefy, and I know the CPU isn't at full capacity, so I doubt it's some computation I'm running that's expensive What I think is happening, is that somewhere in my code I am running io-blocking operations in go blocks, and this is halting the whole system. As a cursory look, I tried increasing the pool-size to 200, and that helped a lot. P95 latency went down from 5s to 700ms. I am using Honeycomb and have these traces, but they don't feel helpful, as lots of the time seems to be "unexplained". Question: is there any command I could run in a production REPL, or tool I could use, which could help me see a true "flame graph" for one of these queries?

Ben Sless 2024-03-29T07:35:04.134779Z

If you want a slightly more ergonomic solution, record the execution using JFR And automatic analysis might yell at you about something (that's a good thing), or you'll be able to see useful details like lock analysis and thread dumps. Combined with hiredman's advice it'll probably be the most effective solution

raspasov 2024-03-29T18:53:39.410889Z

One option I would try is to simply not use (go ...) and see if the perf. parameters improve? (aka switch all (go ...) to (thread ...) )

❤️ 1
raspasov 2024-03-29T18:57:15.151179Z

Any io-blocking code in a downstream function call that originates inside a`(go ...)` can be an issue – easy to miss since it’s not immediately obvious the blocking code is still running on the (go ...) threadpool.

❤️ 1
phronmophobic 2024-03-28T20:28:32.348419Z

Are you running db queries in go blocks?

Alex Miller (Clojure team) 2024-03-28T20:31:28.627689Z

there is a clojure.core.async.go-checking system property that you can set to detect use of core.async blocking ops inside go blocks (for dev, not production), see doc at top of https://clojure.github.io/core.async/

2024-03-28T20:32:30.134669Z

> Are you running db queries in go blocks? AFAIK no. I am doing something like this:

(go 
  (<! (go-db-query
(defn go-db-query [] 
  (a/thread (jdbc/...)))
(So the db queries should be offloaded)

👍 1
2024-03-28T20:33:24.725519Z

Will do @alexmiller! Afaik we don't use any core async blocking ops (if I understand correctly you mean <!! etc), but will enable just in case

✅ 1
2024-03-28T20:33:34.217439Z

if increasing the core.async threadpool size helped a lot, it means something is gumming up that threadpool, which could be i/o or anything else that blocks up a thread for a long time, even long running computation that doesn't have channel operations in the middle

2024-03-28T20:36:43.646589Z

I would shrink the core.async threadpool (make the problem worse) then get stack dumps of the core.async thread pool threads, and see what they are spending their time doing

➕ 2
phronmophobic 2024-03-28T20:37:14.585889Z

As far as tools, there are several tools https://github.com/async-profiler/async-profiler has a -t option to do per thread profiling.

2024-03-28T20:38:40.927659Z

the core async threadpool threads all have names that start with async-dispatch-

Alex Miller (Clojure team) 2024-03-28T20:39:16.599819Z

it seems stupid, but it is unreasonably effective to just kill -3 a java process a few times (triggers a stack dump) and then look at what the the threads are doing

➕ 1
3
Alex Miller (Clojure team) 2024-03-28T20:39:41.181399Z

if there is a bottleneck, by definition you will see the botteleneck at the top of a lot of stacks

❤️ 1
Alex Miller (Clojure team) 2024-03-28T20:40:12.960409Z

this may be harder if you have 100s or 1000s of threads :)

phronmophobic 2024-03-28T20:40:20.114329Z

This problem also seems like the type of problem that shouldn't be too difficult to reproduce in a dev setup (in an ideal world, ymmv, certain restriction may apply).

2024-03-28T20:43:57.748159Z

you could also be spinning up an unbounded number of go's, which would swamp the thread pool, if you are launching go's without waiting for them to complete, etc

2024-03-28T20:44:39.662219Z

but there tend to be other signs of that, like hitting the 1024 pending ops limits on channels

2024-03-28T22:02:04.810159Z

Thank you for jumping on this team. Will aim to look deeper tomorrow, try the suggestions here and fix the root cause

2024-04-08T14:56:02.301799Z

Wow! TIL on UUID's lock (makes sense). Thank you @raspasov

🙏 1
Ben Sless 2024-04-08T15:25:41.131849Z

Oh yeah that's a nasty surprise, another one - single threaded uuid generation is faster than parallel Also fun - use uuid v6

👍 1
dergutemoritz 2024-04-06T09:31:38.308009Z

One particularly innocuous-looking I/O operation is logging

💯 1
☝️ 1
raspasov 2024-04-06T16:19:49.416079Z

If still a problem, I would try attaching YourKit and see if something obvious stands out… often the answer can be quite surprising. Somewhat recently I discovered that a bottleneck can be simply trying to generate too many UUIDs too fast since it’s a global resource and each one takes a long time and blocks… Basically the default UUID generation is also akin to an I/O operation 🙂 (and a global (!) one at that… it will temporarily halt all threads trying to call it – so in a way worse than any other I/O that would presumably block only a single threadpool thread). Tldr; the cause of the slowdown might not be in core.async at all – just some food for thought.