core-async

seancorfield 2025-10-20T18:43:18.361189Z

Any experience reports of using the new alpha of core.async with vthreads JVM option set to target? We're trialing that at work and, making no code changes -- just updating the version and providing that JVM option for all invocations (incl. AOT during JAR building -- I think, if I got it right) -- we see quite a change in thread usage:

seancorfield 2025-10-21T12:48:19.152959Z

The process went into a GC tailspin overnight and pegged the server at 100% so I'm rolling back changes. I realized I'm probably not running the process with that JVM option right now, so that fits with Alex's comment about the AOT-target version most likely being identical to the runtime version. So what I started out testing -- and what I'm back testing now -- is just the updated core.async without targeting vthreads, just running in default mode. I'm going to run like this for a few hours to check it is stable, and then I'll ensure that vthread=target property is enabled and restart the process, and I expect we'll see climbing heap again. Will report back in a few hours.

Alex Miller (Clojure team) 2025-10-21T14:00:44.242599Z

Assuming you are in Java w vthreads you should not need the flag, it will use vthreads (plus you’re compiling to vthread assuming code already)

Alex Miller (Clojure team) 2025-10-21T14:01:19.480009Z

That is, the default (non compiled) is vthreads

Alex Miller (Clojure team) 2025-10-21T14:02:25.995659Z

You did see significant change in thread usage, further evidence for that

seancorfield 2025-10-21T14:14:28.785219Z

I guess this sentence is what I find confusing in the async virtual threads post about this: "If AOT compiling, go blocks will always use IOC (no change)."

seancorfield 2025-10-21T14:16:04.228619Z

We AOT our uberjars, so this sounds like we should get the old core.async behavior -- without that vthreads=target property?

seancorfield 2025-10-21T14:22:58.217489Z

Here are my three possible scenarios -- can you explain what, if any, differences I should see between them: 1. update core.async to alpha2, make no changes to anything, AOT without vthreads=target, run without vthreads=target 2. update core.async to alpha2, make no changes to anything, AOT without vthreads=target, run with vthreads=target 3. update core.async to alpha2, make no changes to anything, AOT with vthreads=target, run without vthreads=target

seancorfield 2025-10-21T14:24:43.389419Z

I believe I was running scenario 1. successfully at the start of this thread(!) -- no async-io threads, lots of forkjointhreads, higher throughput, higher cpu, stable heap. Then I switched to scenario 3. and saw the same overall pattern except heap steadily climbed until we hit a GC spiral and 100% cpu.

Alex Miller (Clojure team) 2025-10-21T14:26:55.072249Z

presuming you are AOTing everything and running on Java 21+, the runtime setting is making no difference. so 1+2 will have go state machines (running on vthreads) and 3 will have vthread go blocks

seancorfield 2025-10-21T14:27:48.463419Z

Running on JDK 24, yes. And always AOT'ing everything (because it makes a massive difference to startup speed when we restart services).

Alex Miller (Clojure team) 2025-10-21T14:30:09.224439Z

so you will still see a thread difference even with 1+2 b/c io-threads and go threads are running vthreads instead of platform threads, but doing the same thing they used to

seancorfield 2025-10-21T14:30:13.295089Z

So 1. & 2. are equivalent, regardless of the JVM property at runtime? And 3. is the different scenario. Okay, so it seems the non-state-machine scenario is the one with the bad heap behavior...

Alex Miller (Clojure team) 2025-10-21T14:30:31.367439Z

yeah, you're compiling so the flag at compilation time is the important one

seancorfield 2025-10-21T14:30:34.386099Z

We're not using io-thread (yet) -- no code changes, just update core.async.

seancorfield 2025-10-21T14:36:40.183669Z

How would you like me to proceed with debugging the apparent heap problem with scenario 3.?

Alex Miller (Clojure team) 2025-10-21T14:46:56.064299Z

depends what tools you have to investigate the leak. we will try to replicate independently.

seancorfield 2025-10-21T15:00:51.001439Z

No specific tools. We could probably get you a heap dump if we re-run scenario 3. (remind me how best to do that for the format you want)

seancorfield 2025-10-21T15:01:47.825509Z

The app where we're seeing this is the heaviest core.async use in our whole suite, and it seems fairly easy to trigger it...

Alex Miller (Clojure team) 2025-10-21T15:12:31.105909Z

a heap dump is likely to be huge and not very useful to us w/o code. better to monitor heap histograms via jcmd over time

fogus (Clojure Team) 2025-10-21T15:13:56.081359Z

Thanks for your help Sean. Just catching up now. Quick question: The problematic scenario #3 and is run with the vthreads flag unset?

Alex Miller (Clojure team) 2025-10-21T15:14:05.373029Z

something like jcmd <pid> GC.class_histogram | head -50 periodically should give you a clue

➕ 1
seancorfield 2025-10-21T15:15:04.712839Z

@fogus Correct. As it turns out, none of my testing has had that vthreads flag set at runtime, hence no scenario 2.

👍 1
seancorfield 2025-10-21T15:16:12.579179Z

@alexmiller Thanks. I will try that command when I get back around to testing the AOT-compiled-with-vthreads=target scenario again.

Alex Miller (Clojure team) 2025-10-21T15:23:16.774599Z

Or use Java Flight Recorder for allocation/leak hints: • Increase stack depth:`jcmd <pid> JFR.configure stackdepth=256` • Start: jcmd <pid> JFR.start name=leak settings=profile duration=5m filename=/tmp/leak.jfr include=OldObjectSample • Wait 5m to accumulate to the .jfr file • Open the .jfr file in Java Mission Control, go to Memory -> Old Object Sample, sort by Allocation Site or Object Type

seancorfield 2025-10-21T17:06:13.413259Z

I don't have JMC locally, but I can capture a .jfr file if you want to examine it? In the meantime, the GC histogram quickly shows this as the #1 memory usage, steadily increasing: 1: 34449 52606000 jdk.internal.vm.StackChunk (java.base@24.0.1)

seancorfield 2025-10-21T17:10:56.974789Z

1: 47076 69978016 jdk.internal.vm.StackChunk (java.base@24.0.1) (and still growing)

seancorfield 2025-10-21T17:26:42.471939Z

This is output from the GC histogram. The first block is from the alpha2 core.async with no property set and no code changes. The subsequent blocks are from the same app AOT'd with vthreads=target

seancorfield 2025-10-21T17:27:15.648589Z

I have leak.jfr.gz (about 6.7M) if you're interested @alexmiller (or @ghadi)?

seancorfield 2025-10-21T17:29:09.988319Z

Old gen (G1) CPU times on that server since starting that AOT'd-with-vthreads=target version -- the other servers are not doing any old gen GC:

seancorfield 2025-10-21T17:29:46.554519Z

And young gen CPU times are steadily increasing:

seancorfield 2025-10-21T17:32:37.832159Z

OK, I'm stopping this test and rolling back to alpha2 without vthreads=target (scenario 1).

seancorfield 2025-10-21T17:33:00.086149Z

LMK if you need more information, or need me to run more tests.

Alex Miller (Clojure team) 2025-10-21T18:23:11.045579Z

seems like what I would expect with vthread go blocks not being "done" / gc released

seancorfield 2025-10-21T20:29:17.764589Z

But with the state machine, and running on vthreads, that doesn't happen?

Alex Miller (Clojure team) 2025-10-21T20:30:45.630669Z

🤷 can't explain it, just observing atm

seancorfield 2025-10-20T18:45:27.877209Z

I guess the first Q would be: how can I confirm I set the option correctly for AOT?

seancorfield 2025-10-20T18:46:26.000349Z

The thread state change is definitely nice to see:

seancorfield 2025-10-20T19:05:33.116379Z

Hmm, I think I've confirmed that the JVM option didn't affect compile-clj...

Alex Miller (Clojure team) 2025-10-20T19:08:49.244319Z

With the default it should compile to go blocks, and you’ll see state machine classes (I think they have “state” in their name). You will want to set the new prop to target, and then you shouldn’t see those

Alex Miller (Clojure team) 2025-10-20T19:09:32.340239Z

You’ll need to set that prop when calling compile-clj

seancorfield 2025-10-20T19:09:43.336839Z

Right, for running everything, we have the property set to target. My question is just about AOT -- compile-clj -- how do I ensure that property is set?

seancorfield 2025-10-20T19:10:34.348429Z

My :build alias has:

:build
  {:extra-deps {ws/build {:local/root "projects/build"}}
   :jvm-opts ["--enable-preview"
              "-client"
              "-Dclojure.core.async.go-checking=true"
              "-Dclojure.core.async.vthreads=target"
...
but I don't know if that is correctly affecting the compile-clj call in build.clj -- is there a way to verify that?

seancorfield 2025-10-20T19:11:12.916809Z

I tried (println (System/getProperty "clojure.core.async.vthreads")) at the top of the main ns so it would print when it was loaded (for AOT compilation) and it prints nil 😐

Alex Miller (Clojure team) 2025-10-20T19:12:19.495279Z

in the call to compile-clj, use :java-opts ["-Dclojure.core.async.vthreads=target"]

Alex Miller (Clojure team) 2025-10-20T19:12:40.794479Z

compile is forked, so the props on the build process itself are irrelevant of course

seancorfield 2025-10-20T19:12:44.037749Z

Ah, that's what I'm missing...

seancorfield 2025-10-20T19:14:25.256179Z

Thank you! I couldn't remember how/where to set that.

Alex Miller (Clojure team) 2025-10-20T19:18:32.853209Z

very curious about observable change in latency/throughput

seancorfield 2025-10-20T19:20:12.108069Z

Yeah, I'll have to rebuild with this change and do another round of testing -- what about clojure.core.async.go-checking=true? Is that runtime only or would adding it to compile-clj affect anything?

Alex Miller (Clojure team) 2025-10-20T19:25:10.420609Z

well, it's irrelevant if you're using vthreads

Alex Miller (Clojure team) 2025-10-20T19:25:25.247349Z

I'd certainly not turn it on in production regardless

seancorfield 2025-10-20T19:26:57.157119Z

Heh, no, that's dev/test only. I was just curious.

seancorfield 2025-10-20T19:27:22.734549Z

We do have clojure.spec.check-asserts=true in production tho' 🙂

Alex Miller (Clojure team) 2025-10-20T19:54:47.826449Z

you do you :)

🤣 2
seancorfield 2025-10-20T21:59:01.094259Z

Now I have that fixed, it's nice to see the size reduction in the JAR files:

About to replace this JAR file: build-2025-10-20_17.32.42
-rw-r--r--. 1 tomcat tomcat 63664369 Oct 20 17:52 /var/www/worldsingles/build/uberjars/wsmessaging-1.0.0.jar

With this updated JAR file: build-2025-10-20_21.34.44
-rw-r--r--. 1 tomcat tomcat 59678051 Oct 20 21:49 /tmp/wsmessaging-1.0.0.jar

Alex Miller (Clojure team) 2025-10-20T22:18:05.288899Z

You could also exclude tools.analyzer.jvm under core.async too if you wanted

seancorfield 2025-10-20T22:20:26.278559Z

Good to know, thanks.

mkvlr 2025-10-20T22:43:08.962289Z

@seancorfield curious what the source of those graphs is. Looking at improving our insight into our prod jvms.

seancorfield 2025-10-20T23:22:23.919449Z

New Relic.

🙏 1
seancorfield 2025-10-20T23:24:57.973849Z

I love New Relic. We have the Linux agent installed on all our servers, and the Java agent in all our applications, as well as a bunch of custom metrics we compute and post to New Relic. Happy to talk your ears off about it 🙂

mkvlr 2025-10-20T23:27:54.895449Z

good to hear, I’ve only used it ages ago with ruby on rails and liked it a lot for that. I see you https://corfield.org/blog/2013/05/01/instrumenting-clojure-for-new-relic-monitoring/, are you still annotating functions like this today?

seancorfield 2025-10-20T23:38:32.355479Z

We use Paul Rutledge's library for it these days... lemme get the deets...

seancorfield 2025-10-20T23:40:18.650169Z

https://github.com/RutledgePaulV/newrelic-clj -- so you can use defn-traced

🙏 1
seancorfield 2025-10-21T01:19:13.107869Z

@alexmiller @ghadi FYI: with the compile-clj :java-opts change so that AOT picks up target, we're seeing slowly increasing heap memory usage that we didn't see previously (with target just set for runtime). I'll be tracking this overnight to see what happens. If it still looks "bad" in the morning, I'll roll back to the earlier version of the day, and run that for a full day and see if heap is stable. Can you think of anything that might affect heap/GC between the AOT'd with target version and the regular runtime target version?

Alex Miller (Clojure team) 2025-10-21T01:39:48.043239Z

No, should be the same bytecode either way

Alex Miller (Clojure team) 2025-10-21T01:40:58.088709Z

Vthreads store their stack on the heap so there is more heap usage in general, but I would expect gc to reclaim as go blocks complete

seancorfield 2025-10-21T01:42:49.185989Z

The four hours with today's first version, heap was very stable. The three hours with the AOT change have seen steady heap climbing.

Alex Miller (Clojure team) 2025-10-21T01:43:40.117009Z

In either case go blocks are just compiled as normal clojure code and run on vthreads so not much magic. I guess it’s possible something is retaining refs and preventing cleanup, but can’t imagine what that would be that’s different between those

seancorfield 2025-10-21T01:44:31.112029Z

Okay. I'll see where we are in the morning. I'm only running this on one server in the cluster so far.

2025-10-22T16:10:32.272029Z

the state machine doesn’t use threads regardless of JVM/vthreads, it’s hand-rolled inversion of control (callbacks)

2025-10-22T16:13:48.498859Z

vthreads will use memory for parked stacks

seancorfield 2025-10-22T16:27:03.580849Z

@richhickey I think https://clojure.org/news/2025/10/01/async_virtual_threads is pretty confusing on when vthreads will be used. When we upgraded core.async, we saw a big shift in threads from async-io (went away) to fjpool-worker (big increase). But we AOT everything and were not using the vthreads=target option for compile-clj -- so, based on your comment, we get IOC behavior and no vthreads that way? (just the change in thread types/names seen in that initial graph). If we AOT with vthreads=target, we'll get only vthreads from go blocks -- and it seems, from discussions with @ghadi, that some vthreads have strong references and won't get GC'd, depending on how they're created (executors create vthreads that can be GC'd, the Thread API creates vthreads that can't be GC'd until they're completed -- based on a loom-dev mailing list discussion). If we weren't AOT'ing, we'd get vthreads instead of IOC just be virtue of running on JDK 24 (21+). And we'd likely still be in this situation (large heap, GC unable to recover space) I think?

seancorfield 2025-10-22T16:30:00.002599Z

I think we can recover GC'ability of vthreads created by the Thread API if we add -Djdk.trackAllThreads=false but that feels like a workaround. I haven't looked at the core.async source (yet) but I'm curious how it creates vthreads -- executors or the thread API...

seancorfield 2025-10-22T16:32:13.671569Z

(Ghadi helped me identify some problematic code in one of our apps that creates a huge number of go blocks and associated channels -- which doesn't matter for IOC but creates vthreads in the new model and those channels were not always closed and the go blocks were left hanging... and un-GC'able)

fogus (Clojure Team) 2025-10-22T17:29:24.776479Z

My first cut at trying to repro was unsuccessful until I added a bit that didn't close its channels. Then I saw the same behavior.

Steven Lombardi 2025-10-22T17:53:53.146119Z

@seancorfield Looking to summarize for my own understanding. Throughout your experiments, the constants in play were: • JDK 24 • Using alpha2 of core.async • No source code changes • Always AOT The only changing variable was: • AOT without -Dclojure.core.async.vthreads=target and everything was fine • AOT with -Dclojure.core.async.vthreads=target and you saw the GC issues

Steven Lombardi 2025-10-22T17:55:44.665299Z

Regarding the problematic code, was the channel not being closed the channel returned by the go block? Or a manually created channel within the go block?

seancorfield 2025-10-22T18:01:26.951019Z

We were creating a huge number of channels and go blocks in this app and they were only closed on the unhappy path which was rare. The assumption was they'd be GC'd. Which they are in the "old" IOC world, but in the VT world, a vthread that isn't terminated doesn't get GC'd by default, depending on exactly how it is created (according to the loom-dev mailing list discussion from a year ago). We're fixing our code to avoid creating so many channels/`go` blocks -- it wasn't really "buggy" code but it was a bit sloppy (but I suspect we create channels/`go` blocks all over the place that we just assumed would get GC'd when they went out of scope).

seancorfield 2025-10-22T18:01:59.109869Z

(and, yes, your bullets accurately reflect how we have things set up)

👍 1
fogus (Clojure Team) 2025-10-22T18:05:02.088699Z

I'll let Sean answer his specific case, but an easy way to replicate is to make a bunch of go blocks in a loop that read from a channel that never receives a value. The problem is not exactly related to channel closing but more so to whatever causes a go block to not complete.

👍 1
seancorfield 2025-10-22T18:06:39.846989Z

I would expect that a lot of core.async code in the wild creates go blocks that never complete -- because it has never caused any problems and go blocks (and channels) "feel" very lightweight...

☝️ 1
fogus (Clojure Team) 2025-10-22T18:07:05.526649Z

I'm sure you're right

seancorfield 2025-10-22T18:07:20.948569Z

It's not going to bite you in the new world unless you create a LOT of those, however...

seancorfield 2025-10-22T18:08:31.108879Z

Reading the loom-dev mailing list thread about vthreads and memory leaks was very interesting -- esp. the difference in between between executor-managed vthreads and the Thread API vthreads.

fogus (Clojure Team) 2025-10-22T18:08:54.285949Z

Links?

seancorfield 2025-10-22T18:10:08.084649Z

That's what Ghadi sent me. There are also some StackOverflow discussions about the memory leaks -- can't remember what search term I used when I stumbled across those. I was researching the -Djdk.trackAllThreads=false property.

fogus (Clojure Team) 2025-10-22T18:10:33.383969Z

Thanks!

Steven Lombardi 2025-10-22T18:44:50.570999Z

> make a bunch of go blocks in a loop that read from a channel that never receives a value Never receive a value, or the channel they read from never closes? I think core.async as a library exercises good go hygiene by ensuring go terminates when the read source closes (e.g. see core.async/pipe as an example). But I'm not sure that practice or discipline is followed in the wild.

seancorfield 2025-10-22T18:45:12.404799Z

Okay, just pushed a version of that app to one server, with -Djdk.trackAllThreads=false and compile-clj passed vthreads=target and our runaway go block creation addressed. Or at least that particular known issue fixed. We'll see how much of a leak we have now...

ghadi 2025-10-22T18:49:23.945109Z

did you swap out the executor?

ghadi 2025-10-22T18:49:32.364499Z

or just the sysprop

Steven Lombardi 2025-10-22T18:51:32.177949Z

We use component in our projects and I try to aggressively manage our channel lifecycle as part of the system so we don't leave garbage laying around. But we still largely ignore the return chan from go blocks. Would that cause issues in the virtual thread world?

ghadi 2025-10-22T18:57:04.972869Z

@seancorfield I wasn't able to make the leak go away without swapping vthread executor (as well as prop...)

2025-10-22T19:40:22.945399Z

in the IOC world, only channels (used by a go block) refer to the go block data and thus when those channels are GCed (not necessarily closed), so too can the go block data be GCed

👍 1
👍🏻 1
1
2025-10-22T19:40:50.789079Z

@ghadi “swapping vthread executor” for what?

2025-10-22T19:44:28.210059Z

jdk.trackAllThreads seems a culprit

2025-10-22T19:44:35.871589Z

defaults to true?

seancorfield 2025-10-22T19:46:26.726369Z

Given the advice is to set jdk.trackAllThreads to false, I assume it now defaults to true -- and from what I've seen discussed online that changed around JDK 21?

2025-10-22T19:47:04.476849Z

this negative interaction is so basic and predictable, sheesh

2025-10-22T19:49:13.252879Z

how could anyone write a large vthreaded IO app with this default?

seancorfield 2025-10-22T19:53:56.708969Z

I've been trying to use vthreads since JDK 19 and have had to roll back nearly every attempt 🙂 I managed to switch some of our future stuff over to vthreads about a year ago. Not sure what JDK version we'd upgraded to by that point, but that had failed previously. I thought it was interesting at Conj last year, when I asked the room when Alex et al were talking about Clojure 1.12 "how many of you are using vthreads in production?" (very few) "who has had problems with them?" (those few hands all stayed up). Seems like more tweaks are needed around vthreads in Java/JDK-land to make them reliable in production 😞

2025-10-22T19:54:34.452419Z

makes me appreciate our IOC impl all the more 🙂

seancorfield 2025-10-22T19:55:13.107999Z

Yup, pretty amazing what you can already do with core.async -- and via a library as well!!

👍 1
2025-10-22T19:55:46.759689Z

thanks for the report Sean - we’re running some tests now but think we have enough info for the moment

ghadi 2025-10-22T19:59:56.719159Z

need trackAllThreads=false, and swap ExecutorService (which tracks all spawned so that .close() works) with an Executor that spawns without tracking

(reify Executor
  (execute [_ r]
    (Thread/startVirtualThread r)))

ghadi 2025-10-22T20:00:37.522719Z

trackAllThreads=false was necessary but insufficient

2025-10-22T20:01:44.348539Z

@fogus ^

seancorfield 2025-10-22T20:10:39.913599Z

FWIW, we're using Executors/newVirtualThreadPerTaskExecutor in a few places, but they are all where the VT is guaranteed to complete, and we're using Thread/startVirtualThread in many more places. I thought, based on some of the discussions on loom-dev, that the Thread API was the unsafe one and executors were the safe one -- but @ghadi your code suggests the opposite? Did I misunderstand how vthread tracking works there?

seancorfield 2025-10-22T20:50:38.664249Z

The green line is the server that is running the vthreads version -- I'll let it run a bit longer but I think heap is just going to keep climbing 🙂

ghadi 2025-10-22T21:45:18.968949Z

yes it’s doomed, but (kindly) capture a heap dump before you nuke

seancorfield 2025-10-22T22:12:53.927599Z

Haha... Okay... That command is either in this thread or our DM. I'm on my phone right now.

ghadi 2025-10-22T22:18:47.921019Z

jcmd ${PID} GC.heap_dump ${output.file}

ghadi 2025-10-22T22:19:08.622959Z

then pray you have enough space on target fs 🙂

seancorfield 2025-10-23T01:35:17.155749Z

@ghadi LMK when you're around and I'll wormhole you that heap dump.