I have a proxy endpoint on my Aleph server, where I send a request downstream in my Ring handler with Aleph’s HTTP client, and return the response map mostly unchanged.
My assumption was that the downstream request’s :body is an InputStream that I can also return, and the server will eventually close it, releasing resources.
My question is: Is there any concern with this approach?
I noticed that my server sometimes gets “stuck” and all requests to it time out. I have no idea if this is related to the proxy endpoint, but this does not seem like an unreasonable guess. Maybe some resources pile up until the server deadlocks somehow? Any ideas?
Netty uses an event-driven execution model like Node, but unlike Node, which inherits Js's single-threaded conceptual model, Netty has no problems using a thread pool to divide the work up amongst cores. The event-driven model is still needed to ensure that Netty could scale up. Early web servers like old Apache used 1 real thread-per-request, which don't (the old C10k problem). Theoretically, vthreads present a new way around the problem, but I haven't kept up with the tech. Yes, the Aleph executors exist to prevent blocking Netty, although it's also necessary for other things. The deferred/stream execution model is just woven throughout. The problem may not be in the Netty interaction side of things. It could be one of Clojure/Manifold/Aleph just doesn't play nicely with vthreads yet. (E.g., one area I've always wondered about is the propagation of Clojure's thread binding frames; copying those for each vthread could easily negate some of the advantages. Presumably ScopedValue is the correct replacement, but Clojure is pretty slow to adapt to Java changes.)
There's also the issue that synchronized methods / blocks pin vthreads - maybe that's what you're running into
@dergutemoritz that sounds like a “heiße spur”, thanks!
Indeed 🕵️♂️ Only thought of this now
But it should be easy to tell from a stack dump
Grepping Netty's source for synchronized also gives 193 matches 😬
Yeah, there's a zillion potential little hiccups like that.
Is the vthread executor pool fixed in size?
if so, that could easily result in deadlocks
No, new vthread per task
It seems that vthreads are meant to be created in large numbers, and there should be little reason to reuse them
I was referring to the underlying pool of real threads
which actually end up executing the tasks
Ah, ok. I think this is the ForkJoinPool/commonPool, which does grow
> The JVM maintains a pool of platform threads, created and maintained by a dedicated ForkJoinPool. Initially, the number of platform threads equals the number of CPU cores, and it cannot increase more than 256.
that was from a secondary source, though
The thing is: In my setup, I’m running two aleph servers. One for “public” HTTP handlers, and one for internal endpoints, such as a health endpoint. The health endpoint on the internal server always stays responsive, that’s why my container (running on fargate) was not restarted in these situations. Both servers use the same executor, so the system was able to run new vthreads. But of course it could still somehow be related to pinned vthreads and synchronized blocks…
IC
Keep us posted!
Heya, sorry for the late reply. Is this still an issue? If so, could you provide a code snippet of how you set this up exactly, specifically how you block the server thread?
Hey! Unfortunately this is one of these happens-occasionally hard-to-reproduce problems. It has not happened since, and I have not found a pattern so far when it happens. It could be something completely different. The “download proxy” is just a shot in the dark where I assume something that I was not sure if it could be a problem. My code essentially looks like this (simplified):
(defn download-handler [request]
(let [{:keys [org repo asset-id]} (get-in request [:parameters :path])
download-url (format "" org repo asset-id)]
(-> (client/get download-url)
deref
(update :headers
#(-> %
(update-keys str/lower-case)
(select-keys ["accept-ranges"
"content-disposition"
"content-length"
"content-range"
"content-type"
"etag"
"last-modified"]))))))
So my Ring handler performs a HTTP client request, blocks for the result and returns the response, just updating the headers a bit. In particular, the :body key is passed through unchanged.Did you try looking at where threads are blocked when it happens e.g. via jstack?
As for your code, I suggest you try to change it to return the response deferred and attaching the transformations via d/chain instead of deref'ing
That will free up the handler thread immediately for further requests
I’d like to avoid that, as we switched from using manifold/async to Java 21 virtual threads and blocking. Both server and client are configured to use a newVirtualThreadPerTask executor
It could be that for whatever reason your request never gets a response. Note that the client has no default request timeout, so it will just wait indefinitely. See the request-timeout and perhaps read-timeout client options.
Ah well, but note that Aleph still will use manifold internally for all kinds of things
So I doubt you're gaining all too much by avoiding it here
Sure, but user code gets so much simpler when avoiding async. Readable stacktraces, using sync middleware, using tools that expect blocking like resilience libs, etc
Absolutely agree
Request timeout is a good idea. Was not aware that the client does not have a default. Seems odd, are you sure about that?
Just Aleph (and Netty for that matter) are not built for that execution paradigm unfortunately
I thought Aleph kind-of is, as it uses its own thread pool for handling requests?
So maybe you'd be better off with a different client and server then?
Well it goes to great lengths to not block (thus manifold)
For a newer project, I was indeed switching to http-kit
Hm I think http-kit is similar in that respect, no? 🤔
Also uses nio under the hood which is all non-blocking
But I'm not 100% sure
Could be
Probably yes
As for there not being a default request timeout, yes, I am 100% sure 🙂
I have been working on that part of the client fairly recently
But to be honest I’m not sure if that is a problem. As Aleph, http-kit defaults to use a thread pool so that users don’t need to worry not to block internal async code.
Yeah but these thread pools are usually meant to not be blocked by I/O which might end up working against the grain of the whole thing
newVirtualThreadPerTask is 🙂
Indeed, maybe it should work. Which pool did you replace this way?
All HTTP-Client requests get :response-executor executor, and I start the server with (http/start-server {:executor executor})
So I expect that every request the server handles runs on a vthread, and every deferred returned by the client as well. The latter does not really matter, as the server-request handling vthread will block for the client response
OK that sounds reasonable
So assuming that requests get stuck indefinitely, you could still be exhausting the client connection pool
So if for example all your proxied requests go to the same destination host, 8 stuck requests would suffice to make everything grind to a halt
Hm, that could indeed be. Sounds like at the very minimum, I should look into request/read timeouts. It looks like these timeouts basically throw; do you happen to know if this exception causes the underlying resources to be cleaned up correctly?
Maybe it is also time to finally learn about the monitoring options of these connection pools 😄
Yeah, it will clean up resources properly!
Cool, thanks!
Will give that a try!
You're welcome! Feel free to ping here for further questions
And good luck 🤞
FYI: https://github.com/netty/netty/issues/12848#issuecomment-2477059053
Thanks. I’m not sure what’s the best way forward right now. We cannot keep the server stalled for long to debug it, as many clients depend on it. So I need to restart it to fix the problem. I might just try to move away from Aleph to a different HTTP library, just to make sure it is not caused by something else.
Well, at least on the client side, it should be 99%-compatible with clj-http (if you're not using Manifold streams), and on the server-side, it's Ring-compatible (if you're again, not using Manifold streams). If you want to debug, my first act would be switching from the custom virtualThread pool to the standard thread pool. I don't think anyone's tested that with Aleph/Manifold. Neither it nor Netty were built with them in mind, though.
Yeah, maybe I prematurely switched to the shiny new toys!
Even though the event-driven paradigm is more complicated to use, it's probably just as efficient, in terms of utilizing CPU cores, as vthreads. I don't know which Aleph version you're on, but later ones have better logging to help debug issues, even if you're not using it for HTTP2.
My $.02, anyway
I’m sure it is very efficient.
Originally, my code was very async, using deferred and avoiding blocking. My thinking was that this is the most efficient way to implement a “gateway” service.
However, the code got more and more complex, e.g. I needed to add custom adapter code to get mulog to trace nicely across async calls, and so on. With vthreads being generally available, we decided to switch to a simpler blocking model, also as a proof of concept for another project that uses Promesa extensively. We figured this might be slightly less efficient, but so much easier to reason about.
vthreads are easier to reason about, for sure. But there's a nonzero chance their interaction with Aleph/Netty is the cause of your problem 😄 Good hunting!
Thanks! Good hint! I might just fall back to the default executors. Vthreads might be a premature optimisation. 🙂
I haven't really looked into them in a year, so it may be nothing, but with any luck, switching off them is pretty easy
True. To be honest, I doubt that this is the issue here, though. But I’m on thin ice 😄 Isn’t Netty all about async and avoiding threading (similar to NodeJS)? And the executor a Aleph addition to prevent users from accidentally blocking Netty? And then, shouldn’t it not matter which growable thread pool to use, e.g. Aleph’s dirigiste one vs vthreads? Somehow I feel it should not matter, but it of course very well could, with so many pieces in play. There’s also the common ForkJoinPool that actually “carries” vthreads…
Unfortunately, this keeps happening to us. Seemingly randomly, our server stops responding to HTTP requests. Since we’re behind a load balancer, we get 504 gateway timeouts. The proxy/client thing was just a guess, because this seems the most “exotic” thing our server is doing. How about the Aleph server — is it possible that it runs out of resources? Maybe previous requests not cleaning up? Any hints where I can start debugging? This is how we start the server (nothing fancy):
(let [port (:port config)
options (-> {:port port
:executor executor
:shutdown-executor? false
:shutdown-timeout (duration/get-seconds (:shutdown-timeout config))})
server (http/start-server handler options)]
;...
port is 8080, executor a (Executors/newVirtualThreadPerTaskExecutor), the :shutdown-timeout is set to 5 secs…Hm there's a lot of potential causes. I suggest taking a look at what the threads are doing when this happens (e.g. via jstack) and which sockets the process is holding on to (e.g. via ss if you're on Linux)
Or maybe you can still hire @kingmob to debug this 🙂 (but not sure if he's still available for consultancy jobs like that)
OK, I have an update.
I experienced the blocking situation again, and managed to run a REPL on the container in this state. (We’re running on AWS fargate with rather few resources, only 2 cores and 500M RAM).
I noticed that virtual threads did not start anymore, and there were two ForkJoinPool workers in WAITING state at jdk.internal.vm.Continuation.run — what I suppose means that they are running vthreads.
I further found this Netty issue, which I read as “virtual threads are not supported and there are no plans to ever support them right now”: https://github.com/netty/netty/issues/12816
Looking deeper it seems that @dergutemoritz was spot on: I think the thread pinning was the issue, and this does not only lead to “https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-04C03FFC-066D-4857-85B9-E5A27A875AF9”, but can also cause thread starvation and deadlocks.
I’m not sure what the criteria are, but in our case it seems that the JVM was not spinning up new carrier threads, even though the two existing ones where blocked. I think I saw somewhere that the pool should start with a thread per core (2 in our case), but can grow to a max of 256.
With the tools installed on the machine, and frankly with the ones I’m familiar with, I could not inspect the virtual threads themselves, but only saw platform threads. My guess is that we had a scenario like this:
• Two incoming HTTP requests, each run in a virtual thread, each executes a HTTP request downstream and blocks for the result.
• The outgoing HTTP requests complete, the responses run in virtual threads as well.
• Now somehow I assume that each pair of virtual threads synchronise: The server vthread wants to pipe the response of the client vthread. Any or both of them could use a synchronized-wait in the underlying implementation.
• That means we have 4 virtual threads, but only 2 platform threads. If the two client threads pin these platform threads, the server threads never get a chance to continue. The system deadlocks.
Cool, thanks for the detailed summary! For completeness' sake: > Thread state for a waiting thread. A thread is in the waiting state due to calling one of the following methods: > • Object.wait with no timeout > • Thread.join with no timeout > • LockSupport.park > > A thread in the waiting state is waiting for another thread to perform a particular action. For example, a thread that has called Object.wait() on an object is waiting for another thread to call Object.notify() or Object.notifyAll() on that object. A thread that has called Thread.join() is waiting for a specified thread to terminate. Source: https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Thread.State.html#WAITING So yeah, a deadlock seems quite likely.
Yep, but I was not sure what states to expect, e.g. wouldn’t a fork-join-worker go to WAITING state after finishing a task until it is notified to pick up the next task?
True, that could also be
If the server doesn't support HTTP/2, then the client will fall back to HTTP/1
(unless you explicitly disabled HTTP/1 which then would result in an error)
If you're curious for how exactly the client finds out which HTTP version the server supports, that's done via the ALPN TLS extension: https://datatracker.ietf.org/doc/html/rfc7301
Sorry for not being clear, with “the server”, I meant “my Aleph server”. I have a proxy, so the server gets a request and makes another one that it wants to pass through to the client. Anyway, I have meanwhile configured all timeouts that I could find, let’s see if the error happens again! (it did, but this was before I added these timeouts, and removed http/2 support from the client)
I’m running with a custom newVirtualThreadPerTask executor for both server and client, and block the server thread until the client request is done.
I just wrapped up my MIT contract a week ago, so I'm actually available, fwiw.
I just realised that I use a custom connection pool with HTTP/2 enabled. I’m not sure why I enabled it, and to be 100% honest I don’t know enough about HTTP/2 to reason about it. The server is not configured for HTTP/2. Could this be an issue?