aleph

2024-07-29T12:51:41.702339Z

I have a proxy endpoint on my Aleph server, where I send a request downstream in my Ring handler with Aleph’s HTTP client, and return the response map mostly unchanged. My assumption was that the downstream request’s :body is an InputStream that I can also return, and the server will eventually close it, releasing resources. My question is: Is there any concern with this approach? I noticed that my server sometimes gets “stuck” and all requests to it time out. I have no idea if this is related to the proxy endpoint, but this does not seem like an unreasonable guess. Maybe some resources pile up until the server deadlocks somehow? Any ideas?

Matthew Davidson 2024-09-17T07:48:18.184939Z

Netty uses an event-driven execution model like Node, but unlike Node, which inherits Js's single-threaded conceptual model, Netty has no problems using a thread pool to divide the work up amongst cores. The event-driven model is still needed to ensure that Netty could scale up. Early web servers like old Apache used 1 real thread-per-request, which don't (the old C10k problem). Theoretically, vthreads present a new way around the problem, but I haven't kept up with the tech. Yes, the Aleph executors exist to prevent blocking Netty, although it's also necessary for other things. The deferred/stream execution model is just woven throughout. The problem may not be in the Netty interaction side of things. It could be one of Clojure/Manifold/Aleph just doesn't play nicely with vthreads yet. (E.g., one area I've always wondered about is the propagation of Clojure's thread binding frames; copying those for each vthread could easily negate some of the advantages. Presumably ScopedValue is the correct replacement, but Clojure is pretty slow to adapt to Java changes.)

dergutemoritz 2024-09-17T07:50:15.758889Z

There's also the issue that synchronized methods / blocks pin vthreads - maybe that's what you're running into

🤯 1
2024-09-17T07:52:07.023389Z

@dergutemoritz that sounds like a “heiße spur”, thanks!

dergutemoritz 2024-09-17T07:52:22.804579Z

Indeed 🕵️‍♂️ Only thought of this now

dergutemoritz 2024-09-17T07:52:31.303899Z

But it should be easy to tell from a stack dump

dergutemoritz 2024-09-17T07:52:46.512699Z

Grepping Netty's source for synchronized also gives 193 matches 😬

Matthew Davidson 2024-09-17T07:53:33.831599Z

Yeah, there's a zillion potential little hiccups like that.

dergutemoritz 2024-09-17T07:53:49.523929Z

Is the vthread executor pool fixed in size?

dergutemoritz 2024-09-17T07:53:56.913949Z

if so, that could easily result in deadlocks

2024-09-17T08:13:47.418189Z

No, new vthread per task

2024-09-17T08:14:13.979799Z

It seems that vthreads are meant to be created in large numbers, and there should be little reason to reuse them

dergutemoritz 2024-09-17T08:17:52.403699Z

I was referring to the underlying pool of real threads

dergutemoritz 2024-09-17T08:18:02.364219Z

which actually end up executing the tasks

2024-09-17T08:18:26.385119Z

Ah, ok. I think this is the ForkJoinPool/commonPool, which does grow

dergutemoritz 2024-09-17T08:19:13.359729Z

> The JVM maintains a pool of platform threads, created and maintained by a dedicated ForkJoinPool. Initially, the number of platform threads equals the number of CPU cores, and it cannot increase more than 256.

dergutemoritz 2024-09-17T08:19:47.550499Z

that was from a secondary source, though

2024-09-17T08:21:08.596119Z

The thing is: In my setup, I’m running two aleph servers. One for “public” HTTP handlers, and one for internal endpoints, such as a health endpoint. The health endpoint on the internal server always stays responsive, that’s why my container (running on fargate) was not restarted in these situations. Both servers use the same executor, so the system was able to run new vthreads. But of course it could still somehow be related to pinned vthreads and synchronized blocks…

dergutemoritz 2024-09-17T08:21:44.904669Z

IC

dergutemoritz 2024-09-17T08:27:46.935879Z

Keep us posted!

👍 2
dergutemoritz 2024-08-09T12:38:40.356579Z

Heya, sorry for the late reply. Is this still an issue? If so, could you provide a code snippet of how you set this up exactly, specifically how you block the server thread?

2024-08-09T12:54:55.434589Z

Hey! Unfortunately this is one of these happens-occasionally hard-to-reproduce problems. It has not happened since, and I have not found a pattern so far when it happens. It could be something completely different. The “download proxy” is just a shot in the dark where I assume something that I was not sure if it could be a problem. My code essentially looks like this (simplified):

(defn download-handler [request]
  (let [{:keys [org repo asset-id]} (get-in request [:parameters :path])
        download-url (format "" org repo asset-id)]
    (-> (client/get download-url)
        deref
        (update :headers
            #(-> %
                 (update-keys str/lower-case)
                 (select-keys ["accept-ranges"
                               "content-disposition"
                               "content-length"
                               "content-range"
                               "content-type"
                               "etag"
                               "last-modified"]))))))
So my Ring handler performs a HTTP client request, blocks for the result and returns the response, just updating the headers a bit. In particular, the :body key is passed through unchanged.

dergutemoritz 2024-08-09T12:59:28.262519Z

Did you try looking at where threads are blocked when it happens e.g. via jstack?

dergutemoritz 2024-08-09T13:00:22.389859Z

As for your code, I suggest you try to change it to return the response deferred and attaching the transformations via d/chain instead of deref'ing

dergutemoritz 2024-08-09T13:00:47.224329Z

That will free up the handler thread immediately for further requests

2024-08-09T13:02:29.003129Z

I’d like to avoid that, as we switched from using manifold/async to Java 21 virtual threads and blocking. Both server and client are configured to use a newVirtualThreadPerTask executor

dergutemoritz 2024-08-09T13:03:05.506479Z

It could be that for whatever reason your request never gets a response. Note that the client has no default request timeout, so it will just wait indefinitely. See the request-timeout and perhaps read-timeout client options.

dergutemoritz 2024-08-09T13:03:33.197399Z

Ah well, but note that Aleph still will use manifold internally for all kinds of things

dergutemoritz 2024-08-09T13:03:45.973419Z

So I doubt you're gaining all too much by avoiding it here

2024-08-09T13:04:37.808829Z

Sure, but user code gets so much simpler when avoiding async. Readable stacktraces, using sync middleware, using tools that expect blocking like resilience libs, etc

dergutemoritz 2024-08-09T13:04:57.523759Z

Absolutely agree

2024-08-09T13:05:21.342969Z

Request timeout is a good idea. Was not aware that the client does not have a default. Seems odd, are you sure about that?

dergutemoritz 2024-08-09T13:05:26.386369Z

Just Aleph (and Netty for that matter) are not built for that execution paradigm unfortunately

2024-08-09T13:06:00.144629Z

I thought Aleph kind-of is, as it uses its own thread pool for handling requests?

dergutemoritz 2024-08-09T13:06:05.484369Z

So maybe you'd be better off with a different client and server then?

dergutemoritz 2024-08-09T13:06:26.031939Z

Well it goes to great lengths to not block (thus manifold)

2024-08-09T13:06:35.910499Z

For a newer project, I was indeed switching to http-kit

dergutemoritz 2024-08-09T13:06:55.728709Z

Hm I think http-kit is similar in that respect, no? 🤔

dergutemoritz 2024-08-09T13:07:10.455719Z

Also uses nio under the hood which is all non-blocking

dergutemoritz 2024-08-09T13:07:39.089389Z

But I'm not 100% sure

2024-08-09T13:07:41.983479Z

Could be

2024-08-09T13:07:45.903909Z

Probably yes

dergutemoritz 2024-08-09T13:07:53.388779Z

As for there not being a default request timeout, yes, I am 100% sure 🙂

dergutemoritz 2024-08-09T13:08:13.201719Z

I have been working on that part of the client fairly recently

2024-08-09T13:08:32.124299Z

But to be honest I’m not sure if that is a problem. As Aleph, http-kit defaults to use a thread pool so that users don’t need to worry not to block internal async code.

dergutemoritz 2024-08-09T13:09:32.176209Z

Yeah but these thread pools are usually meant to not be blocked by I/O which might end up working against the grain of the whole thing

2024-08-09T13:09:49.683119Z

newVirtualThreadPerTask is 🙂

dergutemoritz 2024-08-09T13:10:21.998749Z

Indeed, maybe it should work. Which pool did you replace this way?

2024-08-09T13:11:38.621579Z

All HTTP-Client requests get :response-executor executor, and I start the server with (http/start-server {:executor executor})

2024-08-09T13:12:36.741159Z

So I expect that every request the server handles runs on a vthread, and every deferred returned by the client as well. The latter does not really matter, as the server-request handling vthread will block for the client response

dergutemoritz 2024-08-09T13:12:49.877399Z

OK that sounds reasonable

dergutemoritz 2024-08-09T13:13:34.974069Z

So assuming that requests get stuck indefinitely, you could still be exhausting the client connection pool

dergutemoritz 2024-08-09T13:15:46.038059Z

So if for example all your proxied requests go to the same destination host, 8 stuck requests would suffice to make everything grind to a halt

2024-08-09T13:17:10.424309Z

Hm, that could indeed be. Sounds like at the very minimum, I should look into request/read timeouts. It looks like these timeouts basically throw; do you happen to know if this exception causes the underlying resources to be cleaned up correctly?

2024-08-09T13:17:51.306619Z

Maybe it is also time to finally learn about the monitoring options of these connection pools 😄

👍 1
dergutemoritz 2024-08-09T13:18:01.232939Z

Yeah, it will clean up resources properly!

2024-08-09T13:21:04.579839Z

Cool, thanks!

2024-08-09T13:21:09.350949Z

Will give that a try!

dergutemoritz 2024-08-09T13:21:49.511379Z

You're welcome! Feel free to ping here for further questions

dergutemoritz 2024-08-09T13:22:04.011569Z

And good luck 🤞

1
dergutemoritz 2024-12-19T10:48:49.159969Z

FYI: https://github.com/netty/netty/issues/12848#issuecomment-2477059053

1
2024-09-16T08:29:17.073449Z

Thanks. I’m not sure what’s the best way forward right now. We cannot keep the server stalled for long to debug it, as many clients depend on it. So I need to restart it to fix the problem. I might just try to move away from Aleph to a different HTTP library, just to make sure it is not caused by something else.

Matthew Davidson 2024-09-16T08:34:28.944279Z

Well, at least on the client side, it should be 99%-compatible with clj-http (if you're not using Manifold streams), and on the server-side, it's Ring-compatible (if you're again, not using Manifold streams). If you want to debug, my first act would be switching from the custom virtualThread pool to the standard thread pool. I don't think anyone's tested that with Aleph/Manifold. Neither it nor Netty were built with them in mind, though.

👍 1
2024-09-16T08:44:08.073549Z

Yeah, maybe I prematurely switched to the shiny new toys!

Matthew Davidson 2024-09-16T08:50:03.361169Z

Even though the event-driven paradigm is more complicated to use, it's probably just as efficient, in terms of utilizing CPU cores, as vthreads. I don't know which Aleph version you're on, but later ones have better logging to help debug issues, even if you're not using it for HTTP2.

Matthew Davidson 2024-09-16T08:50:33.050969Z

My $.02, anyway

2024-09-16T08:59:29.160779Z

I’m sure it is very efficient. Originally, my code was very async, using deferred and avoiding blocking. My thinking was that this is the most efficient way to implement a “gateway” service. However, the code got more and more complex, e.g. I needed to add custom adapter code to get mulog to trace nicely across async calls, and so on. With vthreads being generally available, we decided to switch to a simpler blocking model, also as a proof of concept for another project that uses Promesa extensively. We figured this might be slightly less efficient, but so much easier to reason about.

Matthew Davidson 2024-09-16T09:22:49.203409Z

vthreads are easier to reason about, for sure. But there's a nonzero chance their interaction with Aleph/Netty is the cause of your problem 😄 Good hunting!

2024-09-16T09:26:27.038599Z

Thanks! Good hint! I might just fall back to the default executors. Vthreads might be a premature optimisation. 🙂

Matthew Davidson 2024-09-16T09:27:52.852259Z

I haven't really looked into them in a year, so it may be nothing, but with any luck, switching off them is pretty easy

2024-09-16T09:39:10.677369Z

True. To be honest, I doubt that this is the issue here, though. But I’m on thin ice 😄 Isn’t Netty all about async and avoiding threading (similar to NodeJS)? And the executor a Aleph addition to prevent users from accidentally blocking Netty? And then, shouldn’t it not matter which growable thread pool to use, e.g. Aleph’s dirigiste one vs vthreads? Somehow I feel it should not matter, but it of course very well could, with so many pieces in play. There’s also the common ForkJoinPool that actually “carries” vthreads…

2024-09-02T09:04:21.961189Z

Unfortunately, this keeps happening to us. Seemingly randomly, our server stops responding to HTTP requests. Since we’re behind a load balancer, we get 504 gateway timeouts. The proxy/client thing was just a guess, because this seems the most “exotic” thing our server is doing. How about the Aleph server — is it possible that it runs out of resources? Maybe previous requests not cleaning up? Any hints where I can start debugging? This is how we start the server (nothing fancy):

(let [port     (:port config)
      options  (-> {:port               port
                    :executor           executor
                    :shutdown-executor? false
                    :shutdown-timeout   (duration/get-seconds (:shutdown-timeout config))})
      server   (http/start-server handler options)]
  ;...
port is 8080, executor a (Executors/newVirtualThreadPerTaskExecutor), the :shutdown-timeout is set to 5 secs…

dergutemoritz 2024-09-13T13:34:42.424049Z

Hm there's a lot of potential causes. I suggest taking a look at what the threads are doing when this happens (e.g. via jstack) and which sockets the process is holding on to (e.g. via ss if you're on Linux)

dergutemoritz 2024-09-13T13:35:44.392779Z

Or maybe you can still hire @kingmob to debug this 🙂 (but not sure if he's still available for consultancy jobs like that)

2024-09-23T09:36:23.258159Z

OK, I have an update. I experienced the blocking situation again, and managed to run a REPL on the container in this state. (We’re running on AWS fargate with rather few resources, only 2 cores and 500M RAM). I noticed that virtual threads did not start anymore, and there were two ForkJoinPool workers in WAITING state at jdk.internal.vm.Continuation.run — what I suppose means that they are running vthreads. I further found this Netty issue, which I read as “virtual threads are not supported and there are no plans to ever support them right now”: https://github.com/netty/netty/issues/12816 Looking deeper it seems that @dergutemoritz was spot on: I think the thread pinning was the issue, and this does not only lead to “https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-04C03FFC-066D-4857-85B9-E5A27A875AF9”, but can also cause thread starvation and deadlocks. I’m not sure what the criteria are, but in our case it seems that the JVM was not spinning up new carrier threads, even though the two existing ones where blocked. I think I saw somewhere that the pool should start with a thread per core (2 in our case), but can grow to a max of 256. With the tools installed on the machine, and frankly with the ones I’m familiar with, I could not inspect the virtual threads themselves, but only saw platform threads. My guess is that we had a scenario like this: • Two incoming HTTP requests, each run in a virtual thread, each executes a HTTP request downstream and blocks for the result. • The outgoing HTTP requests complete, the responses run in virtual threads as well. • Now somehow I assume that each pair of virtual threads synchronise: The server vthread wants to pipe the response of the client vthread. Any or both of them could use a synchronized-wait in the underlying implementation. • That means we have 4 virtual threads, but only 2 platform threads. If the two client threads pin these platform threads, the server threads never get a chance to continue. The system deadlocks.

dergutemoritz 2024-09-23T14:08:19.332329Z

Cool, thanks for the detailed summary! For completeness' sake: > Thread state for a waiting thread. A thread is in the waiting state due to calling one of the following methods: > • Object.wait with no timeout > • Thread.join with no timeout > • LockSupport.park > > A thread in the waiting state is waiting for another thread to perform a particular action. For example, a thread that has called Object.wait() on an object is waiting for another thread to call Object.notify() or Object.notifyAll() on that object. A thread that has called Thread.join() is waiting for a specified thread to terminate. Source: https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Thread.State.html#WAITING So yeah, a deadlock seems quite likely.

2024-09-23T14:31:20.063439Z

Yep, but I was not sure what states to expect, e.g. wouldn’t a fork-join-worker go to WAITING state after finishing a task until it is notified to pick up the next task?

dergutemoritz 2024-09-23T14:33:09.943869Z

True, that could also be

dergutemoritz 2024-08-30T09:11:18.265739Z

If the server doesn't support HTTP/2, then the client will fall back to HTTP/1

dergutemoritz 2024-08-30T09:11:29.333879Z

(unless you explicitly disabled HTTP/1 which then would result in an error)

dergutemoritz 2024-08-30T09:13:10.952709Z

If you're curious for how exactly the client finds out which HTTP version the server supports, that's done via the ALPN TLS extension: https://datatracker.ietf.org/doc/html/rfc7301

2024-08-30T09:52:39.990589Z

Sorry for not being clear, with “the server”, I meant “my Aleph server”. I have a proxy, so the server gets a request and makes another one that it wants to pass through to the client. Anyway, I have meanwhile configured all timeouts that I could find, let’s see if the error happens again! (it did, but this was before I added these timeouts, and removed http/2 support from the client)

🤞 1
2024-07-29T12:52:59.794119Z

I’m running with a custom newVirtualThreadPerTask executor for both server and client, and block the server thread until the client request is done.

Matthew Davidson 2024-09-15T08:25:40.489159Z

I just wrapped up my MIT contract a week ago, so I'm actually available, fwiw.

2024-08-29T13:44:47.871759Z

I just realised that I use a custom connection pool with HTTP/2 enabled. I’m not sure why I enabled it, and to be 100% honest I don’t know enough about HTTP/2 to reason about it. The server is not configured for HTTP/2. Could this be an issue?