This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-02-20
Channels
- # aleph (72)
- # announcements (13)
- # architecture (8)
- # aws (3)
- # babashka (110)
- # beginners (38)
- # calva (2)
- # clerk (1)
- # clojure (118)
- # clojure-austin (3)
- # clojure-dev (8)
- # clojure-europe (50)
- # clojure-france (2)
- # clojure-nl (1)
- # clojure-portugal (1)
- # clojure-uk (3)
- # clojurescript (101)
- # clr (10)
- # data-science (15)
- # datascript (5)
- # datomic (3)
- # events (1)
- # fulcro (22)
- # graalvm (2)
- # gratitude (3)
- # guix (1)
- # honeysql (1)
- # hyperfiddle (72)
- # jobs (3)
- # lsp (23)
- # malli (18)
- # membrane (29)
- # obb (1)
- # releases (1)
- # ring-swagger (2)
- # shadow-cljs (8)
- # squint (113)
- # xtdb (9)
In a long-running server, I occasionally experience HTTP-Client connection timeout errors like such:
clojure.lang.ExceptionInfo: connection was closed after N seconds
I’m using the default connection pool, which sets keep-alive? true
and idle-timeout 0
, so I assume by default connections are kept indefinitely and will eventually time out. There seems to be no automatic retry.
This took me by surprise, and I have a few questions:
• Is this by design?
• Are the defaults meant for short-lived processes that execute a few requests and then terminate?
• Am I supposed to use a custom connection pool with idle-timeout
for long-lived processes?Hello Ferdinand!
Can I ask you which version of Aleph (0.6.1, 0.6.0?) you are using?
You are correct, this is not "supposed" to happen. When idle-timeout
is zero or not defined, the responses
stream is not supposed to be closed on normal circumstances (this is what triggers the Exception).
It can be when:
• the Channel
is detected as inactive (it should not be)
• you have an issue with what you are sending (decoder failure handling since Aleph 0.6.1)
• probably other reasons I don't have on top of my head
If you are using Aleph version 0.6.1 already, can I ask you to downgrade to 0.6.0 and retry?
Do you have any code to share?
> • Is this by design?
No, it's not
> • Are the defaults meant for short-lived processes that execute a few requests and then terminate?
No, by default, idle-timeout
is unspecified and thus no IdleStateHandler
is added inside the pipeline.
> • Am I supposed to use a custom connection pool with idle-timeout
for long-lived processes?
No
I do use 0.6.1
. Not sure which code to share, I basically call http/request
without any “fancy” options.
When the server closes the connection, will the client get notified and remove it from the pool?
I can downgrade to 0.6.0 for sure. Not sure how to reproduce this, I assume it happens when my server is idle for a while itself
My server basically acts as a HTTP reverse proxy
> When the server closes the connection, will the client get notified and remove it from the pool? I'm unsure if the issue is on the server or client side currently... But there is definitely something. Let me dig a little bit.

Just to test my understanding of Aleph’s code:
• connection-pool
uses create-connection
that in turn uses client/http-connection
. This will eventually die, when the server closes the connection, right?
• Subsequent attempts to retrieve a connection will fail
• request
will remove a connection from the pool on error
• but as far as I can see the connection is not removed by other means?
This is purely client-side
E.g. my downstream request as a HTTP client fails on http/request
> • connection-pool
uses create-connection
that in turn uses client/http-connection
. This will eventually die, when the server closes the connection, right?
If keep-alive
is used, the connection will be released
to the pool and not destroyed. https://github.com/clj-commons/aleph/blob/master/src/aleph/http.clj#L387-L391
create-connection
’s docstring reads:
If the server disconnects, all responses will be errors, and a new connection must be created.
However, maybe we are missing an automated handling here?(Random guess) : I'm wondering if it's not related to the Dirigiste changes we have done recently to ensure the Pool
is removed when not active after some time...
Is it something you can reproduce easily?
(I got my answer... yes) > ERROR in (test-server-name) (client.clj:612) > expected: (.startsWith (request) "localhost") > actual: clojure.lang.ExceptionInfo: connection was closed after 0.013 seconds > {:request {:user-info nil, :pool #object[io.aleph.dirigiste.Pool 0x431298ef "[email protected]"], :request-url "http://localhost:8080/", :headers nil, :server-port 8080, :uri "/", :server-name "localhost", :query-string nil, :body nil, :scheme :http, :request-method :get}}
Please downgrade to version 0.6.0 for now
(It's most likely not Aleph related but due to Dirigiste)
Alright! Not sure though if I understood your hypothesis, probably because I’m not very familiar with Aleph’s or dirigiste’s code. Is there some mechanism in place to automatically remove closed connections from the pool?
(I need to validate my hypothesis first and I will come back with a clear explanation)
Thanks Arnaud!
Heads-up: It's also happening on my side on Aleph 0.6.0... So most likely unrelated to 0.6.1 (and thus Dirigiste changes).
How did you reproduce it locally?
It's a bit random... but I executed the test-server-name
1000 times on ring_test.clj
:
> (deftest test-server-name > (dotimes [_ 1000] > (with-server [:server-name] > (is (.startsWith ^String (request) "localhost")))))
Ah, OK 🙂
(I just wanted to validate if it's something new or not)
Meanwhile I found out that the connection is supposed to be removed from the pool on close:
(create-connection
host
conn-options'
middleware
#(flow/dispose @p host [@c])) ;; <-- This is the on-close callback
To make sure we are on the same page. Your hypothesis is that the HTTP connection is closed (removed from the pool) and later on reused while it should not?
Yes, that’s right. I don’t know the internals, so I might be totally wrong.
Oh, you mentioned something about “inactive” channels, and I see that there is invokeChannelInactive
in the stack trace, if it helps?
clojure.lang.ExceptionInfo: connection was closed after 1019.558 seconds
at aleph.http.client$http_connection$fn__8848$fn__8869$fn__8873.invoke(client.clj:612)
at manifold.deferred$fn__4795$chain_SINGLEQUOTE____4816.invoke(deferred.clj:748)
at manifold.deferred$fn__4795$subscribe__4796$fn__4801.invoke(deferred.clj:714)
at manifold.deferred.Listener.onSuccess(deferred.clj:218)
at manifold.deferred.Deferred$fn__4647.invoke(deferred.clj:397)
at manifold.deferred.Deferred.success(deferred.clj:397)
at manifold.deferred$success_BANG_.invokeStatic(deferred.clj:242)
at manifold.deferred$success_BANG_.invoke(deferred.clj:239)
at manifold.stream.default.Stream$fn__5903.invoke(default.clj:124)
at manifold.stream.default.Stream.close(default.clj:123)
at aleph.http.client$client_handler$reify__8775.channelInactive(client.clj:210)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
at io.netty.handler.stream.ChunkedWriteHandler.channelInactive(ChunkedWriteHandler.java:137)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelInactive(CombinedChannelDuplexHandler.java:418)
at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411)
at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376)
at io.netty.handler.codec.http.HttpClientCodec$Decoder.channelInactive(HttpClientCodec.java:329)
at io.netty.channel.CombinedChannelDuplexHandler.channelInactive(CombinedChannelDuplexHandler.java:221)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411)
at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376)
at io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1075)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:301)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at manifold.executor$thread_factory$reify__4155$f__4156.invoke(executor.clj:70)
at clojure.lang.AFn.run(AFn.java:22)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
> clojure.lang.ExceptionInfo: connection was closed after 1019.558 seconds We talked about the client, but are you also targeting an Aleph server?
Nope, the server is a third party
Can I be the third-party that closes the connection isntead?
Sure, that was my assumption. That Aleph tries to keep a connection open and the server closes it eventually
Or do you mean mid-response?
(Because what I'm investigating - is not the same symptom as the one you sent me (the stacktrace)
> Sure, that was my assumption. That Aleph tries to keep a connection open and the server closes it eventually Yes, that's what I think too
That’s why I asked about the idle timeout. As a proxy, it might make sense to reuse a connection if I get multiple requests myself at the same time, but after a short idle time it does not make sense to assume the connection is still open for the next request
Maybe some kind of race condition? Connection is closed -> it is removed from the pool. Next request: acquires the connection-fn from the pool
But here we are talking about a single request that took 17 minutes to be completed, no? https://github.com/clj-commons/aleph/blob/master/src/aleph/http/client.clj#L596-L613
No, the request failed after 0.2s. The time t0
here is the time since the connection was created, right?
Yes, correct, sorry!
No worries
Anyway, it's a good lead.
A connection that has been closed that went back to the pool while it should have not.
AFAIK, only flow/release
inside Aleph will move the connection back to the pool.
There is something I don't get for now..
Well I think the connection was OK and sent back to the pool, but it can be closed by the server while it is in the pool, right? Theoretically it should then get disposed…
Aaarrf.. that would be terribly annoying..
> Theoretically it should then get disposed… But it means we'll only know when we'll try to fetch a connection back from the pool?
Not sure how this Dirigiste thing works.
It's really just "object pooling", nothing more. There is no janitor concept inside it.
You have the whole API of pool there : https://github.com/clj-commons/dirigiste/blob/master/test/java/io/aleph/dirigiste/PoolTest.java#L60-L67 • acquire • dispose • release
Thinking about the stacktrace. It looks like the channel became inactive and this closed the stream, while we tried to get the connection, and this is all synchronous:
at aleph.http.client$http_connection$fn__8848$fn__8869$fn__8873.invoke(client.clj:612)
at manifold.deferred$fn__4795$chain_SINGLEQUOTE____4816.invoke(deferred.clj:748)
at manifold.deferred$fn__4795$subscribe__4796$fn__4801.invoke(deferred.clj:714)
at manifold.deferred.Listener.onSuccess(deferred.clj:218)
at manifold.deferred.Deferred$fn__4647.invoke(deferred.clj:397)
at manifold.deferred.Deferred.success(deferred.clj:397)
at manifold.deferred$success_BANG_.invokeStatic(deferred.clj:242)
at manifold.deferred$success_BANG_.invoke(deferred.clj:239)
at manifold.stream.default.Stream$fn__5903.invoke(default.clj:124)
at manifold.stream.default.Stream.close(default.clj:123)
at aleph.http.client$client_handler$reify__8775.channelInactive(client.clj:210)
If the connection got closed before request
is called, I’d expect the stream to already be closed, and the deferred from (s/take! responses ::closed)
would be a new one, so that the stack trace should not contain aleph.http.client$client_handler$reify__8775.channelInactive
Not sure how netty works, maybe we only observe a connection to be closed when trying to send a request?
Agreed. I was investigating this behavior (I had no channelInactive on my stacktrace)
> would be a new one, so that the stack trace should not contain aleph.http.client$client_handler$reify__8775.channelInactive
> Not sure how netty works, maybe we only observe a connection to be closed when trying to send a request? Unfortunately, I don't know, I'm trying to put some logging before sending the request to check for a ".isClosed".
I won't draw any conclusion.. but both on your case and mine, this is what triggers the "ChannelInactive" : https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/AbstractChannel.java#L795-L827
I noticed something else: I only got this error in one handler, which provides a download of a 50 MB file, streaming the upstream input stream downstream to the client. I wonder if the connection is returned before this download completes, or if the connection remains in a weird state if the response is not consumer completely?
Maybe.. To be completely honest, I'm having a hard time understanding how this responses
stream is supposed to work. [1]
How do we even guarantee that the response we got from the stream is related to the req
we wanted to send? What guarantees the ordering?
[1] : https://github.com/clj-commons/aleph/blob/master/src/aleph/http/client.clj#L603
> How do we even guarantee that the response we got from the stream is related to the req
we wanted to send? What guarantees the ordering?
Because of the request-response nature of HTTP, remember. The requestor waits for a response before requesting again. The keep-alive header is about the TCP connection, not the HTTP.
Admittedly, it gets more convoluted once we add HTTP pipelining, which enables a requestor to send multiple requests before getting a response. This is where content-length and chunked transfer encoding comes in.
On the server-side, if you use HTTP/1.1 pipelining, and a later handler finishes before an earlier one, it will be chained to wait until the previous one's deferred is finished, so there's no issues there.
What evidence do we currently have that this is related to dirigiste? Because I suspect it may be a race elsewhere.
No evidence. Dirigiste was just an easy target. I don't think it is anymore. Maybe a destroy that got called why scaling down a Pool. To me, Dirigiste is no more a good lead to explore.
Also, I had a build failure related to this issue when merging an unrelated PR. https://app.circleci.com/pipelines/github/clj-commons/aleph/470/workflows/17b1008d-4bbf-4b3c-b345-2545f3f5112b/jobs/455/parallel-runs/0/steps/0-107
In http-connection
, the responses
stream is passed to the client handler, which sets up a callback for channelInactive events to close the stream.
And on incoming requests, the request fn grabs the channel's monitor before putting requests on the requests stream and taking the response. Other places aren't locking on the channel, but even if they were, I'm not sure netty itself is using that monitor, so a channel might be closed even if the monitor is held.
E.g.:
rsp (locking ch
(s/put! requests req)
(s/take! responses ::closed))]
(-> (netty/safe-execute ch
(http/send-message ch true ssl? req' body))
(d/catch' (fn [e]
(s/put! responses (d/error-deferred e))
(netty/close ch))))))
;; this will usually happen because of a malformed request
(catch Throwable e
(s/put! responses (d/error-deferred e))
(netty/close ch))))
Both of the catch examples above could be a race, if the put!
actually runs AFTER the netty/close -> channelinactive handler closes the response stream.
Solution is to either lock on the channel, or not close it until after the error-deferred has been put on the responses stream. However, I don't know how it will interact with netty, have to check...For the record, here is the PR : https://github.com/clj-commons/aleph/pull/592/files
If it proved to be the root cause, we can definitely remove those netty/close. Aleph lived with it during a long time.
I didn’t get much sleep the night before, so I think I was concerned about the wrong thing. The “race” I was worried about shouldn’t be a concern, or at least, not be related to this issue. The locking call maybe needs a closer look, though
(I haven't try yet, but for me, that's not the culprit) My current hypothesis is that we got closed asynchronously (for whatever reason, which might not even be an Exception) [1]. And later on, when the connection is reused, the response-stream is closed already. [1] : https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/AbstractChannel.java#L795-L827