Fork me on GitHub
#aleph
<
2023-02-20
>
Ferdinand Beyer12:02:49

In a long-running server, I occasionally experience HTTP-Client connection timeout errors like such:

clojure.lang.ExceptionInfo: connection was closed after N seconds
I’m using the default connection pool, which sets keep-alive? true and idle-timeout 0, so I assume by default connections are kept indefinitely and will eventually time out. There seems to be no automatic retry. This took me by surprise, and I have a few questions: • Is this by design? • Are the defaults meant for short-lived processes that execute a few requests and then terminate? • Am I supposed to use a custom connection pool with idle-timeout for long-lived processes?

Arnaud Geiser12:02:07

Hello Ferdinand! Can I ask you which version of Aleph (0.6.1, 0.6.0?) you are using? You are correct, this is not "supposed" to happen. When idle-timeout is zero or not defined, the responses stream is not supposed to be closed on normal circumstances (this is what triggers the Exception). It can be when: • the Channel is detected as inactive (it should not be) • you have an issue with what you are sending (decoder failure handling since Aleph 0.6.1) • probably other reasons I don't have on top of my head If you are using Aleph version 0.6.1 already, can I ask you to downgrade to 0.6.0 and retry? Do you have any code to share?

Arnaud Geiser12:02:39

> • Is this by design? No, it's not > • Are the defaults meant for short-lived processes that execute a few requests and then terminate? No, by default, idle-timeout is unspecified and thus no IdleStateHandler is added inside the pipeline. > • Am I supposed to use a custom connection pool with idle-timeout for long-lived processes? No

Ferdinand Beyer12:02:30

I do use 0.6.1. Not sure which code to share, I basically call http/request without any “fancy” options.

Ferdinand Beyer12:02:30

When the server closes the connection, will the client get notified and remove it from the pool?

Ferdinand Beyer12:02:30

I can downgrade to 0.6.0 for sure. Not sure how to reproduce this, I assume it happens when my server is idle for a while itself

Ferdinand Beyer12:02:39

My server basically acts as a HTTP reverse proxy

Arnaud Geiser12:02:25

> When the server closes the connection, will the client get notified and remove it from the pool? I'm unsure if the issue is on the server or client side currently... But there is definitely something. Let me dig a little bit.

thanks3 1
Ferdinand Beyer12:02:04

Just to test my understanding of Aleph’s code: • connection-pool uses create-connection that in turn uses client/http-connection. This will eventually die, when the server closes the connection, right? • Subsequent attempts to retrieve a connection will fail • request will remove a connection from the pool on error • but as far as I can see the connection is not removed by other means?

Ferdinand Beyer12:02:10

This is purely client-side

Ferdinand Beyer12:02:52

E.g. my downstream request as a HTTP client fails on http/request

Arnaud Geiser12:02:36

> • connection-pool uses create-connection that in turn uses client/http-connection. This will eventually die, when the server closes the connection, right? If keep-alive is used, the connection will be released to the pool and not destroyed. https://github.com/clj-commons/aleph/blob/master/src/aleph/http.clj#L387-L391

Ferdinand Beyer12:02:37

create-connection’s docstring reads:

If the server disconnects, all responses will be errors, and a new connection must be created.
However, maybe we are missing an automated handling here?

Arnaud Geiser12:02:20

(Random guess) : I'm wondering if it's not related to the Dirigiste changes we have done recently to ensure the Pool is removed when not active after some time...

Arnaud Geiser12:02:21

Is it something you can reproduce easily?

Arnaud Geiser12:02:47

(I got my answer... yes) > ERROR in (test-server-name) (client.clj:612) > expected: (.startsWith (request) "localhost") > actual: clojure.lang.ExceptionInfo: connection was closed after 0.013 seconds > {:request {:user-info nil, :pool #object[io.aleph.dirigiste.Pool 0x431298ef "[email protected]"], :request-url "http://localhost:8080/", :headers nil, :server-port 8080, :uri "/", :server-name "localhost", :query-string nil, :body nil, :scheme :http, :request-method :get}}

👍 1
Arnaud Geiser12:02:29

Please downgrade to version 0.6.0 for now

Arnaud Geiser12:02:16

(It's most likely not Aleph related but due to Dirigiste)

Ferdinand Beyer12:02:24

Alright! Not sure though if I understood your hypothesis, probably because I’m not very familiar with Aleph’s or dirigiste’s code. Is there some mechanism in place to automatically remove closed connections from the pool?

Arnaud Geiser12:02:51

(I need to validate my hypothesis first and I will come back with a clear explanation)

Arnaud Geiser13:02:10

Heads-up: It's also happening on my side on Aleph 0.6.0... So most likely unrelated to 0.6.1 (and thus Dirigiste changes).

Ferdinand Beyer13:02:57

How did you reproduce it locally?

Arnaud Geiser13:02:04

It's a bit random... but I executed the test-server-name 1000 times on ring_test.clj :

Arnaud Geiser13:02:10

> (deftest test-server-name > (dotimes [_ 1000] > (with-server [:server-name] > (is (.startsWith ^String (request) "localhost")))))

Arnaud Geiser13:02:23

(I just wanted to validate if it's something new or not)

Ferdinand Beyer13:02:24

Meanwhile I found out that the connection is supposed to be removed from the pool on close:

(create-connection
 host
 conn-options'
 middleware
 #(flow/dispose @p host [@c])) ;; <-- This is the on-close callback

🙌 1
Arnaud Geiser14:02:28

To make sure we are on the same page. Your hypothesis is that the HTTP connection is closed (removed from the pool) and later on reused while it should not?

Ferdinand Beyer14:02:46

Yes, that’s right. I don’t know the internals, so I might be totally wrong.

Ferdinand Beyer14:02:12

Oh, you mentioned something about “inactive” channels, and I see that there is invokeChannelInactive in the stack trace, if it helps?

clojure.lang.ExceptionInfo: connection was closed after 1019.558 seconds
	at aleph.http.client$http_connection$fn__8848$fn__8869$fn__8873.invoke(client.clj:612)
	at manifold.deferred$fn__4795$chain_SINGLEQUOTE____4816.invoke(deferred.clj:748)
	at manifold.deferred$fn__4795$subscribe__4796$fn__4801.invoke(deferred.clj:714)
	at manifold.deferred.Listener.onSuccess(deferred.clj:218)
	at manifold.deferred.Deferred$fn__4647.invoke(deferred.clj:397)
	at manifold.deferred.Deferred.success(deferred.clj:397)
	at manifold.deferred$success_BANG_.invokeStatic(deferred.clj:242)
	at manifold.deferred$success_BANG_.invoke(deferred.clj:239)
	at manifold.stream.default.Stream$fn__5903.invoke(default.clj:124)
	at manifold.stream.default.Stream.close(default.clj:123)
	at aleph.http.client$client_handler$reify__8775.channelInactive(client.clj:210)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at io.netty.handler.stream.ChunkedWriteHandler.channelInactive(ChunkedWriteHandler.java:137)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelInactive(CombinedChannelDuplexHandler.java:418)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376)
	at io.netty.handler.codec.http.HttpClientCodec$Decoder.channelInactive(HttpClientCodec.java:329)
	at io.netty.channel.CombinedChannelDuplexHandler.channelInactive(CombinedChannelDuplexHandler.java:221)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376)
	at io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1075)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:301)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at manifold.executor$thread_factory$reify__4155$f__4156.invoke(executor.clj:70)
	at clojure.lang.AFn.run(AFn.java:22)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

Arnaud Geiser14:02:02

> clojure.lang.ExceptionInfo: connection was closed after 1019.558 seconds We talked about the client, but are you also targeting an Aleph server?

Ferdinand Beyer14:02:25

Nope, the server is a third party

Arnaud Geiser14:02:46

Can I be the third-party that closes the connection isntead?

Ferdinand Beyer14:02:41

Sure, that was my assumption. That Aleph tries to keep a connection open and the server closes it eventually

Ferdinand Beyer14:02:56

Or do you mean mid-response?

Arnaud Geiser14:02:08

(Because what I'm investigating - is not the same symptom as the one you sent me (the stacktrace)

Arnaud Geiser14:02:23

> Sure, that was my assumption. That Aleph tries to keep a connection open and the server closes it eventually Yes, that's what I think too

Ferdinand Beyer14:02:05

That’s why I asked about the idle timeout. As a proxy, it might make sense to reuse a connection if I get multiple requests myself at the same time, but after a short idle time it does not make sense to assume the connection is still open for the next request

Ferdinand Beyer14:02:17

Maybe some kind of race condition? Connection is closed -> it is removed from the pool. Next request: acquires the connection-fn from the pool

Arnaud Geiser14:02:35

But here we are talking about a single request that took 17 minutes to be completed, no? https://github.com/clj-commons/aleph/blob/master/src/aleph/http/client.clj#L596-L613

Ferdinand Beyer14:02:19

No, the request failed after 0.2s. The time t0 here is the time since the connection was created, right?

Arnaud Geiser14:02:16

Yes, correct, sorry!

Arnaud Geiser14:02:39

Anyway, it's a good lead. A connection that has been closed that went back to the pool while it should have not. AFAIK, only flow/release inside Aleph will move the connection back to the pool. There is something I don't get for now..

Ferdinand Beyer14:02:13

Well I think the connection was OK and sent back to the pool, but it can be closed by the server while it is in the pool, right? Theoretically it should then get disposed…

Arnaud Geiser14:02:30

Aaarrf.. that would be terribly annoying..

Arnaud Geiser14:02:21

> Theoretically it should then get disposed… But it means we'll only know when we'll try to fetch a connection back from the pool?

Ferdinand Beyer14:02:08

Not sure how this Dirigiste thing works.

Arnaud Geiser14:02:45

It's really just "object pooling", nothing more. There is no janitor concept inside it.

Ferdinand Beyer15:02:17

Thinking about the stacktrace. It looks like the channel became inactive and this closed the stream, while we tried to get the connection, and this is all synchronous:

at aleph.http.client$http_connection$fn__8848$fn__8869$fn__8873.invoke(client.clj:612)
	at manifold.deferred$fn__4795$chain_SINGLEQUOTE____4816.invoke(deferred.clj:748)
	at manifold.deferred$fn__4795$subscribe__4796$fn__4801.invoke(deferred.clj:714)
	at manifold.deferred.Listener.onSuccess(deferred.clj:218)
	at manifold.deferred.Deferred$fn__4647.invoke(deferred.clj:397)
	at manifold.deferred.Deferred.success(deferred.clj:397)
	at manifold.deferred$success_BANG_.invokeStatic(deferred.clj:242)
	at manifold.deferred$success_BANG_.invoke(deferred.clj:239)
	at manifold.stream.default.Stream$fn__5903.invoke(default.clj:124)
	at manifold.stream.default.Stream.close(default.clj:123)
	at aleph.http.client$client_handler$reify__8775.channelInactive(client.clj:210)
If the connection got closed before request is called, I’d expect the stream to already be closed, and the deferred from (s/take! responses ::closed) would be a new one, so that the stack trace should not contain aleph.http.client$client_handler$reify__8775.channelInactive

Ferdinand Beyer15:02:13

Not sure how netty works, maybe we only observe a connection to be closed when trying to send a request?

Arnaud Geiser15:02:20

Agreed. I was investigating this behavior (I had no channelInactive on my stacktrace) > would be a new one, so that the stack trace should not contain aleph.http.client$client_handler$reify__8775.channelInactive

Arnaud Geiser15:02:31

> Not sure how netty works, maybe we only observe a connection to be closed when trying to send a request? Unfortunately, I don't know, I'm trying to put some logging before sending the request to check for a ".isClosed".

Arnaud Geiser15:02:10

(I will come later this evening (CET), I won't forget this issue)

👍 1
Arnaud Geiser16:02:38

I won't draw any conclusion.. but both on your case and mine, this is what triggers the "ChannelInactive" : https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/AbstractChannel.java#L795-L827

Ferdinand Beyer16:02:53

I noticed something else: I only got this error in one handler, which provides a download of a 50 MB file, streaming the upstream input stream downstream to the client. I wonder if the connection is returned before this download completes, or if the connection remains in a weird state if the response is not consumer completely?

Arnaud Geiser17:02:37

Maybe.. To be completely honest, I'm having a hard time understanding how this responses stream is supposed to work. [1] How do we even guarantee that the response we got from the stream is related to the req we wanted to send? What guarantees the ordering? [1] : https://github.com/clj-commons/aleph/blob/master/src/aleph/http/client.clj#L603

kingmob18:02:41

> How do we even guarantee that the response we got from the stream is related to the req we wanted to send? What guarantees the ordering? Because of the request-response nature of HTTP, remember. The requestor waits for a response before requesting again. The keep-alive header is about the TCP connection, not the HTTP. Admittedly, it gets more convoluted once we add HTTP pipelining, which enables a requestor to send multiple requests before getting a response. This is where content-length and chunked transfer encoding comes in. On the server-side, if you use HTTP/1.1 pipelining, and a later handler finishes before an earlier one, it will be chained to wait until the previous one's deferred is finished, so there's no issues there.

👍 1
kingmob18:02:15

What evidence do we currently have that this is related to dirigiste? Because I suspect it may be a race elsewhere.

Arnaud Geiser18:02:58

No evidence. Dirigiste was just an easy target. I don't think it is anymore. Maybe a destroy that got called why scaling down a Pool. To me, Dirigiste is no more a good lead to explore.

kingmob18:02:46

In http-connection, the responses stream is passed to the client handler, which sets up a callback for channelInactive events to close the stream. And on incoming requests, the request fn grabs the channel's monitor before putting requests on the requests stream and taking the response. Other places aren't locking on the channel, but even if they were, I'm not sure netty itself is using that monitor, so a channel might be closed even if the monitor is held. E.g.:

rsp (locking ch
                                  (s/put! requests req)
                                  (s/take! responses ::closed))]
(-> (netty/safe-execute ch
                                                  (http/send-message ch true ssl? req' body))
                              (d/catch' (fn [e]
                                          (s/put! responses (d/error-deferred e))
                                          (netty/close ch))))))

                      ;; this will usually happen because of a malformed request
                      (catch Throwable e
                        (s/put! responses (d/error-deferred e))
                        (netty/close ch)))) 
Both of the catch examples above could be a race, if the put! actually runs AFTER the netty/close -> channelinactive handler closes the response stream. Solution is to either lock on the channel, or not close it until after the error-deferred has been put on the responses stream. However, I don't know how it will interact with netty, have to check...

kingmob18:02:27

It's really late here, I have to call it a night...

👍 1
Arnaud Geiser18:02:58

If it proved to be the root cause, we can definitely remove those netty/close. Aleph lived with it during a long time.

kingmob01:02:30

I didn’t get much sleep the night before, so I think I was concerned about the wrong thing. The “race” I was worried about shouldn’t be a concern, or at least, not be related to this issue. The locking call maybe needs a closer look, though

kingmob04:02:26

Hmm, if we go to the commit before PR #592, does that solve the problem?

Arnaud Geiser10:02:24

(I haven't try yet, but for me, that's not the culprit) My current hypothesis is that we got closed asynchronously (for whatever reason, which might not even be an Exception) [1]. And later on, when the connection is reused, the response-stream is closed already. [1] : https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/AbstractChannel.java#L795-L827

kingmob07:02:20

(Minor addendum about HTTP/1.1 pipelining: we support it, but apparently almost no browsers do anymore; too problematic)