Fork me on GitHub
#aleph
<
2023-02-20
>
Ferdinand Beyer12:02:49

In a long-running server, I occasionally experience HTTP-Client connection timeout errors like such:

clojure.lang.ExceptionInfo: connection was closed after N seconds
I’m using the default connection pool, which sets keep-alive? true and idle-timeout 0, so I assume by default connections are kept indefinitely and will eventually time out. There seems to be no automatic retry. This took me by surprise, and I have a few questions: • Is this by design? • Are the defaults meant for short-lived processes that execute a few requests and then terminate? • Am I supposed to use a custom connection pool with idle-timeout for long-lived processes?

Arnaud Geiser12:02:07

Hello Ferdinand! Can I ask you which version of Aleph (0.6.1, 0.6.0?) you are using? You are correct, this is not "supposed" to happen. When idle-timeout is zero or not defined, the responses stream is not supposed to be closed on normal circumstances (this is what triggers the Exception). It can be when: • the Channel is detected as inactive (it should not be) • you have an issue with what you are sending (decoder failure handling since Aleph 0.6.1) • probably other reasons I don't have on top of my head If you are using Aleph version 0.6.1 already, can I ask you to downgrade to 0.6.0 and retry? Do you have any code to share?

Arnaud Geiser12:02:39

> • Is this by design? No, it's not > • Are the defaults meant for short-lived processes that execute a few requests and then terminate? No, by default, idle-timeout is unspecified and thus no IdleStateHandler is added inside the pipeline. > • Am I supposed to use a custom connection pool with idle-timeout for long-lived processes? No

Ferdinand Beyer12:02:30

I do use 0.6.1. Not sure which code to share, I basically call http/request without any “fancy” options.

Ferdinand Beyer12:02:30

When the server closes the connection, will the client get notified and remove it from the pool?

Ferdinand Beyer12:02:30

I can downgrade to 0.6.0 for sure. Not sure how to reproduce this, I assume it happens when my server is idle for a while itself

Ferdinand Beyer12:02:39

My server basically acts as a HTTP reverse proxy

Arnaud Geiser12:02:25

> When the server closes the connection, will the client get notified and remove it from the pool? I'm unsure if the issue is on the server or client side currently... But there is definitely something. Let me dig a little bit.

thanks3 2
Ferdinand Beyer12:02:04

Just to test my understanding of Aleph’s code: • connection-pool uses create-connection that in turn uses client/http-connection. This will eventually die, when the server closes the connection, right? • Subsequent attempts to retrieve a connection will fail • request will remove a connection from the pool on error • but as far as I can see the connection is not removed by other means?

Ferdinand Beyer12:02:10

This is purely client-side

Ferdinand Beyer12:02:52

E.g. my downstream request as a HTTP client fails on http/request

Arnaud Geiser12:02:36

> • connection-pool uses create-connection that in turn uses client/http-connection. This will eventually die, when the server closes the connection, right? If keep-alive is used, the connection will be released to the pool and not destroyed. https://github.com/clj-commons/aleph/blob/master/src/aleph/http.clj#L387-L391

Ferdinand Beyer12:02:37

create-connection’s docstring reads:

If the server disconnects, all responses will be errors, and a new connection must be created.
However, maybe we are missing an automated handling here?

Arnaud Geiser12:02:20

(Random guess) : I'm wondering if it's not related to the Dirigiste changes we have done recently to ensure the Pool is removed when not active after some time...

Arnaud Geiser12:02:21

Is it something you can reproduce easily?

Arnaud Geiser12:02:47

(I got my answer... yes) > ERROR in (test-server-name) (client.clj:612) > expected: (.startsWith (request) "localhost") > actual: clojure.lang.ExceptionInfo: connection was closed after 0.013 seconds > {:request {:user-info nil, :pool #object[io.aleph.dirigiste.Pool 0x431298ef "io.aleph.dirigiste.Pool@431298ef"], :request-url "http://localhost:8080/", :headers nil, :server-port 8080, :uri "/", :server-name "localhost", :query-string nil, :body nil, :scheme :http, :request-method :get}}

👍 2
Arnaud Geiser12:02:29

Please downgrade to version 0.6.0 for now

Arnaud Geiser12:02:16

(It's most likely not Aleph related but due to Dirigiste)

Ferdinand Beyer12:02:24

Alright! Not sure though if I understood your hypothesis, probably because I’m not very familiar with Aleph’s or dirigiste’s code. Is there some mechanism in place to automatically remove closed connections from the pool?

Arnaud Geiser12:02:51

(I need to validate my hypothesis first and I will come back with a clear explanation)

Arnaud Geiser13:02:10

Heads-up: It's also happening on my side on Aleph 0.6.0... So most likely unrelated to 0.6.1 (and thus Dirigiste changes).

Ferdinand Beyer13:02:57

How did you reproduce it locally?

Arnaud Geiser13:02:04

It's a bit random... but I executed the test-server-name 1000 times on ring_test.clj :

Arnaud Geiser13:02:10

> (deftest test-server-name > (dotimes [_ 1000] > (with-server [:server-name] > (is (.startsWith ^String (request) "localhost")))))

Arnaud Geiser13:02:23

(I just wanted to validate if it's something new or not)

Ferdinand Beyer13:02:24

Meanwhile I found out that the connection is supposed to be removed from the pool on close:

(create-connection
 host
 conn-options'
 middleware
 #(flow/dispose @p host [@c])) ;; <-- This is the on-close callback

🙌 2
Arnaud Geiser14:02:28

To make sure we are on the same page. Your hypothesis is that the HTTP connection is closed (removed from the pool) and later on reused while it should not?

Ferdinand Beyer14:02:46

Yes, that’s right. I don’t know the internals, so I might be totally wrong.

Ferdinand Beyer14:02:12

Oh, you mentioned something about “inactive” channels, and I see that there is invokeChannelInactive in the stack trace, if it helps?

clojure.lang.ExceptionInfo: connection was closed after 1019.558 seconds
	at aleph.http.client$http_connection$fn__8848$fn__8869$fn__8873.invoke(client.clj:612)
	at manifold.deferred$fn__4795$chain_SINGLEQUOTE____4816.invoke(deferred.clj:748)
	at manifold.deferred$fn__4795$subscribe__4796$fn__4801.invoke(deferred.clj:714)
	at manifold.deferred.Listener.onSuccess(deferred.clj:218)
	at manifold.deferred.Deferred$fn__4647.invoke(deferred.clj:397)
	at manifold.deferred.Deferred.success(deferred.clj:397)
	at manifold.deferred$success_BANG_.invokeStatic(deferred.clj:242)
	at manifold.deferred$success_BANG_.invoke(deferred.clj:239)
	at manifold.stream.default.Stream$fn__5903.invoke(default.clj:124)
	at manifold.stream.default.Stream.close(default.clj:123)
	at aleph.http.client$client_handler$reify__8775.channelInactive(client.clj:210)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at io.netty.handler.stream.ChunkedWriteHandler.channelInactive(ChunkedWriteHandler.java:137)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelInactive(CombinedChannelDuplexHandler.java:418)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376)
	at io.netty.handler.codec.http.HttpClientCodec$Decoder.channelInactive(HttpClientCodec.java:329)
	at io.netty.channel.CombinedChannelDuplexHandler.channelInactive(CombinedChannelDuplexHandler.java:221)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376)
	at io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1075)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:301)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at manifold.executor$thread_factory$reify__4155$f__4156.invoke(executor.clj:70)
	at clojure.lang.AFn.run(AFn.java:22)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

Arnaud Geiser14:02:02

> clojure.lang.ExceptionInfo: connection was closed after 1019.558 seconds We talked about the client, but are you also targeting an Aleph server?

Ferdinand Beyer14:02:25

Nope, the server is a third party

Arnaud Geiser14:02:46

Can I be the third-party that closes the connection isntead?

Ferdinand Beyer14:02:41

Sure, that was my assumption. That Aleph tries to keep a connection open and the server closes it eventually

Ferdinand Beyer14:02:56

Or do you mean mid-response?

Arnaud Geiser14:02:08

(Because what I'm investigating - is not the same symptom as the one you sent me (the stacktrace)

Arnaud Geiser14:02:23

> Sure, that was my assumption. That Aleph tries to keep a connection open and the server closes it eventually Yes, that's what I think too

Ferdinand Beyer14:02:05

That’s why I asked about the idle timeout. As a proxy, it might make sense to reuse a connection if I get multiple requests myself at the same time, but after a short idle time it does not make sense to assume the connection is still open for the next request

Ferdinand Beyer14:02:17

Maybe some kind of race condition? Connection is closed -> it is removed from the pool. Next request: acquires the connection-fn from the pool

Arnaud Geiser14:02:35

But here we are talking about a single request that took 17 minutes to be completed, no? https://github.com/clj-commons/aleph/blob/master/src/aleph/http/client.clj#L596-L613

Ferdinand Beyer14:02:19

No, the request failed after 0.2s. The time t0 here is the time since the connection was created, right?

Arnaud Geiser14:02:16

Yes, correct, sorry!

Arnaud Geiser14:02:39

Anyway, it's a good lead. A connection that has been closed that went back to the pool while it should have not. AFAIK, only flow/release inside Aleph will move the connection back to the pool. There is something I don't get for now..

Ferdinand Beyer14:02:13

Well I think the connection was OK and sent back to the pool, but it can be closed by the server while it is in the pool, right? Theoretically it should then get disposed…

Arnaud Geiser14:02:30

Aaarrf.. that would be terribly annoying..

Arnaud Geiser14:02:21

> Theoretically it should then get disposed… But it means we'll only know when we'll try to fetch a connection back from the pool?

Ferdinand Beyer14:02:08

Not sure how this Dirigiste thing works.

Arnaud Geiser14:02:45

It's really just "object pooling", nothing more. There is no janitor concept inside it.

Ferdinand Beyer15:02:17

Thinking about the stacktrace. It looks like the channel became inactive and this closed the stream, while we tried to get the connection, and this is all synchronous:

at aleph.http.client$http_connection$fn__8848$fn__8869$fn__8873.invoke(client.clj:612)
	at manifold.deferred$fn__4795$chain_SINGLEQUOTE____4816.invoke(deferred.clj:748)
	at manifold.deferred$fn__4795$subscribe__4796$fn__4801.invoke(deferred.clj:714)
	at manifold.deferred.Listener.onSuccess(deferred.clj:218)
	at manifold.deferred.Deferred$fn__4647.invoke(deferred.clj:397)
	at manifold.deferred.Deferred.success(deferred.clj:397)
	at manifold.deferred$success_BANG_.invokeStatic(deferred.clj:242)
	at manifold.deferred$success_BANG_.invoke(deferred.clj:239)
	at manifold.stream.default.Stream$fn__5903.invoke(default.clj:124)
	at manifold.stream.default.Stream.close(default.clj:123)
	at aleph.http.client$client_handler$reify__8775.channelInactive(client.clj:210)
If the connection got closed before request is called, I’d expect the stream to already be closed, and the deferred from (s/take! responses ::closed) would be a new one, so that the stack trace should not contain aleph.http.client$client_handler$reify__8775.channelInactive

Ferdinand Beyer15:02:13

Not sure how netty works, maybe we only observe a connection to be closed when trying to send a request?

Arnaud Geiser15:02:20

Agreed. I was investigating this behavior (I had no channelInactive on my stacktrace) > would be a new one, so that the stack trace should not contain aleph.http.client$client_handler$reify__8775.channelInactive

Arnaud Geiser15:02:31

> Not sure how netty works, maybe we only observe a connection to be closed when trying to send a request? Unfortunately, I don't know, I'm trying to put some logging before sending the request to check for a ".isClosed".

Arnaud Geiser15:02:10

(I will come later this evening (CET), I won't forget this issue)

👍 2
Arnaud Geiser16:02:38

I won't draw any conclusion.. but both on your case and mine, this is what triggers the "ChannelInactive" : https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/AbstractChannel.java#L795-L827

Ferdinand Beyer16:02:53

I noticed something else: I only got this error in one handler, which provides a download of a 50 MB file, streaming the upstream input stream downstream to the client. I wonder if the connection is returned before this download completes, or if the connection remains in a weird state if the response is not consumer completely?

Arnaud Geiser17:02:37

Maybe.. To be completely honest, I'm having a hard time understanding how this responses stream is supposed to work. [1] How do we even guarantee that the response we got from the stream is related to the req we wanted to send? What guarantees the ordering? [1] : https://github.com/clj-commons/aleph/blob/master/src/aleph/http/client.clj#L603

Matthew Davidson (kingmob)18:02:41

> How do we even guarantee that the response we got from the stream is related to the req we wanted to send? What guarantees the ordering? Because of the request-response nature of HTTP, remember. The requestor waits for a response before requesting again. The keep-alive header is about the TCP connection, not the HTTP. Admittedly, it gets more convoluted once we add HTTP pipelining, which enables a requestor to send multiple requests before getting a response. This is where content-length and chunked transfer encoding comes in. On the server-side, if you use HTTP/1.1 pipelining, and a later handler finishes before an earlier one, it will be chained to wait until the previous one's deferred is finished, so there's no issues there.

👍 2
Matthew Davidson (kingmob)18:02:15

What evidence do we currently have that this is related to dirigiste? Because I suspect it may be a race elsewhere.

Arnaud Geiser18:02:58

No evidence. Dirigiste was just an easy target. I don't think it is anymore. Maybe a destroy that got called why scaling down a Pool. To me, Dirigiste is no more a good lead to explore.

Matthew Davidson (kingmob)18:02:46

In http-connection, the responses stream is passed to the client handler, which sets up a callback for channelInactive events to close the stream. And on incoming requests, the request fn grabs the channel's monitor before putting requests on the requests stream and taking the response. Other places aren't locking on the channel, but even if they were, I'm not sure netty itself is using that monitor, so a channel might be closed even if the monitor is held. E.g.:

rsp (locking ch
                                  (s/put! requests req)
                                  (s/take! responses ::closed))]
(-> (netty/safe-execute ch
                                                  (http/send-message ch true ssl? req' body))
                              (d/catch' (fn [e]
                                          (s/put! responses (d/error-deferred e))
                                          (netty/close ch))))))

                      ;; this will usually happen because of a malformed request
                      (catch Throwable e
                        (s/put! responses (d/error-deferred e))
                        (netty/close ch)))) 
Both of the catch examples above could be a race, if the put! actually runs AFTER the netty/close -> channelinactive handler closes the response stream. Solution is to either lock on the channel, or not close it until after the error-deferred has been put on the responses stream. However, I don't know how it will interact with netty, have to check...

Matthew Davidson (kingmob)18:02:27

It's really late here, I have to call it a night...

👍 2
Arnaud Geiser18:02:58

If it proved to be the root cause, we can definitely remove those netty/close. Aleph lived with it during a long time.

Matthew Davidson (kingmob)01:02:30

I didn’t get much sleep the night before, so I think I was concerned about the wrong thing. The “race” I was worried about shouldn’t be a concern, or at least, not be related to this issue. The locking call maybe needs a closer look, though

Matthew Davidson (kingmob)04:02:26

Hmm, if we go to the commit before PR #592, does that solve the problem?

Arnaud Geiser10:02:24

(I haven't try yet, but for me, that's not the culprit) My current hypothesis is that we got closed asynchronously (for whatever reason, which might not even be an Exception) [1]. And later on, when the connection is reused, the response-stream is closed already. [1] : https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/AbstractChannel.java#L795-L827

Matthew Davidson (kingmob)07:02:20

(Minor addendum about HTTP/1.1 pipelining: we support it, but apparently almost no browsers do anymore; too problematic)

Ferdinand Beyer11:01:28

Totally forgot about this. Back then, I added an ugly workaround to our code, that checks if an exception message starts with "connection was closed" and if so just retries to send the request again. This seemed to have worked in most cases, but today I encountered the same issue again. I think my workaround did not catch it because I retry only once. Did anyone of you by any chance follow up on this?

Arnaud Geiser14:01:50

I'm not follwing up on this since it's clearly a behaviour Aleph's client has. There is nothing that will guarantee the connection that is pulled from the pool is still active. What you have implemented (while ugly) makes sense in the current state. I would be keen on having this behaviour supported out-of-the box inside the Aleph's client. What is your current position on this @U10EC98F5?

gratitude-thank-you 1
Matthew Davidson (kingmob)16:01:35

Hmm, it definitely seems like a bug, albeit a rare one. Invalid conns are supposed to be disposed of, not released back to the pool. And if the other end/Netty kills it, it should be disposed of properly. I'd prefer to figure out the right combination of locks and/or event callbacks to ensure that objs acquired from the pool are valid, and not check exception error msgs. Unfortunately, @U922FGW59, I have to focus on finding paying work for now, so it might be a while before I take a closer look.

Ferdinand Beyer16:01:40

That’s OK, same here! Thanks!

Arnaud Geiser19:01:44

> Invalid conns are supposed to be disposed of, not released back to the pool. And if the other end/Netty kills it, it should be disposed of properly. Yes, and that scenario definitely works without any bug I'm aware of. What about a connection that is closed while being in the pool? Do we have a mechanism that is going to remove that invalid connection from the pool? Any pointer in the code so I can have a look?

Matthew Davidson (kingmob)06:01:32

If you look in aleph.http/request, lines 385-396, that's probably a good place to add a check. By then, it's successfully obtained a connection fn, but hasn't called it yet. Would have to obtain the Netty chan underlying it to check. Although...one thing that's confusing is in aleph.http.client/http-req-fn , lines 909-910, it checks if the chan is active and open, so maybe not. I still suspect it's a race related to the long-lived nature of the conns here. We do the check when we first make the request, but we don't really have any way to notify others holding the conn that it's been closed. > What about a connection that is closed while being in the pool? That should be handled by the call to create-connection, which passes in #(flow/dispose @p host [@c]) for the on-closed parameter.