Fork me on GitHub
#aws
<
2020-03-25
>
kenny18:03:17

@dchelimsky The deadlock issue reported https://github.com/cognitect-labs/aws-api/issues/130 was fixed, but I think I'm still seeing something related. We have a SQS message processing service that runs 10 processing threads. All 10 of those threads are currently deadlocked on calls to aws-api. Is there a potential issue with the fix applied for #130 when running lots of requests in parallel? I have a thread dump as well. 4 cognitect-aws-send-* threads are paused on refresh! call https://github.com/cognitect-labs/aws-api/blob/0dc6c388d45a7ebe09a3f147481c61aeaa6e1846/src/cognitect/aws/credentials.clj#L96.

kenny18:03:17

@dchelimsky The deadlock issue reported https://github.com/cognitect-labs/aws-api/issues/130 was fixed, but I think I'm still seeing something related. We have a SQS message processing service that runs 10 processing threads. All 10 of those threads are currently deadlocked on calls to aws-api. Is there a potential issue with the fix applied for #130 when running lots of requests in parallel? I have a thread dump as well. 4 cognitect-aws-send-* threads are paused on refresh! call https://github.com/cognitect-labs/aws-api/blob/0dc6c388d45a7ebe09a3f147481c61aeaa6e1846/src/cognitect/aws/credentials.clj#L96.

dchelimsky18:03:08

@U083D6HK9 there's always potential! I can't look at this until Friday, but please feel free to post more information here as you learn more and/or submit an issue.

kenny19:03:55

Pasted the cognitect-aws-send-* threads here

kenny19:03:36

All the aws-send threads have been waiting almost exactly the same amount of time. I'll bet a repro would be running 3 assume role calls in parallel.

kenny19:03:20

I could forcibly serialize calls to fetch-creds for now.

kenny19:03:01

The repro would probably need to be 3 assume role calls with different CredentialsProvider objects since https://github.com/cognitect-labs/aws-api/blob/aff9028757189f9875e8b3ebed18a8f07b624327/src/cognitect/aws/util.clj#L320 code is serializing calls for the same creds provider.

kenny19:03:41

And yes, that could happen in our code. We're making calls to our customers' AWS accounts. Parallel calls to multiple AWS accounts using different creds providers could easily happen.

kenny19:03:45

Got a repro πŸ™‚

kenny20:03:00

I think the problem is the same and the fix only made it less likely. This is what is happening in each thread. | invoke DescribeInstances | fetch-creds (AssumeRole) | invoke STS AssumeRole | fetch-creds (implicit) The fetch-creds calls go through the async-fetch-pool which is size 4. If you launch N threads to run the above flow, where N >= async-fetch-pool size, you will get a deadlock. This occurs because the inner fetch-creds call is sitting in the async-fetch-pool's queue.

kenny23:03:20

I think the best solution for us near-term is to fork aws-api and set the async-fetch-pool size to 10. I'm sure there is a better, more general solution but we need something to work for the time being.

dchelimsky00:03:40

@U083D6HK9 seems like a reasonable near term approach.

kenny00:03:53

When you have free time, curious about your thoughts on the real solution.

kenny00:03:39

Implicit creds fetch to a separate pool?

kenny00:03:09

Separate pool for creds fetch is a half solution too. I suppose you could have arbitrarily nested cred fetching that needs to run.

kenny01:03:11

Returning a channel would probably fix this entirely, but I’m assuming there’s a reason it’s run in a thread.

dchelimsky01:03:40

It does return a channel - we're just putting on that channel in a thread from that pool. Making that async/thread is my first instinct. So it's obviously wrong πŸ˜‰ If you have an opinion about that, I'm all ears. Also want to hear from @U050ECB92.

kenny15:03:04

I deployed workaround with async-fetch-pool set to 10 yesterday evening. The service has been healthy since. For those looking for a workaround, I can say with high certainty that this will fix it. Wrapping in async/thread as @dchelimsky says sounds like another potential solution. I do believe that results in an unbounded number of potential threads to be created. In this case, that should always be small I think.

dchelimsky16:03:48

It's bound by core.async's threadpool size, which defaults to 8, but you can configure it via the clojure.core.async.pool-size system property: https://github.com/clojure/core.async/blob/1c5efa72ebf65cd23315311d2bc3d890b780696e/src/main/clojure/clojure/core/async/impl/exec/threadpool.clj

kenny16:03:21

I don't think async/thread is bound by that. That's for the go blocks.

kenny16:03:00

The maximumPoolSize for core.async's call to newCachedThreadPool is Integer.MAX_VALUE.

dchelimsky17:03:50

So it is bounded πŸ˜‰. But yeah, we'll need to constrain that a bit more.

kenny18:03:16

I think CredentialsProvider fetch needs to return a channel. That would move the problem to the user.

kenny18:03:05

Also, I think that is effectively the same as unbounded. Fairly certain bad things will happen as you approach the max int value.

dchelimsky21:03:45

Oh yeah, I was making a joke about it being bounded.

kenny21:03:22

Ohh, got it πŸ™‚

dchelimsky21:03:28

re: returning a channel from fetch, that would be a breaking change so we won't do that, but we could add fetch-async to the CredentialsProvider and return a channel from that (and use that in our implementations).

kenny21:03:43

It would indeed. Deprecating the current method seems like a good idea since it can result in a deadlock. Deadlocks are one of the worst problems to hit.

dchelimsky21:03:54

I'll have some time to look at this and talk some things over with other contributors tomorrow so we should have at least a direction. I'll follow up on the issue at that point.

dchelimsky17:03:24

@U083D6HK9: @U050ECB92 and I reviewed this together and think the simplest solution is https://github.com/cognitect-labs/aws-api/commit/d1bfbef63e3d17d6f66fb88d38daf500ced27e9a

dchelimsky17:03:39

Yes, that does use core.async's thread pool, which is effectively onbounded, but threads are reclaimed after 60 seconds, and we think the risk is very low of this causing any real problems. If you disagree, please explain.

dchelimsky17:03:48

re: making fetch async, it already is, effectively. Even though a user implements fetch, it is only ever called via fetch-async. Make sense?

kenny17:03:11

I think this will probably work for pretty much all use cases and don't have any arguments against using an unbounded pool at this time. Although, I think I would much prefer these threads to be named something more relevent to make for clearer thread dumps. Right, it is async but it is managed implicitly -- the user cannot manage the threads used to fetch credentials if desired. Returning a channel gives the api consumer control over how the credential fetching manages the threads. Perhaps no one would ever want that level of control. Not sure.

kenny17:03:40

Tangentially, could we remove or change this sentence "See log for more details." in the anomaly message https://github.com/cognitect-labs/aws-api/blob/d1bfbef63e3d17d6f66fb88d38daf500ced27e9a/src/cognitect/aws/util.clj#L310?

{:cognitect.anomalies/category :cognitect.anomalies/fault
 :cognitect.anomalies/message (format "Unable to fetch %s. See log for more details." item)}
Although the fetch did fail, it may or may not have logged any information about the error. I suggest something like this: "Failed fetching %s. Fetch result was nil. "

kenny18:03:32

For that matter, why does fetch not support returning an anomaly and nil as failure conditions?

kenny18:03:18

It would seem the anomaly response from the AssumeRole call would provide far better information than the current anomaly returned.

dchelimsky18:03:10

My guess is that was an effort to keep the implementation of https://github.com/cognitect-labs/aws-api/blob/d1bfbef63e3d17d6f66fb88d38daf500ced27e9a/src/cognitect/aws/credentials.clj#L132 simple, but I agree that anomalies would be better. The problem right now is fetch has "Return the credentials found by this provider, or nil." in the docstring, and changing implementations to return an anomaly instead risks breaking custom implementations that call fetch directly.

dchelimsky18:03:49

That said, an anomaly would definitely give us more control and more information!

kenny18:03:02

Right. I'm thinking we may even want this in the near future to inform a user of our product that we are unable to authenticate using an IAM role they provided and giving them the AWS returned error message. Actually, I'm pretty sure returning an anomaly as is would already result in an anomaly getting returned haha

kenny18:03:43

... result in the anomaly you returned getting returned to the caller. It just happens as an implementation detail though.

dchelimsky18:03:08

Within the aws-api implementation, yes. And we could always update fetch-async to decorate an anomaly with more info.

dchelimsky18:03:04

My concern is for the end user who wrote anything custom that dispatches on nil coming back directly from fetch.

dchelimsky18:03:21

The likelihood of that is probably very low, but I can't be sure.

kenny18:03:01

Oh, I see. Oof.

kenny18:03:06

CredentialsProvider2? πŸ™‚

dchelimsky18:03:43

Though that needs to be reviewed through a more holistic lens.

kenny18:03:54

As in other things that could be included in this addition?