Fork me on GitHub
#datomic
<
2019-09-09
>
steveb8n07:09:34

Q: I have a Solo/Ion webapp which consistently dies after a small load. The only fix I’ve found is to terminate the EC2 instance and let a new one start. It’s not a memory leak, cloudwatch shows plenty of heap. The error in the logs just before locking up is “java.lang.OutOfMemoryError : unable to create new native thread”. Googling suggests needing access to the thread dump to dig deeper. I cannot reproduce this locally using low or high levels of load. Has anyone seen this? What techniques are available to reproduce,diagnose, fix this?

steveb8n07:09:12

Also relevant is that it dies while idle (not being loaded by requests) so this would suggest some thread activity from house-keeping etc although I have no facts to support this

marshall14:09:39

what version of datomic cloud

steveb8n14:09:15

Compute stack is 8772

steveb8n14:09:09

Storage is 470 (not sure if this is important)

marshall14:09:20

i would first try upgrading to the latest (8794)

steveb8n14:09:38

ok. I’ll try that right away. thanks

steveb8n14:09:26

I presume you mean compute only upgrade?

steveb8n14:09:59

ok. now on 8794. I’ll load it with requests and will let it sit to see if it happens again. normally takes an hour or so of idle time. I’ll report back here either way

marshall14:09:37

:thumbsup: Also check your system logs when/if you see the behavior

marshall14:09:58

if possible, can you paste the full stack trace of the error you saw previously?>

steveb8n14:09:20

{ “Msg”: “Uncaught Exception: unable to create new native thread”, “Ex”: { “Via”: [ { “Type”: “java.lang.OutOfMemoryError”, “Message”: “unable to create new native thread”, “At”: [ “java.lang.Thread”, “start0”, “Thread.java”, -2 ] } ], “Trace”: [ [ “java.lang.Thread”, “start0”, “Thread.java”, -2 ], [ “java.lang.Thread”, “start”, “Thread.java”, 717 ], [ “java.util.concurrent.ThreadPoolExecutor”, “addWorker”, “ThreadPoolExecutor.java”, 957 ], [ “java.util.concurrent.ThreadPoolExecutor”, “processWorkerExit”, “ThreadPoolExecutor.java”, 1025 ], [ “java.util.concurrent.ThreadPoolExecutor”, “runWorker”, “ThreadPoolExecutor.java”, 1167 ], [ “java.util.concurrent.ThreadPoolExecutor$Worker”, “run”, “ThreadPoolExecutor.java”, 624 ], [ “java.lang.Thread”, “run”, “Thread.java”, 748 ] ], “Cause”: “unable to create new native thread” }, “Type”: “Alert”, “Tid”: 18, “Timestamp”: 1567951027119 }

steveb8n14:09:30

not sure what you mean by “system logs”

marshall14:09:40

cloudwatch logs

marshall14:09:48

for your datomic system

marshall14:09:07

where are you seeing that ^ error?

steveb8n14:09:12

ah ok. that’s where this stack trace is from

steveb8n14:09:44

all other events at that time look normal

steveb8n14:09:22

very unscientifically, it seems to tolerate clj-gatling load a bit better on this new version. I’ll have to wait now to see if it dies. Good timing as gotta cook dinner (NL time) but will check in later

marshall15:09:48

Does your ion webapp do any async work?

marshall15:09:58

anything that might be spawning threads?

steveb8n15:09:14

yes, it uses http-kit client in async mode to call an ECS service. async calls via pedestal interceptor/handlers. that said, this problem occurred before I was using async mode

steveb8n16:09:18

also using jarohen/chime to periodically report metrics i.e. cron like. again, prior to using chime, this instability was present

steveb8n16:09:52

full stack is lacinia-pedestal / pedestal / resolvers making http calls returning a core.async channel (to allow pedestal to park/async)

steveb8n16:09:38

most api endpoints are sync/blocking but I suspect the http callouts so I focus on those to reproduce the error.

steveb8n16:09:51

so far, no hang so will keep waiting on it

steveb8n16:09:45

prior to async http-kit, was using blocking http-kit calls. I think that is using async underneath so there was probably async machinery being used when this originally manifested

marshall16:09:57

that would definitely be my suspicion for where to look; that error generally indicates that the process is creating unbounded numbers of threads and the OS is out of resources to allocate

marshall16:09:23

despite it being called a “memory” error - it is evidently more commonly a thread resource issue

steveb8n16:09:56

I suspect the same. I don’t have much experience in finding “captured” threads but I’ll start by using a profiler on my localhost and see if I can find anything

Adrian Smith16:09:16

I'm trying to log into Datomic forum with email link login but I've not received any emails all afternoon, is this just me?