java

pithyless 2021-07-06T10:43:53.016600Z

I'm trying to debug and reproduce a situation I'm seeing during testing, where a single node has a 100% CPU spike. I see something like this in the thread dump (that is not visible in other nodes). Is this a red herring or could this be the root cause? And if so, any idea how it reached this state?🧵

pithyless 2021-07-07T13:51:46.019Z

Thanks for the responses! I "fixed" the BLOCKED issue by replacing SimpleDateFormat with DateTimeFormatter - if this turns out to be a good change in testing, I'll submit a PR to aleph for consideration. But, this doesn't fix the CPU spinlock, which I still have not identified. Nothing more to do, but to continue digging and removing variables (it takes many hours to force the issue, but it does seem to be reproducible). 🤞

2021-07-07T15:39:35.019200Z

My guess would be your code is spinning out threads in an uncontrolled manner (without limiting the number of threads, or waiting for them to complete), which is causing memory pressure which is causing the gc to run all the time, which is the source of your high load

pithyless 2021-07-06T10:45:19.017100Z

Notice the SingleThreadEventExecutor and the thread is BLOCKED on a lock.

pithyless 2021-07-06T10:45:45.017300Z

There are 2 more threads that are also BLOCKED waiting for the lock release.

pithyless 2021-07-06T10:47:47.017500Z

Any idea what could cause a lockup on java.util.ListResourceBundle? Is this a filesystem corruption (this is running in Docker and k8s)? A memory issue? (I was initially debugging high thread count, so could this be a symptom of not enough memory to allocate new threads)?

pithyless 2021-07-06T10:56:25.017700Z

Correcting screenshot:

jumar 2021-07-06T20:36:25.018400Z

What about the other threads? I wouldn't expect threads that are BLOCKED to consume a lot of CPU. If you didn't have enough memory to allocate new threads I would expect to see OutOfMemoryError.

2021-07-06T22:26:10.018600Z

yes, that thread is likely blocked for a very short time there, but it happens a lot, every http needs to parse the date string, and it needs some locale information to do it, and that code is loading the locale information and there is a lock somewhere to make sure two threads don't race loading it

2021-07-06T22:28:30.018800Z

you can see the method here https://code.yawk.at/java/14/java.base/java/util/ListResourceBundle.java#191 it is synchronized, which is the lock that is being held, but it immediately returns if lookup is not null