Fork me on GitHub
#java
<
2021-07-06
>
pithyless10:07:53

I'm trying to debug and reproduce a situation I'm seeing during testing, where a single node has a 100% CPU spike. I see something like this in the thread dump (that is not visible in other nodes). Is this a red herring or could this be the root cause? And if so, any idea how it reached this state?🧵

pithyless10:07:19

Notice the SingleThreadEventExecutor and the thread is BLOCKED on a lock.

pithyless10:07:45

There are 2 more threads that are also BLOCKED waiting for the lock release.

pithyless10:07:47

Any idea what could cause a lockup on java.util.ListResourceBundle? Is this a filesystem corruption (this is running in Docker and k8s)? A memory issue? (I was initially debugging high thread count, so could this be a symptom of not enough memory to allocate new threads)?

pithyless10:07:25

Correcting screenshot:

jumar20:07:25

What about the other threads? I wouldn't expect threads that are BLOCKED to consume a lot of CPU. If you didn't have enough memory to allocate new threads I would expect to see OutOfMemoryError.

hiredman22:07:10

yes, that thread is likely blocked for a very short time there, but it happens a lot, every http needs to parse the date string, and it needs some locale information to do it, and that code is loading the locale information and there is a lock somewhere to make sure two threads don't race loading it

hiredman22:07:30

you can see the method here https://code.yawk.at/java/14/java.base/java/util/ListResourceBundle.java#191 it is synchronized, which is the lock that is being held, but it immediately returns if lookup is not null

pithyless13:07:46

Thanks for the responses! I "fixed" the BLOCKED issue by replacing SimpleDateFormat with DateTimeFormatter - if this turns out to be a good change in testing, I'll submit a PR to aleph for consideration. But, this doesn't fix the CPU spinlock, which I still have not identified. Nothing more to do, but to continue digging and removing variables (it takes many hours to force the issue, but it does seem to be reproducible). 🤞

hiredman15:07:35

My guess would be your code is spinning out threads in an uncontrolled manner (without limiting the number of threads, or waiting for them to complete), which is causing memory pressure which is causing the gc to run all the time, which is the source of your high load