Fork me on GitHub
#cljdoc
<
2023-09-13
>
Ben Wadsworth17:09:08

http://cljdoc.org has been down for a bit today (along with other websites). Not sure if anything needs to happen other than wait but thought I'd post here 🙂

seancorfield17:09:18

Ah, I was just coming here to mention that http://cljdoc.org seems to be down.

lread18:09:31

hmm…. I’ll take a peek

lread18:09:00

Rebooted and it is back up and running, thanks for letting me know!

lread23:09:35

What? No hugs?

💜 2
lisphug 4
3
seancorfield00:09:44

(sorry, after posting that I had a meeting, then lunch, then a doctor's appt, then I worked on a bunch of HTML/JS all afternoon!)

lread01:09:13

Ha! Sorry, I was just being needy!

martinklepsch09:09:32

Thanks for jumping in with the restart Lee! 🙌

🙌 1
martinklepsch09:09:19

I had a quick look at "server vitals" for the last 24 hours and it's interesting that there was a significant spike in load before the server went offline...

martinklepsch10:09:46

So memory seems to be the culprit. Seeing lots of OOM errors in the logs, which is a bit surprising given the chart only shows 40-50% utilization. But I think it probably has to do with what Nomad allocates for the container.

lread11:09:14

Thanks for the charts @U050TNB9F! On Sentry I do see a couple of OOM exceptions. A while ago, I added something that should dump the heap on OOM. I'll see if that actually worked.

lread18:09:06

I tweaked the heap dump. We should have a fresh sample next time this occurs.

lread18:09:03

I guess it is questionable for cljdoc to try to continue after an OOM. But I guess it does right now.

martinklepsch11:09:42

yeah also feels like maybe it should restart? I've also occasionally gotten notices that its down but it became available a few minutes later... maybe it does restart in some scenarios?

lread12:09:40

hmmm… I am not aware of any auto restarts, but also don’t fully understand ops

lread13:09:10

I also rarely see OOM exceptions on http://Sentry.io. But maybe it is not capturing all of them?

lread20:10:44

We got another OOM that blew away the server yesterday. I rebooted cljdoc. (the down alert forwarding worked @U050TNB9F!) My change to save the heap dump seems to have worked, so I might take a peek at it to see if it gives any obvious clues. @U050TNB9F I don't have access to stats on cljdoc JVM heap usage, do you see any evidence of a memory leak in heap usage graphs over time?

martinklepsch12:10:58

I have this chart from DigitalOcean but it sounds like you're maybe looking for something at the JVM level? (this is whole-system)

lread15:10:43

Thanks! Yeah, I don't think we currently capture heap usage at the JVM level yet, do we?

lread15:10:35

I'll start logging every time I reboot the cljdoc server on #C0332239PMH so we can get an idea of how often this happens.