http://cljdoc.org has been down for a bit today (along with other websites). Not sure if anything needs to happen other than wait but thought I'd post here 🙂
Ah, I was just coming here to mention that http://cljdoc.org seems to be down.
hmm…. I’ll take a peek
Rebooted and it is back up and running, thanks for letting me know!
What? No hugs?
We got another OOM that blew away the server yesterday. I rebooted cljdoc. (the down alert forwarding worked @martinklepsch!) My change to save the heap dump seems to have worked, so I might take a peek at it to see if it gives any obvious clues. @martinklepsch I don't have access to stats on cljdoc JVM heap usage, do you see any evidence of a memory leak in heap usage graphs over time?
I'll start logging every time I reboot the cljdoc server on #cljdoc-notif so we can get an idea of how often this happens.
(sorry, after posting that I had a meeting, then lunch, then a doctor's appt, then I worked on a bunch of HTML/JS all afternoon!)
Ha! Sorry, I was just being needy!
Thanks for jumping in with the restart Lee! 🙌
I had a quick look at "server vitals" for the last 24 hours and it's interesting that there was a significant spike in load before the server went offline...
So memory seems to be the culprit. Seeing lots of OOM errors in the logs, which is a bit surprising given the chart only shows 40-50% utilization. But I think it probably has to do with what Nomad allocates for the container.
Thanks for the charts @martinklepsch! On Sentry I do see a couple of OOM exceptions. A while ago, I added something that should dump the heap on OOM. I'll see if that actually worked.
I tweaked the heap dump. We should have a fresh sample next time this occurs.
I guess it is questionable for cljdoc to try to continue after an OOM. But I guess it does right now.
I have this chart from DigitalOcean but it sounds like you're maybe looking for something at the JVM level? (this is whole-system)
Thanks! Yeah, I don't think we currently capture heap usage at the JVM level yet, do we?
yeah also feels like maybe it should restart? I've also occasionally gotten notices that its down but it became available a few minutes later... maybe it does restart in some scenarios?
hmmm… I am not aware of any auto restarts, but also don’t fully understand ops
I also rarely see OOM exceptions on http://Sentry.io. But maybe it is not capturing all of them?