cljdoc

Ben Wadsworth 2023-09-13T17:42:08.074809Z

http://cljdoc.org has been down for a bit today (along with other websites). Not sure if anything needs to happen other than wait but thought I'd post here 🙂

seancorfield 2023-09-13T17:56:18.006859Z

Ah, I was just coming here to mention that http://cljdoc.org seems to be down.

lread 2023-09-13T18:12:31.809899Z

hmm…. I’ll take a peek

lread 2023-09-13T18:16:00.266999Z

Rebooted and it is back up and running, thanks for letting me know!

lread 2023-09-13T23:40:35.118459Z

What? No hugs?

🤗 3
4
💜 2
lread 2023-10-07T20:49:44.889649Z

We got another OOM that blew away the server yesterday. I rebooted cljdoc. (the down alert forwarding worked @martinklepsch!) My change to save the heap dump seems to have worked, so I might take a peek at it to see if it gives any obvious clues. @martinklepsch I don't have access to stats on cljdoc JVM heap usage, do you see any evidence of a memory leak in heap usage graphs over time?

lread 2023-10-11T15:18:35.046749Z

I'll start logging every time I reboot the cljdoc server on #cljdoc-notif so we can get an idea of how often this happens.

seancorfield 2023-09-14T00:15:44.298689Z

(sorry, after posting that I had a meeting, then lunch, then a doctor's appt, then I worked on a bunch of HTML/JS all afternoon!)

lread 2023-09-14T01:48:13.260439Z

Ha! Sorry, I was just being needy!

martinklepsch 2023-09-14T09:19:32.640349Z

Thanks for jumping in with the restart Lee! 🙌

🙌 1
martinklepsch 2023-09-14T09:21:19.521169Z

I had a quick look at "server vitals" for the last 24 hours and it's interesting that there was a significant spike in load before the server went offline...

martinklepsch 2023-09-14T10:46:46.987669Z

So memory seems to be the culprit. Seeing lots of OOM errors in the logs, which is a bit surprising given the chart only shows 40-50% utilization. But I think it probably has to do with what Nomad allocates for the container.

lread 2023-09-14T11:34:14.996899Z

Thanks for the charts @martinklepsch! On Sentry I do see a couple of OOM exceptions. A while ago, I added something that should dump the heap on OOM. I'll see if that actually worked.

lread 2023-09-14T18:55:06.995869Z

I tweaked the heap dump. We should have a fresh sample next time this occurs.

lread 2023-09-14T18:56:03.273749Z

I guess it is questionable for cljdoc to try to continue after an OOM. But I guess it does right now.

martinklepsch 2023-10-09T12:44:58.435159Z

I have this chart from DigitalOcean but it sounds like you're maybe looking for something at the JVM level? (this is whole-system)

lread 2023-10-09T15:00:43.322479Z

Thanks! Yeah, I don't think we currently capture heap usage at the JVM level yet, do we?

martinklepsch 2023-09-18T11:20:42.571139Z

yeah also feels like maybe it should restart? I've also occasionally gotten notices that its down but it became available a few minutes later... maybe it does restart in some scenarios?

lread 2023-09-18T12:48:40.068529Z

hmmm… I am not aware of any auto restarts, but also don’t fully understand ops

lread 2023-09-18T13:39:10.184369Z

I also rarely see OOM exceptions on http://Sentry.io. But maybe it is not capturing all of them?