Fork me on GitHub
#cljdoc
<
2021-10-19
>
deep-symmetry02:10:18

I just came here to ask the same thing. Hopefully it can be restored!

martinklepsch10:10:42

Just saw this now (didn’t check my email where alerts go) — restarting 🙂

martinklepsch10:10:09

The machine was at 100% CPU and failed to handle incoming requests.

martinklepsch10:10:37

Does anyone have advice on understanding why this happened how we can understand why it’s happening the next time around? Anecdotally it feels like this wasn’t an issue for a long time but over the last months I probably restarted the service twice.

lread10:10:20

The obvious default spot to check is the logs. Capturing a thread dump, if possible, before restarting can be helpful.

lread10:10:21

Can't remember what kind of metrics cljdoc is collecting on itself. If cpu usage is one of them, you could correlate when cpu went high against logs.

martinklepsch14:10:13

The logs basically stop around 8:34pm (UTC) yesterday 😅

martinklepsch14:10:22

@lee I’d be happy to share prod access with you + some guide on how to restart the service

lread14:10:39

Sure, sounds good!

lread14:10:13

Curious, Cora did trigger a redeploy and that did not work. I wonder why. Maybe the container was pooched?

martinklepsch14:10:54

I think the redeploy will only actually deploy if the sha changes or something like that

martinklepsch14:10:27

It might have worked to push an empty commit or the likes

martinklepsch14:10:04

Basically we publish to docker and when the deploy asks for the same docker image Nomad (the scheduler) will be like “eh; nothing to do here”

martinklepsch14:10:45

@deleted-user I’m assuming this is a joke but just in case it’s not, do you happen to have experience with https://www.nomadproject.io/?

martinklepsch14:10:05

ah that’s cool. I basically just use it to achieve a green/blue type deployment thing where there is no downtime between deploys (vs. systemd or similar)

martinklepsch14:10:09

@deleted-user if you’re interested I’d also be happy to share prod access with you

martinklepsch14:10:42

That’s awesome but I’m also just asking to de-risk the ops part a bit, no expectation that you or someone else necessarily dig into the cpu issue 🙂