2025-06-22 beginners | Clojure Slack Archive

beginners

jaihindhreddy 2025-06-22T08:07:07.313069Z

What would be a good way to embrace the let-it-crash philosophy in a stateful system implemented using component? Here's an example, in a Clojure application running in k8s, if a bad code push causes one of the API endpoints to run new code for which it doesn't have permissions (joining to a new table without access for example), then it's clear that there was some kind of misconfiguration. Currently, the system just continues to chug along, responding with either 500s or 412s. Instead, if we could detect such exceptions and crash the process immediately, the rolling deployment of k8s would ensure that only a small amount of traffic to that endpoint is affected, because the pods with the new code would go into CrashLoopBackoff, and the rollout wouldn't continue. Is this a bad idea in the first place, or is it already standard practice in typical Clojure systems? Of course, I can put more checks in at system-init-time and refuse to start it in the first place, but not all such problems can be caught at init-time.

jaihindhreddy 2025-06-22T08:12:58.001019Z

I can definitely see why this can be a very bad thing in some situations. For example, in the example I described above, when there is no rolling deployment, but someone removed the application's access to a table, what I'm trying to do will crash all pods, as opposed to only the API endpoints requiring that table being affected. I wonder if there's a better way to think about this without embedding too much knowledge in the system, like "I'm I new code that's currently rolling out in a larger context", or making the component aware of the system it is running in. One of my thoughts, was to add a way for a component to tell the containing system "I have failed catastrophically", and let the system deal with it. And there I could do something like "If any component fails catastrophically in the first x seconds of starting the system, shut it down".

p-himik 2025-06-22T11:34:21.161389Z

Something turns exceptions into HTTP 500. Probably some middleware. You can add a new middleware before that one that filters out exceptions that should bring the system down and that then act on them.

Bob B 2025-06-22T15:09:07.160879Z

If you're using something like argo rollouts, deploys also have "health checks" that can be used as a determinant for whether to continue a rollout. How many situations these checks cover is up to their author, but presumably that's more deterministic than timing things (I could imagine a scenario where the "new endpoint" doesn't get organically called until x + 1 seconds after starting).

2025-06-22T15:24:19.037449Z

> either 500s or 412s (By the way, "The 412 (Precondition Failed) status code indicates that one or more conditions given in the request header fields evaluated to false..." https://www.rfc-editor.org/rfc/rfc9110#status.412. In other words: 412 is not a fault, but a feature. In any case, since a client can provoke any 4xx code at will, those would not be great criteria to crash the server.)

jaihindhreddy 2025-06-22T15:25:30.588369Z

Indeed, its always possible that the bad code reveals itself long after any rollout is complete. The health check during rollout does sound like a nice idea. Thanks!

2025-06-22T15:42:15.685689Z

The let-it-crash philosophy sucks 🤣 It's not realistic to any real production scenarios where you want resilience. But if you change it to "let-it-restart" or "let-it-rollback" it probably makes more sense.

2025-06-23T16:03:51.014679Z

depends on which system you are talking about. "let it crash" usually refers to the individual elements of a distributed system. the restart and/or rollback logic are the job of the orchestration code, and should not exist at the level of individual services

2025-06-23T16:04:39.701039Z

at the most abstract level, letting programs crash is much better behavior compared to attempting to prevent crashing

jaihindhreddy 2025-06-23T16:06:04.129549Z

Indeed, I'm talking about this kind of "let it crash", popularised by the Erlang world AFAICT. k8s in this case, being the poor man's BEAM VM.

2025-06-23T16:34:30.327059Z

I say that because I've often seen people misunderstand what is meant. The goal is still very much for the software to always be available and usable by the user correctly. But I've seen people get it confused with, actually have it fail to alert you that it doesn't work. The latter used to be an idea in the old days where software was delivered to user in a very different way. Nowadays it's the "graceful degradation and recovery" that matters more. And what some argue is that it's easier to restart a bigger chunk of the application if not the whole app then it is to recover at a too fine a grain or to try to make due in a degraded state.

2025-06-23T16:35:50.936869Z

For example, in your case, because you framed it in a let-it-crash way, your first thought was for it to stop deployment, and as you identified correctly the issue with that is it can take down the whole fleet. So that's not going to be a good approach, unless you have rollback or restart.