Fork me on GitHub
#hyperfiddle
<
2024-01-25
>
avocade10:01:32

Maybe @xifi or @ggaillard can have a go at this one: we're getting "white screen of death" (aka wsod) quite often during normal usage, where suddenly the app shows a blank white page when we go back to the tab after some time (sometimes half an hour, sometimes hours, or overnight). The console tends to show WebSocket is already in CLOSING or CLOSED state, or Websocket error: Network process crashed. A curious thing is that the blank-page syndrome seems to only affect Chrome (not Safari or Firefox – all latest versions and all on macOS). But I got a similar issue in Safari yesterday where it still showed the app UI, but it was totally unresponsive (websocket failure according to the logs). It's not just showing a blank page to the customers so at least it looks less bad 🙂 But they still would have to reload the page to get the app usable again. Appreciative of any ideas if you've seen/dealt with this before 🙏

avocade10:01:54

Anecdotal stuff from another app than ours is that the following page also gives me wsod if I leave it alone in the bg for some time, but this time in Safari: https://dustingetz.electricfiddle.net/electric-fiddle.essay!%45ssay/electric-y-combinator It's instead a Reactor failure: – Yn. I guess it's old code, but at least it's a data point for giving blank screens unexpectedly.

Geoffrey Gaillard19:01:14

Thank you for the report. We managed to reproduce a similar behavior. We will give it a look.

avocade19:01:29

Thanks @ggaillard, I'm continuing to track these issues and see where I can dive in further to gather more data.

Dustin Getz16:01:45

the safari issue may be different than the chrome issue

Dustin Getz16:01:10

the chrome issue i think we understand, if i am not mistaken we have dealt with this chrome issue in the past and merely regressed something

Geoffrey Gaillard16:01:37

Pushed a fix to master. The Electric client will now try to reconnect when the tab is brought to the front. Technical explanation: • the https://dustingetz.electricfiddle.net/electric-fiddle.essay!%45ssay/electric-y-combinator runs on an old Electric version. • There seem to be a bug where the server crashes after some idle time. • The server kills the socket and the client doesn’t try to reconnect. I changed the Electric client reconnection strategy. Electric will now always try to reconnect, but only if the tab or window is visible to the user. (https://github.com/hyperfiddle/electric/commit/54b423cd0c88cbee6a2c9053f5912ee2c5ef7832) @U09K620SG I don’t think this is worth a changelog entry.

Dustin Getz16:01:56

so iiuc there is a new chrome only bug where the server crashes when idle AND chrome is on the other side?

Geoffrey Gaillard16:01:27

No, sorry for the confusion. 1. There was a bug in the past. I couldn’t reproduce the issue on latest master. 2. There was logic in electric_client.cljs saying: "Only reconnect on your own if we know why the connection failed." I changed 2. to be "always reconnect (with exponential backoff) if the tab/window is visible to the user. If it’s not, wait for it to become visible"

Dustin Getz17:01:36

it sounds like we should log a ticket to understand why the connection is failing in an unclean way?

avocade18:01:42

@ggaillard we have tried the new branch since yesterday, and unfortunately we still see a wsod frequently. The error seems slightly different now though: The connection to was interrupted while the page was loading I'm seeing some exception in the logs on the backend, both locally and deployed, so it might be that which is the reason for the app failing. But it seems the frontend should still be able to auto-refresh with the new change, perhaps not. Not easy to debug this 😕

Geoffrey Gaillard19:01:43

We found the origin of this error. When one’s computer goes to sleep and is then woken up, the browser will send a websocket ping frame to the server to check if the socket is still open. If it doesn’t receive a pong after some hard-coded time, it will consider the connection to be broken and show the above message. Proof: we captured and inspected TCP frames with wireshark. Turns out we forgot to answer to pings in our WS server implementation. We assumed by mistake pongs were sent automatically (httpkit does that, not jetty) We’ll work on a fix.

👀 2
avocade20:01:09

Very interesting! I was just going to post another data point that suggests you fixed wsod for this app I mentioned before: https://dustingetz.electricfiddle.net/electric-fiddle.essay!%45ssay/electric-y-combinator I'm getting the same error as we do, The connection to was interrupted while the page was loading., but I'm not getting a WSOD and this tab has been open for a good while now 👏:skin-tone-2: So this app seems to reload correctly. Which leads me to believe there's something in our app that malfunctions, eg a reference that times out and disappears or something else reference/memory-related (due to it taking some time to occur). Just shortly re this: > When one’s computer goes to sleep and is then woken up … Our WSODs happen even though the computer is on all the time. My laptop never sleeps while connected to power in my office, only when I travel, so it's been on for days now (sure the screensaver has been going on and the display sleeps but not the computer), and the wsod still appears sometimes after less than an hour. (Also we're using httpkit so maybe the pong were actually sent for us then, not sure.) But it will be very interesting to see if your upcoming fix allows our app to refresh correctly even if it crashes on the backend. Again thanks for all your help, this would be great to get to the bottom of 🙌:skin-tone-2:

avocade13:01:27

@ggaillard another anecdote: when I run locally and terminate the backend, the app immediately goes to WSOD with WebSocket connection to '' failed:. A thought: would maybe be preferable if the app stays visible at all times, even if the backend connection is severed, so it never goes to just a white screen. At least it makes the app not look totally broken in an unexpected way. Of course when the user tries to interact it will be unresponsive, but at least the UI doesn't disappear on them 🙂

1
💡 1
avocade10:02:43

@ggaillard I updated to your latest commit on master yesterday afternoon and deployed, have tested overnight and we still get wsod. this error message now:

Geoffrey Gaillard10:02:26

This could be caused by many low-level network issues. Can you reproduce this reliably? Please add timestamps to your screenshots.

avocade15:02:17

Yeah it can be a lot of things causing the network to go down. But it seems the refresh-functionality should resume it anyway regardless. If the backend is down due to a deploy, then it's up again within minutes. And I'm getting this locally as well when the backend is constantly running, but when the tab (or just the app often), is just in the background for some time. I wish I could get timestamps for those error logs, but the console doesn't give them in many cases. when we use timbre internally we get timestamps which is very helpful. We'll do some more digging and see if we can come up with other things to try and triage the problem 👍

Geoffrey Gaillard16:02:33

I agree it should reconnect automatically anyway. To display timestamps:

avocade18:02:34

Weird, thought I had that turned on everywhere 🥸 Thx for the pointer.

Vincent03:02:58

Philosophical thought: Is there some way we can figure out what layer of the protocol stack* we are [with]in ? electric invokes debugging of all layers simultaneously x) Also is protocol stack the term? System Stack Total Stack Multilayer Stack

avocade10:02:48

@ggaillard interesting result overnight, with latest electric master, deployed to two different servers (computer was on all night but "display sleeped" after 30 minutes, and the apps were in the background). all tabs got WSOD:d. Different results in different browsers: • Chrome: no error message at all, just a warning after ~1h20m: electric_client.cljs:54 [Violation] 'close' handler took 1094ms (see screenshot) • Safari: error message after ~1h: WebSocket connection to '' failed: The operation couldn’t be completed. Connection reset by peer • Firefox: error after ~1h: The connection to was interrupted while the page was loading.

Geoffrey Gaillard10:02:06

This is helpful. Thank you. I’ll look into it.

Geoffrey Gaillard09:02:49

Here is my current understanding of the issue: sometimes while the tab is in the background or the screen is sleeping, either : • a deployment will cause the electric client to disconnect, reconnect, detect a client/server version mismatch and trigger a page reload • or the authentication token will expire, and trigger a redirect and/or a page reload. Because the tab is considered to be in the background, it got limited resources. Your JS bundle is large and the browser might fail the download the new file (process throttled or some exceeded quota - not sure). Therefore the page doesn’t load completely, and you get a white page. Screenshots of what happened in background, respectively: 1. refresh due to a deployment 2. couldn’t load main.js after auth redirect

avocade09:02:09

@ggaillard interesting point, so it could be a more prominent issue for us since we end up with a bigger JS bundle than most/all (?) other apps that use electric… But I've also seen the wsod locally within a short amount of time (<30 minutes) where the app was running the whole time. And also on prod (http://app.multiply.co) where we only deploy a few times per day. Those wsod:s happened way faster than auth0 token expiration (which is a few hours at least), and no deploys happened during that time.

avocade09:02:10

Another thing is that I have started getting wsod:s for this app again, which I just assumed you had updated to the latest version (with the refresh-fix). but now I'm thinking that I was wrong about that 🙂 I thought it was done since I didn't get a wsod for a few days or so, but then on Friday I saw it again: https://dustingetz.electricfiddle.net/electric-fiddle.essay!%45ssay/electric-y-combinator

Geoffrey Gaillard10:02:29

Hi Oskar • From what we can reproduce, every time the ws connection drops for any reason, we observe it reconnects successfully. ◦ We tested with chrome, firefox, locally and on your prod and staging env. • We’ve noticed some WSOD but they don’t seem to be related to electric (e.g. screenshot) ◦ We don’t know why a JS file is failing to load 22 min later, but this is not related to electric.

avocade15:02:17

That’s interesting, I haven’t seen a Reactor failure with our app for many weeks now. Thanks for the screenshot 🙏:skin-tone-2: PS. We’re currently removing a big chunk of the app (old now unused parts), so will be interesting to see if the JS becomes substantially smaller in size, and if so helps with the WSODs.

Geoffrey Gaillard16:02:34

I don’t trust my claim about the js file size anymore. Your staging env has a large js file (for debug reasons) but the prod js file is of an acceptable size. So it seems the js file size is not an issue here.

avocade17:02:03

Thanks for the clarification.

henrik07:02:40

I don’t know if this ever were in question or not, but I recently inserted a useless div at the top, as the very first thing happening in the app (after binding node) in order to see if at least that div was present when the app WSODs. It’s not, so that should definitively exclude any chance of it being related to authentication, as our authentication happens entirely within Electric, and after mounting this div.

henrik14:02:47

I switched to Jetty, to see if this was Http-Kit related. Jetty still exhibits the same issue, but gives this server side:

2024-02-09T12:54:02.329Z Clavain.local ERROR [hyperfiddle.electric-ring-adapter:?] - Websocket error
                                          java.lang.VirtualThread.run              VirtualThread.java: 311
            java.util.concurrent.ThreadPerTaskExecutor$TaskRunner.run      ThreadPerTaskExecutor.java: 314
                 org.eclipse.jetty.io.SelectableChannelEndPoint$1.run  SelectableChannelEndPoint.java:  53
                           org.eclipse.jetty.io.FillInterest.fillable               FillInterest.java:  99
       org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded         AbstractConnection.java: 322
      org.eclipse.jetty.websocket.core.WebSocketConnection.onFillable        WebSocketConnection.java: 342
    org.eclipse.jetty.websocket.core.WebSocketConnection.fillAndParse        WebSocketConnection.java: 484
          org.eclipse.jetty.websocket.core.WebSocketCoreSession.onEof       WebSocketCoreSession.java: 229
org.eclipse.jetty.websocket.core.internal.WebSocketSessionState.onEof      WebSocketSessionState.java: 168
java.nio.channels.ClosedChannelException:

henrik14:02:15

This is Jetty 12 using info.sunng/ring-jetty9-adapter

Geoffrey Gaillard14:02:24

Recent electric doesn’t need ring-jetty9-adapter anymore. Are you running an old version of electric?

henrik14:02:35

No, I just wanted virtual threads support, which is dicey in Jetty 11.

👍 1
henrik14:02:13

ring-jetty9-adapter has a misleading name, the actual version of Jetty depends on the version of the lib.

henrik14:02:28

The version of Electric is the latest master: f8776792bb24a3b8b81baa509e5758fea7d20d28

henrik14:02:07

It’s configured like this:

(jetty/run-jetty routes
  {:host                     ip
   :port                     port
   :join?                    false
   :virtual-threads?         true
   :ws-max-idle-time         60000
   :ws-max-text-message-size (* 100 1024 1024)
   :configurator             (fn [server]
                               (add-gzip-handler! server))})
Which should be equivalent to the servlet stuff from starter/fiddle. Jetty 12 doesn’t have a servlet API anymore. Edit: Actually, modulo binary size, which isn’t exposed.

avocade14:02:06

Same wsod-issue with jetty then… this becomes stranger and stranger.

henrik07:02:59

With 0cfb98429f0e9b4cb7822173d55d1498ce2dd972, WSOD remains, but no errors show up in the server-side logs anymore. Client side, these things show up around the time when it should have WSOD’d:

main.js:4410 [Violation] 'message' handler took 304ms
main.js:7270 [Violation] 'setTimeout' handler took 242ms
main.js:7270 [Violation] 'setTimeout' handler took 70ms
main.js:7595 [Violation] 'close' handler took 549ms
WSOD could be tied to app size. Anecdotally, we’re getting more complaints of it (app size has grown ~100kb). But unfortunately, I don’t have a more precise measurement than that.

avocade11:02:07

@ggaillard I think I've conclusively refuted the hypothesis that it's about the app bundle size. In our boot sequence, if we stop at an earlier point before loading the full app (we have a Welcome screen if the user is not a member of any workspaces yet, like on first signup), then the auto-reload works fine and it doesn't get stuck indefinitely at the white screen like we get when the main app UI is fully loaded. If the app bundle size (or related) would be a cause of this, then shouldn't it always fail regardless of how much of the app (and data) is loaded? This is confirmed on my M1 Max laptop in all three browsers (chrome firefox safari), and also on an iPhone 15 Pro. On the iPhone it's very easy to get the app to "unload" (since it seems to eat a lot of RAM) – just opening the Camera app usually does it. When I then go back and view the tab where it stayed on the Welcome-screen it reloads exactly as it should and everything is fine; but when I view the tab where the main app is loaded, it's white and unresponsive (even the "Refresh" button has disappeared from the address bar I noticed, but that's just anecdotal and I don't have insight into the internals of Apple's Safari team to know if this is significant 😅). I've been and am currently attempting to sort of "bisect" to see where in the chain it starts just giving up and dying on a blank screen forever. Could be related to RAM possibly, could be the amount of data loaded, amount of listeners/missionary/reactive stuff in-flight, no clear idea at this point. But if it's just RAM, then after the browser kills the tab in the bg (as normal) it should not prevent the auto-reload when I view the tab again. So it feels like the app/electric is getting itself into a corner it can't back out of and then just dies.

henrik11:02:06

Based on this, it might be something related not to the bundle size, but the payload size transmitted? Alternatively, the “depth” of the stack of reactive functions? That is, given that a simpler screen in the same app (that omits most of the app structure) auto reloads fine, while the full app does not.

avocade14:02:48

Another question @ggaillard: since we don’t see any output in the browser console logs when WSOD happens, is there possibly some “insane mode” for logging that we could enable in electric? That would give us everything? 😅

Geoffrey Gaillard16:02:46

The electric client already log everything meaningful. It will log about everything but received and sent messages, meaning: • connection attempts • disconnects • reconnects • exceptions. You can also look at your devtools "Network" tab to see websocket messages and control frames (screenshot on firefox). If there are no logs, then the electric client is not running. I tried to reproduce the issue on my phone as you described: • navigate to • log in • open camera app • wait a bit • go back to the browser (in-tab app has been unloaded because of RAM pressure) • app reloads properly on its own I still cannot reproduce the issue you describe, nor on my phone nor on my laptop. I tried multiple browsers, in dev mode and on your staging and prod environments. While there were Electric-specific issues, those have been fixed (thank you!). As of today, I have no reasons to believe your current issue is related to Electric. I’m afraid I won’t be able to make further progress on this issue without further screenshots and clear repro screnarios.

henrik16:02:11

I seem to recall that when I shut down the app locally, Electric (the client) would attempt to reconnect. I can’t see that it does this currently. Is this a change in logging, or is reconnection attempts actually not happening?

avocade16:02:57

@ggaillard thanks that's interesting. just a clarification: when you say it worked with auto-reload for you, did you view this page (which is what you see if you just sign up without being invited first)? that's the only view that I've consistently been getting to work as well.

avocade16:02:07

And yes, I agree that it looks more and more as something that we do in our app code that causes the WSOD to occur. But it's so strange that it can fully kill electric, such that it's lowest-level (most resilient) functions also fully die, ie the auto-reload stuff. As often with these things, we'll probably be stupefied about how trivial it was, once we figure it out 😅

henrik16:02:21

I personally have no idea if this is a property of Electric, our code, or a combination. I’m not finding it particularly trivial to figure it out.

henrik17:02:20

Update on reconnection Three windows: terminal, a logged in account, and an account that doesn’t have access (but still sees a view rendered by Electric). When terminating the (server) app, the fully loaded (middle) app does not attempt to reconnect. However, the partially loaded (right hand side) app, does. The middle gets [Violation] 'close' handler took 410ms electric_client.cljs:54 warning, while the right hand side does not. Is it possible that the app takes so long to tear down that it disrupts the reconnector?

avocade17:02:53

Mm we've seen those [Violation] warnings for a long time now. Could it perhaps "time out" trying to reconnect…

henrik15:02:35

It looks like we’ve eliminated WSOD for now. I went in and started to comment out large parts of the app to see where it would start working. Commenting out stuff didn’t make any difference. But I had built a boot helper used early on in the app to be able to create reactive functions that kind of worked as loosely coupled middleware. Replacing that with tightly coupled function calls made the problem disappear. I’ve put the code that I removed in an Electric starter here: https://github.com/fluent-development/electric-boot/blob/boot/src/electric_starter_app/main.cljc Note, however, that this works fine in Electric starter. So we’re rid of the problem, but don’t understand why it worked. We’ll be monitoring the situation. The other weird thing is that I built it while investigating WSOD. :man-shrugging: Thank you very much for the help @ggaillard, all is good for now.

👀 2
avocade18:02:15

Thanks for all your help @ggaillard, this was as weird one and you've been most helpful in this process 🙏 Hope it stays fixed now 🥂