This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-06-24
Channels
- # announcements (39)
- # aws (4)
- # babashka (18)
- # beginners (32)
- # biff (10)
- # calva (22)
- # clj-kondo (11)
- # clj-on-windows (2)
- # cljs-dev (20)
- # clojure (31)
- # clojure-europe (130)
- # clojure-norway (21)
- # clojure-spec (24)
- # clojure-uk (17)
- # clojured (2)
- # clojurescript (6)
- # conjure (1)
- # core-async (5)
- # cursive (11)
- # datalevin (1)
- # datomic (9)
- # emacs (24)
- # figwheel-main (42)
- # fulcro (6)
- # gratitude (6)
- # honeysql (8)
- # juxt (16)
- # kaocha (6)
- # lsp (235)
- # malli (2)
- # nbb (7)
- # off-topic (23)
- # other-languages (24)
- # pedestal (7)
- # reitit (6)
- # sci (1)
- # shadow-cljs (21)
- # tools-build (22)
- # tools-deps (57)
- # vim (24)
Starting a thread to discuss the freezes that have been reported recently…
references https://clojurians.slack.com/archives/CBE668G4R/p1656064304196529 https://github.com/clojure-lsp/clojure-lsp/issues/1058 https://github.com/clojure-lsp/clojure-lsp/issues/1080 https://github.com/clojure-lsp/clojure-lsp/pull/1079 /cc @UKFSJSM38 @U0178V2SLAY @U0123R34Q9Y
Like I said in the Slack conversation linked above, I’ve been working on this for a few days, without much progress. And again, there’s a chance that what I’m working on is unrelated, so I don’t want to sidetrack this conversation if it isn’t. Not sure how to prevent that
@U07M2C8TT I know we are not sure about the root cause, but i thought https://github.com/clojure-lsp/clojure-lsp/pull/1079 could help and we could check with @U0178V2SLAY or @U0123R34Q9Y if fixed for them
Right, I started to work on clojure-lsp/1058, or rather, made a guess that the freezes @U0178V2SLAY was seeing were related to several blocking puts. The outcome of that is https://github.com/clojure-lsp/clojure-lsp/pull/1079
However, the integration tests for that branch fail almost all the time. (Not always, unfortunately.) I’m wondering if I’ve made the problem worse
@U0178V2SLAY and @U0123R34Q9Y, if you’d like you can check out that branch and see if it fixes any of your issues (or makes them worse)
So, here’s a more detailed analysis of how the tests fail
If using linux, you could even download the native binary from https://github.com/clojure-lsp/clojure-lsp/actions/runs/2553860615
First, the failure is much more likely to happen in the GraalVM integration tests. Those tests or 5x faster than the JVM integration tests, and so perhaps this is timing-related test flakiness
Let’s look at one of the tests in more detail
The most recent test run is pretty emblematic. https://github.com/clojure-lsp/clojure-lsp/runs/7036727390?check_suite_focus=true
The first test spins up a clojure-lsp server process, and starts sending messages to it as a mock “client”. The client receives all the server’s notifications about the progress the server is making doing the initial analysis, and eventually hears that the analysis is https://github.com/clojure-lsp/clojure-lsp/runs/7036727390?check_suite_focus=true#step:8:368
But then the client has a problem. The client sent the server an “initialize” request, but it fails to get a response from the server. The client waits 30 seconds to get a response, but then https://github.com/clojure-lsp/clojure-lsp/runs/7036727390?check_suite_focus=true#step:8:370 the test
We output a bunch of logs from the server when this happens
First we see clojure-lsp’s logs. It thinks that it responded to the client’s “intialize” request https://github.com/clojure-lsp/clojure-lsp/runs/7036727390?check_suite_focus=true#step:8:397 sending all the notifications about the analysis scan
And after that we see lsp4j’s logs (lsp4j is the lower-level Java library we use to implement the JSON RPC level of the LSP protocol)
Those logs also say that the response to the “initialize” request https://github.com/clojure-lsp/clojure-lsp/runs/7036727390?check_suite_focus=true#step:8:776
If you look at the timings, lsp4j agrees that the response was sent moments after the analysis progress notifications
scrolling back up to the client’s logs, we can observe some additional https://github.com/clojure-lsp/clojure-lsp/pull/1079/files#diff-576f32366c837b68400162f60075ce1677c7e890347f5f53076546d39a0ab9abR77. You can see https://github.com/clojure-lsp/clojure-lsp/runs/7036727390?check_suite_focus=true#step:8:369 that after receiving the analysis progress notifications, the client started waiting on (read-line)
to get the next line of output from the server
That read-line is supposed to see (and then discard) a header from the server’s output stream, before it moves on to let the client listener read a chunk of JSON from the stream
Digging a bit deeper into the code…
lsp4j’s trace line is generated https://github.com/eclipse/lsp4j/blob/ab3c4891699fd3b3bd057a13df31982fd5bb96b3/org.eclipse.lsp4j.jsonrpc/src/main/java/org/eclipse/lsp4j/jsonrpc/TracingMessageConsumer.java#L101 and printed to the trace file https://github.com/eclipse/lsp4j/blob/ab3c4891699fd3b3bd057a13df31982fd5bb96b3/org.eclipse.lsp4j.jsonrpc/src/main/java/org/eclipse/lsp4j/jsonrpc/TracingMessageConsumer.java#L109-L112
then it calls the thing that actually puts the message on the output stream https://github.com/eclipse/lsp4j/blob/ab3c4891699fd3b3bd057a13df31982fd5bb96b3/org.eclipse.lsp4j.jsonrpc/src/main/java/org/eclipse/lsp4j/jsonrpc/TracingMessageConsumer.java#L114
the server output stream writer does its thing https://github.com/eclipse/lsp4j/blob/ab3c4891699fd3b3bd057a13df31982fd5bb96b3/org.eclipse.lsp4j.jsonrpc/src/main/java/org/eclipse/lsp4j/jsonrpc/json/StreamMessageConsumer.java#L66-L70
But the client fails to read anything from that output stream https://github.com/clojure-lsp/clojure-lsp/pull/1079/files#diff-576f32366c837b68400162f60075ce1677c7e890347f5f53076546d39a0ab9abR78
So, somewhere between the trace being generated printed and the client receiving a line, something is going wrong
That leaves a few possibilities. It could be in lsp4j. Perhaps the trace is generated and saved to the trace file, but the message is never actually put on the output stream. This seems unlikely.
So, it's possible that the bb integration-test client is not parsing properly when (read-line)?
(I’ll get there)
Or, somewhere within Java, the output could be buffered and not written to the output stream
Or, somewhere in the inter-process communication the output could be buffered. (I tried to fix that possibility https://github.com/clojure-lsp/clojure-lsp/pull/1079/commits/f01f773b69289210830ecf20d88266505f07be6d, but my attempt didn’t work.)
Or something in the client side babashka/process, or in babashka’s (read-line)
could be failing. If that’s the case, then this is probably a different issue than what @U0178V2SLAY and @U0123R34Q9Y are reporting, because it’s specific to the integration tests, not to the system in general
Maybe @UKFSJSM38… but, it’s pretty similar. Suddenly the client stops receiving messages, even though the server thinks it’s sending them.
Going back to that https://github.com/clojure-lsp/clojure-lsp/pull/1079/commits/f01f773b69289210830ecf20d88266505f07be6d I referenced… we have two processes that are communicating with each other through an OS pipe. That’s true in real life, where the editors start a clojure-lsp
process, and then communicate with it through it’s input and output streams. And it’s true in integration tests, which do the same with a mock client
yeah, but from @U0178V2SLAY issue, it seems like server doesn't stop that fast, it's like if the go-block thread is full after some other requests, not immediately, so that's why I think are different issues
Yeah, let’s think through @U0178V2SLAY’s issue a little
I know that @U0178V2SLAY’s report was that things stop working only in very specific situations and that @U0123R34Q9Y’s report is that things stop working after several minutes. But I wonder if the integration tests hit this because they run so much faster.
Looking at @U0178V2SLAY’s report more, it said that https://github.com/clojure-lsp/clojure-lsp/issues/1058#issuecomment-1160557291. First, the (producer/publish-workspace-edit lsp-producer edit)
deref is blocked. And second, the core.async put onto the edits-channel is blocked. Is it true that the first would cause the second? If so, then the problem is really that the client doesn’t receive a response from the server, and so the deref never resolves. This is despite the server being confident that it did send a response.
That’s why the integration tests and bug reports seem similar to me
Thinking about this more… there would really have to be 2 edits. The first edit would have to go through clojure-lsp but not get read by the client. That would cause https://github.com/clojure-lsp/clojure-lsp/blob/master/cli/src/clojure_lsp/server.clj#L89-L90 to block. That would block https://github.com/clojure-lsp/clojure-lsp/blob/13e78e3af8082780ebab21a21166ad9598c27562/cli/src/clojure_lsp/server.clj#L220-L222 from making another loop. Since nothing is reading from the edits-chan anymore, when the second edit came in, it would https://github.com/clojure-lsp/clojure-lsp/blob/master/lib/src/clojure_lsp/feature/file_management.clj#L49. The PR I was working converted that last >!!
to put!
, but I don’t think that’s going to fix the fundamental problem. The did-open
will finish instead of blocking, but the edit still won’t be put on the edits-chan, because that chan is still full. (Instead of waiting for the channel to accept the message immediately, the put!
puts that waiting into the background and returns immediately. But it doesn’t actually change that something has to wait for the channel to free up.)
Actually… why do we even have the edits-chan? The only place it’s used is in this one case where we’re creating a namespace in a new file. That’s such a rare occasion. Do we really need all this async machinery to put the edit in the background? Why not just publish the edit directly from did-open
?
I don’t think that removing the edits-chan would fix the problem that the client stops receiving server responses. But it would at least let us be sure that the problem isn’t with that part of the core.async code. WDYT @UKFSJSM38 ?
put! is also quite bad uncontrolled, you will just create another issue if you cause the pending put queue (1024) of channels you put into to fill up if you're not consuming fast enough and have a full buffer.
@U050SC7SV that makes sense… and I agree, I think problems at the write end of the channel are really just symptoms of a problem at the read end of the channel
Yet another question… how does that deref work normally? When lsp4j fulfills that CompletableFuture, or whatever it is, it’s implying it knows the client has received a message. But how could it know that? I thought that this was a regular client/server architecture, where responses are unacknowledged. That is, the client doesn’t have to tell the server that it got a response. I’m gonna go look at that part of the lsp4j code.
Yeah, IMO we shoukd fix the read of the chan, I think this started to happen after we introduced the deref of waiting for the edit to be applied? I don't remember why we have the edits chan, I think that always existed 😂 , I can't see why a sync function wouldn't work, maybe just handle timeouts if we do something that rely on waiting for client (like that deref) The https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#workspace_applyEdit, is a request, so what lsp4j is doing under the hood AFAIK is waiting for client to respond that request, then completing the future on java code
> I think this started to happen after we introduced the deref
Yes, that’s when it started. But removing the deref or adding a timeout to it wouldn’t fix the problem, as far as I can see. Instead, it would change it so we don’t see the problem. But under the hood, we’d still have this problem—the client isn’t receiving the request. Changing or removing the deref
would only change that we don’t notice that the client doesn’t receive the request.
> The https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#workspace_applyEdit, is a request
Ah, that’s right… thanks, that explains it. The client does tell the server whether it applied the edit by returning {applied: true}
So, we can discard any communication issue there since the client responded? or there was a issue handling the client response, not completing the future?
Well, clojure-lsp starts blocking when it sends its text edit response to the client. I’m not sure what happens from there. Does the client receive the response? If it does, does it process it, and reply? If it replies, even with an exception, why doesn’t the future complete? The problem could be at any one of those steps. But I think the problem is at that first step. The client doesn’t receive the server’s response. Here’s why… Looking at @U0178V2SLAY’s logs, https://github.com/clojure-lsp/clojure-lsp/issues/1058#issuecomment-1159512725, we can see that at some point the client stops getting any messages from the server (search for
Received response 'nil - (9)'
for the last time the server responded.)
That is, there aren’t any Received response
, Received request
, or Received notification
lines.
Similarly in @U0123R34Q9Y’s logs, we don’t see any Received
lines.
And from @U0123R34Q9Y description, the server keeps getting client requests, and processing them… as he navigates, the clojure-lsp logs that it’s publishing diagnostics and processing handlers, all things that it would do if it were receiving requests. But nothing changes in the client, lint squiggles don’t change etc. All of that, along with the absence of Received
lines suggests that the client isn’t even seeing clojure-lsp’s messagesok… hmm
What would that tell us?
Gotcha… when you were trying to reproduce https://github.com/clojure-lsp/clojure-lsp/issues/1080 do you remember if you were using native or JVM?
OK, cool… comparing JVM vs native on Linux seems worth trying. The fact that the client’s logs stop seeing any kind of message from the server—any request, notification, or response—suggests to me that maybe the server’s output stream is broken. lsp4j tries to detect when that happens, and to adjust its behavior, but the code is really convoluted. What if it just starts silently dropping all the server’s messages? The reason I asked about Linux or Mac is that the lsp4j code that https://github.com/eclipse/lsp4j/blob/ab3c4891699fd3b3bd057a13df31982fd5bb96b3/org.eclipse.lsp4j.jsonrpc/src/main/java/org/eclipse/lsp4j/jsonrpc/JsonRpcException.java#L27 whether a stream has closed is all based on string comparisons. What if the underlying OS stream is closed, but on Linux that generates a different string message than what lsp4j is expecting? How would lsp4j behave if the stream had closed but it failed to detect it?
Good findings, that's a good question, and it makes me want to drop lsp4j even more 😅
Huh… I have to be afk for awhile, but this is blowing my mind, so I have to write it down… https://docs.oracle.com/javase/7/docs/api/java/lang/System.html#out is a https://docs.oracle.com/javase/7/docs/api/java/io/PrintStream.html. One of the “features” of a PrintStream is that it never throws IOException. But the lsp4j code seems to https://github.com/eclipse/lsp4j/blob/ab3c4891699fd3b3bd057a13df31982fd5bb96b3/org.eclipse.lsp4j.jsonrpc/src/main/java/org/eclipse/lsp4j/jsonrpc/JsonRpcException.java#L30 stream closures to be communicated via IOExceptions
I’m seeing this issue as well with Calva. The server logs don’t show any errors. The only mention I see of “error” is this:
2022-06-24T23:10:50.661Z WARN [clojure-lsp.kondo:260] - Non-fatal error from clj-kondo: No configs copied.
But that looks like it’s not an issue.
It does seem though from client<->server logs that the server just stops responding to requests.@U9A1RLFNV does that happens with previous clojure-lsp version ?
For example, I can try to go to a definition, and this is all that is printed in the logs:
[Trace - 4:53:12 PM] Sending request 'textDocument/definition - (596)'.
Params: {
"textDocument": {
"uri": "file:///Users/brandon/development/honestdeed/backend/development/src/dev/brandon/core.clj"
},
"position": {
"line": 55,
"character": 20
}
}
And VS Code just continues to show an animation above the editor like something is in progress.@U9A1RLFNV those are two different sets of logs… can you tell if the the server thinks its sending responses vs. whether the client actually receives those responses?
Eventually this is printed in the logs:
[Trace - 4:55:04 PM] Sending notification '$/cancelRequest'.
Params: {
"id": 594
}
I’m guessing the client decides to cancel the request.In the server logs I see this line printed when I try to go to a definition.
2022-06-24T23:56:03.674Z INFO [clojure-lsp.handlers:193] - :definition 22ms
client-out -> -------- -> server-in
client-in <X -------- <- server-out
the server prints :definition 22ms
after it’s generated a response, so that it can include how long that took
> can you tell if the the server thinks its sending responses vs. whether the client actually receives those responses? It does seem that the client sends a request, the server gets it, and the client never receives a response. (edited to “does seem”)
“It does not seem” or “It does seem”?
OK, thanks for confirming that… that’s what I’ve been thinking is going on, but since I can’t reproduce it I haven’t really been able to watch both sides of the logs
We should check if that happens for previous version. Also, I think we bumped lsp4j recently, just not sure if it was on this or previous version, I'm on mobile right now
I’m trying 2022.05.31-17.35.50 now and will report back after some time (it takes several minutes of me working for it to occur).
@UKFSJSM38… I’ve been fighting against adding a timeout to the deref
. But if you’re worried that lots of people are having this problem, we could add it as a stop-gap. We’d be running the risk that the server would be applying edits that it shouldn’t.
In the long run, I’d much rather understand and fix the real problem.
I think we can use some time to analyze if it's worth the easy fix or spend more days to work on a good fix. LMK if you intend to keep researching that
I’m going to keep researching, but have other stuff to take care of this evening, so probably won’t get to it until tomorrow. If you do want to changed the deref, I think a timeout is best. I don’t want to remove it entirely… When things are working, it’s important to have. Before adding it we had that bug in Calva where the cursor was moved before the drag edits were applied
No hurries, we didn't receive that much of complains yet, so I'm not sure this happens all the time too, if things get worst, we can add the timeout
lsp4j was bumped to 0.14.0 in lsp4clj 0.4.0. Looks like that was included in clojure-lsp in the 2022.06.22 release.
Yeah, one more thing to check so, we don't need anything from lsp4j release so we could rollback the bump and start working on dropping lsp4j in the future, but it'd be nice confirm the issue didn't happen on previous lsp4j versions
I suspect that it was happening on earlier lsp4j releases, but I agree, it’d be nice to know. I can’t repro either, unless you count CI, which we’re not sure is the same thing. Makes it really hard to test
I’ve been working for some time now and the issue hasn’t happened with 2022.05.31-17.35.50.
Alright, so besides lsp4clj bump, we re-introduced did-change-watched-files which just call didChange, but I'm less confident that could be the issue
@U9A1RLFNV I can create a branch with current release but lsp4clj downgraded, could you test it?
Alright, done, the branch is https://github.com/clojure-lsp/clojure-lsp/tree/test-downgrade-lsp4clj
I'll try to test drive that branch during the weekend. Big thanks to all of you for debugging this.
Somehow I have a feeling that the issue I reported has something to do with the client. It feels like creating a file causes the client to send some message to clojure-lsp
(I need to verify this hunch from the message logs) and clojure-lsp
tries to respond to it, but because the corresponding buffer is not open in emacs, the response is not handled. (which then leads to clojure-lsp
to freeze due to the deref etc) This only happens with treemacs (and .cljs
/`.cljc`), because treemacs doesn't automatically open the newly created file. Using find-file
(`C-x C-f`) creates and opens the file/buffer immediately and the problem goes away.
@U0178V2SLAY (sorry if I’ve said this before… this topic has come up a few times recently) the LSP spec https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#workspace_didChangeWatchedFiles that clients should watch the file system for changes and inform servers. clojure-lsp participates in that scheme, and so when treemacs creates the file, I bet clojure-lsp is notified and that it sends https://github.com/clojure-lsp/clojure-lsp/blob/13e78e3af8082780ebab21a21166ad9598c27562/lib/src/clojure_lsp/feature/file_management.clj#L220 back to the client. (I have no idea why the cljs and cljc files are special. It should do that for clj files too.) You’re right, it’s possible that the client doesn’t handle the server’s messages correctly because it doesn’t actually have that file open. If that’s the case, it’s true that the deref would block. But what confuses me is that the deref is for just one of the many ways that the server can send messages. Specifically the deref is for when it’s asking the client to edit a file. But the server can also send messages for lots of other things. The main other ones are responses to the client’s requests and notifications that lint has changed. From what we see in the server’s logs, it seems to think it’s sending those messages, even after the deref starts blocking. (I think… you can confirm that for us.) That means that there are other threads that continue to process and respond to the client’s requests and continue to send notifications. I can understand why the client wouldn’t seen any of the server’s requests to edit files, because the server is blocking on the deref. But I don’t understand why it wouldn’t see any of those other messages. That’s what’s making me focus on the communication layer, specifically on the server’s output stream. If there’s a problem there, then no messages of any kind would go through. It’s quite possible that I’m not looking at this the right way. Do you see another way of interpreting all of this? It’s also possible that we’re both right… perhaps when the client receives messages about a file it doesn’t have open, it breaks in a way that breaks the server’s output stream.
oh, good to know! I'm certain you know this stuff better than me, this whole LSP business is just a big black box to me which may lead me to make the wrong assumptions and deductions 🙂 I'll try to find the time to play around with this because I have a reliable way to reproduce the problem. I'm thinking about e.g. checking if I could reproduce this with calva (first I need to learn how to use it though), that might tells us something more. Or perhaps try eglot.
I don't want to spam this thread, so I added some observations and logs to my https://github.com/clojure-lsp/clojure-lsp/issues/1058#issuecomment-1166231407. TLDR: creating an empty file from the terminal does not exhibit the problem, the message log contains just a single didChangeWatchedFiles
notification. But when I do the same from treemacs, the log contains a bunch of didOpen
+ didClose
notifications + applyEdit from clojure-lsp
to add a ns declaration.
> I’ve been working for some time now and the issue hasn’t happened with 2022.05.31-17.35.50. Same with me. Been using clojure-lsp for 20 minutes without issues 🎊 I can try out the test branch tomorrow
@U07M2C8TT I tried locally testing your remove-blocking-puts
branch. However, when I create a jar with make clean cli-jar
, the resulting .jar seems to run for a few seconds before closing with no error. If I do the same (e.g. java -jar clojure-lsp-standalone.jar
) for the master branch, the jar keeps running until I cancel it.
@U0178V2SLAY try just make
it should create a binary which is a embedded jar
Tô be honest, I'm starting to think the issue is not on client since we have both calva and emacs users eith a very similar(same?) issue. That's why making sure things got fixed on these branch's in calva or emacs would help know we are working in the right path
@U07M2C8TT LMK if you need any help when you work on this
@UKFSJSM38 I tried plain make
, which generates a clojure-lsp
binary. However, using that remove-blocking-puts
branch produces a binary, which exists after a few seconds without printing any errors (exit code was 1).
Hum, so probably that branch has something wrong, could you try the test-downgrade-lsp4clj one? https://github.com/clojure-lsp/clojure-lsp/tree/test-downgrade-lsp4clj
sure, just a moment
the test-downgrade-lsp4clj
branch builds fine, starts fine and keeps running ok. However, that problem of mine still happens 😿
Alright, thanks for testing, It'd be nice to have calva users testing that branch and confirming if that fixes for then, if so, they are different issues. Meanwhile @U0178V2SLAY did you test the previous release? I think I remember you saying this issue happens for a while already right?
I'm not certain if I've tested the previous release (will try it in a moment, EDIT: also happens with the latest release), but I've gone through a few different nightly builds. I think I've seen this problem of mine for several months now at least, perhaps even as far back as last winter. I only just recently started to really focus on it and try to figure out what exactly causes it and what doesn't. The good news is that it requires a very specific set of circumstances to appear, otherwise other people would have reported (and replicated) it already.
Yeah, makes sense, I suspect you have a different issue than calva users reported recently so
I tried to repro this in calva earlier today, but couldn't. When you add a new file in vs code, it immediately opens it and the bug doesn't appear. If I do the same in emacs, it works too. The only way to get the problem to appear, is to use treemacs to create a new file. It seems that something (emacs? treemacs?) causes the file to open and immediately close (or at least that's what the LSP IO log looks like).
yeah, I agree
In any case, I think you should concentrate on the calva issue, I can workaround this and I assume this problem isn’t as widespread. Next thing I’ll investigate is if I could repro this with some other server, like python/java/something
EDIT: I tried replacing lsp-mode
with eglot
-> same problem still happens.
Thanks, we have ideas of improving the async stuff in clojure-lsp to have a way to control async better and easily.
@U0178V2SLAY thanks for trying the remove-blocking-puts
branch, and sorry it’s broken. It’s in a weird state where it only works for running integration tests. If you really want to try it in a working state, you can use commit d455da7
, but to be honest, I don’t think it’s worth it. I’m pretty convinced that the problem was never the blocking puts. They were just a symptom of the blocking deref, which is itself a symptom of the real problem.
@UKFSJSM38 thanks for the offer of help. I’m currently contemplating writing some Java to debug what’s going on with System.out. My Java skills are abysmal, so we’ll see how it goes. I don’t have any other ideas at the moment. If I get stuck, I’ll let you know. We might want to open this question to a larger venue at some point
Sure, agreed with put not being the issue and the deref issue might just be a indicator of a issue lsp4j/clojure-lsp communication mechanism that was always there. IMO we should focus on understand the root cause and spend time on fixing the root issue. If more users complain or fixing the root issue is a epic (like dropping lsp4j) we can consider just removing the deref for now since without then we should not have a critical issue but a specific one. WDYT?
I agree… if the root issue is an epic and/or if we’re getting lots of complaints, then I agree, we should try to find a quick fix. I don’t want to remove the deref. That would cause a regression of https://github.com/clojure-lsp/clojure-lsp/pull/955. We could try adding a timeout though. But, be aware… if I’m right and the underlying problem is that the output stream is broken, then there’s no real recovery. Neither removing the deref nor giving it a timeout will help. The server has lost the ability to communicate with the client, forever. The server might keep sending messages (removing the deref would let it keep sending edits too) but that wouldn’t have any effect because the client would never receive them. The only fix will be a restart.
Agreed, do you think I should add the timeout so we can confirm it "fixes" the issue and we are in the right path? While you research how to properly fix the issue
@UKFSJSM38 The problem persists with the test-downgrade-lsp4clj
branch.
I suppose it’s worth giving it a shot. If it works, it’ll disprove my hypothesis that something’s wrong at the communication layer
@U9A1RLFNV thanks, so I think it proves the issue was "always" there and started to affecting final user after we included the deref. The lsp4j bump didn't change anything
Well, I’m stuck. My hypothesis was that the OS stream underlying the server’s output stream breaks, and that if I could get that to raise errors instead of silently swallowing writes, I’d see those errors in the logs. To do this I wrapped System/out. The wrapper proxies writes to System/out, but also raises IOExceptions if .checkError, which is a method on System/out, ever returns true. That didn’t get me any more logs. Then I realized that lsp4j logs through java.util.logging, and that we hadn’t wired up timbre<->slf4j<->jul. I did that, though it took a while to get working in Graal. The tests still fail, without any additional logs. Then I added a timeout to the deref. I didn’t expect this to help… the evidence suggests that when deref blocks, it’s a symptom, not a cause. The first several test runs were green, but then I got a failure. So, we’re back to where we started. The client stops receiving the server’s messages and we have no indication why that happens. My new hypotheses are: 1. We still have something wrong with System/out. Perhaps we shouldn’t be using System/out, but instead be using Sockets. There’s some example code in lsp4j that does this. I might be able to get this to work on the server side, but don’t know how to change the client side so that it talks to a socket instead of to the in/out streams of the server process. 2. The wrapper isn’t working, and/or it’s working but it doesn’t cause any logs to be generated, and/or logs are generated but silently dropped (do we need a log config XML file?) 3. There’s nothing wrong with the I/O streams. Instead, somewhere in lsp4j something is blocking writes to the server’s output stream. Or perhaps lsp4j is accidentally dropping the writes. 4. There’s nothing wrong in lsp4j. Something is blocking the reads on the client side. Seems unlikely since several clients, including the integration tests, exhibit the same behavior. 5. There’s something in clojure-lsp that is blocking writes. I have no idea what this might be. Unfortunately, I don’t have many ideas for how to test these. My best guess is that the problem is #3; lsp4j is accidentally blocking or dropping writes. If this is true, I don’t know how to debug it. I don’t really know how to inspect the lsp4j internals. Maybe attach a debugger? Does anyone in this thread have experience with that? I’ve discovered I can reproduce the integration test failures locally, but only if I use a native-image build (GraalVM). Maybe I can get thread dumps, although I’m not sure how hard that is with a native image. Maybe we could ask the lsp4j team for assistance, but I don’t even know what question to ask them... I don’t think I’m going to get very far with, “Is it possible that very rarely your code stops writing output, but not in a way that you’ve ever seen? Do I have a reliable way to reproduce it? No, not really.” My second best guess is #5; clojure-lsp is blocking writes. I don’t have any ideas for what code might be blocking, no less blocking every write. What do you all think? Any other thoughts on what’s happening? Which do you think is most likely? @UKFSJSM38, you brought up the possibility of replacing lsp4j. On the one hand, that seems like a major undertaking just to debug something that we’re not even sure is related to lsp4j. On the other hand, lsp4j is, to me at least, quite complicated. If we spec out what that effort would look like, we could get a feel for how much of an epic it is. I’ll plan to open an Issue.
So the timeout didn't work? so our easy fix would need to just remove the deref is that right? My best guess is #5 as well, if so, replacing lsp4j with a internal solution on lsp4clj would not result in anything right? I can take a closer look tomorrow probably and try a easy fix (or if you manage to do before me), the issue is not being able to repro that.. I'll probably need to setup and test with calva, to be honest, i'm not confident with our integration tests, their failures coulbe this or just another crazy test issue.. > @UKFSJSM38, you brought up the possibility of replacing lsp4j. On the one hand, that seems like a major undertaking just to debug something that we’re not even sure is related to lsp4j. On the other hand, lsp4j is, to me at least, quite complicated. If we spec out what that effort would look like, we could get a feel for how much of an epic it is. I’ll plan to open an Issue. Yes, I already coded on lsp4j and it's pretty hard to understand what happens there, IMO we should replace it in long-term improving lsp4clj.
> So the timeout didn’t work? No, it didn’t work. > so our easy fix would need to just remove the deref is that right? If you remove the deref, you will break drag, create-function, and create-test for all users (anything that uses :show-document-after-edit), although Calva users will see it most. You also run the risk of breaking other things, since edits could be applied out of order. In addition, from what I know of what’s going on, you’re unlikely to fix any of the problems we’re seeing. (Adding the timeout should have made @U0178V2SLAY’s editor stop freezing, but it still leaves him in a bad state—clojure-lsp won’t be working.) That is, I know you think that the problem started with the introduction of the deref, or that it got worse when the deref was introduced, but I don’t agree with either of those. We might have done other things to make this failure more likely. It’s even possible that it’s happening more often because of performance improvements. If, for example, it happens only when certain requests get interleaved, then faster responses would make it more likely. But I don’t think the deref is the cause of the failure. I don’t want to stop you from seeing what happens when you remove it. As you said, it’s hard to repro, so it’ll be hard to tell if it’s working. But if you try it and get really good evidence that the failure goes away when you remove the deref, that’s great. But I think the real problem is elsewhere, and that at the moment we don’t have any “quick” fix.
> My best guess is #5 as well, if so, replacing lsp4j with a internal solution on lsp4clj would not result in anything right? Correct, if the problem is #5, then replacing lsp4j wouldn’t help.
so we probably don't have quick fixes and we are not sure replacing lsp4j will fix it as well
> i’m not confident with our integration tests, their failures could be this or just another crazy test issue That’s true and good to keep in mind. I’ve spent so long on this I don’t have the patience to do the “mess around in Calva for 10 minutes” approach, though I know I should. 😅
> so we probably don’t have quick fixes and we are not sure replacing lsp4j will fix it as well Exactly
but we do know that previous release was working, so maybe the tradeoff of losing what we fixed from previous release and wining the non freeze state anymore may worth it? I'll still test with calva later to see if there is a hidden quick fix
> we do know previous release that was working I don’t think we know that. We have reports that it happens less often, but from a very small sample. > losing what we fixed from previous release… may be worth it Maybe, but I’m not sure… we have three people (@U0178V2SLAY, @U0123R34Q9Y and @U9A1RLFNV) who are reporting that this is happening. But at least a few of them have said that it’s been happening for months, maybe as long as 6 months.
I thought only @U0178V2SLAY was having the issue (emacs) for some time, the other (Calva users) started to have on recent release (also there is a coworker that reported the same issue on Calva, he upvoted the calva issue)
Yes at least for my issue, it only started happening last week. And now it’s been fine since I’ve been working on 2022.05.31-17.35.50
Hey guys, I started experiencing this issue last Thursday on vscode. Only fix that worked for me was using a previous version of clojure-lsp
btw I'm using 2022.05.31-17.35.50
version as well
So I think this helps narrowing down the issue right @U07M2C8TT?
OK, you’ve convinced me. 🙂 Something changed in the last release that broke things for Calva users
Yeah, and I have something in mind: Calva/vscode send did-change-watched-files a little bit different than emacs (I don't know about vim) and since we changed some things there, it may be why we only have calva users with that issue
@UKFSJSM38 it helps narrow a little. You can try removing the deref, but it’s been there since the 2022.05.03 release, so that’s more evidence it isn’t the source of the problems. Otherwise, I guess we have to bisect through the changes to see where it started. Anyone seen any pattern to when it starts happening?
> did-change-watched-files That sounds like a good target
IIRC, vscode send one single batch of did-change-watched-files while lsp-mode sends lots of them (or the vice versa)
I'm starting to think this is the same problem we had last time we tried to add did-change-watched-files one year ago 😂
oh no!
try disable :compute-external-file-changes
setting
https://github.com/clojure-lsp/clojure-lsp/commit/e98249d14ee914f199ca698ee14b4319816658b2#diff-51d8fac8f2db4fe4a584b81b05fd2aacff71c55df99264f587c459aef335b305R9
@U032DBM7NDP or @U0123R34Q9Y could you try setting this setting to false and check if the problem keeps happening using the latest release?
{:compute-external-file-changes false}
With the latest release, right @UKFSJSM38?
you can add it in your ~/.config/clojure-lsp/config.edn
or <project>/.lsp/config.edn
Ok got it, I will update here soon if I can reproduce the problem (or not)
Same here
The issue from last year was similar but not the same @U07M2C8TT https://github.com/clojure-lsp/clojure-lsp/commit/ba54fea9e441037fb153815cb5b4752ef85e1b14, it was a little bit worst, changes overwriting other changes in the file. Even so, test with that new setting disabled it's worth as it was a critical change indeed on how the server works
It does seem like a slightly different problem. But I wonder if that old problem still happens… did-change
still puts new text in the db, even if the version is older.
Yeah, I think the issue was with that inc
that the new version doesn't have, but let's wait to confirm if that old problem will come back or not 😂
Ah, the inc
changes makes sense. But, yeah, let’s focus on one problem at a time. :)
I’m getting the same issue again as before, after adding {:compute-external-file-changes false}
to ~/.config/clojure-lsp/config.edn
I have not experienced the issue so far, but maybe its too soon (did not had a chance to test so much yesterday)
For reference, these are the versions I'm using atm:
clojure-lsp
: 2022.06.22-14.09.50
clj-kondo
: 2022.06.22
Ok nvm, just got the issue again with latest release and {:compute-external-file-changes false}
alright, thanks both for testing, I'll try to find what could introduced the issue from previous to current release soon
Quick update… @UKFSJSM38 is working on a branch that rolls back some of the major changes between 2022.05.31 and 2022.06.22. When that’s ready, we’ll ask you all to test it. One other question, @U032DBM7NDP, @U0178V2SLAY and @U9A1RLFNV… are any of you using Apple silicon?
@U0123R34Q9Y same question for you too
This should revert some changes we believe may have affected and caused the issue https://github.com/clojure-lsp/clojure-lsp/pull/1082
Intel-based here
To test, you need to:
• clone clojure-lsp project
• checkout the branch
• make
to generate the JVM binary
• change calva to use that binary , there is a setting for that on calva vscode settings
I'm trying to get some detailed log with @UKFSJSM38 about this issue and one thing that I noticed when reproducing the problem is that clojure-lsp
stops as soon as it finds some invalid code... (and from the logs we can see something like WARN [clojure-lsp.parser:115] - It was not possible to parse file. Probably not valid clojure code.
).
E.g. of reproducing the problem:
1. Adding a still-to-be-defined function ;
2. clojure-lsp
tries to parse it and spits what I just pasted above on the logs;
3. At this moment, this seems to be a non-recovery point. Even deleting previous reference to the non-defined function, restoring file with git and trying to save it again... nothing makes clojure-lsp
recover
Does any of it make sense?
Great, finally I can repro this ☝️:skin-tone-3: only on Calva but we need to check if it's the same issue c/c @U07M2C8TT
just inserting a random wrong paren )
repro the issue
I’ll see if I can repro too
I can reproduce the “not possible to parse” warning in Calva but after that I still get completions, can navigate to definitions, etc.
(I know why we get the parse error in Calva but not Emacs. After every edit Calva requests code-actions so that it can show the lightbulb. And code-actions parse the file, hence the parse error. Emacs, by default anyway, doesn’t immediately request code-actions and so it doesn’t immediately end up with a parse error.)
the parse error is expected on both editors right? just not the complete stop of next requests even after removing the wrong )
after a few more tries, i’m able to reproduce in Calva too
Yes, you can get Emacs to show the parse error too, if you add an extra )
, then open the code actions menu. And, right, that’s expected, but unexpected that this would make the client stop receiving messages
Let me try for awhile to see if I can get Emacs to exhibit the same problem
When comparing thread stack dumps between Emacs and Calva, I see a lot more clojure-agent-send-off-pool-%d
threads parked on the Calva side. I believe that futures (clojure’s version, not CompletableFuture) use the agent pools. But other things might as well. @UKFSJSM38 perhaps you and @U032DBM7NDP can focus bisection there
> parked on Calva side Oh you mean like Calva is not answering server right? making clojure-lsp hang some threads
No, I haven’t been able to repro on Emacs
> Calva is not answering server
Not quite. I have two clojure-lsp processes, one started by Emacs (which is behaving fine, even after introducing syntax errors) and one started by Calva (which is broken after introducing syntax errors). I run jstack
on both of them and compare the output with diff
. The one from Calva has more agent send-off pool threads in it, like this:
> "clojure-agent-send-off-pool-5" #34 daemon prio=5 os_prio=31 cpu=24.45ms elapsed=1075.37s allocated=2224K defined_classes=0 tid=0x00007fdc2f245800 nid=0xaa03 waiting on condition [0x0000700012978000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
> - parking to wait for <0x000000078155d5c8> (a java.util.concurrent.SynchronousQueue$TransferStack)
> at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:234)
> at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill([email protected]/SynchronousQueue.java:462)
> at java.util.concurrent.SynchronousQueue$TransferStack.transfer([email protected]/SynchronousQueue.java:361)
> at java.util.concurrent.SynchronousQueue.poll([email protected]/SynchronousQueue.java:937)
> at java.util.concurrent.ThreadPoolExecutor.getTask([email protected]/ThreadPoolExecutor.java:1053)
> at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1114)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
> at java.lang.Thread.run([email protected]/Thread.java:829)
>
> Locked ownable synchronizers:> - None
It also has a “process reaper” thread, which I think gets created when you run an OS processes, like via sh
:
> "process reaper" #39 daemon prio=10 os_prio=31 cpu=39.31ms elapsed=1070.91s allocated=13144B defined_classes=0 tid=0x00007fdc30325000 nid=0x15303 waiting on condition [0x00007000108bc000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
> - parking to wait for <0x000000078bac0478> (a java.util.concurrent.SynchronousQueue$TransferStack)
> at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:234)
> at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill([email protected]/SynchronousQueue.java:462)
> at java.util.concurrent.SynchronousQueue$TransferStack.transfer([email protected]/SynchronousQueue.java:361)
> at java.util.concurrent.SynchronousQueue.poll([email protected]/SynchronousQueue.java:937)
> at java.util.concurrent.ThreadPoolExecutor.getTask([email protected]/ThreadPoolExecutor.java:1053)
> at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1114)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
> at java.lang.Thread.run([email protected]/Thread.java:829)
>
> Locked ownable synchronizers:
> - None
>
Err…. wait… maybe it’s the process that Emacs started that has the extra agents threads and process reaper. Better figure that out first 😅
Yeah… ignore this. I was just able to get two essentially identical stack dumps, one from a “good” clojure-lsp and one from a “frozen” clojure-lsp, both started by VS Code.
so, what we have is: • we know Calva's LSP stop working when we add a invalid code and remove it • checking server side it seems all is working perfectly • it doesn't happen on other editors (at least Emacs) So, it seems the issue could be on Calva's LSP client? (vscode-language-client)
But, the problem doesn't happen on pervious clojure-lsp version which doesn't match with that idea
@U07M2C8TT I can't repro the issue anymore, did you do anything different on calva?
don’t think so
I have to be offline for the rest of the day, but what I plan to do tomorrow is a) confirm that the 2022.05.31
build works and b) git bisect
between that build and 2022.06.22
, running make, restarting clojure-lsp in Calva, and trying to trigger the problem. Should be much less tedious now that we have a quick way to repro. I just hope that repro keeps working:crossed_fingers:
Just to add another data point, I’ve been running off that rollback PR mentioned above (#1082) and I haven’t had the issue.
Thata good news @U9A1RLFNV but I can't relate with calva only issues yet, and can't repro the issue all the time (only one)
I mean, I can't see how those reverted commits could be related with a issue that only happens in Calva, I think we need to bisect those to check which commit caused the regression on calva
Experiencing freezing as well in a scale that I had to revert to cursive. Is there a guidance how I could downgrade with calva?
I'm not a calva user, but perhaps https://calva.io/clojure-lsp/#changing-the-version-of-clojure-lsp-used-by-calva might help?
Thanks, it was simple as opening settings in vscode. I will test 2022.05.31-17.35.50
I can't repro on calva, I did repro one time only but can't anymore :( So, can anyone that is having the issue help finding which one of https://github.com/clojure-lsp/clojure-lsp/pull/1082/commitsfixes the issue? You can clone and revert the commits and test with calva @U0123R34Q9Y @U9A1RLFNV @UM1PCCLNN @U032DBM7NDP
I can do it tomorrow. As sidenote, my environment often crashed when I had a debugger on.
I’ve identified a bad commit with git bisect
and created this https://github.com/clojure-lsp/clojure-lsp/pull/1083 which reverts it. Will a few of you check out that PR’s branch, and confirm that it fixes the problem?
@UKFSJSM38 I’ll start researching why that commit caused the problem. If you want a quick fix, and if others agree that this fixes the problem for them, that commit can be merged. If you want to wait a little while I research, perhaps we can come up with a fix that doesn’t lose the new 1045 feature which improved hover element selection
I'm quite surprised that commit could cause the freezing, but if calva users confirm, we can merge it.
No problem merge the revert, I found a issue on that PR today, so a new PR with no freezing and a bug fix would be better :)
@U9A1RLFNV @U0123R34Q9Y @U032DBM7NDP can one confirm the issue doesn't happens with this PR ☝️ ? if we confirm that, we can merge the fix and do a quick fix release
Agreed, it surprised me that that was the commit git bisect
identified. We’ll see what more careful inspection turns up. I’m just really glad to have a single thing to focus on (and glad it wasn’t a dep version bump commit!)
I can try that PR, but I’ve also been running with 190a948df3608dd119d8d5c4c90d4d93edb0f55b
and haven’t had the issue yet.
I did have the issue with 988e7d46c3f67112231b8c0c091e0a743040d4d7
, however. (Referring to https://github.com/clojure-lsp/clojure-lsp/pull/1082/commits)
hum, interesting, any of those commits from #1082 are related with @U07M2C8TT’s revert
I think we need to test a little bit more to make sure we are reverting the correct commit
there’s a chance that there’s weird commit interaction, but i’d really prefer to focus on my PR. if it fixes things for you @U9A1RLFNV and others, then we can postpone the commit interaction research. if it doesn’t then we’ll have to go down that (much harder) road.
Alright, finally with @U07M2C8TT’s help, we found how to always repro the issue:
1. open a clojure file
2. insert a invalid code like just any random )
3. hover at any element like when you want to see a function defintion docstring
4. removing the invalid code won't make lsp work again
Also, I confirmed that @U07M2C8TT’s PR fixes that issue, so I'm confident enought to merge it and release, WDYT?
Hi guys, sorry did not have a chance to see the message earlier, do you still need testing on the fix-PR?
It'd be great if you could test that PR @U032DBM7NDP :) (this https://github.com/clojure-lsp/clojure-lsp/pull/1083)
ok Im using clojure-lsp
built from the PR and see no issues so far, will update shortly
No issues, I think this PR fixes it
Cool! WDYT @U07M2C8TT let's merge and do a release?
:thumbsup: I don’t have any leads yet on what’s going on with that commit, so I think it’s best to release the revert and figure out the problem later
Great! We finally understand now what is the root cause, and we will work on avoid crashes like that in the future. The fix was released under 2022.06.29-19.32.13 version, thank you very much everyone that helped with the debugging, @U07M2C8TT for keep debugging and finding the root cause 👏 @U9A1RLFNV @U0123R34Q9Y @U032DBM7NDP @UM1PCCLNN
For those of us who were really loving the new hover and were sad to see it reverted, I’ll have a PR that restores it soon. As @UKFSJSM38 said, we’ve figured out why that commit broke clojure-lsp for Calva users, how to fix it, and why it wasn’t a problem in Emacs. See the upcoming PR for details. @UKFSJSM38 has some great ideas about how to avoid similar problems in the future. Thanks again to everyone who helped track this down and confirm which fixes were working!
Awesome! @UKFSJSM38 Did you mean 2022.06.29-19.32.13?
@U0178V2SLAY , sorry to say, we still don’t understand the problem you’ve been seeing. But, it has similar symptoms so there’s a (slim) chance that this will help. If not, perhaps @UKFSJSM38’s long-term fixes will. Keep pestering us if the next few releases don’t sort things out for you.
For anyone who’s curious https://github.com/clojure-lsp/clojure-lsp/pull/1087 that restores the new hover feature and discusses how that code accidentally made clojure-lsp freeze for Calva users