Fork me on GitHub
#clojure
<
2024-07-18
>
Olav Fosse18:07:37

I just spent 30 minutes hunting down NPE that was missing a stack trace. Is there any tricks for hunting down trace-less exceptions? How do I find out where they originate from without painstakingly searching by hand?

ghadi18:07:56

-XX:-OmitStackTraceInFastThrow option to the JVM

💯 3
☝️ 1
Alex Miller (Clojure team)18:07:43

The Clojure CLI adds this automatically so stack traces are never omitted

jpmonettas18:07:13

your can use a debugger

jpmonettas18:07:00

although I haven't seen an Exception reported without any stacktrace when using clj :thinking_face: , can you share more?

Alex Miller (Clojure team)18:07:49

by default, the JVM does -XX:+OmitStackTraceInFastThrow and will sometimes omit the stack for thrown exceptions

jpmonettas18:07:02

yeah, but as you said that doesn't happen with clj since it always removes it

Olav Fosse18:07:14

my process is started with clj, CIDER debugger want working either :/

jpmonettas18:07:15

have you tried http://www.flow-storm.org/ ? it should let you jump right into the Exception location and then step back from there to see where the NPE is comming from

Olav Fosse18:07:18

yes, it's awesome

Olav Fosse19:07:16

i'm devving in ~prod, so flowstorm might be too heavy

1
Olav Fosse19:07:33

unless there's a way to keep resource consumption in check

jpmonettas19:07:39

oh yeah, FlowStorm is currenly a dev thing only

👍 1
jpmonettas19:07:15

that is weird, maybe you can use jps and then jcmd PID VM.flags to check that the JVM is running with the -XX:-OmitStackTraceInFastThrow flag

Olav Fosse19:07:47

No such flag :thinking_face: -XX:CICompilerCount=3 -XX:ConcGCThreads=1 -XX:G1ConcRefinementThreads=4 -XX:G1EagerReclaimRemSetThreshold=12 -XX:G1HeapRegionSize=1048576 -XX:G1RemSetArrayOfCardsEntries=12 -XX:G1RemSetHowlMaxNumBuckets=8 -XX:G1RemSetHowlNumBuckets=4 -XX:GCDrainStackTargetSize=64 -XX:InitialHeapSize=127926272 -XX:MarkStackSize=4194304 -XX:MaxHeapSize=2032140288 -XX:MaxNewSize=1218445312 -XX:MinHeapDeltaBytes=1048576 -XX:MinHeapSize=8388608 -XX:NonNMethodCodeHeapSize=5832908 -XX:NonProfiledCodeHeapSize=122912666 -XX:ProfiledCodeHeapSize=122912666 -XX:ReservedCodeCacheSize=251658240 -XX:+SegmentedCodeCache -XX:SoftMaxHeapSize=2032140288 -XX:+UseCompressedOops -XX:+UseG1GCNo

Olav Fosse19:07:33

My start command is just clj user.clj and I’m on 1.12.0-beta1. I’ll add it manuall ig

Alex Miller (Clojure team)19:07:55

what version of the CLI? clojure --version

Olav Fosse19:07:33

the one from apt….

Alex Miller (Clojure team)19:07:00

well, that's not the CLI most likely, or is very old

Olav Fosse19:07:10

Clojure CLI version 1.11.1.1165

Olav Fosse19:07:30

oh, nvm i’m not using the apt CLI. i’m using the one you linked to.

Olav Fosse19:07:45

Clojure CLI version 1.11.1.1165 I updated to Clojure CLI version 1.11.3.1463, i’ll restart my app now

Alex Miller (Clojure team)19:07:04

yeah, that was before the Omit flag was set (change in 1.11.1.1208)

Olav Fosse19:07:27

Gotcha, then all is good Future me says “thank you” 🙏

p-himik19:07:41

Completely stuck on something. Might not be related to Clojure at all but at this point I have no clue. Very occasionally, eftest hangs for me with the default multithreading and progress report. When it happens, it just keeps on overwriting its progress bar over and over again. Well, I waited for 5 minutes max where all the tests in that suite take up to 20 seconds. Imagine that the lines below keep being printed one by one over the same single line, without line breaks (`eftest` uses \r):

2/18    11% [=====                                             ]  ETA: 00:00
 9/18    50% [=========================                         ]  ETA: 00:00
15/18    83% [=========================================         ]  ETA: 00:01
So when it hangs, this goes on seemingly forever. jstack errors out with "target process [...] doesn't respond within 10500ms or HotSpot VM not loaded" but the stacktrace still gets printed out in the same terminal where the tests are running. Nothing out of the ordinary goes on there, it all looks like the tests are in progress. (cont. in 🧵)

p-himik19:07:47

What's strange is that even when it does not hang, it still does the same thing rather frequently, close to the end of the tests, for a few seconds and seemingly without actually affecting to total run time. I can't figure out why it could possibly do that since it uses a single atom for reporting, any changes to that atom go through locking, and those changes only increase the progress. Could it be an issue with the printing to the terminal itself somehow being "stuck"? To corroborate that, if I hold Enter while the progress is being printed, it never does that - the bar always grows. If I override eftest.report.progress/report for :end-test-var with an additional println outside of t/with-test-out, the behavior is also not reproducible - the progress bar always grows.

p-himik19:07:26

Oh, it just hung again. Tried holding Enter - this time it doesn't affect anything, it still hangs and I see this being printed over and over, just as expected:

4/18    22% [===========                                       ]  ETA: 00:00
 6/18    33% [================                                  ]  ETA: 00:01
 7/18    38% [===================                               ]  ETA: 00:02
11/18    61% [==============================                    ]  ETA: 00:02
12/18    66% [=================================                 ]  ETA: 00:02
12/18    66% [=================================                 ]  ETA: 00:04
12/18    66% [=================================                 ]  ETA: 00:06
 5/18    27% [=============                                     ]  ETA: 00:00
 6/18    33% [================                                  ]  ETA: 00:01
10/18    55% [===========================                       ]  ETA: 00:02
11/18    61% [==============================                    ]  ETA: 00:02
12/18    66% [=================================                 ]  ETA: 00:02
12/18    66% [=================================                 ]  ETA: 00:04
12/18    66% [=================================                 ]  ETA: 00:06
 4/18    22% [===========                                       ]  ETA: 00:00
 6/18    33% [================                                  ]  ETA: 00:01
 9/18    50% [=========================                         ]  ETA: 00:02
11/18    61% [==============================                    ]  ETA: 00:02
12/18    66% [=================================                 ]  ETA: 00:02
12/18    66% [=================================                 ]  ETA: 00:04
12/18    66% [=================================                 ]  ETA: 00:06

ghaskins19:07:36

This may be obvious and you already tried, but maybe try with

{:capture-output? false}

ghaskins19:07:23

(I agree the problem sounds strange, though….ive never seen that myself)

p-himik19:07:23

It's already set to false.

👍 1
hiredman20:07:06

what kind of test fixtures do you have?

p-himik20:07:59

Only one :once that sets up an EmbeddedPostgres DB.

oyakushev20:07:12

Which OS and JDK?

oyakushev20:07:59

If it reproduces for you reliably enough, maybe try with a different terminal?

p-himik20:07:30

Ubuntu 24.04. OpenJDK 21.0.3. For terminals - tried mate-terminal, the terminal built into IDEA, and xterm. For shells - tried zsh and sh. I can reproduce it "reliably" - sometimes it happens every other time, sometimes it happens once in a hundred runs.

oyakushev20:07:28

I wonder why jstack doesn't work properly, I'd investigate that first (even though it might be a red herring). Does it happen with other Java processes too? Does it happen with this process when it's not stuck?

p-himik20:07:56

Usually it works just fine. I tried it only once when the tests got stuck, and that's where that error happened.

hiredman20:07:32

how are you running eftest?

oyakushev20:07:19

I mean, you would need to take a few stacks to know the tests are running fine, right? Or connect a plain old profiler for a while, check what's going on both on CPU and waiting.

hiredman20:07:19

depending how complicated your fixture is you might want to make sure it isn't looping in weird ways

p-himik20:07:52

> how are you running eftest? Almost bare-bones, just wrapped its runner/find-tests and runner/run-tests in a -main function that does some CLI argument parsing. > depending how complicated your fixture is you might want to make sure it isn't looping in weird ways The fixture just creates a transaction, runs a few predetermined SQL queries, runs the tests, and rolls back the transaction.

p-himik20:07:17

> you would need to take a few stacks to know the tests are running fine, right? M, perhaps. I just saw that one thread was in the middle of a test in the RUNNING state.

p-himik20:07:11

Currently trying to catch it with (while true (runner/run-tests ...)). Weird - ETA stays at 00:00... But even hanging aside - why can it possibly overwrite its own progress bar given the implementation? And even when the progress bar stays visually the same, sometimes the ETA grows then gets reset back to 00:00.

hiredman20:07:32

I would try and capture the set of events being sent to clojure.tests report function, by passing my own function as the :report option, and if write the events to another file or something, and then call eftests/progress-report

hiredman20:07:52

eftest does reset! the atom when it gets a begin-test event

p-himik20:07:46

Yeah, but only on :begin-test-run, which is fired once, when run-tests is called.

hiredman20:07:31

why I asked about the fixture is you can get some very weird behavior when exceptions bubble out of fixtures (but it has been a while so I don't recall exactly what can happen)

p-himik20:07:55

Hmm. So it might be a combination of: • A parallel run • An exception in a test • That exception never being reported normally (otherwise, I would've seen and fixed it already) Also, the fact that I couldn't reproduce it in sequential runs, or with Kaocha, or with a different reporter, doesn't really mean that much given that the issue could still be there and I was just unlucky in catching it with other setups.

p-himik20:07:14

@U06PNK4HG Reproduced it again and ran jstack a few times - all times it worked just fine.

👍 1
oyakushev20:07:15

Do you see all test runner threads progressing on work?

p-himik21:07:42

Ran it manually three times, inspected the output side by side. In all cases: • Five testing thread are BLOCKED in eftest.runner/synchronize, all in different tests • Another testing thread is RUNNABLE in the same function, apparently printing The test being run in that last thread has this structure:

(deftest ...
  (doseq ... ;; over some static data
    (testing ...
      (let [data (generate-data-with-spec]
        (doseq [item data]
          (is ...)
          (doseq [[k v] item]
            (is ...)))))))
A caveat that shouldn't be important - that is is actually a custom macro in my case that calls clojure.test/is and prints some detailed output with clojure.pprint/pprint in case of a failure.

p-himik21:07:19

A video demonstration of the progress bar. :)

p-himik21:07:26

Attached the IDEA's debugger to that process, it automatically stopped the process for some reason, without any ability to continue it (still can't figure out why it happens sometimes). But I got lucky in that it stopped it in the middle of a completely different test, but also during printing. This time, on an unwrapped clojure.test/is.

oyakushev21:07:47

Looks like multiple threads are fighting over a progressbar?

p-himik21:07:37

Well, yeah - it's behind a call to locking, so that's to be expected. What's unexpected is the progress being reset to the start and the tests sometimes not ending at all.

oyakushev21:07:46

I mean, from the video. Multiple threads print "their" progressbar each. Or, rather, the stats state is updated non-monotonically. Different threads overwrite it with stale data. That doesn't explain the overall suite being stuck though.

p-himik21:07:04

Multiple threads print "their" progressbar each.That shouldn't be happening - there's just one atom with the progress state. That's the problem.

p-himik21:07:10

And the threads don't write the state to it directly, they only report the status of their tests. An example from the impl:

(defmethod report :end-test-var [m]
  (test/with-test-out
    (print-progress (swap! report/*context* update-in [:bar] prog/tick))))

p-himik21:07:32

prog/tick only adds 1 to a value in :bar.

hiredman21:07:55

If there was a test that somehow triggered another eftest run to start, that would explain the running forever and the progress bar jumping back and forth

hiredman21:07:49

Is the thread count from jstack constant or always rising?

p-himik21:07:23

Right. But the tests don't do that. :) The runner is not used by anything but that -main function. I was experimenting with Kaocha earlier today and couldn't reproduce the behavior there at all.

p-himik21:07:42

The count of the threads seems to stay the same.

p-himik21:07:15

Of course, if I output the events into a file, I don't get the hanging behavior. I do see numbers briefly going back and forth, but maybe that's just some buffering somewhere?.. Because the events in the file all have a non-decreasing progress number.

jpmonettas22:07:42

maybe adding some :

(Thread/setDefaultUncaughtExceptionHandler
   (reify
     Thread$UncaughtExceptionHandler
     (uncaughtException [_ _ ex]
       (.printStackTrace ex)))) ;; or log into a file 
if there isn't one?

p-himik00:07:38

Reproduced with that in place as well.

jpmonettas00:07:28

I was hoping that some kind of exception was bubbling up and not showing in the console because of it clearing or something and maybe logging into a file could show something

jpmonettas00:07:26

my guess was that if it is intermittent is maybe some kind of race condition, that puts the system in a state where a thread throws an exception that bubbles up, killing the thread, and then something else restarting it forever. But that is just a random guess, I don't know anything about eftest

p-himik00:07:32

And still nothing explains why the progress would jump from e.g. 15/18 to 2/18. :) Even in cases where the behavior isn't reproducible, that still happens. But doesn't happen if I hold Enter or make the reporting insert a new line before printing out the new progress bar.

jpmonettas00:07:10

maybe you can run that under FlowStorm with ClojureStorm with eftest and your stuff instrumented, with recording paused, when it starts going into the loop thing you record for a little bit and then stop (so it doesn't blow your heap) and then without killing the process you can analyze the recordings?

p-himik00:07:27

Even printing to a file seems to be enough for it to become unreproducible. Even if printing is done on an agent. I doubt throwing more debugging tools at it would help in any way. The only way I seem to be able to peer into the internals is by attaching a debugger after it has already happened. But then the whole process freezes till I detach the debugger, which I also can't figure out why it happens.

jpmonettas00:07:28

> Even printing to a file seems to be enough for it to become unreproducible. dang Heisenbugs

p-himik00:07:21

Another BS: 1. Made some logging changes 2. Suddenly, testing time dropped from 18-20 s to 12 s, reliably for 10 runs straight 3. Removed those changes - the timing got back to 18-20 s 4. Added the very same changes again - the time didn't change, same 18-20 s Why?.. How did it become 12 s for so many runs?

jpmonettas00:07:59

some weird stuff related to hotspot optimizations? no idea why would not be reproducible again

p-himik00:07:23

Every run restarts the whole JVM.

jpmonettas00:07:15

but if you have a race condition (influenced by the logging) that puts the app into a state where it goes different execution paths, I guess you could see different timings ?

p-himik00:07:10

By why would the same exact source code produce different results? The only difference was the time when that execution was happening and the contents of /dev/urandom.

jpmonettas00:07:26

if you run out of ideas I would try the FlowStorm path, it should be easy to try, and since ClojureStorm will add extra instructions everywhere, instead of some localized prints maybe you are lucky and it still reproduce? If it does you can record it

👍 1
p-himik01:07:43

Another weird crap - I just reproduced it with :thread-count 1. But I also did it in mate-terminal which seems to be much faster than the IDEA's one. Oh, now it actually finished after taking 160 s instead of the usual single-threaded 30 s... Could it be that the code is actually totally 100% fine but it's the terminals that screw things up?

p-himik01:07:32

I mean, I think I have in total something like hundreds of thousands of assertions. Each calling that report function, each ending up printing something with \r, all in a span of seconds. Or maybe it's not the terminals themselves but something like GPU driver. Man, sometimes I hate how complex everything is all around and within us.

p-himik01:07:12

Yeah, I'm now content with blaming the GPU for this crap, or some related system. Especially given that once a month or so Chrome just hangs on me and I can resuscitate it by killing its GPU process.

jpmonettas01:07:54

I chased down a weird intermittent crash of the JVM I was having some time ago and ended up being broken ram cells on my laptop breaking the GC with increased chance when a ton of heap was used :face_with_rolling_eyes:

p-himik01:07:51

This change seems to be making everything happy. And the testing time have gotten down by 1 s (still no clue how I got that time of 12 s at some point).

(let [orig-report (get-method report.progress/report :pass)]
  (defmethod report.progress/report :pass [m]
    (binding [t/*test-out* (StringWriter.)]
      (orig-report m))))

p-himik01:07:54

> weird intermittent crash of the JVM A similar thing happened when I tried using JNI with a bare-bones Swift library. :) Still no clue where the crashes would come from. Maybe also hardware issues.

p-himik02:07:35

A salient and discouraging article related to the matter, with a 2024 update: https://danluu.com/cpu-bugs/