Fork me on GitHub
#off-topic
<
2022-09-03
>
p-himik10:09:34

I have an interesting puzzle for people that like dealing with low-level stuff. I mean things like gdb, patchelf, NOP, etc. Or at least delving into the internals of the JVM, since it might be the one to blame. I have a dynamic library built for x64 Linux that, when loaded with System/load, tends to mess things up so that a (json/read-str "...") that follows right after crashes the whole JVM with SIGSEGV. Things I'm almost certain in so far: • The crash happens in the JVM itself and not during loading • The crash is inconsistent and depends upon random stuff, like requiring an extra package, or maybe enabling JVM incubator features, or maybe changing the length of that "..." JSON string • The problematic library is unique - replacing it with any other doesn't result in crashes • It doesn't have any problematic dependencies - only built-in stuff like libpthread • Loading that library doesn't override any signals (confirmed via multiple ways) • Making that library's .init and .init_array sections no-ops made no change • Stripping that library made no change • Removing the .text section results in no crashes • Seems like it crashes only on JDK 18 ("seems" because, given its inconsistent behavior, I can't exactly prove that there won't be a crash. But on JDK 18 it crashes in around 70% of the cases and on any other JDK I have it hasn't crashed in ~20 attempts per JDK) So seems like some library code is getting executed when the library is loaded, but it's done via some unconventional means, perhaps? No idea. Alternatively, given the apparent JDK version dependency, it might a bug in the JVM. No clue how to prove it or even approach it either. I can send the library to anyone who wants to try and reproduce the issue, or provide instructions on how to build the library.

👀 4
jumar12:09:17

What does the JVM crash report tell you? Is there any clue what could cause the bug?

p-himik12:09:44

Just segmentation fault (core dumped), nothing else. Running it via valgrind produces an hs_err_pid...log file that doesn't seem to have any useful details. Just that the crash happened during some nth in some macroexpand.

jumar13:09:06

If you have a minimal reprodrucer l can try it on monday. But I suggest asking on StackOverflow with jvm tag and attaching the error file. There are at least few jvm experts frequently answering questions so they may give you some pointers on where to look and what to try

p-himik13:09:14

I'll see if I can create a Containerfile with a reproduction.

p-himik13:09:44

Oh, how fun. Despite the JVM and the base OS being exactly the same, I cannot reproduce the issue in a container at all.

p-himik13:09:32

Huh, but I managed to crash JVM 17 now, even though it happened once in like 50 launches.

p-himik14:09:45

Something potentially interesting in the core dump from JVM 17:

#0  0x00007fb9cfc8a8d4 in JVM_handle_linux_signal () from /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
#1  <signal handler called>
#2  0x00007fb9cfc8a8d4 in JVM_handle_linux_signal () from /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
#3  <signal handler called>
#4  0x00007fb9cfc8a8d4 in JVM_handle_linux_signal () from /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
#5  <signal handler called>
[...]
#1052 0x00007fb9cfc8a8d4 in JVM_handle_linux_signal () from /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
#1053 <signal handler called>
#1054 0x00007fb9cfc8a8d4 in JVM_handle_linux_signal () from /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
#1055 <signal handler called>
#1056 0x00007fb9cfc8a8d4 in JVM_handle_linux_signal () from /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
#1057 <signal handler called>
#1058 0x00007fb9b907beaf in ?? ()
#1059 0x00007fb9ced46430 in ?? ()
#1060 0x00007fb9c8016af0 in ?? ()
#1061 0x0000000800ccb390 in ?? ()
#1062 0x00007fb9ced46340 in ?? ()
#1063 0x00007fb9ced46330 in ?? ()
#1064 0x00007fb9cf9f4f3a in ?? () from /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
That [...] is just repeating calls to JVM_handle_linux_signal. So it's apparent, that JVM_handle_linux_signal is triggered from within itself - so a signal is triggered during its execution. Sounds like the memory becomes somehow corrupted?..

p-himik18:09:00

Hmm, and so it doesn't crash in Docker but does crash on my native OS and in VirtualBox...

jumar03:09:10

Did you ask on SO to get more advice? Maybe https://www.youtube.com/watch?v=jd6dJa7tSNU could be useful for you. In particular, it discusses various fields of the signinfo structure and also https://github.com/AdoptOpenJDK/openjdk-jdk11u/blob/5f01925b80ed851b133ee26fbcb07026ac04149e/src/hotspot/cpu/x86/assembler_x86.hpp#L99-L106. The presenter is also active on SO (as apangin) and often provides excellent advice.

p-himik07:09:12

> Did you ask on SO to get more advice? So far, the situation has so few details and the reproduction is so flaky that I'm 95% certain that the question will simply be closed. But I will ask it once I stop coming up with new ideas. Another thing I'm almost certain in is that it's not signals - I've already debugged that into oblivion. And it's definitely not due to calling a native method. Because I don't call them. :) I just load a .so file without doing anything else related to native code.

Wanja Hentze08:09:07

can you run nm on the library and show the result?

Wanja Hentze08:09:16

or even better, objdump -x

Wanja Hentze08:09:40

the output is can be quite large so should probably go in a pastebin or sth

Wanja Hentze08:09:53

maybe the library has some strong symbols that, after loading it, take precedence over stuff that's normally a weak symbol

p-himik09:09:39

What do you mean by "strong symbols"? How would anything from a dynamic library take precedence over stuff outside of it given that you have to do an explicit symbol lookup upon the handle of that library to get a particular symbol?

Wanja Hentze09:09:47

a strong symbol is any symbol that is not a https://en.wikipedia.org/wiki/Weak_symbol.

Wanja Hentze09:09:37

> How would anything from a dynamic library take precedence over stuff outside of it given that you have to do an explicit symbol lookup upon the handle of that library to get a particular symbol? By loading it with RTLD_GLOBAL . Which I have no idea if System/load does that.

Wanja Hentze09:09:25

and then looking it up with dlsym(RTLD_DEFAULT)

p-himik10:09:56

Thanks for that link, I've never heard about it before. But I should've specified what I meant by "stripping" in the OP. I used strip --strip-all, so both nm and objdump now say that there are no symbols at all.

Wanja Hentze10:09:36

Ah, but I think I was mistaken too

p-himik10:09:31

But also, what you describe, and given what the Wiki article says, weak vs. strong makes sense only when you still link to the library at the linking time, even if the library is dynamic. But in my case, the library is never linked to. It's just loaded with System/load, so there are no symbol look-ups happening at all.

Wanja Hentze10:09:58

the case I was describing should not actually happen with weak symbols. But it would happen with undefined symbols

Wanja Hentze10:09:58

e.g. 1. foo is undefined in the global symbol namespace 2. we load a library that defines it using RTLD_GLOBAL 3. somebody looks it up and does something depending on whether it's there

Wanja Hentze10:09:08

in this case, 2. would change the behavior of a program

p-himik10:09:04

Even if the library is loaded after the program's start - in run time, with dlopen?

Wanja Hentze10:09:18

oh hmm, strip --strip-all should void all that though

Wanja Hentze10:09:34

can you do objdump -x anyway? there may not be symbols but there's still other metadata (section headers)

p-himik10:09:35

Disregard the odd extension, .so.bin - I was just playing around with Ghidra and it refuses to overwrite original files.

Wanja Hentze10:09:48

hmm, nothing quite jumps out at me

p-himik10:09:33

Thanks for checking anyway. :)