xtdb 2021-02-01 | Slack Archive

nivekuil11:02:59

has anyone else run into this segfault? has happened a few times now with rocksdb (6.12.7). bumping it to see if a newer version is any better

# A fatal error has been detected by the Java Runtime Environment: # #  SIGSEGV (0xb) at pc=0x00007f707d067bd0, pid=1455309, tid=2007832 # # JRE version: OpenJDK Runtime Environment (15.0.1+9) (build 15.0.1+9) # Java VM: OpenJDK 64-Bit Server VM (15.0.1+9, mixed mode, tiered, z gc, linux-amd64) # Problematic frame: # C  [librocksdbjni-linux64.so+0x2afbd0]  rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImp lOptions&)+0x5f0

nivekuil12:02:40

I'm guessing it is correlated with a burst of (crux/entity) calls

nivekuil12:02:30

nope, I can actually repro it on 6.15.2 reliably it seems. It really doesn't like a project-many being made

jarohen12:02:17

Hey @U797MAJ8M, sounds like something's not quite right 😞 Would you able to submit a repro?

nivekuil12:02:00

I can try -- given the nature of these things it might be hard to do without packaging up my whole system. Are you interested in a coredump? not sure how JNI works wrt debug symbols etc. In general I'm pretty sure I am triggering it with pathom making a bunch of crux/entity and now crux/project-many calls, which I guess hammers the index/doc/tx store (rocks is being used as all 3 locally) quite hard

refset12:02:42

Are you using open-q or open-db?

nivekuil12:02:56

oh, yes I am actually.

nivekuil12:02:20

one open-db for a HTTP request, that gets passed to the aforementioned pathom resolvers which then make a bunch of crux calls to build the response

jarohen12:02:09

are those calls in serial for any one opened DB, or are you calling the same DB instance from multiple threads?

nivekuil12:02:05

good question, I would have assumed that a single request is handled entirely by one thread (server is aleph) but I am not actually sure what promesa's default execution model looks like or what pathom3 is doing behind the scenes with it

jarohen12:02:50

> Are you interested in a coredump? not sure how JNI works wrt debug symbols etc. There should be a file dumped by the JVM when it exits that way - hs_err_pid_<>.log ?

nivekuil12:02:15

yup, I have 6 of them. <mailto:[email protected]|[email protected]>?

jarohen12:02:27

cheers! 🙏

nivekuil12:02:06

@U050V1N74: actually, ignore 7141 -- that one seems to be from a long time ago, got swallowed up by the glob

jarohen12:02:28

thanks, received 🙂

jarohen12:02:03

At first glance, it looks like each of the segfaults is at one of the query's first calls into RocksDB, so I'd hypothesise that it's unrelated to project-many, and would happen on any query at that point We've seen this happen when trying to make accesses on closed DBs - could this be a possibility? Often when we've done a with-open that returns (and hence closes its resources) immediately, while its work continues on a different thread.

jarohen12:02:26

Unfortunately Rocks isn't very friendly when this happens - AFAIK there's not a lot Crux can do once this seg-fault is thrown

nivekuil12:02:06

yup, I think this is definitely the case.. my big refactor moved things around to close the db before the request was fully handled!

nivekuil12:02:50

apologies for the noise! and project-many works great as well

jarohen12:02:07

thanks 🙏 and no worries about the noise, keep them coming! 🙂

➕ 3

nivekuil12:02:39

if I can add one point of feedback, the docstring mentions that open-db must be closed but doesn't mention the consequences of using a closed one. This is of course a bad idea but maybe a mention of the consequences for doing so would help make this connection faster

jarohen13:02:51

absolutely - we're on it 🙂

Aleksander Rendtslev18:02:50

Just wanted to ask about this issue: https://github.com/juxt/crux/issues/1399 I’m hitting the same thing. Even after a complete index/log/doc wipe Crux crashes after a few writes with:

"Lucene store latest tx mismatch"

On the latest version:

juxt/crux-lucene      {:mvn/version "21.01-1.14.0-alpha"}

And my crux config:

(let [kv-store (fn [dir]  {:crux/module 'crux.rocksdb/->kv-store
                             :db-dir      (io/file (str  "data/crux-db/" dir))})
        node     (crux/start-node
                   (merge
                     {:crux.lucene/lucene-store {:db-dir "data/crux-db/lucene"}
                      :crux.http-server/server  {:port 4000}
                      :crux/index-store         {:kv-store (kv-store "index")}
                      :rocksdb-golden      (kv-store "tx-log-and-doc")
                                    :crux/document-store {:kv-store :rocksdb-golden}
                                    :crux/tx-log         {:kv-store :rocksdb-golden}})))]

    (log/info "DB Started")
    ;; Synchronize the node
    (crux/sync node)
    ;; Register our transactions
    (register-transactions node)
    ;; Return the node
    node)

Has any of you seen this before?

jonpither18:02:12

Thanks @aleksander990 We are looking into it

🙌 3

Steven Deobald19:02:14

@aleksander990 It does appear there is probably a bug in the current Lucene module for old stores. However, when I filed that bug I was able to get around it in dev by cleaning up tx/doc/index stores + lucene store, then restarting my service. When you do this: > Even after a complete index/log/doc wipe Crux crashes after a few writes ...are you also wiping the lucene-store?

Aleksander Rendtslev20:02:33

Yeah I did wipe all of them. But It’s been working for the last 30 minutes or so. I’ll keep an eye on it over the next few days though

Steven Deobald21:02:10

Hmm. Curious. If you do find you get a rogue tx mismatch after wiping all the stores, that's a separate (though perhaps related) bug from #1399 so it would be good to capture it as well.

jonpither20:02:02

Hi @aleksander990 this is fixed in https://github.com/juxt/crux/pull/1406, we will get a release out soon that addresses this.

Aleksander Rendtslev20:02:49

Sweet!

jonpither20:02:12

thanks for reporting.

2021-02-01

Channels