xtdb 2021-11-10 | Slack Archive

tatut06:11:00

I have problem starting XTDB in ECS from a checkpoint... I see logs that it is restoring from checkpoint, but then it hangs before returning from xt/start-node ... I don't know if it is just slow. I have grace period set to 300 seconds and it doesn't start in that time

tatut07:11:21

{ "context": "default", "level": "INFO", "logger": "xtdb.tx", "message": "Started tx-ingester", "thread": "main", "timestamp": "2021-11-10T06:52:16.430Z" }
[2021-11-10T08:52:17+02:00]
{ "context": "default", "level": "DEBUG", "logger": "xtdb.hash", "message": "Using libgcrypt for ID hashing.", "thread": "main", "timestamp": "2021-11-10T06:52:17.578Z" }
{ "context": "default", "level": "DEBUG", "logger": "xtdb.lucene", "message": "Committing Lucene IndexWriter...", "thread": "xtdb-lucene-fsync-1", "timestamp": "2021-11-10T06:54:18.635Z" }
{ "context": "default", "level": "DEBUG", "logger": "xtdb.lucene", "message": "Committed Lucene IndexWriter.", "thread": "xtdb-lucene-fsync-1", "timestamp": "2021-11-10T06:54:18.635Z" }

tatut07:11:27

those are the last logs, and 4 minutes later ECS just kills it and starts another, perhaps I'll just increase the startup time yet again

tatut07:11:06

I tried doubling grace period to 10 minutes, still doesn't start in that time

tatut08:11:59

I can verify that this happens locally as well, I configured my local env to use the same S3 checkpoint bucket and tx-log/doc store and starting up the node hangs after it has downloaded the checkpoint... no CPU usage to speak of so it isn't doing anything intensive at least

tatut08:11:42

the LMDB index directory seems to be stable and the correct size, but the lucene folder seems to be fluctuating weirdly and doesn't seem to reach the size of the snapshot

refset08:11:53

A profiler would probably help here (YourKit is pretty great if you haven't tried it), but just to rule this out could you try increasing the Lucene refresh-frequency as per https://docs.xtdb.com/extensions/full-text-search/#_parameters

tatut08:11:02

looks like it grows a few megabytes and then goes back to a much smaller size

tatut08:11:44

it doesn't look like lucene uses the checkpoint, the folder should be over 40M but it is only a few and growing slowly

tatut08:11:37

can confirm that the lucene module must be the culprit here... I just tried simply commenting out the lucene module completely and startup continues immediately after the download is complete

tatut08:11:56

the refresh-frequency didn't fix the lucene checkpoint stuff

tatut09:11:32

the cp/try-restore is never called from lucene module

tatut09:11:51

I'll see if adding that works

refset09:11:16

> the `cp/try-restore` is never called from lucene module oh, oops! I agree this looks to be missing, when compared to https://github.com/xtdb/xtdb/blob/e2f51ed99fc2716faa8ad254c0b18166c937b134/core/src/xtdb/mem_kv.clj#L134-L135

tatut09:11:05

added that locally, and startup is fast

tatut09:11:11

should I do a PR?

refset09:11:15

increasing the refresh-refrequency almost certainly will help with the replay speed though

refset09:11:51

sure, please, that will put it firmly on the radar 🙂

tatut09:11:10

https://github.com/xtdb/xtdb/pull/1659

refset09:11:15

thanks! I won't merge the PR myself now, but I will make sure it gets some brain cycles soon

refset09:11:17

Would you be okay to sign the CLA pdf and email it to us? Instructions here https://github.com/xtdb/xtdb/blob/master/CONTRIBUTING.adoc#how-to-contribute

tatut09:11:56

probably fine, I'll look it over

👍 1

tatut10:11:08

signed and emailed

🙏 1

tatut10:11:19

there doesn't seem to be deps.edn files for the modules... it would be much easier to use forked fix versions as git deps without waiting for official release

refset12:11:59

noted, feel free to open an issue (not PR 😅) for that also - I'm not really sure what it would entail, but we do make rather extensive use of lein features currently. We can also publish a snapshot release very easily once the fix is merged in the meantime, if it helps at all

tatut12:11:23

yeah, I added that in the PR, but I'll revert it... it wasn't as convenient tbh

tatut12:11:40

as afaict you can't use git deps that are in some subfolder of a repo

Jacob O'Bryant01:11:07

You can by adding a :deps/root key

tatut05:11:24

good to know, I didn't see that in the deps docs

Brandon Olivier20:11:23

Has anyone seen issues with xtdb, rocksdb, and last year’s M1 Mac? I keep getting java errors, and can’t find much info on google about it.

Brandon Olivier20:11:53

When I try to follow the in-memory tutorial I get this error

Caused by java.lang.UnsatisfiedLinkError
   'long org.rocksdb.LRUCache.newLRUCache(long, int, boolean, double)'

R.A. Porter20:11:32

https://github.com/xtdb/xtdb/issues/1518

Brandon Olivier20:11:00

Maybe I’m too dumb to perceive it, but is there a workaround mentioned here?

R.A. Porter20:11:56

It's not obvious. I just clicked through to the RocksDB issue linked and it looks like the suggestion there (which is echoed on the above, but I didn't realize it at first) is to run an x86_64_JDK under Rosetta. Someone with more knowledge will hopefully chime in, as I have no first-hand experience.

dvingo21:11:51

I've used sdkman successfully to make switching JDKs easy https://itnext.io/how-to-install-x86-and-arm-jdks-on-the-mac-m1-apple-silicon-using-sdkman-872a5adc050d

Steven Deobald22:11:01

@UJL94RYSW I commented on the issue thread for future users, but switching to an x86 JDK is the current work-around, as other folks have suggested. The Java bindings for Rocks apparently need very little work to run on ARM... they just haven't been released yet.

👍 2

Brandon Olivier22:11:24

Thanks y’all 🙂

2021-11-10

Channels