This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-03-29
Channels
- # admin-announcements (1)
- # announcements (20)
- # babashka (43)
- # beginners (134)
- # calva (2)
- # clerk (7)
- # cljdoc (9)
- # clojars (8)
- # clojure (91)
- # clojure-europe (21)
- # clojure-nl (1)
- # clojure-norway (13)
- # clojure-uk (1)
- # clojurescript (5)
- # datahike (3)
- # docker (2)
- # emacs (6)
- # fulcro (7)
- # graphql (9)
- # honeysql (24)
- # improve-getting-started (5)
- # introduce-yourself (1)
- # lambdaisland (1)
- # luminus (3)
- # malli (3)
- # nbb (19)
- # off-topic (22)
- # pathom (1)
- # portal (3)
- # practicalli (1)
- # rdf (26)
- # reagent (29)
- # reitit (9)
- # shadow-cljs (15)
- # spacemacs (3)
- # sql (4)
- # tools-build (30)
- # xtdb (41)
on 1.23.1 version and got this on our test environment during the night
level: WARN, logger: "xtdb.checkpoint", message: "Checkpointing failed", exception:
java.nio.file.FileAlreadyExistsException: /path/_57b4.si
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:94)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:218)
at java.base/java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:484)
at java.base/java.nio.file.Files.newOutputStream(Files.java:228)
at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:410)
at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:406)
at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:254)
at org.apache.lucene.store.Directory.copyFrom(Directory.java:183)
at xtdb.lucene$checkpoint_src$reify_4815$fn_4816.invoke(lucene.clj:291)
at xtdb.lucene$checkpoint_src$reify__4815.save_checkpoint(lucene.clj:289)
at xtdb.checkpoint$checkpoint.invokeStatic(checkpoint.clj:44) ...
what could cause it? improper cleanup of previous checkpoint or something else? there was no actual activity on the nodeHi @U11SJ6Q0K I don't recognise this in particular, but a brief search for that error suggests there may be some filesystem issue: 1. out of disk https://github.com/oracle/opengrok/issues/3131#issuecomment-621004281 2. max open files https://community.dremio.com/t/filealreadyexistsexception-in-lucenesearchindex-commit/4463 Do you have any environment monitoring that could confirm the status of those two things?
was the other error also for Lucene? (i.e. the exact same stack trace, asides from the path)
shouldn’t be out of space, more than double free than used… so out of file handles is one contender
ok, now I’m seeing it again in another environment… we started using the S3 transfer manager CRT, don’t know if that is relevant
there are some lucene checkpoint .edn+folder present in the s3 bucket, but it seems they aren’t clean as the node starts indexing very old transactions…
added some more logging and yes, when starting the node restores from rocksdb checkpoint, but not seeing lucene restoration
then added a startup monitor that polls submitted/completed tx status and it indeed goes through all transactions on startup (I’m guessing the lucene side only, as rocksdb was restored)
In our dev env I have a checkpoint file in the bucket “checkpoint-6011-2023-04-04T06:11:04.301-00:00.edn” containing:
{:xtdb.checkpoint/cp-format "lucene-8", :tx {:xtdb.api/tx-id 6011}, :xtdb.s3.checkpoint-transfer-manager/s3-dir "checkpoint-6011-2023-04-04T06:11:04.301-00:00/", :xtdb.checkpoint/checkpoint-at #inst "2023-04-04T06:11:04.301-00:00"}
the referenced folder exists and has files, is there anything more I can do to check the validityI made my own patched of checkpoint-transfer-manager ns with the prefix fix, now restoring and startup is fast… I can circle back and look into the initial problem of lucene checkpoint failure
more weirdness (same 1.23.1 version, now it finds the checkpoint with the patch… but lucene restore fails
{ "level": "INFO", "logger": "xtdb.checkpoint", "message": "restoring from {:xtdb.checkpoint/cp-format \"lucene-8\", :tx {:xtdb.api/tx-id 20612}, :xtdb.s3.checkpoint-transfer-manager/s3-dir \"checkpoint-20612-2023-04-10T22:29:52.972-00:00/\", :xtdb.checkpoint/checkpoint-at #inst \"2023-04-10T22:29:52.972-00:00\"} to /path/lucene", "thread": "main", "timestamp": "2023-04-11T13:17:00.056Z" }
Caused by: org.apache.lucene.index.CorruptIndexException: Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/path/lucene/segments_2")))
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:293)
at org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:165)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1089)
at xtdb.lucene$__GT_index_writer.invokeStatic(lucene.clj:62)
at xtdb.lucene$__GT_index_writer.invoke(lucene.clj:59)
at xtdb.lucene$__GT_lucene_store.invokeStatic(lucene.clj:362)
at xtdb.lucene$__GT_lucene_store.invoke(lucene.clj:338)
at xtdb.system$start_system$fn__35087.invoke(system.clj:211)
... 43 more
Caused by: java.nio.file.NoSuchFileException: /path/lucene/_129.si
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181)
at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298)
at java.base/java.nio.channels.FileChannel.open(FileChannel.java:357)
at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:238)
at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:157)
at org.apache.lucene.codecs.lucene86.Lucene86SegmentInfoFormat.read(Lucene86SegmentInfoFormat.java:91)
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357)
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291)
... 50 more
Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (2eb369c7). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/path/lucene/segments_2")))
at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:466)
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:434)
... 51 more
I had to delete lucene checkpoint to get node to start… I wonder if there is something I can do myself to fix it, did it try too early to read it or somethingI haven't seen this before. Which filesystem are you using? Is it a regular local SSD mount?
the checkpointing has worked before, but these lucene checkpoint problems (both checkpoint failed, and now this restoration have been happening now that we switched to the s3 transfer manager stuff)
I did check that the checkpoint it wanted to use did not fail, I can see the corresponding checkpoint uploaded logs when it was created
hmm, perhaps we've missed some key detail in the transfer manager implementation :thinking_face:
> I had to delete lucene checkpoint to get node to start to confirm, you mean the checkpoint that was "successfully" created was consistently not restoring? and producing the same error?
It sounds like Lucene's https://stackoverflow.com/a/6231324 isn't being respected, but I don't see how the switch to using the transfer manager could be causing that to happen all of a sudden
I’ll do some more shutdown/restart tests tomorrow and see if this happens consistently, and try to get a shell to the machine when it happens
it is a problem that it completely prevents the node from starting, good thing ECS keeps the old node until a new is up
please do keep me/us in the loop, and I'll give it some thought again later. Feel free to open an issue also if that's easier for you to track
will do, I’ll open issue after some trials… I want to make sure my patched build isn’t causing the trouble first
I don’t know if this is related, but 2 nodes started, both said they restored from checkpoints (rocksdb+lucene), but the node still took ~12 minutes to “start” (after saying “Started tx-ingester” it took 700+ seconds before the next startup log). I was able to verify that a) the checkpoint was downloaded (files were present in the node), b) the node was doing something heavy (150% CPU use on the node and load on the PostgreSQL database that stores our tx-log/docs)