Fork me on GitHub
#xtdb
<
2023-03-29
>
tatut09:03:00

on 1.23.1 version and got this on our test environment during the night

level: WARN, logger: "xtdb.checkpoint", message: "Checkpointing failed", exception:
java.nio.file.FileAlreadyExistsException: /path/_57b4.si
  at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:94)
  at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
  at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
  at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:218)
  at java.base/java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:484)
  at java.base/java.nio.file.Files.newOutputStream(Files.java:228)
  at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:410)
  at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:406)
  at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:254)
  at org.apache.lucene.store.Directory.copyFrom(Directory.java:183)
  at xtdb.lucene$checkpoint_src$reify_4815$fn_4816.invoke(lucene.clj:291)
  at xtdb.lucene$checkpoint_src$reify__4815.save_checkpoint(lucene.clj:289)
  at xtdb.checkpoint$checkpoint.invokeStatic(checkpoint.clj:44) ...
what could cause it? improper cleanup of previous checkpoint or something else? there was no actual activity on the node

tatut09:03:13

I’m seeing another one earlier with a different path, but exactly same problem

refset11:03:04

Hi @U11SJ6Q0K I don't recognise this in particular, but a brief search for that error suggests there may be some filesystem issue: 1. out of disk https://github.com/oracle/opengrok/issues/3131#issuecomment-621004281 2. max open files https://community.dremio.com/t/filealreadyexistsexception-in-lucenesearchindex-commit/4463 Do you have any environment monitoring that could confirm the status of those two things?

refset11:03:35

was the other error also for Lucene? (i.e. the exact same stack trace, asides from the path)

tatut11:03:42

yes both were lucene

👍 2
tatut11:03:59

sadly I don’t have the environment anymore, ECS killed the unhealthy node

tatut11:03:48

but those are good leads, I’ll try to investigate

blob_thumbs_up 2
🤞 2
tatut11:03:50

shouldn’t be out of space, more than double free than used… so out of file handles is one contender

refset23:03:20

Hey Tatu, did you have any joy investigating further?

tatut03:04:10

not yet, hasn’t happened again

👍 2
tatut05:04:13

ok, now I’m seeing it again in another environment… we started using the S3 transfer manager CRT, don’t know if that is relevant

tatut05:04:15

and it seems to always be lucene checkpoint that fails

tatut05:04:47

there are some lucene checkpoint .edn+folder present in the s3 bucket, but it seems they aren’t clean as the node starts indexing very old transactions…

tatut08:04:14

added some more logging and yes, when starting the node restores from rocksdb checkpoint, but not seeing lucene restoration

tatut08:04:47

then added a startup monitor that polls submitted/completed tx status and it indeed goes through all transactions on startup (I’m guessing the lucene side only, as rocksdb was restored)

tatut08:04:51

is there anything else to check, I have a lucene checkpoint but it isn’t being used

tatut08:04:14

In our dev env I have a checkpoint file in the bucket “checkpoint-6011-2023-04-04T06:11:04.301-00:00.edn” containing:

{:xtdb.checkpoint/cp-format "lucene-8", :tx {:xtdb.api/tx-id 6011}, :xtdb.s3.checkpoint-transfer-manager/s3-dir "checkpoint-6011-2023-04-04T06:11:04.301-00:00/", :xtdb.checkpoint/checkpoint-at #inst "2023-04-04T06:11:04.301-00:00"}
the referenced folder exists and has files, is there anything more I can do to check the validity

tatut10:04:19

made #1927 issue, seems that the prefix isn’t used if I analyzed it correctly

tatut10:04:29

so none of the checkpoints were being used

tatut12:04:04

I made my own patched of checkpoint-transfer-manager ns with the prefix fix, now restoring and startup is fast… I can circle back and look into the initial problem of lucene checkpoint failure

refset15:04:23

ah, good spot, sorry about that one :man-facepalming:

tatut13:04:08

more weirdness (same 1.23.1 version, now it finds the checkpoint with the patch… but lucene restore fails

tatut13:04:28

{ "level": "INFO", "logger": "xtdb.checkpoint", "message": "restoring from {:xtdb.checkpoint/cp-format \"lucene-8\", :tx {:xtdb.api/tx-id 20612}, :xtdb.s3.checkpoint-transfer-manager/s3-dir \"checkpoint-20612-2023-04-10T22:29:52.972-00:00/\", :xtdb.checkpoint/checkpoint-at #inst \"2023-04-10T22:29:52.972-00:00\"} to /path/lucene", "thread": "main", "timestamp": "2023-04-11T13:17:00.056Z" }

 Caused by: org.apache.lucene.index.CorruptIndexException: Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/path/lucene/segments_2")))
 	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:293)
 	at org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:165)
 	at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1089)
 	at xtdb.lucene$__GT_index_writer.invokeStatic(lucene.clj:62)
 	at xtdb.lucene$__GT_index_writer.invoke(lucene.clj:59)
 	at xtdb.lucene$__GT_lucene_store.invokeStatic(lucene.clj:362)
 	at xtdb.lucene$__GT_lucene_store.invoke(lucene.clj:338)
 	at xtdb.system$start_system$fn__35087.invoke(system.clj:211)
 	... 43 more
 Caused by: java.nio.file.NoSuchFileException: /path/lucene/_129.si
 	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
 	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
 	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
 	at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181)
 	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298)
 	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:357)
 	at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:238)
 	at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:157)
 	at org.apache.lucene.codecs.lucene86.Lucene86SegmentInfoFormat.read(Lucene86SegmentInfoFormat.java:91)
 	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357)
 	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291)
 	... 50 more
 	Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (2eb369c7). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/path/lucene/segments_2")))
 		at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:466)
 		at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:434)
 		... 51 more
I had to delete lucene checkpoint to get node to start… I wonder if there is something I can do myself to fix it, did it try too early to read it or something

tatut13:04:41

fwiw, the rocksdb restore did work

refset13:04:22

I haven't seen this before. Which filesystem are you using? Is it a regular local SSD mount?

tatut13:04:52

regular local volume

👍 2
tatut13:04:18

the local ECS disk

👍 2
tatut13:04:05

the checkpointing has worked before, but these lucene checkpoint problems (both checkpoint failed, and now this restoration have been happening now that we switched to the s3 transfer manager stuff)

tatut13:04:32

I did check that the checkpoint it wanted to use did not fail, I can see the corresponding checkpoint uploaded logs when it was created

refset13:04:18

hmm, perhaps we've missed some key detail in the transfer manager implementation :thinking_face:

refset13:04:56

> I had to delete lucene checkpoint to get node to start to confirm, you mean the checkpoint that was "successfully" created was consistently not restoring? and producing the same error?

tatut13:04:23

well consistently meaning both nodes that tried it failed the same way

👍 2
refset14:04:46

It sounds like Lucene's https://stackoverflow.com/a/6231324 isn't being respected, but I don't see how the switch to using the transfer manager could be causing that to happen all of a sudden

tatut14:04:52

I’ll do some more shutdown/restart tests tomorrow and see if this happens consistently, and try to get a shell to the machine when it happens

🙏 2
tatut14:04:53

it is a problem that it completely prevents the node from starting, good thing ECS keeps the old node until a new is up

refset14:04:53

please do keep me/us in the loop, and I'll give it some thought again later. Feel free to open an issue also if that's easier for you to track

tatut14:04:44

will do, I’ll open issue after some trials… I want to make sure my patched build isn’t causing the trouble first

👍 2
tatut04:04:26

I don’t know if this is related, but 2 nodes started, both said they restored from checkpoints (rocksdb+lucene), but the node still took ~12 minutes to “start” (after saying “Started tx-ingester” it took 700+ seconds before the next startup log). I was able to verify that a) the checkpoint was downloaded (files were present in the node), b) the node was doing something heavy (150% CPU use on the node and load on the PostgreSQL database that stores our tx-log/docs)

tatut04:04:25

but it wasn’t a full reindex, as that would take over 45 minutes to start in that case

tatut04:04:55

ah, looking at it more closely, it was that the lucene checkpoint was older… as it seems that only the 1st lucene checkpoint was created successfully, all subsequent checkpoints on the node failed with the same error as at the start of this thread (FileAlreadyExistsException)

tatut05:04:20

I think I’ll try setting :keep-dir-between-checkpoints? false to ensure it is cleaned up