Fork me on GitHub
#xtdb
<
2021-04-16
>
kevinmershon15:04:57

@jarohen regarding the crux-lucene PR I opened, does the crux codebase have a preferred method of doing a debounce function? Basically after the very last tx write after a block of writes, ~5-10 seconds later a single commit needs to happen.

jarohen15:04:37

Hey @U0D5Y2403, cheers for the PR 🙏 I'm afraid I don't think we'll be able to make that exact fix though, due to a current constraint that the main query indices and Lucene stay in sync, even if the power cord's pulled out - if not, Crux unfortunately fails on next startup. We're looking at ways to relax this constraint in future releases - best that I defer to @U050DD55V and @U899JBRPF who are a little more in context 🙂

jonpither16:04:15

@U0D5Y2403 having more tx-ops in the tx may speed up. Are you able to see?

kevinmershon16:04:54

@U050DD55V I'm not sure what you are asking

kevinmershon16:04:19

Right now I have about 9k records that were individually written to the crux database. I dumped my lucene index and restarted crux, and it is averaging ~1 index op per second (150 minutes to sync up). At minimum on the crux side it should "batch" transactions with lucene and flush only once caught up. When realtime you can reasonably ask it to flush to disk every tx and ask for the callers to use bulkier transactions where possible, but for startup this is pretty terrible performance

jonpither16:04:39

that does sound extremely slow. We have benchmarked Lucene ingestion, but I'll have an investigation on Monday - thanks for reportijng @U0D5Y2403

refset16:04:17

Hi, I just wrote up a comment on the PR, so let's keep track of the longer-form technical discussion on there if possible, but I'm happy to respond to other questions here 🙂 as Jon said, we'll definitely review this internally on Monday. Are you blocked on prototyping further by this?

kevinmershon16:04:48

👋 appreciate it 🙂

🙏 3
kevinmershon16:04:57

Not blocked. I'm using my own copy of lucene.clj for the time being but we hit a snag in building a reporting pipeline for running dual systems synchronized over kafka to move off our old platform. Payloads are sync'd one record at a time over kafka so I can't batch there because I don't get any info on whether this is the last record, etc.

kevinmershon16:04:00

It's actually fortunate that I dug into this now, because we also need to have keywords and nested fields indexed (our old data model suuuuuucks) so I can modify the indexer to know to do that, whereas the default one should obviously stay string-only

refset16:04:01

Great, that's a relief to know. Batching Crux transactions just for crux-lucene's sake is only really a solution for when prototyping / testing 🙂