What is your general/best advice for adding full text search to an XTDB2 db?
For integration we have the start of a CDC feature, via a Kafka topic, essentially a totally decoupled transaction log you can read from (and index from) without polling: https://github.com/xtdb/xtdb/pull/4986 More to follow on that in the coming weeks as we're building out this story. This is probably the best route for the time being. Although the real answer is let's have a chat, maybe we can figure out a plan for something more native 🙂
With XTDB v1 remember there was a kind of sweet spot of about 1000 insertions per transaction. Are there any metrics for batch updating data in XTDB v2 of that kind? Any known limits or sweet-spots for tuning large batch transactions? Both when pre-loading data (like in a migration) or when under heavy load.
It rather depends on the width and depth of the rows also, but in general 1000 is still our go-to batch size. We have been doing pretty regular analysis on the ingestion pipeline over the past few months, mainly with loading TPC-H data, so the guidance still stands.
For ~peak batch throughput insertion you might want to look at Arrow-based COPY which just landed: https://github.com/xtdb/xtdb/commit/b11f24383d37142b57c75d7cbbbedaf0bd8d2e01#diff-817cf8170a6c123b8973ffc87115d13ba84fa3aa773893b413f403f9451c0997R102-R113 Because it avoids all nearly all serde overheads, we observed about 200k rows/sec with TPC-H, over the Postgres wire protocol (!) The default submit-tx process will be round-tripping via transit, so you end up paying for a lot of allocations at both ends. Getting your data into Arrow may not be free, but you can perhaps parallelize that. How much data are you looking at for a migration?