Fork me on GitHub
#xtdb
<
2023-06-20
>
Nick15:06:28

Hi All -- I'm just looking into XTDB2 and was curious about the Arrow support. My use case has large files (100 million rows) that I currently have in an Arrow format from working with TechML dataset. Is this something that I can/should put into XTDB2 with a xt/submit-tx, or should they be managed outside XTDB and all of the other config and smaller transaction are in the DB and maybe a reference to the file handles for the arrow files themselves on s3 and manage it outside xtdb?

jarohen15:06:30

hey @U04A1LVBBPU πŸ‘‹ either of those strategies should (eventually) work with XTDB2 - it depends on the sorts/frequencies of queries you'll want to make. With the Arrow file out of XT, we obviously won't have any indices etc maintained, so won't be able to do much apart from scan the file. That said, XTDB2 likely isn't in a shape where it'll handle either particularly gracefully at that scale just yet - we're still in early access πŸ™‚ welcome to give it a go, of course, but YMMV πŸ™‚

Nick15:06:55

thanks for the quick reply. I'll definitely benchmark each an let you know what I find. In it's current state, would an xt/submit-tx on an arrow file work? Or should I just be exploring the managed outside option at the moment?

jarohen15:06:42

openly, we'd not considered users wanting to pass an Arrow file directly to submit-tx, but that's a neat idea πŸ™‚

jarohen15:06:45

(thinking out loud) the Arrow file'd have to be in a format we could ingest - i.e. documents with IDs, which table the docs belonged to, any bitemporal validity ranges, etc - but maybe that's not too onerous

jarohen15:06:39

in a multi-node setup, we'd also have to ensure that all nodes saw the same version of the file - we still require that transactions can be processed deterministically between nodes

Nick15:06:21

that shouldn't be too onerous. Its pretty trivial to convert a techml dataset into a clojure vector of maps via the builtin functions. Is there any guidance for target (preferred) transaction sizes? for example, I could write functions to decompose the large data into smaller sets based on attributes in the problem domain (ie by each product, or each customer) and that would make the individual transactions themselves smaller. Also, is there some docs on target format for as well? I'd be up for writing a library to help facilitate techml stack -> xtdb -> and back as I can see a lot of value in getting that working well.

jarohen15:06:09

> Is there any guidance for target (preferred) transaction sizes? for example, I could write functions to decompose the large data into smaller sets based on attributes in the problem domain (ie by each product, or each customer) and that would make the individual transactions themselves smaller. we tend to get diminishing returns after about 1k rows/tx, but highly dependent on the dataset. splitting by domain shouldn't be necessary but I tend to because it makes things easier to check/reason about πŸ™‚ > Also, is there some docs on target format for as well? there's some light reference documentation https://www.xtdb.com/reference/main/datalog/txs and docstrings on submit-tx itself, tutorials etc to come in due course πŸ™‚ > I'd be up for writing a library to help facilitate techml stack -> xtdb -> and back as I can see a lot of value in getting that working well. completely agree - that'd be awesome, thanks!

Nick15:06:32

awesome. one last question. Let's say I have 100K line file, and I split it based on the problem domain. In this scenario, I have 100 products at 1000 rows each. so I have 100 transactions at the target 1000 rows tx size. So I submit these 1000 tx to XTDB I then run some functions and update the file with new values for 20% of the products. So 80 would be identical to the original. So I submit all 1000 tx of the new file to XTDB. Would XTDB do structural sharing and not repeat the data for the 80 that didn't change? Or would these also be duplicated like I have in the baseline case with 2 arrow files? I ask, because my use case will have 100's of scenarios with slight tweaks on this large dataset, and I'd love a DB to help me with versioning, time travel, etc.. and want to spare myself of having to manage copies of all the data myself and have the DB do that work for me

jarohen15:06:31

XTDB2 doesn't do structural sharing in the same way I'm afraid. what it'll likely do is replace those rows in its main indices, and rows that become historical will be demoted to files that (if you only query as-of-now) won't often be pulled down to the nodes from the remote storage

jarohen15:06:30

the idea being that these more historical files can be lifecycle managed in the remote storage - either moved to cheaper storage (think S3 IA/Glacier), or (with some support in XTDB itself) subject to a retention policy

jarohen15:06:18

it'll likely still be worth processing these files into a diff in this case, though - the nodes will still have to churn through the dataset a second time to index them

jarohen16:06:20

although, if your use case can cope with waiting for those to pass through the pipeline, might not be necessary πŸ™‚

Nick16:06:21

thanks for all the help. I'll go exploring and I'm sure I'll have more questions as I get into it

☺️ 2
πŸ™ 2
jarohen16:06:01

no worries!

jarohen16:06:42

also, if you'd be interested in being part of the XT2 early access programme, please do ping me your contact details - we're amassing a group of people keen to get more involved in its development between now and a stable release πŸ™‚