Fork me on GitHub
#datomic
<
2023-10-18
>
akis00:10:54

I'm evaluating whether Datomic would be a good fit to ingest 10M CSV rows per day (10 GB), spread across different CSV files, where each row has ~20 columns. If I understand correctly, that would be 200M datoms a day - is this considered excessive? If this is doable, is there any difference in doing this kind of thing with Datomic Pro vs Cloud/Ions? What would you suggest? Thanks! 🙏

bhurlow02:10:21

what other stores are you comparing?

akis02:10:19

Typically I would use something like Postgres for this usecase

bhurlow03:10:44

we’re doing way than 200M datoms per day, so it’s possible to transact a fairly large throughput. Hard to say more without knowing more about your query needs, but features of Datomic worth considering are: historical audit, flexibility (graphs and relational APIs), cache locality, real transactions. If your project is ETL based, you may not benefit from some of these

👍 1
cch113:10:27

I’m surprised and pleased to hear this. In the past I have heard warnings of poor query performance with more than 10 billion datoms in the database. You would have hit that limit in less than two months, so I’m guessing modern hardware and improvements in the product have made that less of an issue.

bhurlow18:10:52

IME the partitioning strategy is really important. We experienced a lot of IO pain until we got that in order. One thing that’s nice about Datomic is that it’s operational tradeoffs are fairly clear

bhurlow18:10:24

Still, it’s not necessarily the best fit for non-transactional ETL workloads, just depends

favila18:10:44

^^ This. At billions of datoms, you really, really need to know what your query locality looks like and make sure you used partitions accordingly

favila18:10:59

it’s not something you can easily fix later

💯 1
akis18:10:23

What do you mean by non transactional workloads?

favila18:10:06

I think he means OLTP vs OLAP workloads

👍 1
favila18:10:22

datomic is better for OLTP

favila18:10:43

passable at OLAP because of column-oriented indexes (AEVT)

favila18:10:48

but still not great

bhurlow18:10:31

ie if your use-case doesn’t actually have the need to say, ensure global uniqueness of a user’s email, or lock a bank account during a transfer, you’d be putting data through a single thread that could otherwise be spread across a cluster like elasticsearch

akis18:10:21

hmm, yes it makes sense, thank you. My use case is not entirely OLAP in the sense that I don’t need to do a lot of (if at all) analytics, but also don’t think I need some of the features you mentioned above either :thinking_face: It’s more about getting graph querying capabilities and time tracking

bhurlow18:10:48

another item I’d look out for is if the CSVs have any large strings in the cells. Due to indexing design Datomic doesn’t do so well with large string content in the datoms themselves

akis19:10:55

> CSVs have any large strings in the cells. Due to indexing design Datomic doesn’t do so well with large string content in the datoms themselves interesting, that is actually the case. Do you happen to know if xtdb handles this better?

akis19:10:41

when you say doesn’t do well, can you give an example perhaps? Wouldn’t lucene integration help with full text search?

favila19:10:08

it’s not about fulltext, it’s just row storage in the segment

favila19:10:34

datomic doesn’t have any kind of large-value offloading in its on-disk encoding like e.g. postgres TOAST

👍 1
bhurlow19:10:16

Datomic uses “covering” indexing, so the actual datoms themselves, instead of a reference to them, are stored in the B-tree (let’s say B-tree-like). This means that if a datom has say a 50mb string in it, it will be serialized into the index node and will need to be downloaded, de-serialized etc

👍 1
bhurlow19:10:12

This is, I believe, regardless of whether the v in the E A V T is indexed

akis19:10:42

I see! I don’t think it would be that large, likely in the kB range. Perhaps <8kb is ok, given that Postgres doesn’t trigger TOAST below those size anyway IIRC

akis19:10:17

I’m guessing I’ll need to test to be sure :thinking_face:

favila19:10:15

datomic cloud’s limit is 4k, in our 10bil datom db we keep everything under 2-4 with attribute predicates (i.e. no cheating)

thanks3 1
favila19:10:05

I thought the TOAST threshold was about 2kb?

💯 1
akis19:10:07

10bil, nice. How long did it take to get to that size? 🙂

favila19:10:11

for the size of the tuple

favila19:10:26

8+ years, although that’s with a decant in the middle

👍 1
favila19:10:04

(it was 17bil as of the beginning of this year, went back down to 9 after a cleanup)

akis19:10:23

I thought the TOAST threshold was about 2kb?you’re right, thanks for the correction! (`TOAST_TUPLE_THRESHOLD` is mentioned to normally be 2kb in docs)

jasonjckn07:10:45

@U09R86PA4 what kind of sustained throughput do you do? txps and dps , and on what data store

jasonjckn07:10:45

also Q for everyone, what batch size (datoms per tx) and parallelism is optimal for bulk loading