This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-10-18
Channels
- # announcements (12)
- # babashka (6)
- # beginners (62)
- # calva (3)
- # cider (41)
- # clerk (5)
- # clojure (192)
- # clojure-bay-area (1)
- # clojure-europe (14)
- # clojure-norway (97)
- # clojure-uk (6)
- # clojuredesign-podcast (4)
- # clojurescript (30)
- # code-reviews (7)
- # cursive (32)
- # datahike (4)
- # datomic (35)
- # docker (8)
- # emacs (8)
- # events (1)
- # fulcro (13)
- # helix (19)
- # hoplon (4)
- # hyperfiddle (37)
- # jobs-discuss (10)
- # membrane (11)
- # missionary (19)
- # off-topic (28)
- # polylith (8)
- # portal (10)
- # practicalli (8)
- # re-frame (31)
- # reitit (6)
- # shadow-cljs (39)
- # timbre (3)
- # vim (1)
- # xtdb (6)
I'm evaluating whether Datomic would be a good fit to ingest 10M CSV rows per day (10 GB), spread across different CSV files, where each row has ~20 columns. If I understand correctly, that would be 200M datoms a day - is this considered excessive?
If this is doable, is there any difference in doing this kind of thing with Datomic Pro vs Cloud/Ions? What would you suggest?
Thanks! 🙏
we’re doing way than 200M datoms per day, so it’s possible to transact a fairly large throughput. Hard to say more without knowing more about your query needs, but features of Datomic worth considering are: historical audit, flexibility (graphs and relational APIs), cache locality, real transactions. If your project is ETL based, you may not benefit from some of these
I’m surprised and pleased to hear this. In the past I have heard warnings of poor query performance with more than 10 billion datoms in the database. You would have hit that limit in less than two months, so I’m guessing modern hardware and improvements in the product have made that less of an issue.
IME the partitioning strategy is really important. We experienced a lot of IO pain until we got that in order. One thing that’s nice about Datomic is that it’s operational tradeoffs are fairly clear
Still, it’s not necessarily the best fit for non-transactional ETL workloads, just depends
^^ This. At billions of datoms, you really, really need to know what your query locality looks like and make sure you used partitions accordingly
ie if your use-case doesn’t actually have the need to say, ensure global uniqueness of a user’s email, or lock a bank account during a transfer, you’d be putting data through a single thread that could otherwise be spread across a cluster like elasticsearch
hmm, yes it makes sense, thank you. My use case is not entirely OLAP in the sense that I don’t need to do a lot of (if at all) analytics, but also don’t think I need some of the features you mentioned above either :thinking_face: It’s more about getting graph querying capabilities and time tracking
another item I’d look out for is if the CSVs have any large strings in the cells. Due to indexing design Datomic doesn’t do so well with large string content in the datoms themselves
> CSVs have any large strings in the cells. Due to indexing design Datomic doesn’t do so well with large string content in the datoms themselves interesting, that is actually the case. Do you happen to know if xtdb handles this better?
when you say doesn’t do well, can you give an example perhaps? Wouldn’t lucene integration help with full text search?
datomic doesn’t have any kind of large-value offloading in its on-disk encoding like e.g. postgres TOAST
Datomic uses “covering” indexing, so the actual datoms themselves, instead of a reference to them, are stored in the B-tree (let’s say B-tree-like). This means that if a datom has say a 50mb string in it, it will be serialized into the index node and will need to be downloaded, de-serialized etc
I see! I don’t think it would be that large, likely in the kB range. Perhaps <8kb is ok, given that Postgres doesn’t trigger TOAST below those size anyway IIRC
datomic cloud’s limit is 4k, in our 10bil datom db we keep everything under 2-4 with attribute predicates (i.e. no cheating)

I thought the TOAST threshold was about 2kb?you’re right, thanks for the correction! (`TOAST_TUPLE_THRESHOLD` is mentioned to normally be 2kb in docs)
@U09R86PA4 what kind of sustained throughput do you do? txps and dps , and on what data store