Fork me on GitHub
#datomic
<
2024-02-03
>
Hasan Ahmed03:02:03

Hello, I am new to datomic and partitioning in general. what are some good partitioning strategies for datomic or should i even bother (what does datomic do when I do not specify a partition)? I'll be using datomic pro/peers, i see that cloud version partitioning is not as advocated for based on some references/comments i have read online

favila15:02:15

Datomic indexes are sorted lists of datoms divided into blocks/chunks called segments and arranged in a tree. Here is a good high-level overview: https://tonsky.me/blog/unofficial-guide-to-datomic-internals/

favila15:02:57

So sorting influences which datoms are more likely to appear together on the same segment or with the same parent segments in the tree.

favila15:02:46

Partitioning is influencing the sort order by controlling the most-significant-bits of entity ids.

favila15:02:29

Having things-read-together be on nearby segments is important for increasing your cache hit rate, which makes things go faster. The slowest possible thing is reading a segment from storage and decoding it; next-slowest is reading it from valcache or memcache and decoding it; fastest is reading it from object cache.

favila15:02:58

So partitioning can help when all of the following are true: 1) The working-set of data you query while the application is in normal use is larger than your object cache 2) your queries have some kind of locality of things-read-together. For example, if your application data has some kind of tenancy (e.g users/customers rarely “see” other customer’s data), all a tenant’s data should be in the same partition. 3) You run multiple peers 4) You can partition your processes/peers in a similar way, e.g. with http load balancing, so that requests for things in the same partition boundaries go to the same peers.

favila15:02:18

If you don’t assign partitions (either implicit or named partitions), entities are placed in one of the three default partitions: schema goes in the db partition (0), transaction entities go in the tx partition (3) and all other entities go in the user partition (4) https://docs.datomic.com/pro/schema/schema.html

favila15:02:46

If your data is small, you don’t need partitions; however partitions are very hard to introduce later because you cannot renumber entities without decanting, which is a fairly involved process. (Decanting is running the transaction log of a database in order, transforming it, and transacting it into a new database.)

favila15:02:02

Cloud has partitioning, but doesn’t let you control it. (It doesn’t have the tempid record, only tempid strings, so there’s literally no way to express the desired partition). It does some kind of automatic partitioning based on time, but I don’t know if that’s officially stated anywhere. If your query locality correlates with time, this is good for you; if it correlates with something else, too bad 🙂

favila15:02:08

More information on partitioning in on-prem: https://docs.datomic.com/pro/schema/schema.html#implicit-partitions. Specific ways to exploit it (sharding is the one I focused on here; but you can also use it for entity-scans if you assign partitions rigorously and 100% correctly ) https://docs.datomic.com/pro/query/indexes.html#partitions

Hasan Ahmed18:02:54

Oh wow, that's of great help to me. Thank you

Hasan Ahmed18:02:33

One more question, I can envision my queries being something like, query all orders with their lines. Now, do I put all orders in part 'x' and all order lines in part 'y'. or put an order with its lines on part 'x' , next order with all its info will on part 'y', the next on part 'z' etc. for this, I'd probably use implicit partition for the last technique

favila18:02:46

If lines are not shared with orders, you will want to put orders and lines in the same partition with each other (sharding by ownership/scope) rather than partitioning by type of entity.

favila18:02:55

isComponent will do this automatically

favila18:02:54

you can of course also partition by both type and scope

favila18:02:13

if you have query loads which only look at orders for e.g., and you want those to be as fast as possible

Hasan Ahmed18:02:29

Alright, I'll go fake some data and test that out. I'll tell you know the results if interested

Hasan Ahmed18:02:34

Many thanks again

favila18:02:56

A caveat that the benefits of partitioning are very hard to observe until it’s too late 🙂

favila18:02:27

I’ve heard many stories of databases set up without partitioning, then when they hit multiple billions of datoms they have performance problems

Hasan Ahmed18:02:37

In that case, I'll fake too much data

Hasan Ahmed18:02:10

That's why I wanted to know about partitioning early on

Hasan Ahmed18:02:48

And tbh honest i searched on here before posting. And saw your comments about partition

Hasan Ahmed18:02:04

So if I am saved, it's because of you lol

favila18:02:44

How large (# of datoms) do you anticipate your database getting? Approaching 10 billion?

Hasan Ahmed18:02:03

Hmmm, it might

Hasan Ahmed18:02:30

I am building a new app for my company to replace legacy

favila18:02:37

OK, just double-checking. At that scale partitioning is very important

Hasan Ahmed18:02:47

It has like 600k orders

Hasan Ahmed18:02:09

Each order might 10-30 datoms