This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
Hello, I am new to datomic and partitioning in general. what are some good partitioning strategies for datomic or should i even bother (what does datomic do when I do not specify a partition)? I'll be using datomic pro/peers, i see that cloud version partitioning is not as advocated for based on some references/comments i have read online
Datomic indexes are sorted lists of datoms divided into blocks/chunks called segments and arranged in a tree. Here is a good high-level overview: https://tonsky.me/blog/unofficial-guide-to-datomic-internals/
So sorting influences which datoms are more likely to appear together on the same segment or with the same parent segments in the tree.
Partitioning is influencing the sort order by controlling the most-significant-bits of entity ids.
Having things-read-together be on nearby segments is important for increasing your cache hit rate, which makes things go faster. The slowest possible thing is reading a segment from storage and decoding it; next-slowest is reading it from valcache or memcache and decoding it; fastest is reading it from object cache.
So partitioning can help when all of the following are true: 1) The working-set of data you query while the application is in normal use is larger than your object cache 2) your queries have some kind of locality of things-read-together. For example, if your application data has some kind of tenancy (e.g users/customers rarely “see” other customer’s data), all a tenant’s data should be in the same partition. 3) You run multiple peers 4) You can partition your processes/peers in a similar way, e.g. with http load balancing, so that requests for things in the same partition boundaries go to the same peers.
If you don’t assign partitions (either implicit or named partitions), entities are placed in one of the three default partitions: schema goes in the db partition (0), transaction entities go in the tx partition (3) and all other entities go in the user partition (4) https://docs.datomic.com/pro/schema/schema.html
If your data is small, you don’t need partitions; however partitions are very hard to introduce later because you cannot renumber entities without decanting, which is a fairly involved process. (Decanting is running the transaction log of a database in order, transforming it, and transacting it into a new database.)
Cloud has partitioning, but doesn’t let you control it. (It doesn’t have the tempid record, only tempid strings, so there’s literally no way to express the desired partition). It does some kind of automatic partitioning based on time, but I don’t know if that’s officially stated anywhere. If your query locality correlates with time, this is good for you; if it correlates with something else, too bad 🙂
More information on partitioning in on-prem: https://docs.datomic.com/pro/schema/schema.html#implicit-partitions. Specific ways to exploit it (sharding is the one I focused on here; but you can also use it for entity-scans if you assign partitions rigorously and 100% correctly ) https://docs.datomic.com/pro/query/indexes.html#partitions
Oh wow, that's of great help to me. Thank you
One more question, I can envision my queries being something like, query all orders with their lines. Now, do I put all orders in part 'x' and all order lines in part 'y'. or put an order with its lines on part 'x' , next order with all its info will on part 'y', the next on part 'z' etc. for this, I'd probably use implicit partition for the last technique
If lines are not shared with orders, you will want to put orders and lines in the same partition with each other (sharding by ownership/scope) rather than partitioning by type of entity.
if you have query loads which only look at orders for e.g., and you want those to be as fast as possible
Alright, I'll go fake some data and test that out. I'll tell you know the results if interested
Many thanks again
A caveat that the benefits of partitioning are very hard to observe until it’s too late 🙂
I’ve heard many stories of databases set up without partitioning, then when they hit multiple billions of datoms they have performance problems
In that case, I'll fake too much data
That's why I wanted to know about partitioning early on
And tbh honest i searched on here before posting. And saw your comments about partition
So if I am saved, it's because of you lol
How large (# of datoms) do you anticipate your database getting? Approaching 10 billion?
Hmmm, it might
I am building a new app for my company to replace legacy
It has like 600k orders
Each order might 10-30 datoms
Might have*