Fork me on GitHub
#datomic
<
2023-06-23
>
indy15:06:11

Are there downsides to using a common :entity/id attribute instead of a say :post/id or :comment/id? The ids I’m generating for them are uuids (so practically unique). Using a common id attribute is simplifying things a lot for me code wise but I wonder if there are any downsides to this.

indy15:06:06

I understand this is partially about how one models the domain. But will this affect the indexes in a way that it slows down queries?

favila16:06:15

The value index loses locality by “type”, which means both that lookup-refs don’t guarantee that you’ll get a thing of the type you want (if that’s something that matters) and that the value index will be larger, so potentially more bisection to do to find a value

favila16:06:54

OTOH polymorphic access doesn’t need to “probe” a list of identifying attributes if it just has an ID and doesn’t care what kind of entity is at the other end

favila16:06:40

also datomic analytics metaschema needs an attribute to “mark” an entity as belonging to a certain table in the mapping. :post/id gives you that automatically

favila16:06:05

:entity/id does not, it would just give you a single really really wide table

favila16:06:29

another possible downside to multiple attributes: if you really want to keep the invariant that the uuid values are unique among all types, there’s no automatic enforcement of that invariant

favila16:06:13

e.g. :db.unique/value is just going to ensure all :post/id are unique, not that no :comment/id doesn’t have the same value

indy16:06:32

> that the value index will be larger, so potentially more bisection to do to find a value Yeah I was worried about this.

favila16:06:57

consider prefixing the uuids

favila16:06:02

as in write some type marker into the high byte or something

favila16:06:43

unless your data size is very large I wouldn’t worry about perf, worry about ergonomics

👍 4
favila16:06:20

partitioning is also really important at larger sizes, but cloud doesn’t give you control of that

👍 2
indy16:06:32

> The value index loses locality by “type”, which means both that lookup-refs don’t guarantee that you’ll get a thing of the type you want I don’t fully understand this. By type of the thing I want you mean: if I had used :post/id and :comment/id and they were of different types then the index would’ve segmented in that way and that would’ve lead to faster plucking of datoms?

favila16:06:12

datoms in the value index are ordered by AVET, i.e. attribute, value, entity, tx

favila16:06:30

so by having different attributes, you’ve already bisected the space of values

favila16:06:39

all values of the same attribute are together

indy16:06:48

Ah yeah, that I understood

favila16:06:03

so that’s locality by “type” because the attribute is the “type” marker

indy16:06:22

> consider prefixing the uuids Thanks, I’ll take note of this.

indy16:06:41

> so that’s locality by “type” because the attribute is the “type” marker Ooh got it :)

favila16:06:14

thus the suggestion to put something in the high bits of the uuid

👍 2
favila16:06:00

then it’s :entity/id #uuid"01…", #uuid"02…" etc

favila16:06:15

so the sort order once again corresponds to the type

favila16:06:49

if your reads tend to have type affinity, then this increases the selectivity of the working set that needs to be loaded in memory

favila16:06:04

if your reads have no type affinity, then none of this matters, even different attrs wont matter

favila16:06:18

you still are equally likely to need any segment

indy16:06:12

So, with my app, I store a lot of text content which needs versioning so I’m storing no text in datomic just id references to the text in document db that also stores the diffs of previous versions. I’m using datomic as a purely relations store and I’m hoping all the datoms can fit in the object cache because the values of datoms are going to be very small. I think this might significantly improve query performance since the scanning of indexes does not need to load segments into memory all the time and possibly offset the degraded perf from using common attributes I use like entity/id , entity/created-by , entity/type and so on.

indy16:06:07

> if your reads tend to have type affinity Do you mind giving an example of this?

favila16:06:15

if a given process is equally likely to read a value from anywhere in an index, then locality doesn’t matter that much because you can’t reduce the working set of segments at all

2
favila16:06:40

but if a process e.g. tends to read mostly comments by id, then AVET of :comment/id is going to be read a lot and other parts of AVET less so, so they are more likely not to need loading. I.e. your object cache hitrate increases

favila16:06:16

data locality is about reducing the working set of a given process. but to exploit it, you need read affinity that is the same

favila16:06:30

the way to think about this is what index the read uses in your application (AVET, AEVT, EAVT, VAET), and what affinity it has for particular leading parts of those indexes

favila16:06:03

e.g. if you read EAVT, (e.g. via d/pull), then you want to keep those segments small, which is what your large-value offloading is doing

favila16:06:26

and the AVET examples so far are for looking up an item by id. But there’s also VAET for reverse lookups [?unbound-e :ref-attr ?bound-v] and AEVT [?bound-e :attr ?unbound-v]

favila16:06:51

and whatever index-pull does--I’m not sure but I suspect mostly AEVT

indy16:06:10

That is a good bit of things for me to think and read about. Thanks for that and for auditing my thoughts about this.

onetom16:06:15

i haven't seen this idea of hacking the UUID high bits yet. aren't the most significant bits used for describing the type of the UUID? at least the https://en.wikipedia.org/wiki/Universally_unique_identifier article is saying something like that.

onetom16:06:41

but i would also be worried about how much such a trick would raise the chance of collisions or what other effects does it have. feels like a very hacky solution with non-obvious consequences.

onetom16:06:01

@UMPJRJU9E im also wondering why do u need an :entity/id explicitly. can't u just use the :db/id? explicitly generated IDs have the benefit of being stable, even if u export and re-import your data somehow, which is important, if there are systems external to your datomic system, which keep references to your entity IDs. in other words, if u need permalinks to your entities, then u might want to generate your own IDs, but otherwise you can just rely on the :db/id, imho.

favila16:06:30

There are a few bits reserved for uuid type and variant but they are not the highest bits. This “uuid bit hacking” is exactly what d/squuid does

👍 2
onetom16:06:43

eg. if u have a web interface which allows u to interact with your datomic entities, then the web requests can just use the :db/id to determine which entity are you interacting with, since such interactions happen in a short time frame, so the :db/ids won't change, because u won't reload your db during such a session typically.

onetom16:06:13

@U09R86PA4 i guess im just not understanding this table in the wikipedia article, which talks about MSbs. but i should read the RFC probably, which is a bit more authoritative than wikipedia 😄

favila17:06:24

Those are the bits of the family/variant field which is variable length and 3 bits long. But that field is embedded within the 128 bits somewhere in the middle

🙏 2
indy15:06:49

Unrelated Q: Might be obvious but are the values of datoms also cached in memory for ions and on-prem? If I’m storing large strings as values, do these also reside in memory?

favila15:06:13

yes, segments are fully decoded and loaded into object cache before being read, so if a read does indeed need to see the datom value, it’s going to be loaded

favila15:06:26

along with all the datoms “nearby” in the same segment

indy16:06:46

Thanks for clarifying!

souenzzo17:06:33

> ... large strings as values ... datomic ions/cloud has a 4096 character string limit in docs. when you use strings longer than this size, the behavior is undefined (it may transact successfully, it may be trimmed to 4096 later) in on-prem there is no hard-limit in docs. but for example, dynamodb has a hard limit of 400KB object size. i think that if you try to store a string of this size in datomic on-prem@dynamodb, it may have undefined behaviors too.

favila17:06:26

it’s undefined in cloud? yikes

favila17:06:52

fwiw we mandate every string attribute have an attribute predicate in our schema which enforces length

favila17:06:10

large string values were a bane for a year or two

souenzzo17:06:00

I used cloud on early days (~2018) at that time, it behaves as I described. I remember about the QA team saying: long text descriptions are saved and displayed, but cut off after a few days I haven't used it for 4 years now.

Joe Lane14:06:42

Behavior is not undefined. We limit strings to 4K. We always documented that but weren’t always enforcing it. We always store your data. The max DynamoDB item size is something we have circumvented in both cloud and pro since day one.