Fork me on GitHub
#datomic
<
2022-07-25
>
Dave21:07:34

I'm interested to know if anyone experienced at building building enterprise applications using Datomic as a database, has experienced advantages or disadvantages with the use of :db.type/ref value types in the creation of their schema. For instance, consider the two schema snippets below which could represent two ways to model a person's first name in Datomic.

{:db/ident :person/first-name
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one}
-or-
{:db/ident :first-name/name
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one
 :db/unique :db.unique/value}

{:db/ident :person/first-name
 :db/valueType :db.type/ref
 :db/cardinality :db.cardinality/one}
In the second model, just to be explicit, :person/first-name is a ref to :first-name/name. I know storage is cheap so that is not in and of itself is not a reason to prefer the second one. But that aside, and say you have hundreds of thousands or even millions of "fred"s in your database, what would be the advantages or disadvantages of each model (if any) under these circumstances?

Dave23:07:56

Thanks @U0LAJQLQ1. Taking this to a thread to keep everything in one place in case you care to respond further. Appreciate your time. > you probably want to use a component, unique is very weird in this case As in :db.isComponent? :db.unique/value in this case just means you can't have two :first-name/names in Datomic with value "fred". So :person/first-name is just a CaS operation, e.g., my first name used to be Fred, now it's Ralph. Again, just an example as I know folks don't often change their names, but think entities where mutations, at least up until a explicit time whose attribute grouping or :<namespace>/<name>s we control, are common, i.e., today I'm Fred, tomorrow I may be Ralph, and the next day back to Fred or a different name altogether. > when you use refs, accessing and updating data is more of a pain in the ass Can you be more specific? In our app, updating data is exclusively a user exercise, i.e., they are telling the system whether they are "Fred" or "Ralph". > if you are considering the second use case, you prob are going to do that for almost every field in your database, so you can imagine that you are going to need to do a lot of extra work to manage the refs wherever you use them. you are also diluting the meaning of refs for the rest of your data. a ref should be meaningful. adding another ID to a string for all attributes seems insane Our app will have some very complex entity types, i.e., if were were to build them via flat structures, they could have thousands of fields or attributes that our users have to specify each time they create that entity. Since our app is all about making our users more productive, we've broken entity types down and made them much more granular but still meaningful with regard to the domain workflow. So now, instead of a user having to fill in thousands of fields of mostly scalar values to specify an entity, they are simply refing one to ten or more entities they or someone else has already defined (think Rich Hickey talking about composability in programming only it's composability for our specific domain). > every time i had done something like what you did in the second example, i have regretted it and changed it to be a flat structure. Why was that? I ask because with our app, we have no external data sources, i.e., there is no concept of mapping data (or names) to another source, system or database.

pppaul23:07:54

first, i need to clarify something. is your second model to primarily save space in the database? because i first thought that you were treating the name as an ID in one of the examples (common use of unique is to do ref lookups). eg (d/pull db '[*] [:person/name "fred"]) , i was not considering the objective of saving space.

pppaul23:07:06

updating a ref is a pain depending on how you access it. typically if you are swaping around refs you want to make sure you don't have zombie refs. in the case you describe if nobody has a name that refs a :first-name/name, then you may want to delete that. it's an easy way to make zombies when you update a ref and forget to have the ID attached, so you make a new object instead of updating one.

pppaul23:07:53

in this case your ID is the value, so you avoid one pain point, but now you may want to run a GC on your system to clean up unused names

pppaul00:07:19

a flat structure in datomic looks as large as it has existing fields. also it looks as large as the pull request at other times. doing pull request on refs requires a lot more work, especially if they are back references. usually renaming, default values, and maybe translations are included in the pull. if you are using db/idents then the pulls involve a bit more work. if you are taking the first-name/name approach then you probably will have a ton of db/idents as keyword refs as well.

pppaul00:07:53

I don't really understand the problem you are solving. if you are making a system where ent fields are mostly indirect, and you want the fields to be immutable, it sounds a bit interesting, but i wonder what the queries will look like. the pull request are going to look bad, you'll probably want helper functions to make them for you (cus you can't use components). i use db/idents everywhere in my system as keywords, and they are annoying. i'm guessing that on your system the string IDs are an important feature for user discovery?

Dave02:07:32

Thanks for being so generous with your time @U0LAJQLQ1. > first, i need to clarify something. is your second model to primarily save space in the database? No. Although my guess (that's all it is, based on how many bytes a primitive, e.g., double, in Java takes vs a reference) is it would. My reason for exploring the second model is based purely on the fact that I see the world in a bottom-up-fashion and I'm wondering whether a data model based on the way I see the world, could make people doing physical-world things (like manufacturing, which is my subject matter expertise) more productive. Apologies for the lack of context but I find adding too much info in your original post can lead to people just gloss over and not engage or reply. In our data model, as it relates to my 'person' example, I've added the following for clarification. Note: :db.type/string could be be :db.type/uuid or :db.type/long in the case of :<>/id, we just chose to make it a string.

{:db/ident :person/first-name
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one}

{:db/ident :person/id
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one
 :db/unique :db.unique/value}
-or-
{:db/ident :first-name/id
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one
 :db/unique :db.unique/value}

{:db/ident :first-name/name
 :db/valueType :db.type/string
 :db/cardinality :db.cardinality/one
 :db/unique :db.unique/value}

{:db/ident :person/first-name
 :db/valueType :db.type/ref
 :db/cardinality :db.cardinality/one}
> updating a ref is a pain depending on how you access it. typically if you are swaping around refs you want to make sure you don't have zombie refs. in the case you describe if nobody has a name that refs a :first-name/name, then you may want to delete that. it's an easy way to make zombies when you update a ref and forget to have the ID attached, so you make a new object instead of updating one. In our domain case, not sure zombie refs would be a big deal, especially if we choose to make :person/first-name :db.noHistory true. > in this case your ID is the value, so you avoid one pain point, but now you may want to run a GC on your system to clean up unused names Given the additional context above, the value of :first-name/id is still (just) a visual identifier (what our user sees on the screen) but in terms of the :db/id, the single, database-wide entity that is "fred", is represented as a long in the datom, and that long is refd in :person/first-name like:
E              A                    V
[001]          :first-name/id       ER85W23QQ81
[001]          :first-name/name     fred
[002]          :person/id           FG53Q211KOL
[002]          :person/first-name   [001]
> if you are making a system where ent fields are mostly indirect, and you want the fields to be immutable, it sounds a bit interesting, Bingo. That's what I mean by 'single, database-wide entity' above. Defining "fred" in terms of a string instead of a person attribute and specifying it as :db.unique/value is such an 'immutable field'. This is not necessarily natural for us humans to do but as a bottom-up thinker, it's my default pov. Most hear "fred" and immediately think, "person's name". But if we want to make "fred" (or any scalar value for that matter) as reusable (use it to compose other entities) as possible, why not make it its own, immutable entity? Btw, Googling "Immutability" 7 years ago is what led us to Rich Hickey's talks, which led us to choose Clojure/Datomic in the first place. > i'm guessing that on your system the string IDs are an important feature for user discovery? Not sure exactly what you mean by 'user discovery' but the use of the :<namespace>/id attribute was intentional for querying purposes.

pppaul02:07:50

the main issue i can see from this design is that you are going to have indexes on each of these value IDs, so you may want to look into the consequences of that. also all of your queries will have a layer of indirection in them, but queries in datomic tend to be pretty small, so i'm not sure it's too important. i think you also lose lookup refs, or your lookup refs will all have backtracking pulls.

pppaul03:07:52

so, you'll have to make a bit of a DSL if you want to make your code look more like regular datomic code (all reverse lookups return lists, usually of a single item). but if you are consistent in your data model then it shouldn't be hard to make some helper functions/macros that you use everywhere. that ends up sorta happening in normal use anyway.

pppaul03:07:57

you are making AVETs on many ents, so you need to explore how this effects datomic, as the docs say this is expensive.

pppaul03:07:44

you may want to look into XTDB as well, i don't know much about it, but it seems fairly active, so people should be able to tell you if you are going to break their DB or not. it has a lot of similarities with datomic, but doesn't seem like you can painlessly swap one for another (different APIs)

pppaul03:07:05

you may have to keep your own index outside of datomic if your design is going to break datomic (like a KV store, rocksDB or something)

seepel06:07:49

I'm curious what are the advantages of making fields indirect? It sounds like a pain and I'm struggling to come up with a use case that it would serve.

Dave17:07:46

@U03NXD9TGBD, RH discusses this in his https://github.com/matthiasn/talk-transcripts/blob/9f33e07ac392106bccc6206d5d69efe3380c306a/Hickey_Rich/PersistentDataStructure.md talk if you haven't watched it before. He sites indirection in the transcript 4 times. Even though I'm not a programmer, I was able to relate to this as soon as I listened to this talk. Actually found myself smiling and nodding in agreement with every RH talk I've listened to thus far. As a user of manufacturing and supply chain SORs (systems of record), I've endured decades of "identity, state, and values" hell because of the way these systems were designed (underlying data models). There's a huge and unnecessary cognitive load placed on supply chain information workers using these SORs and that's zapping productivity. The importance of "immutability" in SORs cannot be overstated imo. All that said, I can see why any programmer might see indirection as a pain or at least an inconvenience. Simple Made Easy is another great one to watch if you haven't already. simple_smile

pppaul17:07:13

layers of indirection have a cost, sometimes i find myself removing abstractions from my code because debugging them is hard, or understanding them is hard, or i didn't need them in the first place. sometimes abstractions are needed to solve problems, or let users hook into your system to implement their specialised solutions. i think in your case with datomic, the main downside is the indexes. also it may be the case that you want this property for some of your data, and not all, maybe not even much.

pppaul17:07:56

One of the bigger problems that i have run into when building software, is that most people making the system (non-devs, but also devs) don't think about change in the system. business demands that things change, but when those things are binding contracts, well that sounds like a bad idea. people have a very big problem when it comes to identifying when something should stop changing. datomic and your idea of how to use it, do not cover that problem. building things that make sense is a very hard job, and it's not really a programming problem.

Dave18:07:21

> One of the bigger problems... Couldn't agree more @U0LAJQLQ1. Business stakeholders and devs alike, haven't given the necessary hammock time to solving many of these hard problems. There's a myriad of reasons on both sides that could fill a book as to why and how this happens. I'd like to think our small team is different and will indeed solve many of them given the 7 years of hammock time we've put into the effort IP we've developed but we won't know until we commercialize. There are other facets to our application intended to deal with change. Datomic schema is but one of them which I happen to be focused on right now. > layers of indirection have a cost... Can you give an example of your having to use an abstraction to solve a problem and an example of what you mean by specialized solution? I know indexes have costs so best to avoid whenever possible. To that end, any elaboration as to why indirection inhibits the use of lookup refs is appreciated. I looked through the documentation and it's not readily apparent to me.

pppaul19:07:28

multimethod and protocols allow for a type of abstraction that allows users to hook in and create their own solutions to sub or whole problems, while working with the larger system (like plug-ins). embedded languages also allow this, typically referred to as scripting. currently I'm dealing with http://Sentry.io (you may want to use that as well), and building error reports have some well defined structure. I use multimethods to have different types of errors build parts of their own error report. at the same time I am testing the Sentry sdks, and for that I don't want any abstractions, I just want raw data to test with. one of the major costs of abstraction is debugging becomes expensive, other people have trouble maintaining your code. there has to be a big payoff for a big abstraction. SQL is an example of a big abstraction with a big payoff, but good luck asking a random dev to fix anything in postgres core

Dave19:07:27

Thanks @U0LAJQLQ1 for being so generous with your time. I tend to think of abstractions with respect to https://en.wikipedia.org/wiki/Type%E2%80%93token_distinction which doesn't help much when your dev team is trying to explain and have you understand their definition of an abstraction. I struggle mightily to understand it from a dev's pov. What's your best, general definition of an abstraction given the way you're using it above? Also, would still like your take on how indirection inhibits the use of lookup refs assuming that's an accurate interpretation of what you're saying above. Perhaps there's something in the Datomic documentation you've read that I've missed that could provide additional context.

pppaul20:07:06

the article you link to is talking about a certain type of abstraction. example, programmers don't write code to operate on 8-bit chunks of memory, we write code to operate on something like an integer, or decimal, or text, or list. https://en.wikipedia.org/wiki/Integer_(computer_science) look at how many ways there are to represent and int in programming. it's insane and most programmers just abstract away all of that until they run into a problem (code being slow, code taking up too much memory). we work with something where we don't know how many bits it is, and it could change in size depending on certain things happening in the program. we do that with lists as well, lists are rarely fixed sizes, programmers have no idea what their lists look like in memory, but they know how to add something to a list. in higher level languages behaviour becomes an important abstraction. that's what i was talking about with regards to multimethods, protocols, and scripting (interfaces) https://en.wikipedia.org/wiki/Abstraction_(computer_science) . also mentioned in that article, and the main reason why i found lisp/clojure is language abstraction. sometimes the best way to solve a problem is to create a domain specific language for it, like SQL or Datalog (datomic's language).

pppaul20:07:38

for the ref lookups, you just aren't going to be looking up the direct entity that you wants [:first-name/name "fred"] isn't going to point to something you care about, you'll have to do something like (pull [{:first-name/__name [:person/first-name]})_ and then you'll get a list of all the people with "fred" as a name, then you have to figure out what one you want. it becomes very useless except for the exact scenario of getting everyone with the name "fred", which you can do with a 1 line query anyway. so you lose all ability to use lookup refs which are a pretty big deal. you'll have to do queries for every data fetch. in my system queries are a special case, and 95% of my db reads are with pulls via lookup refs

Dave22:07:20

> it's insane and most programmers just abstract away all of that... Got it. And it is insane! It's the result of 40+ years of accumulating "incidental complexity" . Can definitely relate to it. In our https://edgewoodsoftwarecorp.s3.us-east-2.amazonaws.com/CoalesceIntroduction.mp4 (best to avoid Firefox for best audio if you have 10 mins. to watch it) we highlight systems interoperability as one of the underlying root causes of the entrenched, difficult supply chain problems we're trying to solve. Mapping disparate data models has reached its max. scale imo. Those in the enterprise software space still employing this approach are in an endless cycle of just swapping customers at this point. Which is why we're building a large, horizontal (end-to-end product value chain) application that doesn't rely on any external data sources. As for lookup refs, I thought that might be what you were getting at. Reverse lookups will be needed, e.g., a user needs to filter a set of higher order entities that include "fred" (common constituent entity). But in the user's day-to-day interaction with the system performing their work, i.e., creating and managing specifications or digital twins of physical referents (should make much more sense to you if you watch the video), their entry point is :person/<> and not :first-name/<>.

pppaul21:07:11

not everyone has a first name

pppaul21:07:52

you probably want to use a component, unique is very weird in this case

pppaul21:07:44

you can always migrate your data

pppaul21:07:17

when you use refs, accessing and updating data is more of a pain in the ass

pppaul21:07:58

if you are considering the second use case, you prob are going to do that for almost every field in your database, so you can imagine that you are going to need to do a lot of extra work to manage the refs wherever you use them. you are also diluting the meaning of refs for the rest of your data. a ref should be meaningful. adding another ID to a string for all attributes seems insane

pppaul21:07:32

every time i had done something like what you did in the second example, i have regretted it and changed it to be a flat structure. when i do have refs for aesthetics, they are usually components, and represent a grouping of related data.