xtdb 2020-10-03 | Slack Archive

nivekuil02:10:06

ok, so here's (3), which I think is usually the best of both because we can walk only one edge to get info about friends, as opposed to two in (2), but can still ask questions based on the relation itself:

{:crux.db/id   1    :user/friends #{2 3}}   {:crux.db/id        {:friendship [1 2]}    :friendship/to     2    :friendship/close? true}   {:crux.db/id        {:friendship [1 3]}    :friendship/to     3    :friendship/close? false}    ;; Info about friends -- only walks one edge, doesn't join friendships!   (q '{:find  [?friend ?friend-name]        :where [[?e :crux.db/id 1]                [?e :user/friends ?friend]                [?friend :user/name ?friend-name]]})   ;; Info about friends based on friendship -- walks one edge, would have to walk two if we wanted the friend name as well   (q '{:find  [?friend]        :where [[?e :crux.db/id 1]                [?e :user/friends ?friend]                [?friendship :friendship/to ?friend]                [?friendship :friendship/close? true]]})

nivekuil02:10:07

haven't tested the code (I'm not actually modeling friendships myself) but I think that's a good option that's enabled only if we are ok with directed edges?

nivekuil02:10:34

where the value 2 is a hyperedge linking the entities 1 2 and {:friendship [1 2]}.. I'm beginning to think that Crux is really best thought of in terms of a hypergraph rather than a graph

Steven Deobald07:10:36

I slept some very random hours last night so this might be a tired brain talking, but here goes: Does it feel strange that the :friendship/to is implied in the ordering of the ids in the {:friendship [x y]} vector? :crux.db/id {:friendship [1 2] :friendship/to 1 is a valid document but presumably breaks the semantics you're hoping for, making the structure potentially a little fragile?

nivekuil07:10:03

I don't think so, at least not with queries. I think queries should never be concerned with the value of :crux.db/id at all, at least because the way facts are distributed across documents has operational consequences that should be transparent to the application

nivekuil07:10:57

unfortunately queries still do need to care about the question of "is this attribute in the same document as that attribute?" but I don't think you ever need to worry about "what is the value of :crux.db/id in a document?

Steven Deobald07:10:38

i was concerned more with puts than queries: "wups, an ordering bug overwrote {:friendship [2 1]} when it was supposed to be [1 2] ? maybe that's less likely than i'm making it seem in my head.

nivekuil08:10:13

yeah then you would have duplicate entities with the same attributes, which I think is another dangerous problem in general, but I think in practice you'd have the document generator as a function rather than typing it out by hand anyway. It could be a little more likely to mess up typing it out by hand with the vector stuff but typos are always a possibility

Steven Deobald06:10:39

hehe... i wasn't imagining typing attributes out by hand. i just don't trust any of the code i write. 😉

dominicm07:10:45

I'm writing a TxLog, but my ids unfortunately are not longs, they're a compound value managed by the server I'm connecting to. What constraints does crux appy to the tx-id? Unfortunately, I can't think of a way to losslessly convert between a long and my id value, so I'm hoping Crux's actual constraints are more like "ascending" and the type hints could simply be removed.

dominicm10:10:21

I think the only care is actually in the kv stores which want to know how many bits to store. Unfortunately I have a pair of longs instead of a single long, so I can't fit that into a single long. I'll probably do some truncation for now and assume that things won't go too badly... Still interested in input from the core team though.

jarohen11:10:18

Yep, you're on the right lines there - we certainly care that it's ascending. The KV stores do assume it's 8 bytes - in theory this isn't a hard requirement but will likely involve a decent amount of changes throughout the indexer and query engine.

dominicm11:10:14

@U050V1N74 Oh, I thought it might not be ascending. Hmph. I've done some bit fiddling to get the 2 longs into a single long (under the assumption that 2070 is a long way away and someone will have fixed this by then...). But (I think) that will break the long from being ascending, even though there's linearity when I convert it back.

dominicm11:10:06

Yeah, it definitely breaks ascending under the current impl. I can probably do some bit shifts to solve that, but I'd be curious to know if I need to :)

dominicm11:10:03

I cannot do bit shifting to solve that, that isn't how longs work 😀 So yeah, if they don't need to be ascending inside of Crux, that'd be helpful.

jarohen11:10:06

You might not see issues with that, necessarily - the index in question is ordered first by transaction-time, then by transaction-id (for historical reasons, although I wouldn't rely on that - we're considering solely using transaction-id there) - so if you don't have any transaction-time millisecond ties the ordering of the transaction-ids won't be noticeable in that area. Another place you will notice it is in your implementation of open-tx-log which takes an after-tx-id - this doesn't require it to be ascending, strictly speaking, only that it's totally ordered I can't think of others, off-hand :thinking_face:

jarohen11:10:19

entity-history may well sort internally just by transaction-id (on the assumption that sorting by tx-time and tx-id results in the same ordering)

jarohen11:10:57

certainly an interesting use-case, good to challenge our assumptions, and a possible driver of further unbundling 🙂

dominicm11:10:41

open-tx-log is fine, as I can take the long and split it back into consituent parts, and those are ascending (I'm taking the current time millis (essentially) and a tie-breaker which are both normally longs, and assuming the date won't get too high, and neither will the number of tie breakers. Then I'm writing the short into the 16 bits left spare by the assumption that the date won't get too high.

dominicm11:10:39

I think it should always be incrementing when there's a tie. So it's possibly something I should get away with. Although I'm still not seeing any results from my query :D

dominicm12:10:19

blugh. I'd done ::tx/id instead of ::tx/tx-id

dominicm12:10:39

Still no luck though :) Must be more bugs

dominicm13:10:38

I got it! Woohoo

🎉 3

dominicm13:10:53

I also did tx-event instead of tx-events. Basically: pay attention to details.

nivekuil09:10:22

ah, EQL pull syntax is an immediate casualty if you want :crux.db/ids to be invisible since EQL joins look at those. Probably isn't too hard to write a macro with similar semantics

👍 3

victorb10:10:29

Can I query based on crux.db/content-hash ? I got the ID + the content-hash and would like to retrieve the document it represents but my simple tries don't do the trick. I'd like to avoid fetching the entity history of the entity then manually go through each one of them to check if the content-hash matches. I'm trying to do something like this (this particular example returns nothing):

(crux/q (crux/db crux-node)
          {:find '[?e]
           :where '[[?e :crux.db/content-hash $content-hash]]
           :args [{'$content-hash content-hash}]})

refset19:10:28

At one point there actually was a document api that could return documents based their content-hash, but we took the decision that it was probably a bad idea to expose applications to Crux-internal hashing. There are currently no public APIs to retrieve the doc in a query or otherwise. Assuming you are already pulling the content-hash from the entity-history API, is it not sufficient to store the coordinate of [eid valid-time tx-time] and use that as a key?

victorb19:10:41

@U899JBRPF the content-hash actually comes from a request in this case, (like "/:some-resource-id/:content-hash"). I might be better off adding my own content-hash then. I worked around it for now by fetching the entity history, group-by :content-hash then get it from there. Thanks for the answer!

🙏 3

🙂 3

nivekuil20:10:35

Another reason for "decomplecting" documents and logical entities is data eviction for compliance purposes: can you forget about a user's past email/phone #s without forgetting their past passwords if they're all in the same document?

✔️ 3

nivekuil20:10:52

in fact I think it's a good idea to put each fact about PII into its own document and explicitly mark it as PII.. will probably save headaches down the road as the laws change

➕ 3

mauricio.szabo01:10:23

Does it have any performance issue, adding each fact about an entity in its own document? Also, what does PII stands for?

nivekuil01:10:27

definitely, since you'll have to do more joins in the query if you wanted the document together. The upside is you can reduce duplication if you're only updating that fact, since you can only submit whole documents, and maybe save a decent amount of space if your documents are big. PII is personally identifiable information, which as anyone who's worked in a high-compliance environment will likely tell you, is like toxic waste to handle

Steven Deobald06:10:13

or, as anyone who's worked in a low-compliance environment will tell you, actually feels like swimming in toxic waste.

2020-10-03

Channels