xtdb 2022-03-07 | Slack Archive

oxalorg (Mitesh)08:03:00

Hey all! I want to model role based access using 3 different types of documents in XTDB. This is how my structure looks at the moment which works fine:

;; organisation
{:type :organisation
 :xtdb/id "org1-id"
 :organisation/name "org1"}

;; user
{:type :user
 :xtdb/id "member1-id"
 :user/email ""}

;; role
{:type :role
 :role/organisation "org1-id"
 :role/user "member1-id"
 :role/level :owner}

The role structure looks very similar to the "intermediate" many-to-many table we use in relational db's. I'm wondering if there is a better approach to model this? Perhaps a set of roles in user like :user/roles #{role1, role2} or something else? :thinking_face:

malcolmsparks08:03:20

Yes, I agree. Give roles to a user rather than users to a role, and allow users to have many roles.

oxalorg (Mitesh)09:03:49

Thanks!!! So if I understand this correctly, the suggestion here is to model it like this?

;; roles 1 and 2
{:type :role
 :xtdb/id "role1-id"
 :role/organisation "org1-id"
 :role/level :owner}

{:type :role
 :xtdb/id "role2-id"
 :role/organisation "org2-id"
 :role/level :guest}

;; user
{:type :user
 :xtdb/id "member1-id"
 :user/email ""
 :user/roles #{"role1-id" "role2-id"}}

I am thinking this might make some queries more difficult like "List all users with their levels for org1" Changing role for a user also now is a little bit confusing to me. Do I just update the role documents (update role1 :role/level :admin), or do I only update the references (remove the old role and add a new role reference)? i.e. does each role be unique to a user instead of being reused across multiple users?

{:type :role
 :xtdb/id "role1-user1-id"
 :role/organisation "org1"
 :role/level :member
 :role/override-perms #{"share"}}

I guess I'm still thinking too strictly in how I've modeled data using relational tables, so trying to get some sense of intuition here! 🙌

oxalorg (Mitesh)14:03:49

I went with reifying the role concept into something more than a role (a member which contains a role). Realised I wanted to store other things as well to it and make it truly unique to a (user, org) pair. In this case it felt most intuitive to stick to my original implementation as it made most queries easier and I didn't find any gains by doing it the other way! Thanks for the help! 🙂

☺️ 1

refset15:03:45

Thank you for your write-up @U013MQC5YKD - it's really great for everyone to share these kinds of experiences, and there are always tonnes of designs and trade-offs to consider with Datalog-based authz

🙌 1

➕ 1

Steven Deobald19:03:11

@U013MQC5YKD If you have further data modeling questions, it would be great to extend the conversation to http://discuss.xtdb.com — these come up pretty often (and there are a lot of small variations on similar ideas) and the forum might be a nice place to track ideas as a bit of light documentation that folks could read in the future (Clojurians does have a log, but it's not very google-friendly or easy to update with new content).

👍 1

Ivan Fedorov16:03:14

What about conditionally adding clauses? Sometimes I don’t want to filter by :entity/type, can I exclude this triplet on a logical condition?

(defn entity:list-help-offers
  [node params]
  (->> (xt/q
         (xt/db node)
         '[:find (pull e [*])
           :in e-type e-aid
           :where
           [e :entity/type e-type]
           [e :entity/kind :help-direction/offer]
           [e :entity/area-id e-aid]]
         (:entity/type params)
         (:entity/area-id params))
       (mapv first)))

upd: yes managed to

(defn entity:list-help-offers-2
  [node params]
  (let [e-type (:entity/type params)
        where-block
        (filterv identity
          [(if e-type
             '[e :entity/type e-type])
           '[e :entity/kind :help-direction/offer]
           '[e :entity/area-id e-aid]])]
    (->> (xt/q
           (xt/db node)
           {:find '[(pull e [*])]
            :in   '[e-type e-aid]
            :where where-block}
           (:entity/type params)
           (:entity/area-id params))
         (mapv first))))

👍 1

refset16:03:16

this is generally the best pattern. You can often try to use Datalog or and rules to achieve a similar effect, but the performance implications aren't always obvious there

👍 1

sheluchin16:03:09

I have three models with a to-many relation: foo -> bar -> baz, and I'd like to count the number of baz related to a specific foo. Some foos may be missing the foo -> bar relation, the bar -> baz relation, or both, in which case I'd like to count that as zero. How could I write such a query?

refset16:03:52

I'm probably missing something here, but I'd guess like this:

{:find [(count ?baz)]
 :in [?foo]
 :where [[?foo :to-bar ?bar]
         [?bar :to-baz ?baz]]}

refset17:03:58

can you give a concrete example of a few documents and the count you'd expect?

sheluchin17:03:08

I slightly misstated the problem. I'd like to get totals for each foo, not a specific one that gets parameterized. Results should be something like this:

{:xt/id 1
 :type :foo}
{:xt/id 2
 :type :foo}

{:xt/id 3
 :type :bar
 :foo 1}
{:xt/id 4
 :type :bar
 :foo 1}

{:xt/id 5
 :type :baz
 :bar 3}

:xt/id 1 :baz-count 1
:xt/id 2 :baz-count 0

👍 1

sheluchin18:03:44

My actually query is just a bit more complicated, aggregating items by year:

{:find '[?period ?foo-name (count ?baz) #_ ?baz-count]              
 :where ['[?baz ::baz/bar ?bar]                                     
         '[?baz ::baz/datetime ?date]                               
         [(list (period->date-fn-var :year) '?date) '?period]
         '[?bar ::bar/foo ?foo]                                     
         '[?foo ::foo/name ?foo-name]]                              
         ; '[(count ?baz) ?baz-count]]})                       
 :order-by '[[?period :asc]]})

This almost gets me all the way. The results are something like:

["2012" "foo-1" 12]
           ["2013" "foo-1" 98]
           ["2013" "foo-2" 40]
           ["2014" "foo-2" 3]

But I haven't yet figured out how to zeros for years without relations. Another thing I don't quite understand is why uncommenting the last :where clause and using its result in the :find instead of doing the aggregation in the :find directly doesn't produce the same result. It seems like it gets stuck and isn't able to complete when I use it that way.

refset15:03:24

Hey @UPWHQK562 I think this works:

(let [n (xt/start-node {})]
    (->> [{:xt/id 1
           :type :foo}
          {:xt/id 2
           :type :foo}
          {:xt/id 3
           :type :bar
           :foo 1}
          {:xt/id 4
           :type :bar
           :foo 1}
          {:xt/id 5
           :type :baz
           :bar 3}
          {:xt/id 6
           :type :baz
           :bar 3}]
         (map (fn [doc] [::xt/put doc]))
         (xt/submit-tx n)
         (xt/await-tx n))
    (xt/q (xt/db n) '{:find [e (sum c)]
                      :where [[e :type :foo]
                              (or-join [e c e3]
                                       (and [e2 :type :bar]
                                            [e2 :foo e]
                                            [e3 :type :baz]
                                            [e3 :bar e2]
                                            [(identity 1) c])
                                       (and [(identity nil) e3]
                                            [(identity 0) c]))]}))
;;=>  #{[2 0] [1 2]}

🎉 1

refset15:03:32

> Another thing I don't quite understand is why uncommenting the last :where clause [...] It looks like you are hoping to be able to do an aggregate in the middle of the query, which won't work as you might expect (instead I think ?baz-count will always return 1). If you really want to do such a thing though, you can use subqueries for that

sheluchin15:03:59

@U899JBRPF thanks very much! I'll try that shortly. I worked on it a bit longer yesterday and also tried to use the same trick with (identity 1) and zero, but couldn't quite make it do the thing. I used or instead of or-join and had to put in a bunch of [(any? x)] noops to make it do anything at all.

🙏 1

refset15:03:10

> and had to put in a bunch of [(any? x)] noops ah, I should mention I'm using the 1.21 beta release which is a lot smarter here, and ~eliminates the need for this

sheluchin15:03:03

Ah, okay, glad I brought it up 🙂 Looking forward to that release then. I'll try applying your solution later and will give ya a :thumbsup: reaction if it works. Not sure, but maybe something like this should either in docs or the query-test ns? Then again, I searched in here and didn't find anyone asking a similar question. Maybe it's not so common of a requirement. Thanks again @U899JBRPF.

👌 1

refset15:03:44

True, this is definitely a concise problem and example! Fwiw, I don't recall ever solving exactly this before :thinking_face:

sheluchin12:03:05

Hi @U899JBRPF. The query produces the desired result, but the only problem is that on my data set/local machine, adding in the zero values manually AFTER running the query is about 3x faster than doing it all in the query. Is that something you would expect?

refset13:03:44

cool, that's good to hear - is that perf with the beta release?

refset15:03:43

Doing non-IO work inside the query engine is often slower than just using raw Clojure due to the interpretive execution and hashing overheads involved in how the algorithm works

sheluchin15:03:31

Yeah, that's on the beta.

👍 1

sheluchin15:03:14

Which part of that is considered non-IO? It's just aggregations by the looks of it. Granted, I am doing some date truncation in my actual query, but that's not the part that I'm replacing with raw CLJ.

refset15:03:16

On reflection, I think it's possible to rewrite the Datalog to execute much more efficiently using get-attr :thinking_face: I'll have a play with it in a bit

refset15:03:52

> Which part of that is considered non-IO? for instance, this leg creates a tonne of unnecessary subqueries and isn't doing anything with the data in the database:

(and [(identity nil) e3]
     [(identity 0) c]))

(but it's necessary in the current structure for the join to work as a kind of left-join)

sheluchin15:03:37

I was thinking about get-attr with a default value but couldn't summon the right incantations just yet 😛

🙂 1

refset11:03:45

On yet further reflection, get-attr isn't much use since it operates in the wrong direction, but you can do this:

(let [n (xt/start-node {})]
  (defn get-bar-baz-count [db e]
    (or (ffirst (xt/q db
                      '{:find [(count e3)]
                        :in [e]
                        :where [[e2 :type :bar]
                                [e2 :foo e]
                                [e3 :type :baz]
                                [e3 :bar e2]]}
                      e))
        0))
    (->> [{:xt/id 1
           :type :foo}
          {:xt/id 2
           :type :foo}
          {:xt/id 3
           :type :bar
           :foo 1}
          {:xt/id 4
           :type :bar
           :foo 1}
          {:xt/id 5
           :type :baz
           :bar 3}
          {:xt/id 6
           :type :baz
           :bar 3}]
         (map (fn [doc] [::xt/put doc]))
         (xt/submit-tx n)
         (xt/await-tx n))
    (xt/q (xt/db n) '{:find [e (sum c)]
                      :where [[e :type :foo]
                              [(dev/get-bar-baz-count $ e) c]]}))

refset11:03:16

that may still be possible to optimize further by converting [e2 :type :bar] to [(get-attr e2 :type)[e2type]][(= e2type :bar)]and similarly for :baz (depending on what the join order stats look at like at scale, with the various relative cardinalities & selectivities)

sheluchin12:03:36

Thanks @U899JBRPF. I'll give that a try soon and report back on how it impacts performance with my data set. Appreciate your help!

🙏 1

sheluchin13:03:36

Hi @U899JBRPF. I got around to trying your latest recommendation (without the added optimizations). My version is a bit different, because each baz has a datetime field and I want to group things by year count. The query I came up with works to get the correct result:

(defn get-foo-year-count                                              
  [db foo given-year]                                                 
  (doto                                                               
    (or                                                               
      (ffirst                                                         
        (xt/q db                                                      
              '{:find [(count found-years)]                           
                :in [foo given-year]                                  
                :where [[bar :foo foo]                                
                        [baz :bar bar]                                
                        [baz :datetime x]                             
                        [(subs x 0 4) found-years]                    
                        [(== given-year found-years)]]}               
              foo given-year))                                        
      0)                                                              
    tap>))                                                            
(xt/q (xt/db n)                                                       
      '{:find [foo y c]                                               
        :where [[foo :type :foo]                                      
                ;; get all the unique years                           
                [(q {:find [(distinct year)]                          
                     :where [[baz :datetime year+]                    
                             [(subs year+ 0 4) year]]})               
                 [[all-years]]]                                       
                [(q {:find [foo year c]                               
                     :in [foo [year ...]]                             
                     :where [[(dev/get-foo-year-count $ foo year) c]]}
                    foo all-years)                                    
                 [[f y c]]]]                                          
        :order-by [[foo :asc]                                         
                   [y :asc]]})

Result with my data:

[[#uuid "ffffffff-ffff-ffff-ffff-000000000001" "2009" 84]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000001" "2010" 271]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000001" "2011" 971]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000001" "2012" 1277]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000001" "2013" 986]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000001" "2014" 1988]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000001" "2015" 1699]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000001" "2016" 2272]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000001" "2017" 2452]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000001" "2018" 2209]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000001" "2019" 2776]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000001" "2020" 3460]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000001" "2021" 3316]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000001" "2022" 362]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000002" "2009" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000002" "2010" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000002" "2011" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000002" "2012" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000002" "2013" 41]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000002" "2014" 3930]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000002" "2015" 7887]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000002" "2016" 2680]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000002" "2017" 3355]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000002" "2018" 4291]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000002" "2019" 3221]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000002" "2020" 3444]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000002" "2021" 3144]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000002" "2022" 402]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000003" "2009" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000003" "2010" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000003" "2011" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000003" "2012" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000003" "2013" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000003" "2014" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000003" "2015" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000003" "2016" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000003" "2017" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000003" "2018" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000003" "2019" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000003" "2020" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000003" "2021" 1530]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000003" "2022" 437]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000004" "2009" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000004" "2010" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000004" "2011" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000004" "2012" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000004" "2013" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000004" "2014" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000004" "2015" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000004" "2016" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000004" "2017" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000004" "2018" 59]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000004" "2019" 466]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000004" "2020" 284]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000004" "2021" 563]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000004" "2022" 119]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000005" "2009" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000005" "2010" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000005" "2011" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000005" "2012" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000005" "2013" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000005" "2014" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000005" "2015" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000005" "2016" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000005" "2017" 0]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000005" "2018" 3800]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000005" "2019" 5282]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000005" "2020" 5593]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000005" "2021" 13803]
 [#uuid "ffffffff-ffff-ffff-ffff-000000000005" "2022" 7008]]

But it's quite slow. The above takes ~35s on my machine. Do you happen to see any obvious optimizations that would make a big difference? I could, of course, go a completely different route like pre-computing these queries on a schedule (they don't have to be very recent), or adding in year fields that are previously extracted, or some other option... but I am wondering if I can change the query to boost performance significantly. In production, the data will be much larger, so that's a consideration.

refset11:03:24

I do have some ideas to ensure that nothing strictly unnecessary is happening...but first we need to take a peak at the join order - can you share the result of running (:vars-in-join-order (xtdb.query/query-plan-for (xt/db n) <THE-BIG-QUERY>)) ? In general though, it looks like caching or otherwise pre-computing the counting should speed things up dramatically. XT provides hooks to build secondary indexes (e.g. see the implementation of xtdb-lucene) and also transaction functions for helping with this kind of thing.

refset11:03:37

Scheduling the query sounds like the easiest thing to make fast progress here though, unless all your queries are highly analytical/OLAP like this also? If that's the case then it might be best to have a short call to discuss the options in depth (feel free to email me / send an invite <mailto:[email protected]|[email protected]>)

sheluchin18:03:38

@U899JBRPF sorry for the delayed response! Here's the join order: [all-years foo f total y _133063]. I think part of the reason for the ~35s query time is what you explained here https://github.com/xtdb/xtdb/issues/1674. I do have quite a few queries like this and more planned, but I suppose for now I'll go with the quickest and dirtiest solution while I look into using better approaches for the long term. Do you have some sense over whether transaction functions or secondary indexes would be a better place to start?

refset20:03:07

Not a problem, async responses work great for me 🙂 That join order looks sane enough to me - I mainly wanted to sanity check that all-years and foo were the first two in the vector

refset20:03:40

> Do you have some sense over whether transaction functions or secondary indexes would be a better place to start? I'd say transaction functions will be faster to get working, although they may turn out to be a bottleneck for writes, but maybe not - it depends on whether you need to do fresh counting work each time, or whether you can inc/`dec` based only on the small amount of data already at hand in the transaction context

sheluchin16:03:40

@U899JBRPF I can get away with incremental updates. Okay, thank you so much for all the feedback here. I have enough to work with and have gained some insights for sure. Till next time!

🙏 1

2022-03-07

Channels