Fork me on GitHub
#xtdb
<
2022-12-10
>
m.q.warnock17:12:57

is there a (simple, efficient) way to get a histogram of an attribute value: ie. how many of each :kind are in the db? I'm fairly new to datalog queries, and the aggregators don't seem to do what I want

FiVo17:12:08

There is attribute-stats but I don't think this is exactly what you want.

(def node (xt/start-node {}))

(xt/submit-tx node [[::xt/put {:xt/id :foo :value :bar}]])
(xt/attribute-stats node)
;; => {:value 1, :xt/id 1}

FiVo17:12:22

There is (to my knowledge) no public API for getting stats about the values. Maybe you could post the issues you are having in a little example.

m.q.warnock17:12:26

the example would be beyond-trivial; I just want a quick way to see the counts of sub-types of entities that share the same attributes, mostly for debugging an ETL of ~1M entities. I can do it without the more direct support I had in mind, but it's easy to write an sql query for such a thing, and I was surprised I couldn't do a similar thing with datalog/xtdb

FiVo17:12:48

Something like

(xt/submit-tx node [[::xt/put {:xt/id :foo :attr :bar}]
                    [::xt/put {:xt/id :toto :attr :bar}]])
(xt/q (xt/db node)
      '{:find [(count ?e)]
        :where [[?e :attr]]})
?

m.q.warnock18:12:45

that will count the :bar's, but I want a result like {:bar 256000 :baz 12345} - I could do a query for each :foo value, but then it's a full scan per query; better to fetch all of them and scan in-memory outside of the query (which is fine for now, but what if I get to a few billion entities?)

m.q.warnock18:12:20

er- didn't read your example closely- it's the value of :attr I care about; ?e would be unique

FiVo18:12:07

What does {:bar 256000 :baz 12345} represent ? The counts of entities having a :bar / :baz value respectively for attribute :attr ?

FiVo18:12:56

Just for your info XT's query engine runs in your app so the only difference in for this is that the engine handles the logic for you, memorywise you will therefore always run into this issue when querying a lot of data.

m.q.warnock18:12:08

I know the query engine runs in my jvm, but doesn't it scan on disk? I did assume it would execute your example without pulling the relevant columns into memory

m.q.warnock18:12:44

I guess I should read the code; making assumptions about a db isn't a great idea 😉

FiVo18:12:14

Back to your question.

(xt/submit-tx node [[::xt/put {:xt/id :foo :attr :bar}]
                    [::xt/put {:xt/id :toto :attr :bar}]
                    [::xt/put {:xt/id :yes :attr :baz}]])


(xt/q (xt/db node)
      '{:find [attr (count ?e1)]
        :in [[attr ...]]
        :where [[?e1 :attr attr]]}
      [:bar :baz])
;; => #{[:bar 2] [:baz 1]}

FiVo18:12:52

This is how I would do it then

m.q.warnock18:12:58

oh? that's cool, though I don't really want to have to provide :bar and :baz - it's an open set

FiVo18:12:18

Yes, without you specifying the values you are interested in this is currently not possible I think.

m.q.warnock18:12:48

still, it's a method I hadn't seen the possibility of; appreciate it!

FiVo18:12:22

Sorry, you can do it like this

(xt/q (xt/db node)
      '{:find [attr (count ?e)]
        :where [[?e :attr attr]]})

blob_thumbs_up 1
m.q.warnock18:12:10

weird- I thought that was exactly what I tried

m.q.warnock18:12:24

doh! I didn't wrap my single where clause in an additional vector

m.q.warnock18:12:57

thought the unexpected structure had something to do with my use of aggregators

Tommy21:12:49

Hi guys, wrote a quick article on xtdb, would like some feedback before I post it to reddit/#news-and-articles: https://github.clerk.garden/tommy-mor/datalog-blog

refset11:12:57

Hey @UBT7FH96Z this is great 🙂 I'm guessing the plugin itself isn't available to look at also yet, is that right? I was just hunting around and saw this PR of yours https://github.com/babashka/pod-registry/pull/55 - sounds like fun 😄

refset11:12:38

Out of interest, were there other factors to your choosing XT over the other Datalog options?

refset11:12:02

Have you seen any downsides to the flattening approach?

Tommy18:12:13

No I havent finished the plugin, theres a lot of moving pieces but its coming along

Tommy18:12:35

I have been using xt for a bit in other personal projects, so I was familiar with it. Although xt being schemaless does help this use case in particular

🙂 1
Tommy18:12:30

and yes, the babashka pod with golang was super fun, the pod system has a ton of potential I think...

💯 1
Tommy18:12:43

i haven't seen any downsides to flattening approach so far, but havent used it "in anger". for json-api post/put requests I would have to un-flatten them, but for my use case I am only putting match score results, and brand new participants, so I can just make them as data rather than unflatten existing record

refset19:12:22

cool, thank you for the explanations 🙏