Fork me on GitHub
#datomic
<
2023-04-14
>
camdez15:04:32

I have an abstract querying question I could use some help with...trying to fill some residual holes in my mental model... Let’s say we have a schema like this: :customer/name string :invoice/customer ref :invoice/balance bigdec :invoice/void? boolean I would like to query for all non-void invoices for customers who have at least one non-void invoice with balance > 100, plus their associated customers. Here’s a basic implementation for discussion: (def test-data [[1 :customer/name "Cust 1"] [2 :customer/name "Cust 2"] [3 :invoice/balance 40M] [3 :invoice/customer 1] [4 :invoice/balance 150M] [4 :invoice/customer 1] [5 :invoice/balance 50M] [5 :invoice/customer 1] [5 :invoice/void? true] [6 :invoice/balance 40M] [6 :invoice/customer 2] [7 :invoice/balance 150M] [7 :invoice/customer 2] [7 :invoice/void? true]]) (d/q '[:find ?cust ?inv :in $ :where [?large-inv :invoice/balance ?bal] [(> ?bal 100)] (not [?large-inv :invoice/void?]) ; "A" [?large-inv :invoice/customer ?cust] [?inv :invoice/customer ?cust] (not [?inv :invoice/void?])] ; "B" test-data) ;; => #{[1 3] [1 4]} This works, but what I’d really like to do is eliminate the duplication between A & B so that I can write the query in a way that composes—given some (possibly filtered) set of invoices, return the invoices (from the set) / customers where the customer has some large invoice in the set. I seem to commonly find myself in spots where I want to say something like “give me a fresh logic variable, ?large-inv, s.t. ?large-inv ∈ ?inv”, but I don’t know that there’s a Datomic way to do this sort of thing... I’ve tried several approaches, like using a subquery to reject customers I don’t want, but it doesn’t work because predicate expressions seem to always work on individual values (not collections), so I can’t pass in the filtered invoice set for an existence check. Do I just need to accept that it can’t be written how I would like? Feels like I’m fighting the underlying model. I expected a negation-based solution to be possible (if inelegant), but still haven’t found one. Obviously even better if negation is not necessary. Something like this seems like it should be possible (based on my limited mental model of how query results are built): [?inv :invoice/customer ?cust] [?large-inv ∈ ?inv] ; fictional clause type [?large-inv :invoice/balance ?bal] [(> ?bal 100)] [?large-inv :invoice/customer ?cust] …but, AFAIK, this sort of thing is not available. Is there a fundamental reason for that? (Or perhaps there’s an equivalent alternative I’m missing?)

favila17:04:57

I think you may just want to use rules to organize your query?

favila17:04:54

There are some things difficult to express in datalog eg https://stackoverflow.com/questions/43784258/find-entities-whose-ref-to-many-attribute-contains-all-elements-of-input but this sounds like you just don’t like binding a new name? A rule would hide that in its scope

camdez17:04:55

Yeah, that’s what I was going for, but if ?inv goes into a rule, then it’s going to get unified, right? Why means a rule can’t do the GROUP BY/HAVING type logic to scope the ?cust values, or it will reject some invoices I want. (Let me know if this is unclear.) So I think the best I can do is to roll up my initial invoice constraints (in this case (not [?large-inv :invoice/void?])) into a rule so it can be cleanly used twice (what I called “A” and “B” in the initial text). But this rule needs to be used from inside the ?cust selection logic, so it seems impossible to have a general (cust-with-large-invs ?cust ?inv) type rule that would compose.

camdez17:04:30

Ha, thanks. I actually had this exact page open but it’s a bit challenging.

favila17:04:51

You would unify with the rule arg, but bind to another name for the cust-inv search

camdez17:04:04

Specifically, I don’t understand what this is doing: `

[(identity ?groups) [?group ...]]
Possibly the key?

camdez17:04:18

Right, but if I’m binding to a new name, how can I say that ?rule-inv ⊆ ?inv ? This is where I get stuck.

favila17:04:58

I don’t think that’s what’s going on here. You want invoices whose customers have invoices where any one of them is > 100?

favila17:04:13

Or do they all need to be over 100?

favila17:04:45

“all” is the hard part; “any” is trivial

camdez17:04:52

No, just one. While preserving some pre-existing set of constraints over the invoices. In this case, only dealing with non-void invoices.

camdez17:04:02

Any is what I’m struggling with, so please enlighten me 🙂

favila17:04:24

Your first query gives you the right behavior?

camdez17:04:50

I’m wondering if there is any way to write it something like this and still get back all of the invoices for the matched customers: (d/q '[:find ?cust ?inv :in $ % :where [?inv :invoice/customer ?cust] (not [?inv :invoice/void?]) (high-bal-cust ?cust ?inv)] test-data rules)

camdez17:04:53

(sorry, typos fixed now)

favila17:04:24

Extract clauses for A and for B to separate rules which share ?inv

favila17:04:50

Im sorry i can’t spell this out further because I’m on a phone

camdez17:04:56

haha, no trouble. I appreciate the help either way. Just to make sure we’re on the same page, the high-bal-cust rule would need to embed the invoices-we-want rule, right? Meaning they don’t really compose.

favila17:04:06

(High-bal-cust ?cust)(non-void-inv-for-cust ?inv ?cust)

favila17:04:46

Order doesn’t matter for correctness just performance

favila17:04:09

Your ?large-inv is inside the high-bal-cust impl

camdez17:04:06

You’re saying the rules look like this, yes?

[[(high-bal-cust ?cust)
  (non-void-inv-for-cust ?inv ?cust)
  [?inv :invoice/balance ?bal]
  [(> ?bal 100)]]]

favila17:04:37

I gave the query not the rule

favila17:04:37

To be a high bal cust you need a non void inv with bal > 100

favila17:04:09

Then you just want all the non-void inv for those customers right?

favila17:04:30

So I’m not seeing the nesting?

favila17:04:19

Ah ok you are trying to DRY harder

favila17:04:41

Yeah that could serve as a high-bal-cust impl

camdez17:04:50

Sorry, trying to type up a working example of what I think you’re proposing.

favila17:04:03

You don’t want to repeat the not-void check

camdez17:04:22

Right. I think you’re saying something like this: (d/q '[:find ?cust ?inv :in $ % :where (non-void-inv-for-cust ?inv ?cust) (high-bal-cust ?cust ?inv)] test-data '[[(non-void-inv-for-cust ?inv ?cust) [?inv :invoice/customer ?cust] (not [?inv :invoice/void?])] [(high-bal-cust ?cust) (non-void-inv-for-cust ?inv ?cust) [?inv :invoice/balance ?bal] [(> ?bal 100)]]]) (I’m not immediately seeing why but this is giving me an OutOfBounds exception, probably just some typo I haven’t spotted yet). I think this kind of thing should work fine, but the high-bal-cust is married to the constraints on the invoices (`non-void-inv-for-cust`). I was hoping to be able to bring any constrained set of invoices and then apply something like high-bal-cust to it.

camdez17:04:37

Oh, I see the typo now…

camdez17:04:09

Fixed. And yields expected results:

(d/q '[:find ?cust ?inv
       :in $ %
       :where
       (non-void-inv-for-cust ?inv ?cust)
       (high-bal-cust ?cust)]
     test-data
     '[[(non-void-inv-for-cust ?inv ?cust)
        [?inv :invoice/customer ?cust]
        (not [?inv :invoice/void?])]
       [(high-bal-cust ?cust)
        (non-void-inv-for-cust ?inv ?cust)
        [?inv :invoice/balance ?bal]
        [(> ?bal 100)]]])
;; => #{[1 3] [1 4]}

favila17:04:57

You can make high-bal-cust accept an inv, and derive cust from it. But again it’s not the same inv as the Invs you are checking for high balances

camdez17:04:41

Right… this is what I wish I could write:

(d/q '[:find ?cust ?inv
       :in $ %
       :where
       (non-void-inv-for-cust ?inv ?cust)
       (high-bal-cust ?cust ?inv)]
     test-data
     '[[(non-void-inv-for-cust ?inv ?cust)
        [?inv :invoice/customer ?cust]
        (not [?inv :invoice/void?])]
       [(high-bal-cust ?cust ?inv)
        [?large-inv ∈ ?inv]
        [?large-inv :invoice/balance ?bal]
        [(> ?bal 100)]]])

camdez17:04:04

Basically ?large-inv is a new name for ?inv that can be separately unified.

favila17:04:10

I don’t understand why you think ?large-inv and ?inv should be related

camdez17:04:14

Because ?inv has been scoped to a subset of invoices (the non-void ones, in this case). Now I want only invoices from that narrowed scope, if their associated customer has an invoice from that narrowed scope with a balance > 100.

camdez17:04:56

It may sound convoluted, but I think it’s pretty common. Show me all of the invoices owed by customers who owe us a lot. This sort of thing. (Really owe some large invoice, since I’m not working off a total in this case, but either variant would be interesting to me.)

camdez17:04:06

(voided invoices don’t count)

favila17:04:31

The candidate ?inv you are filtering is not the same as the ones you are inspecting for cust high-bal test

favila17:04:08

You can do this, but you need to supply all the invs as a single binding, and you still need to destructure twice

camdez17:04:27

I’m a bit lost on both counts. What do you mean by “destructuring”? (I’m well-acquainted with the term, just not sure what you mean in this context.)

camdez17:04:46

Sounds like you’re talking about binding the variables to the set of candidates in a rule, yes?

favila17:04:55

Unpack ?invs to individual ?inv

favila17:04:08

Not necessarily in a rule

👍 2
favila17:04:14

What you are describing is soothing I would normally only attempt if I discovered a perf problem

camdez17:04:37

I’m actually trying to go the opposite direction… Not exploring this for performance reasons. In fact, ignoring that fully for the moment and thinking about how I can write a series of queries in terms of rules that compose so that I can ensure they behave in similar ways.

favila17:04:54

Didn’t we just do that?

camdez17:04:24

Haha, I don’t think so. 🙂 Let me type up something else to look at…

favila17:04:44

(Non-void-inv ?inv)(filter-inv-high-val-cust ?inv ?cust)

favila17:04:19

Where the vars are the same among rule invocations, that has the effect of a filter

favila17:04:39

Unification ensures you on out get the tuples left where all rules matched

favila17:04:00

“On out”=>only

camdez17:04:00

But filter-inv-high-val-cust is going to reject the invoices with balance < 100

camdez17:04:37

So we only get filter AND high balance AND inv cust has some high balance inv

favila17:04:41

No, because it won’t unify on ?inv for the high Val balances, only on the cust

favila17:04:50

?inv happens to be a superset of ?large-inv in this case in this order but not necessarily

favila17:04:15

Datalog unification means order doesn’t matter

favila17:04:32

(Again except in datomic for performance)

camdez17:04:24

I’m not suggesting order matters. I’m just not sure how you’re going to pass the constrained set of invoices to filter-inv-high-val-cust , then query those invoices for high balance ones without causing unification on ?inv. That’s exactly the crux of my issue.

camdez17:04:18

You will make my day if you tell me what I can put in filter-inv-high-val-cust to make this happen 🙂

favila17:04:22

You did it in your first query via another variable name

favila17:04:30

?large-inv

camdez18:04:27

Right, but that only works if the logic for constraining the set of invoices if also embedded there. Which means that function is not actually reusable / composable. It’s a mirror of what was in the original query.

camdez18:04:03

It’s not reusable if, say, we wanted non-void invoices for some particular product, for example.

favila18:04:28

I see, you want goodness to be parameterizable

favila18:04:35

But that is introducing a third concept separate from ?inv and ?large-inv

favila18:04:06

?legal-invs or ?checked-invs something like that

favila18:04:34

Another option is to filter the data source itself

camdez18:04:19

Yeah, that is an interesting option.

camdez18:04:10

I just want to mimic this:

(->> [{:invoice/balance 40M, :invoice/customer 1, :db/id 3}
      {:invoice/balance 150M, :invoice/customer 1, :db/id 4}
      {:invoice/balance 50M,
       :invoice/customer 1,
       :invoice/void? true,
       :db/id 5}
      {:invoice/balance 40M, :invoice/customer 2, :db/id 6}
      {:invoice/balance 150M,
       :invoice/customer 2,
       :invoice/void? true,
       :db/id 7}]
     (remove :invoice/void?) 
     (group-by :invoice/customer)
     vals
     (filter #(some (fn [{:invoice/keys [balance]}] (> balance 100)) %)))
;; => ([{:invoice/balance 40M,  :invoice/customer 1, :db/id 3}
;;      {:invoice/balance 150M, :invoice/customer 1, :db/id 4}])

camdez18:04:00

Filter some invoices, then group and filter the group.

favila18:04:31

I’m not sure this would be fast enough but you could also do [(non-void-inv ?inv) [(identity ?inv) ?large-inv] (large-bal-cust ?inv ?cust ?large-inv)]

camdez18:04:37

It seems that I have to tie the code of first and second filters together because I can’t just continue processing the intermediate value

camdez18:04:39

Ok, so that’s what I was wondering about earlier. So [(identity ?inv) ?large-inv] basically lets me bind a new name… I’m sure we lose all set-related optimizations, but it’s possible…

favila18:04:13

Identity is a Clojure function that returns its argument

favila18:04:30

(Identity x) => x

camdez18:04:38

I definitely know that 🙂

favila18:04:45

So this is just an idiom for rebinding without unifying

favila18:04:57

Not datalog syntax

camdez18:04:04

Right, I follow.

camdez18:04:22

But then we’ve dropped down to dealing with individual values… generate and filter

camdez18:04:52

Which works. But my broader question is, is there a way to get a fresh binding for a collection and not have it unify. I think the answer is ‘no’.

favila18:04:09

Sub query or :in

camdez18:04:33

So…as far as I can tell, you can only pass scalars to a subquery, correct?

camdez18:04:50

I don’t think there are any docs on this, so this is just what I’ve seen in my prodding.

camdez18:04:48

Ok, so let’s say we wanted to write (large-bal-cust ?cust ?inv) with a subquery that checked to see if any of the invoices in ?inv for that ?cust had a high balance; is there a way to do that?

camdez18:04:20

I’m unsure how to pass in the relation (or collection of ?inv), or however that works exactly…

camdez18:04:46

All I have discovered is how to pass in one ?cust and one ?inv. Specifically talking about subqueries here.

favila18:04:47

Simplest is :in ?set-of-inv-ids

favila18:04:58

Then just don’t destructure

favila18:04:12

Sub query is same concept

favila18:04:36

Datalog unification is hiding the whole set behind ?inv for you (because it is not knowable until all constraints are satisfied), so there’s no way to “collection-use” all ?inv values in the middle of your query

favila18:04:00

“Collection-ize “

camdez18:04:04

So…maybe you can tell me what I’m doing wrong here…

(d/q '[:find ?cust ?inv
       :in $ %
       :where
       (non-void-inv-for-cust ?inv ?cust)
       (high-bal-cust ?cust ?inv)]
     test-data
     '[[(non-void-inv-for-cust ?inv ?cust)
        [?inv :invoice/customer ?cust]
        (not [?inv :invoice/void?])]
       [(high-bal-cust ?cust ?inv)
        [(datomic.api/q
          '[:find ?i .
            :in $ ?c ?i
            :where
            [?i :invoice/customer ?c]
            [?i :invoice/balance ?bal]
            [(> ?bal 10000)]]
          $ ?cust ?inv)]]])
Is this how I should do it? Use the subquery result as a predicate filter? Or should I be returning a binding or something?

favila18:04:21

But you see how “what Invs should filters inspect” is itself a filter parameter. Conceptually it will always be a different thing from “does this particular inv unify”

camdez18:04:52

Oh, you’re saying it is not possible to do this. Ok.

favila18:04:25

I’m saying it’s just not how datalog works. You’re confusing the columns for the rows in datalog

camdez18:04:32

Right? Can’t “collection-ize” in a subquery.

favila18:04:51

You can return a collection from the sub query

favila18:04:11

And bind to a new name that is the whole collection

favila18:04:12

Then you can unify with other vars later via destructuring the collection or even via (contains? ?coll ?x) predicates

camdez18:04:51

Where ?coll is what I returned. Because if it’s a logic variable then we don’t know what will be in it.

favila18:04:56

But the point is datalog sees it as a single value not a collection. It’s not participating in unification

camdez18:04:02

Yep, I gotcha.

camdez18:04:38

So I think I understand the mechanism, and I appreciate you explaining it. But I’m not sure if it gets us any further. Say we wrote a subquery to find high-balance invoices for a customer; if we intersect that with the ?inv values (via (contains ?high-val-invs ?inv)), then we’re back to unifying ?inv in an undesirable way. We can’t give ?inv (as a collection) a new name without knowing the rules that were used to build that collection. Or at least this is my thesis…

camdez18:04:45

Basically I want something like the identity idiom but for a collection in the query engine so I can say “duplicate ?inv to a new name and don’t unify back to it”. I don’t think such a mechanism exists, but I’m not aware of a reason why it couldn’t.

favila18:04:11

Because the true set of ?inv is what ultimately pops out of the :find

favila18:04:47

It’s not the “non void invoices” of your first clauses

favila18:04:41

Datomic happens to evaluate clauses in order (mostly) but datalog semantics don’t

favila18:04:17

So you need a new name for that set

favila18:04:28

That was the example earlier using identity

camdez18:04:55

It’s interesting, I’m curious what I’m saying that has you thinking I care about order. I’m not trying to speak to order. But I must sound like I am.

favila18:04:36

What the set members are of ?inv depends on what clauses were evaluated

favila18:04:44

Thus order matters

favila18:04:02

The set gets smaller as you evaluate more clauses

camdez18:04:44

From a performance perspective, sure.

camdez18:04:15

And, perhaps, if we do more non-Datomic Clojure stuff in query

favila18:04:26

So you can’t say at a filter rule boundary “just whatever is in my ?inv now use for ?largeinv checks”

favila18:04:44

Because that is an impl detail

camdez18:04:55

I’m not actually trying to say “now”.

favila18:04:06

You need a new name to make it stable, thus another argument to the rule

camdez18:04:18

I mean capture the constraints on this variable (at all times) then further add to them in a new name.

camdez18:04:54

Maybe that’s not logically consistent but it sounds reasonable to me at first blush.

favila18:04:54

Ok, syntax you want is something like (not-void ?inv) (bal-over 300 ?inv) (high-bal-cust ?inv ?cust)

favila18:04:13

Bal-over I added as a new “filter” to illustrate the problem

camdez18:04:37

No, I don’t think you’re suggesting what I’m looking for.

favila18:04:01

How is high-bal-cust going to know that ?inv means only not-void and doesn’t fail to check ?inv with bal < 300

camdez18:04:09

I want to say: (let [?x (fresh-var)] [?x ⊆ ?inv] [?x :invoice/balance ?bal] [(> ?bal 100)]) Syntax not withstanding. I want introduce a new name, with additional constraints, but not apply those constraints to ?inv.

favila18:04:51

Ok that’s [(identity ?inv) ?new-inv]

camdez18:04:06

But has to be done at the level of individual values.

favila18:04:27

The key is how do you communicate to a filter what ancillary values it may inspect vs the thing it is filtering

favila18:04:02

I see no automatic way of communicating this, because it is core to the semantic of the filter

favila18:04:22

You need to name the two things, not infer it from the filtered thing

camdez18:04:48

So…let say we have some set, S. X = {x ∈ S: x > 0} Y = {x ∈ X: x > 100} Y doesn’t impose any constraints on X, right? S is analogous the DB. X is analogous to ?inv. Y is what I’m looking for. I’m probably missing something because I don’t have a complete mental model of how the query resolver works, but it’s not obvious to me why we can’t have Y without having to define it directly in terms of S.

camdez18:04:40

I’m not sure what ancillary values you’re referring to. I assume you must mean the relationship between ?cust and ?inv?

favila18:04:25

The ancillary values are the invoices of the customer whose balances must be inspected

favila18:04:40

Distinct from the ?inv being filtered

favila18:04:45

In your example, X is not inv except directly after the non-void-inv binding and before any other filtered are applied

favila18:04:38

?inv is actually the query result itself, the set of non void inv whose customers have high balance non void invoices

favila18:04:53

The set of invoices where the ?inv binding is first established is not the set of ?inv

favila18:04:01

Datalog is solving a constraint problem, and doesn’t say anything about set members that don’t satisfy the constraint

camdez18:04:07

But it can only shrink in size from there, correct? We add additional constraints on the set.

camdez18:04:19

Yep, I think I follow all of that

camdez18:04:54

And you’re suggesting I want to apply constraints to the set members that don’t satisfy the constraint. Is that right?

favila18:04:25

I’m saying that “can only shrink from here” is an impl detail

favila18:04:34

Borne out of the clause ordering

favila18:04:57

And the particular way datomic evaluated datalog

favila19:04:15

And what if you have multiple filters ?

camdez19:04:34

Yes, but I don’t think I’m trying to rely on that implementation detail. I’m just describing how I understand it to work.

favila19:04:42

Which one is the “this is the whole thing” rule and which is the “filtered down” one?

camdez19:04:07

I’m suggesting that filtered values would have to have new names

favila19:04:21

You seem to want a way for ?inv to “remember” the largest set it ever had

favila19:04:29

the non magical way to do that is to give that set another name and pass it as an additional param to rules that need it

camdez19:04:55

That’s what I’m trying to do…without having to rebuild the value from scratch

camdez19:04:45

I’m not trying to change the value of ?inv in (only) some places.

camdez19:04:18

I’m saying I want to bind ?large-inv to be the members of ?inv with large balances.

favila19:04:02

Ok, but the members of ?inv is not all non-void inv

favila19:04:23

Depending on where you do it

camdez19:04:08

Ahhhh. So you’re saying if we don’t unify it “back” (as it were), we’re imposing order-dependent results.

camdez19:04:34

Because we’re not ultimately getting the intersection of all of the constraints anywhere (necessarily)

favila19:04:42

I’m saying the fact that the rule is invoked with non-void inv that aren’t in the final :find is an impl detail

favila19:04:59

Non-void invoices and ?inv only happen to be the same in this particular datalog rule time at particular spots in it’s evaluated

favila19:04:18

“Rule time “=>runtime

camdez19:04:40

I follow you. The ?large-inv rules could run on all invoices, or could run on non-void invoices, yielding potentially different results. But if it added a unifying constraint to the overall query then it wouldn’t matter which order it ran in.

favila19:04:15

Mathematically ?inv never “was” any set other than the set that satisfies the entire query

favila19:04:31

That it “changes” is an impl detail

camdez19:04:42

Yes, I follow.

camdez19:04:48

I appreciate you walking me through that. It’s a good point. I recall these kinds of questions coming up in miniKanren and in Prolog… and that people often reach for impure solutions.

favila19:04:30

So (non-void-impl ?all-inv)(non-void-impl ?inv)(has-largebal-cust ?inv ?cust ?all-inv) would do it

👍 2
favila19:04:54

But may not be best perf

favila19:04:18

Managing ?all-inv as a single coll value may be faster

favila19:04:28

Where this is really annoying actually is negation

camdez19:04:05

I imagine so

camdez19:04:21

I have to step away for a bit. I’ll check back when I can. Thanks again for your help!

favila19:04:52

(or-join [?x] (cond ?x) (and (not (cond ?x)) (other-cond ?x))

favila19:04:41

(Cond ?x) seems to evaluate twice needlessly, and there really is no workaround, and it seems like a sufficiently smart datalog could avoid that

favila19:04:29

Or at least memorize it

onetom18:04:24

great discussion! i was also struggling with a similar problem and i was not sure how to introduce a 2nd set of the same kind of entities. it's reassuring to hear that the [(identity ?e) ?other-e] is a common pattern to break unification.

camdez18:04:05

I’m impressed you made it all the way here, haha. 🙂

kenny17:04:42

Finally getting around to looking at io-stats for Cloud and noticed the https://docs.datomic.com/client-api/datomic.client.api.html don’t mention the :io-context kw arg anywhere. Probably need to trigger a fresh release of the codox?

kenny17:04:34

… or maybe has to do with “Everything about query-stats is alpha and subject to change in future releases.“? Still seems worthwhile to expose it in a more explicit way, even if in alpha.

jaret19:04:19

Yikes! Could have sworn it was there. I'll double check and see if it got reverted.

2