Fork me on GitHub
#datalog
<
2021-08-07
>
lilactown13:08:11

the thing I'm stuck on right now is the specifics of how to chain patterns

lilactown13:08:59

take something like:

[:find ?id
 :where
 [?e :foo/id ?id]
 [?e ?a "bar"]]
I think one way to handle it is: 1. resolve [?e :foo/id ?id], filtering the db to find all matching entries: [[1 :foo/id "123"] [4 :foo/id "456]] 2. expand [?e ?a "bar] to use all ?e values we currently have:
[1 ?a "bar"]
[4 ?a "bar"]
3. for each new pattern, filter the db to find all matching entries: [[1 :foo/name "bar"]] 4. since we're joining across the entity ID, this is the final results

quoll13:08:53

Filtering is A way to do it

Ben Sless13:08:18

I've been thinking of a way to model/phrase it without tying it to a specific implementation

quoll13:08:35

But the more efficient way is usually to: 1. Resolve [?e :foo/id ?id] to find the matching entries 2. Iterate over the bindings from 1 (bindings of ?e and ?id) and use these to rewrite the second constraint. This means updating ?e. Do a lookup for each rewrite. 3. Concatenate the results of 2

Ben Sless13:08:27

btw, how do you handle binding? do you just maintain a replacements map?

quoll13:08:58

Because you’re iterating over the LHS but not the RHS, you want to keep the pattern with the smallest resolution to the left

quoll13:08:52

I don’t. I have a vector that describes the variables [?e ?a] and then each binding is a vector of values that aligns with those column positions

quoll14:08:52

SPARQL’s definition actually describes each line as a map. So, while I represent the above results as: ^{:cols [?e ?a]} [[1 :foo/name]] SPARQL will say the same thing as: [{?e 1, ?a :foo/name}]

lilactown13:08:15

perhaps my terminology was confusing, but I think that's very similar to what I wrote above, assuming "lookup" is basically what I meant by "filtering", right?

lilactown13:08:06

I'm doing these queries on nested EDN, rather than indexes, so a "lookup" of a tuple ends up doing a scan of a bunch of data but I'm not too worried about that rn

quoll13:08:03

Yes, a resolution can be done via filtering

quoll13:08:17

The API will look identical, but it’s just slower

quoll13:08:44

Jena started life doing exactly this (which is why it sucked at the time :rolling_on_the_floor_laughing: )

lilactown13:08:54

> Because you’re iterating over the LHS but not the RHS, you want to keep the pattern with the smallest resolution to the left I think this is the part I've been getting hung up on. this is essentially handled by keeping an updated value of ?e that you pass to the next pattern, right?

lilactown13:08:16

how do you handle something like:

[:find ?movie ?person
 :where
 [_ :move/name ?movie]
 [_ :person/name ?person]]
where you have a "disjoint union"?

quoll13:08:14

That’s an outer product

😅 3
quoll13:08:53

So, [_ :move/name ?movie] resolves to: [["Gone with the Wind"] ["Romeo and Juliet"]] and [_ :person/name ?person] resolves to: [["Robert Downey Jr"] ["Cate Blanchett"]] So the result is: columns: [?movie ?person] [["Gone with the Wind" "Robert Downey Jr"] ["Gone with the Wind" "Cate Blanchett"] ["Romeo and Juliet" "Robert Downey Jr"] ["Romeo and Juliet" "Cate Blanchett"]]

quoll13:08:58

This works, because if you were to bring in a constraint that bound ?movie and ?person then it would filter those down to the matches

quoll13:08:56

This is why query planning matters. It: a) reduces the number of bindings to iterate over b) reduces memory consumption for intermediate results

quoll14:08:46

I’ve had complex queries over large datasets go from 90 seconds down to <100ms with an appropriate plan

lilactown14:08:03

nice. who knows, maybe if datascript had a planner I wouldn't be trying to replace it rn? 😂

Ben Sless14:08:36

does datahike have a planner? :thinking_face:

lilactown14:08:47

(I also need fast EQL queries which I would have to implement myself on datascript anyhow)

quoll14:08:17

sorry… EQL?

lilactown14:08:01

it's very similar to datomic pull

quoll14:08:18

ah, I see

quoll14:08:07

Planning isn’t too hard if you have an index (meaning that you can get fast counts of resolutions), and if your queries are simple (meaning, joins and filters)

quoll14:08:42

It gets harder when you support bindings, or/and/not operators, and so on

refset14:08:22

For reference, Crux's lazy tuple-at-a-time (multi-way join) execution engine means the performance of the plan comes down to selecting a good variable ordering, which is a complicated thing to do ~well: https://github.com/juxt/crux/blob/3e9f48b5ea9fa6815a5ef216ec65059db9219359/crux-core/src/crux/query.clj#L586-L703

ℹ️ 3
👍 3
quoll14:08:33

Although, to be fair, when doing this in Clojure (b) isn’t as big an issue because your results are usually lazy