datalog

lilactown 2021-08-07T13:25:11.014900Z

the thing I'm stuck on right now is the specifics of how to chain patterns

lilactown 2021-08-07T13:33:59.020800Z

take something like:

[:find ?id
 :where
 [?e :foo/id ?id]
 [?e ?a "bar"]]
I think one way to handle it is: 1. resolve [?e :foo/id ?id], filtering the db to find all matching entries: [[1 :foo/id "123"] [4 :foo/id "456]] 2. expand [?e ?a "bar] to use all ?e values we currently have:
[1 ?a "bar"]
[4 ?a "bar"]
3. for each new pattern, filter the db to find all matching entries: [[1 :foo/name "bar"]] 4. since we're joining across the entity ID, this is the final results

quoll 2021-08-07T13:34:53.021300Z

Filtering is A way to do it

Ben Sless 2021-08-07T13:36:18.023700Z

I've been thinking of a way to model/phrase it without tying it to a specific implementation

quoll 2021-08-07T13:38:35.027100Z

But the more efficient way is usually to: 1. Resolve [?e :foo/id ?id] to find the matching entries 2. Iterate over the bindings from 1 (bindings of ?e and ?id) and use these to rewrite the second constraint. This means updating ?e. Do a lookup for each rewrite. 3. Concatenate the results of 2

Ben Sless 2021-08-07T13:39:27.028300Z

btw, how do you handle binding? do you just maintain a replacements map?

quoll 2021-08-07T13:39:58.029400Z

Because you’re iterating over the LHS but not the RHS, you want to keep the pattern with the smallest resolution to the left

quoll 2021-08-07T13:41:52.032900Z

I don’t. I have a vector that describes the variables [?e ?a] and then each binding is a vector of values that aligns with those column positions

quoll 2021-08-07T14:06:52.048100Z

SPARQL’s definition actually describes each line as a map. So, while I represent the above results as: ^{:cols [?e ?a]} [[1 :foo/name]] SPARQL will say the same thing as: [{?e 1, ?a :foo/name}]

lilactown 2021-08-07T13:42:15.033Z

perhaps my terminology was confusing, but I think that's very similar to what I wrote above, assuming "lookup" is basically what I meant by "filtering", right?

lilactown 2021-08-07T13:44:06.034600Z

I'm doing these queries on nested EDN, rather than indexes, so a "lookup" of a tuple ends up doing a scan of a bunch of data but I'm not too worried about that rn

quoll 2021-08-07T13:46:03.035400Z

Yes, a resolution can be done via filtering

quoll 2021-08-07T13:46:17.036100Z

The API will look identical, but it’s just slower

quoll 2021-08-07T13:46:44.037400Z

Jena started life doing exactly this (which is why it sucked at the time 🤣 )

lilactown 2021-08-07T13:46:54.037500Z

> Because you’re iterating over the LHS but not the RHS, you want to keep the pattern with the smallest resolution to the left I think this is the part I've been getting hung up on. this is essentially handled by keeping an updated value of ?e that you pass to the next pattern, right?

lilactown 2021-08-07T13:52:16.040500Z

how do you handle something like:

[:find ?movie ?person
 :where
 [_ :move/name ?movie]
 [_ :person/name ?person]]
where you have a "disjoint union"?

quoll 2021-08-07T13:53:14.040800Z

That’s an outer product

😅 1
quoll 2021-08-07T13:57:53.044400Z

So, [_ :move/name ?movie] resolves to: [["Gone with the Wind"] ["Romeo and Juliet"]] and [_ :person/name ?person] resolves to: [["Robert Downey Jr"] ["Cate Blanchett"]] So the result is: columns: [?movie ?person] [["Gone with the Wind" "Robert Downey Jr"] ["Gone with the Wind" "Cate Blanchett"] ["Romeo and Juliet" "Robert Downey Jr"] ["Romeo and Juliet" "Cate Blanchett"]]

quoll 2021-08-07T13:58:58.045300Z

This works, because if you were to bring in a constraint that bound ?movie and ?person then it would filter those down to the matches

quoll 2021-08-07T13:59:56.046300Z

This is why query planning matters. It: a) reduces the number of bindings to iterate over b) reduces memory consumption for intermediate results

quoll 2021-08-07T14:02:46.047200Z

I’ve had complex queries over large datasets go from 90 seconds down to <100ms with an appropriate plan

lilactown 2021-08-07T14:04:03.047500Z

nice. who knows, maybe if datascript had a planner I wouldn't be trying to replace it rn? 😂

Ben Sless 2021-08-07T14:04:36.047700Z

does datahike have a planner? 🤔

lilactown 2021-08-07T14:04:47.047900Z

(I also need fast EQL queries which I would have to implement myself on datascript anyhow)

quoll 2021-08-07T14:08:17.048600Z

sorry… EQL?

lilactown 2021-08-07T14:09:01.049Z

it's very similar to datomic pull

quoll 2021-08-07T14:10:18.049200Z

ah, I see

quoll 2021-08-07T14:12:07.049400Z

Planning isn’t too hard if you have an index (meaning that you can get fast counts of resolutions), and if your queries are simple (meaning, joins and filters)

quoll 2021-08-07T14:12:42.049600Z

It gets harder when you support bindings, or/and/not operators, and so on

refset 2021-08-07T14:40:22.049800Z

For reference, Crux's lazy tuple-at-a-time (multi-way join) execution engine means the performance of the plan comes down to selecting a good variable ordering, which is a complicated thing to do ~well: https://github.com/juxt/crux/blob/3e9f48b5ea9fa6815a5ef216ec65059db9219359/crux-core/src/crux/query.clj#L586-L703

👍 1
ℹ️ 1
quoll 2021-08-07T14:00:33.047Z

Although, to be fair, when doing this in Clojure (b) isn’t as big an issue because your results are usually lazy

Ben Sless 2021-08-07T06:21:27.013800Z

https://gist.github.com/bsless/632b4040a2b2ad7469369f52cd610c06 Something like this as a starting point for a parser