Hello. I have a pretty trivial query on a very small dataset that I'm trying to optimize since it's painfully slow:
(d/q '[:find ?nid
:in $ %
:where
(highlighted ?nid)]
@!conn
'[[(highlighted ?nid)
[?n :node/id ?nid]
[?n :node/type "service"]
(or
[?e :edge/start ?nid]
[?e :edge/end ?nid])
[?e :edge/project-id ?pid]
[?p :project/id ?pid]
[?p :project/highlight true]]
[(highlighted ?nid)
[?e :edge/end ?nid]
[?e :edge/start ?snid]
(highlighted ?snid)]]))
Basically it finds all nodes that are linked by two highlighted projects (specifically their "service" node). It does so by declaring highlighted nodes (first rule) and checking all paths that can connect them (second rule).
It sounds to me this should be trivial on such a small dataset (100s of nodes) but to my surprise it takes a few seconds to complete.
I tried to understand how refs work (I'm used to RDBMS) and thought about declaring :edge/start as pointing to :node/id but I didn't manage to figure out how to do it (or if it would speed up anything).
Would be nice to have some guidance about this issue and some intuition about what should I be looking at when writing queries.I've given up on this for now. I can do it "manually" but I want to re-visit this later and I'm not under any deadlines. So I'll probably come back to this later on. I'll try to update here when I do. Using refs made many queries shorter (and now I understand how to use them). If there is some guide about recursive query optimizations (even if it's a few simple rules for now) it could be nice to link it somewhere google-able or in "queries"/"tips-and-tricks". Although I'm not sure how many people run into this.
I think all advice I had is there in my replies above. Another important thing: it’s ok to use multiple queries and mix them with collection processing. No need to try to shove everything into one huge query
Let me know if you need more help, I’ll try to do my best
I also thought about statically providing the pre-highlighted service nodes making the first rule redundant but I don't know how to do it as well
also tried this:
(d/q '[:find ?nid
:in $ %
:where
(root ?nid1)
(root ?nid2)
(connected ?nid1 ?nid)
(connected ?nid ?nid2)]
@!conn
'[[(root ?nid)
[?n :node/id ?nid]
[?n :node/type "service"]
[?e :edge/start ?nid]
[?e :edge/project-id ?pid]
[?p :project/id ?pid]
[?p :project/highlight true]]
[(connected ?nid ?eid)
[?e :edge/start ?nid]
[?e :edge/end ?nnid]
(or [?e :edge/end ?eid]
(connected ?nnid ?eid))
]])Hi Reut! 👋 I think the first clause in the first rule head causes a scan. You want to put the most restrictive clause first, e.g. node type which is ground I don't recall if datascript schema accepts index constraints but that might help for your other connected edges
Can you post a dataset? I might look into it if I can play with your data
In general, queries are not the fastest part of DS, and rules are twice that
So it’s ok to “help them out” by pre-calculating stuff, simplifying queries and maybe re-ordering clauses
Using refs would probably also speed things up, as it’ll save you at least one lookup per each row
Are there always two highlighted projects?
Can paths be of arbitrary length?
Hi Ben and Niki
I'll post a random dataset later on when I get back to it.
The relations are as follows:
1. groups point to projects
2. projects point to edges
3. edges point to nodes
In my ui there is a graph (of nodes and edge) where users can highlight a project. Doing this will show all paths to the other highlighted projects. This is why I need all edges on the path between them (I'm not even sure if my query is technically correct since I'm not able to actually run it on many projects).
About your other questions:
1. There can be more than two highlighted projects, but two is a good start since it's most of the value of this feature is figuring out the path from one project to the other (but I can imagine 3 highlighted projects and it's easy to understand what should happen).
2. Paths can be arbitrary length (I thought about implementing some form of DFS to limit the span of open set but I feel like my dataset is so small that it should 'just' work).
3. I tried to optimize it but after reading the doc 5 times I still don't understand what a ref is. When I add things to the schema nothing works. It just says it is a "reference to another entity" but it doesn't say what entity, or how it's referenced. In my example things are linked by my own values (e.g. and edge has a project-id under :edge/project-id which corresponds to a :project/id) and not the db internal values (db/id). I also don't understand what "index:true" is, where it can be used, and how it works. Does it help queries? or only adds an explicitly usable index? is there a query plan viewer? Is there a way to debug queries actions?
4. I didn't read about internals but my assumption is that caching the first rule (or providing it statically) could help a lot. Maybe it's done somehow. But I didn't find a guide on optimizing queries and I'm not sure how to provide it to a rule.
The dataset is probably a few hundred nodes total so I can technically just pull everything out and do it myself. I'll probably do it eventually until I figure out how to make this query work a lot faster.
I'm changing my data to use refs (really helpful that I can use {:edge/start [:node/id id]} instead of {:edge/start :id} and the db does the link)
refs are just entity ids. If you add :db.type/ref to schema, two things will happen: 1. d/entity and d/pull will be able to walk those references 2. it becomes indexed, so that reverse lookup becomes faster as well
Difference is not that pronounced in your query, but it saves one lookup because you know eid on the other side immediately, without having to go through [?n :node/id ?nid]
Also marking ids unique might help, if you are not doing it yet
{:project/id {:db/unique :db.unique/identity}
:node/id {:db/unique :db.unique/identity}}How about something like
(->>
(d/q '[:find ?p ?n
:where
[?p :project/highlight true]
[?p :project/id ?pid]
[?e :edge/project-id ?pid]
(or
[?e :edge/start ?nid]
[?e :edge/end ?nid])
[?n :node/id ?nid]]
db)
(reduce (fn [acc [p n]]
(update acc p (fnil conj #{}) n)) {})
(vals)
(apply clojure.set/intersection))Another rule I find useful when writing queries is “start with what we know, not what we want to find”. In your case you know that some projects are highlighted, so I’d start with these
The way your rules are structures is basically “find me a node” without ever specifying which nodes to look at. So it’ll start with the entire database, which is not as optimal
Think of each step as “reducing a set of all possible answers”. If you start small-ish, then it’s less work to reduce it
I am still moving things around to use refs but I'll try these things when I'm done. Thanks for your answers! Refs seem useful since they make some clauses redundant and it's not a lot of changes.
I've already marked unique ids on what can be marked because I assumed it would index it. About the query you suggest: It returns project to edges while I want the entire paths between specific nodes ("service" nodes of highlighted projects) throughout the graph. Maybe I'm missing something. Projects can be connected via intermediate projects which are not highlighted.
This is the query with refs:
(->>
(d/q '[:find ?p ?n
:where
[?p :project/highlight true]
[?p :project/id ?pid]
[?e :edge/project ?p]
(or
[?e :edge/start ?n]
[?e :edge/end ?n])]
@!conn)
(reduce (fn [acc [p n]]
(update acc p (fnil conj #{}) n)) {})
(vals)
(apply clojure.set/intersection)
)essentially it's service->resource*->service where resources can be anything (including nodes from other projects that are not highlighted) but the first and last nodes are of services that are of projects that are highlighted.
Since the first rule of my naive query is static I was wandering if there is a way to cache/supply it to the recursive rule. It could be already cached in some way but if I could supply the "pre-highlighted" (first rule) nodes it will be quicker since it's just a lookup per traversal instead of query per traversal.
instead of caching I pushed the highlighting down to the nodes but it didn't change much in terms of speed:
(d/q '[:find [(pull ?n [*]) ...]
:in $ %
:where
(highlighted ?n)]
@!conn
'[[(highlighted ?n)
[?n :node/highlighted true]]
[(highlighted ?n)
[?e :edge/start ?n]
[?e :edge/end ?nn]
(highlighted ?nn)]])