datascript

Reut Sharabani 2023-07-16T10:20:01.694309Z

Hello. I have a pretty trivial query on a very small dataset that I'm trying to optimize since it's painfully slow:

(d/q '[:find ?nid
         :in $ %
         :where
         (highlighted ?nid)]
        @!conn
        '[[(highlighted ?nid)
           [?n :node/id ?nid]
           [?n :node/type "service"]
           (or
             [?e :edge/start ?nid]
             [?e :edge/end ?nid])
           [?e :edge/project-id ?pid]
           [?p :project/id ?pid]
           [?p :project/highlight true]]
          [(highlighted ?nid)
           [?e :edge/end ?nid]
           [?e :edge/start ?snid]
           (highlighted ?snid)]]))
Basically it finds all nodes that are linked by two highlighted projects (specifically their "service" node). It does so by declaring highlighted nodes (first rule) and checking all paths that can connect them (second rule). It sounds to me this should be trivial on such a small dataset (100s of nodes) but to my surprise it takes a few seconds to complete. I tried to understand how refs work (I'm used to RDBMS) and thought about declaring :edge/start as pointing to :node/id but I didn't manage to figure out how to do it (or if it would speed up anything). Would be nice to have some guidance about this issue and some intuition about what should I be looking at when writing queries.

Reut Sharabani 2023-07-17T08:52:42.901569Z

I've given up on this for now. I can do it "manually" but I want to re-visit this later and I'm not under any deadlines. So I'll probably come back to this later on. I'll try to update here when I do. Using refs made many queries shorter (and now I understand how to use them). If there is some guide about recursive query optimizations (even if it's a few simple rules for now) it could be nice to link it somewhere google-able or in "queries"/"tips-and-tricks". Although I'm not sure how many people run into this.

Niki 2023-07-17T17:22:41.426809Z

I think all advice I had is there in my replies above. Another important thing: it’s ok to use multiple queries and mix them with collection processing. No need to try to shove everything into one huge query

👍 1
Niki 2023-07-17T17:24:40.186889Z

Let me know if you need more help, I’ll try to do my best

💜 1
Reut Sharabani 2023-07-16T10:31:38.405929Z

I also thought about statically providing the pre-highlighted service nodes making the first rule redundant but I don't know how to do it as well

Reut Sharabani 2023-07-16T14:30:56.424629Z

also tried this:

(d/q '[:find ?nid
         :in $ %
         :where
         (root ?nid1)
         (root ?nid2)
         (connected ?nid1 ?nid)
         (connected ?nid ?nid2)]
       @!conn
       '[[(root ?nid)
          [?n :node/id ?nid]
          [?n :node/type "service"]
          [?e :edge/start ?nid]
          [?e :edge/project-id ?pid]
          [?p :project/id ?pid]
          [?p :project/highlight true]]
         [(connected ?nid ?eid)
          [?e :edge/start ?nid]
          [?e :edge/end ?nnid]
          (or [?e :edge/end ?eid]
              (connected ?nnid ?eid))
          ]])

Ben Sless 2023-07-16T15:42:30.501109Z

Hi Reut! 👋 I think the first clause in the first rule head causes a scan. You want to put the most restrictive clause first, e.g. node type which is ground I don't recall if datascript schema accepts index constraints but that might help for your other connected edges

Niki 2023-07-16T15:51:26.006279Z

Can you post a dataset? I might look into it if I can play with your data

Niki 2023-07-16T15:51:43.073169Z

In general, queries are not the fastest part of DS, and rules are twice that

Niki 2023-07-16T15:52:20.889149Z

So it’s ok to “help them out” by pre-calculating stuff, simplifying queries and maybe re-ordering clauses

Niki 2023-07-16T15:57:22.259949Z

Using refs would probably also speed things up, as it’ll save you at least one lookup per each row

Niki 2023-07-16T15:57:48.019199Z

Are there always two highlighted projects?

Niki 2023-07-16T15:58:53.401609Z

Can paths be of arbitrary length?

Reut Sharabani 2023-07-16T16:46:50.224609Z

Hi Ben and Niki I'll post a random dataset later on when I get back to it. The relations are as follows: 1. groups point to projects 2. projects point to edges 3. edges point to nodes In my ui there is a graph (of nodes and edge) where users can highlight a project. Doing this will show all paths to the other highlighted projects. This is why I need all edges on the path between them (I'm not even sure if my query is technically correct since I'm not able to actually run it on many projects). About your other questions: 1. There can be more than two highlighted projects, but two is a good start since it's most of the value of this feature is figuring out the path from one project to the other (but I can imagine 3 highlighted projects and it's easy to understand what should happen). 2. Paths can be arbitrary length (I thought about implementing some form of DFS to limit the span of open set but I feel like my dataset is so small that it should 'just' work). 3. I tried to optimize it but after reading the doc 5 times I still don't understand what a ref is. When I add things to the schema nothing works. It just says it is a "reference to another entity" but it doesn't say what entity, or how it's referenced. In my example things are linked by my own values (e.g. and edge has a project-id under :edge/project-id which corresponds to a :project/id) and not the db internal values (db/id). I also don't understand what "index:true" is, where it can be used, and how it works. Does it help queries? or only adds an explicitly usable index? is there a query plan viewer? Is there a way to debug queries actions? 4. I didn't read about internals but my assumption is that caching the first rule (or providing it statically) could help a lot. Maybe it's done somehow. But I didn't find a guide on optimizing queries and I'm not sure how to provide it to a rule. The dataset is probably a few hundred nodes total so I can technically just pull everything out and do it myself. I'll probably do it eventually until I figure out how to make this query work a lot faster.

Reut Sharabani 2023-07-16T17:56:43.723319Z

I'm changing my data to use refs (really helpful that I can use {:edge/start [:node/id id]} instead of {:edge/start :id} and the db does the link)

Niki 2023-07-16T19:13:50.259409Z

refs are just entity ids. If you add :db.type/ref to schema, two things will happen: 1. d/entity and d/pull will be able to walk those references 2. it becomes indexed, so that reverse lookup becomes faster as well

Niki 2023-07-16T19:15:43.298519Z

Difference is not that pronounced in your query, but it saves one lookup because you know eid on the other side immediately, without having to go through [?n :node/id ?nid]

Niki 2023-07-16T19:47:30.418489Z

Also marking ids unique might help, if you are not doing it yet

{:project/id {:db/unique :db.unique/identity}
                 :node/id {:db/unique :db.unique/identity}}

Niki 2023-07-16T19:47:47.978559Z

How about something like

(->>
  (d/q '[:find ?p ?n
         :where
         [?p :project/highlight true]
         [?p :project/id ?pid]
         [?e :edge/project-id ?pid]
         (or
           [?e :edge/start ?nid]
           [?e :edge/end ?nid])
         [?n :node/id ?nid]]
    db)
  (reduce (fn [acc [p n]]
            (update acc p  (fnil conj #{}) n)) {})
  (vals)
  (apply clojure.set/intersection))

Niki 2023-07-16T19:48:51.571549Z

Another rule I find useful when writing queries is “start with what we know, not what we want to find”. In your case you know that some projects are highlighted, so I’d start with these

👍 1
Niki 2023-07-16T19:49:35.644279Z

The way your rules are structures is basically “find me a node” without ever specifying which nodes to look at. So it’ll start with the entire database, which is not as optimal

Niki 2023-07-16T19:50:12.365449Z

Think of each step as “reducing a set of all possible answers”. If you start small-ish, then it’s less work to reduce it

Reut Sharabani 2023-07-16T20:21:59.840509Z

I am still moving things around to use refs but I'll try these things when I'm done. Thanks for your answers! Refs seem useful since they make some clauses redundant and it's not a lot of changes.

Reut Sharabani 2023-07-16T20:32:41.888439Z

I've already marked unique ids on what can be marked because I assumed it would index it. About the query you suggest: It returns project to edges while I want the entire paths between specific nodes ("service" nodes of highlighted projects) throughout the graph. Maybe I'm missing something. Projects can be connected via intermediate projects which are not highlighted.

Reut Sharabani 2023-07-16T20:33:08.314199Z

This is the query with refs:

(->>
    (d/q '[:find ?p ?n
           :where
           [?p :project/highlight true]
           [?p :project/id ?pid]
           [?e :edge/project ?p]
           (or
             [?e :edge/start ?n]
             [?e :edge/end ?n])]
         @!conn)
    (reduce (fn [acc [p n]]
              (update acc p  (fnil conj #{}) n)) {})
    (vals)
    (apply clojure.set/intersection)
    )

Reut Sharabani 2023-07-16T20:35:19.962819Z

essentially it's service->resource*->service where resources can be anything (including nodes from other projects that are not highlighted) but the first and last nodes are of services that are of projects that are highlighted.

Reut Sharabani 2023-07-16T20:38:16.702619Z

Since the first rule of my naive query is static I was wandering if there is a way to cache/supply it to the recursive rule. It could be already cached in some way but if I could supply the "pre-highlighted" (first rule) nodes it will be quicker since it's just a lookup per traversal instead of query per traversal.

Reut Sharabani 2023-07-16T21:16:47.546799Z

instead of caching I pushed the highlighting down to the nodes but it didn't change much in terms of speed:

(d/q '[:find [(pull ?n [*]) ...]
         :in $ %
         :where
         (highlighted ?n)]
       @!conn
       '[[(highlighted ?n)
          [?n :node/highlighted true]]
         [(highlighted ?n)
          [?e :edge/start ?n]
          [?e :edge/end ?nn]
          (highlighted ?nn)]])