Fork me on GitHub
#xtdb
<
2022-03-29
>
zclj08:03:21

Hi, I need some help with a graph-based query. I have a graph where I want to find relations between nodes. I want the query to find any path in the graph, bi-directionally, between nodes with a given attribute. For example, if we have the nodes and relations of '5 -> 4 -> 3 -> 1 <- 2', and 5 and 2 have the given attribute, there is a path in the graph between the nodes via 1. I managed to produce the correct result, see example, but the query is to slow on a small example of my real data (150ish nodes). Is there a better way of doing this kind of query?

;; model 5 -> 4 -> 3 -> 1 <- 2
  (let [n (xt/start-node {})]
    (->> [{:xt/id 1}
          {:xt/id           2
           :node/type       :foo
           :node/components [1]}
          {:xt/id           3
           :node/components [1]}
          {:xt/id           4
           :node/components [3]}
          {:xt/id           5
           :node/type       :foo
           :node/components [4]}
          {:xt/id     6
           :node/type :foo}]
         (map (fn [doc] [::xt/put doc]))
         (xt/submit-tx n)
         (xt/await-tx n))
    ;; Query should tell me there is a path between 5 and 2,
    ;;  since 2 is :node/type :foo and they are related via 1
    (xt/q
     (xt/db n)
     '{:find  [?e1 ?e2]
       :in    [?e1]
       :where [(relates ?e1 ?e2)
               [?e2 :node/type :foo]
               [(!= ?e1 ?e2)]]
       :rules [[(relates [?e1] ?e2)
                [?e1 :node/components ?e2]]
               [(relates [?e1] ?e2)
                [?e2 :node/components ?e1]]
               [(relates [?e1] ?e2)
                [?e1 :node/components ?c]
                (relates ?c ?e2)]
               [(relates [?e1] ?e2)
                [?c :node/components ?e1]
                (relates ?c ?e2)]]}
     5))
  ;; => #{[5 2]}

refset09:03:59

Hey @U1G8B7ZD3 the reason that this Datalog-only approach is a slow is because rules execute as fully-materialized subqueries, which means many ~large intermediate result sets are generated (generally the query execution is lazy and the engine tries to minimise intermediate results). The alternative approach is to implement the graph algorithm in userspace via lots-of-small-queries, e.g. bidirectional BFS https://gist.github.com/refset/09271eb23068938162b935ecdec534ec (based on https://hashrocket.com/blog/posts/using-datomic-as-a-graph-database)

zclj10:03:53

Hi @U899JBRPF, thanks for the link! I will dig in to it and see where I end up. In general, would you say that Datalog is not a good option for these kind of graph-traversal queries? Is the tradeoffs any different involving pull queries, since we can make those recursive, and by doing so, traversing the graph?

refset11:03:19

"graph databases" is a fuzzy and complex topic but there are probably arguments on two fronts, that "Datalog lacks the necessary primitives" and that "XT's Datalog query compiler is not sufficiently advanced". However, because of the embeddability and point-in-time consistency, you can always package up efficient algorithm implementations into custom predicate clauses and still benefit from having a single Datalog query as the top-level glue

zclj13:03:14

I see. It sure takes some getting used to having the "db in my app" in contrast to "over there"; I do not have to solve everything in efficient queries, I can do it in my program, as per your example in the link. It seems I could get some leverage from a library such as https://github.com/Engelberg/ubergraph in combination with some XT queries for graph traversals. Thanks for the info and thanks for making XT!

🙏 1
🙂 1
Tomas Brejla18:03:25

@U899JBRPF that gist you posted also seems to be integrated into https://github.com/xtdb/xtdb/blob/master/docs/example/imdb/src/imdb/main.clj, right? That examples project seems handy as it knows how to download and ingest the imdb dataset. It just seems that that specific docs/examples/imdb subproject may be a bit outdated (still a few crux references an some calls to no-longer-existing functions) and not working. I had to add [io.dropwizard.metrics/metrics-core "3.1.0"] to make the project happy, otherwise I was getting some weird unrelated ClassNotFound error when loading some class obviously present on the classpath. I'm now trying to ingest the imdb data, it indeed takes a while..

refset19:03:54

ah, yes that example code is what I adapted to arrive at the gist, sorry it's stale and that we didn't get around to merging any changes upstream yet 😕

refset19:03:09

it's useful to know it's of interest though 🙂

Tomas Brejla20:03:24

I was just curious to try that imdb dataset and this example subproject seemed like a way to go. Btw relatively soon after ingesting all the data, the REPL crashed on me (perhaps my heap was not big enough or something nope, ran out of disk space 🙂). And later when I tried to start the repl again, got kafka.common.InconsistentClusterIdException: The Cluster ID AkIA7ZUiSJKddUhAxZONYw doesn't match stored clusterId Some(YeKv9Du2RoqOS8nBFnifwA) in meta.properties. The broker is trying to join the wrong cluster. Configured zookeeper.connect may be wrong. Some googling lead me to https://github.com/xtdb/xtdb/pull/1609/commits/27d3b3034874bdf43baf24c6c226bfe8c1b6a6ef, and after I deleted that meta.properties file I was able to start that embedded kafka "cluster" again succesfully using the start-from-repl function.

refset20:03:01

Cool! How big is the data set?! I thought it was only a couple of GBs :thinking_face:

Tomas Brejla20:03:00

yeah, it's not that terrible, but the xtdb/docs/example/imdb folder is currently 12G (and increasing, I guess there's still some indexing going on, as there're more and more new files appearing data/db-dir as well as extensive cpu activity)

👍 1
Tomas Brejla20:03:58

just the tsv files themselves have 4.8G btw

😅 1
Tomas Brejla21:03:22

Wow.. it seems that it will take many more hours to finish up the tx ingestion. I just tried submitting new tx and it got assigned a tx-id of 202837 . But after ~2-3 hours of my xtdb node running, (_xt_/db crux) still gives me :xtdb.api/tx-id 76336 . The rate of ingestion seems to be around 600/minute which feels quite slow. Or are numbers like this normal? Is there perhaps any way of of improving this rate?

refset21:03:20

what does your node config look like? are you using LMDB? or Lucene? are the put ops batched into reasonably-sized chunks (e.g. 1000 puts per tx)?

Tomas Brejla22:03:55

(def crux-options
  {:xtdb.kafka/kafka-config {:bootstrap-servers "localhost:9092"}
   :xtdb/tx-log {:xtdb/module 'xtdb.kafka/->tx-log
                 :kafka-config :xtdb.kafka/kafka-config}
   :xtdb/document-store {:xtdb/module 'xtdb.kafka/->document-store
                         :kafka-config :xtdb.kafka/kafka-config
                         :local-document-store {:kv-store :rocksdb}}
   :xtdb/index-store {:kv-store :rocksdb}
   :rocksdb {:xtdb/module 'xtdb.rocksdb/->kv-store
             :db-dir index-dir}})

Tomas Brejla22:03:37

and it seems to use tx batches of 100 puts each

refset22:03:42

okay, hmm! I guess I'd check out the memory usage next, see if there's (a lot of) swapping happening. Next stop after that would be to crack out a profiler

refset22:03:05

what machine is this running on? how much RAM, JVM -Xmx etc.

refset22:03:35

if you're not using 1.21.1-beta2 already then I recommend giving that a spin too

Tomas Brejla22:03:41

A few years old Dell XPS 13 laptop. Intel i7-8550U (8) @ 4.000GHz 16G RAM Low -Xmx and swapping might be a logical explanation. The REPL is started from calva, I'm not even sure what default values it's using. I'll try the profiler tomorrow. Thanks.

👍 1