This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-03-29
Channels
- # announcements (7)
- # asami (13)
- # babashka (22)
- # beginners (52)
- # calva (95)
- # clj-kondo (14)
- # cljs-dev (7)
- # clojars (5)
- # clojure (94)
- # clojure-austin (5)
- # clojure-dev (15)
- # clojure-europe (25)
- # clojure-nl (18)
- # clojure-uk (15)
- # clojuredesign-podcast (28)
- # clojurescript (63)
- # copenhagen-clojurians (1)
- # cursive (3)
- # datalevin (7)
- # datascript (13)
- # datomic (13)
- # duct (14)
- # emacs (24)
- # events (1)
- # fulcro (13)
- # graphql (7)
- # kaocha (4)
- # lambdaisland (6)
- # lsp (22)
- # music (5)
- # off-topic (24)
- # rdf (1)
- # re-frame (3)
- # reitit (9)
- # shadow-cljs (23)
- # sql (15)
- # testing (4)
- # tools-build (6)
- # vim (7)
- # vscode (7)
- # xtdb (21)
Hi, I need some help with a graph-based query. I have a graph where I want to find relations between nodes. I want the query to find any path in the graph, bi-directionally, between nodes with a given attribute. For example, if we have the nodes and relations of '5 -> 4 -> 3 -> 1 <- 2', and 5 and 2 have the given attribute, there is a path in the graph between the nodes via 1. I managed to produce the correct result, see example, but the query is to slow on a small example of my real data (150ish nodes). Is there a better way of doing this kind of query?
;; model 5 -> 4 -> 3 -> 1 <- 2
(let [n (xt/start-node {})]
(->> [{:xt/id 1}
{:xt/id 2
:node/type :foo
:node/components [1]}
{:xt/id 3
:node/components [1]}
{:xt/id 4
:node/components [3]}
{:xt/id 5
:node/type :foo
:node/components [4]}
{:xt/id 6
:node/type :foo}]
(map (fn [doc] [::xt/put doc]))
(xt/submit-tx n)
(xt/await-tx n))
;; Query should tell me there is a path between 5 and 2,
;; since 2 is :node/type :foo and they are related via 1
(xt/q
(xt/db n)
'{:find [?e1 ?e2]
:in [?e1]
:where [(relates ?e1 ?e2)
[?e2 :node/type :foo]
[(!= ?e1 ?e2)]]
:rules [[(relates [?e1] ?e2)
[?e1 :node/components ?e2]]
[(relates [?e1] ?e2)
[?e2 :node/components ?e1]]
[(relates [?e1] ?e2)
[?e1 :node/components ?c]
(relates ?c ?e2)]
[(relates [?e1] ?e2)
[?c :node/components ?e1]
(relates ?c ?e2)]]}
5))
;; => #{[5 2]}
Hey @U1G8B7ZD3 the reason that this Datalog-only approach is a slow is because rules execute as fully-materialized subqueries, which means many ~large intermediate result sets are generated (generally the query execution is lazy and the engine tries to minimise intermediate results). The alternative approach is to implement the graph algorithm in userspace via lots-of-small-queries, e.g. bidirectional BFS https://gist.github.com/refset/09271eb23068938162b935ecdec534ec (based on https://hashrocket.com/blog/posts/using-datomic-as-a-graph-database)
Hi @U899JBRPF, thanks for the link! I will dig in to it and see where I end up. In general, would you say that Datalog is not a good option for these kind of graph-traversal queries? Is the tradeoffs any different involving pull queries, since we can make those recursive, and by doing so, traversing the graph?
"graph databases" is a fuzzy and complex topic but there are probably arguments on two fronts, that "Datalog lacks the necessary primitives" and that "XT's Datalog query compiler is not sufficiently advanced". However, because of the embeddability and point-in-time consistency, you can always package up efficient algorithm implementations into custom predicate clauses and still benefit from having a single Datalog query as the top-level glue
I see. It sure takes some getting used to having the "db in my app" in contrast to "over there"; I do not have to solve everything in efficient queries, I can do it in my program, as per your example in the link. It seems I could get some leverage from a library such as https://github.com/Engelberg/ubergraph in combination with some XT queries for graph traversals. Thanks for the info and thanks for making XT!
@U899JBRPF that gist you posted also seems to be integrated into https://github.com/xtdb/xtdb/blob/master/docs/example/imdb/src/imdb/main.clj, right?
That examples project seems handy as it knows how to download and ingest the imdb dataset.
It just seems that that specific docs/examples/imdb
subproject may be a bit outdated (still a few crux references an some calls to no-longer-existing functions) and not working. I had to add [io.dropwizard.metrics/metrics-core "3.1.0"]
to make the project happy, otherwise I was getting some weird unrelated ClassNotFound error when loading some class obviously present on the classpath.
I'm now trying to ingest the imdb data, it indeed takes a while..
ah, yes that example code is what I adapted to arrive at the gist, sorry it's stale and that we didn't get around to merging any changes upstream yet 😕
I was just curious to try that imdb dataset and this example subproject seemed like a way to go.
Btw relatively soon after ingesting all the data, the REPL crashed on me (perhaps my heap was not big enough or something nope, ran out of disk space 🙂). And later when I tried to start the repl again, got kafka.common.InconsistentClusterIdException: The Cluster ID AkIA7ZUiSJKddUhAxZONYw doesn't match stored clusterId Some(YeKv9Du2RoqOS8nBFnifwA) in meta.properties. The broker is trying to join the wrong cluster. Configured zookeeper.connect may be wrong.
Some googling lead me to https://github.com/xtdb/xtdb/pull/1609/commits/27d3b3034874bdf43baf24c6c226bfe8c1b6a6ef, and after I deleted that meta.properties file I was able to start that embedded kafka "cluster" again succesfully using the start-from-repl
function.
yeah, it's not that terrible, but the xtdb/docs/example/imdb
folder is currently 12G (and increasing, I guess there's still some indexing going on, as there're more and more new files appearing data/db-dir
as well as extensive cpu activity)
Wow.. it seems that it will take many more hours to finish up the tx ingestion. I just tried submitting new tx and it got assigned a tx-id of 202837
. But after ~2-3 hours of my xtdb node running, (_xt_/db crux)
still gives me :xtdb.api/tx-id 76336
.
The rate of ingestion seems to be around 600/minute which feels quite slow. Or are numbers like this normal? Is there perhaps any way of of improving this rate?
what does your node config look like? are you using LMDB? or Lucene? are the put
ops batched into reasonably-sized chunks (e.g. 1000 puts per tx)?
I'm using that https://github.com/xtdb/xtdb/blob/master/docs/example/imdb/src/imdb/main.clj setup
(def crux-options
{:xtdb.kafka/kafka-config {:bootstrap-servers "localhost:9092"}
:xtdb/tx-log {:xtdb/module 'xtdb.kafka/->tx-log
:kafka-config :xtdb.kafka/kafka-config}
:xtdb/document-store {:xtdb/module 'xtdb.kafka/->document-store
:kafka-config :xtdb.kafka/kafka-config
:local-document-store {:kv-store :rocksdb}}
:xtdb/index-store {:kv-store :rocksdb}
:rocksdb {:xtdb/module 'xtdb.rocksdb/->kv-store
:db-dir index-dir}})
and it seems to use tx batches of 100 puts each
okay, hmm! I guess I'd check out the memory usage next, see if there's (a lot of) swapping happening. Next stop after that would be to crack out a profiler
A few years old Dell XPS 13 laptop.
Intel i7-8550U (8) @ 4.000GHz
16G RAM
Low -Xmx
and swapping might be a logical explanation. The REPL is started from calva, I'm not even sure what default values it's using. I'll try the profiler tomorrow. Thanks.