Fork me on GitHub
#xtdb
<
2021-06-15
>
plexus06:06:11

hey everyone, first time revisiting Crux in a while. I'm trying to run fully on SQLite for now, but not sure I got the config right. I'm getting a NullPointerException when trying to transact...

(defonce node
  (crux/start-node {:crux.jdbc/connection-pool
                    {:dialect {:crux/module crux.jdbc.sqlite/->dialect}
                     :pool-opts {}
                     :db-spec {:jdbcUrl "jdbc:sqlite:crux.db"}}

                    :crux/document-store
                    {:crux/module crux.jdbc/->document-store
                     :connection-pool :crux.jdbc/connection-pool}

                    :crux/tx-log
                    {:crux/module crux.jdbc/->tx-log
                     :connection-pool :crux.jdbc/connection-pool}}))

(crux/submit-tx node [[:crux.tx/put {:crux.db/id :ivan}]])
=> java.lang.NullPointerException
cache.clj:13  crux.cache/evict
cache.clj:12  crux.cache/evict
document_store.clj:57  crux.document-store.CachedDocumentStore/iter/fn

lispyclouds07:06:09

Hey @U07FP7QJ0 maybe you need to quote the ->tx-log, ->document-store and ->dialect like this:

(defonce node
    (crux/start-node {:crux.jdbc/connection-pool
                      {:dialect   {:crux/module 'crux.jdbc.sqlite/->dialect}
                       :pool-opts {}
                       :db-spec   {:jdbcUrl "jdbc:sqlite:crux.db"}}
                      :crux/document-store
                      {:crux/module     'crux.jdbc/->document-store
                       :connection-pool :crux.jdbc/connection-pool}
                      :crux/tx-log
                      {:crux/module     'crux.jdbc/->tx-log
                       :connection-pool :crux.jdbc/connection-pool}}))

plexus08:06:31

thanks! that works.

ordnungswidrig14:06:22

Maybe crux should catch that? :thinking_face:

Steven Deobald14:06:59

@U013MQC5YKD Curious: Is this related to the conversation you and I were having during our call? If it is, it might be worth a quick conversation again (maybe with @U899JBRPF) to chat about topologies.

🙌 2
🙏 2
refset16:06:20

> Maybe crux should catch that? :thinking_face: Yeah, I think so! I've added a note on the project board. Hopefully it's just a small tweak to system.clj

oxalorg (Mitesh)19:06:18

@U01AVNG2XNF Arne's working on some really cool data & schema derivations stuff and was testing it out with Crux! But I really like how crux can be used purely with Sqlite3 bananadance, I think I should give it a test on windows and see how it goes 🙂 I'll definitely reach out if we need a bit of help with Crux 🙌

Steven Deobald19:06:58

Oh man... I'd actually forgotten about the Windows requirement. 😬 Please do shout if you hit snags.

oxalorg (Mitesh)19:06:41

Hehe! Yes of course thank you 😁

Tomas Brejla15:06:30

Hi, I have some performance issue when using "find by attribute in" kind of query when I use (or) in the query. I'll post my minimum repro in the thread 🧵. Hopefully I'm doing something really silly and someone will spot my stupidity 🙏 🙂

Tomas Brejla15:06:08

any ideas welcome

Steven Deobald15:06:50

Mentioned above, but have you taken a glance at the query debug logs to see if there's anything obviously weird happening? https://opencrux.com/community/faq.html#observequeries ... at a glance, that really doesn't look like it should be taking 4 seconds. 😕 If someone is able to help you debug, it would also be useful to know what version of Crux you're on.

Tomas Brejla15:06:52

[juxt/crux-core "21.06-1.17.1-beta"]
[juxt/crux-rocksdb "21.06-1.17.1-beta"]

Martín Varela15:06:22

I just tested locally (on .17, in-memory node), and I can reproduce the slowdown. From under 3ms to over 1.6s (after running the query several times)

🙇 4
Martín Varela15:06:23

(with-out-str (time (crux/q
                      (crux/db crux-node)
                      '{:find  [e something-val]
                        :in [[id ...]]
                        :where [[e :bus-stop/id id]
                                [e :something something-val]]}
                      ids)));; => "\"Elapsed time: 1.300942 msecs\"\n"


(with-out-str (time (crux/q
                     (crux/db crux-node)
                     '{:find  [e something-val]
                       :in [[id ...]]
                       :where [(or [e :train-stop/id id]
                                   [e :bus-stop/id id])
                               [e :something something-val]]}
                     ids)));; => "\"Elapsed time: 1326.292413 msecs\"\n"

Martín Varela15:06:44

that's a 1000x slowdown there... seems weird indeed

Martín Varela15:06:47

@U01LFP3LA6P, might be worth running with logging set to DEBUG, and post the query debug info...

Tomas Brejla15:06:57

Sure, I'll try that.

Tomas Brejla15:06:49

wow, what a huge amount of log lines

cat log_output.txt | grep "crux.query" | wc -l
159854

jarohen15:06:55

my money would be on the :something coming first in the join order, because the query planner can't look down inside or clauses what this'll mean is that Crux will scan the :something attribute first, and then check that it matched either :train-stop/id or :bus-stop/id, whereas (ideally) you'd want it to go directly to the id

Tomas Brejla15:06:58

might work as a workaround, but shouldn't crux be able to cope with the query like the one I'm using?

jarohen15:06:55

I can't immediately think of a reasonable way for us to do that, at least :thinking_face:

jarohen15:06:34

if you're going to be making a lot of queries that look for both train-stops and bus-stops, I'd probably look into adding another key onto both sets of docs, which you could then include in the query at the top level

jarohen15:06:21

or, even, query against the :crux.db/id

jarohen15:06:52

unless you want to query for ID 3 and return either train stop 3 or bus-stop 3?

Martín Varela15:06:54

Alternatively, you could work around this by running both queries separately, and using set/union on the results

👍 2
Martín Varela15:06:04

but admittedly, it's hacky

Tomas Brejla15:06:11

I'm not sure I follow, so I'll rather ask in a different way 🙂

(time (crux/q
         (crux/db crux-node)
         '{:find  [id]
           :in [[id ...]]
           :where [(or [e :train-stop/id id]
                       [e :bus-stop/id id])]}
         ids))
Is such query actually wrong? If so, how can I change it to be more efficient?

Martín Varela15:06:33

Yeah, the queries as such seem nonsensical, but the underlying performance issue remains...

jarohen15:06:44

not wrong, just ill-performant, as you've found 🙂

Tomas Brejla15:06:47

I mean.. sure, I might query id directly, restructure the data so that they have some additional field such as :type :bus etc.

jarohen15:06:02

what do the :crux.db/ids of those documents look like?

Tomas Brejla15:06:13

but what if I bump into similar situation on "non-id", regular attribute

jarohen15:06:11

ah, yep, spotted, sorry 🙂

Tomas Brejla15:06:40

Maybe I shouldn't have chosen :train-stop/id and :bus-stop/id, might be a bit distracting

jarohen15:06:20

mm, possibly - have you got a use-case in mind?

Tomas Brejla16:06:15

if they were called :key1 and :key2, and my task was to filter out some vector of values, which appear in my database under either :key1 or :key2 , that would be an identical issue, wouldn't it?

Tomas Brejla16:06:44

a workaround with set/union would probably work, as @U95NTJT4H mentioned

Tomas Brejla16:06:34

but such workaround indeed feels a bit hacky to me and I was trying to avoid it

Martín Varela16:06:25

@U050V1N74 the magnitude of the slow down, for a dataset with only 2k entries suggests that there's something else going on... even a linear scan on a collection of that size should not take over 1s...

👀 2
jarohen16:06:10

mm - we do have a number of fast paths in that area to optimise common cases, maybe we're not hitting any of them in this case :thinking_face:

jarohen16:06:31

best to raise an issue for this with the repro above, if that'd be alright, and we can take more of a look 🙂

👍 2
Martín Varela16:06:46

I just tested with an older release, to see if it was a regression, but the performance degradation remains...

👍 2
Tomas Brejla16:06:46

I have to leave now unfortunately. If I have some spare time in the evening, I'll create the issue.

👍 2
jarohen16:06:57

no worries - I can write that up 🙂 thanks for letting us know! 🙏

refset17:06:56

I think it should be acceptable in this specific instance to drop the [e :something something-val] clause and use (pull e [:something]) in the :find instead i.e. this is fast

(time (crux/q
         (crux/db crux-node)
         '{:find  [id (pull e [:something])]
           :in [[id ...]]
           :where [(or [e :train-stop/id id]
                       [e :bus-stop/id id])]}
         ids))
You could also use the the built-in get-attr predicate, which will necessarily come after the or in the join order
(time (crux/q
         (crux/db crux-node)
         '{:find  [id something-val]
           :in [[id ...]]
           :where [(or [e :train-stop/id id]
                       [e :bus-stop/id id])
                   [(get-attr e :something) [something-val]]]}
         ids))

Tomas Brejla22:06:53

Interesting. It almost feels counter-intuitive to me that using pull syntax for a single attribute may actually result in (way, way) better performance than getting the value of that attribute using :where . Somehow I'd expect similar performance or even pull variant being a tiny bit slower. Can this be said in general that it's a good practice to prefer pull or get-attr (for fetching document's attributes' values) instead of "joining" them using :where "matching clauses"?

jarohen05:06:04

we do tend to make that distinction for pull, yep - using :where to find the documents, and pull to get their data out once you've found them

jarohen05:06:36

https://github.com/juxt/crux/issues/1533 - thanks again for raising and helping us track it down 🙏