Fork me on GitHub
#xtdb
<
2021-02-04
>
nivekuil11:02:26

what shared resources does open-db take up in particular? It seems to be a rocksdb snapshot, any idea how much that costs to keep around? I was thinking of just letting GC take care of old handles, because managing lifetimes in a dynamic lang, taking into account caching etc. is just too much for me

jarohen12:02:44

Hey @U797MAJ8M 👋 tl;dr is if you're not keen on closing it, it'd be worth using plain old db instead as that doesn't open any of its own resources open-db opens up the RocksDB snapshot, as you've said, and also an entity resolution cache, which caches the current versions of entities as of the DB timestamp. I'm not aware of anything in this cache that wouldn't get GC'd though. The RocksDB snapshot is a different matter - I'd have to look at the Rocks Java source, but IIUC this is a backed by a natively allocated object, so won't get GC'd. The memory impact of this is pretty small - it just stores the latest version number when the snapshot was taken but (again IIUC) this is then used to determine what files Rocks can compact away - i.e. if there are old snapshots open, I'm not sure it can compact files into higher levels of the LSM tree, so will affect query performance over time.

nivekuil12:02:09

hi :) I was thinking, inchoately, about doing something like keeping track of opened db's, and if the tx-basis is the same as one already existing in memory, reuse that one instead of making a trip to crux. Does that have any merit to it? not sure how much it's saving

nivekuil12:02:53

I didn't know about the entity resolution cache. I guess after that it's some sort of on-disk cache, and then the doc store past that?

jarohen13:02:16

It's just an in-memory cache, it only caches the mapping between entity-id and content-hash at that DB basis - the 'temporal resolution' of each entity. Doesn't sound like much, but we end up using this mapping quite a lot throughout the query engine

jarohen13:02:15

And I don't know, I'm afraid - it'd depend quite a lot on your use case. The Rocks snapshot and the cache are pretty cheap to create, certainly - the questions would be around how often you re-use the same tx-basis, and how frequently you access the same entities in those queries. As always, best to measure 🙂

nivekuil13:02:12

thanks for the info! real simple criterium for anyone curious:

(put {:crux.db/id :test :foo 1 :bar "1"})   (put {:crux.db/id :test2 :foo 2 :bar "2"})   (do (println "open-db")       (criterium.core/quick-bench (let [db (crux/open-db node)]                                     (crux/entity db :test)                                     (crux/entity db :test2))))   (do (println "db")       (criterium.core/quick-bench (do (crux/entity (db) :test)                                       (crux/entity (db) :test2))))  open-db Evaluation count : 10422 in 6 samples of 1737 calls.              Execution time mean : 57.281102 µs     Execution time std-deviation : 399.468510 ns    Execution time lower quantile : 57.014872 µs ( 2.5%)    Execution time upper quantile : 57.946446 µs (97.5%)                    Overhead used : 1.604431 ns  Found 1 outliers in 6 samples (16.6667 %)  low-severe  1 (16.6667 %)  Variance from outliers : 13.8889 % Variance is moderately inflated by outliers db Evaluation count : 7656 in 6 samples of 1276 calls.              Execution time mean : 75.881244 µs     Execution time std-deviation : 1.356425 µs    Execution time lower quantile : 74.723975 µs ( 2.5%)    Execution time upper quantile : 78.143538 µs (97.5%)                    Overhead used : 1.604431 ns  Found 1 outliers in 6 samples (16.6667 %)  low-severe  1 (16.6667 %)  Variance from outliers : 13.8889 % Variance is moderately inflated by outliers 

nivekuil13:02:33

validity of microbenchmarks notwithstanding, it seems like avoiding a full call to (crux/db node) pays off quite easily

nivekuil13:02:52

and with the cache:

(def db-cache (cc/lru-cache-factory {} :threshold 1000)) (defn db   ([]    (crux/db node))   ([valid-time-or-basis]    (cc/lookup-or-miss db-cache                       valid-time-or-basis                       (fn [_] (crux/db node valid-time-or-basis)))))  (do (println "db-cached")       (criterium.core/quick-bench (let [date (tick/inst)]                                     (do (crux/entity (db date) :test)                                         (crux/entity (db date) :test2)))))  db-cached Evaluation count : 12114 in 6 samples of 2019 calls.              Execution time mean : 50.338854 µs     Execution time std-deviation : 415.897755 ns    Execution time lower quantile : 50.013153 µs ( 2.5%)    Execution time upper quantile : 50.958765 µs (97.5%)                    Overhead used : 1.604431 ns
not sure if it's better to cache a db or an open-db, but at least I'm not thinking about lifetimes anymore

jarohen13:02:57

neat - thanks for sharing!

nivekuil12:02:44

thinking about it more, if you wanted to do some deferred computation that makes db calls in general, you would want to explicitly use and cache the tx basis alongside it. Actually a very interesting feature of crux that you can do that at all.

jarohen13:02:41

We certainly like this one 🙂

Marconi14:02:36

Is there a way to search all documents that had a certain attribute/value, including the ones that were deleted, throughout all history (not in a specific point in time)?

refset09:02:03

Hi @U01DZ6JEPH6 we don't have an index or API that specifically supports such a query today, but we hinted in a recent blog post that more advanced temporal queries are on our roadmap: https://opencrux.com/blog/dev-diary-jan-21.html#_future Can you describe the domain of data you're working with, and what you would need such queries for? It would be very helpful for our designs to learn more about any such use-cases. For now, it would probably be better to handle such information with regular Date values.

Marconi20:02:52

That's great news! My use case would be accidental deletions. Suppose you have a relationship of 1 to many, so it makes sense to store the 1 ref in the many. Like: {:crux.db/id 123 :user/name "User"} {:crux.db/id 456 :doc/title "My Doc" :doc/owner 123} And the user has lots of docs. If he accidentally deletes a doc, I would like there to be a functionality like "Recover deleted docs". I know I can keep the many refs somewhere or implement it in some other way, but I think that since I have an immutable DB, I shouldn't keep a :deleted attribute on "deleted" docs. Thank you for helping out!

refset19:02:25

Thanks for the explanation, that's a useful data point 🙂 for the time being you may well have an easier job by not attempting model your user/application-level deletes as valid time deletes. A :deleted attribute for this is a good compromise, I think.