xtdb 2022-05-11 | Slack Archive

Hukka10:05:01

I tried to search for an answer but didn't really get anything explicit. Isn't it so that while indexes are always up to date and local, XTDB doesn't have any caching for the actual documents so every request will hit the doc store?

Hukka10:05:11

Hm no, that is not true. At least the JDBC doc store accepts a cache-size param

Hukka11:05:31

But is it documents, kilos, megs or something else…

refset11:05:56

Yeah it's a doc-store level config option

Hukka11:05:59

Haha

Hukka11:05:11

> docs - clarify wherever cache-size relates to number of entries

🙂 1

refset11:05:14

> But is it documents, kilos, megs or something else… This is still waiting to hit the live docs https://github.com/xtdb/xtdb/commit/411a349d94d1b035083b9b8c953c4d6946dd9fc8

Hukka11:05:25

Yeah, found that exact thing

Hukka11:05:17

Documenting the default and the cache strategy would be nice too!

👍 1

Hukka11:05:35

Hm, or is the query engine's entity-cache-size talking about the same cache, as doc stores cache-size?

Hukka11:05:32

Ok, it's a simple LRU

❌ 1

Hukka11:05:16

Gotta love that with open source the answers are always there. Of course with some disclaimers about code readability

😅 1

refset11:05:43

Actually the default cache strategy (in the -beta release, anyway) is "second chance", loosely based on this paper https://github.com/xtdb/xtdb/blob/e2f51ed99fc2716faa8ad254c0b18166c937b134/core/src/xtdb/cache/second_chance.clj#L11

Hukka11:05:47

Ah, true, I didn't quite catch what that if does

refset11:05:48

We switched back to LRU for a while as the default until I found the time to chase down this hashCode bug: https://github.com/xtdb/xtdb/pull/1706

Hukka11:05:58

I got my colleague to read that paper, so I don't need to ;)

😏 3

refset11:05:56

It's actually only a very small part of the paper, and not spelled out explicitly as a general concept. I don't really know how Håkan managed to glean such a clear insight 😄

Hukka11:05:22

Ok, so it's a stochastic cache where things are randomly moved from cached to eviction queue. And stuff that gets reused are moved back to the main cache, and stuff that runs out of the queue get really evicted

✔️ 1

refset12:05:07

That is a great description...would you mind if I borrow it to add to the namespace?

Hukka12:05:29

Not at all

🙏 2

Hukka12:05:52

Though now that I read it more, it was a bit inaccurate

Hukka12:05:34

Things in the eviction queue are still in the main cache. On the contrary, every element in the cache maintains both the actual value and the information if it's cooling or not, and access to the element removes that mark

👍 1

Hukka12:05:59

And when the eviction queue is processed, the stuff that were used are just skipped

Hukka12:05:03

I'm still a bit mystified what the computeIfAbsent does, but I need to read what the ICache says about it. Also wondering why resize-cache calls first move-to-cooling, then move-to-cold, since move-to-cold also calls move-to-cooling at every iteration to keep the cooling queue full

Hukka12:05:18

Heh, ICache was pretty thin. I'll read the paper too, then

🙂 1

refset12:05:19

I'm not entirely sure about all this, but move-to-cold only calls move-to-cool iff (> (.size (.getHot cache)) hot-target-size)

Hukka12:05:57

Yeah, which is the precursor for it to do anything at all

Hukka12:05:40

I think the critical point is that I don't see what the computeIfAbsent is for, since that is what triggers the resize-cache

Hukka12:05:31

Hm, I suppose it could be because if only to-cold would call to-cooling, nothing would ever populate the cooling queue. But after it has been populated, it wouldn't be needed

refset12:05:08

I think it's safe to say that I have long since evicted my working knowledge of this namespace

Hukka17:05:48

The more I read, the more questions I get 😕

Hukka18:05:31

Like what is the stored-key-fn (it seems to copy buffers around, but why), why first get a key in compute-if-absent and then again use compute-if-absent of the underlying map, or why try to read from cold, when it's defined as nop

braai engineer11:05:43

When running clj -m xtdb.main, how to tell it to use system env vars via xtdb.edn ?

refset11:05:06

I don't think you can do that via xtdb.edn :thinking_face: which vars are you trying to set?

refset11:05:43

can you not do FOO=bar clj -m xtdb.main?

braai engineer11:05:21

xtdb.main does not look for config in env vars: https://github.com/xtdb/xtdb/blob/master/core/src/xtdb/cli.clj#L35-L50

😅 1

braai engineer11:05:07

#Also sent to the channel

We are in a K8s env and policy is can’t write database creds to file system, so Postgres database host/password injected via env vars. We can put layer on top sure, but would be nice to use xtdb.main direct - one less compile step for us.

jarohen11:05:47

might be a bit hacky, but there are a few libraries that have env-var reader macros, if memory serves, so you might be able to put something like {:user #env "POSTGRES_USER"} in your EDN file

jarohen11:05:07

although I don't recall if that's read with a specific readers map... :thinking_face:

jarohen11:05:09

(I mean, you could write your own reader macro too, I guess)

jarohen11:05:39

https://github.com/juxt/aero#env might work?

➕ 1

braai engineer16:05:42

@U050V1N74 yes but not without wrapping / shimming xtdb.main, in which case you might as well call x/start-node yourself and passing (System/getenv "DATABASE_HOST") from the caller.

braai engineer16:05:07

however I like the idea of retaining config in xtdb.edn and being able to use an #env reader.

braai engineer11:05:07

replied to a thread:When running `clj -m xtdb.main`, how to tell it to use system env vars via `xtdb.edn` ?

Anthony Bui11:05:17

Hi again, I'm trying to figure out if I should use multiple nodes if I, for example, have lots of buildings and the people currently inside these buildings, would a node for buildings and a node for people be the right way to think? Or is it highly use-case dependent? (this case is mostly just querying the people inside buildings)

refset12:05:42

Hey @U03D15XF04D to clarify, I assume you mean 'node' as in "nodes in a graph". Note that we generally use the term 'entity' for this (or 'vertex' is okay too), since 'node' is already reserved for talking about the system/deployment architecture for XT (which itself is a graph...) 🙂 An entity for each building and an entity for each person sounds about right. It probably makes most sense to have a :my.app.person/located-in-building attribute on each person entity, pointing to the relevant building(s), but as you've correctly guessed already there are a lot of concrete details of the exact use-case required to know what's best. Sometimes it can even make sense to reify edges into additional entities, so that you can attached other attributes (/properties) like :my.app.person-building/entry-time or whatever, but then the query engine has to do more work.

Anthony Bui12:05:06

Many thanks for your reply!! Yes, I am definitely a bit lost, but I was actually talking about whether or not running multiple XTDB nodes for my use-case would be ideal (doc page for setting up Kafka mentions availability and fault-tolerance), but I guess the answer is no? 😅, I think that my use-case isn't too complicated, however the amount of data entities will be a million/some millions. I wish to query what people are present in a building, or several buildings, through time. I do have a "located-in-building" attribute and I'm thinking of using (and querying) only on valid-time for entry/exit time - ultimately, I'm wondering if this is the best way in terms of querying performance? Right now I'm testing it locally on LMDB for indexes and RocksDB for tx/doc-store, but it's taking much longer than expected edit: it's taking longer than expected when running queries through Clojure CLI, have not tried http yet

refset14:05:28

Aha! My mistake 🙈 Unless you have particularly high throughput needs (e.g. >500 txes/s) you should be able to scale up on a singe tx-log setup okay. It will certainly be easier to prototype with all data living in a single node. As ever, you should evaluate things carefully with a realistic sample of data. > it's taking longer than expected when running queries through Clojure CLI, have not tried http yet Hmm, the REPL should ~always be faster than doing things over http. Can you share an example of a query? Feel free to DM me if it's sensitive.

Anthony Bui14:05:55

Gotcha, thanks again for your input! One query is the following, just trying to get all entities at a specific date: (time (xt/q (xt/db node #inst "2016-10-31") '{:find [id] :where [[e :xt/id id]]})) This particular one took over 70 seconds lol, granted I'm working on 2016 Mac Pro :P

refset14:05:19

wow, huh

refset14:05:30

is that returning millions of entities?

refset14:05:49

what is the config passed to start-node?

refset14:05:44

You should be able to get away with simply:

(time (xt/q (xt/db node #inst "2016-10-31")
        '{:find [id]
          :where [[id :xt/id]]}))

(which might be slightly faster, I'm not 100% sure)

Anthony Bui10:05:20

Hi again, sorry for not being completely clear: all of our data will be in some millions, but now I'm testing on a much smaller subset of that data. Additionally, as "people exit the building", I logically delete them on their exit-timestamp (so only people inside buildings will be returned for a regular query), meaning that any query shouldn't return in the millions. I reset everything and loaded the data again, the query for the date 2016-10-31 now completes in around 20 seconds (so 70 seconds must have been something wrong on my end), which is much better than before, but there is "only" under 7000 entities returned - how long will it take when we load all of our data? Is this to be expected? Not sure if your query helped with improving times, but I'll stick to it from now on to be sure! The start node only receives the following configuration file: { "xtdb/index-store": { "kv-store": { "xtdb/module": "xtdb.lmdb/->kv-store", "db-dir": "data/index-store" } }, "xtdb/document-store": { "kv-store": { "xtdb/module": "xtdb.rocksdb/->kv-store", "db-dir": "data/doc-store" } }, "xtdb/tx-log": { "kv-store": { "xtdb/module": "xtdb.rocksdb/->kv-store", "db-dir": "data/tx-log" } }, "xtdb/query-engine": { "query-timeout": 1000000 } } (def node (xt/start-node (io/file "resources/config.json"))) I also have "XTDB_DISABLE_LIBGCRYPT=true" and "XTDB_ENABLE_BYTEUTILS_SHA1=true" enabled in order to avoid the error about loading libcrypto unsafely. We briefly discussed if having people present in a building stored in the buildings as nestled attributes, but a quick search showed us that this would be worse since indexing only applies to top-level attributes, right? Thanks again for your help! ☺️

refset10:05:30

Hey again (apologies for the delay!), so on reflection I think what is happening here is that XTDB's native history (valid-time) is a less-than-ideal way to model this, because the temporal index works more like a 'filter' when a query is scanning through the raw EAV content indexes. I.e. if you have a lot of raw EAV data that is not visible as-of the query basis, then the filtering will take time proportional to that raw data set, and therefore this particular approach can't really scale for your use-case if you need low-latency queries... However, assuming you're not needing to have millions of people in a single building(?) I think materializing this people-in-building set membership information into the building's entity should be more viable.

Anthony Bui10:05:27

Many thanks for the explanation! I'll try your suggestion and get back with the results 😄

🤞 1

Anthony Bui11:05:35

Sorry for more stupid questions, but with your membership suggestion, are you still thinking of having "separate" Person entities? We thought of storing an attribute list of "Person" maps inside Buildings, but are running into problems on how to query through time with these, should we store Person valid-time as an attribute and query on that? Would the query-engine be able to handle the ever growing list? Most Person maps have a short valid-time, would the query engine "swiftly" ignore entries out of range? Another thought was to only let "valid" Persons be present in the list, but that wouldn't work for our old data as we then only have the possibility to query on tx-time (?) edit: on second thought. if we were to proceed with having separate Person entities, was your plan that we keep using valid-time on Person entitites and that querying for people inside a specific building (inside the Building attribute list) would be sufficient?

refset12:05:35

> was your plan that we keep using valid-time on Person entitites and that querying for people inside a specific building (inside the Building attribute list) would be sufficient? If I understand correctly, 'yes' 🙂

refset12:05:31

Depending on your exact queries and needs for the data model, deleting / ending ('capping'?) the valid-time of the persons may make some queries harder though :thinking_face:

refset12:05:26

For my suggestion, I guess you can consider the building entity as storing an aggregate/projection/materialization of the 'current' set membership. Which is essentially a duplicate / derivation of the data also encoded into all the Persons' valid-time

Anthony Bui12:05:47

Perfect! Time-travel would work great then if the Buildings are seen to store the "current" people, too. I guess the loading of our data will take a considerable hike in time, but that's something we'll have to live with 😝, again thanks for your help!!

🙂 1

🙏 1

Anthony Bui07:05:25

How would I go about in actually keeping the set of persons in a building "current"? I have tx-functions to remove/add persons from a building's set just after the doc init for said person, but these transactions aren't stamped on valid-time. Should I leave the responsibility to queries? edit: we thought of extracting the membership set as its own document in order to track time, but we're not sure if that's much better performance wise

refset09:05:26

> these transactions aren't stamped on valid-time. Should I leave the responsibility to queries? Can you you explain more what you mean by these two points exactly?

Anthony Bui09:05:51

edit: all I really need to do is to send in the valid start time into the tx-function, right..? :man-facepalming: ~~Yes of course, I haven't explained the whole use case thoroughly so here goes:~~ We have buildings and persons, as explained earlier, and what we ultimately want to query is "how many people were inside this building at this time?". Our data is received in the form of timestamped "personEnteredBuilding" and "personExitedBuilding" events. These are of course continuous, but we would also like to load past data "properly" so that we can query back in time. The valid-time for a person entity would represent the time they were inside a specific building, whereas the valid-time for a building represents the time it is "active". We delete the entities on the timestamp of "personExitedBuilding" thinking that it would be easier to query only on documents present on a specific valid time. Querying raw EAV however, as you explained, took too much time, so we moved on with your suggestion of storing the amount of people as a membership set stored as an attribute in buildings. This would allow a cheap "count" call as well as having the possibility to lookup separate persons if needed. Our perceived problem is that when processing past data, we want to keep the membership set of a building current with valid time. When adding or removing a person eid in the set, we use a simple transaction function running a conj/disj on the set. How would we sort of "apply" the timestamps of a personEnteredBuilding and personExitedBuilding within the building's attribute, when running our transaction functions? If this were to work as we want, wouldn't we need to sort of have our transaction functions work with the valid times?

refset14:05:53

🙂 you can certainly pass in an explicit valid time, and if one isn't specified you should be able to access the "current" tx-time (== valid-time) from the transaction function context

❤️ 1

braai engineer16:05:15

Got bitten by XTDB config for not quoting symbols in :xtdb/module when copypasta’ing xtdb.edn to a .clj file. Would be nice if Spec was invoked on config types.

Steven Deobald21:05:09

Sounds legit. I've created an issue to track: https://github.com/xtdb/xtdb/issues/1754

2022-05-11

Channels