Fork me on GitHub

I tried to search for an answer but didn't really get anything explicit. Isn't it so that while indexes are always up to date and local, XTDB doesn't have any caching for the actual documents so every request will hit the doc store?


Hm no, that is not true. At least the JDBC doc store accepts a cache-size param


But is it documents, kilos, megs or something else…


Yeah it's a doc-store level config option


> docs - clarify wherever cache-size relates to number of entries

🙂 1

> But is it documents, kilos, megs or something else… This is still waiting to hit the live docs


Yeah, found that exact thing


Documenting the default and the cache strategy would be nice too!

👍 1

Hm, or is the query engine's entity-cache-size talking about the same cache, as doc stores cache-size?


Ok, it's a simple LRU


Gotta love that with open source the answers are always there. Of course with some disclaimers about code readability

😅 1

Actually the default cache strategy (in the -beta release, anyway) is "second chance", loosely based on this paper


Ah, true, I didn't quite catch what that if does


We switched back to LRU for a while as the default until I found the time to chase down this hashCode bug:


I got my colleague to read that paper, so I don't need to ;)

😏 3

It's actually only a very small part of the paper, and not spelled out explicitly as a general concept. I don't really know how Håkan managed to glean such a clear insight 😄


Ok, so it's a stochastic cache where things are randomly moved from cached to eviction queue. And stuff that gets reused are moved back to the main cache, and stuff that runs out of the queue get really evicted

✔️ 1

That is a great description...would you mind if I borrow it to add to the namespace?


Not at all

🙏 2

Though now that I read it more, it was a bit inaccurate


Things in the eviction queue are still in the main cache. On the contrary, every element in the cache maintains both the actual value and the information if it's cooling or not, and access to the element removes that mark

👍 1

And when the eviction queue is processed, the stuff that were used are just skipped


I'm still a bit mystified what the computeIfAbsent does, but I need to read what the ICache says about it. Also wondering why resize-cache calls first move-to-cooling, then move-to-cold, since move-to-cold also calls move-to-cooling at every iteration to keep the cooling queue full


Heh, ICache was pretty thin. I'll read the paper too, then

🙂 1

I'm not entirely sure about all this, but move-to-cold only calls move-to-cool iff (> (.size (.getHot cache)) hot-target-size)


Yeah, which is the precursor for it to do anything at all


I think the critical point is that I don't see what the computeIfAbsent is for, since that is what triggers the resize-cache


Hm, I suppose it could be because if only to-cold would call to-cooling, nothing would ever populate the cooling queue. But after it has been populated, it wouldn't be needed


I think it's safe to say that I have long since evicted my working knowledge of this namespace


The more I read, the more questions I get 😕


Like what is the stored-key-fn (it seems to copy buffers around, but why), why first get a key in compute-if-absent and then again use compute-if-absent of the underlying map, or why try to read from cold, when it's defined as nop

Petrus Theron11:05:43

When running clj -m xtdb.main, how to tell it to use system env vars via xtdb.edn ?


I don't think you can do that via xtdb.edn :thinking_face: which vars are you trying to set?


can you not do FOO=bar clj -m xtdb.main?

Petrus Theron11:05:07

We are in a K8s env and policy is can’t write database creds to file system, so Postgres database host/password injected via env vars. We can put layer on top sure, but would be nice to use xtdb.main direct - one less compile step for us.


might be a bit hacky, but there are a few libraries that have env-var reader macros, if memory serves, so you might be able to put something like {:user #env "POSTGRES_USER"} in your EDN file


although I don't recall if that's read with a specific readers map... :thinking_face:


(I mean, you could write your own reader macro too, I guess)

Petrus Theron16:05:42

@U050V1N74 yes but not without wrapping / shimming xtdb.main, in which case you might as well call x/start-node yourself and passing (System/getenv "DATABASE_HOST") from the caller.

Petrus Theron16:05:07

however I like the idea of retaining config in xtdb.edn and being able to use an #env reader.

Petrus Theron11:05:07

We are in a K8s env and policy is can’t write database creds to file system, so Postgres database host/password injected via env vars. We can put layer on top sure, but would be nice to use xtdb.main direct - one less compile step for us.

Anthony Bui11:05:17

Hi again, I'm trying to figure out if I should use multiple nodes if I, for example, have lots of buildings and the people currently inside these buildings, would a node for buildings and a node for people be the right way to think? Or is it highly use-case dependent? (this case is mostly just querying the people inside buildings)


Hey @U03D15XF04D to clarify, I assume you mean 'node' as in "nodes in a graph". Note that we generally use the term 'entity' for this (or 'vertex' is okay too), since 'node' is already reserved for talking about the system/deployment architecture for XT (which itself is a graph...) 🙂 An entity for each building and an entity for each person sounds about right. It probably makes most sense to have a attribute on each person entity, pointing to the relevant building(s), but as you've correctly guessed already there are a lot of concrete details of the exact use-case required to know what's best. Sometimes it can even make sense to reify edges into additional entities, so that you can attached other attributes (/properties) like or whatever, but then the query engine has to do more work.

Anthony Bui12:05:06

Many thanks for your reply!! Yes, I am definitely a bit lost, but I was actually talking about whether or not running multiple XTDB nodes for my use-case would be ideal (doc page for setting up Kafka mentions availability and fault-tolerance), but I guess the answer is no? 😅, I think that my use-case isn't too complicated, however the amount of data entities will be a million/some millions. I wish to query what people are present in a building, or several buildings, through time. I do have a "located-in-building" attribute and I'm thinking of using (and querying) only on valid-time for entry/exit time - ultimately, I'm wondering if this is the best way in terms of querying performance? Right now I'm testing it locally on LMDB for indexes and RocksDB for tx/doc-store, but it's taking much longer than expected edit: it's taking longer than expected when running queries through Clojure CLI, have not tried http yet


Aha! My mistake 🙈 Unless you have particularly high throughput needs (e.g. >500 txes/s) you should be able to scale up on a singe tx-log setup okay. It will certainly be easier to prototype with all data living in a single node. As ever, you should evaluate things carefully with a realistic sample of data. > it's taking longer than expected when running queries through Clojure CLI, have not tried http yet Hmm, the REPL should ~always be faster than doing things over http. Can you share an example of a query? Feel free to DM me if it's sensitive.

Anthony Bui14:05:55

Gotcha, thanks again for your input! One query is the following, just trying to get all entities at a specific date: (time (xt/q (xt/db node #inst "2016-10-31") '{:find [id] :where [[e :xt/id id]]})) This particular one took over 70 seconds lol, granted I'm working on 2016 Mac Pro :P


is that returning millions of entities?


what is the config passed to start-node?


You should be able to get away with simply:

(time (xt/q (xt/db node #inst "2016-10-31")
        '{:find [id]
          :where [[id :xt/id]]}))
(which might be slightly faster, I'm not 100% sure)

Anthony Bui10:05:20

Hi again, sorry for not being completely clear: all of our data will be in some millions, but now I'm testing on a much smaller subset of that data. Additionally, as "people exit the building", I logically delete them on their exit-timestamp (so only people inside buildings will be returned for a regular query), meaning that any query shouldn't return in the millions. I reset everything and loaded the data again, the query for the date 2016-10-31 now completes in around 20 seconds (so 70 seconds must have been something wrong on my end), which is much better than before, but there is "only" under 7000 entities returned - how long will it take when we load all of our data? Is this to be expected? Not sure if your query helped with improving times, but I'll stick to it from now on to be sure! The start node only receives the following configuration file: { "xtdb/index-store": { "kv-store": { "xtdb/module": "xtdb.lmdb/->kv-store", "db-dir": "data/index-store" } }, "xtdb/document-store": { "kv-store": { "xtdb/module": "xtdb.rocksdb/->kv-store", "db-dir": "data/doc-store" } }, "xtdb/tx-log": { "kv-store": { "xtdb/module": "xtdb.rocksdb/->kv-store", "db-dir": "data/tx-log" } }, "xtdb/query-engine": { "query-timeout": 1000000 } } (def node (xt/start-node (io/file "resources/config.json"))) I also have "XTDB_DISABLE_LIBGCRYPT=true" and "XTDB_ENABLE_BYTEUTILS_SHA1=true" enabled in order to avoid the error about loading libcrypto unsafely. We briefly discussed if having people present in a building stored in the buildings as nestled attributes, but a quick search showed us that this would be worse since indexing only applies to top-level attributes, right? Thanks again for your help! ☺️


Hey again (apologies for the delay!), so on reflection I think what is happening here is that XTDB's native history (valid-time) is a less-than-ideal way to model this, because the temporal index works more like a 'filter' when a query is scanning through the raw EAV content indexes. I.e. if you have a lot of raw EAV data that is not visible as-of the query basis, then the filtering will take time proportional to that raw data set, and therefore this particular approach can't really scale for your use-case if you need low-latency queries... However, assuming you're not needing to have millions of people in a single building(?) I think materializing this people-in-building set membership information into the building's entity should be more viable.

Anthony Bui10:05:27

Many thanks for the explanation! I'll try your suggestion and get back with the results 😄

🤞 1
Anthony Bui11:05:35

Sorry for more stupid questions, but with your membership suggestion, are you still thinking of having "separate" Person entities? We thought of storing an attribute list of "Person" maps inside Buildings, but are running into problems on how to query through time with these, should we store Person valid-time as an attribute and query on that? Would the query-engine be able to handle the ever growing list? Most Person maps have a short valid-time, would the query engine "swiftly" ignore entries out of range? Another thought was to only let "valid" Persons be present in the list, but that wouldn't work for our old data as we then only have the possibility to query on tx-time (?) edit: on second thought. if we were to proceed with having separate Person entities, was your plan that we keep using valid-time on Person entitites and that querying for people inside a specific building (inside the Building attribute list) would be sufficient?


> was your plan that we keep using valid-time on Person entitites and that querying for people inside a specific building (inside the Building attribute list) would be sufficient? If I understand correctly, 'yes' 🙂


Depending on your exact queries and needs for the data model, deleting / ending ('capping'?) the valid-time of the persons may make some queries harder though :thinking_face:


For my suggestion, I guess you can consider the building entity as storing an aggregate/projection/materialization of the 'current' set membership. Which is essentially a duplicate / derivation of the data also encoded into all the Persons' valid-time

Anthony Bui12:05:47

Perfect! Time-travel would work great then if the Buildings are seen to store the "current" people, too. I guess the loading of our data will take a considerable hike in time, but that's something we'll have to live with 😝, again thanks for your help!!

🙂 1
🙏 1
Anthony Bui07:05:25

How would I go about in actually keeping the set of persons in a building "current"? I have tx-functions to remove/add persons from a building's set just after the doc init for said person, but these transactions aren't stamped on valid-time. Should I leave the responsibility to queries? edit: we thought of extracting the membership set as its own document in order to track time, but we're not sure if that's much better performance wise


> these transactions aren't stamped on valid-time. Should I leave the responsibility to queries? Can you you explain more what you mean by these two points exactly?

Anthony Bui09:05:51

edit: all I really need to do is to send in the valid start time into the tx-function, right..? :man-facepalming: Yes of course, I haven't explained the whole use case thoroughly so here goes: We have buildings and persons, as explained earlier, and what we ultimately want to query is "how many people were inside this building at this time?". Our data is received in the form of timestamped "personEnteredBuilding" and "personExitedBuilding" events. These are of course continuous, but we would also like to load past data "properly" so that we can query back in time. The valid-time for a person entity would represent the time they were inside a specific building, whereas the valid-time for a building represents the time it is "active". We delete the entities on the timestamp of "personExitedBuilding" thinking that it would be easier to query only on documents present on a specific valid time. Querying raw EAV however, as you explained, took too much time, so we moved on with your suggestion of storing the amount of people as a membership set stored as an attribute in buildings. This would allow a cheap "count" call as well as having the possibility to lookup separate persons if needed. Our perceived problem is that when processing past data, we want to keep the membership set of a building current with valid time. When adding or removing a person eid in the set, we use a simple transaction function running a conj/disj on the set. How would we sort of "apply" the timestamps of a personEnteredBuilding and personExitedBuilding within the building's attribute, when running our transaction functions? If this were to work as we want, wouldn't we need to sort of have our transaction functions work with the valid times?


🙂 you can certainly pass in an explicit valid time, and if one isn't specified you should be able to access the "current" tx-time (== valid-time) from the transaction function context

❤️ 1
Petrus Theron16:05:15

Got bitten by XTDB config for not quoting symbols in :xtdb/module when copypasta’ing xtdb.edn to a .clj file. Would be nice if Spec was invoked on config types.

Steven Deobald21:05:09

Sounds legit. I've created an issue to track: