xtdb 2021-05-18 | Slack Archive

Hukka12:05:28

I tried to look for sizing or deployment guides for Crux, but didn't get any good sources. Have I understood correctly that every Crux node needs to have a full copy of the indices, but not the documents? How large are they roughly (I guess they don't depend on document sizes, but mostly on historic number of document versions)? And is the usual way to then deploy Crux nodes as long term, beefy instances which dynamically scaling application servers then query over the network, not so that the nodes are local or in process to the app servers?

refset17:05:20

Hi Hukka, these are great questions! > Have I understood correctly that every Crux node needs to have a full copy of the indices, but not the documents? In practice your nodes must be able to comfortably handle the full copy including the documents, since a lot of caching is going on and much of the data is stored durably, local to the node. We're actually doing some work right now to make sure that the nodes can work without using the document store more than strictly necessary (since some document stores can be expensive!) see: https://github.com/juxt/crux/issues/1511 A comprehensive deployment guide is definitely a known gap in the documentation 😅 There are no hard rules though, and I would be happy to give you guidance for your specific scenario in mind. For large databases (>100GB) I think deploying Crux separately is usually the sensible way to go - operations will be easier and you will be able to scale & upgrade the tiers separately. You don't necessarily have to use our HTTP system however, and you may find that writing your own data services HTTP API to also run on those beefy instances is a useful exercise.

Hukka17:05:00

Oh, that's surprising. I assumed that if the document store can be S3 or such, then it's not local to the nodes

Hukka17:05:35

I don't have a very specific requirements yet, I'm afraid. We are just starting work in a new startup where there's likely to be a pretty significant bitemporal component ("Ok, but what would have last year's numbers A looked like, if you had the most up to date dataset B available")

👍 4

Hukka17:05:33

So preferably the DB should be ready for that, but I'm trying to see what the tradeoffs for each option are

👌 4

nivekuil01:05:52

I was actually thinking of bypassing the document cache entirely.. without the entity path, isn't the doc store only being hit when indexing txs (and entity-history :with-docs)? You wouldn't have a local copy of the docs at all if you were restoring from checkpoint since those txs never get indexed right?

refset07:05:58

> isn't the doc store only being hit when indexing txs (and entity-history :with-docs)? Also when doing pull - which was the main motivation behind that issue I linked above > if the document store can be S3 or such, then it's not local to the nodes Whilst there are no firm limits on the upper size of documents that Crux can handle, you probably don't want to be putting huge blob values into Crux in any case - is that partly what you were hoping to use Crux to store?

refset07:05:59

Technically you can even have an index-store that is not local to the node (e.g. something like https://github.com/crux-labs/crux-redis, Dynamo or using Rocks+NFS), but there are non-trivial tradeoffs vs query performance involved

Hukka07:05:09

Not huge blobs, no. Just many, though I don't know what counts as many. I guess we should be ingesting some hundreds of millions of documents per year, or the whole thing is a failure without enough customers

Hukka07:05:14

But enough that I wouldn't want to do dynamic scaling that would require copying everything to every node. Of course it's not a must to do dynamic scaling, if the tradeoffs lean elsewhere

refset08:05:27

As per https://opencrux.com/blog/dev-diary-jan-21.html#_future we have some plans in the pipeline that will make dynamic scaling much smoother. Is scaling elasticity (i.e. scale downing also) particularly important?

refset08:05:24

I'd be happy to tell you more on a call if you're keen to know at all, otherwise we should have a blog post out next month with more details

Hukka08:05:13

Thanks for the offer, but I'm hesitant to use that (yet) when so much is unclear. Blog post sounds good, we are just starting so not yet at the SaaS phase, just poking at data sources locally. Later, hopefully, we would also know more about concrete needs like how bursty the load will be, does bitemporality matter or not and so on and on

👌 4

Hukka08:05:25

I had the #5 diary in a tab already, but hadn't prioritized reading it based on the title 😉 So much to read, still

🙂 4

2021-05-18

Channels