Fork me on GitHub
#xtdb
<
2021-11-26
>
Christian Pekeler12:11:02

What’s the recommended storage setup that’s fast, scalable, reliable, simple, and cost-effective? I see what’s possible on https://docs.xtdb.com/administration/configuring/#_storage but would like some guidance on what to pick to avoid being in Jacob’s situation too soon. We’re currently using RocksDB for everything in development and are gearing up for our first production deployment.

refset13:11:47

This is a great question. Firstly, which cloud are you deploying to? I'd recommend using whatever cloud-native object storage is on offer for the doc-store, but be mindful of bandwidth costs (I wouldn't rule out needing to stick a shared cache in depending on access patterns and costs). For the tx-log, again using something cloud-native will probably work out $cheapest, but there are a fair few technical trade-offs to think about (tail-latency, write contention & latency, HA/durability etc.) so I would err on the side of first trying whatever managed JDBC store is on offer and doing some measurements yourself. If you're comfortable with Kafka though (e.g. you already have a cluster running) then I think it's a fairly optimal choice for the tx-log. Confluent's managed service has yet to disappoint but it's arguably not so cheap. As another data point, the team at Avisi successfully implemented a tx-log and doc-store on top of Google Cloud Datastore: https://github.com/avisi-apps/crux-datastore

tatut13:11:48

I'm using pg for tx log and docs with the assumption that "storage is cheap"

👍 1
allandaviesza13:11:26

Do you know if Avisi is using the Datastore implementation in production?

refset14:11:30

> Do you know if Avisi is using the Datastore implementation in production? I have no reason to suspect that's not still the case, but I don't have my ear close enough to the ground on what they're up to currently 🙂 Did you see their recent blog post on XT? There's a video walkthrough too, but you may need to auto-translate the subtitles like I did https://twitter.com/Avisi_IT/status/1451502118293151746

Christian Pekeler14:11:33

I’m currently on DigitalOcean (but still at a stage where I could easily switch). They have an S3-compatible object storage called Spaces which we should be able to use as document store. They also have managed DBs with MySQL and PostgreSQL. Which one you think would be better for transactions log? And what would you propose for the index store?

Christian Pekeler14:11:31

I know nothing about Kafka. Seems weird to use an event streaming platform for storage. What would it get me in the context of XTDB?

Christian Pekeler14:11:19

I also find it unintuitive to use an S3-like storage for my documents since S3 isn’t designed for speed.

refset14:11:52

> Which one you think would be better for transactions log? And what would you propose for the index store? Using either managed MySQL or PostgreSQL for both tx-log and doc-store would likely be worth attempting and measuring first of all. For the index-store I recommend RocksDB by default, but if you really want to optimise for query speed then LMDB will almost always be faster (but it lacks compression, is more obscure, etc.). > Kafka [...] What would it get me in the context of XTDB? Not a lot by itself, beyond its standard properties of low-latency, durable/HA commits 🙂 if you're already using Kafka at your org though then it could mean someone else has already figured out all the ops & backups etc. > Seems weird to use an event streaming platform for storage. Infinite retention has been a fairly explicit part of the Kafka roadmap for few years now and I think will continue to gain momentum. For instance, Confluent have added tiered storage to help make it more viable. > S3 isn’t designed for speed. That's true in certain dimensions, but it can certainly be fast enough depending on your application's requirements. By contrast, Kafka is also viable as a very fast "document log" but it needs pairing with Rocks (which materializes a "local-document-store"). Looking at other things AWS on offer, Dynamo could also work well as a doc-store (though we've not created a module for that yet).

Christian Pekeler15:11:34

I was hoping there was a straight-forward recommendation like “unless you have one of these special needs, we recommend you use xyz which is proven to be easy to set up and maintain, plenty fast, and currently used by several XTDB users”. I wouldn’t even know how to begin measuring and comparing all these available choices.

Christian Pekeler15:11:48

Maybe asked differently: Is there a rule of thumb for how long it makes sense using RocksDB as sole storage? For example up to x MB of data or n number of app servers or some other dimension?

refset15:11:15

Using RocksDB for the tx-log and/or doc-store means you can't have strong durability / high-availability guarantees, because it ties those storage components to a single node (i.e. n number of app servers, where n=1 ...assuming you're embedding XT inside your app and not using HTTP)

refset15:11:24

> “unless you have one of these special needs, we recommend you use xyz which is proven to be easy to set up and maintain, plenty fast, and currently used by several XTDB users” From a JUXT perspective we've spent the most time working with XT+Kafka, so that's what I can recommend with most confidence, but most JDBC-Postgres (for tx-log + doc-store) users I've heard from seem happy enough...with the exception of Jacob and his wanting to migrate the doc-store 🙂

Christian Pekeler16:11:05

> we’ve spent the most time working with XT+Kafka Kafka+RocksDB?

refset16:11:13

yep, sorry, I should have added that

Jacob O'Bryant16:11:17

I still count myself as pretty happy with postgres :). fwiw I think postgres for TX log + doc store makes a pretty good default recommendation

🙏 1
Christian Pekeler17:11:33

> we’ve spent the most time working with XT+Kafka Was this mostly by choice or because the client already happened to use Kafka?

refset17:11:13

> the client already happened to use Kafka 99% this

Christian Pekeler17:11:49

@U7YNGKDHA and @U11SJ6Q0K if you had equal familiarity with all storage choices and started a new greenfield project, do you think you would still pick PG?

Jacob O'Bryant18:11:09

Short answer: yes, most likely. Only reason I think it might be good to switch to Spaces for doc store (I'm also on digitalocean) is because my app accumulates a lot of large documents, and I've already had to upgrade the postgres cluster once simply because the disk was almost full. I suspect this would be less of an issue for most apps. For context, it's https://thesample.ai/--I subscribe to ~1k newsletters and store the contents of all emails they send. Actually I was originally planning to just store the email contents in Spaces with a foreign key in XT, but then started wondering if I might as well put the whole doc store on Spaces.

👀 1
Jacob O'Bryant18:11:06

If I set up a second node with Spaces for doc store, I'll do some testing and see how the performance is--if there isn't a noticeable difference, I'd probably start using that as my default setup. So I suppose the long answer is "undecided"

👀 1
xlfe04:11:02

+1 for the Avisi datastore option - if you're on GCP, the datastore option is pretty cost effective unless you have lots and lots of data!

tatut05:11:28

Pg is the storage option I know best and it's easy to operate and backup, so probably yes

Christian Pekeler09:11:32

Thank you all for your input! I’ll take PG for a spin. (GCP is not an option for me because support is not in Google’s DNA - too risky for me.)

🙏 1
👌 1