xtdb 2023-09-18 | Slack Archive

Martynas Maciulevičius09:09:28

Can XTDB be backed by DynamoDB or Cassandra? (I saw that there is somebody's implementation of Redis storage engine but I think what it achieves is a little different thing -- it's fast storage but it's mirrored and not heavily sharded) I currently watch a presentation about Datomic and RH says that "Sequel DBs implement these <storage> constraints and so does DynamoDB, so we can run Datomic backed by DynamoDB". What happens with XTDB if I have larger database than my node's memory and its storage? Because IMO this is what DynamoDB is all about. Also do you handle this differently between XTDB1 and XTDB2?

refset22:09:12

> Can XTDB be backed by DynamoDB or Cassandra? not officially, but in principle 'yes' ...however trying to use them for the index-store could make queries unacceptably slow because out-of-process latencies stack up quickly YMMV > What happens with XTDB if I have larger database than my node's memory and its storage? RocksDB handles the first case (memory) well, and the second case (local storage) is largely why we decided to take a very different architectural route with the 2.x branch (though things like https://github.com/rockset/rocksdb-cloud are very interesting to consider in the context of pushing the 1.x design to its ultimate limits) > Also do you handle this differently between XTDB1 and XTDB2? yes very much - have you seen the talk Jon gave at the Conj this year?

Martynas Maciulevičius07:09:43

> have you seen the talk Jon gave at the Conj this year? I've not watched it. I'll do it. Edit: Ok. I watched it and the Columnar bit seems to be the interesting part. But I remember that you posted in the forum that the in-node way of hosting the database will not be the primary way and also that value semantics will be SQL-like instead of JVM-like (e.g. dates). And this value semantics thing would mean that it's a different database. But it's good that the Datalog bit will be there and tables will be schemaless. I guess I'll go to see into the repository itself what's happening there.

Martynas Maciulevičius08:09:25

Do you think that it would be possible to use XTDB2 without temporal functionality right now? How would one know which parts are already reliable-ish? As I understand once I'd import the data into the database the Apache Arrow bit should be stable as you didn't write that storage part on your own. Also if I'd want to reexport the data into XTDB1 I'd use a query the bitemporal query that would span throughout all time :thinking_face:

refset08:09:28

> Do you think that it would be possible to use XTDB2 without temporal functionality right now? It's quite pervasive, so you couldn't e.g. easily disable it to get better performance > How would one know which parts are already reliable-ish? talk to us 🙂 looking at tests is always a good idea > As I understand once I'd import the data into the database the Apache Arrow bit should be stable as you didn't write that storage part on your own. this is not quite the case currently, because while Arrow is solid we are still working on the exact layout of the Arrow files (essentially a primary covering indexing) to improve performance, so we can't support people trying to move between the changing formats currently - you would need to export and re-import each time

refset08:09:23

> Also if I'd want to reexport the data into XTDB1 I'd use a query the bitemporal query that would span throughout all time that sounds about right 👍

Ben Sless13:09:27

You could try Scilla :thinking_face:

Martynas Maciulevičius15:09:48

You mean this? https://www.scylladb.com/

Martynas Maciulevičius15:09:17

Her hair is really beautiful but she kind-of lost me when she said "you require less infrastructure -- which mean you get direct savings": https://youtube.com/watch?v=JPkrdWMVpPk

Ben Sless15:09:57

Yeah, my mistake, Scylla. It's supposed to be a high performance stand in for Cassandra

Martynas Maciulevičius22:09:25

> high performance stand in for Cassandra I don't yet know how to use Cassandra/DynamoDB/Scylla effectively. In my mind I still want joins so I somehow have to mitigate this. I don't yet think productively in this consistency model yet :thinking_face: I know that I'd need to copy data and do my joins in advance. But I don't think I want to do it as this is overoptimization without real benefit :thinking_face:

Martynas Maciulevičius21:09:16

#Also sent to the channel

> It's supposed to be a high performance stand in for Cassandra There's this video of how would one do a DB: https://youtube.com/watch?v=fU9hR3kiOK0 1. Your write part is just the TX 2. You do apache samza to broadcast into sharded DB tables and then you can use the distributed KV store as a read-only store. 3. XTDB bundles all of this together and I'm not sure if XTDB1 is suitable for this as-is (I write this even though I worked on event processing+XTDB in my previous job). Maybe then you'd use multiple XTDB1s with different database schemas that would be just waiting for inputs from the stream processor and it would only be a delivery/temp layer. And then I think this video is part of an answer on how to properly do Scylla/Cassandra/etc: https://youtube.com/watch?v=NJs5cv0JF_4 1. Transducer fns as main building block 2. Samza will call transducer function for your stream and then you could commit the result into Cassandra/Scylla The way I understand Cassandra/Scylla is that they're just the delivery layer because you somehow have to handle the consistency across multiple nodes at the same time. And this is why something like Samza is useful (also it will still have a problem of implementing unique constraint in your "table"). I think that XTDB2 doesn't try to address the point of this type of sharded scalability that's based on shards+reduced consistency but I'm not sure yet. Maybe it's still on this track. I'll have to understand more of it. Basically I think that XTDB2 will still attempt to load the working field set into a memory of a single node and run the user's query on it while in Cassandra/ScyllaDB you'd be precomputing the queries in advance and there is no working set at all -- there's only key. Everything would be on disk, maybe even a spinning cheap disk. So I think that XTDB2 will be trying to bargain with performance but it simply won't beat the Cassandra/ScyllaDB approach of several milliseconds per query result and expensive event replay (I saw this in some Cassandra slides some time ago and if ScyllaDB's claims are even a little true... yeah). XTDB1 only allows one type of full node and XTDB2 tries to bundle the DB and logic together like in traditional DBs (I remember that refset mentioned in the forum that they'd like to steer the users to use external node where possible and maybe even abandon/discourage the tx-log and running in the in-node mode). So IMO XTDB2 will not beat Datomic (yes Cassandra/Scylla are really rough to handle for consistency but this is how they beat Datomic for reads and volume) in the read scalability here as they'll be based on a similar read model as Datomic (but choice is always a good idea and this one gives us bitemporality).

nivekuil22:09:12

I use scylla for my doc store and as the main database for data I don't care about joins/provenance in. xtdb1/2 are not suitable for write-heavy use cases but should scale very well for reads. and the consistency is fine, especially now that scylla uses raft

nivekuil22:09:33

the dynamo-type databases are very simple, don't overthink it. it's just a key->key->value store

Martynas Maciulevičius21:09:16

replied to a thread:Can XTDB be backed by DynamoDB or Cassandra? (I saw that there is somebody's implementation of Redis storage engine but I think what it achieves is a little different thing -- it's fast storage but it's mirrored and not heavily sharded) I currently watch a presentation about Datomic and RH says that "Sequel DBs implement these <storage> constraints and so does DynamoDB, so we can run Datomic backed by DynamoDB". What happens with XTDB if I have larger database than my node's memory and its storage? Because IMO this is what DynamoDB is all about. Also do you handle this differently between XTDB1 and XTDB2?

> It's supposed to be a high performance stand in for Cassandra There's this video of how would one do a DB: https://youtube.com/watch?v=fU9hR3kiOK0 1. Your write part is just the TX 2. You do apache samza to broadcast into sharded DB tables and then you can use the distributed KV store as a read-only store. 3. XTDB bundles all of this together and I'm not sure if XTDB1 is suitable for this as-is (I write this even though I worked on event processing+XTDB in my previous job). Maybe then you'd use multiple XTDB1s with different database schemas that would be just waiting for inputs from the stream processor and it would only be a delivery/temp layer. And then I think this video is part of an answer on how to properly do Scylla/Cassandra/etc: https://youtube.com/watch?v=NJs5cv0JF_4 1. Transducer fns as main building block 2. Samza will call transducer function for your stream and then you could commit the result into Cassandra/Scylla The way I understand Cassandra/Scylla is that they're just the delivery layer because you somehow have to handle the consistency across multiple nodes at the same time. And this is why something like Samza is useful (also it will still have a problem of implementing unique constraint in your "table"). I think that XTDB2 doesn't try to address the point of this type of sharded scalability that's based on shards+reduced consistency but I'm not sure yet. Maybe it's still on this track. I'll have to understand more of it. Basically I think that XTDB2 will still attempt to load the working field set into a memory of a single node and run the user's query on it while in Cassandra/ScyllaDB you'd be precomputing the queries in advance and there is no working set at all -- there's only key. Everything would be on disk, maybe even a spinning cheap disk. So I think that XTDB2 will be trying to bargain with performance but it simply won't beat the Cassandra/ScyllaDB approach of several milliseconds per query result and expensive event replay (I saw this in some Cassandra slides some time ago and if ScyllaDB's claims are even a little true... yeah). XTDB1 only allows one type of full node and XTDB2 tries to bundle the DB and logic together like in traditional DBs (I remember that refset mentioned in the forum that they'd like to steer the users to use external node where possible and maybe even abandon/discourage the tx-log and running in the in-node mode). So IMO XTDB2 will not beat Datomic (yes Cassandra/Scylla are really rough to handle for consistency but this is how they beat Datomic for reads and volume) in the read scalability here as they'll be based on a similar read model as Datomic (but choice is always a good idea and this one gives us bitemporality).

2023-09-18

Channels