Hi, yesterday I read this one: https://world.hey.com/dhh/multi-tenancy-is-what-s-hard-about-scaling-web-services-dd1e0e81 > Computers have gotten so ridiculously fast that there is scarcely any organization in the world that can overwhelm a web-based information system running on a single server. All the complexity and sophistication required to run web services today stem from multi-tenancy. From having a single system serve millions of users at the same time. But what if we stopped doing that? > > Thatβs essentially the premise of a resurgent interest in sqlite outside of its historic embedded use cases. We have a multi-tenancy SaaS system that uses a single Datomic database. Many queries are only fast in memory, since the data of a customer is spread across many segments. With a cold object cache way too many segments needs to be fetched from storage. However, I think that Datahike should be a really good candidate if you like to have one database per customer like described in the article?
Hi @whilo thanks a lot again for all the input. Do you have a blog post about your "super-dbs" approach? Sounds great that you can atomically transact to multiple underlying databases.
No, not yet. I would have to implement it, but it is not very difficult. Transacting is done purely in memory, so you first do that on all databases and then write all their DB records into one key-value blob in the underlying store.
I think so. Another thing could be that each user owns an instance of Datahike to keep power of his or her own data.
Cool π Is there anything that needs memory per database connection? If I remember correctly Datomic forces each peer to reserve at least the memory-index-max.
I don't know exactly. Maybe @whilo can answer...
in one db / tennant would make things easy if the database management system has features for this. In POstgreSQL each DB will require a set of connections that can't be shared across tennants, each connection takes a certain amount of memory and CPU. so scaling that has limits . there are of course more things to consider. how does datahike fare in that regard? what is the use when having 1-10 open connections per DB with 100 - 1000 DB's (connection pooling). Can you avoid having that ?
Maybe you can get around of having any database connections at all, by using the sqlite library as storage backend for datahike?
for small db's sqlite should work fine. what I heard is that sqlite is not well for concurrency. so if you have a single connection to db. it will be ok. but if you have more users trying to get data, might not work that well
The hypothese of David is that there should be no single organization that can create more load and data to reach the limits of a sqlite. Of course this only works if each customer gets his own sqlite db / file. This article goes into more detail regarding the sqlite limits: https://blog.wesleyac.com/posts/consider-sqlite
Hey @maxweber! Connections in Datahike are shared between all processes on one machine, i.e. you need to allocate some memory for each database you connect to (which is the configuration, max entity id, schema and the root of each index). The configuration and schema could be broken out if that was a problem.
So you need a constant amount of memory for each database you are connected to, no matter how many processes are connected to it.
You should be able to also have one database per customer in Datomic, but maybe I am missing something.
I don't think there is a general pattern of how to index data, it really depends on the queries and how data is local in the indexes with respect to them. I would probably create multiple databases for different "views" on the same data if I had a complicated query setup, e.g. where some queries are customer specific and others aggregate over large groups of customers.
This is of course something you wish your database to do automatically for you, but I think this is very hard. It requires understanding how queries will be run exactly ahead of time and you would require a lot of "JIT-compiler style" optimizations in the query engine to adjust indexing.
In Datahike you could create "super-dbs" that transact into multiple underlying dbs and still have atomic transitions over all of them.
I have thought about predictive cache policies (similar to how modern CPUs have a learning mechanism to do branch prediction), but that would be a lot of black magic that might fail in unintuitive ways. Datomic opted for transparency to the developer, which is probably a good default.
Hi @whilo thanks a lot for your answers. The Datomic team was happy that we only have one database for all customers. Each Datomic database connection requires to reserve the memory-index-max in the peer's memory (https://docs.datomic.com/pro/operation/capacity.html). The default memory-index-max is 512m. So while it is possible to run multiple database in a Datomic peer, it doesn't really scale. That's why I'm happy to hear that datahike only needs a constant amount of memory.
That is very interesting, I did not know that.
I tried to decouple readers as much as possible (in Clojure's philosophy), so there is actually no need to even interact with the transactor if you don't write to the database, but just connect as a reader.
There are trade offs, I think Datomic was designed for very specific ones, ideally Datahike should be hackable to fit it to different use cases. This is my main complaint about Datomic.
If you like to you could provide a synthetic test case and I can help to get it working for you.
That would be awesome :-) Thanks a lot
In general, we are struggling to find a solution for the partitioning problem. All queries are fast enough. But only when the segments are in memory. So the bottleneck is the fetching and parsing of the segments. Each segment only contains a little bit of data for the customer. Especially, the eavt index suffers from this problem when you try to pull the entities that you found in your datalog query. At the moment our only solution is to use one datomic peer-server that has a lot of memory. Therefore we are forced to use the Datomic client API. One way would be to derive a read-only datahike database per customer from our large Datomic database. Datomic's implicit partitions (https://blog.datomic.com/2023/04/implicit-partitions.html) would be another solution, but it requires to change the entity IDs. Anyway, I just wanted to highlight some problems we ran into with our multi-tenancy database π I never thought about avoiding the multi-tenancy problem all together like David described it in his blog post.