2024-01-16 datahike | Clojure Slack Archive

datahike

maxweber 2024-01-16T13:12:28.446199Z

Hi, yesterday I read this one: https://world.hey.com/dhh/multi-tenancy-is-what-s-hard-about-scaling-web-services-dd1e0e81 > Computers have gotten so ridiculously fast that there is scarcely any organization in the world that can overwhelm a web-based information system running on a single server. All the complexity and sophistication required to run web services today stem from multi-tenancy. From having a single system serve millions of users at the same time. But what if we stopped doing that? > > That’s essentially the premise of a resurgent interest in sqlite outside of its historic embedded use cases. We have a multi-tenancy SaaS system that uses a single Datomic database. Many queries are only fast in memory, since the data of a customer is spread across many segments. With a cold object cache way too many segments needs to be fetched from storage. However, I think that Datahike should be a really good candidate if you like to have one database per customer like described in the article?

maxweber 2024-01-17T10:17:11.842199Z

Hi @whilo thanks a lot again for all the input. Do you have a blog post about your "super-dbs" approach? Sounds great that you can atomically transact to multiple underlying databases.

whilo 2024-01-17T22:31:11.808829Z

No, not yet. I would have to implement it, but it is not very difficult. Transacting is done purely in memory, so you first do that on all databases and then write all their DB records into one key-value blob in the underlying store.

timo 2024-01-16T13:16:15.382959Z

I think so. Another thing could be that each user owns an instance of Datahike to keep power of his or her own data.

maxweber 2024-01-16T13:21:21.596329Z

Cool 🙂 Is there anything that needs memory per database connection? If I remember correctly Datomic forces each peer to reserve at least the memory-index-max.

timo 2024-01-16T13:40:04.826799Z

I don't know exactly. Maybe @whilo can answer...

Eugen 2024-01-16T16:22:45.597609Z

in one db / tennant would make things easy if the database management system has features for this. In POstgreSQL each DB will require a set of connections that can't be shared across tennants, each connection takes a certain amount of memory and CPU. so scaling that has limits . there are of course more things to consider. how does datahike fare in that regard? what is the use when having 1-10 open connections per DB with 100 - 1000 DB's (connection pooling). Can you avoid having that ?

maxweber 2024-01-16T17:54:48.217679Z

Maybe you can get around of having any database connections at all, by using the sqlite library as storage backend for datahike?

Eugen 2024-01-16T17:57:43.577939Z

for small db's sqlite should work fine. what I heard is that sqlite is not well for concurrency. so if you have a single connection to db. it will be ok. but if you have more users trying to get data, might not work that well

maxweber 2024-01-16T18:06:45.242099Z

The hypothese of David is that there should be no single organization that can create more load and data to reach the limits of a sqlite. Of course this only works if each customer gets his own sqlite db / file. This article goes into more detail regarding the sqlite limits: https://blog.wesleyac.com/posts/consider-sqlite

👍 1

whilo 2024-01-16T20:52:36.467279Z

Hey @maxweber! Connections in Datahike are shared between all processes on one machine, i.e. you need to allocate some memory for each database you connect to (which is the configuration, max entity id, schema and the root of each index). The configuration and schema could be broken out if that was a problem.

whilo 2024-01-16T20:53:20.996399Z

So you need a constant amount of memory for each database you are connected to, no matter how many processes are connected to it.

whilo 2024-01-16T20:54:44.212709Z

You should be able to also have one database per customer in Datomic, but maybe I am missing something.

whilo 2024-01-16T20:58:27.568529Z

I don't think there is a general pattern of how to index data, it really depends on the queries and how data is local in the indexes with respect to them. I would probably create multiple databases for different "views" on the same data if I had a complicated query setup, e.g. where some queries are customer specific and others aggregate over large groups of customers.

whilo 2024-01-16T20:59:39.743589Z

This is of course something you wish your database to do automatically for you, but I think this is very hard. It requires understanding how queries will be run exactly ahead of time and you would require a lot of "JIT-compiler style" optimizations in the query engine to adjust indexing.

whilo 2024-01-16T21:00:40.595049Z

In Datahike you could create "super-dbs" that transact into multiple underlying dbs and still have atomic transitions over all of them.

whilo 2024-01-16T21:02:31.476879Z

I have thought about predictive cache policies (similar to how modern CPUs have a learning mechanism to do branch prediction), but that would be a lot of black magic that might fail in unintuitive ways. Datomic opted for transparency to the developer, which is probably a good default.

maxweber 2024-01-16T21:18:39.999859Z

Hi @whilo thanks a lot for your answers. The Datomic team was happy that we only have one database for all customers. Each Datomic database connection requires to reserve the memory-index-max in the peer's memory (https://docs.datomic.com/pro/operation/capacity.html). The default memory-index-max is 512m. So while it is possible to run multiple database in a Datomic peer, it doesn't really scale. That's why I'm happy to hear that datahike only needs a constant amount of memory.

whilo 2024-01-16T21:19:20.378179Z

That is very interesting, I did not know that.

whilo 2024-01-16T21:23:50.676499Z

I tried to decouple readers as much as possible (in Clojure's philosophy), so there is actually no need to even interact with the transactor if you don't write to the database, but just connect as a reader.

👍 1

whilo 2024-01-16T21:24:31.005099Z

There are trade offs, I think Datomic was designed for very specific ones, ideally Datahike should be hackable to fit it to different use cases. This is my main complaint about Datomic.

whilo 2024-01-16T21:25:08.416659Z

If you like to you could provide a synthetic test case and I can help to get it working for you.

maxweber 2024-01-16T21:27:48.985429Z

That would be awesome :-) Thanks a lot

maxweber 2024-01-16T21:50:38.343849Z

In general, we are struggling to find a solution for the partitioning problem. All queries are fast enough. But only when the segments are in memory. So the bottleneck is the fetching and parsing of the segments. Each segment only contains a little bit of data for the customer. Especially, the eavt index suffers from this problem when you try to pull the entities that you found in your datalog query. At the moment our only solution is to use one datomic peer-server that has a lot of memory. Therefore we are forced to use the Datomic client API. One way would be to derive a read-only datahike database per customer from our large Datomic database. Datomic's implicit partitions (https://blog.datomic.com/2023/04/implicit-partitions.html) would be another solution, but it requires to change the entity IDs. Anyway, I just wanted to highlight some problems we ran into with our multi-tenancy database 😅 I never thought about avoiding the multi-tenancy problem all together like David described it in his blog post.

Clojurians Log v2

datahike