datahike

2024-11-17T02:37:30.674209Z

Bro, I am sorry, but datascript works well. Just with this backend

2024-11-19T09:44:45.218309Z

So, I do not think I wrote an optimal backend, but it not stuck for 10 seconds, that’s fact. Unfortunately, I do not know Konserve in details to find a cause in your code

whilo 2024-11-19T09:56:12.448379Z

konserve-dynamodb now has 10-20ms write latency, similar for reads (on an EC2 instance). But this is for single write operations. For some reason when I write many in parallel it still adds up to around 100ms, while my fix did half the average transact time for S3.

whilo 2024-11-19T09:59:58.148569Z

Your multiple put operation is the optimal thing to do. But it does require a very strong memory model of the underlying storage, usually some form of multi version concurrency control, e.g. SQL or Dynamo. This limits the scaling of the store and you effectively rely on a stronger mutable memory manager to piggieback on top. Datahike can run on much weaker memory semantics such as network filesystems or S3 reliably as well. I will try to fix the dynamo use case though, because it has a very nice latency and I understand the value proposition.

whilo 2024-11-19T10:01:51.278689Z

I have no idea why you were stuck for 10 seconds. I cannot reproduce this for datahike-dynamodb. Database creation takes some time now (up to 10 secs) because dynamo takes time to get the table online.

whilo 2024-11-19T10:02:59.688679Z

Maybe S3 would actually be fast enough now for you btw. Not sure. It has some variance on how fast it returns, but it averages at around 200ms for a transact call.

whilo 2024-11-19T10:08:09.263339Z

I am happy that DataScript works for you, I wanted to have the storage functionality shared with it because I think this software stack is actually fairly clean and it is important for Clojure devs. Datahike is in parts more complicated and maybe some of that could be avoided (e.g. async support), but there are reasons for most of the design decisions and strategy.

πŸ‘ 1
2024-11-19T10:09:21.202719Z

But datascript does not have versioning functionality

whilo 2024-11-19T10:12:15.835329Z

Nope. It has the right memory model, but Nikita was mostly focused on the lean lightweight in-memory use case.

whilo 2024-11-19T10:12:44.674589Z

Which is nice. Datomic can be heavy to set up.

πŸ‘ 1
whilo 2024-11-17T08:29:34.935819Z

That is good to know. I had prototyped the storage backend for the persistent-sorted-set and Nikita adjusted it. We use the same IStorage interface as DataScript, so your code somewhat translates. DataScript does not nearly have the same amount of features and query performance though. You use different dynamo libraries here as far as I can see. Why did you do that? I need to check the differences. Can you also native compile your code? What are the latencies?

whilo 2024-11-17T08:35:45.920949Z

What is misc/do-nanos*?

2024-11-17T09:20:13.731139Z

Bro, it just count nanoseconds taken to execute code inside. What wrong with it?)

whilo 2024-11-17T09:28:45.801139Z

Nothing, I found it.

whilo 2024-11-17T09:50:54.833239Z

Just to make that clear and save you potentially some trouble, DataScript's durability layer is not made for distributed access. It is only safe inside a single JVM process against a strongly consistent backend that can do atomic updates over multiple keys (e.g. not file system or S3). If you restore a DataScript db in parallel lambda invocations you can read incoherent snapshots.

whilo 2024-11-17T09:52:48.659879Z

Datahike can read from anywhere without coordination, the only thing you need to ensure in our memory model is that you have a single writer and do not transact in parallel, but rather use the single writer (by setting it up and pointing to it in the config).

whilo 2024-11-17T09:55:01.549049Z

I think Datahike and DataScript are very similar in many ways and I am very grateful for DataScript. I just think Nikita has his own take on what DataScript should be that does not really align with what I need Datahike for. I would be more than happy to reduce the maintenance burden and just have the functionality in a joint community project. But the distributed memory functionality is what I really care about.

whilo 2024-11-17T09:59:05.614069Z

I suspect you have fairly optimal latency with your DataScript code, there are two differences a) nippy serializer and b) writing all changes to the indices in one write request. The latter can be hacked into Datahike relatively easily (basically by just porting your code over and changing it a bit), but abstracting this through konserve will require a bit more work (not a lot though).

whilo 2024-11-17T23:41:16.112389Z

These releases skip a conservative backup creation that konserve has for underlying stores that do not provide atomic updates, since S3, dynamodb and JDBC do provide atomic updates on a per key basis. With this change konserve-dynamodb now has the expected latencies of ~10ms per write operation. I left the update to konserve-jdbc for review for @alekcz360 first.

whilo 2024-11-17T23:45:23.077159Z

datahike-dynamodb is still not as fast as it should be though, I think this is because I schedule many write operations in parallel instead of a single batch. I will give this feature a shot at some point in the near future, if you need this right now lmk.

whilo 2024-11-17T23:49:30.631519Z

AWS S3 now has d/transact latencies of ~200ms for me, effectively halfing it from before.

whilo 2024-11-17T23:59:28.229059Z

@pat you probably want to set the same config for cloud-storage as well: https://github.com/replikativ/konserve-jdbc/pull/24/files

πŸ‘ 1