2024-10-29 datahike | Clojure Slack Archive

datahike

2024-10-29T15:06:12.885209Z

I am trying to use :datahike.index/hitchhiker-tree index, but it is crashing with NullPointerException. Looks like :branching-factor is nil . On :datahike.index/persistent-set everything is okay.

whilo 2024-10-29T16:07:10.133229Z

Why do you want to use the hitchhiker-tree? It is kind of deprecated by now, but it should still work. Please open an issue with a reproducible example if it is not too much of a hassle.

2024-10-29T16:09:01.687709Z

Really? I thought it was the new one. Okay)

whilo 2024-10-29T16:13:39.039899Z

It original was the durable backend, but it had an overhead of between 5-10x for pure in-memory iterators, so we decided to make the persistent sorted set durable and integrate the concepts from the hitchhiker-tree. I might add a log to datahike next, which will make it similar in design to Datomic and recover the benefits the hitchhiker-tree had in write performance. The ideas of the hitchhiker-tree are cool and maybe we will fully recover it, but having to write operations (the overlay it has) while reading is a bit problematic for read performance. Also you reapply the ops on every read, something that would need to be optimized away and would be novel, but maybe not beneficial. So I rather have a global tx-log per database like Datomic and not follow the fractal tree design.

2024-10-29T16:15:43.430579Z

Ok. Another question. What jdbc databases are supported? Is it possible to run on DynamoDB?

whilo 2024-10-29T16:16:24.847779Z

I haven't tried. @alekcz360 do you know? That would be cool.

whilo 2024-10-29T16:18:19.727019Z

I provided S3 as a cheap option for clouds that have a compatible blob storage interface. But it has higher latency.

2024-10-29T16:19:03.839289Z

How much in milliseconds?)

whilo 2024-10-29T16:25:58.382609Z

Depends on where your clients are. Can be a few hundred milliseconds per write if you are not in an AWS data center.

👌 1

whilo 2024-10-29T16:26:38.446679Z

Throughput is not limited though and readers cache, so if your DB is not written often, then it might be a good option.

2024-10-29T16:27:06.013159Z

No, i need often and faster

whilo 2024-10-29T16:27:07.958819Z

You can also run inside AWS lambda, if you ensure only one lambda is transacting.

whilo 2024-10-29T16:27:46.131459Z

Ok, make sure you use transact! if you have a lot of small transactions and don't care in which order things are exactly transacted.

whilo 2024-10-29T16:27:52.358519Z

It is async.

2024-10-29T16:28:01.752079Z

Run what inside lambda?

whilo 2024-10-29T16:28:38.039329Z

The whole datahike functionality. You don't need a server, you just need to ensure that only one process (lambda) is writing at a time.

2024-10-29T16:29:11.108019Z

Yes, i understood.. no, it can be not one lambda, they can be parallel.

whilo 2024-10-29T16:29:21.604939Z

https://github.com/replikativ/datahike/blob/main/doc/distributed.md#aws-lambda

whilo 2024-10-29T16:29:55.628799Z

You can enforce the transact lambda to a singleton on AWS. You need to send all your requests to this lambda then. @viesti built the demo.

whilo 2024-10-29T16:30:31.493569Z

Readers (query) can scale out freely.

whilo 2024-10-29T16:31:38.281539Z

https://github.com/viesti/clj-lambda-datahike/blob/main/terraform/main.tf#L7

whilo 2024-10-29T16:34:46.074569Z

If you need to really scale out it is worth thinking about memory locality and where you run your queries. The distributed.md file documents the memory model and might help. You might want to run your queries on a set of machines/lambdas with hot caches. The highest throughput can be achieved (obviously) on a single large VM if you want to afford it. If you care more about bringing together different datasets and flexibility the distributed index space should provide unique advantages that no other database provides to my knowledge.

whilo 2024-10-29T16:35:25.804349Z

I care for the latter to reduce the overhead in coding glue code and sending data back and forth for joining separate databases.

whilo 2024-10-29T16:35:45.735869Z

It is one of the main reasons for datahike to exist.

👍 1

whilo 2024-10-29T16:36:55.639479Z

Datahike should be fast, but speed alone is not our design objective. It is about making data handling easy and flexible and let the data base do most of the work for you. With datahike you can join db snapshots from any instance anywhere without coordination. You just need read access to the store (e.g. S3 bucket).

whilo 2024-10-29T16:37:36.492959Z

This also gives us very strong scale out properties for databases that are not completely rewritten all the time.

2024-10-29T16:37:39.252539Z

Okay.. too much information) I will check it out for sure. Now I need to run my setup even with file storage via EFS

whilo 2024-10-29T16:37:51.574809Z

Hehe, sure.

whilo 2024-10-29T16:38:15.867059Z

Keep it simple. I just provide this as pointers you can get back to and pick up.

👌 1

2024-10-29T16:38:45.076359Z

Thank you much for reaching out

👍 1

Clojurians Log v2

datahike