I am trying to use :datahike.index/hitchhiker-tree index, but it is crashing with NullPointerException. Looks like :branching-factor is nil . On :datahike.index/persistent-set everything is okay.
Why do you want to use the hitchhiker-tree? It is kind of deprecated by now, but it should still work. Please open an issue with a reproducible example if it is not too much of a hassle.
Really? I thought it was the new one. Okay)
It original was the durable backend, but it had an overhead of between 5-10x for pure in-memory iterators, so we decided to make the persistent sorted set durable and integrate the concepts from the hitchhiker-tree. I might add a log to datahike next, which will make it similar in design to Datomic and recover the benefits the hitchhiker-tree had in write performance. The ideas of the hitchhiker-tree are cool and maybe we will fully recover it, but having to write operations (the overlay it has) while reading is a bit problematic for read performance. Also you reapply the ops on every read, something that would need to be optimized away and would be novel, but maybe not beneficial. So I rather have a global tx-log per database like Datomic and not follow the fractal tree design.
Ok. Another question. What jdbc databases are supported? Is it possible to run on DynamoDB?
I haven't tried. @alekcz360 do you know? That would be cool.
I provided S3 as a cheap option for clouds that have a compatible blob storage interface. But it has higher latency.
How much in milliseconds?)
Depends on where your clients are. Can be a few hundred milliseconds per write if you are not in an AWS data center.
Throughput is not limited though and readers cache, so if your DB is not written often, then it might be a good option.
No, i need often and faster
You can also run inside AWS lambda, if you ensure only one lambda is transacting.
Ok, make sure you use transact! if you have a lot of small transactions and don't care in which order things are exactly transacted.
It is async.
Run what inside lambda?
The whole datahike functionality. You don't need a server, you just need to ensure that only one process (lambda) is writing at a time.
Yes, i understood.. no, it can be not one lambda, they can be parallel.
https://github.com/replikativ/datahike/blob/main/doc/distributed.md#aws-lambda
You can enforce the transact lambda to a singleton on AWS. You need to send all your requests to this lambda then. @viesti built the demo.
Readers (query) can scale out freely.
https://github.com/viesti/clj-lambda-datahike/blob/main/terraform/main.tf#L7
If you need to really scale out it is worth thinking about memory locality and where you run your queries. The distributed.md file documents the memory model and might help. You might want to run your queries on a set of machines/lambdas with hot caches. The highest throughput can be achieved (obviously) on a single large VM if you want to afford it. If you care more about bringing together different datasets and flexibility the distributed index space should provide unique advantages that no other database provides to my knowledge.
I care for the latter to reduce the overhead in coding glue code and sending data back and forth for joining separate databases.
It is one of the main reasons for datahike to exist.
Datahike should be fast, but speed alone is not our design objective. It is about making data handling easy and flexible and let the data base do most of the work for you. With datahike you can join db snapshots from any instance anywhere without coordination. You just need read access to the store (e.g. S3 bucket).
This also gives us very strong scale out properties for databases that are not completely rewritten all the time.
Okay.. too much information) I will check it out for sure. Now I need to run my setup even with file storage via EFS
Hehe, sure.
Keep it simple. I just provide this as pointers you can get back to and pick up.
Thank you much for reaching out