Hello everyone,
I'm using datahike on a project that is in production for some months now.
I have been using the :file backend until now.
Here are some numbers:
• There are 4605 datoms in the DB
• The folder containing the files weights 122 MB (huge!)
• The exported database (with export-db) weights 169.6 kB
• When switching to the sqlite backend, the reimported database weights 580 kB.
Question 1: How can there be such a big difference between the files and the exported information (almost a 737 ratio). Or with the files and the SQLite DB? Did I miss something?
Question 2: When switching to SQLite backend and when reloading the project, the database seems to be reset everytime. It is not in the case with the file backend.
With file I init the DB as so:
(let [...
_ (when-not (d/database-exists? db-config)
(d/create-database db-config))
conn (d/connect db-config)]
...)
Would this piece of code conflict with the SQLite DB?
Thanks a lotOk I see. Thanks a lot for the detailed answer!
And also for the reset of the DB (SQLite discussion) on system reload, I found the bug. My mistake. Thanks again.
@dam Perfect!
Hey @dam. Nice to hear from you! I am working on the storage requirements as we speak. This has not much to do with the backend, and more with the fact that we retain all the snapshots that are written while you have accumulated your database. You can achieve the same effect without exporting and importing by running gc from time to time. I am moving the gc function currently into the main api here https://github.com/replikativ/datahike/pull/716
In the current release it still lives here https://github.com/replikativ/datahike/blob/main/src/datahike/experimental/gc.cljc#L45
(the doc string there is still wrong, the remove date refers to now, so it will remove all old snapshots before the invocation started)
Note that if you have distributed readers that might still access the old snapshots (e.g. for queries that run a long time) then you should give them a grace period.
I do not fully understand you SQLite problem. How do you exactly reload?
Thanks for the answer @whilo, When you talk about snapshots, you're talking about snapshots of the indexes? Meaning that on every transaction a new snapshot of all indexes is recorded on disc? Concerning the SQLite problem, I have a reloaded workflow based on this article's approach : https://medium.com/@maciekszajna/reloaded-workflow-out-of-the-box-be6b5f38ea98 In the end it's just a reloaded workflow... When I start the system I run the following code for the DB:
(let [...
db-config {...}
_ (when-not (d/database-exists? db-config)
(d/create-database db-config))
conn (d/connect db-config)]
...)
And then conn goes into the application state.
When I stop the system I only run
(d/release conn)
Whit the :file backend, the data is, as expected, persistent.
With the :jdbc with "sqlite" backend and with the same code, the DB seems to reboot after running start, stop and start again.
I don't know if it is a behavior of the jdbc adapter or a known behavior in general, but it seemed strange.And also if the snapshots are stored in SQLite, the problem remains since SQLite doesn't shrink the DB file when data is removed... It has to be "vacuumed" manually by default. So not a solution at all. I'll check th GC feature.
Each snapshot is not a full copy, but just the delta of nodes between trees, the same as in https://hypirion.com/musings/understanding-persistent-vector-pt-1
With the hitchhiker-tree index we could be much more space efficient, because it maintained a transaction-log in each node and did not write to many nodes in each transaction. We rebuild on the persistent sorted set because it was 5-10x faster in pure read performance and currently the indices in each snapshot are optimal for reading. But I am aware that the amount of storage used in this approach can be a negative surprise and plan to readd a Datomic style db-wide transaction log next again. That should shrink the space needs at the tradeoff of having to retransact a few datoms from the log when you read a snapshot.
Yes, the storage use should be the same for SQLite. Maybe there is more or less overhead, but it should be similar.
@alekcz360 do you have an idea what could be going on in the SQL backend?
I had a similar issue with the SQL backend. I had about 200k datoms with history. The SQL DB size was about 5 GB. But the datoms on their own in an edn file were 45MB. I'm not sure if this is good or bad as I have no frame of reference. It just seems a bit counterintuitive.
Yes, counterintuitive is bad. It makes sense if you see how it works and you can call gc in regular intervals to keep the storage use bounded (it does not slow down the transactions), but it would be better to create less overhead in the first place.
The priorities were a) read performance, b) write performance and c) storage efficiency. a) and b) are covered now and also our query engine is fairly competitive, but c) needs to be addressed next.