datahike

lambdam 2024-10-07T22:11:05.652119Z

Hello everyone, I'm using datahike on a project that is in production for some months now. I have been using the :file backend until now. Here are some numbers: • There are 4605 datoms in the DB • The folder containing the files weights 122 MB (huge!) • The exported database (with export-db) weights 169.6 kB • When switching to the sqlite backend, the reimported database weights 580 kB. Question 1: How can there be such a big difference between the files and the exported information (almost a 737 ratio). Or with the files and the SQLite DB? Did I miss something? Question 2: When switching to SQLite backend and when reloading the project, the database seems to be reset everytime. It is not in the case with the file backend. With file I init the DB as so:

(let [...
      _ (when-not (d/database-exists? db-config)
          (d/create-database db-config))
      conn (d/connect db-config)]
  ...)
Would this piece of code conflict with the SQLite DB? Thanks a lot

lambdam 2024-10-09T09:07:00.200889Z

Ok I see. Thanks a lot for the detailed answer!

lambdam 2024-10-09T09:32:38.237569Z

And also for the reset of the DB (SQLite discussion) on system reload, I found the bug. My mistake. Thanks again.

whilo 2024-10-09T10:04:46.985889Z

@dam Perfect!

whilo 2024-10-08T03:53:08.904689Z

Hey @dam. Nice to hear from you! I am working on the storage requirements as we speak. This has not much to do with the backend, and more with the fact that we retain all the snapshots that are written while you have accumulated your database. You can achieve the same effect without exporting and importing by running gc from time to time. I am moving the gc function currently into the main api here https://github.com/replikativ/datahike/pull/716

whilo 2024-10-08T03:54:31.020349Z

In the current release it still lives here https://github.com/replikativ/datahike/blob/main/src/datahike/experimental/gc.cljc#L45

whilo 2024-10-08T03:56:01.952369Z

(the doc string there is still wrong, the remove date refers to now, so it will remove all old snapshots before the invocation started)

whilo 2024-10-08T03:58:35.354799Z

Note that if you have distributed readers that might still access the old snapshots (e.g. for queries that run a long time) then you should give them a grace period.

whilo 2024-10-08T04:42:49.373649Z

I do not fully understand you SQLite problem. How do you exactly reload?

lambdam 2024-10-08T12:48:25.039939Z

Thanks for the answer @whilo, When you talk about snapshots, you're talking about snapshots of the indexes? Meaning that on every transaction a new snapshot of all indexes is recorded on disc? Concerning the SQLite problem, I have a reloaded workflow based on this article's approach : https://medium.com/@maciekszajna/reloaded-workflow-out-of-the-box-be6b5f38ea98 In the end it's just a reloaded workflow... When I start the system I run the following code for the DB:

(let [...
      db-config {...}
      _ (when-not (d/database-exists? db-config)
          (d/create-database db-config))
      conn (d/connect db-config)]
  ...)
And then conn goes into the application state. When I stop the system I only run
(d/release conn)
Whit the :file backend, the data is, as expected, persistent. With the :jdbc with "sqlite" backend and with the same code, the DB seems to reboot after running start, stop and start again. I don't know if it is a behavior of the jdbc adapter or a known behavior in general, but it seemed strange.

lambdam 2024-10-08T12:51:57.826959Z

And also if the snapshots are stored in SQLite, the problem remains since SQLite doesn't shrink the DB file when data is removed... It has to be "vacuumed" manually by default. So not a solution at all. I'll check th GC feature.

whilo 2024-10-08T16:04:02.407399Z

Each snapshot is not a full copy, but just the delta of nodes between trees, the same as in https://hypirion.com/musings/understanding-persistent-vector-pt-1

whilo 2024-10-08T16:08:17.638769Z

With the hitchhiker-tree index we could be much more space efficient, because it maintained a transaction-log in each node and did not write to many nodes in each transaction. We rebuild on the persistent sorted set because it was 5-10x faster in pure read performance and currently the indices in each snapshot are optimal for reading. But I am aware that the amount of storage used in this approach can be a negative surprise and plan to readd a Datomic style db-wide transaction log next again. That should shrink the space needs at the tradeoff of having to retransact a few datoms from the log when you read a snapshot.

whilo 2024-10-08T16:09:46.174799Z

Yes, the storage use should be the same for SQLite. Maybe there is more or less overhead, but it should be similar.

whilo 2024-10-08T16:10:09.561659Z

@alekcz360 do you have an idea what could be going on in the SQL backend?

alekcz 2024-10-08T16:15:23.956609Z

I had a similar issue with the SQL backend. I had about 200k datoms with history. The SQL DB size was about 5 GB. But the datoms on their own in an edn file were 45MB. I'm not sure if this is good or bad as I have no frame of reference. It just seems a bit counterintuitive.

whilo 2024-10-08T16:19:03.236199Z

Yes, counterintuitive is bad. It makes sense if you see how it works and you can call gc in regular intervals to keep the storage use bounded (it does not slow down the transactions), but it would be better to create less overhead in the first place.

whilo 2024-10-08T16:24:58.460439Z

The priorities were a) read performance, b) write performance and c) storage efficiency. a) and b) are covered now and also our query engine is fairly competitive, but c) needs to be addressed next.