Fork me on GitHub
#datahike
<
2022-02-22
>
awb9916:02:07

I have noted something weird in terms of disk space used by datahike. I am importing data from various sources to a datahike db. All the data saved as edn files takes 12 MB. I then save this data to a konserve store, and the total disk space used there is 2.4 MB. This is fine, since konserve is using compression. Now I only import parts of the 12 MB data to datahike (those fields I care about), But now I notice that datahike db uses 278 MB disk space. I have turned off history, so it is really confusing that the disk space of datahike is so high. Any ideas why this could be the case? I use datahike file backend.

metasoarous20:02:14

I'm a little surprised it's that high, but keep in mind: Datahike has to maintain multiple indexes, and storing things in EAV format has some cost associated with it as well.

metasoarous20:02:46

The hitchiker tree itself has metadata which requires space as well.

metasoarous20:02:16

Still, like I said, that's higher than I would have thought, so others will have to chime in as to why it's quite that high and how that scales.

awb9922:02:58

my fear is that either there is "growing fragmentation" or that "history is active even though it shoudl be disabled"

awb9923:02:19

I will experiment more to see how the disk usage is on a raw import where I am sure all data is only inserted once.

awb9923:02:48

I have ran my imports to a brand new datahike and konserve db. Now my datahike db has 62MB and my konserve db has 2.7 MB. Previously the SAME DATA that I imported to datahike had a size of 300 MB. So somethink is growing in datahike. In konserve nothing is growing.

awb9923:02:14

|-------------------+-------|
|        Type       | Count |
|-------------------+-------|
| :distributor      | 54    |
| :product          | 212   |
| :woo-order        | 77    |
| :invoice          | 3704  |
| :lineitem         | 12436 |
| :lineitem-invoice | 11937 |
| :lineitem-po      | 499   |
| :tracking         | 39    |
| :po               | 164   |
|-------------------+-------|

awb9923:02:43

so this are the counts of the entities (by type)

awb9923:02:25

62.9 MiB [###################] /datahike-db
12.1 MiB [###                ] /import
2.7 MiB [                   ] /konserve-db

awb9923:02:35

this is the size of the folders reported by ncdu

awb9923:02:49

I will now run the SAME import again to see what that changes.

awb9923:02:16

(def cfg {:store {:backend :file
                  :path "data/datahike-db"}
          :keep-history? false})
(defn create! []
  ;(if (d/database-exists? cfg)
  (info "creating datahike db..")
  (d/delete-database cfg)
  (d/create-database cfg)
  (def conn (d/connect cfg))
  (info "creating schema..")
  (d/transact conn schema))

(defn connect! []
  ;(if (d/database-exists? cfg)
  (let [db-filename (get-in cfg [:store :path])]
    (if (.exists (io/file db-filename))
      (do (info "connecting to datahike db..")
          (def conn (d/connect cfg)))
      (create!))))

awb9900:02:52

|-------------------+-------|
|        Type       | Count |
|-------------------+-------|
| :distributor      | 54    |
| :product          | 212   |
| :woo-order        | 77    |
| :invoice          | 3704  |
| :lineitem         | 12436 |
| :lineitem-invoice | 11937 |
| :lineitem-po      | 499   |
| :tracking         | 39    |
| :po               | 164   |
|-------------------+-------|

awb9900:02:59

this is the data after two import runs

awb9900:02:38

So my entity saving routine is not generating duplicates.

awb9900:02:03

169.9 MiB [###################] /datahike-db
 12.1 MiB [#                  ] /import
  2.7 MiB [                   ] /konserve-db

awb9900:02:28

The konserv-db stayed at exactly the same size,

awb9900:02:49

and the datahike-db did go from 63 MB to 169 MB.

awb9901:02:18

I am updating entities with the most up to date version of the entities.

awb9901:02:37

So I guess somehow either datahike is still keeping the history, or it keeps the old ids around

Björn Ebbinghaus16:02:50

When you add the same facts twice, you still get a transaction entry in your database. Maybe that‘s the reason?

awb9919:02:03

so transaction records are being kept?

awb9919:02:27

I would have thought that the transactions are automatically being applied to the modified state in this case.

awb9919:02:19

is it possible to look up the number of transactions that are in dtahike?

awb9919:02:40

Since there is no ui for datahike, I want to crete some sort of statistics queries that tell me what I really have.

Björn Ebbinghaus19:02:27

Transactions are just entities. So you can query and pull them like entities. Here is a query for the number of transactions in your system.

[:find (count ?tx) .
 :where 
 [?tx :db/txInstant]]

; or
[:find (count ?tx) .
 :where 
 [_ _ _ ?tx]]
; E A V  T
; T is just a :db/id (ref) to a transaction entity
The number of all entities:
[:find (count ?e) .
 :where 
 [?e]]

Björn Ebbinghaus19:02:40

This is really neat. You can query for the last time an attribute changed:

[:find ?time .
 :in $ ?entity ?attribute
 :where 
 [?entity ?attribute _ ?tx]
 [?tx :db/txInstant ?time]]

Björn Ebbinghaus19:02:36

You can annotate your transactions. Have a look at the Datomic Best Practices page: https://docs.datomic.com/on-prem/best-practices.html#add-facts-about-transaction-entity Useful for auditing: I have a :tx/by attribute, where I record the ref to the user who initiated that transaction.

awb9919:02:33

I still dont understand why datahike keeps transactions, when history is disabled.

awb9919:02:55

I have implemented the :db/txInstant count. And there are no transactions in the datahike db.

awb9919:02:09

So the transactions does not explain the growth in the db.

awb9919:02:14

I will file a ticket.

awb9923:02:28

I was searching the konserve docs for a way to add 1000 key-value pairs to the store in one transaction. Will this happen if I just run multiple save requests via (k/assoc-in store k item) and then wait for all of the channels to be completed? Or is there a different syntax that I have to use?

Björn Ebbinghaus16:02:11

What exactly do you want to do? Asking because there are no „transactions“ in konserve. Do you want to make a datahike transaction with 1000 facts and it is to slow? Or do you want to batch konserve assocs? According to the README of konserve the file system store is CPU bound, so you could try to add your 1000 pairs in parallel.

awb9919:02:11

I want to add 4000 invoices which will be stored with 4000 individual keys. My api query to get this data takes milliseconds, to store this to a edn file takes 300 milliseconds. To store it to konserve takes 30 seconds. I want to speed that up.

awb9919:02:52

I am talking of konserve.

awb9919:02:42

There must be something like a transaction in konserve as well, because when it writes the changed index to disk, and I do two index updates in parallel, it somehow has to coordinate file writes.

kkuehne15:02:57

I think that is something that @U1C36HC6N could answer since I'm not into these kind of specifics in konserve.

awb9917:02:29

That would be very cool @U1C36HC6N

awb9917:02:12

I am trying to find out how I can do efficient batched updates in both konserve and datahike.

awb9917:02:30

In konserve I believe it must be hidden somewhere in the core.async syntax.

awb9917:02:41

In datahike it must be to add a big transaction list.

awb9917:02:22

For datahike, I sort of struggle because if I say aggregate 1000 trnasactions into a batched transaction.

awb9917:02:34

Then the question is what happens if one of the transactions in the batched transaction fails.

awb9917:02:56

If I run each transaction sequentially, then only one transaction will fail, while the others will be good.

awb9917:02:27

If I add the transactions together nto a batched transaction, then there could be a failure in one single transaction, but since datahike would see only one transaction list, they all would fail.

kkuehne18:02:09

Right, I started with the batch transactions already to increase performance but you pointed out exactly the problems that I'm facing at the moment. 🙂