What are the known trade-offs with the immutable git-like semantics? I might have done something wrong but I noticed that with about 1gb of data it blew up to like 80GB when it's transacted in 1000’s of transactions instead of batched. I know we get great read distribution but is the trade-off here that we pay more in storage for high write applications?
How many commits did you make? if you transact in very small batches then you will get inflated storage usage. There are different ways to mitigate this, one is to use the online gc during import or bulk loads, or gc in general if you don't need the git history anymore.
If you can give me a way to reproduce then I can also take a closer look at your example.
https://github.com/replikativ/datahike/blob/main/doc/gc.md?plain=1#L72
To free up storage once it should be enough to run gc. Note that gc does remove your branch history though (not the history-db, that is internal if you have activated it).
My assumption is I have done something very wrong. Or it's some local hacks I have made along the way.
Did you have the history support on in the db config?
Yes have history support on. Was 11k+ transactions. Will do the gc as don't need it. Just didn't know it would grow so much.
The online gc will make it behave more like a mutable database overwriting and removing history immediately, you can also optionally kick off the offline gc in regular intervals. I think @alekcz360 is doing this now. I haven't implemented that by default yet, because I first wanted to provide the online gc to make bulk imports as efficient as possible and it is easy enough to do explicitly in a loop, but there should probably be a convenience function for this.
brought 98gb to 2gb
good 🙂