Good day all,
I'm running parallel CPU-intensive tasks (~10-50 threads) and using datalevin as a kvs. however, it's a huge bottleneck that's causing only 25% cpu usage.
each key is a vector pair: [string integer]
each value is a 1-item list of a nested map with 10k keys.
It takes ~100ms to write 1 of those doc from main thread. however, when I run lots of tasks in a claypoole threadpool, It's about 2-3 seconds for each write. It makes me wonder whether reads are blocked by writes too.
How do you suggest I improve performance?
1. async writes: is this possible in datalevin?
2. batch transactions: I'm thinking of having my threadpool store data in an atom, then a single-thread worker keeps taking all the data and performing transactions in batches. idk if this would improve performance.
3. Partitioning: Would partitioning between different tables work? (does datalevin serialize writes to different tables on the same db) or must I use different dbs?
Thank you
There's only one writer in Datalevin. Concurrent writes will be slower than single thread write due to contention. The more writers you have, the slower it will become.
(get-conn dir schema {:kv-opts {:flags (conj datalevin.constants/default-env-flags :nosync)}}) will not be syncing on every transact. :mapasyc :wirtemap will be asynchronous write (the oldest default, DB corruption possible when crash), :nometasync will be faster synchronous write (cut sync time in half, but may lost last transaction when crash, the recent old default). We changed to the safest default, because data loss can happen with these faster write settings when system crashes.
Then make sure to close the db (which call sync) when you done, or call sync manually on the underlying KV store when you are done.
If you are bulk loading data, 2 is what you should be doing. Batch write is a lot faster. Because the slowness is due to sync, which is done at the end of a transaction.
If your use case is amicable for one DB per partition, you can do that. A common use of Datalevin is one DB per user. Basically, a Datalevin DB is just a single file, there's a single "table" in a Datalevin DB, and there's a single writer for a DL DB. Plan accordingly.
In Datalevin, reads do not block writes, writes do not block reads. However, you should have only one writer. Multiple writers is slower than single writer due to contention.
If you are using DL as KV store, try this (d/open-kv dir {:flags (conj datalevin.constants/default-env-flags :nosync)}) and when you are done all the writes, call (d/sync db) to manually sync to disk, where db is your kv store.
The speedup should be several orders of magnitude.
sync is expensive regardless which DB you use. Most DBs "cheat" for they introduce complicated mechanisms that have their own tradeoffs. Datalevin opt to do the simplest thing, let you do your own sync if you want the best speed. But our default favors safety over speed, because a corrupt DB is not recoverable.
If you are running heavy computation (e.g. machine learning training with images), using Datalevin as a buffer of some sort (because it is faster than file system), and are not interested in persisting the things in the buffer, then you can open it as a temporary DB, which will automatically be using :nosync , like so (d/open-kv dir {:temp? true}) , and this DB file will be deleted when JVM exits.
In summary, your option 1, 2, 3 all work with Datalevin. You can also combine them.
Thank you very much @huahaiy, I also found out that part of the problem is the seq i was saving was lazy. This means during the write lock, dataelvin thread was not only persisting, but rather doing the main computation
Oh, I see.