This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-03-18
Channels
- # announcements (7)
- # babashka (4)
- # babashka-sci-dev (73)
- # beginners (101)
- # biff (4)
- # calva (33)
- # clerk (36)
- # clj-commons (23)
- # clj-kondo (3)
- # clojure (38)
- # clojure-europe (2)
- # clojurescript (29)
- # datalevin (15)
- # emacs (2)
- # fulcro (8)
- # gratitude (1)
- # hugsql (9)
- # hyperfiddle (43)
- # jobs-discuss (4)
- # lsp (47)
- # malli (7)
- # off-topic (14)
- # pathom (5)
- # practicalli (1)
- # releases (7)
- # shadow-cljs (4)
- # spacemacs (6)
- # sql (7)
- # tools-deps (7)
- # transit (8)
- # xtdb (6)
greetings. I'm trying to do some ETL testing with a non-trivial data-source - aka trying to stuff a lot of stuff into datalevin, more or less to see if it would handle the the data elegantly or no. In this case, I'm dealing with about 157K entities, that shake out to about 20 datums apiece, and I'm going through a process where I continually am retrying this operation after changing the T part. I keep bumping into LMDB: Environment mapsize reached, and various java out of memory errors. This is on a laptop with 64G or ram. Am I abusing this, or more likely, failing to set some parameters or something to allow this sort of workload to pass. What is the best practice for dealing with this sort of thing in datalevin, and as a n00b, I'm sure I've missed something. I'm basically transacting maps. deleting them all, i'm simply running [:db.fn/retractEntity eid] in a large batch. I'll keep trying things, but wondering if there is some guidance on how to do this better.
To bulk load data, the fastest way that also avoid OOM is to create datoms yourself and load then with conn-from-datoms
This will avoid the transaction logic, which is very expensive and not well optimized yet.
If you insist on transacting them, there are two things to note. If you want to transact all that data in a single transaction, your memory needs to be much much larger than the data as many copies of the data are used during transaction. You can always tune your JVM to use more memory, which you are not using obviously. The easiest is to set a bigger -Xmx
Another easy fix, is to transact the data in smaller batches, e.g. using partition
function.
I've been playing with various parition sizes on the load, currently running with about 128. I just now partitioned the retractions, which let them succeede.
My issues using 'with-datoms' is that this is a puzzle involving about 100 or so different batches of data that need to be merged into a single "pile" at the end of it all
the transaction logic is not optimized at this moment, we will get to that after query engine rewrite is finished.
Is there any sense to sort of "restarting" the connnection periodically to overcome the swells?
don’t think “restarting” helps. Another thing is that when you transacting bulk data, one should turn off caching, we can add that option
yes, all of this stuff is historical from the 80's. once it gets to disk, it can be mostly forgotten