This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2024-01-23
Channels
- # announcements (1)
- # babashka (13)
- # cherry (12)
- # cider (6)
- # clj-kondo (3)
- # cljs-dev (28)
- # clojure (77)
- # clojure-europe (25)
- # clojure-nl (1)
- # clojure-norway (35)
- # clojure-uk (5)
- # clojurescript (31)
- # conjure (7)
- # cursive (12)
- # data-science (9)
- # datalevin (5)
- # datomic (8)
- # hyperfiddle (21)
- # jobs (7)
- # kaocha (19)
- # malli (26)
- # matrix (3)
- # releases (1)
- # shadow-cljs (42)
- # squint (95)
- # testing (2)
- # vim (14)
- # wasm (1)
Anyone have any advice on how I can modify the following code to not hit a java heap out of memory error? I'm coming from nodejs, in which case I'd just turn this into a stream, not sure what the equivalent would be in clojure.
(defn batch-insert
"Inserts rows in batches of 1000 using a transducer."
[datasource table-name columns rows]
(let [insert-fn (fn [rows]
(jdbc-sql/insert-multi! datasource table-name columns rows))]
(transduce
(comp
(partition-all 1000) ;; Create batches of 1000
(map insert-fn)) ;; Apply the insert function to each batch
(fn
([] [])
([result] result)
([result input] (conj result input)))
[]
rows)))
(batch-insert
db/conn
:outreachos.city-financials
(vec (ds/column-names residential))
(mapv #(vec (vals %)) (ds/rows residential)))
I'm taking a tmd.dataset with 56 columns and 2.2m rows, turning it into a vector of values that can be batch inserted with next-jdbc into a postgres database. I'm running into an issue though where the java process goes up to about 9.1 GB of memory around 1m rows in and crashes. I do have more RAM available on my system, but I'd rather improve the code than allocate more RAM in a custom JVM option flag or whatever.I was able to solve the problem by just getting rid of the whole transducer part and simplifying it to below. It still takes up around 3.5 GB when processing though.
(let [columns (vec (ds/column-names residential))
rows (mapv #(vec (vals %)) (ds/rows residential))
table-name :outreachos.city-financials]
(jdbc-sql/insert-multi! db/conn table-name columns rows {:batch true
:batch-size 10000
:large true}))
Ok I was able to fix it by adding a flag not to return keys:
(let [columns (vec (ds/column-names residential))
rows (mapv #(vec (vals %)) (ds/rows residential))
table-name :outreachos.city-financials]
(jdbc-sql/insert-multi! db/conn table-name columns rows {:batch true
:return-keys false
:batch-size 10000
:large true}))
will likely get more response https://clojurians.zulipchat.com/#narrow/stream/151924-data-science/topic/tech.2Eml.2Edataset , but https://github.com/techascent/tech.ml.dataset.sql may be something to look into as well
@U06C63VL4 oh nice, I didn't know about tmd sql. I'll definitely take a look at that.
Anyway the library @U06C63VL4 recommended fixed the issues and made things a lot easier, so thanks 🙂