Fork me on GitHub
#data-science
<
2024-01-23
>
vonadz15:01:33

Anyone have any advice on how I can modify the following code to not hit a java heap out of memory error? I'm coming from nodejs, in which case I'd just turn this into a stream, not sure what the equivalent would be in clojure.

(defn batch-insert
  "Inserts rows in batches of 1000 using a transducer."
  [datasource table-name columns rows]
  (let [insert-fn (fn [rows]
                    (jdbc-sql/insert-multi! datasource table-name columns rows))]
    (transduce
     (comp
      (partition-all 1000) ;; Create batches of 1000
      (map insert-fn))      ;; Apply the insert function to each batch
     (fn
       ([] [])
       ([result] result)
       ([result input] (conj result input)))
     []
     rows)))

(batch-insert
          db/conn
          :outreachos.city-financials
          (vec (ds/column-names residential))
          (mapv #(vec (vals %)) (ds/rows residential)))
I'm taking a tmd.dataset with 56 columns and 2.2m rows, turning it into a vector of values that can be batch inserted with next-jdbc into a postgres database. I'm running into an issue though where the java process goes up to about 9.1 GB of memory around 1m rows in and crashes. I do have more RAM available on my system, but I'd rather improve the code than allocate more RAM in a custom JVM option flag or whatever.

vonadz15:01:54

I was able to solve the problem by just getting rid of the whole transducer part and simplifying it to below. It still takes up around 3.5 GB when processing though.

(let [columns (vec (ds/column-names residential))
               rows (mapv #(vec (vals %)) (ds/rows residential))
               table-name :outreachos.city-financials]
           (jdbc-sql/insert-multi! db/conn table-name columns rows {:batch true
                                                                    :batch-size 10000
                                                                    :large true}))

vonadz16:01:33

Hmm no, looks like it still gets stuck at around 1m records.

vonadz16:01:52

Ok I was able to fix it by adding a flag not to return keys:

(let [columns (vec (ds/column-names residential))
               rows (mapv #(vec (vals %)) (ds/rows residential))
               table-name :outreachos.city-financials]
           (jdbc-sql/insert-multi! db/conn table-name columns rows {:batch true
                                                                    :return-keys false
                                                                    :batch-size 10000
                                                                    :large true}))

vonadz17:01:45

@U06C63VL4 oh nice, I didn't know about tmd sql. I'll definitely take a look at that.

👍 1
namenu14:01:03

mapv is eager so it will make VM to hold every 2.2m rows in memory.

👍 1
vonadz14:01:24

yeah not sure I can pass a lazy sequence as a parameter to insert-multi though.

vonadz14:01:55

Anyway the library @U06C63VL4 recommended fixed the issues and made things a lot easier, so thanks 🙂

👍 1
🎉 1