Fork me on GitHub
#sql
<
2022-11-01
>
kwladyka13:11:32

(next.jdbc.sql/insert-multi! ... {:batch true :batch-size 2000}) How do you deal with situation when index key is duplicated, but you want to insert all others? Is there any magic PostgreSQL settings or you switch to add each row one by one then? Filter batch before insert-multi! by query DB before and get ids to remove them in batch. Quite abstractive question, but maybe there is something which I don’t know about.

kwladyka14:11:13

(defn insert-multi! [table cols rows]
  (try
    (jdbc-sql/insert-multi! (pool) table cols rows {:batch true
                                                    :batch-size 2000})
    (catch BatchUpdateException e
      (if (postgresql-utils/duplicated-key? (ex-cause e))
        (doseq [row rows]
          (try
            (jdbc-sql/insert! (pool) table (zipmap cols row))
            (catch PSQLException e
              (if (postgresql-utils/duplicated-key? e)
                (l/warn e)
                (throw e)))))
        (throw e)))))
looks pretty ugly, but probably can’t be much better

Jan Winkler14:11:20

This looks pretty janky. I'd suggest the following steps: • begin transaction • search for existing duplicates in the db • remove the duplicates from the input (or from the db, depending on what you are trying to accomplish) • insert the filtered input • commit the txn

kwladyka14:11:20

right now I have v2:

(defn- insert! [table m]
  (try
    (jdbc-sql/insert! (pool) table m)
    (catch PSQLException e
      (if (postgresql-utils/duplicated-key? e)
        (l/warn e)
        (throw e)))))

(defn insert-multi! [table cols rows]
  (try
    (jdbc-sql/insert-multi! (pool) table cols rows {:batch true :batch-size 2000})
    (catch BatchUpdateException e
      (if (postgresql-utils/duplicated-key? (ex-cause e))
        (doseq [row rows]
          (insert! table (zipmap cols row)))
        (throw e)))))
I have to think about this. What you are saying is correct too, but will take more lines and probably will end with more complex code, than it is now. Also I will have to use transaction and block other threads. This is not as easy choice.

kwladyka14:11:17

consider table will have millions of rows

kwladyka14:11:19

actually the only one fragment which should change then is

(doseq [row rows]
          (insert! table (zipmap cols row)))
Because I have to check it the cause of the issues ir duplicate key or not. So this code stay as it is.

kwladyka14:11:29

I am thinking loudly

kwladyka14:11:21

So the

(doseq [row rows]
          (insert! table (zipmap cols row)))
at first glance looks simpler, than: https://clojurians.slack.com/archives/C1Q164V29/p1667313560042789?thread_ts=1667309672.219459&amp;cid=C1Q164V29

kwladyka14:11:39

Do I use {:batch true :batch-size 2000} correctly here? I mean did the map correctly set batch-size? I have feeling like it is going too slow vs when I was partitioned this myself.

Jan Winkler14:11:19

(defn insert-but-drop-dupes
  [txable rows]
  (jdbc/with-transaction [txn txable]
    (let [dupes (into #{}
                      (map :mytable/id)
                      (jdbc/execute!
                        txn
                        ["select id from mytable where id = any(?)"
                         (into-array Long (map :mytable/id rows))]))
          unique-rows (remove (comp dupes :mytable/id) rows)]
      (sql/insert-multi! txn :mytable unique-rows))))

👍 1
kwladyka15:11:42

thank you, I think I can mix this 2 into 1 better. This situation happen only from time to time, so I don’t know if I want to always check dedupes with duplicate Exception.

Jan Winkler15:11:06

this is just a proof of concept and there are problems like: • the rows arent deduplicated, so if there's duplicate id in the input, the fn will barf • I assume there could be problems if the rows coll was exceedigly large • the txn handling probably should be done somewhere else (for instance, if this was part of a web app, I'd probably want to run the whole handler in a single txn to provide atomicity)

kwladyka15:11:35

the goal here is to import data from files which are sometimes corrupted