Fork me on GitHub
#onyx
<
2018-10-05
>
morri16:10:40

I'm preparing batches of segments to insert into a database which requires large batch inserts to be efficient. Is a reduce task with a window and a trigger the way to do this or am I missing something obvious? batch-fn doesn't look appropriate since it requires that the number of segments produced to be the same as the number of input segments.

lmergen18:10:49

you can configure a batch-size of the output connector

lmergen18:10:02

that will make onyx (try to) work in batches of that size

lmergen18:10:34

if you combine that with, say, the postgresql copy implementation (part of onyx-sql), you can actually have fairly efficient inserts

morri18:10:40

Thanks @lmergen -- I was hoping for something like this. Looking at onyx-sql though it seems that it always inserts each segment as a separate transaction. Maybe I'll need to implement an output plugin myself. The database -- clickhouse -- accepts writes by http request, so jdbc isn't helpful anyway. https://github.com/onyx-platform/onyx-sql/blob/0.13.x/src/onyx/plugin/sql.clj#L168

morri18:10:06

@lmergen Do you happen to know why each segment is inserted as a separate database call rather than batching all the segments into a single database call?

lmergen18:10:41

i think it's chosen initially out of simplicity

lmergen18:10:48

"make it work, then improve"

morri18:10:54

Got it -- so to write all the batches together, I would do something like:

(write-batch [this {:keys [onyx.core/write-batch]} replica messenger]
             (jdbc/with-db-transaction [conn pool]
               (insert-fn conn (mapcat :rows write-batch)))
             true)

lmergen19:10:50

yep, pretty much

👍 4
lmergen18:10:16

i implemented copy support for postgresql about a year or so ago

lmergen18:10:36

you can take a look at it and see whether you can do something similar for clickhouse

lmergen18:10:53

it would be the best way to do it, anyway

morri18:10:59

In the case of the postgresql copy -- would the right approach be to have an upstream task that combined multiple rows into a single segment? Would this upstream step be using windows and triggers?

lmergen18:10:43

well, then you would need to make the sql plugin "understand" these batches -- since you would probably be putting vectors inside single segments in this case

lmergen18:10:48

i wouldn't go there

lmergen18:10:23

you will always want inserts to be as efficient as possible, so it makes the most sense to reuse the actual batches that onyx uses

lmergen18:10:59

then you can configure this using :onyx/batch-size in your task map

morri18:10:49

I think the sql plugin would accept this segment right? -- {:rows [{:col1 "row1-val"} {:col1 "row2-val"}]} But I see what you are saying about using the batching mechanism.

lmergen18:10:29

oh yeah that's correct

lmergen18:10:29

i think a window could work in this case, but.. you will probably want to have logic like "every 3 seconds or every 3000 rows, whichever is reached first"

lmergen18:10:40

i have found it difficult to define triggers like that

lmergen18:10:48

maybe @lucasbradstreet knows how to do that

morri18:10:36

I'm aiming more towards batches of 1 million rows and will have enough traffic that I don't need to worry about it taking too much time to accumulate enough rows

lmergen18:10:36

well then, luxury problems 🙂

morri18:10:39

Thanks for discussing -- The options I'm seeing are 1. Prepare the batches of rows using a window aggregation and then submit the results using the onyx-http output plugin 2. Roll the batch preparation and database submission into my own output plugin and rely on onyx/batch-size

lmergen18:10:49

if you need help with it, i would be happy to assist

lmergen18:10:24

i've been meaning to want to take a look at clickhouse anyway, been hearing a lot about it the past year 🙂

morri19:10:16

As a side question do you know of any examples using onyx.plugin.protocols/prepare-batch?

lmergen19:10:25

i think the idea is that prepare-batch should be pure

lmergen19:10:36

or rather, can be called multiple times without nasty side effects

lmergen19:10:02

but i hardly ever use prepare-batch myself

morri19:10:29

For clickhouse the rows need to be loaded into a large byte array using some classes provided by clickhouse, so I thought prepare-batch might be the right place to do this.

lucasbradstreet19:10:41

If you use prepare batch it’s mostly to setup a buffer that you drain with write batch, since write batch may be called multiple times. Usually you can get away without it but it’s useful sometimes.

lmergen19:10:26

^ sounds like prepare batch is the right place then for what you want

🧡 4
lucasbradstreet19:10:35

Right, that could be a good place to put it

🧡 4