onyx 2016-07-18 | Slack Archive

Hi, Does anyone have any experience outputting to Redshift? I'm looking at processing incoming data from Kafka, carrying out some filtering and transformation and then outputting the data in batches to Redshift, but I haven't come across any specific plugins. My other thought was to write the data to S3, but I noticed that Onyx currently doesn't support it.

gardnervickers15:07:22

We do not currently have a redshift plugin, although it is planned. We're currently revamping our stream processing engine and simplifying our plugin interface, so it's unlikely that until the work is finished we will be actively working on new plugins.

lucasbradstreet15:07:48

Output plugins won't change much, so if you want to give it a crack I'd be happy to advise

lucasbradstreet15:07:09

We also have onyx-amazon-s3, which does support writing to s3

ben-long15:07:00

@lucasbradstreet: Thanks, that sound great. I'll take a look at building a Redshift output. I've managed to connect to it via a JDBC java library before, so I might be able to rehash some old code

gardnervickers15:07:55

Oh I didn't know it supported JDBC. You should be able to use the onyx-sql plugin then.

ben-long15:07:38

cool, that would make my life easier and one less thing to learn 🙂

ben-long15:07:12

Looks like onyx-sql will do the trick I'll just need to specify 'com.amazon.redshift.jdbc41.Driver' as the classname and make sure it's stored locally as AWS only make the driver available for download

klmn15:07:19

its strongly depends on the size of batch. if you have an normal bulk size, then the only option is to transfer data to s3 and use coy sql command.

ben-long15:07:11

@klmn: Thanks, that's something I need to investigate. The data we're looking at processing will be hitting Kafka at up to 1000 1 to 2 megabit files a second. At the moment all our jobs spit out data to S3 and then we have an S3 event set up to trigger the load, but this only happens once a day and all the data is written at the same time. Now we're looking at processing up to 12TB a day in near real time, if I can remove S3 from the equation it would make life easier, as I won't have to try and come up with a trigger for the load, or find a way to clean up all the old data files littering up S3.

gardnervickers15:07:55

Theres a big difference between 1000 1-2mb files/second and 12tb a day.

gardnervickers15:07:13

Will there be a filter step on kafka?

gardnervickers15:07:25

before the actual processing happens?

klmn15:07:44

@ben-long: unfortunately, using s3 the only way to proceed this data to redshift. Plus storing preprocessed logs on s3 is always a good idea. Sometime redshift have issues with uploading data, and then you have to re-try. the best way imho: write small to s3, then create an lamda function in aws, which one after the file appear on s3, automatically with re-try will try to upload it to redshift.

ben-long16:07:54

@gardnervickers: I hadn't realised I could filter on Kafka (we have an existing queue but until now, it's been a separate thing that I haven't needed to worry about), but if we can, that would be a good time to add a very high level filter - we're expecting a high volume of duplicate data

gardnervickers16:07:31

Ah, de-duplication is a bit more nuanced. I don’t know if the Kafka ETL tools support that yet.

ben-long16:07:54

@klmn: I was concerned about data loss, and everyone in my team is used to working with outputting to S3, so that would seem like the easiest way to proceed

ben-long16:07:14

@gardnervickers: Nuts, I was hoping to palm that task off to the team that look after the Kafka queue 😉

gardnervickers16:07:59

Due to your data volume it’s hard to recommend a generic solution.

ben-long16:07:31

That's ok, we're planning on starting with a much smaller volume and then seeing if we can scale up. This is a new project for us, and we've never needed to process realtime data before, so it's going to be a fun ride!

klmn16:07:13

@ben-long: we was using redshift to proceed around 30tb peer day, currently switched from it, but bundle kafka -> s3 -> redshift worked great.

2016-07-18

Channels