Fork me on GitHub
#clojure-uk
<
2016-07-29
>
agile_geek08:07:57

ETL with Clojure, recommended approaches and tools? Go!

korny08:07:56

Hmm, the last ETL work I did in clojure we rolled our own. But that was about 4 years ago, so there may be better tools available now

agile_geek08:07:14

@korny: I was looking at Onyx

agile_geek08:07:32

but Spark and Flambo might work too

agile_geek08:07:19

I'm not sure I need this level of scalability as I haven't even won contract with client yet but just looking for things to investigate around creation of a 'data lake'...whatever that means to this particular org!

agile_geek08:07:38

From what little I know atm it's likely to be using a mixture of calls to SOAP API end points and raw JDBC connections to extract.

agile_geek08:07:20

I'm thinking simply dumping data to S3 for storage at rest but might be HDFS

glenjamin08:07:41

my last job are using Kinesis + Lambda + Redshift to good effect

glenjamin08:07:57

with S3 for longer term storage of raw events

agile_geek08:07:07

@glenjamin: cool. Will investigate that stack. I've briefly looked at Lambda but tbh it feels like overkill from what little I know.

glenjamin08:07:32

it might be the opposite, you only pay for compute time

glenjamin08:07:43

and you can connect directly to kinesis

glenjamin08:07:50

so it’ll scale up to match queue size

agile_geek08:07:01

This feels more like a batch processing exercise tbh. However, I'm really open to anything as I'm investigating solutions to understand their advantages and constraints but the initial bid will be for a discovery consultancy piece to identify problems and solution spaces. I hope (if I win work) to follow that up with a bid for solving one or more of the problems.

benedek09:07:47

we do something very similar but with kafka instead of kinesis (no lambda naturally)

benedek09:07:01

for historic reasons (eg no kinesis when the system was set up)

benedek09:07:14

parts of this is open sourced

benedek09:07:04

for saving raw kafka raw data on s3 for example

benedek09:07:33

to load data into redshift

benedek09:07:57

that said we plan to go the kinesis&lambda route eventually too

agile_geek09:07:36

@benedek: @glenjamin looks like I've got plenty of homework to do! Thanks guys.

benedek09:07:52

haha enjoy! 🙂

agile_geek09:07:15

Of course it could all be moot if I don't win the work!

dominicm09:07:25

Is kinesis better than kafka? I never feel comfortable using AWS products as I worry about hosting lock-in.

benedek09:07:22

define better? 😉

dominicm09:07:26

I guess, in what regards is Kinesis an improved product over Kafka? Features? Performance? I don't entirely know what I'm looking for.

benedek10:07:36

bigest advantage i guess that it is running on somebody’s else computer 😉

lsnape10:07:59

@dominicm: One distinction is that Kinesis only keeps 24 hours of events. I don’t think Kafka has any limit other than disk storage

benedek10:07:58

@lsnape: i think that can be extended to 7 days now

benedek10:07:30

> Data records are accessible for a default of 24 hours from the time they are added to a stream. This time frame is called the retention period and is configurable in hourly increments from 24 to 168 hours (1 to 7 days). For more information about a stream’s retention period, see Changing the Data Retention Period.

lsnape10:07:57

Also at my last job we used Kinesis as part of our email sending pipeline. Twice during a deploy it decided to reset the marker to the previous day, resulting in 20k emails being sent to customers 😕

lsnape10:07:56

Oh that’s good to know benedek

lsnape10:07:26

It was probably a bug in the amazonica consumer code we were using.. but still, that shouldn’t happen!

benedek10:07:06

oh well, you can easily end up with something like this in the kafka world too. we had soemthing similar when the ping timout (or similarly named config property) was set to too low for the zookeeper cluster we used for our kafka installation

benedek10:07:00

this is more like a characteristic of this architecture i think… (not meaning you are bound to have such ‘bugs’ but you have to prepare for this kind of situations…)

lsnape10:07:16

the voice of experience 🙂 definitely will factor in the events pipeline going nuts next time I work on batch/stream processing

benedek10:07:46

yeah after such ‘hiccups’ we built in some replayability and with things like emails (customer facing stuff): we basically send a warning 15 mins before the real email where you can easily block the real emails going out

benedek10:07:50

so far that worked well

dominicm10:07:08

Idempotence is highly important when building event-driven systems, I understand

lsnape10:07:16

Slightly different kettle of fish, but I’d always turn to SQS before reaching for a pub/sub like Kafka + Kinesis. Much more of a known quantity, and far more reliable.

glenjamin11:07:26

if you’re familiar with kafka, its probably better

glenjamin11:07:00

the biggest gain is not having to mess around with the operational setup/running of kafka

glenjamin11:07:19

the automatically scaled lambda subscriptions are really nice though

korny11:07:04

re: idempotency - definitely - we’re finding all sorts of cases where JMS brokers decide to re-play messages when the network goes screwy. Idempotency can be tricky to handle though.

dominicm11:07:14

So, datomic has some nice features with allowing you to attach an event id to a transaction

dominicm11:07:21

That's an option.

dominicm11:07:46

Could a k/v store with the event id be stored in something like Cassandra or Dynamo be useful for that too?

dominicm11:07:59

It's not true idempotence, of course, but close enough for most cases.

dominicm11:07:02

In practice this might not be true, but this is my understanding from reading.

agile_geek12:07:43

Kafka can work well for this too.

agile_geek12:07:30

However, I don't think the thing I'm going to be looking at is an 'real time' streaming problem

agile_geek13:07:21

@korny: yeah been looking at it couple of days ago but haven't used it yet.

korny13:07:08

I’m not sure I’d use it for infrastructure automation, there are better tools out there for getting Amazon to behave - my team are using Terraform and seem to think it’s pretty good. But for fiddling with infrastructure quickly, it’s neat. I’m looking at getting Jepsen to fiddle with security groups, and it looks like it’ll be nice and simple.

thomas14:07:53

Clojure O’Clock clj

agile_geek15:07:26

Train O'Clock

dominicm15:07:00

yawn o'clock

dominicm15:07:07

My girlfriend just came downstairs to tell me she's just had a nap, and is wide awake

xlevus15:07:17

that's the worst

xlevus15:07:29

inevitably you're now going to be kept awake

dominicm15:07:29

Yes. Yes I am

Pablo Fernandez16:07:38

Anybody here working for the Daily Mail?

glenjamin16:07:50

They’re usually very keen to note it’s the Mail Online

dominicm16:07:29

@pupeno: What would somebody there be volunteering themselves up for by answering?

xlevus16:07:42

I applied at Mail Online, their interview process is weird, didn't do the technical test as I got another job. That's as far as I went

nha16:07:57

@pupeno: Also applied and accepted their offer, but I haven't started yet. @xlevus what did you find weird if you don't mind sharing it ?

xlevus16:07:26

their HR pre-screening questionnaire

xlevus16:07:54

at one point it was just "Potato?" "I don't understand, can you elaborate" "Well, do you... potato?"

xlevus16:07:57

rinse/repeat

xlevus16:07:04

"ok, put me down as a 9, I potato"

dominicm16:07:56

Did they really say, potato

nha16:07:07

Wow ok I don't remember that at all.

nha17:07:34

I found the rating email I had - definitely no potato here (and it was 1 to 5). But maybe different teams have different questions

Pablo Fernandez17:07:28

I was approached by them, so, I’d like to know what’s it like to work there.

dominicm18:07:09

@pupeno: that's what I was asking 😃

Pablo Fernandez18:07:37

Ah… ok 🙂