Fork me on GitHub
#aws
<
2015-09-22
>
ragge07:09:11

@alandipert: would love to hear more at some point if you have a few mins

ragge07:09:14

@alandipert: we currently lean heavily on kafka... love the idea, not too fond of operating the implementation

alandipert12:09:52

@ragge: re: kinesis, afaict it's all the affordances of kafka without the ops overhead

alandipert12:09:04

i've never run kafka, but i can say our experience with kinesis has been good so far

alandipert12:09:01

we have run into two minor problems with clients; lambda functions can only receive a maximum of 10k kinesis events per invocation, so we're not able to use lambdas for doing meaningful aggregation

alandipert12:09:12

also Spark 1.4 doesn't reliably consume from kinesis but a fix is coming in 1.5 next week

alandipert12:09:22

(we're running spark in EMR)

ragge12:09:34

we use kafka for basically all data transportation; logs, metrics, application data

ragge12:09:45

do you archive stuff to s3?

ragge12:09:04

or do you have other ways to re-process a stream?

alandipert12:09:00

our events come from ad servers, on these servers we both push to kinesis and store to s3

alandipert12:09:09

also we store as much as possible on those machines' disks

alandipert12:09:30

the thing we do is collect ad impressions, aggregate, and report on them

alandipert12:09:56

for the "new" kinesis stuff we go from ad server -> kinesis -> spark aggregation -> s3 -> lambda "loader" function -> redshift

ragge12:09:28

ok, interesting

ragge12:09:59

have you found the scaling ok to work with? increasing/decreasing shards

alandipert12:09:16

we haven't done much scaling yet, the rate on the impressino stream is pretty constant

alandipert12:09:37

we still need to test resharding though

alandipert12:09:02

because we'll eventually need to scale up. the KCL claims to handle this without intervention, but haven't seen it in action yet

ragge12:09:16

thanks for the info

erik_price12:09:22

alandipert: what serialization do you use to write your events to S3? (assuming you write multiple events to a single S3 object)

alandipert12:09:51

@ragge: sure, be curious to hear how it goes for you and trade more notes

alandipert12:09:09

@erik_price: newline-delimited JSON... mostly for historical reasons

erik_price12:09:41

one line per event, S3 objects aggregating by some time interval?

alandipert12:09:49

the "raw" impression log is one per event which goes on Kinesis, and then Spark aggregates into 30 minute intervals

alandipert12:09:38

are you working on a reporting pipeline of some kind also?

erik_price12:09:46

Just want to learn more about how other real-world deployments are doing this. We’re not capturing analytics events, but we run ingestion jobs periodically and the data we pull in gets written to S3. In our case the data is big enough that we have a one-item-to-one-S3-object relationship, but I am interested in how people are doing it with a larger number of smaller events.

ragge13:09:31

@alandipert: quick follow up: do you use kpl when producing events?

alandipert13:09:14

@ragge: yeah, the node one