onyx 2016-09-09 | Slack Archive

jasonbell07:09:02

@camechis with regards to mesos deploys in one iteration of a project I was deploying a docker container that literally just submitted a job and then exited out. It worked well but I’d prefer to kill the existing job first before deploying the new one. I need to write some code to clear out existing jobs by job name or uid from zookeeper first. If I find any more bits of information that I think might be useful I’ll add them here.

lucasbradstreet07:09:09

lib-onyx’s replica query server is useful for this sort of thing, because it can tell you what jobs are running. One thing that would be good to support there is queries by job metadata, although it’s pretty trivial to implement yourself

jasonbell07:09:34

nice to know, thanks @lucasbradstreet

lucasbradstreet07:09:29

Also, we strongly recommend you supply your own job-id as part of the job metadata. This improves things in a few ways. One you can store it somewhere when you deploy - it doesn’t need to be in Onyx, and it doesn’t need to be done by the container that submits. The second way it’s good is that you don’t get into messy situations when you submit the job twice by accident (say mesos started the first job submit container and you’re not sure whether it submitted or not). Since the job submission is idempotent if you use the same job id, you can just run the submission again

zamaterian08:09:53

Any recommendation for which zookeeper version to use ?

lucasbradstreet08:09:18

Latest stable would be my recommendation. I haven’t used 3.4.9 myself, but I think they’re pretty conservative with the minor releases - I’d expect mostly bug fixes

jasonbell08:09:21

@zamaterian I’ve used 3.4.6 for some time now, it gets on with the job perfectly.

zamaterian09:09:38

I have a sql table with 3.8mill records, and if my max-pending is 10, batch-size is 10 and rows-per-segment is 10, I consistently experience "Not enough virtual peers have warmed up to start the task yet, backing off and trying again…” and "java.io.IOException: Broken pipe" from zookeeper.ClientCnxn until the job stops with Log subscriber closed due to disconnection from ZooKeeper. My theory is that the amount of logs entries is to large; The problem goes away by increasing row-per-segement to 100.

lucasbradstreet12:09:41

@zamaterian: that is pretty strange to be honest. I assume it's after the job has been operating for a while?

lucasbradstreet12:09:43

My first guess is that the job is going so slow that peers are dropping on and off due to network issues. There's no good reason why an increased row-per-segment would cause anything to happen in the log

zamaterian12:09:23

No, its under startup of the peer 🙂 btw we are running a pretty old zookeeper 3.4.5

lucasbradstreet12:09:36

Oh I know the answer.

lucasbradstreet12:09:25

Right, so the way onyx-sql is working is that it'll be partitioning up the key space and checkpointing it to zookeeper. Since you have a low row-per-segment number it has to checkpoint more, and it's probably hitting the 1MB zookeeper limit

lucasbradstreet12:09:44

We could definitely do a better job checkpointing less data

lucasbradstreet12:09:06

You would want bigger row-per-segment and max-pending anyway. I suggested those low figures previously as a debugging sanity check

zamaterian12:09:31

It matches my theory, its not really a problem for us, since our throughput is set for debugging 🙂

lucasbradstreet12:09:33

There'll be way too much overhead as evidenced by this issue.

lucasbradstreet12:09:21

Suggested fix for now would be for us to check the serialised data size and kill the job with a good error if it's bigger than 1MB

zamaterian12:09:42

🙂

lucasbradstreet12:09:38

Thanks for the report :)

zamaterian12:09:06

Anytime, its a pleasure working with you guys.

michaeldrogalis17:09:36

A few bug fixes + improved docs are out in 0.9.10-beta2: http://www.onyxplatform.org/docs/user-guide/0.9.10-beta2/

vladclj17:09:12

Hi, I got error [org.onyxplatform/onyx-http "0.9.10.0-beta3"]

#error {
 :cause kw->fn does not exist
 :via
 [{:type clojure.lang.Compiler$CompilerException
   :message java.lang.IllegalAccessError: kw->fn does not exist, compiling:(onyx/plugin/http_output.clj:1:1)
   :at [clojure.lang.Compiler load Compiler.java 7391]}
  {:type java.lang.IllegalAccessError
   :message kw->fn does not exist
   :at [clojure.core$refer invokeStatic core.clj 4119]}]
 :trace

vladclj17:09:05

https://github.com/onyx-platform/onyx-http/commit/b1722d413d82c9ff545212eac7fb32427968b390

Travis17:09:56

@michaeldrogalis Love the new Docs and especially in this new beta

michaeldrogalis17:09:23

@camechis Thanks!

michaeldrogalis17:09:29

@vladclj Hmm.

michaeldrogalis17:09:38

Can you do lein clean and try that again? We moved that function in this release.

vladclj17:09:59

Hmm.

vladclj17:09:06

@michaeldrogalis

lein clean

- did not help

michaeldrogalis17:09:22

Are you using Onyx 0.9.10-beta2 or 3 with the plugin? https://github.com/onyx-platform/onyx/blob/0.9.10-beta3/src/onyx/static/util.cljc#L18

michaeldrogalis17:09:35

Maybe your core version is behind your plugin version

michaeldrogalis17:09:57

I made a recursive Spec for flow condition predicates. It's pretty cool when you use conform, you get the exact AST back.

michaeldrogalis17:09:00

https://gist.github.com/MichaelDrogalis/c66fc7c8db6367d9907caf200d5746fe

Travis17:09:37

nice

robert-stuttaford18:09:56

you guys must be loving spec, @michaeldrogalis

robert-stuttaford18:09:17

it's almost like you designed being able to take advantage of this sort of thing right in 🙂

michaeldrogalis18:09:45

@robert-stuttaford We generated a full Spec for Onyx in a scary short period of time 🙂

michaeldrogalis18:09:21

Would like to get upgraded from Schema to Spec in core itself once 1.9 ships

robert-stuttaford18:09:02

i wonder when that'll actually be

michaeldrogalis18:09:15

I hope within 6 months.

robert-stuttaford18:09:01

my company will be old enough to start school in 6 months!

robert-stuttaford18:09:38

actually pretty insane what's happened in Clojure since i started

michaeldrogalis18:09:51

So much growth, it's really great.

smw18:09:35

If i have a text file as an input, what’s the simplest way to run that type of job on a production cluster?

smw18:09:07

I think I read you’re not supposed to use core.async for prod workloads, right?

Travis18:09:15

That is correct

smw18:09:21

Do I need to stick it into kafka?

smw18:09:29

sql database?

lucasbradstreet18:09:37

kafka would be my recommendation

Travis18:09:43

I was thinking that as well

lucasbradstreet18:09:44

If you already have it

smw18:09:15

Yeah, I don’t on this (purpose built) cluster, but I’m running mesos, so I guess I can spin it up.

smw18:09:23

Was hoping for something lightweight 🙂

Travis18:09:47

If using dcos it's very simple to start up from the universe

smw18:09:03

(I’m not, but we have experience using the elodina framework in prod)

lucasbradstreet18:09:55

If you’re making a choice between spinning up a new SQL database, or a new Kafka cluster, then the choice has more more to do with familiarity, except that Kafka is a better fit for Onyx

lucasbradstreet18:09:21

Overall I would err to Kafka, but you should realise there might be a learning curve

smw18:09:08

Yeah. I’ve used it quite a bit in prod, but this is a one-off job, so was hoping to skip that part.

smw18:09:34

Thanks for the help.

lucasbradstreet18:09:55

If it’s only small data, you could possibly use onyx-seq as your input

michaeldrogalis18:09:59

Maybe onyx-seq is a better way to go then.

michaeldrogalis18:09:02

Beat me to it. 😛

smw18:09:07

Ooooh.

lucasbradstreet18:09:08

Just realise you might be delaying your issue

Travis18:09:08

Lol, yep

smw18:09:56

Will onyx-seq somehow distribute the seq over multiple nodes?

smw18:09:26

I’ve gone through a bunch of the learn-onyx repo now, but I’m still a bit fuzzy on what’s going on from a cluster standpoint.

smw18:09:34

Does it inject them into zk somehow?

lucasbradstreet19:09:17

That’s kinda up to you. You should read the code. It’s very simple. You could distribute it over :onyx.core/slot-id but you would have to think about how to do it

lucasbradstreet19:09:42

If you’re thinking that much and have access to a Kafka cluster, you should probably just stick it in Kafka though.

michaeldrogalis19:09:17

I think he's asking does the work get distributed across multiple machines - which is yes. If you're asking is there exactly one reader, or does that get distributed, then thats no - single reader.

smw19:09:34

Ok. That’s exactly what I need.

smw19:09:36

Thank you.

smw19:09:49

I’ll definitely do kafka for bigger prod workloads, which we’re working on.

lucasbradstreet19:09:54

Yep, single reader, though you could do multi reader with it, which is what I was saying with the slot-id comment

smw19:09:57

(and we’re already using kafka for)

lucasbradstreet19:09:02

But you’d have to do it yourself

smw19:09:12

But this is a one-off disaster recovery scenario 🙂

smw19:09:26

‘find data we may have lost’ sort of thing

lucasbradstreet19:09:36

Kafka is a better choice 99.9% of the time

smw19:09:40

So I want to do: list of files | download file / split into records | process records | output processed record to elasticsearch

smw19:09:10

it’s totally kosher to output many more segments as output from a function than you received as input, right?

smw19:09:34

"A Function is a construct that takes a segment as a parameter and outputs a segment or a seq of segments."

lucasbradstreet19:09:00

Yeah, return as many segments as you like. The only thing that I would say is that our current messaging model will have problems with high branching factors

smw19:09:30

branching is if I want each message to be processed by multiple functions / ouputs?

lucasbradstreet19:09:03

So say, for example, you have exactly one segment on an input task, that leads to a million messages that take a second each, and a pending-timeout on that task of 180 seconds

lucasbradstreet19:09:31

Say your throughput, flowing through your entire job is less than 1M messages per 180 seconds

lucasbradstreet19:09:56

Then you’re going to get a retry. Now because there was a single message that came out of the input task everything is going to be tried again

lucasbradstreet19:09:03

and this will repeat over and over

smw19:09:08

ahhhhh

lucasbradstreet19:09:25

So the higher branching factor, the more likely you are to get into situations that don’t make forward progress

lucasbradstreet19:09:57

Now say you had 1000 messages, each which produced 1000 messages each. Then you might have some chance to make forward progress vs 1 message that produced 1M messages

lucasbradstreet19:09:13

both would probably be bad, but I’m making up an extreme example to illustrate the point

smw19:09:20

so even though the input task has finished processing a segment and sent it on, the timeout on that task applies to tasks further down the line in the workflow pipeline?

lucasbradstreet19:09:56

Yeah, it’s a tree of acks, so everything down the line depends on the original input being acked

smw19:09:34

I’m much closer to 1000 messages each producing 1M messages each

smw19:09:56

I guess I could do what I really didn’t want to do

smw19:09:00

which is put kafka in the mix

smw19:09:14

and then do

smw19:09:22

job 1: download / split /send to kafa

lucasbradstreet19:09:28

Even if you do that, you should realise you’re doing that and tune the other knob, which would be to keep :onyx/max-pending low

smw19:09:29

job 2 read from kafka | process

lucasbradstreet19:09:00

max-pending 1 might mean 1000 messages being out there, bs max-pending 1000 meaning 1M, for example

lucasbradstreet19:09:09

but tbh, you should just stick it on Kafka

lucasbradstreet19:09:17

You’ll end up thanking yourself later

lucasbradstreet19:09:45

You’ll end up with more jobs, or you might end up building a job where you want to stick the output on Kafka again

lucasbradstreet19:09:00

and you’ll think yassss… I’ve decoupled it

lucasbradstreet19:09:03

This is awesome

smw19:09:06

Ok. Thanks so much for the help.

lucasbradstreet19:09:15

No worries. Good luck

smw19:09:33

I love where you’re going with this. Really enjoyed the defn podcast.

lucasbradstreet19:09:25

Thanks! It was tough with the four of us, heh. Glad you enjoyed it 🙂

lucasbradstreet19:09:15

Four people on the podcast I mean. Hard not to talk over each other!

Travis19:09:39

i enjoyed it for sure

aengelberg21:09:15

I'm perf-tuning my onyx job and trying to find the bottleneck of input, functions, and output tasks.

aengelberg21:09:28

If I want to isolate just the input task to see if it's the bottleneck, what's the best way to go about that?

aengelberg21:09:45

My current thinking is to put a function in between the input and output which always returns an empty vector

aengelberg21:09:49

Would that accomplish what I'm trying to do?

lucasbradstreet21:09:40

@aengelberg: quickest way is to use onyx-metrics and start charting your metrics. Best guess will be the tasks with the highest batch latency, because that's approximately how long it's taking to process a batch.

lucasbradstreet21:09:44

Once you have those figures, I'd drill down further. If your bottleneck isn't an aggregation task, I'd probably try out Java mission control, which is an awesome profiler that can help you out with CPU bound tasks

lucasbradstreet21:09:56

But you'll want to have a good idea about your job profile first

lucasbradstreet21:09:34

jmc / flight recorder is awesome though. You should try it out even if you figure out your onyx bottleneck, trust me

aengelberg21:09:09

At the moment we have a job that is reading from kafka, deserializing transit, and performing trivial functions, but only getting ~2k records per second which feels low.

aengelberg21:09:27

I've heard of those profiling tools, we'll probably check those out

lucasbradstreet21:09:52

Yeah that is low

michaeldrogalis21:09:54

@aengelberg Is the metrics suite I set up still functional? And are the tasks still mostly fused together? Thats extremely slow, yes

Travis22:09:49

I know this pain

Travis22:09:23

For us it's the aggregation but the trigger that fires to elasticsearch absolutely kills the performance

2016-09-09

Channels