Fork me on GitHub
#onyx
<
2015-10-22
>
michaeldrogalis01:10:49

@spangler: ZK has a socket limit, yeah

michaeldrogalis01:10:05

Min # of peers to run are the total number of tasks per job, unless you set :onyx/min-peers in a task

spangler18:10:44

@michaeldrogalis Okay first things first. I upped our onyx version to 0.7.11 since when we started using it the latest was 0.6.0

spangler18:10:52

Now when I try to start onyx it hangs

spangler18:10:58

Any ideas there?

spangler18:10:04

Would be nice to use the latest version, especially since the AOT fix

lucasbradstreet18:10:54

So 0.6.0 to 0.7.11?

spangler18:10:55

The last thing I see is

org.apache.zookeeper.ClientCnxn - Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
org.apache.zookeeper.ClientCnxn - Socket connection established to 127.0.0.1/127.0.0.1:2181, initiating session
org.apache.zookeeper.ClientCnxn - Session establishment complete on server 127.0.0.1/127.0.0.1:2181, sessionid = 0x150675a060e0a09, negotiated timeout = 40000
o.a.c.f.state.ConnectionStateManager - State change: CONNECTED

spangler18:10:24

Then it just sits there not accepting requests or input

spangler18:10:34

I know zookeeper is working, because kafka uses it and it is working fine

lucasbradstreet18:10:01

These are the big changes as of 0.7.0

lucasbradstreet18:10:42

Of particular importance is the change from ident -> plugin

lucasbradstreet18:10:07

Seems odd that you're not seeing an exception though

spangler18:10:24

Hmm... that may be it

spangler18:10:05

Okay, so no more :core.async/write-to-chan either then

spangler18:10:19

which makes sense because I haven't even passed in the catalog at startup

spangler18:10:55

But that will probably fix something else ; )

lucasbradstreet18:10:01

What messaging are you using? Should be schema checked still

lucasbradstreet18:10:53

We dropped both core.async and netty. Hopefully that's it. Try aeron

lucasbradstreet18:10:11

It short circuits locally so there should be no perf impact

spangler18:10:33

:onyx.messaging/impl :core.async ----> :onyx.messaging/impl :aeron ?

lucasbradstreet18:10:34

Have you inspected onyx.log. Really would expect you to see an error somewhere

spangler18:10:47

No error in onyx.log

spangler18:10:54

tailing it all the time : )

lucasbradstreet18:10:05

Alright. I'll have to look at that.

spangler18:10:02

Hmm... still hung

spangler18:10:46

Peer config for reference

(def peer-config
  {:zookeeper/address "127.0.0.1:2181"
   :zookeeper.server/port 2181
   :onyx/id id
   :onyx.peer/job-scheduler :onyx.job-scheduler/balanced
   :onyx.messaging/ack-daemon-timeout 60000
   :onyx.messaging/impl :aeron
   :onyx.messaging/bind-addr "localhost"})

lucasbradstreet18:10:49

Hmm. Check the changes further. Sorry, I need to go to sleep so I won't be able to help you any more

lucasbradstreet18:10:08

The hanging does sound like a ZK connect issue though

lucasbradstreet18:10:27

Are you sure that's the right address and port?

spangler18:10:45

It was working in 0.6.0, also works if I switch back

spangler18:10:09

Thanks for your help though

lucasbradstreet18:10:27

Fair enough. Weird. You may need to also specify the ports that Aeron should use. Good luck.

lucasbradstreet18:10:36

Happy to help another time

spangler18:10:45

Is aeron another process I need to start?

lucasbradstreet18:10:46

Yeah, actually, you should be starting an env with an embedded media driver in it. Should be in the docs. If you're running locally only I don't think it'll be used at all, but it's worth doing

lucasbradstreet18:10:32

That's done via api/start-env and making sure your env-config includes the right embedded media driver settings which are in the docs

spangler18:10:42

Okay, thanks again!

spangler18:10:54

Oooookay reading through the change log was quite helpful

spangler18:10:05

Looks like now java 8 is required!

spangler18:10:12

Was still on java 7

spangler18:10:19

On to the next issue.....

michaeldrogalis18:10:59

On the bright side, 0.8.0 will be fully backwards compat with 0.7. Didnt need to break anything this round.

spangler18:10:21

Well, good to be running the latest version finally

spangler18:10:30

That seemed to do the trick

michaeldrogalis19:10:41

Nice, good to hear.

michaeldrogalis19:10:37

@spangler: What company are you at? Trying to get a feel of who's using it in industry.

spangler19:10:00

I am at Little Bird

spangler19:10:04

in Portland OR

spangler19:10:20

I work with Justin Smith, who says he has been in contact with you before

spangler19:10:46

We have had onyx around but been doing a bunch of other stuff to get our product ready for launch

spangler19:10:04

Now I am actually trying to get it to work, hence the barrage of questions : )

spangler19:10:44

@michaeldrogalis Which is a great segue into my next question

spangler19:10:21

It looks like take-segments! is not actually blocking until all of the tasks emit their :done sentinel

spangler19:10:28

which maybe is just a misunderstanding of mine

spangler19:10:03

This is my workflow:

(def gather-profiles-workflow
  [[:in :followers]
   [:in :timeline]
   [:in :blog]
   [:in :fullcontact]
   [:followers :out]
   [:timeline :out]
   [:blog :out]
   [:fullcontact :out]])

spangler19:10:38

Then I am periodically putting a segment onto the input channel

spangler19:10:05

This flows through the four tasks, and emits their output to the output channel

spangler19:10:43

Finally once I am done, I put a :done sentinel onto the input channel

spangler19:10:50

Then somewhere else I issue a take-segments! on the output channel

spangler19:10:04

And it just returns immediately, not waiting until all of the tasks have finished their processing

spangler19:10:36

So I am getting a bunch of partial results

michaeldrogalis19:10:10

@spangler: Ah, cool shop. Nice.

spangler19:10:29

It is pretty cool

spangler19:10:41

Lots of fun algorithms to implement : )

michaeldrogalis19:10:51

:done can only be used once. If you put it on your channel more than once I have no idea what will happen

michaeldrogalis19:10:01

Nothing good, thats for sure

spangler19:10:13

Right, I only put :done in the input channel

michaeldrogalis19:10:21

Oh, nevermind, misread

spangler19:10:29

but that means that each of the tasks gets it

spangler19:10:37

which all flow to the output

spangler19:10:39

could that be the problem?

spangler19:10:03

When I do the take-segments! I am getting the :done in there at the end

michaeldrogalis19:10:04

Onyx will wait for all of the segments in flight to finish processing

spangler19:10:58

Hmm.... so I am getting a result from take-segments! before all the segments are processed by all of the tasks

spangler19:10:25

I have some printlns in there when the task starts, and I get my (partial) results and see tasks still firing

michaeldrogalis19:10:39

Are you forcibly shutting down the environment at any point?

spangler19:10:50

No, only on full system shutdown

michaeldrogalis19:10:56

Something sounds not right with how the channels are wired up. Id need to see code to diagnose further. Can't dig in now though. I can tell you for certain that's not how Onyx works. You won't see the sentinel value downstream until all inflight messages finish processsing

spangler19:10:01

As another data point, I am still getting repeated messages of Not enough virtual peers have warmed up to start the task yet, backing off and trying again... in my onyx.log, even though take-segments! has returned

spangler19:10:53

Any leads until you are able to take a look at the code? I have been following your example project pretty closely...

michaeldrogalis19:10:54

That would indicate that your job never really started

spangler19:10:14

Hmm.... yet it did something!

spangler19:10:19

Ah, here's another data point

spangler19:10:37

It tries to start the same job again, even though I only hit submit-job once

michaeldrogalis19:10:38

Are you using lifecycles to hook into :lifecycle/task-start?

spangler19:10:55

No, what would I put there?

michaeldrogalis19:10:09

Nothing, just wondering if you were trying to use that for something.

spangler19:10:12

I am just using printlns in the functions that the task invokes with the segment

michaeldrogalis19:10:35

If you can send me a reproducer I can check it out tonight

spangler19:10:52

Okay, will do

spangler20:10:08

Does onyx retry a job after a certain time possibly?

michaeldrogalis20:10:47

It does retry entire jobs, no. Tasks may be rebooted to other peers, possibly to the same peer. But not a whole job.

spangler20:10:06

So tasks are retried

spangler20:10:09

That might be what I am seeing

spangler20:10:48

under what conditions is a task retried?

spangler20:10:58

Or rebooted rather

michaeldrogalis20:10:56

Uncaught exception, but :onyx/restart-task-pred is specified in the catalog entry

lucasbradstreet20:10:53

^must be specified

lucasbradstreet20:10:48

Are you using the same onyx/id between restarts with a persistent ZK? (Sounded like you were using another ZK server)

spangler20:10:20

@lucasbradstreet No, different UUID every time

lucasbradstreet20:10:23

Sometimes you might be queueing up multiple jobs without knowing it between peer startup and submit job runs

lucasbradstreet20:10:29

Ok. That's out then

lucasbradstreet20:10:22

I'd try it out on a simple workflow and see if you can reproduce it

lucasbradstreet20:10:32

Back to sleep. Gn

lucasbradstreet20:10:23

I guess another possibility is you could be accidentally returning nil from a task which could be interpreted as an empty list of segments, causing nothing to flow on. That'd be an amazing guess though

lucasbradstreet20:10:04

@michaeldrogalis: how do you feel about throwing an exception when an onyx/fn returns nil?

spangler20:10:10

What does this mean? (from onyx.log) core.async input plugin stopping. Retry count: 1

michaeldrogalis20:10:34

@lucasbradstreet: Not sure, need to benchmark and see how much it degrades perf

lucasbradstreet20:10:54

It means everything was sent and confirmed (acked) except for one segment. Sometimes that segment can be the done retrying though

lucasbradstreet20:10:41

Make sure you've actually got a new channel with new data on it each time you test if you're testing from a repl. Otherwise look into functions that return nothing

lucasbradstreet20:10:59

@michaeldrogalis: it'll be pretty much 0 perf impact

lucasbradstreet20:10:32

I guess you have to add in all the exception throwing code too

michaeldrogalis20:10:48

We'll get it out tomorrow. Get back to sleep 😛

michaeldrogalis20:10:05

check it out, rather*

lucasbradstreet20:10:19

We already do quite a bit of stuff like that though so it's probably minimal. And yep, we'll test it tomorrow. ZZZZZ