Fork me on GitHub
#onyx
<
2017-11-20
>
twashing19:11:37

Hey, I just placed an issue on onyx-kafka, for troubleshooting simple job not writing to a topic. https://github.com/onyx-platform/onyx-kafka/issues/47

twashing19:11:59

Any ideas here would be great. I’m probably just missing something very simple.

eriktjacobsen19:11:53

Getting an error that looks like it's stemming from onyx itself: integer overflow

eriktjacobsen19:11:18

Anyone seen similar?

jasonbell20:11:56

So :conform-health-check-msg threw the exception. Do you have any lifecycles setup? (http://www.onyxplatform.org/docs/user-guide/0.12.x/#_example) if you don’t the task will die and not restart, that will render the job broken and everything will need restarting.

jasonbell20:11:15

If you handle the exception you can :reset the task for it to startup again.

jasonbell20:11:43

Though you are still prudent to see what’s happening in the task itself to catch the exception in the first place obviously.

lucasbradstreet21:11:19

I’ll pop in with some additional info later

eriktjacobsen21:11:01

@jasonbell So I see the task name, but since there was nothing in the stack trace and the code is this:

(defn conform-health-check-message
  [segment]
  (let [result (:result segment)
        ts (if-let [ts (:timestamp segment)]
             ts ; expect this to be in msecs
             (.getMillis (time/now)))
        output (-> segment
                   (select-keys [:hash :config :lambda :commit])
                   (assoc :timestamp ts :result (keyword result)))]
    (debug "Conformed: " output)
    output))

We don't have any lifecycles set up for this task, and it didn't look like we were modifying any integers, which is why I thought it might be in the internals of onyx.

lucasbradstreet21:11:32

It definitely is onyx internals. More when I’m done with a call.

eriktjacobsen21:11:18

Sure, no rush. job successfully restarted from resume-point it seems. Thanks

jasonbell21:11:58

@eriktjacobsen As long as you’re sorted. Previously I’ve wrapped each task with lifecycle events just in case and then handle all the exception so the job doesn’t have a chance to fail.

lucasbradstreet21:11:55

Alright, so the problem there is that it took a really long time to write out a batch to the task downstream of your task, and we overflowed a long in terms of how many nanoseconds it took.

lucasbradstreet21:11:29

The second problem, as @jasonbell accurately described, is that you don’t have a handle-exception lifecycle on your tasks as a failsafe for whether to continue running the job.

lucasbradstreet21:11:19

So, I would think the actions for us are to fix the overflow. Your actions are to figure out why it might have taken so long for that task to write the batch, as well as add the exception lifecycle.

lucasbradstreet21:11:05

My bad for assuming we would never overflow that long 😄

eriktjacobsen21:11:19

ah. Looking through logs, it seems there were some ZK timeouts happening around that time.

lucasbradstreet22:11:59

Yeah, I’m guessing you got blocked downstream, and so upstream was trying to offer the segments to it and got stuck.

lucasbradstreet22:11:30

@jasonbell hah, I have a helper just like that. Actually in this case it’s already a long, but nanoseconds are kinda big to start with 😮, so we overflowed the long anyway.

lucasbradstreet22:11:55

I’m actually not sure how that overflowed, as it would have had to be a lot of hours (many many thousands)

lucasbradstreet22:11:13

I’ll have to figure it out anyway.

lucasbradstreet22:11:58

Ahhh, it’s not resetting the accumulated time when you’re processing batches of 0 size, so if you have a long running job that isn’t receiving any segments it’ll continue to accumulate. How long was that job running for approximately?

eriktjacobsen22:11:37

The weekend, since Friday. looks like timeouts were happening here and there, but started majorly ramping up about an hour before the exception which is ultimately what stopped the job.

lucasbradstreet22:11:42

OK, the overflow still doesn’t completely make sense to me then.

lucasbradstreet22:11:43

Anyway, I’ll put in some code to prevent the overflow, and with the lifecycle addition the job would have recovered.