Fork me on GitHub
#onyx
<
2017-11-22
>
kenny04:11:52

When testing out Onyx + Kafka my peers continually restart after throwing this exception:

java.lang.NullPointerException: 
    clojure.lang.ExceptionInfo: Caught exception inside task lifecycle :lifecycle/initializing. Rebooting the task. -> Exception type: java.lang.NullPointerException. Exception message: null
       job-id: #uuid "30fbff49-4319-a757-6a0b-b7f28f795426"
     metadata: {:job-id #uuid "30fbff49-4319-a757-6a0b-b7f28f795426", :job-hash "db786ebce6742af642d9d60805225644110d2b6ded6bfd599e2f3823e62c4"}
      peer-id: #uuid "c01c9e46-3a41-efb7-88f7-7f2dd7b8cf7c"
    task-name: :in
That is the whole stacktrace. Seems like it is cut off but that's it. Any ideas on how to proceed?

lucasbradstreet04:11:22

That’s odd. I think there should be more to the stack trace. I assume this is from onyx.log?

kenny04:11:14

Yes. I do have these additional config settings:

{:level     :info
 :appenders {:println {:ns-whitelist ["compute.*" "dev.*"]}
             :spit    (assoc (appenders/spit-appender {:fname "server.log"})
                        :ns-blacklist ["org.apache.zookeeper.*"
                                       "onyx.messaging.*"])}}
I wouldn't imagine that would affect the printing of a stacktrace though.

lucasbradstreet04:11:10

Ok. If your error is reproducible, turning that off and trying again is all I can really think to try

kenny04:11:45

Very reproducible. Will try that now.

kenny04:11:53

The full stacktrace is now visible. Though it wasn't hidden due to that log config, rather the JVM optimization that omits the stacktrace root.

lucasbradstreet04:11:35

Ah. That one. I do usually overrule that in my dev JVM opts

lucasbradstreet05:11:53

Anything we need to be worried about?

kenny05:11:06

Yep just added that to my JVM opts. Nope.

lmergen08:11:49

now that we're talking about this, i recall there was some JVM options that allows you to emit a stderr log any time any exception is generated. that bypasses any logging framework. or am i confused ?

lucasbradstreet08:11:34

Not sure about that one. I know there is a way to hook into unhandled exceptions, which is handy when you have a lot of threads doing things.

lmergen09:11:53

hmmm might be mistaking jvm for another language

djjolicoeur12:11:18

question regarding job submission. we have four peers running and they all submit the same set of 3 jobs. 2 of the jobs, the job submission returns success on all 4 peers. the 3rd, success is only returned on one peer and the other 3 peers fail with :incorrect-job-hash. is this just the result of a race condition in job submission? or are we somehow generating different job hashs on our peers. I believe the job only needs to be submitted once to the cluster, but just want to make sure I understand what is happening, here. we are also running on 0.9, currently.

michaeldrogalis15:11:01

@kenny That JVM opt kills me everytime

michaeldrogalis15:11:59

@djjolicoeur I think you're confused about how job submission works. You only want to submit each job once - and probably not from the peers on start up. I'd start them from a different process

michaeldrogalis15:11:43

If you're running into a hash error, you're trying to submit the same job ID with different job content

djjolicoeur15:11:18

thanks @michaeldrogalis, we are actually looking to make that change WRT job submission in the near future. I essentially inherited this system and, to date, the current job submission process has not caused any issues other than that submission quirk I mentioned. that being said, the content of the jobs should be idempotent, so I need to track down what the diff is.

michaeldrogalis16:11:00

@djjolicoeur Understood. Yeah, there's some underlying difference causing them to hash to different values. This should be the place to figure out what's up: https://github.com/onyx-platform/onyx/blob/0.9.x/src/onyx/api.clj#L209

djjolicoeur16:11:08

thanks @michaeldrogalis, I will take a look

kenny18:11:36

How do you guys develop an Onyx job that uses a Kafka queue at the REPL? Do you start an embedded version of Kafka? Or maybe replace Kafka queues with a Core async channel? What's the best approach?

lucasbradstreet18:11:12

We used to use an embedded version and/or swap out core async, but lately we’ve moved closer to just using kafka directly via docker/docker-compose https://github.com/onyx-platform/onyx-kafka/blob/0.12.x/docker-compose.yml

lucasbradstreet18:11:48

My preference is to minimise differences between dev, tests and prod, rather than get a bit nicer of a dev experience by swapping out core async.

lucasbradstreet18:11:19

I’ve often initially developed it against core.async and then moved it to kafka via docker-compose later though.

lucasbradstreet18:11:35

We use circleci which allows us to stand up ZK + Kafka for our tests via that compose yaml.

kenny18:11:27

I like that approach. How do you handle the creation of topics that are needed for your job? Do you use the Kafka admin API?

kenny19:11:36

Makes sense. I'll try that approach out. Thank you 🙂

kennethkalmer21:11:24

I’m curious what the smallest practical EC2 instance types are for a Zookeeper ensemble to power onyx… more specifically, onyx mostly for monthly batch jobs and a few adhoc jobs throughout the month

kennethkalmer21:11:11

For this project it really isn’t about scaling as much as it is about breaking up the batch processing code into neatly defined workflows that are easier to reason about

kennethkalmer21:11:26

Hmm, that cluster might need kafka too… still lots to explore and figure out, just trying to think ahead to what happens if my poc is a success 🙂

lucasbradstreet21:11:35

I’m not really sure how small you can go because we have always biased towards that piece being as solid as possible. We don’t do all that much with ZK when you’re using the s3 checkpointer though, so it just needs to be able to respond reliably.

kennethkalmer21:11:54

I'll play when it comes down to it and see!

kenny21:11:31

Any idea what this Schema error is talking about?

clojure.lang.ExceptionInfo: Value does not match schema: {(not (= (name (:kafka)) (namespace :kafka/offset-reset))) invalid-key}
     error: {(not (= (name (:kafka)) (namespace :kafka/offset-reset))) invalid-key}
I don't think I can run spec on the task def because I can't find any Kafka plugin specs.

kenny21:11:45

I'm not sure why :kafka is in a list in that exception. Seems strange.

michaeldrogalis21:11:07

@kenny Can I see your catalog?

michaeldrogalis21:11:17

I feel like I say this on a weekly basis now. We gotta ditch Schema. 😕

lucasbradstreet22:11:08

From a first look, kafka/offset-reset looks ok. I’m on my phone though so it’s hard to look deeper

lucasbradstreet22:11:17

What version of onyx kafka is this?

lucasbradstreet22:11:11

I assume this exception was thrown after you started the job?

kenny22:11:24

[org.onyxplatform/onyx-kafka "0.11.1.0" :exclusions [org.slf4j/slf4j-log4j12]]

kenny22:11:59

Yes, thrown during initialization lifecycle.

lucasbradstreet22:11:31

Yknow I think that it’s being double namespaced

lucasbradstreet22:11:45

Keyword validation is non existent in Clojure

lucasbradstreet22:11:09

That was a pretty WTF one to figure out. It all looked perfect

kenny22:11:35

Not sure what double namespaced means 🙂

lucasbradstreet22:11:06

Oh nope. Not it

lucasbradstreet22:11:38

Thought you were using the map name spacing form where you would supply the namespace before the map, and then I thought it had a second namespace inside the map, but no

kenny22:11:15

I'm not actually typing those namespaced maps - it's output from a DSL we have. Personally, I don't like using the namespaced map syntax in my actual code.

lucasbradstreet22:11:18

There’s definitely something odd going on, but I can’t diagnose it further from my phone. If you figure it out let me know. The schemas are in here: https://github.com/onyx-platform/onyx-kafka/blob/0.12.x/src/onyx/tasks/kafka.clj

kenny22:11:50

Ok. Will keep staring at it to see if something catches my eye.

lucasbradstreet22:11:58

Yeah. I hate that format too

michaeldrogalis22:11:03

Man, I have no idea. That's bizarre.

kenny22:11:43

Would the Onyx log shed any light here?

michaeldrogalis22:11:05

Probably not. This is a Schema check before Onyx ever boots up

kenny22:11:07

This exception occurs at runtime during the lifecycle. Is that what you mean when you say before Onyx boots up?

lucasbradstreet22:11:14

This one is the onyx kafka schema check on task start @michaeldrogalis

lucasbradstreet22:11:40

Can you give us the full exception stack trace just in case though?

michaeldrogalis22:11:05

@kenny I bet if you upgrade to "0.11.1.1" this'll be fixed

kenny22:11:46

Good call. Will try that now.

kenny22:11:55

Upgrading causes to 0.11.1.1 causes this exception:

clojure.lang.ExceptionInfo: No reader for tag cond
clojure.lang.Compiler$CompilerException: clojure.lang.ExceptionInfo: No reader for tag cond {:tag cond, :value {:default false, :test false, :ci false}}, compiling:(simple_job.clj:17:15)
I'm pretty sure I've run into this before with Onyx. Something with an Aero version mismatch.

lucasbradstreet22:11:12

Yeah I think this is aero related

lucasbradstreet22:11:39

Though I don’t know why it’s bringing in aero. Probably a dev dependency thing when it should really be a test dependency

kenny22:11:49

The strange thing is that the changes don't seem to indicate something changed with Aero: https://github.com/onyx-platform/onyx-kafka/compare/0.11.1.0...0.11.1.1

lucasbradstreet22:11:27

Hail Mary lein clean?

kenny22:11:34

Using boot 🙂

lucasbradstreet23:11:20

Love it. Just no time to switch over

kenny23:11:37

Figured out the problem but not the solution. You guys may have some insight. We have an internal config library that stores our app's configuration in a config.edn file that is on the classpath. This library is brought in as a dependency in my Onyx project. We are able to read the config perfectly fine this way. This is with a project using [org.onyxplatform/onyx-kafka 0.11.1.0]. For some reason that is not shown in the GitHub compare, using [org.onyxplatform/onyx-kafka 0.11.1.1] places a new config file on the classpath using the same name config.edn. This config.edn overrides the one from our internal config library. The exception pasted above is due to the overridden config.edn using a tag literal #cond that has been removed in newer version of Aero. The real problem here is onyx-kafka placing a new config.edn on the classpath and those changes not being included in the GitHub compare.

kenny23:11:42

It looks like a config.edn is always included in org.onyxplatform/onyx-kafka but for some reason upgrading from 0.11.1.0 to 0.11.1.1 causes my config to get overridden (yay Maven). Is there a reason that a config.edn is included with the onyx-kafka jar?

lucasbradstreet23:11:30

Yuck. No, that should absolutely be under test-resources and should not be included outside of tests

lucasbradstreet23:11:38

Sorry about that.

kenny23:11:42

Yeah that was nasty. Anyway to get a 0.11.x release out with the config.edn removed?