Fork me on GitHub

When testing out Onyx + Kafka my peers continually restart after throwing this exception:

    clojure.lang.ExceptionInfo: Caught exception inside task lifecycle :lifecycle/initializing. Rebooting the task. -> Exception type: java.lang.NullPointerException. Exception message: null
       job-id: #uuid "30fbff49-4319-a757-6a0b-b7f28f795426"
     metadata: {:job-id #uuid "30fbff49-4319-a757-6a0b-b7f28f795426", :job-hash "db786ebce6742af642d9d60805225644110d2b6ded6bfd599e2f3823e62c4"}
      peer-id: #uuid "c01c9e46-3a41-efb7-88f7-7f2dd7b8cf7c"
    task-name: :in
That is the whole stacktrace. Seems like it is cut off but that's it. Any ideas on how to proceed?


That’s odd. I think there should be more to the stack trace. I assume this is from onyx.log?


Yes. I do have these additional config settings:

{:level     :info
 :appenders {:println {:ns-whitelist ["compute.*" "dev.*"]}
             :spit    (assoc (appenders/spit-appender {:fname "server.log"})
                        :ns-blacklist ["org.apache.zookeeper.*"
I wouldn't imagine that would affect the printing of a stacktrace though.


Ok. If your error is reproducible, turning that off and trying again is all I can really think to try


Very reproducible. Will try that now.


The full stacktrace is now visible. Though it wasn't hidden due to that log config, rather the JVM optimization that omits the stacktrace root.


Ah. That one. I do usually overrule that in my dev JVM opts


Anything we need to be worried about?


Yep just added that to my JVM opts. Nope.


now that we're talking about this, i recall there was some JVM options that allows you to emit a stderr log any time any exception is generated. that bypasses any logging framework. or am i confused ?


Not sure about that one. I know there is a way to hook into unhandled exceptions, which is handy when you have a lot of threads doing things.


hmmm might be mistaking jvm for another language


question regarding job submission. we have four peers running and they all submit the same set of 3 jobs. 2 of the jobs, the job submission returns success on all 4 peers. the 3rd, success is only returned on one peer and the other 3 peers fail with :incorrect-job-hash. is this just the result of a race condition in job submission? or are we somehow generating different job hashs on our peers. I believe the job only needs to be submitted once to the cluster, but just want to make sure I understand what is happening, here. we are also running on 0.9, currently.


@kenny That JVM opt kills me everytime


@djjolicoeur I think you're confused about how job submission works. You only want to submit each job once - and probably not from the peers on start up. I'd start them from a different process


If you're running into a hash error, you're trying to submit the same job ID with different job content


thanks @michaeldrogalis, we are actually looking to make that change WRT job submission in the near future. I essentially inherited this system and, to date, the current job submission process has not caused any issues other than that submission quirk I mentioned. that being said, the content of the jobs should be idempotent, so I need to track down what the diff is.


@djjolicoeur Understood. Yeah, there's some underlying difference causing them to hash to different values. This should be the place to figure out what's up:


thanks @michaeldrogalis, I will take a look


How do you guys develop an Onyx job that uses a Kafka queue at the REPL? Do you start an embedded version of Kafka? Or maybe replace Kafka queues with a Core async channel? What's the best approach?


We used to use an embedded version and/or swap out core async, but lately we’ve moved closer to just using kafka directly via docker/docker-compose


My preference is to minimise differences between dev, tests and prod, rather than get a bit nicer of a dev experience by swapping out core async.


I’ve often initially developed it against core.async and then moved it to kafka via docker-compose later though.


We use circleci which allows us to stand up ZK + Kafka for our tests via that compose yaml.


I like that approach. How do you handle the creation of topics that are needed for your job? Do you use the Kafka admin API?


Makes sense. I'll try that approach out. Thank you 🙂


I’m curious what the smallest practical EC2 instance types are for a Zookeeper ensemble to power onyx… more specifically, onyx mostly for monthly batch jobs and a few adhoc jobs throughout the month


For this project it really isn’t about scaling as much as it is about breaking up the batch processing code into neatly defined workflows that are easier to reason about


Hmm, that cluster might need kafka too… still lots to explore and figure out, just trying to think ahead to what happens if my poc is a success 🙂


I’m not really sure how small you can go because we have always biased towards that piece being as solid as possible. We don’t do all that much with ZK when you’re using the s3 checkpointer though, so it just needs to be able to respond reliably.


I'll play when it comes down to it and see!


Any idea what this Schema error is talking about?

clojure.lang.ExceptionInfo: Value does not match schema: {(not (= (name (:kafka)) (namespace :kafka/offset-reset))) invalid-key}
     error: {(not (= (name (:kafka)) (namespace :kafka/offset-reset))) invalid-key}
I don't think I can run spec on the task def because I can't find any Kafka plugin specs.


I'm not sure why :kafka is in a list in that exception. Seems strange.


@kenny Can I see your catalog?


I feel like I say this on a weekly basis now. We gotta ditch Schema. 😕


From a first look, kafka/offset-reset looks ok. I’m on my phone though so it’s hard to look deeper


What version of onyx kafka is this?


I assume this exception was thrown after you started the job?


[org.onyxplatform/onyx-kafka "" :exclusions [org.slf4j/slf4j-log4j12]]


Yes, thrown during initialization lifecycle.


Yknow I think that it’s being double namespaced


Keyword validation is non existent in Clojure


That was a pretty WTF one to figure out. It all looked perfect


Not sure what double namespaced means 🙂


Oh nope. Not it


Thought you were using the map name spacing form where you would supply the namespace before the map, and then I thought it had a second namespace inside the map, but no


I'm not actually typing those namespaced maps - it's output from a DSL we have. Personally, I don't like using the namespaced map syntax in my actual code.


There’s definitely something odd going on, but I can’t diagnose it further from my phone. If you figure it out let me know. The schemas are in here:


Ok. Will keep staring at it to see if something catches my eye.


Yeah. I hate that format too


Man, I have no idea. That's bizarre.


Would the Onyx log shed any light here?


Probably not. This is a Schema check before Onyx ever boots up


This exception occurs at runtime during the lifecycle. Is that what you mean when you say before Onyx boots up?


This one is the onyx kafka schema check on task start @michaeldrogalis


Can you give us the full exception stack trace just in case though?


@kenny I bet if you upgrade to "" this'll be fixed


Good call. Will try that now.


Upgrading causes to causes this exception:

clojure.lang.ExceptionInfo: No reader for tag cond
clojure.lang.Compiler$CompilerException: clojure.lang.ExceptionInfo: No reader for tag cond {:tag cond, :value {:default false, :test false, :ci false}}, compiling:(simple_job.clj:17:15)
I'm pretty sure I've run into this before with Onyx. Something with an Aero version mismatch.


Yeah I think this is aero related


Though I don’t know why it’s bringing in aero. Probably a dev dependency thing when it should really be a test dependency


The strange thing is that the changes don't seem to indicate something changed with Aero:


Hail Mary lein clean?


Using boot 🙂


Love it. Just no time to switch over


Figured out the problem but not the solution. You guys may have some insight. We have an internal config library that stores our app's configuration in a config.edn file that is on the classpath. This library is brought in as a dependency in my Onyx project. We are able to read the config perfectly fine this way. This is with a project using [org.onyxplatform/onyx-kafka]. For some reason that is not shown in the GitHub compare, using [org.onyxplatform/onyx-kafka] places a new config file on the classpath using the same name config.edn. This config.edn overrides the one from our internal config library. The exception pasted above is due to the overridden config.edn using a tag literal #cond that has been removed in newer version of Aero. The real problem here is onyx-kafka placing a new config.edn on the classpath and those changes not being included in the GitHub compare.


It looks like a config.edn is always included in org.onyxplatform/onyx-kafka but for some reason upgrading from to causes my config to get overridden (yay Maven). Is there a reason that a config.edn is included with the onyx-kafka jar?


Yuck. No, that should absolutely be under test-resources and should not be included outside of tests


Sorry about that.


Yeah that was nasty. Anyway to get a 0.11.x release out with the config.edn removed?