Fork me on GitHub

Hi, i've got a problem with tags being used in my project. I've got 2 separate projects each holding implementations for different tasks. Tags are used to avoid a task being assigned to a peer which is not offering appropriate functionality. The strange thing is that after adding the tag information to the catalog entries and the peer-configuration in the 2 projects the submitted jobs fails to kick off stating that there are too few virtual peers available. When removing the tags the job kicks off but expectably fails due to assigning a task to a peer without the necessary functionality. Note that in the latter case there seems to be enough virtual peers but not in the first one (nothing but the tag information changed). Is this expected behavior? Is there something else that needs to be adjusted when using tags?


@atwrdik It sounds like the scheduler can't find a valid set of peers to assign all the work to in the 1st case.


I am trying to understand complete_latency a bit better, I read here that this is the length of time for 1 segment to get through a the job. Does this include the length of time that it is in the pending messages queue? if that is the case, is there a metric for the time a segment is in the pending message queue for an input task?


And if that is the case, then the key thing is that we definitely don’t want the complete_latency to ever go above the pending-timeout


this is for 0.9


@milk 0.10+ will get rid of that metric completely, so I wouldn’t spend so much time on it. There’s no metric for the current time pending. 0.10+ does away with the completion semantics completely, so you never have to worry about timing out messages.


I see that but we have a system with 0.9.x so just trying to make sure I understand what those mean


for the time being, we are planning an upgrade, but just need to manage this for the time being


I'm getting "Unfreezable type: class datomic.index.Index" but stacktrace says nothing about "where" in my job it occurs


yeah I don’t want to dig too deep


@milk I think the best you have are the retry metrics.


@souenzzo Could you stick the stack trace somewhere? Generally when we throw we include task metadata.


Can you check the onyx.log? Guessing we’re not adding extra metadata there.


just trying to manage this one a bit, and I see the retry metrics so I think I have that handled there, so I guess I will see that, I was just going to look to see if there were any ways I can see it creeping up


It’s possible that it’s being dropped by clojure.test if you’re using clojure.test


essentially, I think the main thing is that I don’t want complete_latency to go above pending-timeout


I just saw that the complete_latency was a bigger than the batch latencies of my tasks in my job, so I thought it might be getting that extra time sitting on the queue, but was just verifying my understanding of it


Yeah, I mean, it won’t really because it’ll be retried at that point anyway.


right, I think that make sense, but we shouldn’t have that get near our timeout or we are going ot be i a bit of a retry loop


Yeah, generally what will be happening is that you’re queuing up too many messages at once


If the sum of batch latencies >>>> pending-timeout


you can reduce :onyx/max-pending to reduce the queue size which will help a lot


that’s your main backpressure knob.


yep, makes sense, and one last question somewhat related, when it retries, it is just re-queueing those messages on that internal queue right, not putting them back on the Kafka topic?


Yes, but if they had actually managed to write out the output plugin downstream and still triggered a retry, you will end up with duplicates


It’s impossible for it to be exactly once writing to the output plugins in 0.9


cool much appreciated @lucasbradstreet!