Fork me on GitHub

Questions regarding peer timeouts on 0.12. Under the "The same messages are being replayed multiple times" heading in UserGuide, it points to the :onyx/pending-timeout flag which appears to be able to be set on a per task basis, but it links to the cheatsheet which points out it is deprecated. That points to ABS which brings the fields :onyx.peer/publisher-liveness-timeout-ms and :onyx.peer/subscriber-liveness-timeout-ms.


Questions: 1. Does that mean there is no longer a way to set a per-task timeout different from every peer on the box? 2. What counts as a publisher, is that only input tasks? For fn-tasks, are those always subscribers? 3. We have a task which performs a side effect of reaching out to a notification service that takes 15 seconds to timeout. Whenever we have more than 4 recipients and the service is down, it causes the peer to fail and the message to be replayed 7 times over 10 minutes. It is odd that it seems consistently 7 repeats. Is there some variable here that controls this, or explanation of why its always 7 rather than infinitely repeating or varying of sometimes 6, sometimes 8?


Finally: For long running side-effect tasks like this that we don't want to be replayed, what's the best practice around this? Just throw it in a future and ignore response? Up the peer-config to a much higher timeout affecting all other tasks on the peer? Is there a way to send a heartbeat manually mid-task in between other calls? In this specific case we can re-design the message flow, but I'm wondering about just general cases where a specific task might take inordinately longer than preferred global peer timeout.


You cannot set a timeout per task, that is right. However, these timeouts are really just measures of peer liveness. They are not intended to handle failure cases around retries, they’re more intended to be used to detect the failure of peers themselves.


a publisher is any peer that is upstream of a peer


a subscriber is any peer that is downstream of a peer


So, for example, if I am an input peer sending segments to a function peer, and that function peer stops sending me heartbeats, I may choose to time that peer to time out, causing the job to be restarted from a checkpointed offset.


I’m not sure why you’re ending up with precisely 7 replays over 10 minutes, that’s a little surprising


I’ll have to think a little bit about your suggestion to allow heartbeating long running onyx/fns. I can see why you might want to continue heartbeating for those long running functions, where a fn call might take 10 minutes, but you wish to timeout the peer much earlier than that if it’s not live.


Could you move those requests to an output plugin? I specifically built the output plugin to be async so that you can have some futures on the go and return a code to not let it advance, while also heartbeating.


Output plugins are very simple, so assuming you don’t have any downstream tasks from this task I think it’ll be a much better fit