Fork me on GitHub

@lucasbradstreet is there way from outside the peer to detect if one of the peer hangs ? (we previously talk about the hang, with the datomic write-bulk-datoms-async as the output). My initially thought is to periodical to check if the log has changes using onyx.http-query, based on the assumption that the log dos not change when the peer hangs, but this approse only works when using a single peer - when running in a multi peer cluster its need to detect if a specific peers has updated the log. Do you have any hints on how to accomplish this ?


@zamaterian I don’t have anything good for you ready to go, but maybe we can work through creating something that we can expose via metrics and possibly onyx.http-query


@zamaterian by the way, did you ever find out the cause of the write-bulk-datoms-async hang? I did create a patch to timeout on deref, but never shipped it.


@lucasbradstreet never figuring it out 🙂 exposing the some of the metrics via onyx.http-query sound like the way to go.


I still think it's likely that you hit memory pressure and datomic failed in weird ways that stopped it from dereffing. I might add that fix anyway.


Just spent the night with a corrupt datomic database 😉


So, for the metric, maybe we can track whenever a task tries to read a batch. That would give a good indication of whether it is stuck anywhere.


Or maybe an indicator of amount of outstanding checkpointed segments in zookeeper (since the sql checkpoints all segments on startup)


The problem there is if you have three output peers and one is stuck, the checkpoint might still update. You would probably see retries though


So for a single peer its when a task read a batch, for the entire cluster it could be outstanding checkpointed segments.


Ah, I see what you mean. Why not just look at how many messages are getting acked then?


That’s kind of the same thing, since it’ll cause segments to go from outstanding to not outstanding


How do we do the same in a ’streaming context’ since we dont know when we will receive new segements (eg from kafka), probably need to look at if any segments is in-flight ?


You can possible use a combination of throughput on the input task, and pending count on the input task


That way you know things are moving and you also know how many messages are in flight


@lucasbradstreet thx for your input 🙂

Drew Verlee14:10:47

when testing an aggregation locally i assume its typically to use an atom as shown in learning onyx?


That’s usually the easiest


if this is useful to others, here's a simple function to view a workflow via

(defn workflow-to-adj-map
  "Convert an Onyx workflow
   to a adjacency matrix as a map.
    e.g. [[:a :b] [:a :c] [:b :d] [:c :d]]
      => {:a [:b :c] :b [:d]  :c [:d]})
  (reduce (fn [m [k & vs]]
            (assoc m k (conj (m k []) (first vs) ))
                        ) {} workflow))
;; #ds/graph (workflow-to-adj-map workflow)


someone's gonna want it eventually 😉