onyx 2016-10-07 | Slack Archive

zamaterian06:10:53

@lucasbradstreet is there way from outside the peer to detect if one of the peer hangs ? (we previously talk about the hang, with the datomic write-bulk-datoms-async as the output). My initially thought is to periodical to check if the log has changes using onyx.http-query, based on the assumption that the log dos not change when the peer hangs, but this approse only works when using a single peer - when running in a multi peer cluster its need to detect if a specific peers has updated the log. Do you have any hints on how to accomplish this ?

lucasbradstreet06:10:06

@zamaterian I don’t have anything good for you ready to go, but maybe we can work through creating something that we can expose via metrics and possibly onyx.http-query

lucasbradstreet06:10:36

@zamaterian by the way, did you ever find out the cause of the write-bulk-datoms-async hang? I did create a patch to timeout on deref, but never shipped it.

zamaterian06:10:02

@lucasbradstreet never figuring it out 🙂 exposing the some of the metrics via onyx.http-query sound like the way to go.

lucasbradstreet06:10:50

I still think it's likely that you hit memory pressure and datomic failed in weird ways that stopped it from dereffing. I might add that fix anyway.

zamaterian06:10:34

Just spent the night with a corrupt datomic database 😉

lucasbradstreet06:10:39

So, for the metric, maybe we can track whenever a task tries to read a batch. That would give a good indication of whether it is stuck anywhere.

lucasbradstreet06:10:40

Touch

lucasbradstreet06:10:47

Err youch

zamaterian06:10:24

Or maybe an indicator of amount of outstanding checkpointed segments in zookeeper (since the sql checkpoints all segments on startup)

lucasbradstreet06:10:16

The problem there is if you have three output peers and one is stuck, the checkpoint might still update. You would probably see retries though

zamaterian06:10:18

So for a single peer its when a task read a batch, for the entire cluster it could be outstanding checkpointed segments.

lucasbradstreet06:10:44

Ah, I see what you mean. Why not just look at how many messages are getting acked then?

lucasbradstreet06:10:59

That’s kind of the same thing, since it’ll cause segments to go from outstanding to not outstanding

zamaterian06:10:14

How do we do the same in a ’streaming context’ since we dont know when we will receive new segements (eg from kafka), probably need to look at if any segments is in-flight ?

lucasbradstreet06:10:57

You can possible use a combination of throughput on the input task, and pending count on the input task

lucasbradstreet06:10:11

That way you know things are moving and you also know how many messages are in flight

zamaterian06:10:26

@lucasbradstreet thx for your input 🙂

Drew Verlee14:10:47

when testing an aggregation locally i assume its typically to use an atom as shown in learning onyx?

lucasbradstreet14:10:14

That’s usually the easiest

aaelony22:10:02

if this is useful to others, here's a simple function to view a workflow via https://github.com/jsofra/data-scope...

(defn workflow-to-adj-map
  "Convert an Onyx workflow
   to a adjacency matrix as a map.
    e.g. [[:a :b] [:a :c] [:b :d] [:c :d]]
      => {:a [:b :c] :b [:d]  :c [:d]})
  "
  [workflow]
  (reduce (fn [m [k & vs]]
            (assoc m k (conj (m k []) (first vs) ))
                        ) {} workflow))
;; #ds/graph (workflow-to-adj-map workflow)

michaeldrogalis23:10:32

Cool 🙂

aaelony23:10:22

someone's gonna want it eventually 😉

2016-10-07

Channels