Fork me on GitHub
#onyx
<
2016-09-13
>
aengelberg02:09:03

My onyx dashboard in production repeatedly crashes when it is allocated 512MB of memory. Is that not enough?

michaeldrogalis02:09:21

@aengelberg What error are you seeing?

aengelberg02:09:39

not seeing a particular error, but in Aurora it gets automatically shut down (penalized for flapping) when I go to the onyx dashboard in the browser and a few log entries load.

aengelberg02:09:02

Starting Sente 
Starting HTTP Server 
Http-kit server is running at  
Connected: 192.168.32.16 f6f57e46-2b09-432d-af82-561e30565131 16-Sep-13 02:52:50 ip-192-168-64-240 
INFO [onyx-dashboard.tenancy] - Starting Track Tenancy manager for tenancy amperity {:zookeeper/address "...", :onyx.peer/job-scheduler :not-required/for-peer-sub, :onyx.messaging/impl :aeron, :onyx.messaging/bind-addr "localhost", :onyx/tenancy-id "amperity"} f6f57e46-2b09-432d-af82-561e30565131 16-Sep-13 02:52:50 ip-192-168-64-240 
INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper. 
Exception not found for job #uuid "9e92d3f2-b962-4ac1-8bd5-803ed8c97f81" 

aengelberg02:09:12

That's the process stdout before it dies

aengelberg02:09:04

Unless Exception not found for ... is deadly

michaeldrogalis03:09:15

I haven't seen that message before. That's an odd one, but admittedly I rarely work on the dashboard. How many entries are in the ZooKeeper log for the tenancy ID? That's curious that Aurora's shutting it down.

aengelberg03:09:07

I had just onyx.api/gced the log, because there were enough messages in the log that onyx-dashboard would crash whenever I loaded a few hundred messages.

michaeldrogalis03:09:50

Something else is definitely not right then. A few hundred log entries is nothing

aengelberg03:09:22

The log now starts at 23314. The dashboard dies when it loads up to #23331.

michaeldrogalis03:09:00

What happens when you give the Aurora proc more RAM? There should be some way to see what exactly killed the process.

aengelberg03:09:51

I gave onyx dashboard more RAM. now it gets up to log #23518 then dies.

aengelberg03:09:24

@michaeldrogalis I kept an eye on the aurora homepage and kept refreshing while the dashboard was loading stuff. At one point the "used memory" went up to 1900MB. Then back down to 0MB when it got killed

aengelberg03:09:41

To be exact, I gave it 2GB of RAM

michaeldrogalis03:09:05

@aengelberg Not sure off the top of my head, I'd have to look at it in context.

aengelberg03:09:46

Looks like improving memory usage is an outstanding issue. https://github.com/onyx-platform/onyx-dashboard/issues/51

michaeldrogalis03:09:16

I wouldn't expect to hit it that quickly with the numbers you cited earlier. Could be though.

michaeldrogalis03:09:24

Gonna head off to sleep. Talk later.

stathissideris08:09:09

@michaeldrogalis I dug out some thread from a couple of years ago where you asserted that Spark is much faster than Onyx. Is this statement still true today?

lucasbradstreet08:09:01

@stathissideris: spark's throughput will definitely be better when used for batch jobs. Spark's latency will be worse for streaming, though throughput may be better in this case. We haven't done any big benchmarks later because we are transitioning to a new model which will improve things. We'll bench again after we're done

aspra08:09:04

The value for the batch latency percentile metrics is in what unit?

lucasbradstreet08:09:44

@aspra it's so you can tell how long your onyx/fn is taking to run. If you add up all of the batch latencies, end to end, it's a good indication of what is causing high complete latency. One issue is that it's the total for a batch of segments. I've been meaning to add a metric which divides batch latency by the size of a batch.

lucasbradstreet08:09:44

That metric would be more useful for profiling your onyx/fn, whereas batch latency is more useful for seeing where in your pipeline latency is being added

aspra09:09:03

@lucasbradstreet Right makes sense. Still useful. And is it in millis?

aspra09:09:40

And the throughput number of segments?

aspra09:09:55

ok thanks 🙂

lucasbradstreet09:09:58

Those are per second

aspra09:09:15

1s or 10s etc. right?

lucasbradstreet09:09:21

Although we may also have per 10 second measurements too, if I remember correctly

aspra09:09:34

yeah 60s as well

mariusz_jachimowicz10:09:29

@aengelberg I am very curious about the dashboard memory usage and those problems. I woud try look at it somehow when I finish https://github.com/onyx-platform/onyx-dashboard/pull/63

lucasbradstreet10:09:18

@aengelberg: are we talking JVM memory? That issue was more regarding the cljs side of things IIRC, but it sounds like the JVM side needs some work too

gardnervickers11:09:56

@aengelberg: are you explicitly setting the heap size for your dashboard JVM?

aspra13:09:45

is there any way for the onyx dashboard to not replay all jobs but only the running ones?

lucasbradstreet13:09:25

Because it’s a log that has to be played from start to finish to get the replica, you can’t selectively play it. However, you can use onyx.api/gc to compact the log

lucasbradstreet13:09:51

gc plays the log back and then writes out the full replica as a new log entry. You’ll lose the history, but you’ll still know the current state of the system

lucasbradstreet13:09:36

I think onyx-dashboard probably needs some optimisations so that you can play up to, say, the last hour in the JVM, then send all the recent history over the wire

lucasbradstreet13:09:48

@mariusz_jachimowicz that might be a good way to reduce memory consumption

aspra13:09:19

yeah would be nice. I noticed that if there are quite some jobs it takes a significant amount of time to get the latest state. Plus it can be a bit confusing whether it is done or not.

aspra13:09:28

A very handy tool nevertheless 🙂

lucasbradstreet13:09:12

Yeah, it just has to grow up a little 😄

drankard13:09:03

Hi there. I have a job running based on the datomic_mysql_transfer :partition-keys (onyx/sql) :read-rows (onyx/sql) :prepare-datoms (fn) :write-to-datomic (onyx/datomic) the integer key used in the input task is unique and ordered. Its working ok and im now trying to tweek and adjust the throughput. The thing is that I have many million rows in the sql database tables, it takes forever to run. So.. I was wondering: what is the recomended way of "watermarking" the progress i case of: - A crash eg. Transactor not available.. or any other reasons.. - If the job is killed - Adjust the batch size / Rows per segment

lucasbradstreet13:09:58

@drankard rather than watermarking it, I would probably setup onyx-metrics and monitor throughputs, retries, etc as it runs. Then after a while of monitoring it, I would just onyx.api/kill-job it, make some tweaks and do the whole round again

drankard14:09:19

The thing is the that we don’t want to start transact all the input rows again, we would want to resume from the point where the last job ended/failed/resubmitted. What should we do to get this guarantee/behavoir.

drankard14:09:27

And later on when we start receiving changes (addative) to our sql import table to be able to start from the last row transacted.

aspra14:09:49

Trying to show the metrics to the dashboard. I see that

The Onyx dashboard already knows what to do with this output

aspra14:09:06

so the dashboard will connect to <ws://127.0.0.1:<PORT>/metrics>?

zamaterian14:09:20

@aspra afaik the dashboard not longer support showing the metrics

aengelberg14:09:34

@lucasbradstreet: yes. @gardnervickers: I don't think I've tried that, should that help?

aspra14:09:22

@zamaterian ah really, didnt know that, thx. @lucasbradstreet could you please confirm?

aengelberg14:09:47

I guess it's getting unfairly punished by aurora if the jvm doesn't even know the limit.

gardnervickers14:09:59

@aengelberg Yea, there’s a funky property with the JVM in docker containers. So when you start up the JVM in server mode without setting the heap size, it defaults to 1/4th of system memory. The problem is it will not see the docker container memory limits as “system memory”. So if you’re running on a 8gb box and give a container 1gb of memory, it wont default to 500mb, it’ll default to 2gb.

aspra14:09:16

@zamaterian probably not then 🙂

zamaterian14:09:57

@aspra nop I’m sending the metrics to aws cloudwatch

lucasbradstreet14:09:08

Sorry about that. I need to fix the README. We took it out because it is a very poor substitute for real metrics systems and it's a pain to maintain

lucasbradstreet14:09:34

Timbre metrics is good enough for local use, and we really don't want people using it in prod

aspra14:09:49

@lucasbradstreet ok I see. In my case I am using it on a test environment for load testing so could be handy. I can try to do what @zamaterian is suggesting though

lucasbradstreet14:09:31

Admittedly it’d be useful for dev. We have gotten a little pushback about dropping support

lucasbradstreet14:09:23

Medium term plan is to be able to get a metrics stack up with one command, so that users have an alternative that doesn’t require a lot of work

zamaterian14:09:35

I been thinking about throwing the metrics logs after graphviz for dev.

lucasbradstreet14:09:56

I’ve consider doing the same

Travis14:09:21

I just took the ony-benchmark project and ran the ansible scripts with slight modification to setup our stack

zamaterian14:09:40

I think I would be a good way for people to figure out what knobs to turn.

lucasbradstreet14:09:42

Actually, some of the benches in onyx-benchmarks do parse the metrics after

lucasbradstreet15:09:47

@aspra @aengelberg @camechis I’ve improved the performance tuning doc. It now includes more things you should check as you go to production. I’ll continue to improve it further after improving discussion of metrics types and what to look for there https://github.com/onyx-platform/onyx/blob/master/doc/user-guide/performance-tuning.adoc

michaeldrogalis16:09:04

@stathissideris We have pretty high hopes for the next generation streaming engine after we performance tune it. We have some novel improvements on the latest research (~1 year old)

aengelberg16:09:26

@gardnervickers: thanks for the tip, actually setting a JVM memory limit seemed to work. Now I'm getting some client side issues where my Firefox is now taking 2GB of memory, putting me over my laptop's memory limit.

Travis16:09:30

Anxious to see it

michaeldrogalis16:09:16

Lucas has been leading the effort on that front for almost 6 months now. It's a master piece.

aengelberg16:09:19

once you push and release the changes it will be a master piece.

gardnervickers16:09:35

Hahaha that got me

zamaterian17:09:03

Hi guyes, I just killed my datomic transactor - after my throughput fell to zero - using write-bulk-datoms-async . This didn’t affect the status of my onyx job (no errors or anything in the timbre log), according to the dashboard its still running. It appears to be stuck somewhere. Have you experienced this before ? Normally the job stops when the transactor is not available.

michaeldrogalis17:09:31

@zamaterian Does the Onyx cluster still appear to be attempting to perform writes? Do you see errors in the log?

michaeldrogalis17:09:59

Also, have you recently set up any lifecycles to continue the job upon exceptions?

zamaterian17:09:15

No activities at all, no lifecycle configured to continue upon execption, running 0.9.9.0. No errors in the timbre log. Would you like a thread dump ? https://gist.github.com/zamaterian/fe8495e07caafc20f9ab8f5a8384d010

michaeldrogalis17:09:03

I'm a little busy to dig in at the moment.

zamaterian17:09:41

Thats fine 🙂 I’m just gonna restart it then.

michaeldrogalis17:09:40

Thanks. 🙂 Is the dashboard your only method of verifying that the job is still running? We have bugs in the dash once in a while.

zamaterian17:09:20

yes it was my only, did refresh it though - I should be able to see the status in zookeeper, next time

michaeldrogalis17:09:58

Running the Replica Server in onyx-lib is a pretty nice light-weight way to see what's going on. But yeah off the top of my head, not sure what might be going on there. An exception from an unavailable transactor should definitely kill the job unless otherwise handled.

zamaterian17:09:35

Will look into replica server 🙂

michaeldrogalis17:09:46

It's basically the guts that underlies the dashboard, stripped down to the bare basics. Follows the log and gives you a JSON view of what the log looks like.

michaeldrogalis17:09:06

Just another data point to possibly help you along, anyway. 🙂

stathissideris17:09:30

@michaeldrogalis thanks and good luck with the improvements 🙂 I’m not using onyx nor spark right now — just evaluating

aaelony18:09:00

I'm using the following serializer function with the onyx s3-output plugin:

(def s3-serializer-fn
  (fn [vs]
    (.getBytes (pr-str vs) "UTF-8"))
  )
which writes to file a list of strings printed on one line. How can I get this to output a newline at the end of each item, without the opening and closing parentheses of the list (each item already has embedded tab delimiters)?

aaelony18:09:32

trying things like (.getBytes (apply str (mapv println vs) "UTF-8")) ...

michaeldrogalis18:09:03

@aaelony (str (pr-str xs) "\n")?

michaeldrogalis18:09:19

That would work if the reader is going to treat each line as a distinct readable set of chars.

aaelony18:09:32

wow. that simple, will try that but I think the outer parens will still be there

michaeldrogalis18:09:49

I don't follow what you mean by the parens comment.

michaeldrogalis18:09:02

You're saying "vs" is a collection, and you want one per line?

aaelony18:09:54

vs is a a vector of strings like ["A\t1\t2\t3" "B\t1\t2\t3" ...]

aaelony18:09:30

so ideally the above would be 2 lines of tab-delimited data in the output file

michaeldrogalis18:09:54

(apply str (interpose "\n" (map pr-str ["A\t1\t2\t3" "B\t1\t2\t3"])))
=> "\"A\\t1\\t2\\t3\"\n\"B\\t1\\t2\\t3\""

aaelony18:09:57

yeah, in the repl it's like that. But in the file the tabs and newlines don't resolve..

aaelony18:09:02

I'll try it once more

lucasbradstreet18:09:14

Here’s a new discussion of onyx-metrics that everyone might be interested in. It discusses how to think about each metric type https://github.com/onyx-platform/onyx-metrics/blob/master/README.md#guide-to-types-of-metrics--diagnosing-issues

aaelony22:09:02

@michaeldrogalis, so with a serializer function of

(defn s3-serializer2-fn
  [v]
  (-> (apply str (interpose "\n" (map pr-str v) ))
      (.getBytes "UTF-8")
      ))
and an output step that produces
(apply str (interpose "\t" fields-v))
I end up with output of
"A\t1\t2\t3"
"B\t1\t2\t3"
"C\t1\t2\t3"
Do I need another "apply str" in the serializer function?

michaeldrogalis22:09:09

@aaelony (interpose "," ["a" "b" "c"]) => ("a" "," "b" "," "c")

michaeldrogalis22:09:40

Er, wait. I'm confused.

michaeldrogalis22:09:51

That wasn't your expected output?

aaelony22:09:52

I'm looking to get to files with

A    1    2    3    4
B    1    2    3    4
C    1    2    3    4 
where the delimiter is a tab

aaelony22:09:25

instead I'm ending up with a \t in each line as a string row

michaeldrogalis22:09:07

Maybe back-off on the pr-str that you're using if you're getting a literal \\t char back

aaelony22:09:01

okay... fairly shortly I think I'll have tried all possible combinations 😉

michaeldrogalis22:09:40

I would extract what you have from Onyx and deal with it strictly from Clojure. Someone in the main #clojure channel might have a better answer for you.

aaelony22:09:52

it works fine in the clojure repl

aaelony22:09:06

it's when it goes to file that it looks different

aaelony22:09:24

I'll take it offline though...

michaeldrogalis22:09:39

Even when you spit it to a file?

aaelony23:09:17

I'll try that next. Was printing to the repl only

aaelony23:09:26

i.e. println makes it look fine

aaelony23:09:23

spit works fine too

michaeldrogalis23:09:29

How about spitting to an S3 file in the same way that you are now?

aaelony23:09:40

that results in "A\t1\t2\t3" for each row

michaeldrogalis23:09:18

Do they files you're testing with have different extensions or codecs?

michaeldrogalis23:09:42

Basically just trying to help you pear this one down to the essentials of the problem, I'm 99.9% Onyx isnt affecting the behavior.

aaelony23:09:59

okay, I'll keep at it then

michaeldrogalis23:09:23

The other thing to try would be looking at the underlying S3 writer library that the plugin uses and trying it directly to reproduce the formatting problem.

aaelony23:09:30

sounds good. will do

lucasbradstreet23:09:37

@aaelony: you're spitting the output of that serialiser fn?

lucasbradstreet23:09:02

My only thought is pr-str is a good way to get escaped output, assuming you're still using it

aaelony23:09:04

Well, I have an output step that turns a vector of fields into tab delimited fields via

(apply str (interpose "\t" fields-v))

aaelony23:09:23

then the serialiser function does

(defn s3-serializer2-fn
  [v]
  (-> (apply str (interpose "\n" (map pr-str v) ))
      (.getBytes "UTF-8")
      ))

aaelony23:09:30

which is almost what I want

aaelony23:09:56

the newlines resolve fine, it's only the tabs that are unresolved

aaelony23:09:32

but putting the tab interpose in the serializer splits too much

lucasbradstreet23:09:12

Yeah no idea then

lucasbradstreet23:09:01

I'd watch out for pr-str though

lucasbradstreet23:09:16

Could see it double escaping your tab. Haven't tested it

aaelony23:09:25

I might drop down to java or something

lucasbradstreet23:09:16

Try map pr-str on fields-v before the tab interpose

aaelony23:09:41

I tried that earlier and got an error...

aaelony23:09:49

but it's something like that

lucasbradstreet23:09:53

Then turn the map pr-str in the serialiser into map str

aaelony23:09:29

you solved it!

aaelony23:09:39

thank-you.

lucasbradstreet23:09:56

Heh was wondering if it'd work. You're lucky I couldn't sleep :p

aaelony23:09:08

indeed 🙂

aaelony23:09:18

the answer is:

(defn s3-serializer2-fn
  [v]
  (-> ;; (apply str (interpose "\n" (map pr-str v) ))
      (apply str (interpose "\n" (map str v) ))
      (.getBytes "UTF-8")
      ))

aaelony23:09:35

sleep soundly

lucasbradstreet23:09:39

pr-str heard u like escapes in ur escapes

aaelony23:09:06

I don't normally use pr-str, I should avoid it

aaelony23:09:40

it is listed in the catalog entry at https://github.com/onyx-platform/onyx-amazon-s3, perhaps that's where I got the idea to use it

lucasbradstreet23:09:13

It's an ok way to print re-readable edn but you can end up accidentally going too far as you have found out

aaelony23:09:36

I'll remember from now on, lesson learned 🙂

aengelberg23:09:57

Here, I have read-input, with a parallelism of one (in green), then two other tasks that follow the input linearly, each of which has a parallelism of two. But each task should be seeing the same segments.

aengelberg23:09:50

My grafana queries all look like this:

SELECT sum("value") FROM "[:read-input] 10s_throughput" WHERE $timeFilter GROUP BY time(1s) fill(null)

aengelberg23:09:11

Am I doing something wrong if I want to aggregate all logs across all (virtual) peers?

aengelberg23:09:32

to be clear, my goal here is to see all the metrics at around the same level.

aengelberg23:09:57

unless that's less useful.

Travis23:09:16

I think thats what we are doing as well, although I can’t confirm if its correct, lol