onyx 2016-08-25 | Slack Archive

@jasonbell Love your blog btw! I’m a data science geek too, so your posts are A+ in my book

@erichmond Thank you, that’s most kind.

I have a simple catalog thats import data from a ms-sql base into datomic. After the job has been running a while, the throughput drops to 0,0 according to the metrics log, (metrics on all tasks :all). Metrics keep getting written to the log file At the same time: I'm monitoring the datomic tx-report-queue, in a seperat repl - the transactions stop coming. I also have enabled debug log for the ms jdbcdriver, and its also stops writing new entries. In onyx-dashboard the job is still running. It generally happens after aprox 200k - 700K datomic transactions. When I kill the peer and start a new peer it continues the job, and after some time the throughput drops to zero again. The batch-latency looks fine, its not a jvm gc issue. No exceptions found in either logs (eg, onyx, jul etc) Its a ms-sql base with 42 mill records Its a single peer with 12gb heap, separate jvm for aero and an external zookeeper Catalog [[partition-keys read-rows] [read-rows prepare-datums] [prepare-datums transact-datomic]] transact-datomic uses write-bulk-tx-async-calls Onyx version : 0.9.7 Onyx/sql version : "0.9.9.1-20160816.124319-6" Do you guys have any idea how to troubleshot this further ?

lucasbradstreet10:08:42

Interesting. Any retries?

lucasbradstreet10:08:13

I guess if the throughput drops to zero then it's probably not retrying

zamaterian10:08:25

my last run contains 101 retries on partition keys

lucasbradstreet10:08:27

Hmm. Each of those will amplify out to 500 rows or whatever you have configured. But you should see things pick up back again for a while when the retries happen.

lucasbradstreet10:08:31

No exceptions in the log?

zamaterian10:08:32

retry_segment_rate_1s contains values between 0.99 to 2.00

zamaterian10:08:47

No exceptions at all

lucasbradstreet10:08:07

What's the max-pending for the input task?

zamaterian10:08:37

partitions-keys max-pending 1000, sql-rows-per segment 1000 and batch-size 100

lucasbradstreet10:08:17

Some blocking buffer could be getting stuck somewhere (which would be a bug). 1000x1000 = 1M rows in flight. Can you try reducing max-pending and rows-per-segment to help me debug it a little?

zamaterian10:08:48

yes what size would you prefer ?

lucasbradstreet10:08:50

Let's just try 200 for each

zamaterian10:08:54

around 160k datomic tx it stopped, but after a small (10-20s) time it picked up again.. Which hasn’t been observed before.

lucasbradstreet11:08:03

Interesting. Is it still going?

zamaterian11:08:01

After 400-500k it stopped - but I noticed that partition keys pending msg is aprox 950. I did observe pauses in datomic report queue.

zamaterian11:08:35

but still no exceptions in any of the logs. Could it be that something happens with the datomic connection ? (btw during the run i’m seeing info log from the datomic peer - no log that indicates a problem there.)

lucasbradstreet11:08:48

It's certainly possible, especially if onyx does keep retrying

lucasbradstreet11:08:43

Stick an onyx/fn on the output task that prints the segment and returns the segment. Then you'll have logging to see if segments are making it to the output

zamaterian11:08:57

will do 🙂

michaeldrogalis20:08:43

Everyone give @vijaykiran a big round of applause for contributing a new, improved User Guide - http://www.onyxplatform.org/docs/user-guide/0.9.10-beta1/

michaeldrogalis20:08:56

Includes anchor linking to each section.

Travis20:08:16

Nice!

michaeldrogalis20:08:34

0.9.10 will be out next week after we merge in a few more fixes + some docs. This release includes the Peer Query Server. Each peer can optionally run an HTTP server to respond to a health check and provide a status report.

Travis20:08:20

Excited as always!

2016-08-25

Channels