Fork me on GitHub
#onyx
<
2018-05-15
>
dbernal14:05:43

has anyone ran into any issues with C3P0 deadlocks on the onyx-sql plugin before?

lucasbradstreet17:05:23

I haven’t seen them, but onyx-sql is mostly user maintained so I’ve never run it in prod

lucasbradstreet17:05:07

I have seen initial connection timeouts lately on CI, but it was because our CI was using latest for docker images, and we ended up with an incompatible mysql version.

lmergen18:05:18

i know the code quite well and it’s definitely not normal to have deadlocks, especially since the whole thing is single threaded. how are you using the plugin, mysql ?

lmergen18:05:50

also, when you say deadlocks, do you mean a timeout when acquiring a connection from the pool ? those are two different things.

lmergen18:05:38

@lucasbradstreet congrats with the release :)

lmergen18:05:59

very happy to see curator updated

lucasbradstreet18:05:41

Thanks. Just checking over everything before announcing 🙂

lucasbradstreet19:05:26

Hi everyone! Onyx 0.13.0 is out with a very minor breaking change that probably won’t affect anyone. It bumps curator so that ZooKeeper SSL is now supported. https://github.com/onyx-platform/onyx/blob/0.13.x/changes.md#0130

eriktjacobsen19:05:44

Hey, we recently updated our code, everything passes unit tests and works fine for several days, but randomly our job goes from processing messages, then dumps something like this out, and then...... nothing. The job looks like it is running, the peers look like they are fine, but there is zero application or onyx level logging that appears again from the peers. (this is in an environment on a single box with 22 virtual peers). Just throwing it out there in case anyone has seen something similar, the only major addition is switching to an onyx output plugin that does Amazonica Lambda invocation.

lucasbradstreet19:05:33

Looks like the uri is missing for your request? uri=-}

lucasbradstreet19:05:52

not sure if something changed in how your urls got passed down to your lambda plugin but I’d check on the segments at that plugin.

eriktjacobsen19:05:52

yes i'm trying to debug the output plugin itself and throwing more judicious try-catch, I'm more concerned that this seems to shut the entire system down with zero logging from onyx, no missed heartbeats, virtual peers closing down, aeron messages, etc... it's like once this triggers, everything just stops. My understanding is there is a threadpool for the virtual peers so just curious that this seems to just freeze the entire thing

lucasbradstreet19:05:36

Ah right. Um, this is a plugin you wrote right?

lucasbradstreet19:05:05

If you’re not checking whether your async requests failed in the plugin, from Onyx’s perspective everything may be working fine.

lucasbradstreet19:05:40

Just a stab in the dark

eriktjacobsen19:05:03

From the point that error dump happens, no further messages seem to be processed. Literally the log file just stops, though the peers remain reporting as up and things seem like they are running from zk / onyx perspective, just no messages are consumed. I get that the output plugin might be FUBAR and not actually saving anything and fine with that, more concerned that everything else fails silently. Will circle back around once the error is figured out.

lucasbradstreet19:05:14

I assume it’s not possible that it just processed everything?

lucasbradstreet19:05:21

Anyway, let me know how you go.

eriktjacobsen19:05:52

Correct, input is a kafka stream that receives messages every minute, and we have another onyx cluster running with the former version of code which has no hiccups.

lucasbradstreet19:05:20

In that case I could see a situation where the plugin is returning false from synced?, prepare-batch, or write-batch https://github.com/onyx-platform/onyx-plugin/blob/0.12.x/resources/leiningen/new/onyx_plugin/medium_output.clj#L42

lucasbradstreet19:05:46

In that case it will continue to heartbeat, because Onyx is waiting for your work to finish, and the plugin is signalling to wait. In that case it’s still probably a problem dealing with the async requests

lucasbradstreet19:05:31

There are metrics/health checks that you’d be able to use to detect when it’s processing a certain epoch for a long time.