Fork me on GitHub
#onyx
<
2017-04-17
>
devth15:04:31

getting ready to run onyx in staging / prod and starting to think about how we want to operate our onyx app. - should it submit its job automatically on startup? - should it idempotently ensure ElasticSearch has proper index and mapping and install them if not? i can imagine a few ways of operating an onyx app: - api calls / CLI - embeded nREPL - web dashboard - automation (no ops)

mpenet15:04:15

I am curious, did you write an elasticsearch plugin? It's something on my near future to-do.

devth15:04:27

@mpenet my coworker rewrote https://github.com/onyx-platform/onyx-elasticsearch using your spandex client for 5.x. (he hasn't published it yet)

mpenet15:04:34

That's great, I was thinking of doing the same. Any chance it'd get open-sourced?

devth15:04:51

i think that's the plan. i'll check

devth15:04:31

@clyce can we open source the onyx-es rewrite for 5.x?

clyce16:04:50

@devth I wrote this for open source it, may be after a few iteration of the code quality?

devth16:04:39

wouldn't hurt to PR. then the community can help improve quality quicker 🙂

devth15:04:42

anyway: would be good to hear about others' experience operating onyx in prod

michaeldrogalis15:04:48

@devth I’ve done all 4 of those before. 🙂 The usual answer - it depends.

devth16:04:47

i think i like the idea of "no ops". we're running in Kubernetes. no-ops allows you to spin up ad-hoc envs and not have to worry about manually provisioning stuff.

gardnervickers16:04:19

@devth same here, are you using Helm?

gardnervickers16:04:13

Nice, we've been loving being able to spin up ad hoc environments and have everything work with automatic TLS certificate procurement and ingress routing.

gardnervickers16:04:43

Haven't heard it called "no-ops" before, I like that!

devth16:04:12

can't remember where i heard the term, but not mine

devth16:04:26

how are you getting automated TLS certs?

devth16:04:45

we use kube-lego for external, vault for internal. haven't seen many other options for internal.

gardnervickers16:04:31

We use Traefik for ingress instead of multiple external load balances. They have an ACME option for LetsEncrypt.

devth16:04:02

nice. i've looked at traefik but haven't used. looks great

gardnervickers16:04:48

Yea it's served us well so far, there is a similar setup possible with Nginx but it's not an out-of-the-box kind of experience like Traefik.

devth16:04:07

kubernetes on GKE or something else?

gardnervickers16:04:23

We're running on AWS in multiple regions, one cluster using kops and another that's self-hosted with bootkube

devth16:04:48

cool. i haven't used it outside GKE

lmergen16:04:07

having run these types of apps in production myself for quite some time, i found out the hard way that job management should ideally be explicit

lmergen16:04:17

make a simple interface for managing your jobs first

lmergen16:04:24

then you can choose to build tools to automate that on top of it

devth16:04:48

so it's useful / necessary to be able to: 1. start 2. stop 3. get status of a job?

lmergen16:04:03

yes, and i would distinguish between graceful shutdown and kill

lmergen16:04:50

also, think about compatibility with versions of your data — you can decide not to do anything with it yet, but it’s worth it to have at least some idea how you want to manage backwards incompatible upgrades of your streaming jobs

lmergen16:04:17

but this is a bit of a different discussion

devth16:04:29

makes sense. thanks!

devth22:04:31

decided to see what would happen if i recursively killed the /onyx znode in zk while onyx was running. it quickly restarted all of its peers, recreated /onyx/my-tenancy/{bunch of sub nodes}, then the pod died with [s6-finish] sending all processes the KILL signal and exiting and a new pod was created. :thinking_face:

michaeldrogalis22:04:07

@devth Erm, yeahhh don’t do that. 🙂

michaeldrogalis22:04:27

You can wipe out all the peers, but dropping /onyx/ while it’s actively running is pretty bad.

devth22:04:50

well my intent was to kill all state of all my old failed jobs 🙂

devth22:04:59

better to shut down onyx, kill, restart? same effect, i think...

michaeldrogalis22:04:17

Yeah. You can also invoke the garbage collector.

devth22:04:50

oh yeah. btw, my google is really bad at finding onyx stuff

michaeldrogalis22:04:26

Technically we should be able to bounce back from the /onyx zk node being dropped. We don’t have a Jepsen test for that one though, since it’s akin to wiping out database transaction logs while a database server is still running.

michaeldrogalis22:04:42

Should be most of what you need.

devth22:04:55

right. probably safe to assume zk would never lose your /onyx

michaeldrogalis22:04:08

It’s all fine for a ZK node to go down, for ZooKeeper to temporarily become available, but deleting those contents is a much more catastrophic fault, yes.

michaeldrogalis22:04:33

I’d recommend rotating your tenancy, :onyx/tenancy-id, and after it transitions, you can rm the old tenancy under /onyx

devth22:04:45

i wonder if your search results would improve if you added "Onyx Platform" to the <title>. right now: <title>User Guide</title>

devth22:04:54

ah, that makes sense

michaeldrogalis22:04:54

Probably would help. 🙂 Send a PR our way?

michaeldrogalis22:04:02

I can get to it later tonight if not.

devth22:04:12

oh right. forgot that site is oss

devth23:04:31

do we really need s6-overlay when we have K8S to ensure the process is always up and running?

gardnervickers23:04:05

@devth Not if you're running the media driver in a separate container.

devth23:04:45

oh right. i'm using embedded right now but will switch to separate. haven't considered a separate container in the same pod but maybe that makes sense.

devth23:04:14

not sure if s6 is hiding anything. it just shut down / restarted but i have no idea why

gardnervickers23:04:40

Yea just make sure they share a /dev/shm memory volume.

devth23:04:47

e.g.

[cont-finish.d] executing container finish scripts...
[cont-finish.d] done.
[s6-finish] syncing disks.
[s6-finish] sending all processes the TERM signal.

devth23:04:16

oh right. i'm using an emptyDir volume for /dev/shm

gardnervickers23:04:31

S6 is just to get around the PID1 reaping problem.

gardnervickers23:04:03

@devth using a Memory volume would offer better perf.

gardnervickers23:04:28

I believe emptyDir writes to disk still

devth23:04:43

yeah. will see about switching

devth23:04:21

don't think you can share dev/shm across multiple containers yet https://github.com/kubernetes/kubernetes/issues/4823

gardnervickers23:04:55

You can just mount a mem volume across containers at /dev/shm and it's functionally the same

devth23:04:56

> However, you can set the emptyDir.medium field to "Memory" to tell Kubernetes to mount a tmpfs there we go