This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2016-02-01
Channels
- # aatree (1)
- # admin-announcements (11)
- # beginners (77)
- # boot (73)
- # braid-chat (29)
- # cbus (3)
- # clara (3)
- # cljs-dev (16)
- # cljsjs (2)
- # cljsrn (68)
- # clojure (149)
- # clojure-austin (1)
- # clojure-czech (2)
- # clojure-miami (8)
- # clojure-poland (28)
- # clojure-russia (165)
- # clojure-ukraine (1)
- # clojurebridge (3)
- # clojurescript (64)
- # community-development (1)
- # core-async (27)
- # core-matrix (2)
- # cursive (38)
- # data-science (2)
- # datavis (4)
- # datomic (3)
- # dirac (78)
- # emacs (10)
- # events (1)
- # funcool (6)
- # hoplon (25)
- # immutant (2)
- # jobs (3)
- # ldnclj (34)
- # luminus (4)
- # mount (23)
- # off-topic (26)
- # om (121)
- # onyx (320)
- # other-lisps (1)
- # proton (13)
- # re-frame (33)
- # yada (3)
thank you, @michaeldrogalis - that’s very helpful
@michaeldrogalis: we have a busy time ahead, so we're probably going to go with balanced for now, but i'll see what i can do to get you a reproducible case when i can for sure.
@greywolve: the problem was explicitly with colocated?
And it never starts the job?
Ok cool. I'll see if I can quickly reproduce it. Definitely an alpha feature for now I guess
Cool, that helps
I’m not able to easily reproduce it, so we’ll probably need more details later
interesting set of exceptions we got recently
system continues to run
Those look a lot like a deploy that was bad and was fixed, or a jar that was incomplete.
The only time I've seen things like those class not found issues have been when I lein installed over a running jar
Maybe the daemon that starts the uberjar tried to start it up before it was fully uploaded and then tried again and didn't crash
ok, Lucas, i think we’re very close to a multi-node cluster. there are many variables, but i’ve got it all the way to the point where it submits the job and then doesn’t process anything
on both servers, the onyx log shows many Starting ZK connections in a row, followed by one Stopping ZK connection message
on both servers, the zk log has many of these: Got user-level KeeperException when processing sessionid:0x1c529c38d6770028 type:create cxid:0x13 zxid:0x900000770 txntype:-1 reqpath:n/a Error Path:/onyx/highstorm/origin/origin Error:KeeperErrorCode = NodeExists for /onyx/highstorm/origin/origin
i’ve already confirmed that the ZK cluster is up - i can see one is Leader and one is Follower, and watching A’s log while cycling B has an effect, ditto B while watching A
i’m a little unsure as to where to look next for clues
I think that error is ok. It doesn't look like you're versioning your onyx/id though.
I don't recommend using the same onyx/id between deploys
i’m very sure there are sufficient peers - each node starts 40, which is the total for all the tasks in the system
ah, that’s good information, thank you
should be ok if we cleanly stop and start all instances, right?
The reason for not sharing the onyx/id is that it needs to play back the log which may not be compatible with your onyx version, plus you can end up with different versions of your jar communicating with each other
Correct
One approach is to use a git sha or uberjar md5
You could pretend highstorm to if
yep, we’ll amend config in circleci script so that we have a history to refer back to
You could also prepend the circle build number so everything is sorted nicely
ok. ZK running. aeron running. jobs start. no exceptions anywhere. onyx and zk logs clear. plenty of peers started. it just doesn’t do anything.
what else can i check?
true, although the git sha is better, i think, because it relates to the code and not the build
yeah, you can do both
Are any metrics being output?
no metrics
I assume it says some log messages like this: "16-Jan-13 23:16:22 mbr INFO [onyx.peer.task-lifecycle] - [cf9ef6b1-bf1c-46aa-ad7a-9b25f8cb07be] Enough peers are active, starting the task"
None at all huh
nope, i don’t see those messages
OK, it sounds like the job hasn’t been started
Did the submit job get submitted to the right onyx/id?
yes, the onyx/id is currently hard-coded in the config file to “highstorm"
used throughout
it starts lots and then one ZK conn closes right at the end
is that significant?
`16-Feb-01 04:48:49 http://hs1.cognician.com INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper. 16-Feb-01 04:48:53 http://hs1.cognician.com INFO [onyx.log.zookeeper] - Stopping ZooKeeper client connection`
Sounds like lots of peers starting up and a final submit job
Weren’t you doing the submit job as part of the startup process before?
it stops the ZK for me locally, and then the "Enough peers..." messages flood the log
yes; we still are
If so, are you doing a single submit job on one node now?
Otherwise you might be starting up three jobs
that’s why
I don’t know why one doesn’t get started though
each node starts 40 peers, the job requires 40 peers, and we start it twice
(because both start the job on startup)
so i would think that they’d still work
Yeah, I think it should still work
It’s my best guess though. Hrm.
ok. i’m going to down one onyx instance and get the other working first
one onyx + a 2 node zk cluster should work just fine, right
so, while i do that, how do you control which server submits the jobs
you have whatever is doing the deploy do the submit
we’d have to provide alternate config to one
in a separate process
sounds like a custom AWS CodeDeploy script
Are you running it with a new onyx/id just in case?
got a couple uk.co.real_logic.aeron.exceptions.DriverTimeoutException: Driver has been inactive for over 10000ms
on stopping the service, and then it hung for over a minute on kill-job
Otherwise all your jobs will still be scheduled
eventually kill -9’d it
i’ll hup ZK too
ZooKeeper is probably safe
i restarted it to discard scheduled jobs
ok. so maybe running ZK, Onyx, Aeron on a 2 core box aint such a good idea.
Since you discarded your ZK, make sure you set the start-tx in the log reader
This may just be testing anyway.
yeah i’ve coded it to go back 2 days in the datomic tx log
@lucasbradstreet: another bit of info
although the job isn’t running (no task logging output), the system is super busy
these CPU numbers are representative of the last 20m mins at least
what might it be doing, if not processing :input?
@robert-stuttaford: is it possible your metrics are broken?
i.e. pointing at the wrong server?
Or something like that
that is possible
gosh. now i feel a fool. it is wrong. i’ll fix that and let you know how i go
Heh, it was starting to be the only explanation
used to be on the same node, then moved it
I'm surprised it didn't throw / log an exception on connect after a while
i can show you a nice big screenshot of all the tails if you like
Might as well. There might be a fix needed in onyx-metrics
You're still using the riemann sender, right?
slight variant for DataDog’s impl
This is a sender your team built? Figured you never ended up building it since we were going to put it in onyx-metrics at some point
yes, our own sender
K, my guess is a future without exception handling in it then
In the sender
well, this is why we have a test setup figure out all the grizzlies
Always a little bit satisfying when it's not our fault :p
it very often isn't
Futures are a good way to find out you don't handle your exceptions very well :). Been there many times
haha funny i wrote a note to myself to always consider that, but somehow still forgot to catch other exceptions there and log them, doh
@lucasbradstreet: do you guys want a dogstatsd specific sender? otherwise we'll probably just open source it separately
Ah, if it's just datadog flavored statsd you guys feel free to open source it separately
I guess it'd be handy to have it on call if needed
I guess it'd be handy to have it on call if needed
@lucasbradstreet, the onyx/start-job
fn returns nil
nevermind
i’m a dork
@robert-stuttaford: Over the hurdle?
not yet
quite stuck, actually
going to take a step back and make a fresh branch from known-good and apply changes and test bit by bit
Sounds good
New template is here! https://twitter.com/MichaelDrogalis/status/694202368095727616
There's a couple more steps but it makes for good copy ;)
Heh, indeed. Read the instructions in README.md after docker-compose
is up.
I wonder how easy it would be to make some of that behavior not live in templates. I mean , I like templates and I get why that’s a great starting point for people, but now it’s also not clear if I want to backport changes to the thing I already created with the template or backport changes from the new template to the thing I already have 😄
there have been 46 commits in the last week, even if they’re small, that’s hard to keep track of
@lvh: There's not much in the way of feature-level behavior in the template. It's mostly an example of idioms that we find most helpful, and a suggestion about how to structure the project. I'd say give the new template a read and decide if you like the way it's set up.
As much as possible, we moved sharable behavior into https://github.com/onyx-platform/lib-onyx.
michaeldrogalis: Sure. I’m looking at changes like:
https://github.com/onyx-platform/onyx-template/commit/40c9f84d86d3ec711f4cd3a30c8eda28603f57b9
(which feels like a lein plugin waiting to happen)
or:
https://github.com/onyx-platform/onyx-template/commit/5a17f083c30d595965b892b8a95f4502e2dc52ac
(where it’s not clear why I want that and what other changes I need to backport for that to work)
or changes to:
https://github.com/onyx-platform/onyx-template/blob/0.8.x/src/leiningen/new/onyx_app/script/run_peers.sh
Where I stil had the old verison that runs two exec java
s and it’s not clear to me which one I might want and why
I’m trying to backport things as they make sense since that’s probably more sustainable than just restarting every time the template updates
Tbh I'm trying to remain relatively hands off, time to see where other people run with the operational aspects. But him and @lucasbradstreet can give some insight
I agree with the general sentiment though, it's the drawback of templating. I would backport as well once I had a handle on it.
@lvh: The changes to run_peers are to allow docker to cleanup properly on shutdown https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/
@lvh: I totally agree with you. It’s going to take a bit more time to find the set of conventions that are universal across Onyx app’s, and developing tooling to handle that will come.
I believe that the idioms put forward with this new template are general enough as to allow us flexibility while writing tools to manipulate your onyx jobs.
The eventual goal would be to get to something ruby on rails like.
onyx new job —input sql —output kafka
or similiar
Sure! I wanna make clear that I’m not criticizing these efforts, and I’m super glad and thankful that you’re working on these universal conventions
Thanks for testing it out! I welcome harsh criticism, it’s important to get this right for a wide range of use cases.
Is https://github.com/onyx-platform/onyx-template/commit/5a17f083c30d595965b892b8a95f4502e2dc52ac literally all I need to port for Bookkeeper to work? If so, why do I want bookkeeper?
You can use Onyx w/out BookKeeper, Bookkeeper is used for persistence with state aggregation (windowing)
Essentially if you want to use the stateful windowing stuff, you need bookkeeper
I’d be pretty happy if we can make the dev env for my project just always be docker compose, and have that expose a REPL or something for local development; that feels like having less different thing around
Not sure how familiar with docker-compose you are (i’m brand new, so I think this is cool) but you can do docker-compose scale peer=3
to create multiple containers, start a job, and kill off containers to test Onyx’s failover behavior
another reason I care about compose is that I work for rackspace, and we have a thing called carina
which means that if you can get your thing to run on docker compose, we can probably get you a bare metal machine that it works on, too
Is it similar to Kubernetes/Marathon?
It implements the docker api?
Ahh, that's so cool!
Gotta run for real now, back later
https://getcarina.com/
Sweet i’ll read up on it. My plan was to add options for +kubernetes
etc. to generate the templates for you. +carina
would be great to have too
the good news is if it just uses docker compose, you almost certainly need to write 0 code for that to happen
Yea thats interesting. Does it handle the networking for you like docker-compose does?
it’s docker-swarm under the hood, except it’s smart about multi-phys-host and multi-segment
Ok that makes sense
Wow thats really interesting
and when I say “rack” that sometimes means “physical rack, you know, with computers” and sometimes that means “rackspace” since, well, we’re operating it
they also figured out how to make it work hypervisorlessly, which is a nice performance boost
I’ve not used any of the docker/cluster management tools in a year or so but how easy is it to request volumes?
I remember that being a bit wonky when I was using Kubernetes, sometimes it would not work and fail silently on GCE
it continues to be a huge weak spot for docker, and it’s clear that the Thing You Should Use(TM) is “external services for storage”, obviously that doesn’t work all of the time
for better or worse, carina has very clearly chosen to interop directly with what docker provides, and not, unless absolutely necessary, write a proprietary alternative
unfortunately that means the tools are what docker gives you, and I’d be lying if I said those were prefect
gotcha
carina was working on (and I think this is done now?) giving you cloud block storage that you can attach as a volume, which essentially solves that problem
Yea that’s what Kubernetes does
but last I checked it was only possible on the GCE platform
Ohhh nice
No, but a client of mine that’s currently on RackSpace private cloud is looking at running a bunch of their services as docker containers. I saw that you guys have OpenStack as an option instead of vsxi for the private cloud stuff.
Carina+OpenStack might be a good option for them
I was going to say “you don’t want to run your own openstack, but we’ll totally run that for you” but that sentence sounded like I was being an awful corporate shill
but yes, we will totally run an openstack for you and that’s probably a good deal, turns out running clouds is p hard
and openstack is not optimized for the “I just want to mess around with this right now” audience
it’s good!
herrwolfe, sirsean and reaperhulk (not here yet, I don’t think) are folks on my team, FWIW
Hello! Thanks for the introduction. Makes it much easier when I know which people are sharing the same problems.
I’m currently trying to get the docker-compose thing running to get data through Kafka (and in this demo into MySQL). Can’t really tell where I’m hung up … submitting jobs seems to work but nothing appears in the database. Is there an obvious way to turn on more logging?
@sirsean: I assume you're looking at onyx.log
?
@gardnervickers made this helpful walkthrough, also. http://recordit.co/OxM66e0kG8
Hey folks!
@sirsean: Hey, did you setup the db table?
gardnervickers: yep, I thought that was my problem (since I didn’t do it the first time) but now that it’s there it remains empty.
michaeldrogalis: the onyx.log
file doesn’t seem to be getting updated, which makes sense since I’m running inside Docker (I believe it’s only there because once I ran the dev mode outside Docker).
onyx.log is replicated to stdout when using docker-compose
@sirsean: What do you see in your docker-compose logs?
Really? odd.. is the container for the onyx peer running if you do docker ps
You should be seeing something like
peer_1 | Attempting to connect to to Zookeeper: zk:2181
peer_1 | Started peers. Blocking forever.
in the docker-compose logs
Launched the Media Driver. Blocking forever...
16-Feb-01 21:28:44 f32b7bc67204 INFO [onyx.static.logging-configuration] - Starting Logging Configuration
16-Feb-01 21:28:44 f32b7bc67204 INFO [onyx.messaging.aeron] - Starting Aeron Peer Group
Attempting to connect to to Zookeeper: zk:2181
Started peers. Blocking forever.
Yea that’s fine
that means the cluster is up and waiting for a job
If you go and create the DB table
Then you can submit a job to ZK with
ZOOKEEPER=$(echo $DOCKER_HOST|cut -d ':' -f 2|sed "s/\/\///g") lein run -m app-name.jobs.sample-submit-job
Then you should get segments flowing into your DB
There will be some chatter from ZK if that’s what you mean
That’s how I had been submitting the job. I see ZK things happen and then no logs from the peer.
But do you see results accumulating in your DB?
Perhaps relevant, the kafkacat logs:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
It could totally be kafkacat, it’s just piping the results of curl to stdin on kafkacat and sometimes it breaks.
docker-compose rm
I would delete the containers and re-make them. Need to make a proper service that restarts on failure for kafkacat
Got this error in the middle of building kafkacat:
== running CMake in build directory
./configure: 41: ./configure: cmake: not found
The "cmake" program is required to configure yajl.
It's available from most ports/packaging systems and
Build of libyajl FAILED!
Failed to build libyajl: JSON support will probably be disabled
Building kafkacat
./bootstrap.sh: line 65: pkg-config: command not found
Using -lpthread -lz -lrt -lpthread -lz -lrt for rdkafka
./bootstrap.sh: line 65: pkg-config: command not found
grep: tmp-bootstrap/usr/local/lib/pkgconfig/yajl.pc: No such file or directory
Using for yajl
Yea I think so, i’ll look into it.
Oh sorry that’s not an issue
Can you make a gist with the output from docker-compose logs
after you have deleted your old containers (`docker-compose rm`)
You mean do a docker-compose rm
, then docker-compose up
, then docker-compose logs
and show the output?
Yea if you dont mind
just to eliminate possibilities here 😕
What OS are you on @sirsean?
I'm on OS X 10.9.5. Just a datapoint
95% of the problems I have with this setup is the kafkacat container.
looks good
create the table and submit the job, then update the logs please? Thanks for the help
$ ZOOKEEPER=$(echo $DOCKER_HOST|cut -d ':' -f 2|sed "s/\/\///g") lein run -m desdemona.jobs.sample-submit-job
16-Feb-01 16:28:43 lips.local INFO [onyx.log.zookeeper] - Starting ZooKeeper client connection. If Onyx hangs here it may indicate a difficulty connecting to ZooKeeper.
16-Feb-01 16:28:43 lips.local INFO [onyx.log.zookeeper] - Stopping ZooKeeper client connection
Submitted job: #uuid "d953c935-6d27-4d52-8bc5-5e9e2bd4f018”
Not ignoring, trying to recreate
I’m wondering if I should try to hook up a different source for Kafka that doesn’t use kafkacat. (It’s not like my actual app is going to use this, haha.)
@sirsean: I did find a bug with our logger. For some reason the default timbre logger was not logging our user namespaces
(defn standard-out-logger
"Logger to output on std-out, for use with docker-compose"
[data]
(let [{:keys [output-fn]} data]
(println (output-fn data))))
(defn -main [n & args]
(let [n-peers (Integer/parseInt n)
config (read-config ( "config.edn") {:profile :default})
peer-config (-> (:peer-config config)
(assoc :onyx.log/config {:appenders {:standard-out
{:enabled? true
:async? false
:output-fn t/default-output-fn
:fn standard-out-logger}}}))
peer-group (onyx.api/start-peer-group peer-config)
env (onyx.api/start-env (:env-config config))
peers (onyx.api/start-peers n-peers peer-group)]
(println "Attempting to connect to to Zookeeper: " (:zookeeper/address peer-config))
(.addShutdownHook (Runtime/getRuntime)
(Thread.
(fn []
(doseq [v-peer peers]
(onyx.api/shutdown-peer v-peer))
(onyx.api/shutdown-peer-group peer-group)
(shutdown-agents))))
(println "Started peers. Blocking forever.")
;; Block forever.
(<!! (chan))))
Can you change your launch-prod-peers to look like that
Then, you will see if any segments are actually flowing through onyx after you docker-compose rm
, ./script/build.sh
, docker-compose up
You rebuilt the containers, correct?
(Had to go make the DB table again, but I don’t think that would’ve been a problem since nothing gets to that point.)
@gardnervickers: Could this be that obscure bug that popped up where the containers cant talk to the internet?
When the DNS settings are not transfered to the container by the docker environment
Not really sure, Onyx is starting up fine it seems.
Er, I meat maybe thats why kafcat isnt pumping in messages
I’m rebuilding the kafkacat image with dig
in the script.sh
to see if it’s able to even connect.
Good call
@michaeldrogalis: sorry yea I got that, thinking out loud.
My last resort is to usually delete the docker images associated with the docker-thing im doing 😕
Its okay, hard to escape Docker being finicky on everyone's machine
kafkacat_1 | ;; ANSWER SECTION:
kafkacat_1 | . 299 IN A 104.16.49.168
kafkacat_1 | . 299 IN A 104.16.50.168
kafkacat_1 | . 299 IN A 104.16.52.168
kafkacat_1 | . 299 IN A 104.16.53.168
kafkacat_1 | . 299 IN A 104.16.51.168
kafkacat_1 |
kafkacat_1 | ;; Query time: 34 msec
kafkacat_1 | ;; SERVER: 8.8.8.8#53(8.8.8.8)
Can you delete your peer, kafkacat, kafka and zk images?
Sorry to go nuclear but I have no idea what else it could be
I’m more interested in figuring out why even after you made the code changes and rebuilt, you were not seeing any logs from the peer
I’m going to revert that commit preventing proper logging though. Maybe @michaeldrogalis can push the update?
I have to run for a few, @sirsean I pushed the changes to https://github.com/onyx-platform/onyx-template but I cannot do a deploy. If you want you can clone, lein install
and lein new onyx-app my-app-name +docker
.
Or wait for @michaeldrogalis to deploy
That should get you output looking like this
zookeeper_1 | 2016-02-01 23:14:38,737 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /172.17.0.6:47432 which had sessionid 0x1529f108fe10083
peer_1 | 16-Feb-01 23:14:41 f45c733dc20b INFO [onyxapp.lifecycles.logging] - :write-lines logging segment: {:rows [{"groupId" 19094838, "groupCity" "Chatsworth", "category" "health/wellbeing"}]}
peer_1 | 16-Feb-01 23:14:41 f45c733dc20b INFO [onyxapp.lifecycles.logging] - :write-lines logging segment: {:rows [{"groupId" 15173842, "groupCity" "Renton", "category" "socializing"}]}
zookeeper_1 | 2016-02-01 23:14:42,242 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /172.17.0.6:47433
zookeeper_1 | 2016-02-01 23:14:42,244 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /172.17.0.6:47433
zookeeper_1 | 2016-02-01 23:14:42,245 [myid:] - INFO [SyncThread:0:ZooKeeperServer@617] - Established session 0x1529f108fe10084 with negotiated timeout 5000 for client /172.17.0.6:47433
zookeeper_1 | 2016-02-01 23:14:42,248 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@494] - Processed session termination for sessionid: 0x1529f108fe10084
zookeeper_1 | 2016-02-01 23:14:42,250 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /172.17.0.6:47433 which had sessionid 0x1529f108fe10084
peer_1 | 16-Feb-01 23:14:44 f45c733dc20b INFO [onyxapp.lifecycles.logging] - :write-lines logging segment: {:rows [{"groupId" 5073632, "groupCity" "Yelm", "category" "paranormal"}]}
peer_1 | 16-Feb-01 23:14:44 f45c733dc20b INFO [onyxapp.lifecycles.logging] - :write-lines logging segment: {:rows [{"groupId" 3054822, "groupCity" "Jackson Heights", "category" "movements/politics"}]}
zookeeper_1 | 2016-02-01 23:14:44,755 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /172.17.0.6:47434
zookeeper_1 | 2016-02-01 23:14:44,757 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /172.17.0.6:47434
zookeeper_1 | 2016-02-01 23:14:44,758 [myid:] - INFO [SyncThread:0:ZooKeeperServer@617] - Established session 0x1529f108fe10085 with negotiated timeout 5000 for client /172.17.0.6:47434
zookeeper_1 | 2016-02-01 23:14:44,763 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@494] - Processed session termination for sessionid: 0x1529f108fe10085
zookeeper_1 | 2016-02-01 23:14:44,765 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /172.17.0.6:47434 which had sessionid 0x1529f108fe10085
I just deleted all my docker images, generated a fresh template and followed the steps to get that working. I’ll be back in a few hours if you’re still having problems just write them here and I’ll get to it. Thanks for helping me debug this