Fork me on GitHub

Greetings! I’m getting an “Error communicating with HOST” message after my transactors died and rebooted on AWS. The IP address in the error message no longer exists because the transactors came back on new IP addresses. It appears that the transactors did not write their new IP addresses to storage (DynamoDB) and the app is getting the wrong IP address. How do I check what IP address is stored in Dynamo?


FWIW, I’m also unable to connect in the REPL.


Another interesting note: this only happens on one of our production apps. The other production apps connect as expected.


Quick update: In the REPL, I see the following error:


Might be an SSL issue?


@statonjr: if the transactor can’t write its ip address, it should be failing. Are the transactors staying up?


Transactors are staying up. Both fell down on Thursday evening, but came right back up and have been up ever since.


@statonjr You can run datomic.peer/transactor-endpoint (side note: diagnostics tool, not stable api, so don’t use outside of this intended purpose) to sanity check what the current transactor endpoint is. If peer can’t connect with that error, endpoint is what you expect, and transactors are fine, check to see if anything about security groups, etc. changed? That’s what the issue looks like on the surface at present.


The app is pointing at the wrong endpoint for some reason. Also using the previous transactor version. Probably a deploy error on our side.


@bkamphaus: Is :version the transactor version or the peer version?


@statonjr: for the map returned by transactor-endpoint, it’s the transactor version.


Makes sense. Thanks.


@bkamphaus: Our staging environment works with the previous version. Only the host is different.


We’re checking security groups, but we haven’t changed anything there recently.


The :host version is incorrect and has been since the transactors went down. When they came back up, it appears that they either failed to write the IP addresses or they did write their IP addresses and DynamoDB failed somewhere.


@bkamphaus: Fixed. We rebooted the transactors and then the app. Our app has this code: (def conn (delay (d/connect url))) that I think does some caching somewhere. I’m going to investigate, but we’re back.


@statonjr: glad to hear you’re back. I wouldn’t expect transactors failing to write their IP address to be an issue - they should be writing their hostname and alt-host (assuming you’re using our cf tools or your own bootstrap logic, they should get that when machine goes up), and the IP is written as part of heartbeat. I.e. if they can’t write and read it, they would experience heartbeat failures and go down.


What is the meaning of added? in [entity attribute value transaction added?] ?


Makes sense. BTW, after we rebooted the transactor but before bouncing the app, we ran datomic.peer/transactor-endpoint and could see the new IP addresses, which matched our EC2 instances. When I created a connection with (d/connect url), I could connect and run queries. When I tried to use the delay above, it failed and showed the old :host IP.


@sdegutis: added distinguishes assertions from retraction. true for assert, false for retract.


Only after I rebooted the app was I able to connect to the new transactor.


Oooh. Cool, thanks.


@statonjr: I’m not sure what effect the caching implied by delay should have, but peers should automatically reconnect to a new transactor on failover.


at present Datomic will cache the call to conn, so it’s probably not necessary to put it in the body of a` delay` (i.e. if you connect twice in same app/peer lib to same database, you’ll get the previous connection).


I’m not sure, either, and we have Immutant in there, too. I’m going to look closer this weekend.


At least I have a stack trace to look at!


Is it expected behavior that if I have a transactor A on AWS with ddb storage table X, starting up a second transactor B, pointing at ddb table X, will crash transactor A?