Fork me on GitHub
#datomic
<
2015-07-16
>
gerstree09:07:44

When following the guide (http://docs.datomic.com/aws.html) to deploy datomic transactor on AWS I end up with the problem described here: http://comments.gmane.org/gmane.comp.db.datomic.user/4568

gerstree09:07:48

I tried to to find the startup.sh script that gets downloaded from a Datomic S3 bucket to find out what the problem could be, but cannot find it

gerstree09:07:37

I also tried to boot the Datomic AMI by hand to figure out how to fix this, but I cannot log into the instance by ssh, although I have a key pair set up and used it while booting the AMI

gerstree09:07:28

The security group also allow inboud ssh traffic

gerstree10:07:42

Can someone help me figure out how to fix this, or am I better of setting up a custom AMI and a manual/custom automation transactor installation

robert-stuttaford11:07:46

@gerstree, @bkamphaus should be able to assist when he appears

gerstree11:07:00

@robert-stuttaford: great, thanks. I will continue with the rest of my setup and wait a bit for help on the transactor part.

marshall12:07:55

@gerstree: What size EC2 instance are you attempting to start the txor on?

gerstree12:07:53

@marshall It was a c3.large instance

Ben Kamphaus13:07:42

@gerstree: what’s the Xmx and memory settings you’re using?

Ben Kamphaus13:07:32

re: process, you’re using the ensure-transactor, ensure-cf, create-cf-template, create-cf-stack commands? For the keypair, you added it to the generated JSON (using AWS docs or some other resource for guidance)?

gerstree13:07:18

@bkamphaus, I did exactly what was documented in http://docs.datomic.com/aws.html: ensure-transactor, ensure-cf, create-cf-template and create-cf-stack. I did not add the keypair to the generated JSON when using the create-cf-stack

gerstree13:07:11

I did however start an instance from the console using a keypair (that works for several other instances) and could not connect to that instance

gerstree13:07:34

Should the image allow ssh access?

Ben Kamphaus13:07:56

some common culprits for this sort of thing: heap size too large for instance, memory settings larger than 75% of heap size, unsupported instance type (looking into it), license not valid for version of Datomic requested

gerstree13:07:00

I am looking for a way to find the error, what would be the best approach? I was trying to start an instance and run the startup.sh manually.

gerstree13:07:05

Is that startup.sh script available somewhere?

gerstree13:07:14

The Xmx was 2625m by the way

gerstree13:07:51

Looks reasonable for the 3.75G of that c3.large instance type not?

Ben Kamphaus13:07:01

memory-index-max + object-cache-max = ?

Ben Kamphaus13:07:27

I don’t know if there's a good generic troubleshooting route for the dead start on CF I can offer, past addressing common config issues. The general approach I use is to validate the settings against e.g. local dev then to build the transactor properties + cf properties around the working settings, then push.

gerstree13:07:39

I understand, I was looking for a way to get a hold of the real error, it's annoying the instance terminates and can't be restarted. And the instance I start by hand is not accessible via ssh for some reason.

Ben Kamphaus13:07:48

@gerstree: I verified that ssh is turned off at the AMI level as a security measure

Ben Kamphaus13:07:03

assuming you’re running with a super user AWS key (i.e. same one you used with ensure*) , can you start a local transactor against the ddb table set up by ensure-transactor using the same settings (also forcing the same JVM args re: Xmx, etc.)?

Ben Kamphaus13:07:59

If you want to PM me a version of the transactor properties and cf properties files redacted of any credentials or sensitive info on Slack, I can look it over. We do have customers that configure their own instance or CF for some of the reasons you mentioned - e.g. wanting to ssh in to access logs, etc.

Ben Kamphaus13:07:11

In general, it’s typically the issues I mentioned - wrong instance type (you should be fine here), too much heap (you’re ok here), not enough heap for transactor properties memory settings (unsure yet), invalid license (unsure yet) — if you use ensure with AWS key w/appropriate permissions your security groups, role policies, etc. should be fine, but if not these can contribute as well.

Ben Kamphaus13:07:51

Once you’re going, cloudwatch metrics and log rotation are almost always sufficient for figuring out problems that arise.

gerstree14:07:40

I am indeed running with a super user AWS key (we tried via IAM first, but that was hopping from policy to policy... no fun).

gerstree14:07:34

I have not tried connecting to ddb from a local transactor, let me try that first.

kbaribeau15:07:35

can anyone offer advice on a transactor that keeps timing out? It'll timeout even if I give it a transaction with just a single datom.

kbaribeau15:07:23

I assume this means it's busy with something, but I'm unable to tell what that is or how to stop it

Ben Kamphaus15:07:16

@kbaribeau: do you have access to logs or metrics? In either do you see AlarmBackPressure ? (one scenario is that it could be in the middle of a large indexing job with several transactions backed up)

kbaribeau15:07:21

that would show up in both logs and cloudwatch metrics?

Ben Kamphaus15:07:03

Yeah. w/logs you can also just grep for [Ee]rror [Ee]xception and Alarm as a first sanity check for health.

Ben Kamphaus15:07:04

Or w/metrics look for any Alarm(s) — other storage metrics as well such as StoragePutGetBackoffMsec etc. could be an indicator if e.g. storage provisioning is an issue.

kbaribeau15:07:22

I've got a NullPointerException in the log from yesterday

kbaribeau15:07:27

java.lang.NullPointerException: null
	at datomic.db$get_ids$fn__4004.invoke(db.clj:2352) ~[datomic-transactor-pro-0.9.5186.jar:na]
	at clojure.core.protocols$fn__6074.invoke(protocols.clj:79) ~[clojure-1.6.0.jar:na]
	at clojure.core.protocols$fn__6031$G__6026__6044.invoke(protocols.clj:13) ~[clojure-1.6.0.jar:na]
	at clojure.core$reduce.invoke(core.clj:6289) ~[clojure-1.6.0.jar:na]
	at datomic.db$get_ids.invoke(db.clj:2367) ~[datomic-transactor-pro-0.9.5186.jar:na]
	at datomic.db.ProcessExpander.getData(db.clj:2425) ~[datomic-transactor-pro-0.9.5186.jar:na]
	at datomic.update$processor$fn__10124$fn__10125$fn__10126$fn__10130$fn__10133$fn__10134.invoke(update.clj:246) ~[datomic-transactor-pro-0.9.5186.jar:na]
	at clojure.lang.Atom.swap(Atom.java:37) ~[clojure-1.6.0.jar:na]
	at clojure.core$swap_BANG_.invoke(core.clj:2232) ~[clojure-1.6.0.jar:na]
	at datomic.update$processor$fn__10124$fn__10125$fn__10126$fn__10130$fn__10133.invoke(update.clj:240) ~[datomic-transactor-pro-0.9.5186.jar:na]
	at datomic.update$processor$fn__10124$fn__10125$fn__10126$fn__10130.invoke(update.clj:238) ~[datomic-transactor-pro-0.9.5186.jar:na]
	at datomic.update$processor$fn__10124$fn__10125$fn__10126.invoke(update.clj:235) [datomic-transactor-pro-0.9.5186.jar:na]
	at datomic.update$processor$fn__10124$fn__10125.invoke(update.clj:216) [datomic-transactor-pro-0.9.5186.jar:na]
	at datomic.update$processor$fn__10124.invoke(update.clj:216) [datomic-transactor-pro-0.9.5186.jar:na]
	at datomic.update$processor.doInvoke(update.clj:216) [datomic-transactor-pro-0.9.5186.jar:na]
	at clojure.lang.RestFn.applyTo(RestFn.java:139) [clojure-1.6.0.jar:na]
	at clojure.core$apply.invoke(core.clj:626) [clojure-1.6.0.jar:na]
	at datomic.update$background$proc__10046.invoke(update.clj:58) [datomic-transactor-pro-0.9.5186.jar:na]
	at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:na]
	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]

kbaribeau15:07:55

I'm not sure our metrics are configured correctly. looking into it

kbaribeau15:07:07

I can't even see a metric named AlarmBackPressure, but other metrics look reasonable. I think.

Ben Kamphaus15:07:41

did transactions stop going through with the NPE? (you would no longer see, e.g. TransactionBytes metrics after that point if so)

kbaribeau15:07:01

The metrics are making it look that way, although there are log entries after the NPE (but not many)

kbaribeau15:07:21

heartbeats are still getting through

kbaribeau15:07:51

Could restarting the transactor instance help?

Ben Kamphaus15:07:59

Do you have any really large transactions that precede that? (seen from high TransactionBytes values).

Ben Kamphaus15:07:18

Def. worth restarting the transactor to see if that resolves the issue.

Ben Kamphaus15:07:39

Just to note, though, that NPE looks like its a transaction function error, though, I wouldn’t expect it to have an impact on the health of the system.

kbaribeau15:07:11

Oh, interesting.

kbaribeau15:07:58

The largest TransactionBytes value I see in the last day is 257 bytes. Seems pretty small.

kbaribeau15:07:06

I think I'll restart, thanks for the help so far. Knowing which metrics to look at is definitely useful simple_smile

Ben Kamphaus15:07:44

if you have a long indexing job (say something like excise where a lot of segments have to be rewritten), you’ll see IndexWriteMsec values during the indexing job, so having a lot of those when transactions are timing out could be an indication that you’re stuck in indexing.

kbaribeau15:07:47

Cool. I had suspected indexing at one point but didn't realize there was a metric for it.

Ben Kamphaus16:07:40

I don’t believe that metric is in any version but don’t remember off the top of my head in which one it was introduced

Ben Kamphaus16:07:38

CreateEntireIndexMsec will also show up at the end of a successful indexing job. AlarmIndexingFailed will show up if indexing fails (and these failures are usually related to memory issues on the transactor if they do show up).

gerstree17:07:40

@bkamphaus, no need to look at our template anymore. I have the transactor up and running on AWS.

gerstree17:07:13

You put me on the right track by making me start the transactor locally first, talking to ddb on Amazon

gerstree17:07:19

Thanks so much

Ben Kamphaus17:07:14

@gerstree: glad you were able to get the issue resolved!