Fork me on GitHub

When following the guide ( to deploy datomic transactor on AWS I end up with the problem described here:


I tried to to find the script that gets downloaded from a Datomic S3 bucket to find out what the problem could be, but cannot find it


I also tried to boot the Datomic AMI by hand to figure out how to fix this, but I cannot log into the instance by ssh, although I have a key pair set up and used it while booting the AMI


The security group also allow inboud ssh traffic


Can someone help me figure out how to fix this, or am I better of setting up a custom AMI and a manual/custom automation transactor installation


@gerstree, @bkamphaus should be able to assist when he appears


@robert-stuttaford: great, thanks. I will continue with the rest of my setup and wait a bit for help on the transactor part.


@gerstree: What size EC2 instance are you attempting to start the txor on?


@marshall It was a c3.large instance


@gerstree: what’s the Xmx and memory settings you’re using?


re: process, you’re using the ensure-transactor, ensure-cf, create-cf-template, create-cf-stack commands? For the keypair, you added it to the generated JSON (using AWS docs or some other resource for guidance)?


@bkamphaus, I did exactly what was documented in ensure-transactor, ensure-cf, create-cf-template and create-cf-stack. I did not add the keypair to the generated JSON when using the create-cf-stack


I did however start an instance from the console using a keypair (that works for several other instances) and could not connect to that instance


Should the image allow ssh access?


some common culprits for this sort of thing: heap size too large for instance, memory settings larger than 75% of heap size, unsupported instance type (looking into it), license not valid for version of Datomic requested


I am looking for a way to find the error, what would be the best approach? I was trying to start an instance and run the manually.


Is that script available somewhere?


The Xmx was 2625m by the way


Looks reasonable for the 3.75G of that c3.large instance type not?


memory-index-max + object-cache-max = ?


I don’t know if there's a good generic troubleshooting route for the dead start on CF I can offer, past addressing common config issues. The general approach I use is to validate the settings against e.g. local dev then to build the transactor properties + cf properties around the working settings, then push.


I understand, I was looking for a way to get a hold of the real error, it's annoying the instance terminates and can't be restarted. And the instance I start by hand is not accessible via ssh for some reason.


@gerstree: I verified that ssh is turned off at the AMI level as a security measure


assuming you’re running with a super user AWS key (i.e. same one you used with ensure*) , can you start a local transactor against the ddb table set up by ensure-transactor using the same settings (also forcing the same JVM args re: Xmx, etc.)?


If you want to PM me a version of the transactor properties and cf properties files redacted of any credentials or sensitive info on Slack, I can look it over. We do have customers that configure their own instance or CF for some of the reasons you mentioned - e.g. wanting to ssh in to access logs, etc.


In general, it’s typically the issues I mentioned - wrong instance type (you should be fine here), too much heap (you’re ok here), not enough heap for transactor properties memory settings (unsure yet), invalid license (unsure yet) — if you use ensure with AWS key w/appropriate permissions your security groups, role policies, etc. should be fine, but if not these can contribute as well.


Once you’re going, cloudwatch metrics and log rotation are almost always sufficient for figuring out problems that arise.


I am indeed running with a super user AWS key (we tried via IAM first, but that was hopping from policy to policy... no fun).


I have not tried connecting to ddb from a local transactor, let me try that first.


can anyone offer advice on a transactor that keeps timing out? It'll timeout even if I give it a transaction with just a single datom.


I assume this means it's busy with something, but I'm unable to tell what that is or how to stop it


@kbaribeau: do you have access to logs or metrics? In either do you see AlarmBackPressure ? (one scenario is that it could be in the middle of a large indexing job with several transactions backed up)


that would show up in both logs and cloudwatch metrics?


Yeah. w/logs you can also just grep for [Ee]rror [Ee]xception and Alarm as a first sanity check for health.


Or w/metrics look for any Alarm(s) — other storage metrics as well such as StoragePutGetBackoffMsec etc. could be an indicator if e.g. storage provisioning is an issue.


I've got a NullPointerException in the log from yesterday


java.lang.NullPointerException: null
	at datomic.db$get_ids$fn__4004.invoke(db.clj:2352) ~[datomic-transactor-pro-0.9.5186.jar:na]
	at clojure.core.protocols$fn__6074.invoke(protocols.clj:79) ~[clojure-1.6.0.jar:na]
	at clojure.core.protocols$fn__6031$G__6026__6044.invoke(protocols.clj:13) ~[clojure-1.6.0.jar:na]
	at clojure.core$reduce.invoke(core.clj:6289) ~[clojure-1.6.0.jar:na]
	at datomic.db$get_ids.invoke(db.clj:2367) ~[datomic-transactor-pro-0.9.5186.jar:na]
	at datomic.db.ProcessExpander.getData(db.clj:2425) ~[datomic-transactor-pro-0.9.5186.jar:na]
	at datomic.update$processor$fn__10124$fn__10125$fn__10126$fn__10130$fn__10133$fn__10134.invoke(update.clj:246) ~[datomic-transactor-pro-0.9.5186.jar:na]
	at clojure.lang.Atom.swap( ~[clojure-1.6.0.jar:na]
	at clojure.core$swap_BANG_.invoke(core.clj:2232) ~[clojure-1.6.0.jar:na]
	at datomic.update$processor$fn__10124$fn__10125$fn__10126$fn__10130$fn__10133.invoke(update.clj:240) ~[datomic-transactor-pro-0.9.5186.jar:na]
	at datomic.update$processor$fn__10124$fn__10125$fn__10126$fn__10130.invoke(update.clj:238) ~[datomic-transactor-pro-0.9.5186.jar:na]
	at datomic.update$processor$fn__10124$fn__10125$fn__10126.invoke(update.clj:235) [datomic-transactor-pro-0.9.5186.jar:na]
	at datomic.update$processor$fn__10124$fn__10125.invoke(update.clj:216) [datomic-transactor-pro-0.9.5186.jar:na]
	at datomic.update$processor$fn__10124.invoke(update.clj:216) [datomic-transactor-pro-0.9.5186.jar:na]
	at datomic.update$processor.doInvoke(update.clj:216) [datomic-transactor-pro-0.9.5186.jar:na]
	at clojure.lang.RestFn.applyTo( [clojure-1.6.0.jar:na]
	at clojure.core$apply.invoke(core.clj:626) [clojure-1.6.0.jar:na]
	at datomic.update$background$proc__10046.invoke(update.clj:58) [datomic-transactor-pro-0.9.5186.jar:na]
	at [clojure-1.6.0.jar:na]
	at [na:1.7.0_55]


I'm not sure our metrics are configured correctly. looking into it


I can't even see a metric named AlarmBackPressure, but other metrics look reasonable. I think.


did transactions stop going through with the NPE? (you would no longer see, e.g. TransactionBytes metrics after that point if so)


The metrics are making it look that way, although there are log entries after the NPE (but not many)


heartbeats are still getting through


Could restarting the transactor instance help?


Do you have any really large transactions that precede that? (seen from high TransactionBytes values).


Def. worth restarting the transactor to see if that resolves the issue.


Just to note, though, that NPE looks like its a transaction function error, though, I wouldn’t expect it to have an impact on the health of the system.


Oh, interesting.


The largest TransactionBytes value I see in the last day is 257 bytes. Seems pretty small.


I think I'll restart, thanks for the help so far. Knowing which metrics to look at is definitely useful simple_smile


if you have a long indexing job (say something like excise where a lot of segments have to be rewritten), you’ll see IndexWriteMsec values during the indexing job, so having a lot of those when transactions are timing out could be an indication that you’re stuck in indexing.


Cool. I had suspected indexing at one point but didn't realize there was a metric for it.


I don’t believe that metric is in any version but don’t remember off the top of my head in which one it was introduced


CreateEntireIndexMsec will also show up at the end of a successful indexing job. AlarmIndexingFailed will show up if indexing fails (and these failures are usually related to memory issues on the transactor if they do show up).


@bkamphaus, no need to look at our template anymore. I have the transactor up and running on AWS.


You put me on the right track by making me start the transactor locally first, talking to ddb on Amazon


Thanks so much


@gerstree: glad you were able to get the issue resolved!