Fork me on GitHub
#datomic
<
2018-11-05
>
lwhorton16:11:07

this weekend i started a mini project to explore datomic cloud (solo topo). i accidentally used the wrong key-pair when setting up cloudformation. after the stack was created I deleted the parent stack (which also removed storage, compute stacks). i then followed https://docs.datomic.com/cloud/operation/deleting.html to remove all durable storage stuff, too. i used aws’ tag search feature to find anything residual created by cloudformation to make sure it was removed, too (the only thing still around were the old terminated ec2 instances). when i try to follow the marketplace template again (this time picking the correct key pair) i keep getting a failure creating the new Storage stack:

The following resource(s) failed to create: [StorageF7F305E7]. . Rollback requested by user.

lwhorton16:11:48

inside of that storage failure, the error resources are:

The following resource(s) failed to create: [LogTableReadScalingPolicy, MountTarget1, LogTableWriteScalingPolicy, MountTarget2, MountTarget0, AdminPolicy].

lwhorton16:11:14

i’m not clear on why these steps failed, other than maybe they weren’t 100% cleaned up properly following the delete steps?

marshall17:11:33

@lwhorton More likely you don’t have the IAM permissions required

marshall17:11:38

are you signed in as an AWS admin?

lwhorton17:11:03

yes, i have the adminaccess group which is allow

lwhorton17:11:11

should i even be able to re-create a stack with the exact same stack name (and app name, for that matter)?

marshall17:11:39

in general yes, although it could be that some resources did not get deleted, which would cause the kind of issue you’re seeing

marshall17:11:44

well, could cause

lwhorton17:11:56

i feel like it’s just lingering state somewhere, but it’s hard to find where

lwhorton17:11:41

i’ve had issues with CF in the past where a stack gets stuck in some interminable state, and i’ve had to contact aws support directly to clear things up lol

lwhorton17:11:15

i’ll keep digging around and will let you know if i find the root of the problem

kenny17:11:27

I received this exception during the ValidateService while deploying my Ions:

{
                "Type": "clojure.lang.Compiler$CompilerException",
                "Message": "java.lang.StackOverflowError, compiling:(potemkin/walk.clj:10:13)",
                "At": [
                    "clojure.lang.Compiler",
                    "analyze",
                    "Compiler.java",
                    6792
                ]
            }
This is the first time I have received this exception. It appears to be coming from clj-http. Upon redeploying the Ions, deploy succeeded. It smells like some sort of race condition. Any idea how I can avoid this problem in the future?

steveb8n19:11:24

@kenny if you are using solo, it's likely a memory limit problem we all experience. The location in the exception is irrelevant. There is a workaround if you upgrade the datomic stack and edit the CF template to increase JVM memory Params. I haven't done this yet, I just rerun deploys and ignore it

kenny19:11:35

I am using production.

marshall19:11:26

@kenny i’m working on a doc improvement that should help suss that out and reproduce locally I’ll ping you when i’ve got it put together

4
marshall15:11:32

@kenny Take a look here: https://docs.datomic.com/cloud/ions/ions-reference.html#jvm-settings Can you try running locally with the same JVM settings used in your stack and let us know if you don’t see the same behavior locally?

kenny20:11:30

What exactly should I try running locally?

marshall20:11:23

loading and invoking your ion code

kenny20:11:45

What -main should I specify?

marshall20:11:00

just require your ion ns

marshall20:11:13

the same way you would when you are testing your ion code locally

marshall20:11:17

before deploying

marshall20:11:34

the idea is to load the ion code in a JVM with the same memory settings used by Datomic Cloud

kenny20:11:46

Oh. I test my Ions exactly as you described and have never hit the exception that was hit in production.

marshall20:11:17

were your JVM settings the same as those specified in that table?

kenny20:11:05

Probably with far less memory. I haven't configured any JVM properties on my system and I'm guessing the defaults are really low.

kenny20:11:53

Our Ion tests run via CI where the machine only has 4gb. The tests have been run thousands of times without hitting that exception.

kenny20:11:23

And we are running production topology with i3.large. Unless you think having extra memory is the issue, I don't think that exception was due to RAM.

marshall20:11:17

"Message": "java.lang.StackOverflowError, compiling:(potemkin/walk.clj:10:13)"

marshall20:11:30

indicates you ran out of java stack space when compiling the code

kenny20:11:08

Gotcha. Our CI runs on 4gb and has not hit that exception. Datomic Ions run on i3.large and according to that chart that means they have 10.52gb available.

marshall20:11:33

is CI a Datomic system?

kenny20:11:58

No, it is a CircleCI container. It runs the Ions by executing clojure, as you suggested, and has not hit the exception.

marshall20:11:34

Is it using the same JVM settings (for instance -Xss) as Datomic?

kenny20:11:01

It is using the default clojure uses which I'd guess is the Java defaults. I was assuming, perhaps incorrectly, that the production topology configuration would configure that higher than the defaults. I do now see the GC flags which are not being used right now. I'll try a few runs with those flags.

kenny20:11:42

No failures out of 10 runs. I don't know what the failure rate in production is either because I have only hit this one time.

marshall20:11:43

I suspect something that library was doing was using a lot of stack space (deeply recursive structure of some sort maybe). Your production instance happened to be doing something else at the time you tried to deploy and the combined load exceeded the -Xss

kenny21:11:13

Hmm ok. What is the solution if I hit it again?

stijn13:11:20

@marshall why does the documentation mention these different instance types?

marshall13:11:49

@U0539NJF7 They are the various instance types used by Datomic Cloud

stijn13:11:50

we can only select i3.large or i3.xlarge

marshall13:11:10

Query groups can use the others

stijn13:11:28

and the 1M is already there in version 411-8505?

marshall13:11:12

the settings in that table are accurate for the latest release (441-8505)

stijn13:11:35

ok. so, if that still fails with the StackOverflowError, what are the resolutions for that?

marshall13:11:08

I would first want to see whether that failure occurs locally with the same stack settings

marshall13:11:57

but, in general, the resolution is to alter the use, compiling, and loading of libraries that are causing it

stijn13:11:02

(because we tried updating the jvm settings on the template, but it made the nodes not come up after termination - probably made a mistake)

stijn13:11:29

ok, I think that makes sense. we had the problem earlier with using the aleph http client and loading the netty namespaces was failing. replacing it with clj-http worked in this case

stijn13:11:44

but, some stuff is hard to remove from your code base 😄

marshall13:11:58

agreed. you don’t always have to remove it, either.

marshall13:11:11

sometimes it is about how/where/when you require

stijn13:11:28

ok, are there any guidelines available?

marshall13:11:09

that would be non-datomic-specific

marshall13:11:27

i.e. things like deeply nested lazy seqs / large recursive call stacks, etc

stijn13:11:00

ok, thanks

kenny16:11:01

@U0539NJF7 interestingly for us, the lib that caused this problem in the first place was clj-http 🙃