datomic 2020-07-01 | Slack Archive

genekim18:07:35

Hello! I’m wondering if I can get some help with my Datomic Cloud instance that seems to have gone sound — in fact, I’m on a call with @plexus trying to puzzle this out. 1. I’m getting “channel 2: open failed: connect failed: Connection refused” errors on the proxy, when a Datomic Client tries to access the Datomic Cloud instance. 2. In AWS CloudWatch, I see the following alarm, which occurred very close to when we started seeing Datomic connection errors occurring. Can anyone propose any recommendations? @plexus, any other data worth sharing? (Sorry, gotta pop off for 30m. Thank you, all!)

genekim18:07:21

Error I get from a REPL connection:

Execution error (ExceptionInfo) at datomic.client.impl.cloud/get-s3-auth-path (cloud.clj:178).
Unable to connect to localhost:8182

plexus18:07:08

reading some more AWS docs it seems we have exceeded the allocated write throughput, which is supposed to only cause throttling, but instead the datomic instance has gone under or become unreachable...

ghadi18:07:18

check your cloudwatch datomic dashboard

ghadi18:07:28

should have a clear smoking gun

ghadi18:07:11

if you have any Alerts (not just "Events") in that dashboard, look at those too by navigating to cloudwatch logs

genekim18:07:44

Thank you @U050ECB92 — is this the dashboard? (Sorry, on a call…. 🙂

ghadi18:07:32

yes, weird that it's mostly empty

ghadi18:07:44

what about the bottom half of that dash?

genekim18:07:39

Was empty — full screenshot here:

genekim18:07:27

(Empty dashboard was the reason I was asking Datomic team at Conj 2019 about getting help upgrading last year, which I never got around to.)

marshall18:07:47

the alarm you posted is irrelevant - that is used by DDB for autoscaling capacity

marshall18:07:56

you should restart your solo compute instance

marshall18:07:05

you can just bounce it from the EC2 console

marshall18:07:08

@U6VPZS1EK

marshall18:07:38

https://docs.datomic.com/cloud/troubleshooting.html#troubleshooting-solo

marshall18:07:42

marshall18:07:50

^ solo dashboard should look like that

marshall18:07:10

your instance and/or JVM got wedged and b/c solo is not an HA system there is nothing to fail-over to

marshall18:07:27

quickest fix is to terminate the instance and let ASG create a new one

genekim18:07:44

Roger that! Will try in 30m as soon as I get off this call! Thx!

marshall18:07:48

:thumbsup:

genekim19:07:13

Posting this datomic log event, before I destroy the solo instance:

2020-06-25T22:54:28.953-07:00
{
    "Msg": "RestartingDaemonException",
    "Ex": {
        "Via": [
            {
                "Type": "clojure.lang.ExceptionInfo",
                "Message": "Unable to load index root ref bd9b3c36-2912-437d-8fc7-6953ab60a1b2",
                "Data": {
                    "Ret": {},
                    "DbId": "bd9b3c36-2912-437d-8fc7-6953ab60a1b2"
                },
                "At": [
                    "datomic.index$require_ref_map",
                    "invokeStatic",
                    "index.clj",
                    843
                ]
            }
        ],
        "Trace": [
            [
                "datomic.index$require_ref_map",
                "invokeStatic",
                "index.clj",
                843
            ],
            [
                "datomic.index$require_ref_map",
                "invoke",
                "index.clj",
                836
            ],
            [
                "datomic.index$require_root_id",
                "invokeStatic",
                "index.clj",
                849
            ],
            [
                "datomic.index$require_root_id",
                "invoke",
                "index.clj",
                846
            ],
            [
                "datomic.adopter$start_adopter_thread$fn__21647",
                "invoke",
                "adopter.clj",
                67
            ],
            [
                "datomic.async$restarting_daemon$fn__10442$fn__10443",
                "invoke",
                "async.clj",
                162
            ],
            [
                "datomic.async$restarting_daemon$fn__10442",
                "invoke",
                "async.clj",
                161
            ],
            [
                "clojure.core$binding_conveyor_fn$fn__5739",
                "invoke",
                "core.clj",
                2030
            ],
            [
                "datomic.async$daemon$fn__10439",
                "invoke",
                "async.clj",
                146
            ],
            [
                "clojure.lang.AFn",
                "run",
                "AFn.java",
                22
            ],
            [
                "java.lang.Thread",
                "run",
                "Thread.java",
                748
            ]
        ],
        "Cause": "Unable to load index root ref bd9b3c36-2912-437d-8fc7-6953ab60a1b2",
        "Data": {
            "Ret": {},
            "DbId": "bd9b3c36-2912-437d-8fc7-6953ab60a1b2"
        }
    },
    "Type": "Alert",
    "Tid": 6306,
    "Timestamp": 1593150867958

genekim19:07:17

marshall19:07:06

Thanks, although that shouldn’t cause a significant issue

genekim19:07:39

Okay, terminated the datomic instance, which didn’t work… terminated the bastion-host instance, which didn’t work… terminated the datomic proxy script, and restarted… forced some sort of reauthentication, which did work! Thank you, all!

genekim19:07:15

🙏🙏🙏 🎉🎉🎉

marshall19:07:44

you’d definitely need to restart the proxy script after restarting the bastion instance

marshall19:07:10

IIRC it regenerates creds/keys after coming back from termination

genekim19:07:39

Thank you for the help, all! Described resolution of story at end of thread ^^^.

zhuxun220:07:30

Is it possible to implement a correct task queue in Datomic? Mostly importantly, ensure that multiple task retrievers won't get the same task from the top of the queue. (In PostgreSQL for example I needed to use LOCK FOR UPDATE)

Joe Lane20:07:54

It's certainly possible to make a queue out of datomic, but why not just use an actual queue? I also don't necessarily think it's a good idea to use datomic as a queue, depending on the throughput, failure semantics, and data retention you need.

zhuxun220:07:07

@U0CJ19XAM Good point. I am looking into https://github.com/Factual/durable-queue as well.

Joe Lane20:07:33

Why not sqs?

zhuxun220:07:41

Actually, I just realized a queue might not satisfy what I need. There isn't a static queue. Tasks have priorities and they might be changed dynamically. Every task retriever grabs the top-priority job from the database at the moment it accesses the database. Is there a established solution or pattern for something like that?

Joe Lane20:07:45

Depends on the domain, if this is something for humans (like a Jira / Trello clone) then this is easy. If this is for machines, it depends on your throughput, scale, and failure modes.

Joe Lane20:07:52

That being said, you may be interested in https://github.com/clojure/data.priority-map

Joe Lane20:07:27

and / or https://github.com/clojure/data.avl/

zhuxun221:07:37

The job retrievers are machines. I don't think an in-memory solution would work well for my particular case, plus, the tasks and their attributes (from which to compute the priority) are already stored in a datomic database so I that's why I was wondering if there's some sort of locking mechanism between querying and updating...

zhuxun221:07:47

The performance of the priority sorting isn't that much of a problem, at the moment an index on the priority attribute should work well enough

zhuxun221:07:40

In other words, is there a way to say "change the first item satisfying my query to have attribute [:task/taken true]" -- all within an atomic transaction

Joe Lane21:07:27

Yes, via a transaction function, but I don't think it's going to work out well in the end. What happens once a task is taken but then the task retriever dies? What are your retry policies? How do you distinguish between a slow job and a failed job?

Joe Lane21:07:04

Do you have different levels of prioritization like low, medium and high, or is everything prioritized globally? If you can do the former, I think SQS with a queue per level is likely a better approach.

Joe Lane21:07:26

Because it handles all these things for you

zhuxun221:07:55

Thanks. That makes sense. What if I'm not using a standard cloud service? Can Kafka serve a similar purpose?

Joe Lane21:07:38

I'd look at rabbitMQ, kafka is a durable log.

Joe Lane21:07:37

(It could do this as well, but may be more difficult to operate. Again, I know nothing of your problem domain, scale, other constraints, etc. so it's hard to make a good recommendation)

zhuxun221:07:34

Thanks! I will take a look at rabbitMQ

Lone Ranger22:07:37

I don't suppose there is any way to force a peer to use an alternative address than the one provided by the transactor (retrieved from storage), is there?

favila01:07:39

The transactor properties file can have host and alt-host. Are two names not enough?

favila01:07:14

I’m not sure about ports. I dimly recall that you can specify port in the connection string, but that might only be for dev storage

Lone Ranger22:07:26

or alternative port, at least?

Lone Ranger22:07:52

i.e., transactor running on transactorUrl:4334 but is being reverse proxied at with appropriate firewall rules VPN etc etc

Lone Ranger23:07:51

okay looks like we're able to change the port on the LB. still curious if this is possible, tho

2020-07-01

Channels