Fork me on GitHub
#datomic
<
2022-06-03
>
manutter5120:06:33

Anybody ever seen issues with the transactor restarting itself every few minutes due to failed heartbeat? We’ve got plenty of RAM, plenty of disk space, including plenty of space for the logs, low network latency to the MSSQL backend, low load, nothing in the logs to indicate why the heartbeat keeps failing?

favila21:06:01

I would look for JDBC connection issues

manutter5121:06:09

That would cause a heartbeat failure?

manutter5121:06:31

You’re talking JDBC connection from datomic back to the MSSQL back end?

favila21:06:21

The “heartbeat” is the transactor writing its address into storage periodically. It’s part of the HA failover system

favila21:06:55

I think it’s the id “pod-coord” in the sql backend.

favila21:06:12

I don’t remember exactly. It starts with “pod-”

favila21:06:28

so the heartbeat failure would mean the transactor couldn’t write to MSSQL

favila21:06:15

if this doesn’t happen immediately on startup, that suggests some JDBC connection-level thing is wrong. maybe a connection count is exceeded, maybe MSSQL killed a long-running connection, maybe there’s TCP issues lower down the stack.

manutter5121:06:20

Ok, I’m seeing references to heartbeat in the logs on New Relic

favila21:06:13

This heartbeat entry is also how peers find the transactor

manutter5121:06:08

The problem happens at irregular intervals, but we’re definitely starting successfully and running for a good few minutes.

favila21:06:38

does it happen only when there’s an idle transaction period?

favila21:06:52

also, are you running on google cloud?

manutter5121:06:08

No it seems to only happen when we try to execute transactions.

manutter5121:06:22

And we’re not on Google Cloud

favila21:06:12

do any transactions succeed?

manutter5121:06:47

Yes, we’re getting some throughput, we’re just having frequent interruptions.

favila21:06:43

hm. yeah, I would look very closely at JDBC driver settings and MSSQL settings

manutter5121:06:57

I’ll pass that on, thanks much!

favila21:06:26

and try to correlate the interruptions with either idle time or high load (such as an indexing job).

favila21:06:56

idle time maybe triggers timeouts or connection closes; high load maybe causes instability or forced closes

manutter5121:06:19

:thinking_face:

manutter5121:06:49

Ok, that’s some stuff we can look into, thanks again

Ivar Refsdal07:06:15

Could it be that the MSSQL backend server has a short max time that a connection can "live"? From what I know about tomcat-jdbcpool, which Datomic uses, the default is to never close a connection when it is returned to the pool. It's controlled by the https://tomcat.apache.org/tomcat-7.0-doc/jdbc-pool.html#How_to_use property. Unfortunately I don't think you can control this from/in Datomic.

manutter5111:06:46

We ended up restarting the SQL Servers in the HA group (which was a non-trivial operation), and that seems to have resolved the issue, so it does look like a problem with the underlying storage rather than with datomic itself.

👍 1
Ivar Refsdal12:06:06

Are you running your own SQL servers "on premises"? For our MS Azure PostgreSQL setup we needed to increase the number of IOPS or so. The database was struggling a lot before that. I don't recall how that manifested itself exactly, but I think with dead/dropped connections.

manutter5112:06:56

Thanks, I’m not involved on the SQL Server side of things at all, but I’ll pass that on.

Ivar Refsdal13:06:54

If you are running the database in the cloud or "paas", I'd also recommend setting socketTimeout on the peers: https://ask.datomic.com/index.php/631/blocking-event-cluster-without-timeout-failover-semantics?show=700#a700

👍 1
jdkealy22:06:21

Is there a way to drop a database without using a peer ? What would be the fastest way to drop a db ?

favila22:06:34

What problem are you trying to solve? Do you have multiple databases in the same storage+transactor, or do you want to deprovision the whole stack as fast as possible, or something else?

jdkealy22:06:34

I have a script that restores a DB, can't restore a DB with an existing DB of same name, so i want a quick one-liner that doesn't require devs (who don't know clojure) to drop and restore

favila22:06:30

what kind of storage?

jdkealy22:06:52

just local dev

favila22:06:04

would it make sense to unconditionally rm -r the data directory?

favila22:06:29

or even to not use datomic-level backup/restore but to distribute the h2 files

favila22:06:41

then it’s just file copy

jdkealy22:06:26

oh really? Even if it's from dynamo ?

favila22:06:21

no, I’m assuming you’re using the same storage

favila22:06:24

storage level operations are probably always going to be fastest. I don’t know if you care about what else may be in the local dev. If it’s just distributing readonly replicas, you could backup from dynamo, restore into h2, then distribute the h2 files to all the devs and just blow away whatever h2 files they have already.

favila22:06:50

if you do care about what else may be in those local dev databases, then I think you need d/delete-database or d/rename-database, a more interactive datomic-level restore, etc

favila22:06:39

Remember d/delete-database doesn’t reclaim storage

jdkealy22:06:26

Was list-databases removed from the datomic peer api ?

favila22:06:35

are you thinking of get-database-names?

favila22:06:04

that’s the client api

favila22:06:47

I don’t know why they decided to make these different

favila22:06:55

peer api predates the client api