This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-06-03
Channels
- # announcements (6)
- # babashka (14)
- # beginners (17)
- # biff (3)
- # calva (19)
- # circleci (3)
- # clj-on-windows (1)
- # cljdoc (21)
- # cljs-dev (6)
- # clojure (119)
- # clojure-australia (2)
- # clojure-europe (28)
- # clojure-france (3)
- # clojure-norway (12)
- # clojure-survey (2)
- # clojure-uk (7)
- # clojurescript (25)
- # core-typed (1)
- # cursive (11)
- # datomic (53)
- # emacs (14)
- # events (1)
- # gratitude (1)
- # holy-lambda (21)
- # integrant (2)
- # jobs (1)
- # jobs-discuss (3)
- # juxt (3)
- # kaocha (1)
- # lsp (17)
- # nbb (14)
- # off-topic (25)
- # pathom (11)
- # re-frame (24)
- # releases (1)
- # remote-jobs (2)
- # rewrite-clj (10)
- # shadow-cljs (11)
- # sql (3)
- # tools-build (6)
- # tools-deps (83)
- # vim (26)
- # xtdb (10)
Anybody ever seen issues with the transactor restarting itself every few minutes due to failed heartbeat? We’ve got plenty of RAM, plenty of disk space, including plenty of space for the logs, low network latency to the MSSQL backend, low load, nothing in the logs to indicate why the heartbeat keeps failing?
That would cause a heartbeat failure?
You’re talking JDBC connection from datomic back to the MSSQL back end?
The “heartbeat” is the transactor writing its address into storage periodically. It’s part of the HA failover system
if this doesn’t happen immediately on startup, that suggests some JDBC connection-level thing is wrong. maybe a connection count is exceeded, maybe MSSQL killed a long-running connection, maybe there’s TCP issues lower down the stack.
Ok, I’m seeing references to heartbeat in the logs on New Relic
The problem happens at irregular intervals, but we’re definitely starting successfully and running for a good few minutes.
No it seems to only happen when we try to execute transactions.
And we’re not on Google Cloud
Yes, we’re getting some throughput, we’re just having frequent interruptions.
I’ll pass that on, thanks much!
and try to correlate the interruptions with either idle time or high load (such as an indexing job).
idle time maybe triggers timeouts or connection closes; high load maybe causes instability or forced closes
:thinking_face:
Ok, that’s some stuff we can look into, thanks again
Could it be that the MSSQL backend server has a short max time that a connection can "live"? From what I know about tomcat-jdbcpool, which Datomic uses, the default is to never close a connection when it is returned to the pool. It's controlled by the https://tomcat.apache.org/tomcat-7.0-doc/jdbc-pool.html#How_to_use property. Unfortunately I don't think you can control this from/in Datomic.
We ended up restarting the SQL Servers in the HA group (which was a non-trivial operation), and that seems to have resolved the issue, so it does look like a problem with the underlying storage rather than with datomic itself.
Are you running your own SQL servers "on premises"? For our MS Azure PostgreSQL setup we needed to increase the number of IOPS or so. The database was struggling a lot before that. I don't recall how that manifested itself exactly, but I think with dead/dropped connections.
Thanks, I’m not involved on the SQL Server side of things at all, but I’ll pass that on.
If you are running the database in the cloud or "paas", I'd also recommend setting socketTimeout
on the peers:
https://ask.datomic.com/index.php/631/blocking-event-cluster-without-timeout-failover-semantics?show=700#a700
Is there a way to drop a database without using a peer ? What would be the fastest way to drop a db ?
What problem are you trying to solve? Do you have multiple databases in the same storage+transactor, or do you want to deprovision the whole stack as fast as possible, or something else?
I have a script that restores a DB, can't restore a DB with an existing DB of same name, so i want a quick one-liner that doesn't require devs (who don't know clojure) to drop and restore
storage level operations are probably always going to be fastest. I don’t know if you care about what else may be in the local dev. If it’s just distributing readonly replicas, you could backup from dynamo, restore into h2, then distribute the h2 files to all the devs and just blow away whatever h2 files they have already.
if you do care about what else may be in those local dev databases, then I think you need d/delete-database or d/rename-database, a more interactive datomic-level restore, etc