datomic 2022-09-23 | Slack Archive

jasonjckn06:09:01

Latest datomic release has ‘High’ CVEs for h2 database dependency, any solutions to this ? it’s a pretty significant issue to deploying it at my company

jasonjckn06:09:49

apparently i can just upgrade the h2 database dependency to latest version , so long as i’m using postgresql driver, didn’t think it’d be that easy reading through chat logs

favila10:09:12

Note that the difficulty is that h2 itself is not compatible with its own db files across the major releases. I suspect this is why datomic has not bumped it: suddenly no one would be able to open their existing dev dbs

👍 1

jasonjckn17:09:47

There’s also some API compatibility/ class errors if you try to use h2database v2.x with a dev/mem instance of datomic too from what i saw, i’m still using v1.x on dev builds

luposlip16:09:38

Hey there! I have a development Transactor (1.0.6397) running in a docker container with the datomic dev protocol. I’ve set the transactor and peer passwords, and set it to allow remote connections. When the docker container starts, it creates the data folder with the h2 database. I try to connect from my repl (same datomic version) with connection string ..?password=[the-password], but I just get this error, no matter what I try:

1. Caused by org.h2.jdbc.JdbcSQLException
   Wrong user name or password [28000-171]

        SessionRemote.java:  568  org.h2.engine.SessionRemote/done
...
       JdbcConnection.java:  109  org.h2.jdbc.JdbcConnection/<init>
       JdbcConnection.java:   93  org.h2.jdbc.JdbcConnection/<init>
               Driver.java:   72  org.h2.Driver/connect
     PooledConnection.java:  266 
...
                   sql.clj:   16  datomic.sql/connect

Any help is appreciated!!

Leaf Garland20:09:14

It sounds like you're doing things right. To double check, you have set two passwords in the transactor properties?

storage-admin-password=admin
storage-datomic-password=use-this-password
storage-access=remote

And you're using the datomic one in your connection string? https://docs.datomic.com/on-prem/configuration/configuring-embedded-storage.html

luposlip20:09:30

yeah, I’m doing it exactly the same. The passwords you wrote above, are they default passwords, or just random? This is from my config file:

storage-admin-password=pwd
storage-datomic-password=pwd
storage-access=remote

Leaf Garland20:09:14

Just random passwords. It looks like the connections are working, but you are exposing both the transactor and dev storage ports from your container? e.g. defaults are transactor on 4334 and H2 on 4335 (usually +1 from transactor port).

luposlip20:09:09

Alright. Yes, I’m starting the container like this:

docker run -p 4334-4336:4334-4336 transactor-dev:latest

luposlip20:09:31

Looking in the logs, it also seems to start fine, this is currently the last entry:

2022-09-23 20:52:41.642 INFO  default    datomic.lifecycle - {:tid 25, :username "asdfasdf", :port 4334, :rev 59, :host "0.0.0.0", :pid 17, :event :transactor/heartbeat, :version "1.0.6397", :timestamp 1663966361621, :encrypt-channel true}

luposlip21:09:40

OHH!! It’s working now! Initially I set different passwords. Then I changed to pwd and pwd, and then changed other things (that presumably was wrong). Then I tried to change the passwords back to being different - and now it works! 😄 So my conclusion is - the 2 admin/datomic passwords have to be different. This could be documented. Thanks for your time @U02EP7NKPAL!

🎉 1

Dustin Getz18:09:32

Is there ever a reason to prefer q over qseq? seems like qseq is strictly better (more powerful/expressive) in every way (zero loss in expressive power)

pithyless14:09:33

datomic.api, datomic.client.api, and datomic.client.api.async all seem to have slightly different versions of q and qseq (they all are documented to accept some subset of query-list, query-map, query-string). One thing I am aware of - from a strict expressiveness view - is the query-map does not support returning a collection or scalar value in the find spec (i.e. :find [?a ...] or :find ?a .) I do wonder if the internal implementations are different as to have different performance characteristics in greedy queries (e.g. when returning just a scalar or computing aggregates).

onetom16:10:22

im also curious about the answer to the q vs qseq question. and i also miss the scalar find spec a lot... makes me wonder that im missing something... the datalog query would be so nice and declarative, but it

onetom16:10:55

the lack of these scalar find specs are like a fly in the soup. they are just so useful, so often. especially during interactive repl work. to remedy the situation, i was considering to write some https://github.com/thunknyc/richelieu advice around q & qseq, which would rewrite the datalog query and do the necessary post-processing on the result.

onetom16:10:45

im already advising d/transact, d/with & d/pull to convert back and forth between java.util.Date & java.time.Instant, using [tick.core :as t]:

(defn maybe-instant->inst [maybe-convertable-to-inst]
  (if (or (t/instant? maybe-convertable-to-inst)
          (t/zoned-date-time? maybe-convertable-to-inst)
          (t/offset-date-time? maybe-convertable-to-inst))
    (t/inst maybe-convertable-to-inst)
    maybe-convertable-to-inst))

(defadvice ^:private transact-instants
  "Replace java.time.Instants with Clojure instants (which are java.util.Date)
   before transacting."
  [transact conn arg-map]
  (-> arg-map
      (update :tx-data (partial walk/postwalk maybe-instant->inst))
      (->> (transact conn))))

(defonce _transact-instants (advise-var #'d/transact #'transact-instants))
(defonce _with-instants (advise-var #'d/with #'transact-instants))

(defadvice ^:private transact-throw-txd
  "Like d/transact, but attaches the tx-arg to its exceptions."
  [transact conn arg-map]
  (try (transact conn arg-map)
       (catch Exception ex
         (-> "Transaction failed"
             (ex-info arg-map ex)
             throw))))

(comment
  (advise-var #'d/transact #'transact-throw-txd)
  )

(defn- ^:deprecated maybe-inst->instant [i] (if (inst? i) (t/instant i) i))

(defadvice ^:private pull-instants
  ([pull db arg-map]
   (->> (pull db arg-map)
        (walk/postwalk maybe-inst->instant)))
  ([pull db selector eid]
   (->> (pull db selector eid)
        (walk/postwalk maybe-inst->instant))))

(defonce _pull-instants (advise-var #'d/pull #'pull-instants))

Daniel Jomphe14:01:28

#Also sent to the channel

Is there now someone sufficiently knowledgeable here to provide an answer to this question? > Is there ever a reason to prefer q over qseq? > > seems like qseq is strictly better (more powerful/expressive) in every way (zero loss in expressive power) I find https://docs.datomic.com/cloud/query/query-executing.html#qseq a bit lacking, and searching past discussions in this forum for qseq seem to reflect that.

Daniel Jomphe14:01:35

#Also sent to the channel

To observe differences, I put in place an A-B test in our codebase that blindly runs and compares all our q calls to qseq. For now I observed two functional differences: 1. As stated, the fact that qseq returns a seq implies that empty results are not [] like q, but nil instead. So I wrapped qseq in e.g. (or (qseq...) []) to not affect the codebase that depends on results always being vectorized (in some areas), and continued observing. 2. I observed one case where a query returns results in a different order. And some qualitative differences: • Time: results get back significantly faster, as shown in screenshots. • I didn't yet observe if our app's usage of these results is slowed down by lazy realization or not, though. I suppose I should. • Didn't see an impact in CloudWatch metrics of the Datomic Cloud servers, but I didn't activate the detailed metrics to see memory. I suppose I should. :)

favila14:01:11

When q returns you know all IO is done

favila14:01:35

Sometimes that is an advantage

Daniel Jomphe14:01:24

Do you mean that by simply accessing the returned results, it might actually sometimes trigger further queries??

favila14:01:51

That’s exactly the point of qseq…

favila14:01:15

qseq delays the evaluation of pull

Daniel Jomphe14:01:20

Wow, curious... nav & datafy are probably in play here?

favila14:01:33

? No?

Daniel Jomphe14:01:31

Haha ok. So you seam to mean that when pull is executed, it is (only) then that a further automatic query ~~could~~ will happen.

favila14:01:48

The pull is the IO

favila14:01:41

It’s not the IO in the datalog query (the where clauses and the final result set). That is always finished when q or qseq returns

favila14:01:26

But if the find has pulls in it, q will evaluate then eagerly but qseq will delay to the time the entry is accessed

favila14:01:37

The data to satisfy the pulls is not guaranteed to be loaded, so you may incur additional IO

favila14:01:50

But that will never happen with q

Daniel Jomphe14:01:14

And if the time the pull entry is accessed is much later, then we might be holding to a much larger set of the DB's persistent representation in memory, but if we access it quite soon and are done with it, then, well, that's not an issue.

favila14:01:38

No, nothing is retained

Daniel Jomphe14:01:12

Phew, thanks a lot for clarifying my wrong intuitions about this! 😅

favila14:01:28

In essence, qseq is this:

favila14:01:31

(->> (d/q '[:find ?x ...] db)
     (map (fn [result-tuple]
            (update result-tuple 0 #(d/pull db pull-expr %)))))

favila14:01:37

whereas q is this:

favila14:01:42

(->> (d/q '[:find ?x ...] db)
     (mapv (fn [result-tuple]
            (update result-tuple 0 #(d/pull db pull-expr %)))))

favila14:01:59

in terms of runtime characteristics

favila14:01:10

that’s literally the only difference

favila14:01:49

so you want qseq if you are memory constrained; q if you want to avoid blocking when processing the result

favila14:01:38

but most of the time if you just want a subset I’d say you want q, then sort, then get your subset, then pull

favila14:01:47

unfortunately qseq realizes the pull when the result item is realized, not when the individual slot in the item is accessed, so it’s not ok for getting the entire result set, doing something with the other fields, then looking at the pulled fields

favila14:01:56

I think it’s built more to answer problems the client model faces than the peer model

favila14:01:40

in the peer model, decoupling all this for more control is usually fine; but in the client model each decoupling incurs another network hop, or has to keep a bunch of stuff retained in the peer-server.

👍 2

Daniel Jomphe14:01:19

Ok, so laziness' comeback (if we ever thought it was starting to be abandoned by e.g. more use of transducers), with some tradeoffs based on the specific implementation details. While you were writing these last posts, I was writing this below, and I now realize how naïve the below is: Then the following end-user advice would seem practical, IIUW: • When loading info where the user might want to explore just some subparts, always use qseq. • When loading info where the user will definitely see all of its subparts, prefer q.

favila14:01:34

the first one isn’t wrong, but results from qseq are unordered

favila14:01:17

so you either are partially consuming results and don’t care about order, or you are consuming all results but incrementally without head-holding.

favila14:01:19

again if your query has no pull in its find, there is no difference between q and qseq, they are exactly the same

favila14:01:04

pulls often take a long time and its results take a lot of memory relative to the datalog query evaluation and results. qseq lets you defer that work to when the result entry is read

👍 2

favila14:01:00

oh, and on the client, they also take a lot of network and marshalling-unmarshalling all at once, because of those maps

👍 2

favila14:01:26

so client api has an additional advantage of smoothing that out into more smaller payloads

favila14:01:35

but more round trips

Daniel Jomphe14:01:36

Ok, thanks a lot, this will be quite useful to refer to! As for observing the impact on our app's e.g. API handlers performance in time and space, I think the best way to do it would be to bring in cpu and heap metrics to compare when we toggle to q or to qseq (instead of running both blindly), grouped per... not by API handler, but by its constituent parts, b/c one handler might call many functions making many requests, each having its own distinct impact on the system. This should provide solid feedback to tune our intuitions and refine our choices. And you might say we'll have much more important optimization opportunities to make before that, like tuning our actual finds and pull repres'es. I should think to report back about this when priorities allow this (I'll have to enable OpenTelemetry host metrics before I can make these kinds of correlations with confidence).

favila15:01:11

I wouldn’t overthink the difference here. This is a throughput vs latency, IO vs memory trade off

favila15:01:23

If you default to qseq, you are probably fine unless you process the result somewhere using some threadpool meant for non-blocking cpu workloads

favila15:01:21

IME that’s not what most web apps do and they would rather reduce the working set size, put less stress on the gc, etc

pithyless13:01:17

@U0514DPR7 thanks for resurrecting this thread and @U09R86PA4 for the clarification. One last question to check my understanding: is there a tradeoff between doing a pull inside a qseq (vs first doing a q for eids followed by a pull-many)? It sounds like in both cases the eids query is greedy and the pull would be lazy (unless I'm misremembering that pull-many is lazy). In the case of the client, I guess this would be one more network hop, but on-prem this should be equivalent. Correct?

favila14:01:01

Pull-many is not lazy. When I say “query then use pull-many” I’m imagining some kind of chunking

favila14:01:54

Ie partition-all some-chunk then mapcat pull-many

👍 2

favila14:01:44

Pull-many seems to prefer aevt indexes as a data source when it can, and I’m not sure pull-inside-q does.

favila14:01:36

So it’s exactly as you describe except for the clarification about pull-many laziness and the uncertainty about which indexes will end up being used

pithyless14:01:40

This sounds like yet another knob one can twiddle when you’re not quite getting the performance you expect on your workload.

favila14:01:36

Pull-many btw is definitely not the same as mapv pull. The latter prefers EAVT

Ben Sless18:09:02

Copyright related question - if I implement a spec of the datomic query and pull API abstract syntax as appears in the official documentation, do I need to do anything regarding licensing, attribution, copyrights assignment, mentions, etc?

Daniel Jomphe14:01:28

replied to a thread:Is there ever a reason to prefer `q` over `qseq`? seems like qseq is strictly better (more powerful/expressive) in every way (zero loss in expressive power)

Daniel Jomphe14:01:35

replied to a thread:Is there ever a reason to prefer `q` over `qseq`? seems like qseq is strictly better (more powerful/expressive) in every way (zero loss in expressive power)

2022-09-23

Channels