Fork me on GitHub
#datomic
<
2020-07-02
>
kschltz12:07:10

Hi, there. I have a ~3Billion datom database in datomic cloud and the need to add an AVET index seems more than reasonable

favila12:07:01

datomic cloud already value-indexes everything

kschltz12:07:14

Just realized it

kschltz12:07:38

I was wondering how datomic will handle the creation of this new index

kschltz12:07:08

Will it index the existing datoms? If so, would it harmful from a operation standpoint?

marciol12:07:23

Hi, We are using Datomic Cloud, executing queries against a Database with approximately 3 Billion Datoms, but a trivial query is taking a long time to return, or it isn’t returning at all, rising timeout exception. With the query bellow we are trying to return all transactions from a merchant in a range of time, super trivial:

(d/q {:query  '[:find (pull ?entity [*]
                        :in $
                        :where
                        [?entity :merchant-id "beb9c7db-a7eb-4e56-8c4e-4db195566562"]
                        [?entity :transaction-time ?transaction-time]
                        [?transaction-time :utc-time ?transaction-time-utc]
                        [(>= ?transaction-time-utc #inst"2020-06-01T03:00:00.000-00:00")]
                        [(<= ?transaction-time-utc #inst"2020-06-30T03:00:00.000-00:00")])]
              :args   [(d/db (:conn client))]
              :timeout 50000})
We are running in query groups i3.xlarge (with 30.5 GB RAM), and wondering ourselfs if we need to increase these machines. Can someone with more experience thrown light on this?

Ian Fernandez13:07:12

d/query with pull inside tends to be redundant sometimes, I think that d/pull will get better performance =)

marshall13:07:56

using pull in query should have the same performance characteristics as a query followed by a pull, except that in the case of using client/cloud using pull in query will save a round trip/wire cost

Ian Fernandez13:07:16

it's datomic cloud w/o ions

Ian Fernandez13:07:01

I think d/pull will help in this case

marshall13:07:33

you should test the time it takes to query just for the entity IDs and how long it takes to pull the attributes of interest

metal 6
marshall13:07:44

to determine if the pull or the query is taking the majority of the time

metal 6
Ian Fernandez13:07:39

it can be a problem to use this w/o Ions with too many entities on cloud?

Guilherme Pupolin13:07:39

Hi @U05120CBV, these are the queries and execution times First: PULL + 30 days interval = 110578.733438 msecs (14,054 results) Second: Entity + 30 days interval = 22990.008083 msecs (14,054 results)

(time (def pull-entities (d/q {:query  '[:find (pull ?entity [*])
                                           :in $ ?merchant-id ?transaction-time-start ?transaction-time-end
                                           :where
                                           [?entity :merchant-id ?merchant-id]
                                           [?entity :transaction-time ?transaction-time]
                                           [?transaction-time :utc-time ?transaction-time-utc]
                                           [(>= ?transaction-time-utc ?transaction-time-start)]
                                           [(<= ?transaction-time-utc ?transaction-time-end)]]
                                 :args   [(d/db (:conn client))
                                          "beb9c7db-a7eb-4e56-8c4e-4db195566562"
                                          #inst"2020-06-01T03:00:00.000-00:00"
                                          #inst"2020-06-30T03:00:00.000-00:00"]
                                 :timeout 50000})))

  "Elapsed time: 110578.733438 msecs"
  => #'pgo.commons.datomic-test/pull-entities
  (count pull-entities)
  => 14054

  (time (def pull-entities (d/q {:query  '[:find ?entity
                                           :in $ ?merchant-id ?transaction-time-start ?transaction-time-end
                                           :where
                                           [?entity :merchant-id ?merchant-id]
                                           [?entity :transaction-time ?transaction-time]
                                           [?transaction-time :utc-time ?transaction-time-utc]
                                           [(>= ?transaction-time-utc ?transaction-time-start)]
                                           [(<= ?transaction-time-utc ?transaction-time-end)]]
                                 :args   [(d/db (:conn client))
                                          "beb9c7db-a7eb-4e56-8c4e-4db195566562"
                                          #inst"2020-06-01T03:00:00.000-00:00"
                                          #inst"2020-06-30T03:00:00.000-00:00"]
                                 :timeout 50000})))

  "Elapsed time: 22990.008083 msecs"
  => #'pgo.commons.datomic-test/pull-entities
  (count pull-entities)
  => 14054 

Guilherme Pupolin13:07:39

And one more case, without filter time: Third: Entity + without interval = 3768.019134 msecs (17,670 results)

(time (def pull-entities (d/q {:query  '[:find ?entity
                                         :in $ ?merchant-id
                                         :where
                                         [?entity :merchant-id ?merchant-id]]
                               :args   [(d/db (:conn client))
                                        "beb9c7db-a7eb-4e56-8c4e-4db19556656"]
                               :timeout 50000})))
"Elapsed time: 3768.019134 msecs"
=> #'pgo.commons.datomic-test/pull-entities
(count pull-entities)
=> 17670

Joe Lane14:07:11

This is the time to get the data back to your development computers, right?

Joe Lane14:07:27

Where are you located, which AWS_REGION are your non-ion machines located, and which AWS_REGION is your datomic cloud cluster deployed in?

Guilherme Pupolin14:07:30

For these examples, I connected using datomic-cli from São Paulo in the cluster at us-east-1. In the productive environment, it has a VPC Endpoint connecting our applications to Datomic in us-east-1.

Joe Lane14:07:46

So I understand, in prod, your datomic cluster is in us-east-1, and your applications connect to it from which AWS_REGION? Where are the machines themselves?

Guilherme Pupolin14:07:53

In prod, both in us-east-1

Joe Lane14:07:46

Have you gone through the "Decomposing the query" Example marshall posted?

Guilherme Pupolin15:07:10

Yes I do. But I couldn’t improve any more than that (I got a better result just passing the merchant-id, I do not know a way to search better on this date ref).

[?entity :merchant-id ?merchant-id]
[?entity :transaction-time ?transaction-time]
[?transaction-time :utc-time ?transaction-time-utc]
[(>= ?transaction-time-utc ?transaction-time-start)]
[(<= ?transaction-time-utc ?transaction-time-end)

marciol16:07:07

@U05120CBV we noticed that what is hurting the query performance are all clauses related to time.

Joe Lane16:07:03

@marciol Can you typehint the query clauses with ^java.util.Date

Guilherme Pupolin16:07:07

@U0CJ19XAM in this way?

(d/q {:query  '[:find ?entity
                                           :in $ ?merchant-id ^java.util.Date ?transaction-time-start ^java.util.Date ?transaction-time-end
                                           :where
                                           [?entity :merchant-id ?merchant-id]
                                           [?entity :transaction-time ?transaction-time]
                                           [?transaction-time :utc-time ?transaction-time-utc]
                                           [(>= ?transaction-time-utc ?transaction-time-start)]
                                           [(<= ?transaction-time-utc ?transaction-time-end)]]
                                 :args   [(d/db (:conn client))
                                          "beb9c7db-a7eb-4e56-8c4e-4db195566562"
                                          #inst"2020-06-01T03:00:00.000-00:00"
                                          #inst"2020-06-30T03:00:00.000-00:00"]
                                 :timeout 50000})))

Joe Lane16:07:17

Although, it may not make a difference because you are using the custom comparators <= and >=.

Joe Lane16:07:06

Why did y'all decrease the timeout from 60 seconds to 50 seconds?

marshall16:07:51

@U016FDZFA2X what are the schema definitions for all the attributes in the query

kschltz17:07:21

This is the one

kschltz17:07:44

From what we know so far, the issue lies in the time nesting

kschltz13:07:41

It takes around 50s to retrieve 14k results

kschltz13:07:14

• We sliced the db using d/since without much improvement

marciol13:07:03

It’s a lot of time to return the result of such trivial query, must be something we can do to decrease this time.

favila13:07:39

could it also be the pull * and not the query itself?

favila13:07:08

your find looks odd (missing close paren). is that the whole thing?

marshall13:07:32

@marciol Have you separately tested the time it takes to query for the entity IDs and to pull the attributes from them

marciol13:07:28

@U05120CBV we are going to do all this to get a more fine grained overview of what is happening.

marshall13:07:50

also review the decomposing a query example

zhuxun216:07:39

What is the idiomatic way to answer the question "when was this attribute last changed"?

ghadi16:07:24

bind the ?tx (the fourth component) in a clause, then join to the ?tx's :db/txInstant

metal 6
ghadi16:07:55

[?e :myAttr _ ?tx]
[?tx :db/txInstant ?time]

zhuxun216:07:28

Is there a equivalent of "sort by" in datomic?

zhuxun217:07:52

Then how do I query for something like top 10 entity with respect to an attribute? Do I have to query for all of them and then do a client-side sorting?

favila17:07:36

Take a look at d/seek-datoms and d/index-range on the peer api, and index-seek and index-pull on the peer

favila17:07:20

you could also try abusing nested queries a bit. You can call normal clojure code inside a query, so you could have an inner query that gets all results, and an outer query that sorts and limits them

zhuxun217:07:49

I guess by nested queries you mean something like this, right? https://docs.datomic.com/cloud/query/query-data-reference.html#q

zhuxun217:07:21

There must be a better way ...

zhuxun221:07:01

Can I have an attribute storing an unspecified EDN?

zhuxun221:07:45

Or an unspecified nested data structure consisting of maps, vectors, and any of the supported basic types as leaves (i.e., a JSON-like structure)

souenzzo23:07:39

@zhuxun2 you can use pr-str and store as string

👍 3
marshall23:07:06

@zhuxun2 Datomic is not intended for storing LOBs. You should avoid putting large objects directly in Datomic. Either split them into individual facts (datoms) or store the LOBs somewhere else (ddb, s3, etc) and store a reference to them in Datomic

👍 3
jeroenvandijk09:07:13

@U05120CBV Do you have a rule of thumb for when a string becomes too big and should be considered a LOB?

ilshad18:07:44

@U0FT7SRLP 4Kb is the limit for strings

👍 6