Fork me on GitHub
#datomic
<
2017-02-03
>
pesterhazy09:02:20

@sova, I've found, by the way, that I usually don't need to get the entity id when the code is written in idiomatic way

pesterhazy09:02:41

although of course that depends on what you're transacting 🙂

reitzensteinm13:02:32

has anyone come across an :db.error/invalid-entity-id on valid entities before?

reitzensteinm13:02:01

(d/transact (conn) [{:db/id 17592186076008, :account/owner-phone-number "047041****"}])
=> #object[datomic.promise$settable_future$reify__6480 0x49b7d935 {:status :failed, :val #error {
 :cause ":db.error/invalid-entity-id Invalid entity id: 17592186076008"

reitzensteinm13:02:45

(d/touch (d/entity (cal.utils.database-connection/db) 17592186076008))
=> {:account/owner-phone-number "<snip>", :account/base-commission 0.0, ...}

reitzensteinm13:02:26

I'm using the dev db, deleted the data and restored a backup of prod, where we seemed to be having a bit of trouble

favila15:02:41

@reitzensteinm You cannot transact arbitrary entity ids explicitly

favila15:02:19

Are you sure the entity existed before?

favila15:02:45

the only way to "mint" a new entity id is to transact with a temp id and let the db assign an actual id

reitzensteinm15:02:00

@favila yes I'm sure it existed (see the second output)

favila15:02:00

what is the basis-t of the db? @reitzensteinm

reitzensteinm15:02:33

it says 540270

reitzensteinm15:02:45

is that a transaction count?

favila15:02:27

it's the "t" part of the last transaction

favila15:02:42

there is one counter inside the db which is incremented for every minted entity id. its value goes in the "t" part of an entity id (the lower bits)

favila15:02:10

the higher bits of an entity id contain the t of a partition id

favila15:02:43

so 17592186076008 has a t of 31592, and a partition of 4

favila15:02:49

(the user partition)

reitzensteinm15:02:57

makes sense. it's a very old entity

reitzensteinm15:02:02

been in the db for nearly a year

favila15:02:25

I was just sanity-checking that the t value had advanced beyond this entity's t value

reitzensteinm15:02:46

i'm suspecting corruption of the db. I was investigating an issue where, on prod, a transaction with a tempid overwrote an existing entity

reitzensteinm15:02:02

the error occurs on a backup of the prod db, restored locally

reitzensteinm15:02:27

(not the same entity)

reitzensteinm15:02:41

but the restored db is apparently very broken

favila15:02:45

other sanity checks I can think of: are you sure that db is actually from the correct connection? Are you connecting to both transactors from the same process with an older datomic version? (older datomic versions would reuse the connection)

favila15:02:47

do you see any 0-size files in the values/ dir of your backup? (this happened to us--a corrupted backup)

reitzensteinm15:02:59

there's just the one transactor (the same process does not connect to prod)

reitzensteinm15:02:04

will check the backup dir

reitzensteinm15:02:14

i've deleted and rebacked up several times

favila15:02:19

if all of this is ruled out, I would file a support ticket

favila15:02:27

this does sound like corruption

favila15:02:39

what is your prod storage backend?

reitzensteinm15:02:50

it's on postgresql, hosted on RDS

favila15:02:20

you can also query the db directly to look for corruption

favila15:02:28

(if it manifests in prod too)

reitzensteinm15:02:51

the same error (writing to existing entity) does not, but I'm thinking that the other shoe is going to drop pretty soon

reitzensteinm15:02:04

just the one error on prod indicating corruption (tempid reusing an existing entity)

reitzensteinm15:02:36

do you mean using the datomic integrity diagnostics fn?

favila15:02:28

I'm not aware of that? I meant the blob column in the sql db will be size 0 or start with 0x00

favila15:02:55

(for the value rows, their id value looks uuid-ish)

reitzensteinm15:02:38

thanks for your help! I'll try to look for corruption and file a ticket

favila15:02:01

select * from datomic.datomic_kvs where id like "pod%" will get you the (mutable) roots, may have some interesting things in it

reitzensteinm15:02:51

no 0 sized files in the backups dir, lowest are 60

jdkealy16:02:26

is there some sort of trick for manual indexing to help me get around a terribly slow query ?

jdkealy16:02:16

photos belong to a collection, collection belongs to organizations. I have 2M photos and I can't get the highest "foreign key" from my photos because running out of RAM.

marshall16:02:30

as in largest numerically?

marshall16:02:41

for a specific attribute?

jdkealy16:02:55

yes, but it's not necessarily unique

jdkealy16:02:01

because each organization will have their own FK

jdkealy16:02:18

org 1 has 10 photos with ids 1-10, org 2 has 20 photos ids 1-20

marshall16:02:35

as long as the attribute is index/true you could simply walk (d/datoms :avet :foreign/key) until you get to the largest

marshall16:02:13

^ that is lazy so it should be pretty memory efficient

jdkealy16:02:22

there's no way to isolate those datoms to just an org's datoms is there?

marshall16:02:23

if you want to realize the whole set you could (seq ) it

marshall16:02:43

how are orgs defined?

jdkealy16:02:14

it's basically a user. firstname / lastname... they create a collection :collection/organizaiton and then you upload photos to your colleciton :collection/photo

marshall16:02:34

that’s all in a single entity?

jdkealy16:02:44

3 entities

jdkealy16:02:14

:user/name :db/type :string :collection/organization :db/type :ref :photo/collection :db/type :ref

marshall16:02:15

have you tried optimizing the ordering of clauses in your query?

jdkealy16:02:37

there's only about 4 orgs now

jdkealy16:02:46

so i go org -> collection -> photos

jdkealy16:02:54

but in my big org, it still has to filter through 2M photos

jdkealy16:02:38

so... i guess one thing that would help would be being able to quickly iterate through an org's photos

jdkealy16:02:54

would it make sense ( i'd hate to do it ) but to add an attribute :photo/organization

jdkealy16:02:58

which bypasses the collection ?

marshall16:02:57

you could do a combination of query and datoms

marshall16:02:05

query to get the entity IDs of the collections

marshall16:02:21

then use datoms on :eavt or :aevt

marshall16:02:32

for each of the collections

jdkealy16:02:50

as one query or 20k queries ( i have 20k collections )

jdkealy16:02:19

each collection averages like 100 photos

marshall16:02:33

and you’re looking to find the ‘largest’ collection?

jdkealy16:02:02

i'm trying to find the single photo that has the highest foreign-key

jdkealy16:02:26

i have a counter funciton in datomic

jdkealy16:02:41

so it doesn't need to happen every time i create a new photo

marshall16:02:44

is the highest foreign key the most recently transacted?

jdkealy16:02:07

but i'm importing old data. so after import i'm trying to set the counter to the highest fk

jdkealy16:02:31

sure... the highest foreign key for that organization recently transacted

jdkealy16:02:42

i mean... it's recent if the thing isn't broken

marshall16:02:46

you could use the log

marshall16:02:24

walk backwards through transactions until you find it

jdkealy16:02:46

right... i guess i'm trying to find ways though to offset some of my performance anxiety about these kinds of queries in the future

jdkealy16:02:10

is the log really a sustainable solution

marshall16:02:42

it’s a tree just like the indexes

marshall16:02:52

it just happens to be indexed by t first

jdkealy16:02:06

oh interesting

jdkealy16:02:34

i don't think though that i could trust the last attr to be the highest

marshall16:02:46

ah. well, that might not work then

jdkealy16:02:20

do you think I should think about duplicating the attr :collection/organization to :photo/organization ?

marshall16:02:40

i don’t love it, but it might work best

jdkealy16:02:49

that way i could query the datoms

marshall16:02:16

yeah, having the intermediate entity means you have a join required

jdkealy16:02:25

right... i guess it's like an index, it would be cool if you could have these kinds of indexes happen in the background

jdkealy16:02:47

like... if an entity belongs to another ent which belongs to another X levels deep

marshall16:02:15

basically compound keys

jdkealy16:02:23

since the datoms api seems to be the only way to query big sets quickly

jdkealy16:02:57

is it in the roadmap to expand on d/datoms to allow you to add multiple parameters ?

favila16:02:17

@jdkealy what do you mean?

favila16:02:23

@jdkealy you could use d/datoms within the query too

favila17:02:57

you can call arbitrary functions in a query. Remember the query is run on the client

favila17:02:34

you're trying to avoid realizing a large intermediate set you only want an aggregation for

favila17:02:08

you can do that by performing the aggregation in the where of the query via a function call instead of using datalog to aggregate (which will realize the intermediate set)

favila17:02:14

e.g. (defn last-datom [db idx & matches] (last (apply d/datoms db idx matches)))

favila17:02:08

':where [(my.ns/last-datom $ :eavt ?org ?whatever) [?e ?a ?v ?t]]

favila17:02:08

Although I'm not sure how relevant this is from your discussion because I'm not sure how the entities are connected. I would have to see the original query

favila17:02:59

datoms is still an index scan though

favila17:02:24

why are you interested in the highest entity id anyway? that seems like a strange thing to care about?

jdkealy17:02:12

the highest foreign key... i'm migrating data from databases that used an RDBMS and i want to keep the foreign key counter going seamlessly when they migrate to new system

favila17:02:38

ah, its not an entity id, it's a long

favila17:02:57

what if you put that clause first?

favila17:02:25

reverse the order of your :where clauses

favila17:02:33

so you start with all fk, then filter by org

favila17:02:49

rather than starting with org and finding all its fk

jdkealy17:02:57

would that make it faster ?

jdkealy17:02:12

i have 4 orgs, 20k collections, 2M photos

jdkealy17:02:15

or "content"

jdkealy17:02:34

i thought the idea was to go from lowest to highest

favila17:02:09

normally, but there's also benefit to traversing in a sorted order

favila17:02:40

perhaps query engine is smart enough to dispose of some intermediate sets

favila17:02:12

probably I am wrong though and the size of the intermediate set is what dominates

marshall17:02:52

its definitely worth testing

marshall17:02:03

i also suspect what @favila said - the join is dominating

marshall17:02:19

if you have some prior knowledge of the values you’re looking for you might be able to leverage http://docs.datomic.com/clojure/#datomic.api/index-range

marshall17:02:30

i.e. start at your last known largest value

marshall17:02:39

that would reduce the size of the required index scan

favila17:02:28

yes, avoiding a full datoms seq for the index segment

favila17:02:33

(think array bisection)

jdkealy17:02:37

cool thanks guys

jdkealy17:02:41

i'll give it a try now

favila17:02:59

honestly if orgs is small I would just write a reduction function

marshall17:02:29

use an async transducer if you’re using clients

jdkealy17:02:53

are there some examples somewhere of something like this ?

favila17:02:01

I'll write one up for this

jdkealy17:02:05

wow thanks 🙂

marshall17:02:09

I believe that ^ is an example using a transducer across an async query

marshall17:02:08

it could be easily made into one

favila17:02:42

(As a side note, you can avoid need for ffirst with :find (max ?x) .)

jdkealy17:02:38

wouldn't you want to make sure the new fk is higher than the previous seen ?

jdkealy17:02:48

if we could assume that we could just grab the last one, no ?

favila17:02:22

is type of :content/fk a :long?

favila17:02:27

datoms are stored in sorted order, so the :avet index will sort by a, then v, then e, then t

favila17:02:40

v here is :content/fk, so it is always increasing

jdkealy17:02:45

oh ok... so this creates a map of all the orgs and their fk's

favila17:02:05

it seqs over all content/fk, for each one gets its org, then writes that to a map

jdkealy17:02:05

cool great

jdkealy17:02:17

at what point would this not be sustainable ? what if i have 100M photos ?

favila17:02:33

lots of orgs

jdkealy17:02:37

i can deal with slow...

jdkealy17:02:43

just not timing out

favila17:02:47

this should never time out

jdkealy17:02:51

cool great

favila17:02:06

this is the smallest amount of mem you can possibly usee

jdkealy17:02:29

honestly i'm using elasticsearch for most of my queries... datomic i'm just using to give me the facts

favila17:02:41

so it may not be fast if there are a lot of datoms to seq over, but they will never all be in memory at the same time

jdkealy17:02:53

cool awesome

marshall17:02:33

neat side effect - you’ll populate your local cache with all the segments about your orgs and photos by running it

marshall17:02:49

depending on the size of your local (and/or memcached) instances

jdkealy17:02:01

interesting

jdkealy17:02:18

do you populate the cache by calling by-id ?

jdkealy17:02:30

i mean... (d/entity _db id)

marshall17:02:53

the datoms call technically

marshall17:02:58

although entity probably would too

marshall17:02:05

basically anything that fetches data will cache

marshall17:02:08

so query, datoms, etc

marshall17:02:32

my point was only that if you ran that reducer on a “cold” peer by the time it ran you’d have cached potentially most of your DB

marshall17:02:42

obviously depending on the size of the db and size of your cache

favila17:02:58

so a repeated run on the same peer would likely execute much more quickly

jdkealy17:02:13

cool, got it

jdkealy17:02:23

wow holy shit that was pretty fast

jdkealy17:02:47

i mean like 10 seconds fast ... but definitely beats timing out!

favila17:02:03

hah I was not expecting so large a difference

marshall17:02:24

those big set to small set joins can really be a bear

jdkealy17:02:37

without an indexes, it ran for like an hour and i gave up, with indexes GC overhead, and this way like 10 seconds

marshall17:02:54

most of the time query is great and preferred, but there are a few cases where you definitely can’t beat direct datoms

favila17:02:59

yeah that access pattern is a weakness of queries. Queries can't do the aggregation until the entire set you are aggregating over is realized

favila17:02:29

so if that set is very big, you may OOM

jdkealy17:02:38

right that makes sense

jeff.terrell17:02:55

Question: what is a Datomic database value, really? Is it just an entity id or something? I'm asking because I used an exception service today to diagnose a production exception (in a Rails app), and I thought, man, it'd be so nice to be able to query the database at that point in time. I got to wondering whether Datomic database values were lightweight enough to include in the exception information, so that I could copy the value from an exception report, then go and paste it into a REPL session connected to the production database for forensic investigation.

favila17:02:41

@jeff.terrell You can recreate a db value from its basis-t, as-of-t and since-t (both usually nil). We log the t in situations like you describe

favila17:02:17

(d/basis-t db) => t to get it, (d/as-of db t) to recreate it

favila17:02:35

If you have a timestamp on your log you can also grab a db value by time

jeff.terrell18:02:28

@favila - Thanks! That makes sense. Filing this little fact away for future Datomic advocacy. :-)

stuartsierra21:02:49

@jeff.terrell More technically, a database value is a pointer into one set of "root" nodes in the immutable persistent trees that make up Datomic's indexes in storage. New versions of these roots are created each time the transactor does an indexing job; old roots are discarded when you run gc-storage. The d/as-of call gives you a "filtered" view of the database as it existed at time t.

jeff.terrell22:02:09

👍 neat. Thanks for the explanation, @stuartsierra.