Fork me on GitHub
#datomic
<
2017-05-26
>
Lone Ranger01:05:45

having trouble grokking the "set" "unset" in the documentation for altering a schema to be non-unique

Lone Ranger01:05:36

(d/transact conn '[{:db/id     :user/email
                    :db/unique unset}])
not working like I would hope it would πŸ˜…

marshall01:05:01

Take a look at that section and the next few subsections

Lone Ranger01:05:49

darn users ... can't they read? πŸ€“

Lone Ranger01:05:42

does anyone have any idea why these might be conflicting?

{:d1 [17592186045678 :user/id 52 13194139534544 true],
 :d2 [17592186045678 :user/id 71 13194139534544 true]}

Lone Ranger01:05:11

or, what questions would you ask to determine if they are conflicting?

Lone Ranger01:05:13

this is the schema:

#:db{:ident       :user/id,
     :cardinality :db.cardinality/one,
     :valueType   :db.type/long}

marshall01:05:17

You can't assert multiple values against the same EA pair in a single transaction if the attribute is cardinality one

Lone Ranger01:05:46

to further the mystery... I attempt to transact a bunch at once (d/transact con list-of-maps) and I was getting lots of conflicts, but when I do

(doseq [m list-of-maps] (d/transact conn [m]))
it works fine

marshall01:05:17

Because the entirety of a transaction is atomic (ie it all happens at exactly the same time) how would you know which is the value to assert?

Lone Ranger01:05:48

ohhh... so, all those maps were being assigned the same entity?

marshall01:05:00

And your 2nd example is doing 1 transaction vs many

Lone Ranger01:05:34

I really should've gone to the g*dd*mn day of daytomic training πŸ˜… do you know if they're doing another one at the conj this fall?

marshall01:05:50

Don't know yet

Lone Ranger01:05:25

so it the second way of doing it preferred for a bunch of seperate entities?

marshall01:05:38

Not necessarily

marshall01:05:12

But I suspect you have multiple txns against the same entity in your maps

marshall01:05:03

Looking at your original 2 conflicting datoms - you're saying the same entity has both 52 and 71 as id

marshall01:05:32

Notice both have the same value in the E position

Lone Ranger01:05:37

i.e. for a more complete example

(d/transact conn [#:user{:disable false,
        :email "i********@****.net",
        :authenticated false,
        :pwdhash
        "pbkdf2:sha1:1000$mIrT****************",
        :lastname "B***",
        :hawk "ib******",
        :username "ibi*****",
        :firstname "I*****",
        :id 53,
        :group_name "client",
        :count 0,
        :last 1485361260000}
 #:user{:disable false,
        :email "robe*****@***.net",
        :authenticated false,
        :pwdhash
        "pbkdf2:sha1:1000$****************,
        :lastname "F",
        :hawk "r*****",
        :username "r"****,
        :firstname "R****",
        :group_name "admin",
        :count 0,
        :id 77
        :last 1491327360000}])

Lone Ranger01:05:08

there were a bunch more of those in the transaction but I just pulled two

Lone Ranger01:05:23

those are somehow all being regarded as the same entity?

marshall01:05:06

Where is the user/id from your original conflicting datoms example

Lone Ranger01:05:16

oh, I filtered those out in desperation πŸ˜…

marshall01:05:28

Ah. Yes it seems that you have multiple maps referring to the same entity. Do you have a unique identity or value attribute ?

Lone Ranger01:05:44

no I didn't make any of them unique πŸ˜›

Lone Ranger01:05:43

if I had made at least one of the attributes unique would a bulk-add have worked?

marshall01:05:19

You could add an explicit db/id to each to be sure, but the behavior you describe is unexpectes

Lone Ranger01:05:56

I thought it implicitly created a db/id

marshall01:05:58

Unless there's a unique attribute or you have the same temp ids in more than one map

marshall01:05:44

It does. But you can use an arbitrary string temp id for instance to refer to other entities in the same txn

marshall01:05:05

What version of datomic

Lone Ranger01:05:12

ah I see ... let me check

Lone Ranger01:05:47

datomic-pro-0.9.5561

marshall01:05:01

If youll email me a repro (schema and txn that fail) ill have a look tomorrow morning.

onetom10:05:06

@val_waeselynck i just came across your datofu project (via a slack log archive) im wondering what happened to the idea of defining the schema using helper :db/fns as you suggested in https://stackoverflow.com/a/31480922 that time u said u haven't tried it because you are happy with generating the schema via code. have you tried it since or do u know anyone who tried it?

val_waeselynck11:05:04

onetom: haven't tried it, I've been moving more in the opposite direction. Regarding modeling, I see the Datomic schema as a derived thing rather than a source of truth; my approach for http://bandsquare.com is to store model metadata in a DataScript database from which installation transactions are derived.

val_waeselynck11:05:33

I still believe that for most projects, datofu's approach will be the most reasonable one, at least for getting started. Datomic's transactions being data doesn't mean it has to be written in data literals

val_waeselynck11:05:55

I should add this to the SO question

val_waeselynck11:05:28

I now believe even less in database functions approach than I did at the time - they'd just be an un-portable DSL disguised as data

onetom12:05:21

interesting... however what does porting mean? you would expect that some other system might want to read the same EDN data which describes the schema as transaction function calls? it will see a vector of lists which contain a symbol a few keywords and a string. if you would just use the data literal, then you would get a vector of namespaced-keyword-keyed maps. then what would be the next step that other system would do with this data? it would still need to interpret it somehow. the db/fn approach means it would just deal with positional parameters as opposed to named ones... and if that other system would understand datomic schema attribute names already, then it should just receive the output of a datomic query which returns a schema as maps using pull... πŸ˜•

onetom12:05:00

not that i don't like the functional approach, just the person im working with at the moment insists on using .edn files for the schema and similar seed data. and it works for now, so instead of resisting, i'd like to trick him towards a more concise solution πŸ™‚

val_waeselynck12:05:20

In this case, trick him by using custom EDN tagged literals πŸ˜›

Lone Ranger16:05:26

anyone have any tips for muscling through large sql imports?

Lone Ranger16:05:33

my transactor keeps timing out

Lone Ranger16:05:09

do smaller imports you say? that's a solid idea. I'm glad we had this talk πŸ˜‚

favila16:05:42

smaller queue depth?

favila16:05:09

are you doing transact-async without derefing?

Lone Ranger16:05:24

yeah I think part of the problem is I'm using jdbc and pulling the whole table into memory which isn't great either

Lone Ranger16:05:43

I need to figure out away to do a lazy-seq on the rows

Lone Ranger16:05:15

and I am doing transact-async and I'm assuming it's without derefing b/c I didn't know derefing was a technique you could employ 😳

favila16:05:36

if you don't deref at some point, you are just overwhelming the transactor

Lone Ranger16:05:40

(defn import-table! [conn db table-name tx-fn]
  (do (import-schema! conn db table-name)
      (d/transact-async conn (import-table conn db table-name tx-fn))))

favila16:05:51

oh, you have one giant transaction

favila16:05:54

also not good

Lone Ranger16:05:20

what's a better practice?

Lone Ranger16:05:55

loop over the rows and transact them one at a time or in chunks?

favila16:05:40

I'm surprised I'm not finding something that puts all bulk import advice on one page

Lone Ranger16:05:05

to be fair this stuff is fairly cutting edge as far as tech goes

Lone Ranger16:05:36

it's really nice we have a nice community (aka (== @favila 'community))

Lone Ranger16:05:42

you know it's also interesting ... part of the brilliance of this all is the prolog has been around for a long time and so have databases and it's so awesome that someone finally put them together

favila16:05:41

in order of importance: 1) transact in chunks of 1000 ish datoms 2) use pipelining 3) do it with a separate amped-up transactor with no other load, or on a local machine (or whatever) and get it into production with a backup/restore 4) dial up memoryIndexThreshold and memoryIndexMax (to avoid indexing as long as possible)

Lone Ranger16:05:13

baller... that oughta be pinned

Lone Ranger16:05:40

thanks again πŸ˜„

favila16:05:20

pipelining and smaller transactions are the most critical

favila16:05:52

the rest you can often ignore. since your entire import fits in memory anyway it's unlikely the other stuff matters much

Lone Ranger16:05:14

well πŸ˜… some tables

favila16:05:17

I aim for 1000 datoms per transaction

favila16:05:32

and deref after every tx (no pipelining)

favila16:05:38

if that's too slow, I add pipelining

favila16:05:50

if that's still too slow, I do it offline

Lone Ranger16:05:31

okay so what would that look like?

(doseq [chunk chunks]
    @(d/transact conn chunk))
?

favila16:05:45

use transact-async

favila16:05:51

but yes, essentially

Lone Ranger16:05:55

I'm not sure I understand the signifcance of derefing the transaction

favila16:05:13

d/transact and d/transact-async return futures

favila16:05:32

d/transact waits-with-timeout for the future to resolve, then returns the future

favila16:05:39

d/transact-async does not wait

Lone Ranger16:05:52

but if you deref it you get the benefits of async and sync

favila16:05:21

(d/transact) is really just for repl use

favila16:05:45

it automatically adds a timeout, and does not return until the future is either done or throws because it timed out

Lone Ranger16:05:56

ahhhh interesting

favila16:05:06

but really long waits are not abnormal on a bulk import job so you don't want the timeout

favila16:05:13

you want transact-async

favila16:05:42

however, that doesn't wait at all, so if you call it over and over without deref you are just overwhelming the transactor with potentially thousands of tx requests

favila16:05:21

(and not checking for errors either--transactions may legitimately fail but you won't see the error and won't stop issuing txes)

favila16:05:01

however immediately derefing is slow: it means tx is sent, and no new tx is sent until response is received

favila16:05:34

that's where pipelining comes in: you send maybe 10-20 d/transact-async at a time and deref later or in another thread

Lone Ranger16:05:47

yeah that's fine, slow is not an issue

favila16:05:48

keeping a bunch of txs in the air

Lone Ranger16:05:55

just steady and reliable is important

favila16:05:56

but still derefing somewhere

Lone Ranger16:05:36

:thinking_face:

Lone Ranger16:05:03

so you're saying send 10-20 chucks sized 1000 and then deref somewhere?

favila16:05:48

pipelining relies on not waiting for a deref to finish before sending another d/transact-async

favila16:05:06

but with a limited number of in-flight requests

favila16:05:23

so as not to overwhelm the transactor

favila16:05:33

this does not preserve transaction order

favila16:05:12

this one does (but I rarely use it, not sure how bug-free it is) https://gist.github.com/favila/3bc6fae005228a3290d5509c088e2f11

favila16:05:35

and you can do it without threads too, using reduction or a loop

Lone Ranger16:05:10

inorder isn't so important

favila16:05:30

gather the in-flight futures into a vector. when it reaches the desired size, start derefing them and removing the derefed ones from the vector, then keep ggoing

Lone Ranger16:05:43

oooo okay fantastic

Lone Ranger16:05:04

I think it would probably be doable to use channels for that

favila16:05:10

don't deref them all at once, that will flush the pipeline

Lone Ranger16:05:50

don't deref all the in-flights at once?

favila16:05:16

well yes, that would mean you wait until all in-flights are finished

favila16:05:24

so flushing the pipeline

Lone Ranger16:05:58

why is flushing the pipeline bad?

Lone Ranger17:05:15

or is just inefficient?

favila17:05:17

that means while you are derefing all the inflights, you have 0 in flight

Lone Ranger17:05:53

hmm is that true even if you are derefing them on a seperate thread/channel?

favila17:05:59

so you go from e.g. 20 inflight, then your depth is reached you start derefing them all, so then you have 0 in flight, then you issue 20 inflight all again

favila17:05:31

no, I'm explaining a caveat of a single-thread impl

favila17:05:57

when your inflights fill, be careful to deref only some, not all of your backlog, or else you will empty your pipeline

Lone Ranger17:05:20

so... fill up 20, deref 10 or so, bring on 10 more, deref 10 or so, etc?

Lone Ranger17:05:03

If I had time for such things this would make for an interesting study

favila17:05:10

yeah, so your effective inflight variance is going to depend on jitter

favila17:05:24

i.e. difference in time-to-complete of each tx

favila17:05:50

I don't know how much depth matters, just as long as the txor never has to wait for another tx from you

Lone Ranger17:05:03

gotcha. I did a NoSQL -> MySQL transfer pipeline once that did a similar thing that attempted to optimize its write speed by varying chunk size, but honestly I'm not sure it was worth the effort, the gains were fairly marginal

favila17:05:05

it should always have another tx waiting in its queue after it finishes

Lone Ranger17:05:46

gotcha. Excellent, gives me a great place to start

favila17:05:01

and if your depth is a little too deep, txor will apply backpressure anyway so its safe

favila17:05:35

as long as you eventually listen to the backpressure by derefing somewhere