Fork me on GitHub
#datomic
<
2016-08-16
>
robert-stuttaford05:08:47

it would be nice if datomic provided a reader literal for squuids #squuid "etc"

danielstockton10:08:51

aren't they just uuids when you're reading them?

danielstockton10:08:33

if you already have them, what part needs to know whether they were generated sequentially or not?

robert-stuttaford11:08:07

gosh. you're right. i'm a dork. i guess what i meant is it would be nice to generate squuids via a tag in edn

robert-stuttaford11:08:10

kinda like temp ids

danielstockton11:08:29

no, i thought that's what you meant, just got confused by the "etc" i think

robert-stuttaford11:08:12

yeah, that was incorrect

danielstockton11:08:02

I was curious, looks like these are the data readers defined by datomic {db/id datomic.db/id-literal db/fn datomic.function/construct base64 datomic.codec/base-64-literal}

danielstockton11:08:20

#squuid might be handy too, I don't see why not

robert-stuttaford11:08:46

if we could do the same as we do with temp ids, e.g. #squuid[1], so that you could create relationships with squuids the same you can with db ids

robert-stuttaford11:08:54

that would be awesome

eggsyntax13:08:16

Anyone done a query/pull-exp cheat sheet? Because I would use that thing every day...

robert-stuttaford13:08:16

http://docs.datomic.com/query.html and http://docs.datomic.com/pull.html are pretty comprehensive. i found that good old practice embedded the concepts quickly

eggsyntax13:08:10

Yeah, those are my go-tos. Still, it'd be nice to have a one-pager to quickly refer back to, especially if I haven't been doing it for a while & I'm forgetting particular details of syntax.

val_waeselynck14:08:18

@robert-stuttaford: the trouble with a squuid tag is that the generated uuids would not be deterministic... at this point I'd say the content of the EDN file has stopped being 'just data',

robert-stuttaford14:08:21

sounds like a good opportunity to contribute 🙂

robert-stuttaford14:08:37

@val_waeselynck: true, but this is already the case with #db/id

robert-stuttaford14:08:04

IF you supported determinism with e.g. #squuid -1 so that multiple uses of the token resolve to the same value

robert-stuttaford14:08:09

using a basic cache

val_waeselynck14:08:35

I know, I feel there's a difference with tempids though, not sure how to express it

val_waeselynck14:08:54

at least with tempids it's always the same datoms that ends up in storage, not so with random uuids

val_waeselynck14:08:03

so it's kinda more deterministic

robert-stuttaford14:08:27

which is probably why we don't have a reader tag 🙂

val_waeselynck14:08:33

@robert-stuttaford: I guess so. Even the #db/id tag felt weird to me in the beginning TBH

val_waeselynck14:08:30

@robert-stuttaford: btw, I recently stumbled on your podcast about Datomic and Onyx, I really liked it

robert-stuttaford14:08:43

thanks! which one? on defn.audio?

val_waeselynck14:08:55

yeah that's the one

robert-stuttaford14:08:30

that was a fun chat. Vijay and Ray are a blast

val_waeselynck14:08:36

I'm looking for solutions to make my analytics faster and more scalable, so definitely looking into tools like Onyx

robert-stuttaford14:08:14

@iwillig: enjoying your episode 🙂

robert-stuttaford14:08:23

you mentioned about how you're having to think differently about historical data

robert-stuttaford14:08:40

have you started to realise the difficulty of technical debt in your data ? 🙂

bhagany14:08:04

oh man. I am already stressing out about this, and I haven't had any problems yet.

robert-stuttaford14:08:23

i'm busy working on an epic to rebuild our database, transaction by transaction

robert-stuttaford14:08:00

initial analysis of the first 2mil txes yields ±120k txes i want to keep. the rest is either schema, data we no longer want, or bad programming

robert-stuttaford14:08:22

the bad programming and old data are about equal!

bhagany14:08:09

I'm worried about the ever-increasing complexity of historical queries that have to deal with schema changes

robert-stuttaford14:08:36

you mean having to query across all the versions of the schema?

bhagany14:08:40

yes, correct

bhagany14:08:52

which exacerbates my tendency to bikeshed such things

robert-stuttaford14:08:13

yeah. we've handled that in a couple ways. small data sets, we just re-transact and lose the time information. larger ones, we've continued to query across

bhagany14:08:59

It may not even become a problem in practice, I'm not yet sure how far back we'll need to go. But here I am worrying about it 🙂

robert-stuttaford14:08:09

i'm looking forward to unifying all that in the rebuild

robert-stuttaford14:08:36

the primary driver for doing this is to be prepared to shard in future, by building good tooling now

robert-stuttaford14:08:10

10 billion datoms is the theoretical upper limit for a db. we're at around 100mil, which means we have 99 copies to go. that's what's worrying me 🙂

bhagany14:08:34

I have a looooooooong way to go before I'm there 🙂

robert-stuttaford14:08:38

also, i get to re-partition the data according to the read patterns we've since discovered we have

bhagany14:08:58

that kind of thing keeps me from worrying about partitioning too much - I just don't know how it'll be. For some reason, that reasoning works on me for partitions, but not for future schema changes.

val_waeselynck14:08:14

@bhagany: curious about your specific problem. Is it that you are querying on asOf dbs and need to compensate for "future schema change" in your queries that go too far in the past ?

bhagany14:08:34

yes, that's right

bhagany14:08:53

I don't actually have that problem yet. But I am trying to anticipate future schema needs now (and at the same time trying to not try, because that kind of thing can get you in trouble too)

val_waeselynck14:08:31

@bhagany: my take on this was to actually stop using asOf in application code

val_waeselynck14:08:48

history is not programmable

bhagany14:08:49

I can see your point there. I may come around to endorsing it, depending on how this goes.

val_waeselynck14:08:49

@bhagany: That's a very interesting problem actually. I think what you could do in a technology like Apache Samza is derive a new Log of facts from an old Log of facts, adding the migration, and using the new Log as the data source in the application code.

val_waeselynck14:08:00

That'd be an indirection between facts-recording and querying which Datomic does not have (yet)

bhagany14:08:48

interesting idea. I'll have to give that some thought.

Ben Kamphaus14:08:00

@robert-stuttaford: have you found in testing how much re-partitioning could possibly speed up query patterns that are currently problematic for you? Just curious.

robert-stuttaford15:08:39

@bkamphaus: not yet. i haven't managed to actually rebuild the db yet. it's a big task -- 58mil txes, ~4 years worth. the first 2mil txes yielded over 100 transaction shapes to reason through alone

robert-stuttaford15:08:52

i'll be certain to share any findings, though. this gon' be fun!

Ben Kamphaus15:08:56

yeah, one of the big value props for Datomic for me is being able to support arbitrary query without over-engineering any particular aspect of the schema/model for particular query patterns. Obviously you never quite hit that point 100% with any database, but I’m curious if in the wild people end up needing to solve pain points with partitioning, or something like a reasonable set of partition across a few logically grouped domains is usually sufficient.

danielstockton15:08:20

@val_waeselynck im not sure that's true.. #db/id[:db.part/user] isn't deterministic, it's using a counter behind the scenes which is increased on each transaction

danielstockton15:08:36

#db/id[:db.part/user -1] would be deterministic

danielstockton15:08:36

or it depends on the basis-t of the db-after the transaction, im not sure it uses a counter or not

val_waeselynck15:08:25

@danielstockton: yes but I would argue that it's the same datoms [eav] that end up in storage, so it's more deterministic in a way

val_waeselynck15:08:53

e.g you can rely on transacting your edn file being idempotent

danielstockton15:08:54

but the tempid determines the e in a datom, which can be different?

danielstockton15:08:11

it depends when you transact and against what db

danielstockton15:08:58

but if you're importing one edn file on a fresh database, then i guess it is...

Ben Kamphaus15:08:06

the idempotent aspect of the schema comes from upsert for a :db/ident att/val pair (which is a unique identity) and special rules for tempid resolution in that case.

Ben Kamphaus15:08:34

which is dependent on the fact that an entity id is not an attr/val pair but its own thing, and which is not true of e.g. a uuid attribute.

Ben Kamphaus15:08:18

it’s less what #db/id[:db.part/user -1] resolve to across all invocations versus the fact that they will resolve to the same tempid within a particular transaction, meaning that it will result in the implied link/relation/join for tempids being fulfilled by the resulting entity id generation.

jdkealy17:08:37

is it possible to have a function to get-or-create an entity? I wanted to write a function that does a lookup, if it finds the criteria, it returns the ent-id of the match otherwise it creates the entity and returns the ent-id of the newly created entity. I can do the lookup in a regular datalog query but I believe that would not be thread-safe, e.g. if I'm importing big datasets and the query is run on multiple machines, it will rapidly create multiple duplicate entities. My function looks like this: https://gist.github.com/jdkealy/42bf630ceba6385914a43d5645d31d55

jdkealy17:08:17

my function returns tx-info like so {:db-before datomic.db.Db@2f39cc32, :db-after datomic.db.Db@4be3c61a, :tx-data #object[java.util.ArrayList 0x787bfc5c [datomic.db.Datum@1953ce9d]], :tempids {}}.... but i didn't actually transact anything... do i access the returned query via tx-data ?

jdkealy17:08:48

also... im calling the function like so... @(d/transact @db/conn [[:person/namer oid name]])... so i guess i am transacting... i'm a bit confused obviously on this subject

Ben Kamphaus17:08:27

@jdkealy: you’re crossing a couple of concerns that are decoupled in Datomic. I would split the logic somewhat.

Ben Kamphaus17:08:29

Do the query to see if what you’re looking for exists yet, if not, go through either a transaction function to create it or rely on assigning entity a unique identity so you can rely on Datomic’s upsert behavior

Ben Kamphaus17:08:44

eventually to figure out the outcome of the transaction to get the entity that was created you’ll want: http://docs.datomic.com/clojure/#datomic.api/resolve-tempid

jdkealy17:08:15

right.. but datomic's uniqueness constraint is only on a single attribute as far as i know

Ben Kamphaus17:08:38

If something has a unique identity in Datomic, it will handle that race for you, i.e. it will resolve the transaction to the existing entity ( http://docs.datomic.com/identity.html#unique-identities )

Ben Kamphaus17:08:45

composite uniqueness isn’t a thing in Datomic at this point in time, yeah.

jdkealy17:08:10

i thought that this kind of thing was the point of datomic functions

Ben Kamphaus17:08:33

yes, it is, though there’s an advantage to taking opportunities to rely on predefined behaviors rather than explicitly program your own with transaction functions.

Ben Kamphaus17:08:47

but composite uniqueness would preclude being able to rely on the default behavior for this case.

jdkealy17:08:00

indeed 🙂 so is there any way to do what i'm trying to do ?

jdkealy17:08:43

like... return the entity id or else create it in a single-threaded way? i'm worried about creating dozens of dupes as i'm going to be running this code on like 4 servers

robert-stuttaford17:08:43

@stuartsierra: hi 🙂 in the latest Cognicast, Craig mentioned your predilection for "decanting databases". it sounds like you've done this a couple times. i'm embarking on a rather large decanting of my own soon, and i wonder if you have any tips, or perhaps even generalised code that may be useful?

jdkealy17:08:50

i.e. can the datomic function return the result of a query or does it only return data related to a transaction

Ben Kamphaus17:08:08

if this is basically a big import and you can provide a unique identifier from the domain or by pre-generating uuids for everything prior to import, the default unique identity upsert behavior gets you there for free.

Ben Kamphaus17:08:19

a transaction function (note this isn’t the only kind of database function but the typical one) returns transaction data that are then transacted on the transactor (provided it doesn’t throw an exception), but the results are standard transaction result maps. I.e. you can’t change the behavior of what happens on the other side.

Ben Kamphaus17:08:16

but you could define things like for example, attempt to create this thing it it doesn’t exist, throw an exception, rely on that exception on the peer to know that if you get/sync a database value after your attempted transaction you can get the entity via query.

jdkealy17:08:26

ok... so perhaps instead of returning the entity id and then transacting with the id i should focus on doing the full transaction in the function ?

jdkealy17:08:05

or... another way would be ... if i do call the thread-safe transact function, i can do a lookup directly after and it's guaranteed to be unique right ?

Ben Kamphaus17:08:50

yes though that implies a blocking deref on the transaction and inspecting the :db-after, which is fine but may slow down import logic considerably if you’re doing this e.g. on every typical transaction.

jdkealy17:08:53

it would be like.... 20k times a day maybe ? tops

jdkealy17:08:27

i'm not as worried about slowness as i am about my app crashing 😕

Ben Kamphaus17:08:36

My first pass (knowing nothing else of the domain) would probably be the transaction function that tries to transact the thing and if it already exists, aborts the transaction via exception, and then either uses A. tempid resolution for a successful transaction result and a query to find it if the transaction aborts, or possibly B. just query to find it on a database value after the transaction attempt (successful or not) since it should be there either way.

jdkealy17:08:18

awesome... i think B sounds pretty straightforward... many thanks!

robert-stuttaford17:08:13

basically: find, or try: create-via-tx-fn, catch: find

robert-stuttaford17:08:12

(or (d/q ...) (try (d/transact ... [[:your-make-fn-which-first-also-does-the-d/q-thing ...]]) (catch ... (d/q)) (d/q ...))

robert-stuttaford17:08:42

you'd move the query bit to a function of its own to keep things DRY of course

robert-stuttaford17:08:51

distributed systems are hard 🙂

pheuter17:08:58

if i have multiple (pull) expressions inside of the :find clause in a query, is it possible for Datomic to not return nil if one of the pull queries doesn’t return anything?

robert-stuttaford17:08:24

no. nil is not a thing that datalog does at all

pheuter17:08:41

sorry, not nil, in this case just []

pheuter17:08:58

strangely enough the value the find returns is nil

robert-stuttaford17:08:31

sounds like a good candidate for breaking your code apart

robert-stuttaford17:08:03

i may not fully understand how you're getting an empty vec though

pheuter17:08:10

[:find (pull ?e […]) (pull ?e […]) :where [?e …]]

pheuter17:08:59

if one of those pulls doesn’t return any data, even if the other one does, the query will return [[nil]]

robert-stuttaford17:08:41

i'd put the pulls outside of d/q in a separate fn call

robert-stuttaford17:08:48

and deal just with ids in d/q

robert-stuttaford17:08:09

i don't know the answer to your actual question, though

robert-stuttaford17:08:23

what happens if you explicitly include :db/id in your pull expressions?

pheuter17:08:52

yeah, the underlying problem is a complex query for various data and metadata associated with certain entities, some of which can be potentially missing, and i still want to get all the data back, instead of constraining the result set

pheuter17:08:58

i feel like i might have to settle for making n separate queries

robert-stuttaford17:08:08

nothing wrong with separate queries

robert-stuttaford17:08:17

it's all in local memory anyway 🙂

pheuter17:08:35

yeah, maybe not the first request but perhaps it’s not such a big deal

robert-stuttaford17:08:24

it's a non issue; datalog is working with sorted sets of datoms in local memory, always

robert-stuttaford17:08:27

very often it's better to decouple things!

pheuter17:08:03

thanks! makes sense...

jdkealy17:08:06

thanks for your help @bkamphaus... i went with solution B... it appears to work on https://gist.github.com/jdkealy/4d8da9c5bbb37df19978c45256ea1856

kenny18:08:12

What is the S3 backup-uri format? I tried http://bucket.s3-aws-region.amazonaws.com and <s3p://bucket-name>.

kenny18:08:46

Ah, found it. Never mind 😛

Ben Kamphaus18:08:46

for reference, which is probably what you just found 🙂

kenny18:08:06

Yes. Where backup-name is a folder or an actual backup?

kenny18:08:21

Hmm.. Is it possible to use backup/restore to copy one DB to another? I tried, however, I got this exception:

java.lang.IllegalArgumentException: :restore/collision The database already exists under the name '...'

Ben Kamphaus18:08:27

Can’t copy one db to two different names in the underlying storage. You can overwrite a db by restoring to the same name, or restore that db to new name on a different storage.

kenny18:08:09

Ah I see, thanks

pheuter18:08:44

[:find (min ?e) (max ?e)
 :in $
 :where [?e :some/attr “some-value"]]
does it make sense to interpret the two values returned above as the earliest entity associated with that value vs. the latest entity associated with that value, assuming that there are multiple entities that share that same attribute-value pair?

Ben Kamphaus19:08:38

@pheuter: leading part of entity id is from partition, so multiple partitions can break that strategy.

Ben Kamphaus19:08:10

I would bind the 4th position (tx) and use that if it’s what you mean specifically. Also note that unless the parameter $ is a history db, you won’t find the earliest association if it has since been retracted.

pheuter19:08:40

Good points, thanks for the heads up!

pheuter19:08:22

so my question is then how does it resolve aggregating multiple entities, and then for each entity multiple tx-entities?

pheuter19:08:32

if i do a max on the ?tx, will that be across all entities?

Ben Kamphaus19:08:57

If you do a (max ?tx) on [?e :some/attr "some-value” ?tx] it will be the most recent tx in which the datom matching the leading portion [?e :some/attr “some-value …] was asserted. (and is still true as of the most recent database value). If you pass a history db, it will be most recent tx to touch it (even a retraction) unless you also bind the 5th position to true, i.e. [?e :some/attr “some-value” ?tx true].

pheuter19:08:01

in my particular case i’m looking to use ?e in a subsequent :where clause to get a related entity, how can i know that i’m getting the entity associated with the latest tx?

pheuter19:08:32

basically, there are two entities, a and b. b has an attribute that’s a ref to a, and it’s possible to have multiple b entities that ref to the same a

pheuter19:08:44

given a, i’d like to get the latest transacted b that links to a

pheuter19:08:51

that’s the general problem

Ben Kamphaus19:08:11

I’m not sure I follow what you’re asking as it looks like your concern is covered. The where clause limits the results, so you only get the relation from entity to transaction constrained by the presence of that attribute and value, for the most recent transaction.

Ben Kamphaus19:08:18

:where [?b :some/ref ?a ?tx] aggregated on the max value of the ?tx returns that datom. I guess you could get a set of matches in the event that there are multiple b entities which assert :some/ref a-id in one transaction.

Ben Kamphaus19:08:20

Oh, grouping behavior.

pheuter19:08:28

what i need is something like: :where [?e :some/attr “some-value” (max ?tx-id)]

pheuter19:08:51

where ?e would represent the entity associated with the latest tx

pheuter19:08:40

right, it seems like a workaround now is to manually build a map of tx-ids to entity-ids, find the max, then get the entity-id associated with it

Ben Kamphaus19:08:09

if the grouping behavior runs afoul of what you need, as is the case here (just tested it), I would just return the ?e ?tx tuple and apply max-key second on the result.

pheuter19:08:37

that seems like what i need

Ben Kamphaus19:08:57

the aggregation in query always realizes the intermediate set in memory on the peer anyways, so it doesn’t save you any performance cost to avoid the seq manipulation, really.

Ben Kamphaus19:08:35

sorry for initial detour, forgot that ?e (max ?tx) only shows you max ?tx grouped by e, not what you wanted in this case. It’s also possible to use a subquery, if you’re stuck with REST API or don’t have clojure manipulations and don’t want to realize the whole thing in a query, but if you’re in clojure I’d stick with a single query and a sequence manipulation.

vinnyataide19:08:14

hello! how to recover from a connectexception? I wanted to create a db if there's none so I made the follow command

(try
  (def conn (d/connect uri))
  (catch ConnectException e (d/create-database uri)))

vinnyataide19:08:36

(:import ( ConnectException)))

vinnyataide19:08:13

Show: Clojure Java REPL Tooling Duplicates All  (3 frames hidden)

3. Unhandled java.util.concurrent.ExecutionException
2. Caused by org.h2.jdbc.JdbcSQLException
1. Caused by java.net.ConnectException
   Connection refused

pheuter19:08:00

@bkamphaus: thanks for the patience and help, makes a lot of sense now 🙂

vinnyataide19:08:26

oh I guess I need to start the transactor

kenny20:08:30

I am having trouble connecting a third peer. We currently have a license for 10 peers. Two of the peers are being used by a staging and production server. I want to query the Datomic instance running in the cloud from the REPL. However, when I try and connect to my transactor running in the cloud from the REPL, I get clojure.lang.ExceptionInfo: Error communicating with HOST ... or ALT_HOST ... on PORT 4334. Both the staging and production servers are able to connect to the Datomic instance. Do I need to set a username and password locally somewhere or change a local license key?

kenny20:08:42

I also cannot connect to the database from the shell on a server running in the cloud

bhagany20:08:50

in cases like these, network configuration is always my first stop. have you checked that the transactor is reachable, ports are open, etc?

kenny20:08:14

It is reachable. Both my staging and production servers can connect to it

bhagany20:08:30

but is it reachable from the machine you're on?

bhagany20:08:48

I mean, obviously you can't connect with the peer library. But can you, say, telnet to it?

kenny20:08:08

telnet ip 4334
Trying ip...
Connected to ip.
Escape character is '^]'.

bhagany20:08:34

alright, to be honest, that just about exhausts my advice. it's always the network for me 🙂

kenny20:08:14

It would be nice if there was a different exception thrown if it was a peer problem or a network problem

kenny20:08:27

Shutting down my staging server allows me to connect to the db from the REPL.

kenny20:08:21

Is it possible there is an issue with the license?

bhagany20:08:01

that exception would really surprise me, if that's the case

bhagany20:08:11

not sure I can explain what you're seeing any other way, though

kenny20:08:18

Hmm.. Will someone from the Datomic team see these messages or should I email them directly?

bhagany20:08:15

they're usually on here, but if you're paying, I'm pretty sure that comes with direct support

jaret21:08:20

@kenny: Sent you a private message so we can get a support case going 🙂