Fork me on GitHub
#datomic
<
2018-08-15
>
Desmond00:08:30

@lockdown- any ETL tools that you recommend?

Desmond00:08:05

and any strategies for how to design a sql schema that simplifies the ETL

Desmond00:08:59

i haven't used google's natural language service but i think for our first iteration we probably won't need a very sophisticated schema. we'll probably just be going through one column.

johnj00:08:37

don't know of any tools for this, very new to datomic. Can datomic backup/export to edn? if its simple enough, why just not do it in clojure? convert part of the edn you need to csv and load/import the data into the correct columns in the sql server.

johnj00:08:07

I would just use a query to dump the data you need to edn

Chris Bidler00:08:55

It would be even simpler, probably, to just query your datomic db for the data you want in a batch, and emit rows directly into the SQL datastore via yesql or korma or whatever the current hotness is in clj SQL libraries. If your data is “finished”, that would be the end of it (one big batch ETL job), but if you want to maintain a living datastore you could also add a transaction fn to your db that writes each new (relevant) transacted fact out to your gcs database using the same method

Desmond19:08:50

sounds good. i'll check out those sql libraries. thank you!

Chris Bidler00:08:27

No need to abandon your datomic db just to also have your data elsewhere (JSON data lake for BI, SQL, etc.) :)

👍 4
Alex Miller (Clojure team)12:08:18

Best to ping @U05120CBV and @U1QJACBUM for that, I’m not on Datomic team

henrik12:08:01

Sorry, I that’s because I watched REPL-driven development yesterday 🙂 You were declared the Grand Master of Documentation.

😁 4
jaret13:08:51

Thanks @U06B8J0AJ we’ll get on these

👍 4
marshall14:08:03

Fixed. Thanks for letting us kno.

👍 4
steveb8n10:08:03

Question: what do you use for schema migrations? I’ve been using Conformity but I wonder if it’s worth the effort. Since schema txns are idempotent, I could just stop using it and re-transact on every restart. Does anyone have other benefits they can see from using Conformity or other migration strategies?

val_waeselynck13:08:32

> Since schema txns are idempotent, I could just stop using it and re-transact on every restart That's my default approach as well, but keep in mind that you may also need to migrate data in addition to schema (e.g populating a new attribute with a default value), and more generally some migrations that are not idempotent. Those are the reason I still use something like Conformity (see also Datofu: https://github.com/vvvvalvalval/datofu#managing-data-schema-evolutions)

eoliphant11:08:00

I've used conformity myself on quite a few projects. Still sort of having the same debate lol. I know part of it is just "psychological" having used stuff like flyway and liquibase to manage schema for what seems like decades, i had sort of an existential fear of not using something to manage the process. For most of my cases though it hasn't really been strictly necessary as we've always tried to follow the guidance on non-breaking schema growth. One thing that i haven't had to do so far. But I think would make me more comfortable to manage with something like conformity is true "migration" when you need to make some bulk data update where you're say doing some transformations to take advantage of or pre populate some new schema attributes. Having said that, it's a bit of a moot point at the moment for us as were moving as much as possible to cloud, and it doesn't support the client api as of the last time i checked

👍 4
tlima11:08:19

Is it possible to transact some data using a date in the past, as the timestamp, so that I could use as-of to navigate that history?

tlima11:08:47

Could I just manipulate the :db/txInstant attribute?

eoliphant11:08:10

Yes, @t.augusto but there are limitations. I believe the date you set can't be older than anything currently in the db. Theres a note about imports on the "best practices" page. Also, make sure thats what you really want to do. @val_waeselynck has an excellent blog post on the optimal use of tx time vs your "domain" time. https://vvvvalvalval.github.io/posts/2017-07-08-Datomic-this-is-not-the-history-youre-looking-for.html

eoliphant11:08:27

@captaingrover check out http://www.onyxplatform.org/ We use it to stream transactions out of datomic and into stuff like elasticsearch, etc

tlima14:08:57

Is there a way to make the Transactor run some storage setup code, before starting (Cassandra keyspace/table creation, for instance), or I must ensure everything is in place before starting it?

stuarthalloway16:08:59

The latter. The great thing about On-Prem is you get to set up storage yourself, exactly the way you like it. The terrible thing about On-Prem is you have to set up storage yourself, exactly the way you like it. 🙂

👍 4
henrik17:08:39

What's the least unidiomatic way of creating an ordered set of references in Datomic?

val_waeselynck20:08:48

Some ideas here: https://github.com/vvvvalvalval/datofu#implementing-ordered-to-many-relationships-with-an-array-data-structure. You can also consider linked lists. Don't bother about being idiomatic, but you should consider the read and write patterns to choose.

notid17:08:55

I’m recommending datomic for a use case that will have >10billion datoms within 1 year. Is there an easy way to do excise datoms, so long as it is not the most recent datom for that entity? i.e., I want to delete data that’s older than 6 months, but not if it hasn’t been updated in the last 6 months

johnj18:08:25

If you don't mind, why are you recommending datomic? What features do you need of it?

val_waeselynck20:08:31

Excision seems like an extreme solution for this problem. Have you considered moving some attributes to an auxiliary store?

eoliphant20:08:37

Yeah one of datomic’s key value props is the history etc. If you’re anticipating excision at the outset, might not be the best fit or datomic + something else..

notid21:08:34

Going back in time at an entity level is a key feature my client needs. It would preferably keep all history, but given that there is a bit of a ceiling on datoms, I’d like to understand ways of mitigating the problem of having 10 billion datoms within a year.

notid21:08:21

I have considered moving them to a different store, but at the very least, I’m talking about 1 billion entities within one year, with each entity having ~10 attributes

eoliphant23:08:33

Fair enough. I think the 10 billion thing is a matter of what they've tested out to. Perhaps some testing at your expected scale is in order

notid17:08:03

This is based on my understanding that 10 billion datoms is something of a soft limit in datomic

johnj18:08:02

Is a soft limit, but I understand performance will degrade greatly

johnj19:08:02

Forcing you to create more DBs, or move to a completely different DB

Joe Lane19:08:32

@U1ZP5SMA6 on-prem or cloud?

notid21:08:17

@U0CJ19XAM We would be open to either. We are on AWS now, but hope to move towards GCP in the next 2 years or so. As such, I imagined using on-prem.

eoliphant23:08:58

Also dbs are "cheap" afaik. Can you potentially use them as a sort of partitioning?. I think the limit is db not storage

henrik01:08:38

Unfortunately, it doesn't seem that you can join dbs in queries yet for Cloud.

notid14:08:20

Thanks for the help. There is a good chance that there are such partitioning schemes that can work for us.

eoliphant15:08:56

Ah hell @U06B8J0AJ really? I missed that. Going to have some scenarios where I'll need that soon

henrik15:08:19

That's what I understood. @U05120CBV?

marshall16:08:43

‘performance will degrade greatly’ is not accurate

marshall16:08:36

performance of what? writes? reads? the specific behaviors of “very large” databases will depend greatly on the data model and access patterns

marshall16:08:12

the “dbs are cheap, make more” advice is good, but only true for Cloud. If you’re using on-prem you should have one transactor per operational DB

marshall16:08:26

also, correct, there is currently no cross db join using the client

marshall17:08:09

depending on ‘how’ you want to use multiple DBs you may be fine to query the individual dbs separately then join your application

notid19:08:01

@U05120CBV thanks for some of that clarification. I’ll definitely see if it’s possible to shard into separate databases. I expect regular updates to ~50million entities, each with about 10 attributes. My math shows being at 10 billion datoms within a year, with a fairly static number of entities. The ability to go back in time is very desirable for my use case, but only in recent months. If the database can’t support that scale, I’d be willing to excise old data, so long as that data isn’t the most recent for the entity.

notid19:08:34

Do you have any recommendations on how to go about capacity planning given that use case?

marshall19:08:41

Don’t use excise for size control

marshall19:08:53

that’s not what it was designed for, and it will not work well for that use case

marshall19:08:21

broadly, I’d recommend sharding by time with a rolling window

marshall19:08:10

create a new db every X months and only write to it

marshall19:08:25

keep the ‘older’ db around long enough to support your queries against the older data

notid19:08:05

Perfect. Good call @U05120CBV. Appreciate the feedback

marshall19:08:29

if you “need” to handle entity continuity across the DBs you can write a tool that “moves” some subset of active entities from the old db into the new one

marshall19:08:59

if you can get away with just moving all writes to the new db and leaving the old db(s) there for read, you can probably get away with a single transactor

marshall19:08:15

the transactor-per-db rule is largely based on write-side behavior

marshall19:08:35

so if you can move all your writes to the new db, you’re usually pretty OK to have one or two older read-only dbs in the same system

tlima18:08:19

When using a Cassandra cluster as the storage layer, how should the Console app be configured? Tried to add -Ddatomic.cassandraClusterCallback=com.example.MyClass.myStaticMethod to the DATOMIC_JAVA_OPTS envvar, to no avail… 😕

tlima19:08:52

@U072WS7PE or @U05120CBV, could any of you guys help here?

marshall19:08:21

you should be able to launch the console against Cassandra the same way you do against any storage

marshall19:08:28

use the cassandra storage URI

tlima19:08:28

You mean this URI: datomic:cass://<cassandra-server-host>:<cassandra-port>/<cassandra-table>?user=<usr>&password=<pwd>&ssl=<state>? If so, this is what I’m using. The thing is <cassandra-server-host> is a “cluster gateway”, if we can say so, which has it’s own API to retrieve the actual cluster nodes. This is why I’d like to use the same cluster callback I’m already using with the transactor. Is it possible?

marshall20:08:25

I don’t believe you can specify a cluster callback using the console

marshall20:08:30

i’ll have to look into it

tlima20:08:26

Ok, @U05120CBV. May I ping you again, later this week?

tlima18:08:11

Hi, @U05120CBV. Any updates here?

tlima21:08:06

One more thing: does the cluster callback gets used by the commands like backup-db and restore-db?

tlima11:08:05

@U072WS7PE or @U05120CBV, could any of you guys help me?

Desmond19:08:44

how can i query datomic from nodejs? according to the peer language support docs https://docs.datomic.com/on-prem/languages.html i should call the REST api. according to the REST api docs https://docs.datomic.com/on-prem/rest.html, however, the REST server is no longer supported. what's the move here?

stuarthalloway20:08:31

@captaingrover your best options right now are:

stuarthalloway20:08:53

in Cloud: expose a lambda or REST service via API Gateway

stuarthalloway20:08:27

On-Prem: expose your own REST service from a peer

Desmond03:08:04

@U072WS7PE hi Stu, yeah i'm using On-Prem. When you say your own REST service you mean wrap some http around the client library as opposed to using the bin/rest standalone service?

stuarthalloway11:08:28

@captaingrover around the peer library would be better, no process hop

cap10morgan21:08:11

Does the automated AWS setup still not support custom VPCs (i.e. any VPCs other than the default)?

ghadi21:08:59

like classic VPCs @cap10morgan?

cap10morgan21:08:17

no, not classic, just VPCs other than the default

eoliphant23:08:08

Nope. It's still self contained. @cap10morgan we've just gone to using the vpc endpoint approach into our existing vpcs, working through some hacks to deal with the facts that endpoint ips are a bit of AWS magic. They're fine for access from the client vpc. But if you're say using a vpn for dev access to the vpc, those ips aren't directly accessible, even though they're in the accessible range

cap10morgan23:08:29

I’m... asking about something much simpler than all that. I just want the ability to deploy the transactors into a non-default VPC. I’m not familiar with the VPC endpoint approach, though.

eoliphant23:08:23

Yeah, you can't at this point. The CF scripts include the vpc creation. We even looked at hacking them to do what you describe but wasn't worth it from an effort or support impact perspective. With the vpc endpoint, you'll install datomic, let it create its vpc and associated other bits. Subsequent to that, you create an endpoint ip address for the datomic system in your existing vpc where your client apps are located. There's a description, and another supporting CF script in the docs under something like operations -&gt; client applications