Fork me on GitHub
#datomic
<
2019-09-10
>
favila13:09:33

I notice the client api version of d/datoms seems to ignore the fourth (tx) part of :components. Is this a known issue? Bug or by design?

marshall17:09:35

Yes, known. By design I believe, as if you have EAVT you have the datom, so you don’t need index access

favila18:09:10

It differs from peer api d/datom

favila18:09:33

I would at least expect it to be mentioned that only three components in the vector are inspected

favila18:09:11

You make a good point though. The only additional possible information datoms could give you is whether it was an assertion or retraction

marshall18:09:37

and if you have lots of datoms there you need to get to, you’re doing it wrong

favila13:09:10

Also is there a better way to get all datoms matching pattern on client api other than repeatedly adjusting offset? This feels like O(n^2). If client had seek-datoms or a “starting at” datoms argument, one could use the last seen datoms as the start to the next chunk

favila13:09:37

I’m struggling with workloads which are too large for a single query where I would normally use peer d/datoms lazily to produce an intermediate chunk or aggregate

Joe Lane14:09:23

Do you have an example of "datoms matching pattern"?

favila14:09:59

The pattern in the :components argument

favila14:09:00

Eg {:index :avet :components [:myattr]} pattern is every datoms whose attr is :myattr

favila14:09:00

Intermediate sets may be too large for a single query

marshall15:09:28

@U09R86PA4 How big is the total DB and how big are the results you’re looking for? Also, how frequently are you running this query (or ones like it)?

favila15:09:32

They are run infrequently (offline or batch jobs)

favila15:09:03

Use cases vary but they follow the pattern of being able to aggregate as you go and aggregation is much smaller than input; or preparing subsets of input for the same query rerun many times

favila15:09:42

Example I ran in to today was counting the unique values on a non-indexed attr

favila15:09:45

On a peer this is map :v distinct over d/datoms :aevt :myattr

favila15:09:50

On a client, the throughput decayed as the offset increased

favila15:09:58

I gave up eventually

favila15:09:20

When I ran it on a peer, the result took a few minutes but bounded memory and result set size was 60 out of 120 million input datoms (cold instance, no valcache or memcached)

marshall17:09:19

are you running the peer server with the same memory settings as you did for the peer?

favila17:09:30

Peer server is actually a little bit bigger and has valcache

favila17:09:56

These are queries I couldn’t run on even a really large peer. I don’t fault peer server for not being able to handle it naively. I just can’t use my usual d/datoms workaround for controlling the size of intermediate results by being lazy

marshall17:09:06

Hrm. I don’t quite understand what the 4th component has to do with it. Do you have large #s of datoms with the same AVE that only differ in T?

favila17:09:18

From correctness perspective, I will get results where not al components match

favila17:09:43

This is unrelated, is why it’s in a separate message/thread

marshall17:09:13

what chunk size are you using for your datoms call?

favila17:09:36

I discovered it while doing thought experiments with a client api with a seek-datoms; I could use it to construct the start of the next chunk instead of merely seeking the whole result over again by the offset (which seems to be how it is behaving. Is that actually how client’s datoms is implemented?)

marshall17:09:15

ok. i think i understand

marshall17:09:20

you should use the async API

marshall17:09:33

it provides chunked results

marshall17:09:35

on a channel

marshall17:09:11

that should allow you to lazily iterate your results

marshall17:09:21

without repeatedly calling datoms

favila17:09:25

does the server’s impl of client d/datoms have the same time complexity as peer (->> (apply d/datoms index components) (drop offset) (take limit)) or is it more efficient than that?

marshall17:09:55

i believe it is more efficient if you’re using chunked async client

favila17:09:56

It feels n^2 but maybe I am running into unrelated externalities

marshall17:09:07

i’m not sure with the sync impl

favila17:09:32

Why would they be different?

marshall17:09:44

since there’s no “next” chunk in sync impl

marshall17:09:51

you’re just getting limit

favila17:09:35

Ah so there may be some cursor-like state in there

favila17:09:02

Ok I’ll try async

marshall17:09:23

there’s a sublety here

marshall17:09:29

you can do it with the sync api

marshall17:09:33

i just talked to Stu

marshall17:09:06

you should use it just like you are, but set :limit -1

marshall17:09:24

the iterable that is returned is lazy

marshall17:09:34

and you don’t need repeated calls to datoms with offsets

marshall17:09:47

so you can map over it, or call seq on it, or whatever you want to do

favila17:09:30

Ah ok. Why no “chunk” knob since the same considerations apply?

marshall17:09:30

multiple calls to d/datoms is definitely re-performing the work on every call

marshall17:09:46

further down: “Synchronous API functions are designed for convenience. They return a single collection or iterable and do not expose chunks directly. The chunk size argument is nevertheless available and relevant for performance tuning.”

marshall17:09:57

it’s actually there, probably an oversight in the api docs if it’s not listed there

favila18:09:20

Sync api namespace docs do not mention :chunk

favila18:09:52

Ok I was expecting less magic, I should have tried the dumb thing of limit -1?

favila18:09:07

Well gonna try it now

marshall18:09:12

🙂 sorry for the confusion

marshall18:09:38

yes, limit -1 and use the returned iterable however you like

marshall18:09:53

and you can configure the chunk size for perf tuning if you desire

favila18:09:29

Yeah I thought async‘s chunking was just doing offset adjustment for you like a normal rest api would

marshall18:09:18

ah. no, it’s maintaining an iterator between the client and server

favila18:09:30

Ok it works! Thanks! Large :chunk makes a huge difference in sync api and is not ignored, so I consider that a doc bug

favila18:09:24

I guess a pipeline/prefetch option is out of the question? 😊

marshall18:09:06

i think that would be a good candidate for a feature request on the feature request portal

marshall18:09:10

and i will look into fixing the api docs

favila21:09:49

I can’t find this doc link you shared on the on-prem section of the datomic docs website

marshall21:09:56

Relevant info there

marshall21:09:10

But youre right, no exactly analogous page

Lone Ranger14:09:54

is there a Datomic certification path anywhere? 😮

colinkahn15:09:58

Is there documentation around using <, >, <=, >= with letters? Like [(< ?title "C")]. I’m seeing that < is exclusive where > is inclusive and wondering if that’s expected.