Fork me on GitHub
#xtdb
<
2021-03-05
>
nivekuil00:03:45

I'd like to keep a track of the count of specific AV pairs, specifically for relationship cardinality. I know the crux indices already track attribute cardinality; does it make sense to also track relationship cardinality there (looks like neo4j has something like that https://neo4j.com/developer/kb/fast-counts-using-the-count-store/#_relationship_counts) or should that be done externally?

refset12:03:07

To potentially improve the join order calculations in future we have been looking at storing statistical histograms / bloom filters / hyperloglog etc. - these would allow the query planner to understand the "selectivity" (right now it only accounts for attribute cardinality, like you say). However, are you looking to track exact relationship counts?

refset12:03:27

You could already add a transaction function to track such relationship counts in userspace - have you already discounted that for performance reasons?

refset12:03:14

Some reflections, looking at the Neo4j docs: 1. their counts are exposed inside of the queries, which is pretty handy 2. their main "relationship count" is similar to Crux's crux.api/attribute-stats except Crux will be counting across all of time so probably can't be relied on in the same way (if you are making any temporal corrections or using :crux.tx/delete) 3. "Relationship counts to/from a node with a single label" - even if Crux had some basic attribute-level schema model, I think in order to replicate this functionality you would also have to bake in a concept of "label"/"type"

nivekuil17:03:15

yeah, I try to avoid transaction functions in general and using one to bump a counter definitely feels egregious, as there are much more efficient ways to model that (even keeping temporality, ala time series).

nivekuil17:03:41

my reference for this functionality is actually facebook's TAO. it has an assoc￱_￱count function which is how you get like counts etc.

nivekuil17:03:12

hyperloglog looks interesting. log(log(n)) space complexity? wow

nivekuil17:03:10

unrelated question, how does crux datalog do not-joins right now? is it naive or something clever with a bloom filter?

refset21:03:47

ha, I suppose like counts can completely get away with not being at all consistent 🙂

refset21:03:16

not-joins work the same way as all rules - they are eager sub-queries

refset21:03:15

(note that I'm not an expert on query.clj)

nivekuil21:03:26

yeah, the query/index stuff is definitely the fancy part of crux I won't even try to grok :) also I will note that I think (filterv (complement ... is faster than (vec (remove..

nivekuil21:03:12

though, looks totally inconsequential for what this is doing

refset23:03:52

You may well be right 🙂 I suppose that bit only happens during query planning, so yeah it's not on the real query hot path. If you'd ever be interested in a overview of the internals it could be fun to give a tour to a small group some time...just let me know!

👍 3
nivekuil00:03:23

I guess this would rely on being able to teach crux about domain attributes in general, which would also enable stuff like disabling indexing of a certain attribute -- another valuable feature to have (I think wrapping an attribute in a map isn't quite as good, since the attr of the map itself is still indexed?)

refset12:03:32

If Crux simply maintained counts on all AV combinations, not just relationship / "reference type", then I don't think a schema would be needed still. > I think wrapping an attribute in a map isn't quite as good, since the attr of the map itself is still indexed? That's true, the hashed value of the map is still indexed. Crux being schema-aware would have some pros, for sure, and maybe there's some ~obvious way to extend Crux to make it optionally possible in future 🙂

nivekuil17:03:57

well, you could just say something like an attribute with an underscore doesn't get indexed which would be a simple, if slightly unergonomic approach (you'd have to deal with the underscore in code)

nivekuil17:03:10

counters on all AV combinations definitely sounds too expensive, and would another dimension of concern (do I have any lurking high cardinality stuff that'll go quadratic in prod?) I think it's not quite a schema to be able to feed some domain info to crux, or at least schema-or-not-schema shouldn't really be a dichotomy, but degrees of constraint. Giving some info to the indexer wouldn't impose any sort of constraint on how the end user uses crux, right? all the schema-on-read sales pitches would still be applicable

refset18:03:30

(I just had a skim back through the channel, sorry for the lateness!) > Giving some info to the indexer wouldn't impose any sort of constraint on how the end user uses crux I don't think it would impose constraint, but it might encourage patterns that aren't robust, I'm not sure. I could imagine a predicate such as`[(av-count ?a ?v) ?c]` that magically works faster for things the indexer was given schema hints for...it might be quite useful but are there any downsides?