xtdb 2021-03-19 | Slack Archive

nivekuil08:03:06

so, to disable indexing on some attributes, the way I'm thinking about hacking it in is to store metadata alongside the doc, like {:noindex [:some-attr]} and then in crux.tx/index-tx-events add some logic to filter out those keys from the docs returned from fetch-docs and to the indexer it should be like the doc was missing those keys all along. any obvious concerns with this approach?

refset09:03:29

The main concerns from me are 1. Unsupported internal APIs are subject to change, use at your own risk 🙂 2. fetch-docs also happens during pull (which was called eql/project before ~last week) so you possibly want to hack that too, to keep the query engine behaviour feeling coherent: https://github.com/juxt/crux/blob/master/crux-core/src/crux/pull.clj#L20 3. Adding`{:noindex [:some-attr]}` to individual documents repeatedly doesn't sound like it will save all that much space across the system in total (once you factor in the increases to doc-store disk and cache usage) - using a central schema document to store this might be a better move

nivekuil09:03:52

on the topic of space usage, I was thinking that we already store a per-document schema, namely the attribute names. so it does seem like a lot of needless duplication but it doesn't seem too far from the status quo. I was also thinking that I could pull out the "schema" parts of the doc (the attr names, any other metadata) from the hash and values, and just encode the vals of the map in its own column (specific to a doc store with that concept of course) and maybe that would be much more amenable to compression

nivekuil09:03:08

thanks for the tip about pull :)

👌 4

nivekuil09:03:27

that is to say, the central schema would be beneficial for other use cases too and it would also take me really far astray from crux, and I am sympathetic to the concerns around global, categorical schemas -- I think it'd be a premature optimization that I'd leave to the pros

refset09:03:57

a central schema for this would certainly require a lot more design & engineering work to get "right" (migration, temporal semantics etc.)

jarohen09:03:21

When we last looked in to reducing the space usage of attributes in the query indices, we did try using a dictionary for the attributes - it turned out to have a negligible effect on the size of the indices after Rocks compaction, and increased query times 🙂

nivekuil09:03:56

I think refset is referring to space usage in the doc store? the metadata I'm thinking about adding would be transparent to the indexes.. I'm thinking 3 columns in scylla, each doc being represented with the hash as primary key, metadata (including key names) as clustering key, and the value.. I think it would end up with docs of the same "schema" sorted within each scylla sstable, but not totally sure how compaction would leave it

✔️ 4

Tuomas12:03:02

A question about lucene-text-search . I can do fuzzy matching with

(crux/q
   (crux/db node)
   '{:find [(eql/project ?e [*])]
     :where [[(lucene-text-search "person\\/first-name:mer*")
              [[?e]]]]
     :limit 5})

And variable binding with

(crux/q
   (crux/db node)
   '{:find [(eql/project ?e [*])]
     :in [name]
     :where [[(lucene-text-search "person\\/first-name:%s" name)
              [[?e]]]]
     :limit 5} "merja")

But how can I do both at the same time? My best attempts:

(crux/q
   (crux/db node)
   '{:find [(eql/project ?e [*])]
     :in [name]
     :where [[(lucene-text-search "person\\/first-name:%s*" name)
              [[?e]]]]
     :limit 5} "merja")
; Execution error (ParseException) at org.apache.lucene.queryparser.classic.QueryParserBase/getWildcardQuery (QueryParserBase.java:700).
; '*' or '?' not allowed as first character in WildcardQuery

  (crux/q
   (crux/db node)
   '{:find [(eql/project ?e [*])]
     :in [name]
     :where [[(lucene-text-search "person\\/first-name:%s%s" name "*")
              [[?e]]]]
     :limit 5} "mer")
; Execution error (MissingFormatArgumentException) at java.util.Formatter/format (Formatter.java:2672).
; Format specifier '%s'

I'm struggling with using multiple fields + bindings

(crux/q
   (crux/db node)
   '{:find [(eql/project ?e [*])]
     :in [first last]
     :where [[(lucene-text-search "person\\/first-name:%s OR person\\/last-name:%s" first last)
              [[?e]]]]
     :limit 5} "first" "last")

In most cases I could also format the search string myself. Is there any reason not to?

refset13:03:55

Ah, so a prefix wildcard card search requires an extra step when creating the QueryParser, i.e. https://github.com/juxt/crux/pull/1431/files#diff-8f1e93f353667e4c2b3076093a53a0ad841fc14e883445e5e41f02b3bed38facR20 - right now you would really have to create your own custom ns that leans on crux-lucene to do this (see that PR for an example) but as it happens we're currently reviewing how best to split the module into smaller pieces that can be configured & composed more easily. We should have this sorted out and released in the next 2-3 weeks if you can wait 🙂 > I'm struggling with using multiple fields + bindings I'm not sure about this one though, I'll do some experimentation later this afternoon > In most cases I could also format the search string myself. Is there any reason not to? No concrete performance reasons I can think of 🙂

Tuomas05:03:49

Great, thanks. I actually get what I need right now by formatting the lucene search string myself, so there is no blockers for me.

☺️ 3

refset09:03:38

Hey again @UH9091BLY - I just wanted to say thanks for reporting your bindings issue! I've now fixed it ahead of a new release today 🙂 https://github.com/juxt/crux/commit/8d07a45deaee2b64f029ba2eadc6c3a23cb597ed

❤️ 3

mmer14:03:03

Hi, a simple question - I am using crux to track a document over time based on lines of text each having an entry with its own crux entity id. Is it best to commit the whole set of entities in one go, if in the future only some lines of the document change and so only a few entities have a history. The reason I ask is that I am never sure how to indicate the db that represents just the latest set of data.

refset14:03:06

Hi 🙂 > Is it best to commit the whole set of entities in one go, In Crux terms, do you mean a sequence of :crux.tx/put operations in a single transaction?

refset14:03:49

I don't see any issue with only some lines having history, although I'm not sure what you mean by "indicate" What's a realistic maximum for the number of lines per document? You could potentially model each modification to a line as a new entities, rather than updating existing line entities (e.g. via a linked list of references, or similar)

mmer16:03:00

I guess my question is better put, by saying when you query a crux db does it only give you the latest entries if you do not specify anything to indicate you want history.

refset17:03:23

oh I see, yep that is the default behaviour. Queries won't ever return historical versions (documents) of an entity, unless you specify an explicit valid-time or transaction-time in the crux.api/db call

mmer17:03:14

thanks

2021-03-19

Channels