Fork me on GitHub

Hey all. Just got an interesting error while putting some data into a xtdb instance (well, technically crux 1.17) where I get the following message:

Exception in thread "crux-polling-tx-consumer" java.lang.IllegalStateException: missing docs:
Followed by a ton of doc ID’s. Anybody seen it before? Is there anything that I should do besides restart the node?


Hey @UEC8W94AE ! Which tx-log & doc-store combo is this?


reports of S3 doc-store flakiness lead us to make some changes in 1.18.1, which should help if you are able to upgrade


Hey sorry on the delay. It’s kafka for both, w/ a rocks local document store on top of the document store. Restarting the node fixed it

🙂 2

However it did block processing the tx log until I restarted it. Stuck on 1.17 until I get time to update crux-geo to 1.18


But I did have multiple threads going so the fix looks promising 🙂

👍 2

Ah, cool, well hopefully your upgrade to crux-geo isn't too rough when you get to it 🤞 Blocking the tx processing is an intentional consistency safety feature, if a bit crude.


hi--about a year ago I posted a deftx macro that tried to make transaction functions more convenient to work with. It was half-baked then, but I've gotten back to it and it's now about 3/5 baked, or at least useful enough for me to start using throughout my codebase. Sharing it here in case it's useful or interesting for others:

👍 6

Hey, I remember! But it's a very hazy memory 🙂 I am definitely going give this a go tomorrow. Thanks for sharing 🙏

🙂 2

I should ask as I'm posting this--for anyone else using tx functions heavily, what if anything have you been doing to make things more ergonomic?


and re: the macro i'll say the jankiest part of it is probably how i resolve symbols--writing fq symbols in tx fns is kind of a drag so i wanted to try to solve that so you can just write code how you normally do, and the machinery of a syntax quote is just the thing for it, but i couldn't figure out how to use a syntax quote in that macro so i have to give a bad impl of it that probably has many subtle bugs, a few of which i can already think of


Is there a faster way to get distinct values for a given attribute from an index directly somehow? Specifically a scan over AVE perhaps if it exists?

(q *node*
 '{:find    [?vals]
   :where   [[_ :my-attr ?v]
            [(distinct ?v) ?vals]]})
The above times out with ~495K records coming from attribute-stats


Hmm, try using the distinct aggregate, i.e. :find [(distinct ?vals)] instead. The clojure.core/distinct predicate function you used will, I think, be trying to process the ?v relation all in one go, whereas the aggregate is lazy/streaming


It still may not be fast enough though, so you could either increase the timeout or, as a last resort, manually maintain a materialised index (or secondary index) to keep track


I’m trying to build the materialized index, but it’s taking minutes to populate


I’ve tried both way


Is there any way to get into the raw indices?


(and, is there a raw index that would be of any use here)


> Is there any way to get into the raw indices? Officially, no 🙂 see However, there is an AV index, and you can access it directly like this


You can see the index definition here: How it gets encoded: That it gets encoded for every field-value in each document: How it gets prefix-scanned: And that the query engine does make use of the AV scan here (exclusively) ...but I don't yet understand the ins-and-outs of how the n-ary-join-layered-virtual-index executes enough to explain why the initial solution I tried in that gist doesn't work 😅


Amazing, I’ll give this all a read. Thanks for the tips 🙂

🙏 2

Looking through the discussion - it seems like the distinct aggregate could also just be smarter in-terms of using this index…. Or am I missing something?


(and seems somewhat similar to the idea proposed in GH#1515 too)


Yep! Although I think the triple clause could use it whenever the e is _ (and v is a logic var) and it would have the ~same effect


Hey again, any joy with the speed up?


Hey! Yes, using the AV index directly worked perfectly for my usecase.

🙌 2

That being said, I still think that being forced to materialize the attributes (as opposed to the EID or even the sorted serialized representation) in an index really is difficult to overcome


Specifically if I have a bounding box constraint, it might load up 500K records that then need to be deserialized from geojson, encoded into the attribute binary, and then joined… I really think that being able to lazily provide the serialized (and sorted) representation would be a huge win for third-party indices…


Is there any chance I could convince you all to consider throwing those capabilities on your roadmap?


Well, the direction we've taken with Lucene is almost in the opposite direction...where we've de-emphasized usage of the predicate function integration (which is still potentially useful in basic scenarios) in favour of running lots of small queries and performing the temporal resolution in userspace, e.g. - you can potentially then hide all this in a user-defined predicate to achieve the same end result in Datalog (as per Do you think that could work for your scenario also? I wonder if there's some way to avoid the excessive serialization work by either using ByteBuffers or extending this codec protocol (or similar)


Thanks for the tips as always @U899JBRPF - I will take a look and see whether I can figure out how it can best be applied. Maybe I can work it into some of the work of getting up to speed w/ v18 too


My pleasure, as always! I'd be happy to talk it through on a call / pair on it also