xtdb 2022-04-14 | Slack Archive

tatut05:04:04

out of curiosity, what would be the best way to store a latitude/longitude point (two numbers basically) to get a "things within radius of input point" type of query fast? any examples of GIS data with XTDB

tatut05:04:52

I'm guessing just having the two numbers as separate attributes would be good enough unless you have lots of entities

Hukka07:04:04

Yeah, store separately, get all points within a bounding box, and then calculate actual R for only points in the bbox

Hukka07:04:31

No way to do an efficient index with arbitrary query points

Hukka07:04:45

Anything more complicated and I think I would just offload those parts to postgis

Hukka07:04:58

For a lot of points it would start making sense to split the query into smaller bounding boxes, so that for everything definitely inside the radius would get included without calculating the R, and you would just check the bboxes on the boundary more carefully

Tomas Brejla07:04:16

My usecase was to find train/bus stops around a 1-15km radius. I ended up storing 3 fields: lat, lon and geohash (which I pre-calculated using factual/geo ) When doing the search, I first compute (geohashes-around lat lon max-meters) , then use them to query all the bus/train stops which have one of those geohashes. Then finally i filter only (in plain clojure, outside xtdb query) those found results and leave only whose (_spatial_/distance is withing the radius I'm searching in. All this could definitely be improved, but it worked well enough for me on a dataset of approx 2789 train stops & 34973 bus stops. With mentioned approach, my api endpoint executes in around 60ms on a digitalocean potato server.

3

Hukka07:04:47

How was the perf of using the precomputed bboxes vs querying lat and lon directly?

Tomas Brejla07:04:56

No idea, I was stupid enough not to try the much simpler variant with bounding boxes you mention. I'm not really sure why I haven't come up with that 😄

Hukka07:04:13

Heh

Tomas Brejla08:04:16

I just quickly tried querying via those lat/lon fields and the query is suprisingly slow. Even if i define the tiny bounding box (a single point matching just a single train stop, ie (`min-lat = max-lat` and, min-lon = max-lon), the query executes in no less than 800ms and more. In comparison, lookup of the same train stop via geohash value takes 1.2ms. I may have some issue in it the query.. may look into it later in the evening. Basically I just added comparisons like these. Perhaps there's more efficient way.

[e :lat lat]
[e :lon lon]
[(<= min-lat lat)]
[(>= max-lat lat)]
[(<= min-lon lon)]
[(>= max-lon lon)]

Hukka08:04:32

Wow, 800ms is a lot

1

Hukka08:04:55

Is there any difference if you pass in the constraints directly, not via :in?

Hukka08:04:31

Oh, and what data type are your latlons?

Hukka08:04:37

If it doesn't have an implementation in https://github.com/xtdb/xtdb/blob/master/core/src/xtdb/codec.clj#L211, then I think it won't get indexed

refset09:04:29

> If it doesn't have an implementation in https://github.com/xtdb/xtdb/blob/master/core/src/xtdb/codec.clj#L211, then I think it won't get indexed ~small byte-array values should actually get round-tripped (via Nippy) in a way that coherently sorts within the indexes for the range constraints to work correctly https://github.com/xtdb/xtdb/blob/2460cc8844e7b020c9c34f3d7a684fa4ea713d7e/core/src/xtdb/codec.clj#L362-L375 e.g. see https://gist.github.com/refset/f737151c2956093bbe01d7b81a743328

tatut09:04:22

this opened up interesting discussion, thanks... 🙂

Tomas Brejla11:04:51

@U8ZQ1J1RR you nailed it. Passing those min-lat (etc) values directly instead of via :in brought the times to 1-2ms as well.

Tomas Brejla11:04:42

the lat / lon are just regular numbers (ie Double)

Tomas Brejla11:04:54

The slowdown just because of passing those values via :in in quite unexpectable to me. Does it have some good reason?

Hukka11:04:43

Beats me 😄

Hukka11:04:58

But following the channel, I had a bit of a hunch

tatut11:04:40

would be interesting to see what the difference in the query plan is

refset11:04:39

@U01LFP3LA6P were you passing the values :in as scalars? i.e. :in [min-lat max-lat ...] or are they wrapped in tuples/relations? Are you using -beta2 or -beta3? Or something older? We added a change not too long ago that should calculate better join orders when scalars are passed in: https://github.com/xtdb/xtdb/commit/6245d76541847af969d86d6b63f0892e40642eab (also see the linked issue for discussion on the background / fundamental engineering problem)

Tomas Brejla11:04:56

scalars, 1.20.0

Tomas Brejla11:04:50

seems that the performance of this form with beta-3 is good :thumbsup:

🙌 1

refset11:04:34

ah, excellent, thanks for confirming that ☺️ we will get 1.21 released properly soon, but in the meantime I would recommend using the beta during any new development anyway

👍 1

Steven Deobald16:04:58

@U01LFP3LA6P @U8ZQ1J1RR As always, this is a very interesting discussion that I'm willing to bet someone (a future "someone") would appreciate reading on http://discuss.xtdb.com if either of you are in the mood to summarize some time. 🙂

Petrus Theron16:04:37

@U11SJ6Q0K https://en.wikipedia.org/wiki/Hilbert_curve

👍 1

Petrus Theron16:04:28

@U01LFP3LA6P what do you mean by passing in lat/long values “directly” instead of via :in?

refset17:04:57

> what do you mean by passing in lat/long values “directly” instead of via :in? @petrus i.e. as literals in the compiled query, rather than using logic vars, like [?e :att "val1"] vs [?e :att ?val] + :in [?val] (there is some small overhead in this approach, as it essentially bypasses the query compilation caching)

nivekuil23:04:41

you would want a separate geospatial index, have you seen https://github.com/teknql/crux-geo

Petrus Theron15:04:00

Hey guys, I am preparing a technical talk on XTDB at my company (a conservative fintech) next week. I’m pitching as “version control for data.” Any previous experience on “selling XTDB” into a traditional bank-like organization and running it in production? The first pushback I got from cloud infra was about where to store the RocksDB indices if K8s pods have ephemeral storage.

Martynas Maciulevičius15:04:27

Try saying a word "postgres", "mysql" or "kafka". These could work.

Steven Deobald16:04:11

Heh.

Steven Deobald16:04:34

#Also sent to the channel

@petrus We'd be quite happy to help you out with this, I think. Care to drop us a line at <mailto:[email protected]|[email protected]>? Maybe we can have a call with you to talk through what you're trying to build and where/how xtdb is appropriate, providing some context and materials we have available. It's Good Friday in the UK, Sweden, and Canada tomorrow... so I'm afraid we're probably all OoO for that. But assuming your presentation is after Tuesday, we should be around to set something up. === Others: In general, if you want help constructing an argument to the powers-that-be at your company/org that they should allow you to give xtdb a spin, we're generally happy to (try to) help out with that. 🙂 More xtdb nodes in production is always a nice thing.

👍 2

Petrus Theron16:04:52

#Also sent to the channel

@steven427 📨 mailed. Talk has to be ready for dry run by Wednesday, so not much time. Actual talk on Thursday.

1

👍 1

Hukka07:04:56

The question of how to run XTDB indexes in kubernetes are of interest to us too, though at this point (how much data we have, how much complexity persistent storage would bring) we still do it in-mem.

Petrus Theron08:04:54

#Also sent to the channel

How do you deal with rebuilding in-memory indices when nodes are restarted? My understanding is that indices are built eagerly, which means if an index node is restarted, this would be a serious cold cache problem and put load on the tx log.

Hukka09:04:07

We just do it cold for now, since we don't scale dynamically yet. When our instances become shorter lived, we would use https://docs.xtdb.com/administration/1.20.0/checkpointing/

tatut09:04:19

we are using checkpointing to S3

refset11:04:59

> how to run XTDB indexes [..] we still do it in-mem @U8ZQ1J1RR I would recommend switching to RocksDB or LMDB sooner rather than later - definitely ahead of going into production - because the performance will be more realistic (often faster...which is counter-intuitive 🙂) and you will be more sure to avoid OOM errors

Hukka11:04:45

Sorry, I meant we run rocksdb with the inmem non persistent filesystem in kubernetes

👍 1

😅 1

Steven Deobald16:04:34

replied to a thread:Hey guys, I am preparing a technical talk on XTDB at my company (a conservative fintech) next week. I’m pitching as “version control for data.” Any previous experience on “selling XTDB” into a traditional bank-like organization and running it in production? The first pushback I got from cloud infra was about where to store the RocksDB indices if K8s pods have ephemeral storage.

@petrus We'd be quite happy to help you out with this, I think. Care to drop us a line at <mailto:[email protected]|[email protected]>? Maybe we can have a call with you to talk through what you're trying to build and where/how xtdb is appropriate, providing some context and materials we have available. It's Good Friday in the UK, Sweden, and Canada tomorrow... so I'm afraid we're probably all OoO for that. But assuming your presentation is after Tuesday, we should be around to set something up. === Others: In general, if you want help constructing an argument to the powers-that-be at your company/org that they should allow you to give xtdb a spin, we're generally happy to (try to) help out with that. 🙂 More xtdb nodes in production is always a nice thing.

👍 2

Petrus Theron16:04:52

replied to a thread:Hey guys, I am preparing a technical talk on XTDB at my company (a conservative fintech) next week. I’m pitching as “version control for data.” Any previous experience on “selling XTDB” into a traditional bank-like organization and running it in production? The first pushback I got from cloud infra was about where to store the RocksDB indices if K8s pods have ephemeral storage.

@steven427 📨 mailed. Talk has to be ready for dry run by Wednesday, so not much time. Actual talk on Thursday.

1

👍 1

Petrus Theron08:04:54

replied to a thread:Hey guys, I am preparing a technical talk on XTDB at my company (a conservative fintech) next week. I’m pitching as “version control for data.” Any previous experience on “selling XTDB” into a traditional bank-like organization and running it in production? The first pushback I got from cloud infra was about where to store the RocksDB indices if K8s pods have ephemeral storage.

How do you deal with rebuilding in-memory indices when nodes are restarted? My understanding is that indices are built eagerly, which means if an index node is restarted, this would be a serious cold cache problem and put load on the tx log.

2022-04-14

Channels