Hi everyone, Newbie here, so pardon If I have some misconceptions. I've been sitting on trying out xtdb for a project of mine, but just can't seem to get a handle of if it's suitable. Among other things, I'd like to use xtdb for "market recording". Each recording consists of 10 - 40 entities, and each entity is a sorted map that changes over time (the market also has it's own changing info, but this can be treated as another entity). Each market recording ends up having about 10-20k samples per entity. Only about 1k recording are retained at any point in time (to save disk space). My current approach uses a map for the market, and vectors of maps as the entity timeline. I persist with nippy + market id as file name. Each file ends up being 2-13MB. However, I can't randomly sample data from different markets quickly because It takes fews seconds to load, plus all data is loaded at once even if I just want 1 snapshot, which has restricted my workflow to handling a few markets at a time. Storing these in xtdb would make things a lot easier, but since there's no structural sharing, I'm unsure if it's suitable. Is there a guideline for a similar problem? I'm currently thinking of having separate document for each entity to reduce the copying done, but still unsure if this is the right approach.
Hey (fellow) @olajeremy123! It sounds like a good match for XTDB in theory, regardless of concerns about structural sharing - you may be interested in my answer https://discuss.xtdb.com/t/v2-best-way-to-handle-frequent-updates-that-might-not-contain-any-changes/404/13?u=refset. How much data in total is being handled in the system assuming you didn't drop >1k samples, e.g. over a year?
how large is each entity? is it well-modelled? or do you have to capture lots of nested details that change often?
In a year, there'll be about 6000 market recordings. However, I'm working limited disk space, so I prefer to prune history at some point. I look to remodel my entity as I make switch to xtdb (It's my first main clj project, so lots of bad decisions). Currently, an entity snapshot is a map of sorted maps. i.e. a map of 3-5 keys, and each key mapsto a sorted <double, double>map of a maximum of 350 entries (in practice, only about 10-20 entries are used at any given moment). the values in these sorted maps are what changes.
Reading the linked thread, It looks like the consensus is to just do it; split docs into fast and slow parts, and hope xtdb compression does the rest
Thanks for the added context. ow come the disk space is limited? is this running on some fixed compute infra? or on end user machines?
Yes, It's a hobby project, so I'm running everything on my local machine with 512gb internal ssd; I basically index market data, do some analysis, feature engineering, etc.
Ah okay, I guess that changes the equation a bit - it may be cheaper (in time/effort) to invest in an external ssd 🙂
Be aware that so far v2 hasn't had advanced retention / downsampling ideas prioritised, and ERASE is quite a blunt instrument. Therefore, you can't use v2 currently and expect to reclaim space or keep storage limited. You would need to lean on the idea of 'time-sharding'/decanting into fresh database instances regularly (e.g. once per month) and only copy over the data you want to retain
Yh, I don't mind doing that once or twice, and I'd likely move onto the cloud in 2-3 months. Thanks @taylor.jeremydavid
Do columns get cleaned up ? If I insert a bunch of records that later on get erased the columns still exist and return nil when SELECT *. It's just out of curiosity, this is mostly something that shows up at the repl when trying out, unlikely to be an issue.
no, unfortunately we can't feasibly clear columns up without implementing some form of drop-column DDL; without this I think we'd need to check whether to remove a column by checking every other doc every time a document gets erased
the drop-column DDL is also problematic for a database that prides itself on not being able to destructively edit the past 🙂 maybe we'd need to implement that with erase semantics too
Make sense, thanks.