Fork me on GitHub
Vincent Cantin18:03:15

I went through the documentation of Crux and I still have questions.

Vincent Cantin18:03:52

Are vectors of vectors supported?

Vincent Cantin18:03:40

Are maps of maps supported? (supposing that we can't flatten them at the root level of the document, for some reasons)


Hi @vincent.cantin - you can put almost any valid edn map in as a document, so long as it has a valid :crux.db/id - but you won't be able to use Datalog queries to their fullest potential unless you shred your data into smaller documents. If you can give an example of the kind of document structures you are working with I can try to help you understand how it could be modelled

Vincent Cantin18:03:01

I don't have the data at hand, but I have been told that it is heavily aggregated into documents.

Vincent Cantin18:03:45

The use case is to import a huge legacy DB into (maybe) Crux, in order to run data analytics on it.

Vincent Cantin18:03:18

The reason why I thought about Crux is for the bitemporal aspect.


Ah okay, interesting. Crux is certainly fast enough to support many kinds of analytical queries (likely with a mixture of Datalog + Clojure), and bitemporality can be very helpful for integrating views over multiple data sources or even multiple imports of the same legacy DB. Numerical analytics over timeseries data is not going to be efficient with Crux today though. Ingesting the data as-is and then gradually transforming it (retroactively!) to support your queries would be a reasonable thing to do. Feel free to reach out if you want to discuss anything in more detail - I would be very happy to video call etc. 🙂

👍 4

@U899JBRPF I'm wondering if your point here about numerical analytics over timeseries data still holds in XTDB 2. I was just reading the early access page and recalled reading this comment of yours some time ago.


Oh man, blast from the past 😄


So there's still a 'tax' being paid in v2 to validate the visibility of each version of a given row, and pure time series systems don't have to pay this at all (because they just store raw data) - however we are considering offering the ability to create atemporal tables in the future which wouldn't pay that tax. The ability to work with raw Arrow (perhaps Parquet in future too) files also affords possibilities


Maybe there are some time series db tricks we could add also, to cope with out-of-order buffering without going full bitemp, I'm not sure!


What was your use case again? Are you just hoping to avoid ETL across multiple systems?


The single-writer limitation of XT is also severe, when most time series systems give up strong transactionality to offer massive ingest throughput


> What was your use case again? In my case I was using XT as my primary data store for timeseries data and doing a bunch of numerical analytics over it. Some of the records would change occasionally and the ability to compare versions was useful in those cases, but that was the minority case. Eventually I started doing most of the timeseries stuff in plain Postgres and then pushing computed summaries to my XTDB store, and the ability to compare versions became very useful there. Nowadays I'm working on different projects and not using XT so much, but I miss it, and recent conversations about needing the ability to override certain DB fields while tracking those changes has me thinking of XT again. It's once again timeseries data that goes through continuous analysis.


Cool, good to know. I guess I'm quite curious about what those Postgres (or other) queries look like


We haven't added PARTITION BY / OVER support to the engine (for SQL at least) yet so I don't know if v2 Datalog has the ability to express those problems either