Fork me on GitHub
#xtdb
<
2023-01-07
>
cddr10:01:48

I’ve been looking at flink (specifically their “stateful function” API) as a platform to run a stateful event processor. They provide what seems to be a simple kv storage object for local state. I believe this is based on rocksdb. However I think the model I want to put in state will be quite a complex data-structure that might be quite suitable to model as triples. So I’m wondering if I can somehow connect xtdb to the storage interface provided by flink.

refset15:01:29

Hey @cddr XT could be a good candidate as an alternative StateBackend (with adequate wrapping), but I believe there is a fundamental tension to resolve between a desire for bounded windowing in Flink's streaming processing model vs unbounded infinite retention of history in XT's storage model. For certain use-cases I can imagine it working usefully, but only where the amount of data being retained is expected to stay modest. At large enough scale I would expect the practicalities to break down due to the unbounded size of the StateBackend, so having clarity on the reasonable upper bound for your Flink infrastructure is probably a pre-requisite. Do you already have an idea there of the retention requirements and the capacity of the infrastructure? In principle the mechanics of the two systems are somewhat similar, and the RocksDBStateBackend incremental checkpointing is not dissimilar to how we handle "checkpoints" in XT (note we don't actually use Rocks' incremental facility yet...but it's definitely possible). If you don't actually need XT's history capability however then you may want to look elsewhere, e.g. https://github.com/juji-io/datalevin doesn't retain history so you can easily keep state bounded manually (it uses LMDB though, not Rocks). We have experimented with implementing configurable retention models in the past (see https://github.com/xtdb/xtdb/pull/727) but there's nothing in the codebase currently to help with that sort of thing that you can easily hook into. As Malcolm mentioned, a couple of our colleagues are actually using Flink on a project at the moment and might be able to offer more concrete advice (I haven't used Flink), although they aren't in this Slack yet 🙂. I also know they have been playing with some sort of light integration with XT via Kafka Connect. If you're curious: https://www.linkedin.com/feed/update/urn:li:activity:6963175473992409088/ / https://www.linkedin.com/in/asel-kitulagoda/

malcolmsparks10:01:03

Hi @cddr, a couple of my colleagues at JUXT have been working on something with xtdb and Flink. I'll reach out to them on Monday to connect you.

👍 2
sparkofreason19:01:02

Just starting on a POC for replacing our existing RDBMS with xtdb. In importing data, I made each row a document, and added an attribute denoting the name of the original table. Running this query to get the list of tables takes an extremely long time (in fact, I haven't gotten it to finish yet, still bumping up the :timeout value), which puzzles me, since there's less than a 100 values for :mysql/table. Am I missing something here in terms of data modeling or configuration?

'{:find [(distinct ?table)]
  :where [[_ :mysql/table ?table]]
  :timeout 30000000}

sparkofreason20:01:50

Looking at the disk and CPU usage while this runs, it certainly seems like it's probably scanning every document to execute this query (they all have the :mysql/table attribute). I would have thought this would be more like a quick index lookup, so it feels like I must have done something wrong here.

1
phill23:01:44

Is the value of :mysql/table the xt/id of an entity representing the table, or is the value of :mysql/table just a string or keyword?

sparkofreason23:01:42

Just a keyword.

FiVo14:01:49

Just a remark concerning data modelling. If you like to keep the explicit table reference that is fine. A more common approach for table to doc translation would be to use the table name as namespace for the attribute keywords.

FiVo14:01:34

And if you would then like to know what attributes your db contains you could do something like

(xt/submit-tx node [[::xt/put {:xt/id "p1" :person/name "joe" :person/email ""}]])
  (keys (xt/attribute-stats node))
  ;; => (:person/name :person/email :xt/id)

sparkofreason17:01:10

Thanks. So that actually anticipated my next question: Will queries limited to a "table" be more efficient via namespaced keywords or having a :table attribute? Based on my experience here, I'm guessing the namespaced keywords win, because that avoids having attributes that have to index the entire DB.

sparkofreason17:01:23

By which I mean doing this:

{:find [?e] :where [[?e :mytable/name "foo"]]}
vs
{:find [?e]
 :where [[?e :table "mytable"]
         [?e :name "foo]]}

FiVo17:01:44

Yes definitely the former. In the second case the second clause potentially sieves through all entities having an name attribute.

metal 2
refset17:01:38

> I would have thought this would be more like a quick index lookup Hi @U066LQXPZ sorry to chime in late! On this point I just realised the explanation is related to what I responded with yesterday on the other thread https://clojurians.slack.com/archives/CG3AM2F7V/p1673356386496499?thread_ts=1672440723.766149&amp;cid=CG3AM2F7V. Essentially, currently, all aggregation work (i.e. anything in the :find) happens after the main query execution of the :where clauses. This means the result tuples are already streaming in and the indexes aren't being interrogated at this point (with the exception of pull). In the case of distinct specifically however, although we do keep a HyperLogLog approximation of the number of distinct values, our indexes don't currently track the exact number which means it can't be used in queries. I think this scenario (if it were still relevant!) could perhaps instead be modelled as concrete application data, e.g. a document containing the set of known tables.

2
markaddleman17:01:55

> Yes definitely the former. In the second case the second clause potentially sieves through all entities having an name attribute. This reminds me to ask: Does XTDB have support for compound attributes/indexes?

refset17:01:37

> Does XTDB have support for compound attributes/indexes? Not currently, since XT intentionally avoids storing any notion of schema. If you were to attempt doing something in this space though then it might look similar to the Lucene module which is mostly just a secondary index ("projection") over the same tx-log

markaddleman18:01:21

That might be useful. In my use case, nearly all queries are bounded in time against an explicit timestamp attribute in the document. Having a compound index of timestamp + doc attribute would significantly speed up queries, I think

📝 2