Fork me on GitHub
#xtdb
<
2023-05-06
>
andrewzhurov07:05:47

Heyo, folks! Thanks for all the effort put into XTDB, exciting to see databases being reimagined. <3 I heard that there will be switch to Apache Arrow in v2. I'm curious about how graph's data will be represented in Arrow's columnar format, will it be a format used by existing tools (https://github.com/rapidsai/cudf/blob/98122d3ea0728e715cf12df426946b21a61b511e/python/cudf/cudf/core/dataframe.py#L5203, https://unum-cloud.github.io/ustore/) or home-brewed? Is there a convention? I'm curious because I'm planning to write a graph store and am considering Arrow for representation of graph data

👀 2
refset22:05:59

Hey @U0ZQT0K2N thanks for the feedback 🙌 and also for the interesting links! 👀 Currently 2.x doesn't implement any specific 'graph' indexes, alternative storage layouts, or multi-way join algorithms. We would like to work on these things eventually but for the time being we are focused on the columnar foundations. Therefore, 1.x will likely work better for genuinely graph-shaped problems for quite some time (because of the existing sorted triple indexes + multi-way joins). Out of interest, what is the problem/business domain you are working with?

andrewzhurov01:05:59

Roger, thanks for letting me know, Jeremy. <3 That's a cool architectural feature that by having a log of events you can derive any number of data models out of it, that may be optimized for specific query languages (SQL, Datalog) and even engines (this cuGraph representation may be in Apache Arrow's columnar format, tailored for GPU querying, whereas Datalog running on CPU may leverage triple store indexes). :the_horns: And I guess you could even prefer one data model over the other based on the query itself (analytical queries would be efficient to run on GPU, whereas transactional - on CPU). What a wonderland, curious to see how far XTDB will push it. 💪 Problem domain intersects with that of XTDB. A good deal of skepticism is adviced due to heavy blue sky dreaming ahead. 🙂 I'm planning on writing a distributed compute framework for the Web, akin to re-frame, where subscriptions are SPARQL queries, events are SPARQL updates, and app-db is derived out of a personal event log (akin to how it's done it XTDB, and how re-frame's app-db is being derived out of a sequence of events), event log's being the source of truth and being managed as OrbitDB's log (a CRDT), allowing peers to collaborate on the same event log, having state eventually consistent. Would love to build on top of XTDB, as it mimicks its architecture, but it's meant to run in browser. So this tool is meant to be a distributed compute framework, a sort of Internet Computer, initially specific to computation of SPARQL, akin to Semantic Web (with https://solidproject.org/) but of computation, not data - data's being derived out of computation (akin to how XTDB took a step back having event log as the source of truth), which gives provenance, immutability and as-of out-of-the-box - well, all the goodies that XTDB may have to offer, and distributed compute on top. If it will prove to be a sound architecture, perhaps it may become an alternative compute model for the Web that embraces p2p and immutability. My hope is that it'll allow us to have a more robust foundation on top of which we, as humanity, engineer knowledge. One hell of a blue sky dreaming that is. 😄 It is as though XTDB would have CRDT for event log (so users have their local logs they work on and sync them with each other) and have SPARQL transactions for events (and, well, SPARQL to query over db that gets derived out of a log (or across multiple dbs)). Dreams, dreams! Not sure how far this idea will fly, but I'll try to implement a prototype in the next month and we'll see from there. 🙂 Sorry for a wall of text, these topics get me all excited

❤️ 1
rickmoynihan10:05:45

FWIW this is of interest to me too. Background is that we work with government data, and publish official statistics. There is a big desire to have an RDF data model. However statistical data / analytical queries pose performance problems for essentially all triplestores. Hence we’re prototyping an approach where we have a surface RDF data model, and internally model it with RDF assumptions; but remove the requirement for every aspect of the model to be reified with a triple in a triplestore… i.e. a subset of triples may be materialised in a triplestore, but the analytical data may be held in tables in an index/db optimised for that… though identifiers into observations in that table will be URI’s etc. Similarly we want to bake immutability into the data model etc, as this is for official statistics. Bitemporality is also very relevant too; though we don’t necessarily need it everywhere. Anyway I’ve been following xtdb since before its inception, and had evaluated 1.x as a possible foundation about 6 months ago, but was skeptical of it being able to address the non-functional performance requirements because of its triplestore foundations (fundamentally I think triplestores will never be good at analytical loads - I’ve played with a lot of them, and they all suffer similar performance challenges). So currently we’re exploring building a hybrid solution; but after seeing @U050DD55V’s xtdb 2.0 announcement at the conj I was excited by the possibilities of columnar storage in XTDB 2.0; so it might be an option to reconsider in the future.

🙂 2
❤️ 2
👍 1
💡 1
refset13:05:51

> My hope is that it'll allow us to have a more robust foundation on top of which we, as humanity, engineer knowledge. One hell of a blue sky dreaming that is. 😄 I can https://www.hytradboi.com/2022/baking-in-time-at-the-bottom-of-the-database 😅 Thank you both for sharing - really interesting to hear!

❤️ 2
andrewzhurov06:05:12

> However statistical data / analytical queries pose performance problems for essentially all triplestores. Did you have a chance to give https://www.rdfhdt.org/ a spin for analytical queries? It's the most performant triples-indexed RDF store I heard of, although still its layout is not optimized for analytical queries. Arrow's layout seems ideal for them. Combined with querying on GPU it seems to be a complete raze (https://github.com/rapidsai/cugraph, https://www.graphistry.com/). > Similarly we want to bake immutability into the data model etc, as this is for official statistics. There is https://inqlab.net/2022-04-14-geopub-datalog-rdf-and-indexeddb.html that aims to have immutability at the level of data, they've built tooling around immutable RDF (block store to serialize it, mutable capabilities (akin to OrbiTDB + IPNS), content-addressable RDF, and Datalog query engine over RDF. Sadly they've had funding only for a year and now it's stale. Lads did some solid R&D there, it may not be practical due to there being alternative tools from which that architecture could be assembled, but I find it to be a good source of ideas. I've been curious myself about content-addressable RDF for a while, but it does pose some challenges, and so far it seems to me that an alternative strategy of keeping an immutable log of events may give us the immutable properties we're after and some more cool traits on top, in fear of repeating myself, some are: • Clojure's atom-like interface for peers to collaborate on an RDF store - so our data does not become half zebra half horse due to some txes performed in parallel - they're serialized • provenance - given each tx is signed and we keep track what's the output of each tx, makes it possible for any triple to drill down which tx it came from and who's its author • as-of (a particular tx of a particular immutable tx log) (however, since logs are CRDTs they may get accreted with new txes that may end up before the your as-of tx, making as-of query on that new log return a different result!) • distributed compute - delegating SPARQL execution to machines that are well suited for it, e.g., akin to how graphistry can delegate execution of analytical queries to a backend instance running cuGraph, and use a client for simple render of found results • possibility to derive as many data representations of it as we want (perhaps tailored for use-case, e.g., for efficient analytical queries) As a side note. Interestingly, we could have valid time assoced to txes, so non-valid txes are skipped when deriving a view of the database at some valid time, huh :thinking_face: One surprise outcome of that is in that txes that follow up an a non-valid tx may produce a different result now. E.g., say in tx1 user1 got authorized with admin privileges valid to 2024, and in tx2 he gave user2 admin rights (firstly checking that he's authorized to do so, otherwise tx is a noop). Then, querying as-of 2024, tx1 gets invalidated, becoming beeing skipped, and tx2 subsequently gets invalidated, as there are no more right, so now both user1 and user2 are not admins anymore. There is another strategy for a similar effect - marking txes as invalid (as another tx). @U06HHF230, I found https://github.com/Swirrl/grafter/blob/d7aa2bcb93fb3158e59bbdc832bc203ba5f6acde/doc/ideas.org#matchanext-ideas, EDN data format for expressing SPARQL queries looks nice! Is there value in having an API that returns a SPARQL query string out of it? I guess it would be nice for handling it over to different query engines. Given matcha is in .cljc you could even use it in browser with say https://comunica.dev/. So it would be possible to use this data format to query RDF with ClojureScript, from the browser. :the_horns: > I can https://www.hytradboi.com/2022/baking-in-time-at-the-bottom-of-the-database 😅 I think Freeplane is a great idea - a strive for collaborative knowledge engineering that is meant to be built on a robust data foundation. Many tools try to go for it (e.g., Notion), except they fail to recognize importance of a solid foundation, making it yet another walled garden of mutable nature - a weak fit for humanity's knowledge engineering platform. :man-shrugging: What seems like a better idea is to allow people to have their personal persistent graphs and allow interlinking. Much like Semantic Web with https://solidproject.org/ if only it would be persistent. XTDB resembles that architecture, aside of interlinking bit (not sure how one could reference an entity from another DB at some t :thinking_face: ). I'm amazed how alike our journeys are. 😄 My initial motivation is to build a tool akin to Freeplane and for that I've been searching for a robust data foundation, and settled on architecture that closely resembles XTDB's, hah! (perhaps I've been influenced by your vision, while reading on XTDB back then) 🙂

🙂 2
rickmoynihan08:05:59

Well that’s a lot to digest and reply to 🙂 > Is there value in having an API that returns a SPARQL query string out of it? Yes, there’s lots of value in that; but if you want that there is https://github.com/yetanalytics/flint which is great. > Did you have a chance to give HDT a spin for analytical queries? Regarding HDT. Yes I’ve looked at it, and played with their visualisation tools a little… but haven’t measured its performance. I think it’s pretty neat, but I don’t think it’s actually trying to be a faster index for querying, it seems to me like it’s mainly trying to be a more efficient partially indexed serialisation format. Their benchmarks don’t mention the stores they compared against, and when I last looked the indexing they did didn’t seem to be particularly novel or state of the art, so I assumed though it has some other properties I like that it wouldn’t be a solution to analytical performance. The wider set of problems HDT tries to solve are certainly more novel; but it’s more a small interesting step into the blue-sky rather than being grounded in problems I actually have. I could be wrong of course. > Content addressable RDF Yes like many others I’ve been having these thoughts for years too, there are obvious synergies… but it’s pretty hard to mesh them coherently without trading off certain assumptions, whilst avoiding opaque identifiers everywhere. > CRDT’s Yeah I’ve thought about this a little too… I think a key choice is at what layer the CRDT exists, and the granularity it operates at. If you want it at the semantic/domain layer you need to characterise updates to domain entities with update semantics that align with the domain/modelling and constraints. Different CRDT’s have different properties so you might map a G-Set to one class of entity, and a 2P-Set to another. Regardless it seemed more of a research project, so I didn’t pursue it much further.

❤️ 2
andrewzhurov09:05:00

Thank you for sharing your experience, @U06HHF230! > I think a key choice is at what layer the CRDT exists, and the granularity it operates at. This is rightfully noted, approach that intigues me is where we have one CRDT at the highest level (tx log). That would make tx log eventually consistent across participating in it peers, allowing for offline-first. Another trait is that we can have immutability at the log level and yet have a familiar mutation interface on top (SPARQL, in this case) - gives immutability, which we've been after with content-addressable RDF, without the downside of opaque identifiers and struggles with updates, where we need to create a new content-addressabel structure on every sneeze. I mean, doing it by hand seems tedious, but if we are to have content-addressable representation derived out of an RDF graph automatically - that sounds alright. There may be value we can get out of having it. E.g., persist content-addressable representation in IPFS - then we have kind of persistent graphs, and it may be efficient for syncing between peers. Attaching a sketch of this highly pragmatic architecture. 😄

andrewzhurov09:05:44

A blue-sky side-thought: Also content-based matching sounds interesting. I heard XTDB uses it (it treats entity's id as content, I think). Leaving entity id out of it would: 1) increase intersection between graphs - if two entities have the same content, no matter their names - they're the same 2) making names a user-level concept - give as many names to an entity as you like (names are personal dictionaries, akin to how it's done in Unison Lang) It does change SPARQL semantics thought, as SPARQL matches by names..

andrewzhurov10:05:51

A bit more descriptive sketch (added more sync arrows, so peers ended up with the same state of the tx log). tx log is as though re-frame would keep an event log out of which it would derive app-db

rickmoynihan10:05:05

> This is rightfully noted, approach that intigues me is where we have one CRDT at the highest level (tx log). It sounds like you mean at the lowest level?! My issue with putting it into one log at the bottom is how you can capture domain level update constraints here. Merging at the graph level beneath the semantics is trivial, but it doesn’t ensure consistency and schema adherence at the domain/entity level. I appreciate also that RDF punts on this somewhat via the OWA; but applications need to care about the constraints as everything being a many-to-many join is too impractical. So the approach I was thinking of was assuming there is a class of entities which live in this special world.. e.g. having some root crdt:property subProperties and classes crdt:Entity. I didn’t work through all the details. So I guess in blue-sky terms I’d like RDF applications to be able to leverage CRDT’s for updates but to also maintain application consistency and schema adherence through those CRDT updates. These applications would then be constrained subsets of the RDF world.

rickmoynihan10:05:15

FYI there is a #C09GHBXRC channel.

refset10:05:39

> My issue with putting it into one log at the bottom is how you can capture domain level update constraints here. > Merging at the graph level beneath the semantics is trivial, but it doesn’t ensure consistency and schema adherence at the domain/entity level. I agree this is the key problem - essentially: how to align the logical and physical models. I'm not sure anything useful can be done generically without strong upfront schema knowledge. I think there's something in this "program synthesis" idea: https://www.shadaj.me/papers/crdt-synthesis.pdf / https://github.com/hydro-project/katara https://www.antidotedb.eu/ is another notable hub of research in this space.

👀 1
rickmoynihan11:05:32

Thanks for the links 🙇 I’ll give them a skim when I find time… > I agree this is the key problem - essentially: how to align the logical and physical models. Totally agree! That’s exactly what I’m getting at. It’s also one of the reasons why I felt that though I want bitemporality that xtdb doesn’t quite bake it into the right place for me. I’ve pondered this issue a bit — I feel like it should be possible though by essentially aligning schemas with update semantics provided by the gamut of CRDT’s, and by classifying kinds of schema based on the nature of updates/merges they permit, and having composable/layered schemas. I think the problems are in figuring out the order of layering, such that schema-composition can occur monotonically and be aligned with the various notions of time. Anyway this is all very wooly, there’s doubtless some deep foundational problems, to what I’m suggesting; it’s just a gut feeling. It’d be fun to find some time to think about it properly.

rickmoynihan11:05:39

Also retroactivity of schemas is an interesting area… i.e. it would be nice to tighten constraints on schemas to earlier business times, if all the data is compatible from the time those entity/identities were in place. I think there’s a super interesting area of data structures research which seems relevant here “retroactive data structures”. Erik Demaine (a legend) presents some great papers and MIT courses on the area.

💯 1
andrewzhurov12:05:09

> It sounds like you mean at the lowest level?! Yes! At the foundational, shall we say. 😄 > Merging at the graph level beneath the semantics is trivial, but it doesn’t ensure consistency and schema adherence at the domain/entity level. That's a good catch. Perhaps we can tx schema (e.g., OWL or JSON Schema) and accept only those txes that comply to it? Running a schema compliance check on db, returned by tx, and if the returned db does not comply to schema - tx is considered invalid and gets skipped. > Also retroactivity of schemas is an interesting area… That's interesting! I guess we could tx schema in the past using valid-time And if we'd like to migrate some tx to the new data model, I guess we could transact a correction tx (in XTDB terms, meaning our new tx will have the same valid-time as the old tx, overriding it) There may be many txes that use old data model which we'd like to be migrated to the new.. correcting all of them seems burdensome. To avoid that, perhaps instead of a tx log we could have an event log (in re-frame's terms, where events capture user's intent)

[:add-post "post description]]
[:add-comment <post-id> "comment"]
and supply event handlers separately (be they Clojure functions (akin to re-frame's event handlers) or SPARQL updates), that would give us: 1. ability to define what events are acceptable in our now event log (attending concern of domain consistency) 2. derive different app-db representations as we see fit (be it an EDN map if events handlers are Clojure fns or an RDF Store if event handlers are SPARQL updates) So we have
(derive-app-db event-log event-handlers)
Where event-log is a CRDT, where events are ordered by valid-time, and tx-number (Lamport) (or tx-time) with user-id as tie-breaker. > I think there’s a super interesting area of data structures research which seems relevant here “retroactive data structures”. They're fascinating! I've seen XTDB docs refer to https://oparu.uni-ulm.de/xmlui/bitstream/handle/123456789/4150/RetroactiveComputing_Mueller2016.pdf?sequence=5&amp;isAllowed=y. Seems valid-time is one approach to have it. Thanks for all the links!

👍 1