rdf

Bart Kleijngeld 2022-12-05T20:07:00.424829Z

In our company we have chosen an RDF stack to do our data modeling, particularly RDFS/OWL for conceptual models, and SHACL for logical models. Many of our data architects have absolutely no familiarity with it and now need to learn it. It turns out it's hard to get them excited however... "What's wrong with UML?", "Why not simply go for a centralized approach and store everything in a relational DB?" (we are moving away from that, favoring a data mesh approach). I have plenty of reasons why I think a graph-based choice like RDF is preferable over UML/relational for this purpose (I briefly shared my view here a few days ago), but, frankly, I'm actually not very experienced myself in this field, so I was hoping to learn from people here šŸ™‚. I will use your input, among other sources, for a presentation I'm preparing on this. Why is RDF so suited to do data modeling (or isn't it?)? What are problems with UML/relational that RDF does not suffer from? What are caveats of RDF to take into consideration? Etc. Thanks!

2022-12-06T13:18:18.938139Z

@jumar RDF’s utility or OWL’s?

jumar 2022-12-06T15:53:59.096759Z

I guess RDF but perhaps both?

quoll 2022-12-06T16:07:21.719049Z

I’m working with medical data right now. Using public files, which are mostly in tabular format. There are lots of ID codes that refer to foreign datasets. For instance, Vaccine manufacturers are provided in files from the CDC (Center for Disease Control). They also reference CVX codes, which are the codes for vaccines. There are also (separate) files that link CVX codes to National Drug Codes (NDC). There are also other systems that connect these codes into SNOMED (Systematized NOmenclature for MEDical data). I could put all the data into tables, and create foreign keys between them. In fact, my company has done this in the past. The ELT process is difficult, because these systems tend to model their data differently, and will sometimes take another system’s code and append extra letters to it. You often find yourself having to link many tables, and it can be very difficult to learn the schema to traverse from one part of the dataset to another. It’s a painful mess, but it works.

quoll 2022-12-06T16:09:20.128359Z

Putting the whole thing into RDF simplifies the process significantly. ELT still has to happen, but it’s simplified. If something has different codes in different systems, I can keep both, and just link them. I don’t need to figure out foreign keys. I can traverse across the graph quickly and easily. And change management has become a thing of the past.

šŸ‘ 3
āž• 1
quoll 2022-12-06T16:10:03.621159Z

There’s nothing particularly complex about this. We’re just gaining utility by using a graph shape for the data instead of tabular

Bart Kleijngeld 2022-12-06T16:17:37.491379Z

That's a nice example that might appeal to my colleagues.

curtosis 2022-12-06T23:43:58.353829Z

When you say you’re using SHACL for logical models, can you give a (suitably anonymized) example?

Bart Kleijngeld 2022-12-07T06:51:09.139589Z

I'm reading your question in two ways, so I'll just answer it in both šŸ™‚. If you're looking for an example on how we use SHACL to obtain a logical model, it would look something like this:

:CarShape a sh:NodeShape ;
    sh:targetClas vehicle:Car ;
    sh:property :idShape ;
    sh:property [
        sh:path vehicle:tireCount
        sh:datatype xsd:int ;
        sh:minCount 1 ;
    ] .

:idShape a sh:PropertyShape ;
    # ...
Targeting the Car class from the vocabulary (conceptual model) you provide logical constraints this way, forming a logical model. If, on the other hand, you're looking for what we use such logical models for, let me try to answer that as well. The idea is to describe our data formally in a (large) conceptual model done in RDFS/OWL, so you can focus just on meaning and relationships under the Open World Assumption. This is great for modeling. From there, use cases in IT arise. Information is selected from the conceptual model, and logical constraints (like above) are added using SHACL. The resulting logical model can then be used to generate all sorts of target schemas (this is basically the project I work on), i.e. JSON Schema, OpenAPI specs, Pydantic models, SQL DDL, you name it (Work in progress!). I hope that clarifies it for you.

curtosis 2022-12-07T14:16:50.919039Z

I was thinking mostly the first, but the second was also super helpful. Thanks!!

šŸ™‚ 1
curtosis 2022-12-07T14:20:53.408079Z

That’s actually quite relevant to the work that I’m doing, though in some cases I could see it making sense to ā€œgenerateā€ (modulo a lot of human knowledge) in the other direction: given a bunch of possibly-overlapping (primarily-)SQL logical models, generate SHACL to describe their relationship to a conceptual model/ontology, possibly expanding/refining the conceptual model as needed.

Bart Kleijngeld 2022-12-07T14:28:53.984889Z

Never considered that way around. interesting. Could you elaborate on your use case/work/project perhaps? Some context might make me appreciate what you're doing more

curtosis 2022-12-07T14:39:23.847219Z

Without getting too specific šŸ˜‰ sure… A common problem in a lot of large government agencies, especially the ā€œboringā€ ones, is that they cover several major programs that are kind of related, but have some significant differences in approaches to data, only partially due to simple organizational boundaries. As a hypothetical example, there may be several programs that provide certain benefits or support to households, but because of the way the programs are designed (from a policy perspective) they define ā€œhouseholdā€ quite differently. So there’s a nontrivial challenge in being able to identify which elements of those models are equivalent (and thus commensurable) and those that are not. So you want to be able to enable users (primarily but not exclusively) analysts to be able to find the right data and use it correctly, but you can’t realistically do much from the top down to standardize things. We’ve had some success building conceptual models in RDF/OWL, but the connection back to the logical models has always been fairly gauzy.

Bart Kleijngeld 2022-12-07T14:43:30.249489Z

Haha, good to be careful. Interesting. There's definitely seems to be some overlap in our use cases. Do I understand correctly that you wish to have all the data represented in RDF ultimately? So that data integration and federated querying (is that what you call it? Still learning) can be done?

curtosis 2022-12-07T15:00:51.095399Z

I don’t think there’s any appetite to put all the data in RDF — for starters a lot of it really is transactional (and there is a LOT of it*) and it’s not clear** how much of it would benefit from a graph perspective — but having the metadata all integrated in one catalog would be extremely valuable. That said, there are subdomains where the relationship graph would actually be super useful. * ~1Bn complex actions (dozens of txs per action) per year At least to the business-value folks. Demonstrating it at scale is part of the challenge.

curtosis 2022-12-07T15:03:46.394589Z

We did a demonstration several years back on one of the natually-graphy domains (~6M primary subjects) and that graph alone was somewhere around 1.5Bn triples.

quoll 2022-12-07T15:06:39.654559Z

Yes… triples grow quickly šŸ™‚

curtosis 2022-12-07T15:25:55.737839Z

we had a Cray graph analytics machine at the time šŸ™‚

quoll 2022-12-07T16:57:24.397409Z

Well, 1.5B triples should fit in main memory šŸ™‚

quoll 2022-12-07T16:57:59.780539Z

A Cray graph machine should barely notice šŸ™‚

curtosis 2022-12-07T17:06:26.444929Z

welllll…. IIRC it was complicated šŸ™‚ . Also their triple/sparql implementation was distinctly weird, for performance reasons. (Basically everything was materialized as shared-memory pointers, so it was super fast once you loaded. Also very odd processors — slow clock but rotated through 128 thread slots with zero context switch overhead. https://en.wikipedia.org/wiki/Cray_XMT#Threadstorm4) Intriguing for low-level implementors, interesting performance properties for users.

🄰 1
curtosis 2022-12-07T17:06:59.391189Z

We also had a team that was using it in non-RDF mode for some genomics work. It was fun to have access to.

quoll 2022-12-07T17:16:13.251819Z

I interviewed with them about working with one of these back in 2010, but opted for another opportunity instead. I was definitely curious about it

curtosis 2022-12-07T17:17:26.142659Z

I wish you had, their early RDF implementation was terribad. šŸ˜„

curtosis 2022-12-07T17:18:33.551249Z

but I grew up with/on AllegroGraph, so that’s where I am most comfortable.

šŸ’– 1
quoll 2022-12-07T17:42:06.690109Z

Well that makes sense, since it’s all in CL!

quoll 2022-12-05T20:34:45.487969Z

UML and relational are orthogonal, IMO. Data that are regular and well defined is appropriate for relational storage. This is especially the case when the most common access mode refers to entire records at once. Data that evolve in structure, are less record-oriented, and have a lot of information through linkages (e.g. tree structures) are much more appropriate for graph storage.

quoll 2022-12-05T20:48:32.644339Z

UML is a modeling language with a long history of modeling software. I don’t know if it prefers an OWA or CWA, though I think it is agnostic to that (because there are systems that convert between UML and OWL). It is typically used for documentation purposes, though there have been some attempts at automating processes with it, hence the development of OCL (Object Constraint Language). OWL is specifically designed for the OWA, which makes it inappropriate for software development, but well suited for data. This is particularly true for data on the web, where not all current data may be accessible at any time, and where data continue to grow. (side note: I hate that ā€œdataā€ is a plural word). It has significantly more descriptive capability around relationships than UML has, and consequently allows for better modeling. However, this is a double edged sword, as relatively fewer people have exposure to OWL, and are unaware of the these capabilities, meaning that they are not used as often as they could be. Importantly, OWL was designed from the outset to be reasoned over, and there are many automated systems for doing exactly this. This, combined with the greater expressivity of OWL, is what has allowed automated reasoning in multiple domains, including medical (SNOMED), pharmaceutical, and financial domains. There are many organizations who rely on these reasoning systems.

quoll 2022-12-05T20:53:12.043609Z

As some examples, NASA uses RDF/OWL for inventory systems in building spacecraft, SNOMED uses it to automate relationships between medical concepts, Deutsche Bank uses it to automate money laundering and fraud detection, and every major pharmaceutical company uses it to identify candidates for drug trials

quoll 2022-12-05T20:54:02.680429Z

I provided examples to demonstrate that it’s not all hype. It has significant utility.

Bart Kleijngeld 2022-12-05T21:05:09.270429Z

Examples help me out here, thanks! And yes, it does feel awkward that data is plural šŸ˜†. Sadly, I think the reasoning capabilities aren't we look to utilize any time soon. We are a large company and wish to model our (business) language, and all the data flowing through and being produced and stored in our systems. Note: even the data itself won't be in RDF, only the data models that needs to be conformed to. For now, at least. We like to take a decentralized approach here, a bit like the AAA slogan: anyone can say anything. Embracing OWA, this web-like approach really sounds like a match with RDFS/ OWL to me. It's just that I don't know well enough where UML is more limited in ways that matter to us. For instance: I don't know if one can express "subPropertyOf" in UML reasonably, let alone properties as first-class citizens to begin with. That's homework for me I guess, although be my guest if you have anything to say about that too. Thanks as always

quoll 2022-12-05T21:42:30.659889Z

I didn’t think there was, but I looked it up and… it exists, but it’s ugly

quoll 2022-12-05T21:42:56.712829Z

I don’t know that it’s part of the UML spec though

quoll 2022-12-05T21:46:21.613659Z

Oh, no, apparently it’s legal. You just don’t see it used much

quoll 2022-12-05T21:46:30.002059Z

ā€œGeneralization between associationsā€

šŸ‘€ 1
jumar 2022-12-06T07:19:14.889579Z

Im a complete noob, just watching the discussions here. After reading this https://clojurians.slack.com/archives/C09GHBXRC/p1670273592043609?thread_ts=1670270820.424829&channel=C09GHBXRC&message_ts=1670273592.043609 I’m wondering what are some more boring examples of its utility apart from NASA, fraud detection, and farmaceutical companies. All of that sounds a bit special…

2022-12-08T16:09:51.557149Z

> A common problem in a lot of large government agencies, especially the ā€œboringā€ ones, is that they cover several major programs that are kind of related, but have some significant differences in approaches to data, only partially due to simple organizational boundaries. This is the same thing we find in the world of official government statistics. There’s very little standardisation on statistical concepts or codes, which means data that could be harmonised, isn’t. It seems you’re working at the ā€œmicro dataā€ level; but these problems are inherited and new ones created through at the ā€œmacroā€ level of official statistics. This is the area I work in.