2024-11-29 rdf | Clojure Slack Archive

rdf 2024-11-29

dazld 2024-11-29T12:37:22.021819Z

For someone who is quite experienced in datomic, but not at all with RDF, I’m curious in exploring the technology, but at a bit of a loss at where to start. I’ve seen and played briefly with Protege for defining an ontology, but it feels like that might be not quite the right place to start. Ideally I’d like to build up from data, and not a GUI to start with. Is there a clojure-y starting point that the hive mind here could recommend? Things that I’m specifically interested in are: • Differences between derived data and ground facts (ty @luke for pointers on terminology there!) • How querying works across these two worlds • How to observe changes to both derived and ground data • Combining open ontologies and custom ones appreciated!

2024-12-02T09:42:48.361349Z

> I like Rich’s ideas in Datomic, but I think he made a mistake when he didn’t adopt RDF. Datalog is a triple store in all but standards compliance, and adding in schemas to keywords is basically adopting URI schemes that you would in RDF to help differentiate what you’re talking about as you combine data from multiple systems I have some sympathy for boths sides of this argument. I think one of the practical things that Datomic has, that RDF doesn’t really is, cardinality restrictions on properties at the schema level. In RDF everything is naturally a many to many relationship and that makes something trivial in normal application design more tricky to handle in RDF. OWL cardinality restrictions aren’t really the same. Obviously to some extent the lack of schema based cardinality restriction is in many ways what gives RDF its power (open world assumption etc); but it does make it more awkward to use. SHACL at least offers a standardised way to express this now, but it exists above the base layer, not at the bottom.

luke 2024-12-02T16:52:03.923129Z

The way I think about it is that all RDF typing and cardinality rules are fundamentally descriptive, they describe what valid data looks like. There are a lot of significant performance optimizations that are available if you have strict typing or cardinality, but if you bake those things into the very structure of your storage layout then you can't assert what is otherwise technically valid RDF (even if it doesn't match your validity rules) I think there are ways to square this circle and get most of the performance benefits without sacrificing RDF semantics, but they all add implementation complexity.

👍 1

luke 2024-12-02T17:00:59.407819Z

The other option is only support true RDF semantics on the read side, and just admit that you may not be able to write arbitrary RDF data to storage.

dazld 2024-11-29T14:13:44.990079Z

https://jena.apache.org/tutorials/rdf_api.html am starting here.

2024-11-29T14:18:13.452449Z

For RDF in general, I’d recommend the RDF Primer: https://www.w3.org/TR/rdf11-primer/

👍 2

wikipunk 2024-11-29T15:03:35.681679Z

I like "The Semantic Web for the Working Ontologist", and use Turtle (.ttl files) to start with the data. You can load your Turtle file into Protege and when you edit it you can reload it in Protege. Programmatically the main interface to your data will be SPARQL. AWS has some good examples: https://github.com/aws/graph-notebook https://github.com/aws-samples/rag-with-knowledge-graph-using-sparql Also recommend using ChatGPT to ask questions about ontologies, you can paste the Turtle text from your file and get assistance. You can then see what works by trying it. Protege will tell you when something is wrong. Also, if you use reasoning in Protege I recommend ELK, it is fast. Reasoning basically takes the "ground facts" (the assertions) and generates inferred facts using the ontologies loaded. Shameless plug: I work on the MITRE D3FEND ontology, you can load our ontology for cybersecurity into Protege, our main Turtle file is located in src/ontology/d3fend-protege.ttl https://github.com/d3fend/d3fend-ontology Feel free to PM me questions and good luck 🙂

☝️ 1

🙌 1

2024-11-29T17:47:45.505799Z

symbolic reasoning performed by OWL reasoners takes provable (provable in the first order logic sense, so yes you can get explanation trees), and makes inferred edges or classes. I like Rich's ideas in Datomic, but I think he made a mistake when he didn't adopt RDF. Datalog is a triple store in all but standards compliance, and adding in schemas to keywords is basically adopting URI schemes that you would in RDF to help differentiate what you're talking about as you combine data from multiple systems

2024-11-29T17:48:16.379329Z

here's our quick talk (45 mins) on using reasoning: https://www.youtube.com/watch?v=aQ_srH69pFs&ab_channel=Stardog

2024-11-29T17:48:50.710309Z

that's from one of the senior ontologists on my team, so you'll get the textbook definitions that you'll find in the Working Ontologist book, in the W3c specs, etc

👍 1

2024-11-29T17:51:35.970519Z

@veckon I'll be following up with Peter next week on better support for D3FEND in Stardog examples and such

🦾 1

2024-12-03T09:09:35.518949Z

Yup, I agree with those points. I’m really just saying those solutions typically result in complexity being pushed into your implementation or towards the user; and I understand why that’s not always desirable, particularly when for many users the openworld assumption brings only unrealised or theoretical benefits. I say this as someone who has built many RDF systems over the past decade. I love RDF, and haven’t used datomic I just think there’s room in the ecosystem for designs like datomic that make a slightly different foundational design decision, and the trade offs that come with that.

2024-12-03T09:11:23.443079Z

On this point of alternative almost-RDF databases, has anyone seen this: https://typedb.com/ I’m curious what people here think of the design space it may enable.

2024-12-03T13:12:37.128239Z

I tried to make a groovy DSL that was very similar to that, the result I distilled into GroovySPARQL years ago. I thought it was neat, but inferior to just solid sparql with clojure, or sparql in Java/Spring, like that model breaks down at the edges as the friction becomes too much

Max 2024-12-03T13:17:37.160819Z

I investigated using it for my digital humanities knowledge graph project. We need first-class statements (statements can themselves participate in statements) and are generally appealed to by the idea of predicates that can have multiple arguments. We also need reasoning far beyond what OWL can provide. We liked that it was a full package: data storage, query engine, and reasoner all off the shelf. We don’t have a strong need for RDF compatibility and were planning on wrapping the data layer anyways so typedb’s quirks didn’t bother us. It has an active support forum and discord group, vaticle devs regularly respond to questions. It definitely has some quirks, their fundamental model is definitely different than I was used to, and some features aren’t fully fleshed out, so kind of like working with datomic you should expect to do some things you might do in a traditional db in software like any value validation beyond single value regexes. We ultimately ended up not using it due to the lack of one very specific feature: we want to be able to find the number of different ways a conclusion can be made via reasoning for ranking purposes, and they indicated they were not likely to add that feature due to performance reasons, which is understandable. tldr: it’s different, but very powerful and more off the shelf than Jena, and more like a database in that you can’t just open it up and rearrange the innards.

2024-12-03T13:20:43.639139Z

what areas of OWL did you find lacking @max.r.rothman? Typically the two scenarios that people want are inference on negation, i.e. if this thing lacks a property, then infer an edge/class, or inference based on aggregation, i.e. if there are 5 of these, then it's on sale, or whatever. In fact, these two were requested enough that Stardog's latest reasoner is built with stratified datalog to offer these, w/ interop w/ most of the rest of OWL (sorry OWL-DL, nobody likes you)

2024-12-03T13:23:36.005189Z

(and for completeness, the other thing folks ask for is inference to create new individuals, which isn't supported in OWL and we've avoided any urge to build a proprietary extension to offer anything along those lines as the risk of non-termination and unsafe use would be significant.. best to just write a program at that point)

Max 2024-12-03T13:25:25.544839Z

We have a lot of implication relationships, eg if these 3 facts are present in a particular relationship (and maybe not this other one) then imply this other fact. I’m no owl expert, but my impression was that that wasn’t really owl’s wheelhouse.

2024-12-03T13:28:02.471359Z

the gotcha in that scenario may have been aggregation, but otherwise OWL is fully composable, so you can build up an axiom set that composes, so individual property chain axioms on each leg, combine together into a union on a larger one and so on until you arrive at the abstraction you want. That is admittedly going to be advanced OWL usage

Max 2024-12-03T13:29:36.917779Z

We’re a knowledge graph sort of in the CYC and slightly in the expert system lineage (but with a more focused domain), we have a ton of heterogeneous information and want to further enrich it with knowledge context from culture bearers, less like the SNOMED, etc kind of thing where inheritance is the primary mode of reasoning

Max 2024-12-03T13:32:23.964349Z

Right, I don’t doubt that some of the things we want to do are achievable with OWL, but some would not or would be complicated to implement, and at that point it seems like it makes more sense to just break out a full general antecedent-consequent style reasoner.

Max 2024-12-03T13:33:07.562099Z

If you’re interested in learning more about my project, we’re at http://klezmerarchive.org

2024-12-05T12:17:09.979379Z

We've offered full OWL support for many years now, and I'll say that there have been very few users who used OWL-DL, sameAs, or even more complex axioms. Most of the time, folks are successful with a light weight approach that builds easier to understand class/property hierarchies, inverse/transitive/symmetrical definition, and usage of stardog rules. That's the rationale for the https://docs.stardog.com/inference-engine/#stride-reasoner-alpha, which will be optimized for these datalog style inferences, adds in negation/aggregation, and drops the rest of OWL. Things like owl:sameAs, we've opted for a spark based ML job that creates probability graphs on matching, enabling the data engineering to deal with unifying/cleaning up data. @rickmoynihan would love your feedback on the new Stride rule support (examples https://docs.stardog.com/inference-engine/user-defined-rules#rule-examples, including https://docs.stardog.com/inference-engine/user-defined-rules#negation, and https://docs.stardog.com/inference-engine/user-defined-rules#aggregation).

👀 1

luke 2024-12-05T12:46:31.533739Z

Huh. That's interesting. I would have thought that sameAs would be one of the most useful and widely used rules. Though I guess I spend a lot of my time thinking about federation use cases, which probably doesn't match the real world at the moment.

2024-12-05T12:55:02.417279Z

@albaker that looks cool, and I think a practical improvement. IIRC I had a feature request against stardog for something along these lines, where it wasn’t previously possible to run stardog rules, without also running a fuller reasoning profile — and that had costs you didn’t really want to pay down… be it from messy data, or less control over query performance.

2024-12-05T13:12:00.123099Z

sameAs is cool, but there are various subtleties/issues with it depending on your usecase. IIRC one issue we noted is that owl:sameAs is blind to the identifier; so :rick-m owl:sameAs :richard-moynihan doesn’t express any preference or canonicalisation… it doesn’t favour any symbol for representing that entity, you can legitimately get either one back for a query. Which essentially meant for us that owl reasoning / canonicalisation / semantics / would be being passed downstream, which is a high bar for many data users. For us it would have meant needing to reason and canonicalise earlier in the process; and we didn’t have a place for that at the time; as we were looking towards sameAs to solve a different issue.

2024-12-05T13:13:59.559229Z

we're effectively dropping owl:sameAs, nobody uses it, the performance is not great as the database size grows and it's really the final step in data cleansing... like you had to go through the trouble of figuring out which nodes were equivalent, and then you add the edge. We're favoring the entity resolution service now, a spark job that produces a probability graph (given a sparql query as input) with the probability of nodes matching in the result set. The data engineering can then choose what to do, merge graphs, create stardog rules for probability ranges, etc

2024-12-05T13:15:01.744699Z

and yeah, those semantic issues are also unclear on how you want the query engine to behave, so all in all, owl:sameAs has been not that great of a feature

2024-12-05T13:18:07.460199Z

yeah — sameAs really quickly explodes through all the different transitive, reflexive, symettric properties etc — I did skim some of the literature on processing it but it’s hard to see how it can be done efficiently.

luke 2024-12-02T00:15:24.230529Z

+1 for "The Semantic Web for the Working Ontologist" to really get the key concepts (and all the other resources are mentioned is good)

👍 1

Bart Kleijngeld 2024-12-04T08:08:27.053329Z

@max.r.rothman I don't know how long ago it was that you tried out TypeDB, but they are working on a complete reimplementation in Rust. Perhaps the performance gains will reopen the discussion for adding the feature you found lacking.

2024-12-04T09:00:11.949659Z

Klezmer’s great! ❤️ Sounds like an interesting application of alien tech 🤩

🙏 1

Max 2024-12-04T13:15:06.066989Z

@bartkl I hope so, but I’m not optimistic. The problem isn’t guaranteed to terminate in the general case, though you can ensure it does if you’re careful with your rules. For example, consider a rule whose antecedent is that a path exists between two nodes. If there’s a graph loop anywhere along any potential path, then there are an infinite number of potential paths. Even if you prune loops, the problem changes from N (a path exists) to N^2 (what are those paths). https://forum.typedb.com/t/include-all-explanations-for-an-inferred-relation/297/2 if you’re interested

👍 1

2024-12-04T13:35:04.852039Z

think_beret Interesting thread… Another point, you mentioned in that thread is storing potentially contradictory facts. That’s not typically handled well in FOPL, because of the principle of explosion. I’m assuming the contradictions weren’t expressed as logical inconsistencies in that logic system? To handle that sort of thing you’ll need a non-monotonic logic, such that adding new facts can actually remove inferences. Many, many years ago I worked on a https://en.wikipedia.org/wiki/Defeasible_reasoning system that would reason over these sorts of things.

Max 2024-12-04T13:43:41.295899Z

Funny story, that’s why we need the “all the paths” thing. The contradictory facts are in ground data not reasoner rules (though of course by using rules on that conflicting ground data you can produce inconsistent conclusions). Our approach is to just reason all the conclusions that come out of the ground data we have rather than attempt to produce a single correct conclusion, and to use the number and length of the paths used to produce that conclusion to rank the conclusions (shorter path = closer to ground data = higher rank, more paths = more evidence = higher rank). You seem you have a lot of knowledge in this area, and I’m a (relative) newbie! I’d love to chat more and get your thoughts on these things, but perhaps we should start a new thread

2024-12-04T13:51:28.749609Z

👍 sure — I’m not really an expert though. I’ve just maintained a passing interest in it since I worked around the edges of a defeasible reasoning system 20 years ago… and since then I’ve done some work in RDF and tinkered with OWL… though I find OWL quite hard to use in practice.

Clojurians Log v2

rdf 2024-11-29