Fork me on GitHub
#rdf
<
2022-05-14
>
abdullahibra06:05:43

what is preferable open source datastore to you guys? apache jena?

rickmoynihan08:05:33

Fuseki/Jena is I think almost certainly the most popular opensource store, and is very actively developed. RDF4j has a native store option too and it is also actively developed. Both are really good long running and active projects. Personally I find RDF4j’s code much cleaner and easier to read and use than Jena’s, though Jena tends to have more features etc. There are some limits on the capabilities of the opensource triplestores. Typically in terms of number of triples stored (and also performance). Not sure what the limits are these days…. but IIRC about 7 years ago we tried storing about 7-800 million triples in RDF4j’s native store, and it would start returning wrong query results… iirc the developers said they’d tested it up to about 3 or 400 million. The commercial stores tend to do better with large quantities of data… e.g. stardog.

Bart Kleijngeld09:05:35

To read input RDF files and query the graph this yields, we currently use RDF4j (in Kotlin) and its SPARQL support. Slowly but surely my colleagues are being sold on Clojure, so I'm trying to have some code done to demonstrate the capabilities. Using RDF4j this extensively in Clojure is going to be a massive amount of Java interop, which is far from ideal (that's what we want to move away from the in the first place). I was thinking of using only its Rio module for reading RDF files. I can then map the triples in the model to hash-maps easily, moving into Clojure territory. What I would then end up needing is some in-memory RDF graph DB that I can use to query the information for our purposes. I've only just realised this doesn't have to support SPARQL (this is new territory for me), and could very well be some other query language. I've been looking around and: • I feel @quoll’s Asami might be a good fit. • I also saw @rickmoynihan’s Grafter which seems a nice wrapper on top of RDF4j, although I don't know how much it supports, and I noticed the RDF4j version it supports is rather old. Also, I came across Arachne's Aristotle, and some no longer maintained SPARQL query DSL named matsu. I appreciate the help! Oh, and this is going to be an open-source project, so only free (as in "doesn't cost money") projects are an option.

rickmoynihan08:05:36

👋 Hey! Sorry I was away on holiday… Grafter is essentially a wrapper over a chunk RDF4j, mainly as you say to simplify the interop. You are right that the RDF4j version it supports is getting a little long in the tooth. There were some breaking changes in RDF4j which caused our tests to fail, which is why just bumping the version further doesn’t simply work… though I suspect it will be relatively easy to fix it up (just requires a little time). In terms of what grafter supports; it has essentially full support for parsing and serialising RDF in any serialisation (via Rio). The abstraction is that you basically send/receive the data via lazy sequences. One of the main thing grafter adds is a bidirectional mapping of RDF data types into clojure datatypes, rather than the native RDF4j datatypes… i.e. an xsd:long is just a java.lang.Long, not some RDF4j numeric class that doesn’t support arithmetic. Some grafter protocols then provide ways to get the datatype-uri for all these native types etc. There is support for querying various sparql repositories (remote via the sparql protocol and in memory/native-store RDF4j ones) and having the datatypes handled for you properly. We don’t have a lot of support for RDF4j’s model APIs… and instead tend to fire the triples we get into an in memory store of our own (matcha: https://github.com/Swirrl/matcha). Matcha basically targets a similar use case to asami; essentially being a schemaless in memory triplestore for querying RDF via basic graph patterns. Unlike asami it makes no attempt to look and feel like datomic. It was written at approximately the same time as Asami and if I’d known about Asami at the time I’d probably have just used it instead.

❤️ 1
Bart Kleijngeld08:05:47

Hope you had a good holiday! Thanks for the extensive explanation, this helps a lot. I'm currently trying out Asami and am very happy with it so far. However, there's some concerns from colleagues regarding the query language not being SPARQL (and being Clojure), limiting contributors to the project on the long run. I hope to convince them to convince the others that a Clojure DSL is actually preferable 😉, but just in case I'm curious to know: does Grafter support the SPARQL querying and repositories (both in-memory and others) of RDF4j? It sounds like a very nice approach to wrapping RDF4j, happy to see the project being actively maintained!

rickmoynihan09:05:21

Ok, there’s actually quite a lot to dig into here, mostly around what your application and its architecture are. For the work we typically do, we need a real database and use a proper SPARQL based triplestore. There’s generally far too much data to hold in memory. So the pattern we tend to follow is to query the database with SPARQL CONSTRUCT statements, and load the triples into an in memory triplestore, and then query that again locally with many queries to construct the view. There are a number of reasons to do this, but the main one is because remodelling the data as clojure maps / trees from backend database queries involves remapping terms; and because of normalisation/dryness issues etc you end up building a mini database in your datastructures e.g.

{:article-db {:article/1 {:title "blah" ,,,}}
 :articles [:article/1 :article/2]
,,,}
Once you realise this is essentially an intermediate database to assemble a view, you realise it’s much more direct to just use an in memory database. Most of these in memory database / models aren’t actually SPARQL anyway; so if that’s your usecase you’re not really losing much… e.g. in Jena you’d use a Model for this and query it with BGP’s… same in RDF4j; or with arbitrary code. Asami and Matcha etc can be put to this purpose — but you can still have SPARQL as your main database. The reason to do this sort of design is because it avoids the network overhead of lots of small queries. Also you can make trade offs e.g. often it can be quicker to slightly over select data from the database, but have a simpler query, and then filter it out later client side.

rickmoynihan10:05:55

You could also look at using flint to generate SPARQL queries on an in memory RDF4j sparql repo. IMHO flint is brilliant, but not yet perfectly suited to scenarios where you want to compose multiple queries. I filed an issue about this here https://github.com/yetanalytics/flint/issues/24 which also shows how to integrate this sort of thing with grafter (you could easily substitute (repo/sparql-repo ,,,) for an in memory (repo/sail-repo)

rickmoynihan10:05:53

To explicitly answer your question though; yes grafter does support querying remote sparql repositories and in memory ones. It also has some support for SPARQL variable / binding substitution, i.e. you can write a static SPARQL query in a file and then provide one or more bindings for the variable. In my experience this pattern means you can avoid a lot of cases where you might want to generate queries dynamically… which typically makes for maintainable code with a clearer performance profile. There are cases where you need to generate queries though; in particular SELECT queries if you want to be dynamic in the columns (you can avoid this by using CONSTRUCTs though. If you need to do query generation though something like flint is far preferable to string munging.

quoll11:05:18

Funnily enough, I have started a SPARQL parser for Asami, but I have a way to go on it. I’m parsing the SPARQL (via Instaparse), but the transforms to Asami’s query language is a lot. It's probably a reasonably portable library, if you're interested @rickmoynihan? 🙂

quoll11:05:01

The fact that Asami’s features focus on SPARQL semantics helps there 😊

rickmoynihan11:05:45

So it converts SPARQL strings -> asami edn queries?

rickmoynihan11:05:13

FWIW I already have an instaparse bnf that converts SPARQL into an intermediate EDN based AST… though it’s not currently part of an open sourced lib

rickmoynihan11:05:26

It’s basically lossless; only stripping whitespace and comments

quoll11:05:32

The Instaparse bit was quick. It's the transforms that are taking me time 🙂

Bart Kleijngeld12:05:04

Great info @rickmoynihan, thanks! That gives me a far better picture of what's possible. I'm definitely going to check out flint

rickmoynihan12:05:04

> The Instaparse bit was quick. It’s the transforms that are taking me time 🙂 Indeed. 🙂

quoll11:05:26

I’m trying to write RDF/SPARQL code for Asami in pure Clojure, but I’ve had extremely limited time lately. (Mostly due to travel for work and conferences). But for now Asami could only work with RDF if you use external tools for things like Turtle parsing

👍 1
abdullahibra12:05:40

Asami could only use Datomic as backend for not in-memory option?

abdullahibra12:05:31

honestly, Asami looks very interesting.

quoll12:05:38

Asami has been designed to be like Datomic, with many backend options. But for now, a local file option is the only one that's implemented.

quoll12:05:20

You wouldn't use Datomic as a backend. But you could use any one of Datomic’s backend options.

quoll12:05:38

The first 2 I’d like to do are Redis and Postgres

Bart Kleijngeld12:05:41

Parsing the RDF Turtle (and other serializations) files to triples in Clojure is a breeze using RDF4j. So from there I think I can just load up those triples in Asami and go from there 🙂. I checked out your Strange Loop talk: Asami looks great! From what I can tell the query language is also already really powerful. Transitive properties and AND/OR assertions will really help me out. I don't think there's an equivalent to SPARQL's property path though, right? Where I would say something like ?x (p*/q)+ ?y?

quoll12:05:55

If you like Jena (and Andy put in a lot of work in to make storage scale better) then you could consider writing a wrapper library around it?

rickmoynihan08:05:44

FWIW I’ve been wanting to add a Jena backend to grafter for a long time. The protocols and namespaces are essentially already abstracted to support this addition. https://github.com/Swirrl/grafter/blob/master/src/grafter_2/rdf/protocols.cljc

quoll12:05:54

I https://github.com/quoll/stardog-clj in 2014. I literally only spent one night on it, and https://github.com/stardog-union/stardog-clj/blob/9476063ff09d8f012260bdaa034fe112d80ff458/src/stardog/core.clj#L2 the next day. Last week I learned that it's now in active deployment at NASA :rolling_on_the_floor_laughing:

wow 1
quoll12:05:56

I wasn't trying to say something to amaze, but rather that it doesn't take a lot, and that's still enough to go a long way!

abdullahibra12:05:56

I saw it amazing, with less effort you achieved a lot.

Bart Kleijngeld13:05:49

Very nice. That's inspiring indeed 😄, cool story!

Eric Scott12:05:55

@simongray has kindly compiled a set of clojure-based graph resources: https://github.com/simongray/clojure-graph-resources

gratitude 1
👍 3
🙏 1
❤️ 1
Bart Kleijngeld12:05:29

Couldn't have hoped for more. Thanks!