Fork me on GitHub
#rdf
<
2015-08-27
>
rickmoynihan00:08:50

@joelkuiper: this is actually going to change - currently Grafter parses a lang string into a reified object - you can str it to get the string but if you want the tag out you have to (.getLanguage (->sesame-rdf-type my-lang-string-obj)) ... This bit is a bit broken - and we've had a ticket to fix it for a while... it's a pretty simple fix though... The plan is to implement a Literal record type -- so basically a map like @jamesaoverton says -- but with some polymorphic benefits that ensure it can coerce to the sesame (and maybe oneday jena) types properly... it'll have the string itself, the URI type and if its a string have a language keyword set e.g. :en/`:fr` (we use keywords for language tags already and it works well). Right now you can build lang strings with the (s "hola" :es)

rickmoynihan00:08:43

@joelkuiper: @jamesaoverton: just reading your discussion -- remember SPARQL 1.1 doesn't really have FULL support for quads... i.e. you can't CONSTRUCT a quad... the pattern in the construct is the GRAPH... I personally think this is a real shame, as there are quad serialisation formats (e.g trig/trix etc...). This might be why you can't just get quads from a model

rickmoynihan00:08:38

@joelkuiper: just looking at your type coercion code -- Grafter also has both a Triple record and a Quad record... And I consider this a mistake... one we're going to undo I think you really only want (defrecord Quad [s p o g]) - then create a constructor function for triple which returns a Quad with a nil :g. Otherwise you'll get into a load of bother where #Quad { :s 1 :p 1 :o 1 :g nil } is not= to #Triple {:s 1 :p 1 :o 1}

rickmoynihan00:08:02

simply because they're different types

rickmoynihan00:08:24

obviously its easy enough to resolve but its a small pain... - yes there are still problems with this model where RDF semantics don't map directly onto clojure value semantics... However we recently had a discussion on the sesame developers mailing list where we convinced the core committer to change sesame's policy to use value semantics when testing equality - rather than RDF style equality where a quad of :s1 :p1 :o1 :g1 .equals a triple of :s1 :p1 :o1. This should be coming in a future release... Not sure what Jena's policy is here

rickmoynihan00:08:49

@jamesaoverton: early on in the first version of grafter - because I initially wanted a terse syntax for expressing triple patterns I also chose to represent URI's as strings - as URI's are the primary data type - in Grafter string literals have to built with the s function. - Again this is something I'm going to change -- raw java strings should probably not automatically coerce into RDF - or if they do they should do so to RDF strings in the default language... any java URI type you might reasonably use should probably be made to work.

jamesaoverton00:08:34

Yeah, well… It’s something I’ve thought a lot about, and in the end I really like working with plain, literal EDN data everywhere I can. For what I do, IRIs are opaque, and I don’t need to get their protocol or query params. So I end up using strings for IRI, keywords for CURIEs/QNames, and maps for Literals.

jamesaoverton00:08:23

I work with EDN as long as possible, and only convert to other formats at the very end.

rickmoynihan00:08:25

yes - records can add noise - but I think you can actually override print-method to print them shorter e.g. you might be able to do #URI ""

jamesaoverton00:08:35

Then you need to provide a reader function to cast the string to that type. I’m glad EDN has typed literals, but I haven’t found that they’re worth the hassle.

rickmoynihan00:08:14

yes I know - it definitely adds some friction

jamesaoverton00:08:37

I think that Transit has a native URI type, which would be more convenient.

rickmoynihan00:08:21

ooo interesting

rickmoynihan00:08:32

what exactly do you use edn-ld for @jamesaoverton ?

jamesaoverton00:08:21

The library itself is a recent refactoring of some patterns I’ve developed over the last three years. So I’ve only used that particular code in a few projects so far, but I’ve used its predecessors in a larger number of projects.

jamesaoverton00:08:57

And although I’m allowed to share those other projects, I’ve never had the time to clean them up and put them on GitHub...

jamesaoverton00:08:55

But this is an example of some of the stuff that I do: https://github.com/jamesaoverton/MRO

jamesaoverton01:08:02

The Clojure code takes a table from an SQL database that contains a very dense representation of MHC class restrictions, AKA some biology stuff.

jamesaoverton01:08:12

The goal is to convert that table into an OWL ontology. The ontology has several branches, with specific relationships.

jamesaoverton01:08:52

There’s an Excel spreadsheet that specifies templates for different branches at different levels.

jamesaoverton01:08:25

Then I read the source table and the template table, and zip them together into a sequence of maps defining OWL classes.

jamesaoverton01:08:45

Finally, I convert that EDN data into RDFXML file.

rickmoynihan01:08:35

what makes it EDN, rather than just CLJ? simple_smile

jamesaoverton01:08:57

There are really two parts. The first is ripping the source table into a number of branch-specific tables. Then I use a Java tool I wrote ROBOT to convert those tables to OWL.

jamesaoverton01:08:18

It’s not the best example, but it’s on GitHub.

jamesaoverton01:08:56

To answer your question: I’m pretty convinced by this "Data are better than Functions, are better than Macros” thing that Clojure people talk about.

jamesaoverton01:08:58

The MRO project doesn’t use the EDN-LD library because it’s for OWL and not just RDF. I haven’t figured out a general way to describe OWL in EDN, but I’ve been talking to Phil Lord about it.

rickmoynihan01:08:26

yeah Phil and I have spoken in the past too

rickmoynihan01:08:28

what in the MRO example is data?

rickmoynihan01:08:46

thats not functions/macros/ just general clojure

jamesaoverton01:08:45

The source table from SQL, and the Excel spreadsheet under src/mro. Those are converted to all the branch-specific CSV files at the top level.

rickmoynihan01:08:21

sorry I was meaning where is the data - in the EDN-LD Data > Functions > Macros, sense - presumably by that you meant that EDN-LD represents transformations by clojure data? Not symbols/functions/macros

jamesaoverton01:08:23

The previous version of the MRO code had a separate function for each level of each branch.

jamesaoverton01:08:50

EDN-LD is mostly just conventions for representing RDF in EDN, and then some functions for working with those representations.

rickmoynihan01:08:04

and now you have a map - essentially in place of a cond?

jamesaoverton01:08:56

In the MRO example, there’s a sequence of maps representing templates, and a sequence of maps from the source table (SQL). Then the smarts are in the apply-template function, that applies each template to each row of the source table.

jamesaoverton01:08:18

So there’s a smaller number of higher-level functions, in the end, and I find it easier to reason about.

rickmoynihan01:08:01

for what its worth - your MRO code seems broadly similar to grafter pipelines... In that you have a sequence of rows which you effectively process in row form... and then templatize. Is that fair?

rickmoynihan01:08:13

oh sorry you just said that

jamesaoverton01:08:40

I agree with that.

rickmoynihan01:08:53

Grafter basically works the same

jamesaoverton01:08:28

In the MRO case, the Clojure code is table-to-table, then ROBOT (my Java tool) is used for the table-to-OWL part.

jamesaoverton01:08:00

At the end of the day, pretty much all the code I write is a pipeline. :^)

rickmoynihan01:08:46

same for a lot of the stuff we do

jamesaoverton01:08:58

Some day I’ll publish a cleaner example :^)

rickmoynihan01:08:02

that and tools around them

jamesaoverton01:08:41

You made a good point about Quad equality above. I’ll think more about that.

jamesaoverton01:08:55

It was good talking, but I’ve got to go now.

joelkuiper08:08:00

so as far as I’m aware off there’s no real way to use SPARQL 1.1 to get Quads, but there might be in the future, so I’ll just leave it nil I guess.

joelkuiper08:08:48

As far as type/data coercion … well I don’t really want to invent another class/type model for RDF. So I’ve chosen to represent results/triples as simple maps and records with strings for URI’s and json-ld-isch maps as best I can for the rest. If that’s not your cup of tea you can always just use the Jena objects 😉 and forget about the lazy-seq stuff 😛

joelkuiper08:08:15

If commonsRDF solves this problem I might consider implementing that, but for now it’s just too much of a mess to match the RDF semantics to Clojure, and the simplest thing I could think of was {:type “typeURI” :lang “@lang” :value “Jena coerced POJO”}

joelkuiper08:08:01

or a string for uri, I may consider wrapping that in a java.net.URI. though, bit unsure still

rickmoynihan08:08:46

@joelkuiper: I'd be tempted to go with a record for Quads and Literals... it makes writing and extending coercions easier (admitedly you can use a multimethod for this too -- but you'll probably just end up dispatching on type anyway (and you can always use a multimethod on a record too if you want)... Also multimethod dispatch is quite a bit slower than record dispatch... and you'll probably end up dispatching on millions of quads

rickmoynihan08:08:06

when users come to process results

joelkuiper08:08:07

well, Triples 😛 since there’s no real way of getting Quads 😉

joelkuiper08:08:04

So a Literal of [type, value, lang] -> [String, Object, Keyword] or something?

rickmoynihan09:08:50

type => String, value => String, lang => Keyword

joelkuiper09:08:47

why value as a string?

rickmoynihan09:08:02

there might not be a way to query for a Quad -- but I think on the processing side it makes sense to have a quad -- because you can set it to non-nil yourself and serialise nquads etc easier

joelkuiper09:08:25

Jena has excellent support for making sense of a lot of the XSD types into java objects

rickmoynihan09:08:55

ahh ok sorry - by Object you mean Integer/Float/Double/Date etc...

joelkuiper09:08:57

that’s a fair point

rickmoynihan09:08:58

then yes I agree

rickmoynihan09:08:14

definitely coerce the types out where you can

rickmoynihan09:08:23

but where you can't you'll need to fall back to string

joelkuiper09:08:29

right, that’s what I do now

rickmoynihan09:08:44

thats what we're doing with grafter

joelkuiper09:08:23

yeah I saw that simple_smile

rickmoynihan09:08:26

did you read the stuff I wrote here last about Triple/Quad equality etc?

joelkuiper09:08:58

yup, interesting stuff; I’ll probably change it to Quad for those reasons. Makes sense

rickmoynihan09:08:19

Its definitely a trade off -- but I think its the better one

joelkuiper09:08:58

could also just use a map I guess

rickmoynihan09:08:41

yes but it'll have the same issues -- i.e. (= {:s :s1 :p :p1 ::o o1 :g nil} {:s :s1 :p :p1 ::o o1}) => false

joelkuiper09:08:33

yeah, that’s true. it’s a silly problem 😛

rickmoynihan09:08:01

its not a big deal - its just annoying -- and can cause hard to find bugs

joelkuiper09:08:43

it’s one of those things that would be easy enough to solve with a custom Equals method though

rickmoynihan09:08:27

yes but I think its more pragmatic to retain value semantics

joelkuiper09:08:51

I’ve gone back and forth on that topic in Java projects; either can create hard to find bugs, especially if done inconsistently across developers 😛

rickmoynihan09:08:49

yeah it definitely depends on what you're doing

rickmoynihan09:08:10

but I think programming with values is generally better

rickmoynihan10:08:51

@joelkuiper: any reason to use "@en" strings rather than :en keywords for language tags - (I know obviously that SPARQL and various serialisations represent them that way...

rickmoynihan10:08:25

keywords share memory when you have lots of them

joelkuiper11:08:45

no strong opinion, it’s closer to JSON-LD

joelkuiper11:08:46

which is nice

joelkuiper11:08:22

switched it to keywords 😉, probably the last I’ll work on it for the week at least!

quoll15:08:14

I want to think on it some more, but I agree that we should have: (not= {:s :s1 :p :p1 :o o1 :g nil} {:s :s1 :p :p1 :o o1})

quoll15:08:26

rather than a custom = function, I’d like to see another function that explicitly calls out that it’s handling some kind of equivalence instead

quoll15:08:11

such as: (equiv {:s :s1 :p :p1 :o o1 :g nil} {:s :s1 :p :p1 :o o1})

rickmoynihan15:08:46

I personally think its better to have one type - even if it has a nil field a lot of the time instead of two - for essentially the same thing

rickmoynihan15:08:01

I think its a good idea to have a custom equivalence function that implements RDF semantics

rickmoynihan15:08:53

so the not='s case won't arise in normal usage

quoll16:08:24

on the second point, yes. Clojure needs to have = semantics that are separate to what is needed for RDF

quoll16:08:23

for instance, I want to be able to say things like: (matches {:s s1 :p p1 :o o1} {:s s1 :p p1 :o o1 :g g1})

quoll16:08:03

because the triple in the first arg does match the triple-in-a-graph found in the second arg

rickmoynihan16:08:41

quol - I think the best thing is to have a Quad record -- with a triple constructor - that essentially returns you a nil in the :g

rickmoynihan16:08:15

so (matches (triple :s1 :p1 :o1) (quad :s1 :p1 :o1 :g1) => true

quoll16:08:29

it’ll depend on usage. I’ve never needed quads, except when storing multiple graphs in a single file

quoll16:08:45

I’m a “triples” person myself simple_smile

rickmoynihan16:08:39

we use both 50/50 - one representation simplifies things for everyone... if you don't care about the nil :g - you don't need to...

quoll16:08:55

when I say “storing”, I also mean “loading”, since you get quads back when you read, and they need to go to various graphs

rickmoynihan16:08:00

the Quad record will seamlessly coerce into a sesame/jena triple/quad resepectively

rickmoynihan16:08:22

yes -- we use quads a lot -- because most of our work is writing pipelines that generate RDF... and we usually want to derive the graph from the data we're loading in

quoll16:08:52

and you’re working with multiple graphs at once?

rickmoynihan16:08:21

the fact you can't in other tools is one reason we created grafter

rickmoynihan16:08:43

we have tens of thousands of graphs

quoll16:08:56

ah. You’re one of those simple_smile

rickmoynihan16:08:44

we manage lots of data for many customers

rickmoynihan16:08:23

so a lot of the time its out of our hands

rickmoynihan16:08:09

graphs are also very useful for managing data

quoll16:08:19

most RDF stores are optimized around triples, and then group statements into graphs. Those that treat graphs as an equal part of the quad take a small performance hit, and it often seems unjustified given that SPARQL treats graphs so differently

quoll16:08:34

yes, I completely agree that graphs are great that way

rickmoynihan16:08:54

@quoll: having used fuseki, sesame, stardog, bigdata and graphdb/owlim I can say that statements not true in my experience

rickmoynihan16:08:39

on many stores you have to use graphs to get acceptable performance

quoll16:08:59

I may not have been clear in what I was trying to say

rickmoynihan16:08:05

I agree thats its unfortunate SPARQL only half implements graphs though

quoll16:08:27

when RDF stores are storing data on disk, many of them will use a scheme that is based around subject/predicate/object. Graphs then get implemented as a separate structure (e.g. separate index files, or an index that refers to statements as a group, but not allowing arbitrary selection of subject/predicate/object/graph as single step index lookups).

quoll16:08:49

Some stores do allow arbitrary lookup for quads

quoll16:08:58

but then SPARQL hamstrings it

quoll16:08:26

I mean, you can still work with it, but SPARQL presumes that you’ll be selecting only a couple of graphs, and working with triples from them. The syntax gets messier if you treat graphs as just another element of the quad

quoll16:08:57

ironically, the stores that index symmetrically on the quad can handle the operations just fine. It’s SPARQL syntax that gets in the way

quoll16:08:25

but because of this bias, many stores don’t index symmetrically around the quad

quoll16:08:54

that’s usually OK, because many applications don’t ask for lots of graphs like that

quoll16:08:10

but some do…. hence my statement that you’re “one of thosesimple_smile

rickmoynihan16:08:50

@quoll: yes you're right -- sorry was missunderstanding what you were saying... Yes that's definitely true... Graph performance can be spotty on some stores... I know - because we have some automatically generated queries which have well over 1000 graph clauses

rickmoynihan16:08:21

but we actually sell a linked data management platform -- so its unavoidable -- we frequently push the limits and assumptions of every triple store

quoll16:08:20

I can’t recall now which stores index symmetrically around quads. I know ours does, but it’s in dire need of some love, and doesn’t even handle SPARQL 1.1 (i.e. indexing is great, but query/update functionality is not)

quoll16:08:01

I think that the default indexing in Jena is symmetric

quoll16:08:19

I should ask Mike about Stardog though

quoll16:08:56

I’ve never contributed to the internals of Stardog (for obvious reasons). And the Clojure adapter was just a client

rickmoynihan16:08:09

I'm guessing stardog does

quoll16:08:24

I thought it did

quoll16:08:32

I can ask… hang on

rickmoynihan16:08:59

what store do you work on?

quoll16:08:51

or rather… I did

quoll16:08:00

I’ve been busy 😕

rickmoynihan16:08:50

ahh yes I've been to this site before! simple_smile

quoll16:08:28

Well… busy life, plus the fact that I’d been on it for over a decade. I’ve been trying new things lately

rickmoynihan16:08:12

ahh you're the guy that implemented an RDF store on Datomic... I had that same thought the moment Rich released it... How did it go?

quoll16:08:43

it’s been good, though I put it aside for other stuff. I’m trying to pick it back up again actually

quoll16:08:05

Datomic is implemented in a very similar way to Mulgara’s indexes (persistent trees), so it seemed natural to me

quoll16:08:53

OK, Al doesn’t know. He said I should ask Mike directly simple_smile

quoll16:08:41

Mike is fun to talk to about this stuff, but I only have him on email, not IM simple_smile

rickmoynihan16:08:33

Yes Mike and I have exchanged emails...they have a gitter channel now

rickmoynihan16:08:13

what datomic schema does kiara use?

rickmoynihan16:08:41

does it implement a schema for triples/literals - or does it somehow use vocabularies for a datomic schema?

quoll16:08:33

literals are done in 2 ways

quoll16:08:07

if they’re simple text or using one of a few xsd datatypes then they’re stored as native values (strings, longs, doubles, floats, dates, URIs)

quoll16:08:09

anything else, and they become a structure with properties for value (a string) and datatype (a URI, since there aren’t any IRIs in xsd datatypes)

quoll16:08:58

RDF properties get scanned for the values that they refer to, and the most general type required is found

quoll16:08:58

this is because if you have a property of my:value and it refers to a xsd:long, then it’s a very rare schema that requires that property to also refer to a string, or something else

rickmoynihan16:08:52

yes I'd say thats a fair assumption

quoll16:08:10

but if that DOES happen, then the type for the property in the Datomic schema is set to refer to a structure, and that structure then refers to the final value, using different property names for each type

quoll16:08:23

that’s a corner case, but it makes querying more complex 😕

quoll16:08:12

I think I need to change how subjects work though

rickmoynihan16:08:46

whats the performance on datomic like?

rickmoynihan16:08:08

is there any hope of it being competitive?

quoll16:08:25

for now, if they’re IRIs then I convert to QNames (ruthlessly, if necessary) simple_smile then convert the QNames to keywords and use those as the entity IDs. This works, but it uses RAM.

quoll16:08:52

I have not pushed it to big datasets yet

quoll16:08:40

Most of the big sets are in RDF/XML (which I despise), and I really want to avoid Jena (I love those guys, but Jena is bloated), so I’ve started on an RDF/XML parser in Clojure

quoll16:08:01

I have a decent Turtle parser though, and that seems OK

quoll16:08:07

but I haven’t loaded anything really big through hit

rickmoynihan16:08:12

does it work with large files?

quoll16:08:49

that’s another thing. Datomic recommends that you don’t try to do really big loads. They recommend chunking it up. That’s easy in Turtle, but not so much with RDF/XML

rickmoynihan16:08:19

Jena do a good job - if you want a standards compliant, free store... but yeah the codebase is a mess... Sesame's code is so much better to work with

quoll16:08:20

besides that, I hate the idea of multiple transaction points at arbitrary locations in a load. But it’s pragmatic, so I guess I need to

quoll16:08:33

yes, I’ve contributed to Jena

rickmoynihan16:08:05

@quoll: yeah chunking sucks

quoll16:08:45

Mulgara is actually faster if you don't

quoll16:08:16

annoyingly people would chunk their data, and then get annoyed at Mulgara for performing badly

quoll16:08:53

but every chunk becomes a new transaction, which means that it requires a new root to the persistent tree

quoll16:08:14

if you load 1M triples, then you just have a simple tree

rickmoynihan16:08:41

so I'm guessing you need to reindex if that happens

quoll16:08:06

if you load 100K triples 10 times, then you end up with most of the nodes in the first tree being duplicated while inserting the second 100K, and so on for each chunk

quoll16:08:23

actually, Mulgara does not do background indexing (which is something I started work on, but never finished)

quoll16:08:38

so when it’s finished loading, it’s fully available

quoll16:08:46

but that makes loading slower

quoll16:08:49

Stardog, for instance, loads immediately into a linear file, and then moves those triples (or quads) into the indexes in the background. Querying looks in both the indexes (fast) and the linear file (slow).

quoll16:08:04

So loads are lightning fast, but querying sucks for a while

quoll16:08:20

the longer you wait, the faster the querying gets

quoll16:08:42

anyway, Mulgara isn’t as complex, but it does not need reindexing

quoll17:08:53

Just got a response on twitter: yes, Stardog is symmetrically indexed (I thought it was)

rickmoynihan17:08:43

thats interesting

joelkuiper17:08:48

Hey! Thought you’d might also be interested in this channel, we’ve also been discussing some YeSPARQL related things simple_smile

wagjo17:08:22

Definitely! Thanks for inviting me.