Fork me on GitHub
#rdf
<
2020-10-13
>
rickmoynihan08:10:00

@steven427 Ok I asked some friends and colleagues about some of the bigger projects in the heritage/museums/library/arts/humanities space here’s some more: The british museum’s catalog (this is the one I remembered but couldn’t quite find): https://www.britishmuseum.org/collection/object/W_1892-0516-351-a (looks like they’ve hidden or removed their public sparql endpoint, but the structure of the collections is clearly SKOS - I have it on good authority they’re also using cidoc that I mentioned in the thread) Another big one The library of congress: https://id.loc.gov/ Europeana — which is a huge cross europe collaboration to connect the european heritage sector, through various projects based around linked data etc… https://pro.europeana.eu/ e.g. see here for one of their many big projects: https://www.europeana.eu/en try searching for e.g. “van gogh” or historiana here: https://historiana.eu/ plus many others… Also the British Library: https://bnb.data.bl.uk/ Also all UK legislation is represented/managed as a repository of linked data, giving URI identifiers for everything on the official site here: https://www.legislation.gov.uk/developer/uris

rickmoynihan09:10:32

Ok asked another friend who works for a client of ours who used to work at the BBC on their linked data platforms > 4 years ago… This is what he said about heritage orgs that he knows of who ha(d|ve) linked data projects in the UK at least: > BBC themselves, The National Archives, British Museum, National Library Wales, National Library Scotland, Rijksmuseum, Getty (thesauri for artists, geography and others), Wellcome, Archaeology Data Service, People’s Collection Wales, Science Museum, University of Manchester Image Collection, Tate Gallery, BFI Archive Collections, Nature…

rickmoynihan09:10:52

BBC was a big one obviously… their news publishing and editorial processes uses linked data so journalists can cross reference articles and topics/articles when writing them, and also IIRC the olympics, and I think football coverage was/is done on linked data… though I could be wrong about the footy.

Steven Deobald10:10:57

@rickmoynihan This is a great list! Thanks so much for taking the time to compile this. Before it's lost to the sands of Slack-is-the-worst-service-possible-for-something-like-Clojurians (:face_with_rolling_eyes:) do you know if this channel is logged somewhere?

quoll18:10:51

@simongray Sorry I was not online yesterday. I’ve only just seen your comments now. In general, I like @samir’s responses. "foo" and "foo"@en are different literals. In fact, for RDF 1.0, there were 3 distinct types of string: • "foo" was a Simple Literal"foo"^^<xml:string> was a Typed Literal"foo"@en was a simple literal with a Language Tag All 3 were distinct, and I can’t tell you the grief this caused. It was a relief when RDF 1.1 was introduced and gave all simple literals (that didn’t have a language tag) a datatype of xml:string. Those with a tag are now rdf:langString

👍 3
quoll18:10:26

In terms of SPARQL stores, there are requirements on correctness, but performance may be terrible. In general, Jena worked hard for correctness, but typically did so with naïve code. Over time, this was reimplemented for better performance.

quoll18:10:34

Generally, most stores get indexed around triples, and not strings. The store I worked on (Originally called Tucana, then Kowari, and finally Mulgara) had the option to use a Lucene index on strings, and extended the query language to allow for Lucene lookups. But SPARQL was intentionally technology agnostic, so how you might implement string indexing is not considered.

quoll18:10:50

For instance, a Patricia index may be used for all strings, and then any queries that include a regex could convert that operation into an index lookup. However, I’m not aware of anyone who did that (we started on Tucana, but lost funding). Consequently, I think that most regex queries are managed exclusively as filters… and that will never scale.

quoll18:10:14

As for languages… the idea of tagging is to provide semantics for a group of letters. The simple literal "chat" is just a sequence of 4 unicode characters. However, "chat"@en has a semantic that means a conversation, and "chat"@fr has a semantic that means a male cat. These semantics were considered important to capture

💯 6
rickmoynihan21:10:41

this is a great example

simongray13:10:58

I still find it weird and not every ergonomic that in a system where knowledge is otherwise defined using named relations, for some reason this particular information has to be hardcoded into strings. 😛 but thank you for the in-depth history lesson.

simongray13:10:40

@U051N6TTC btw Paula, if I may ask, what is the end goal of Asami? the readme says it is inspired by RDF, but it doesn’t really mention RDF otehrwise. If I wanted to use it as a triplestore for an existing dataset I guess I would have to develop code for importing RDF files and other necessary functionality?

quoll13:10:28

That’s right, you would. Though I have an old project that would get you some of the way there

quoll13:10:55

Ummm… the end goal. I only have vague notions right now. I can tell you why I started and where it’s going 🙂

simongray13:10:06

Please do 🙂

quoll13:10:14

It was written for Naga. Naga was designed to be an agnostic rule engine for graph databases. Implement a protocol for a graph database, and Naga could execute rules for it

quoll13:10:08

I thought I would start with Datomic, then implement something for SPARQL, OrientDB… etc

quoll13:10:47

But I made the mistake of showing my manager, and he got excited, and asked me to develop it for work instead of evenings and weekends. I agreed, so long as it stayed open source, which he was good with

quoll13:10:27

But then he said that he wanted it to all be open source, and he wasn’t keen on Datomic for that reason. So could I write a simple database to handle it? Sure. I had only stopped working on Mulgara because I don’t like Java, so restarting with Clojure sounded like a good idea (second systems effect be damned!) 🙂

quoll13:10:24

Initially, Asami only did 3 things: • indexed data • inner joins • query optimizing

simongray13:10:46

hah, ok, so it’s mainly because your manager dislikes closed source software? That is a fantastic 1st world problem to have.

quoll13:10:15

But I did it in about a week, so it wasn’t a big deal

quoll13:10:36

The majority of that was the query planner

quoll13:10:21

you could argue that it wasn’t needed (Datomic doesn’t have one), but: a) I’d done it before b) rules could potentially create queries that were in suboptimal form. I’ve been bitten by this in the past

quoll13:10:44

Some time later, he called me and asked me to port it to ClojureScript. So it moved into the browser

quoll13:10:07

Since then, I’ve been getting more requests for more features. Right now it handles a LOT

quoll13:10:31

That’s when I started a new pet project (evenings and weekends)

simongray13:10:04

It seems like a lot of work is happening in this space at the moment with Asami, Datalevin, Datahike, Datascript. Kind of exciting.

quoll13:10:17

This is for backend storage. It is loosely based on Mulgara, but with a lot of innovations, and new emphasis

quoll13:10:48

Honestly, if I’d known about Datascript (which had started), then I would have just used that

quoll13:10:24

Anyway… I mentioned the backend storage, and several managers all got excited about it. So THAT is now my job

quoll13:10:44

And for the first time, they’ve given me someone else to help

quoll14:10:29

He’s doing the ClojureScript implementation (over IndexedDB)

quoll14:10:16

I’m doing the same thing on memory-mapped files. But it’s behind a set of protocols which makes it all look the same to the index code

quoll14:10:12

I also hope to include other options, like S3 buckets. These will work, because everything is immutable (durable, persistent, full history, etc)

simongray14:10:34

Do you see a future where a common protocol like ring can be developed for all of these Datomic-like databases? So much work is happening in parallel.

quoll14:10:56

That was actually exactly the perspective that Naga has!

quoll14:10:14

The protocol that Naga asks Databases to implement is oriented specifically to Naga’s needs, but it works pretty well

simongray14:10:31

I see. So perhaps it’s just a question of willingness to integrate.

quoll14:10:11

Well, the way I’ve done it in Naga has been as a set of package directories which implement the protocol for each database. Unfortunately, I’ve been busy, so I only have directories for Asami and Datomic

quoll14:10:18

But they both work 🙂

quoll14:10:27

I imagine that it wouldn’t be hard to do Datascript

quoll14:10:09

The main thing that Datascript/Datomic miss is a query API that allows you to do an INSERT/SELECT (which SPARQL has)

simongray14:10:52

I need to get some real work done before heading “home” for today, i.e. moving from the desk to the sofa. Thanks for an interesting conversation. I’m keeping an eye on Asami (and now naga). Really interesting projects.

quoll14:10:42

Thank you

quoll14:10:13

They look quiet right now because I’m working on the storage branch

rickmoynihan14:10:06

@U051N6TTC: Sounds like you’ve both had a very interesting career, and currently have a dream job. Most managers would never entertain the need to implement a new database; though it sounds like you’ve done it many times. :thumbsup: @UB3R8UYA1 spoke here a while back about doing something that sounded similar; providing some common abstraction across RDF and other graph stores / libraries. I definitely see the appeal; but I don’t really understand the real world use case. Why is it necessary for your business? Swapping out an RDF database for a different RDF one can be enough work as it is (due to radically different performance profiles), let alone moving across ecosystems. Or am I misunderstanding the purpose of the abstraction; is it to make more backends look like graphs? Which is a use case I totally get 👌. Regardless I’d love to hear more about your work

quoll14:10:41

only twice: Mulgara and now Asami

😂 3
quoll14:10:15

At work, there is no impetus to be able to swap things out 🙂

quoll14:10:47

but any libraries that use a graph database have motivation to do it

quoll14:10:09

particularly if the library is supposed to have broader appeal than for just the team developing it

quoll14:10:51

For instance… there is no need for Asami to have a SPARQL front end, but it’s a ticket, because I’d like to make it more accessible to people

rickmoynihan15:10:16

yeah ok that’s fair

quoll15:10:23

Besides, if I don’t implement a SPARQL front end, it will be embarrassing!!!

quoll15:10:50

For anyone reading… I was on the SPARQL committee

rickmoynihan15:10:56

I don’t know how you could live with yourself… 😆

rickmoynihan15:10:23

ahh well in that case… I don’t know how you could live with yourself 😁

rickmoynihan15:10:52

If you don’t mind me asking, if you could re-live being on that committee, knowing what you do now, what would you do differently?

quoll16:10:21

Well, it was a learning experience for me. A number of interests were on the committee to push the standard in a direction that most suited their existing systems. So rather than introducing technical changes, or working against specific things, I would have focused more on communication with each member of the committee. Not that I think I did a terrible job, but I could have done better

quoll16:10:41

From a technical perspective, I would have liked to see a tighter definition around aggregates, with algorithmic description.

quoll16:10:48

But that’s just because I find a bit of flexibility in some of the edge cases there. Also, having a default way to handle things, even if they’re not the ideal optimized approach, would have been nice to have

quoll16:10:18

That said, that’s essentially what Jena sets out to do. They try to be the reference implementation, and they most certainly don’t take the optimized approach

quoll16:10:19

The early versions of Jena saved triples as a flat list, and resolved patterns as filters against them 😖

quoll16:10:58

Andy had some long conversations with me about Mulgara’s storage while he was planning out Fuseki

quoll16:10:31

Also @rickmoynihan: > Sounds like you’ve both had a very interesting career, and currently have a dream job Yes! I have certainly been spoiled! I honestly don’t know how I have managed to keep coming back to these things, but I’m happy that I have. Of course, I’ve done other things in the between, but even those can be informative (for instance, I’ve had opportunities to work with both Datomic and OrientDB)

quoll16:10:58

Oh! I just thought of something I could have mentioned in the SPARQL committee that continues to frustrate me… transactions!

quoll16:10:30

It’s possible to send several operations through at once. e.g. An insert; an insert/select; a delete. But there are limits on what you can manage there. There are occasions where transactions are important.

quoll16:10:22

Datomic is frustrating that way too, because Naga needs it. (I manage it by using a with database, and once I’m done, I replay the accumulated transactions with transact)

rickmoynihan11:10:26

@U051N6TTC: fascinating, I agree it would have been nice to have a standard for transactions.

quoll18:10:51

Especially when the original intent of RDF was to provide semantic linkages (hence the name, “Semantic Web”)

quoll18:10:53

Also, on some specific questions: > implemented languages as equality-distorting aspects of strings literals Languages change the value. You can consider that as “equality-distorting”, but it can be avoided. For instance… > If I am to query for my own name in an RSF resource how should I refer to it? “Simon”@en, “Simon”@da, and 6000 other entries? Your query could include: WHERE { ?me foaf:name ?name . FILTER(str(?name) = "Simon") }

quoll18:10:57

A good implementation (and I’m not saying that your SPARQL store will be) could turn that FILTER operation into an index lookup

quoll18:10:43

Jena never used to do that, but they may have updated lately. This might be an excuse for me to check in and see how Andy is doing 🙂

samir19:10:53

To push the argument further, the concept of equality is quite complex as you can see in https://clojure.org/guides/equality . RDF makes no special treatment regarding equality. AFAIK two terms are equal when they have the identical long notation. SPARQL being a query language makes some decisions regarding equality in some functions. To me it feels like a good compromise as the goal of the semantic web is to enable the articulation of arbitrary knowledge and data domains

rickmoynihan21:10:51

Yeah it’s the lexical form that strictly speaking should be used, in combination with the datatype uri, lang string etc Though some stores will do some implicit coercions, e.g. stardog will by default canonicalise various numeric types e.g. xsd:bytes into xsd:integer unless you switch that off. https://www.w3.org/TR/rdf-concepts/#section-Literal-Equality