rdf 2020-10-12 | Slack Archive

simongray08:10:07

Can someone with more knowledge of RDF explain to me the rationale of langStrings like “word”@en? They seem like the completely wrong abstraction. Strings of letters are not encoded in a language, but rather it’s the other way around: languages use strings of letters to represent words and these string literals can of course be claimed by multiple languages and their interpretation may be different, but they are still the same string of letters. With the way langStrings are implemented in SPARQL queries (an enforced filter), basic string queries for words in multiple languages either suffer a combinatorial explosion of language-encoded strings or an unintentionally smaller result set (if not using an exhaustive set of languages). It seems to me like languages are better implemented simply as an entity that can be linked to any number of strings or simply just a separate property. This can then be used to selectively filter in different languages. I cannot see any benefit to hardcoding languages into string literals. I wonder why the current implementation is even a thing.

rickmoynihan09:10:35

@simongray I think it’s partly pragmatic, but it runs deeper than lang strings. For better or worse no RDF primitives can be used in ?s position. Whatever design was taken in this area would make similar trade offs elsewhere, so the implemented solution is I think a fair compromise. Philosophically in RDF you have a kind of platonic concept which is assumed to exist outside of human languages, and any human language can in principle know it; but it’s the same universal concept regardless of human language. Obviously there are competing schools in philosophy of language that say meaning is relative to other terms, and consequently relative to the language they’re expressed in (e.g. the Sapir–Whorf hypothesis), but that’s not how things are modelled in RDF. In RDF labels etc are just annotation properties; i.e. they don’t carry any formal semantics… they’re just an aid to humans, the real meaning is intended to be in the identifiers and their relationships. Regarding SPARQL’s deficiencies (real or perceived), SPARQL was developed independently of RDF. RDF came first. Of course you could choose to implement language specific predicates en:name, fr:nom if you had a good reason to — but it’s not how modelling in conventionally done.

simongray09:10:25

Well, looking past the fact that basic multi-language string queries cannot be guaranteed to be correct when using unknown datasets since the full range of possible language-encodings must be specified in the query, It just seems to defy logic to me as well. Are names language-encoded? Do I have 6000+ different names then to account for every possible official language?

simongray09:10:18

It’s also seems completely idiosyncratic to RDF. What other languages have implemented languages as equality-distorting aspects of strings literals?

rickmoynihan09:10:44

Not sure how to interpret this: > Are names language-encoded?

simongray09:10:35

If I am to query for my own name in an RSF resource how should I refer to it? “Simon”@en, “Simon”@da, and 6000 other entries?

simongray09:10:06

My name has nothing to do with my mother tongue

rickmoynihan09:10:55

In that case just use "simon" … an xsd:string language is optional.

simongray09:10:35

That has not been my experience.

samir09:10:56

@simongray It is useful to see the lang string as belonging to the presentation layer. Whenever you deal with data that will be queried, with most SPARQL implementations, basic strings are the main choice. (opiniated)

simongray09:10:12

If I search for a basic string, won’t this simple not not include any language-encoded strings?

simongray09:10:31

e.g. https://stackoverflow.com/questions/40246175/sparql-matching-literals-with-any-language-tags-without-run-into-timeout

simongray09:10:09

so in order to actually return a full result set I will have to construct a query for the string in every possible human language in addition to the basic string

simongray09:10:22

I ran into this issue already using Apache Jena (aristotle)

rickmoynihan09:10:14

String searching isn’t really what RDF is optimised for. I’d say the use case for lang strings is mainly to provide labels for display. And yes this can be awkward. multi-lingual stuff is always awkward in any system. I’m not defending RDF here btw, I’m just explaining how to think of it.

simongray09:10:25

When the dream is interconnectivity of resources, the fact that strings are represented in this way with no reliable enforcement mechanism seems to completely destroy any hope of integrating multilingual datasets

samir09:10:40

This is what I meant with presentation layer. You fetch the lang strings with the proper language before display of the entity in some viewer. Or some of them if you have precedence rules between languages

☝️ 3

rickmoynihan09:10:17

Typically most real world systems won’t work in 6000+ languages. You’ll probably have a just a handful…. So filtering to the users locale is tractable. Normally I just slightly overselect, e.g. :s rdfs:label ?label and then pick the most appropriate based on a precedence etc.

rickmoynihan09:10:31

Jinx!

simongray09:10:30

Ok, so is it the case that different RDF implementations treat strings with no language-encoding in different ways? Cause when I have tried searching a for basic string it will not return any results unless I specify the language (the enforced filter). This was in Apache Jena

samir09:10:07

The main point of RDF is actually to take resource identity very seriously. The labels are seen as helper data. I agree that this is suboptimal. On the other hand, systems integrating over multiple language can analyse all labels to infer the likelihood that two entity are identical and then add an appropriate statement articulating this fact

👍 3

simongray09:10:56

But langStrings are not just used for labels

simongray09:10:18

The object of a triplet can either be a resource or a literal, right?

samir09:10:21

That is right, I used the term labels to simplify the discussion

👍 3

rickmoynihan09:10:45

I’m pretty sure most systems treat (no)-language encoding the same. :foo rdfs:label "foo" and :foo rdfs:label "foo"@en are different triples. I’m pretty sure this is part of the standard. Obviously you can choose to handle this stuff at an ETL layer if you need too… e.g. by normalising labels into xsd strings or whatever to suit your application.

samir09:10:01

Actually text searching is not really part of SPARQL, often you have a parallel text indexing service (and the clauses for text search can be integrated in the SPARQL request). From the point of view of RDF “foo” and “foo”@en are just different litterals.

rickmoynihan09:10:41

yeah, though some triple stores have non standard extensions to handle it better

👍 3

simongray09:10:46

ok… thanks for responding both of you. I still think this aspect of RDF is completely idiosyncratic and to me simply introduces complexity that will need to be handled elsewhere.

simongray09:10:08

I can’t imagine dealing with a programming langauge where every string is potentially language-encoded… shudder

rickmoynihan09:10:54

You’re right it is idiosyncratic, and it does pass the complexity buck, and I have experienced this frustration myself. So I’m not disagreeing with you. Though I think any solution that wasn’t tailor made for your application would be idiosyncratic here too. You really just need to learn to work with it rather than against it. If you do that your life will be easier.

simongray09:10:00

Yeah. I guess I am somewhat getting around it by representing the data as a labeled property graph instead, both in Neo4j and using ubergraph in Clojure. I just can’t believe that this passed through multiple committee (re)designs and the - to me - obvious and much more flexible way to represent languages already used in HTML/XML was not simply reused here.

rickmoynihan10:10:56

how do xml/html solve this? It sounds to me like your issue is more that you’re doing an exact string match, rather using FTS.

simongray10:10:39

lang=“en” and xml:lang=“en”

simongray10:10:46

what’s FTS?

rickmoynihan10:10:54

full text search

simongray10:10:23

well, I am a SPARQL n00b. Is full-text search built-in and if that is the case, how is it accessed?

rickmoynihan10:10:01

It’s not part of the standard, but lots of backends support it

rickmoynihan10:10:03

https://jena.apache.org/documentation/query/text-query.html

simongray10:10:36

I see. Thanks for pointing that out. I tried getting around the language-encodings by using regex but that was just unbearably slow. I guess FTS is close to the performance of matching string directly?

rickmoynihan10:10:54

Yes it should be. It’ll use lucene indexing etc underneath, and you can probably even tweak the indexing there too should you need to. Enabling FTS will usually make indexing slower of course.

simongray10:10:44

Right. Thank you very much for educating me. I just noticed that I had already starred one of your libraries on github as part of my initial research of dealing with RDF in Clojure. 🙂

👍 3

simongray09:10:06

In general, RDF seems a bit over-engineered

rickmoynihan10:10:22

I can see why you might think that, but I think RDF itself is actually very well engineered. RDF is actually pretty minimal. The complication comes from the fact that there are lots of interwoven standards; so the ecosystem is complicated; and so are some of the other standards, e.g. OWL. You should only use what you really need though.

Steven Deobald11:10:40

Out of curiosity, what are the domains/projects you folks are working on with RDF and Clojure? I'm currently working on an implementation of a digital library for http://pariyatti.org, which requires quite a bit of relationship management between entities: ancient Pali literature with many variations and translations over the past ~2000 years, authors, topics, etc. I started with Neo4j but I'm currently spiking a move to Crux, for a variety of reasons. Because librarians at http://pariyatti.org will forever consist of volunteers with limited time, I've leaned away from semantic web tech in favour of writing something (potentially?) simpler by hand... but if the project ever begins to concern itself with the contents of the documents within the library, it might be foolish to continue avoiding things like RDF. My go-to example at the current granularity is Ledi Sayadaw, a monk who authored a long list of books in contemporary Pali about ~100 years ago. He's now a topic for other, modern literature in other languages. Those sorts of relationships would be a nightmare in Postgres but they've been manageable in Neo4j and Crux so far. "Contents" might be something as fine-grained as the knowledge that kukkara in Pali means dog in English (and a dozen other translations)... obviously I have no intention of encoding that knowledge at that granularity in a database layer I've hand-rolled. 😉 Have other folks in here walked a similar road?

simongray12:10:24

I think having to figure everything out yourself is both freeing and requires more extensive research. Sounds like you don't need to integrate with any other sources or distribute your data? In that case, I don't think RDF is a requirement.

simongray12:10:47

I've inherited the official Danish wordnet which was created as part of a big research project more than a decade ago. The primary data lives in a SQL db and only exists as RDF in a limited exported version using the original draft version of RDF/XML. I need to support linking with the Princeton WordNet while supporting a bunch of future functionality, so my mission has been normalising the usage of RDF and graphs for data modelling, including at the db level.

rickmoynihan13:10:17

Actually the cultural/arts/museum space has historically been a large adopter of rdf and linked data. Lots of big museums, art collections and libraries etc use RDF for their metadata catalogs. There is definitely a tonne of vocabularies and work using RDF in this space… in particular probably: https://iiif.io/ which is adopted by dozens of national musuems/galleries etc worldwide. but also cidoc: http://www.cidoc-crm.org/ and probably a bunch more. Not sure what the latest stuff is, but I could probably find out. SKOS was designed for representing thesaurus etc. https://www.w3.org/2004/02/skos/ @U01AVNG2XNF I’d say there’s a strong argument to use RDF here, given it’s wide adoption. Also there’s a good chance RDF will be around long after trendier stuff like crux.

Steven Deobald14:10:53

@U06HHF230 Interesting! If you had a line on more recent developments, I'd be very curious to know what they are. My entire career was spent in finance / e-commerce type things so I'm really a fish out of water in what seems to be an almost entirely government / academic dominated space.

Steven Deobald14:10:15

> Also there’s a good chance RDF will be around long after trendier stuff like crux. @U06HHF230 I suppose I hadn't considered these two things at odds with each other. Is there a particular backing store(s) people tend to rely on in the world of RDF?

Steven Deobald14:10:38

@simongray You're right, for the foreseeable future this system won't integrate with any other or require any sort of data distribution. Pariyatti will be internally curated and won't resemble anything like Wikimedia's work. That said, it's a fine line between a curated library and a system for researching ancient linguistics. The latter no doubt has a lot to learn from the work already done on the semantic web, whether the system is open or not.

Steven Deobald14:10:23

@U06HHF230 Do you know of any specific organizations or projects using cidoc? I'm surprised it didn't come up when I was researching off-the-shelf tools.

rickmoynihan15:10:07

http://www.cidoc-crm.org/stackeholders http://www.cidoc-crm.org/sig-members-list I guess the above lists would be a good place look Also the Smithsonian… https://americanart.si.edu/about/lod There’s been lots of other linked data projects in this area; but I can’t recall many off the top of my head… I can ask some colleagues.

rickmoynihan15:10:24

> Is there a particular backing store(s) people tend to rely on in the world of RDF? There are many… probably half a dozen serious commercial options, plus the two big opensource ones Jena and RDF4j; and then maybe twenty or more opensource ones targeting various niches or in various stages of development.

Steven Deobald04:10:03

Not sure why I wasn't expecting to find the Stakeholders list under Community. :woman-facepalming: Thanks! I'd love to know about more LOD projects... understanding the surface area will help make more informed decisions as the project plods forward.

Steven Deobald04:10:04

rdf4j .... offers an easy-to-use API that can be connected to all leading RDF database solutions. -- I'm still finding this whole space a bit confusing. It seems like Jena actually does all the database work itself (maybe?) but RDF4j relies exclusively on a third-party triplestore? https://en.wikipedia.org/wiki/Comparison_of_triplestores sheds a little light on the situation but I'm still a bit unclear who/what is writing data to disk in these different setups. 😉

rickmoynihan07:10:10

RDF4j has several triple store backends just like Jena. In particular a native store (which is persisted to disk) and a memory store, plus a few more… It also comes with a workbench (database server) that you can run, like Jena (Jena’s is called Fuseki). RDF4j has a much cleaner API in my mind, but Jena has more features in some areas. In particular WRT inferencing. (Disclosure I’m actually supposed to be a core contributor to RDF4j; but it’s only because I submitted a bunch of extensive bug reports a few years back; with a few small patches.) I actually use Jena in a few places too.

rickmoynihan13:10:31

publishing government data (mostly statistical data)

Steven Deobald14:10:12

Very cool. My partner has been working with http://CivicDataLab.in for the past half year or so, in a similar space. I'm not sure they've ever even contemplated RDF for their statistical data, though.

2020-10-12

Channels