Fork me on GitHub
#rdf
<
2023-01-23
>
simongray10:01:32

Hey SPARQL people, I need to find all resources of a specific type that share the same set of predicate-objects, i.e. they are identical. I have no real idea how to construct a query to return that…

simongray10:01:32

The goal is to remove duplicates

simongray10:01:09

Currently just matching on some attributes and trying to reduce the result set in Clojure. but would be nice to have a single, generalised query for finding duplicates.

Kelvin14:01:57

Would SELECT DISTINCT be helpful here?

Kelvin14:01:39

i.e. if you’re trying to remove dupes (as opposed to specifically querying for dupes) of a certain type, would the following query be helpful?

SELECT DISTINCT ?s
WHERE {
  ?s a my:Type
}

Kelvin14:01:24

If on the other hand you’re trying to specifically find duplicates I think you need to use the COUNT aggregate and use filters to return when the count > 1

Kelvin15:01:08

Ah I think HAVING would be your friend here (not sure if this query actually works but could be worth a shot):

SELECT ?s
WHERE {
  ?s a my:Type
  ?s ?p ?o
}
GROUP BY ?p ?o
HAVING (COUNT(?s) > 1)

quoll21:01:28

I don’t have data to play with here, but I’m thinking you could do something like this (in thread)…

quoll21:01:37

select ?s
where{
  ?s a mytype .
  ?s ?p ?o .
  ?s2 a mytype .
  ?s2 ?p ?o .
  FILTER (?s != ?s2)
  MINUS{
    ?s ?px ?ox .
    NOT EXISTS { ?s2 ?px ?ox }
  } 
}

quoll21:01:36

This matches ?s and ?s2 where they are both mytype and they share any properties at all. Then remove any instance of ?s where there is a property/value and the matching ?s2 does not have that property/value

quoll22:01:11

if ?s has fewer properties than ?s2 then it will be taken away via the MINUS, but then when the bindings are the other way around ?s will have more properties than ?s2 and so it won’t be removed via the minus. Which I think works here

quoll22:01:52

The problem is that the above works via a nested loop in the query engine. It doesn’t scale