rdf

simongray 2023-01-23T10:08:32.571579Z

Hey SPARQL people, I need to find all resources of a specific type that share the same set of predicate-objects, i.e. they are identical. I have no real idea how to construct a query to return that…

simongray 2023-01-23T10:10:32.737259Z

The goal is to remove duplicates

simongray 2023-01-23T10:40:09.087359Z

Currently just matching on some attributes and trying to reduce the result set in Clojure. but would be nice to have a single, generalised query for finding duplicates.

Kelvin 2023-01-23T14:52:57.074959Z

Would SELECT DISTINCT be helpful here?

Kelvin 2023-01-23T14:54:39.084389Z

i.e. if you’re trying to remove dupes (as opposed to specifically querying for dupes) of a certain type, would the following query be helpful?

SELECT DISTINCT ?s
WHERE {
  ?s a my:Type
}

Kelvin 2023-01-23T14:56:24.801069Z

If on the other hand you’re trying to specifically find duplicates I think you need to use the COUNT aggregate and use filters to return when the count > 1

Kelvin 2023-01-23T15:03:08.418249Z

Ah I think HAVING would be your friend here (not sure if this query actually works but could be worth a shot):

SELECT ?s
WHERE {
  ?s a my:Type
  ?s ?p ?o
}
GROUP BY ?p ?o
HAVING (COUNT(?s) > 1)

simongray 2023-01-24T08:15:35.731769Z

Thanks, @kelvin063!

quoll 2023-01-23T21:56:28.747299Z

I don’t have data to play with here, but I’m thinking you could do something like this (in thread)…

quoll 2023-01-23T21:56:37.203019Z

select ?s
where{
  ?s a mytype .
  ?s ?p ?o .
  ?s2 a mytype .
  ?s2 ?p ?o .
  FILTER (?s != ?s2)
  MINUS{
    ?s ?px ?ox .
    NOT EXISTS { ?s2 ?px ?ox }
  } 
}

quoll 2023-01-23T21:58:36.277489Z

This matches ?s and ?s2 where they are both mytype and they share any properties at all. Then remove any instance of ?s where there is a property/value and the matching ?s2 does not have that property/value

quoll 2023-01-23T22:02:11.976049Z

if ?s has fewer properties than ?s2 then it will be taken away via the MINUS, but then when the bindings are the other way around ?s will have more properties than ?s2 and so it won’t be removed via the minus. Which I think works here

quoll 2023-01-23T22:02:52.567179Z

The problem is that the above works via a nested loop in the query engine. It doesn’t scale

simongray 2023-01-24T08:16:00.619209Z

Yeah…

simongray 2023-01-24T08:16:15.059159Z

Thanks @quoll!