rdf

Kelvin 2022-08-01T13:42:49.378619Z

Question about SPARQL

Kelvin 2022-08-01T13:43:36.947169Z

So I’m revisiting the https://github.com/yetanalytics/flint library and one of the changes I’m proposing is to prevent aggregate expressions from being nested inside other aggregates.

Kelvin 2022-08-01T13:44:37.275269Z

That is because that is not allowed in Apache Jena (which I’m using as my reference implementation); you’d get this exception:

; Execution error (QueryParseException) at org.apache.jena.sparql.lang.QueryParserBase/throwParseException (QueryParserBase.java:578).
; Line [x], column [y]: Nested aggregate in expression not legal

2022-08-01T16:33:11.580309Z

Can you give an example of such a nested aggregate?

Kelvin 2022-08-01T17:24:38.683239Z

SELECT (MAX(AVG(?x)) AS ?maxAvg)
WHERE {
    ?x ?y ?z .
}
GROUP BY ?z

Kelvin 2022-08-01T17:24:55.803359Z

Does it make semantic sense? Not really (which is probably why Jena has that restriction)

Kelvin 2022-08-01T17:26:05.264119Z

For the record this is totally legal:

SELECT (AVG(?x) AS ?avg) (MAX(?avg) AS ?maxAvg)
WHERE {
  ?x ?y ?z .
}
GROUP BY ?z

2022-08-01T18:17:19.010989Z

I agree that doesn't make logical sense to take the aggregate of an aggregate.

Kelvin 2022-08-01T18:20:19.881339Z

Which means even if it’s technically allowed in the syntax, I’ll likely add that restriction because a) it doesn’t make semantic sense and b) we’d don’t really want syntax that’s not supported by certain impls like Jena, right?

2022-08-01T20:44:15.959559Z

Yeah

2022-08-02T08:36:13.841759Z

In my mind adding a restriction like this puts you on the slope of implementing a poor mans type checker; or at least you’d be mixing concerns of static analysis with parsing. As you’ve noted it looks like this is visible in Jena’s implementation. They’re denying the first form as syntactically invalid (when it’s not), but can’t detect it when the functions aren’t composed directly, and an intermediate variable is introduced. By way of analogy; this is a perfectly valid clojure form: (inc "foo") but it’s a semantically incorrect program; and will result in an evaluation error when run. The error isn’t detected in the parser however. I think static analysis features are super useful; but if implemented that they should occur in a separate layer / concern; for example a linter like clj-kondo or type checker.

2022-08-02T08:38:09.349829Z

Out of interest what does Jena return when you run this? https://clojurians.slack.com/archives/C09GHBXRC/p1659374765264119?thread_ts=1659361477.275269&cid=C09GHBXRC

Kelvin 2022-08-02T17:46:36.306699Z

I did some playing around in the https://sparql-playground.sib.swiss/. The following:

SELECT (COUNT(?s) AS ?countS) (MAX(?countS) AS ?maxCountS) WHERE {
  ?s ?p ?o
}
LIMIT 10
returns 14. If you remove the MAX expression binding then ?countS is also 14.

Kelvin 2022-08-02T17:47:02.210099Z

So MAX at least returns the max of a singleton collection, which actually makes sense

Kelvin 2022-08-02T17:56:45.826109Z

WAIT OOPS I misread the results

Kelvin 2022-08-02T17:57:05.402699Z

?countS is 14 but ?maxCountS returns nothing

Kelvin 2022-08-02T18:15:36.873619Z

And of course I did the nested aggregates:

Kelvin 2022-08-02T18:15:39.094739Z

SELECT (MAX(COUNT(?s)) AS ?maxCountS) WHERE {
  ?s ?p ?o
}
LIMIT 10

Kelvin 2022-08-02T18:16:14.648729Z

Given that the query actually went through without error it seems that it is indeed a Jena-specific detail

2022-08-02T22:18:09.796419Z

Yes, this is the behaviour I’d expect to be honest. The query should run and remove the bindings for the solution when it errors during evaluation. The spec describes this here: https://www.w3.org/TR/sparql11-query/#aggregateExample2 Also note the definition of MAX in the spec,and the description for Set functions here where they’re described as receiving a a value of type MultiSet: https://www.w3.org/TR/sparql11-query/#setFunctions The above is interesting though because it implies to me that the query below should also fail, even though the grammar permits it, as 10 is a literal integer not a MultiSet:

SELECT (MAX(10) AS ?maxCountS) WHERE {
}
I’ve tried this and your variants on the sparql playground (sesame), stardog and jena and all return "10". Which makes sense but it feels like it should be an error according to the spec; though I’ve not looked thoroughly. I suspect all these implementations coerce the literals in this case into multisets through some implicit coercion or something; though it seems peculiar with regards to your discovery.

Kelvin 2022-08-01T13:45:23.303679Z

Thing is, I can’t find such a restriction in the https://www.w3.org/TR/sparql11-query/, so I’d like to know if this is a general restriction or a Jena implementation detail.

2022-08-02T08:27:07.453919Z

@kelvin063: I’m pretty sure this is a Jena implementation detail, and for what it’s worth I’d advise against copying it. It looks to me that this isn’t a syntactic concern it’s a semantics concern; and flint should stick to just syntax. I’m not arguing that Jena shouldn’t error in these cases; but IMHO they shouldn’t error in the parser; they should error in the evaluator. > we’d don’t really want syntax that’s not supported by certain impls like Jena, right? I think this is a slippery slope and if you defer to implementations as your oracle you’ll be cutting the expressivity of the language to whatever common subset they all support; which will be less than the union of their collective grammar/features. So if anything I’d argue that you should err on the side of the opposite direction. For example many triple stores implement extensions to the SPARQL grammar, and you probably don’t want to rule out ever supporting these extensions. In my mind this would benefit from a clear policy of potentially growing the grammar (or at least fully supporting it) rather than artificially restricting it.

Kelvin 2022-08-02T16:26:54.889109Z

> you’ll be cutting the expressivity of the language to whatever common subset they all support; which will be less than the union of their collective grammar/features. I actually thought that only implementing the common subset would be desirable; that way, you can just plug Flint (or whatever lib) into whatever implementation and you won’t get error messages from said implementation.

Kelvin 2022-08-02T16:27:16.368639Z

But it seems like I may be mistaken on that premise?

2022-08-02T22:50:33.562099Z

In the general case, and as a general principle to follow beyond this specific example, I think applications (and flint) can rely on the implementations erroring; and I don’t see why flint needs to protect me from the implementation erroring. Yes the error can happen sooner if you can detect it syntactically; but you’re on a slippery slope, and I think it’s more principled to stay out of static analysis or type checking etc. Yes those could be useful things; but I think they should be built on top of something like flint; or it’s intermediate representation, and not baked into the parsing itself. In this specific case; I think (though I’m not 100% sure) we’ve discovered that Jena is returning the wrong result in this case. It should be removing the binding from that solution during evaluation; not erroring during parsing. Is it a big deal? Not really; because why would you write query. My point on common subset is really that the spec should be the common subset you implement; and that you should focus on correctness and the grammar/parsing. Jena is I think strictly speaking doing less than the spec requires here, so flint doing that would mean flint is targeting the lowest common denominator of sparql; which will be less than sparql itself specifies. Then once you have that subset you could possibly choose to increase the size of grammar beyond the spec by supporting various syntax extensions some implementations may provide. I’m not suggesting this by the way; just that you could.

Kelvin 2022-08-01T14:05:22.054129Z

Also another question: according to the https://www.w3.org/TR/sparql11-query/#sparqlGrammar a GROUP BY clause is defined as:

GroupClause    ::= 'GROUP' 'BY' GroupCondition+
GroupCondition ::= BuiltInCall | FunctionCall | '(' Expression ( 'AS' Var )? ')' | Var

Kelvin 2022-08-01T14:05:52.550539Z

Where BuiltInCall and FunctionCall exclude pure arithmetic expressions like (?x + ?x)

Kelvin 2022-08-01T14:06:32.728109Z

Yet this seems to be legal in Jena?

SELECT (SUM(?z) AS ?sum)
WHERE {
  ?x ?y ?z .
}
GROUP BY (?x + ?x) ; not a function of the form FNAME(args...)

quoll 2022-08-02T15:36:11.281459Z

This is legal because (?x + ?x) is an expression (the 3rd GroupCondition)

Kelvin 2022-08-02T15:39:41.274839Z

Yep @eric.d.scott pointed that out to me yesterday!

quoll 2022-08-02T15:56:12.061779Z

I read further down and saw that

quoll 2022-08-02T15:57:07.725569Z

The horrible part about that BNF is writing code that implements it 😖

quoll 2022-08-02T15:57:36.903269Z

You can have code that performs all of the operations, but actually connecting it to the parsed grammar is a pain

2022-08-01T15:13:47.449809Z

As I read this, ?y would have to be a URI and not a numeric literal. Can you SUM over URIs?

quoll 2022-08-02T15:36:32.701339Z

Not according to the spec, no

Kelvin 2022-08-01T15:26:49.882419Z

Oh these were just junk bindings that I cooked up to demonstrate the example

Kelvin 2022-08-01T15:27:12.880599Z

But I’ll change it to ?z to make it more correct

2022-08-01T15:34:34.691359Z

So the problem is that GROUP BY (?literal1 + ?literal2) fails but BIND(?literal1 + ?literal2 AS ?literal3) / GROUP BY (?literal3) does work?

Kelvin 2022-08-01T15:36:15.715339Z

The problem is that GROUP BY (?literal1 + ?literal2) SHOULD fail according to the grammar but it doesn’t

❌ 1
2022-08-01T15:41:29.784619Z

So the critical part is here, then, I think: '(' Expression ( 'AS' Var )? ')'

2022-08-01T15:42:29.411049Z

I think the ? after the ('AS Var)' makes that optional, so I think you're within your rights to just put an Expression

👍 1
Kelvin 2022-08-01T15:47:17.642959Z

Ooooh

2022-08-01T15:47:37.261969Z

Though I can't think of how you'd reference that downstream.

Kelvin 2022-08-01T15:47:58.353299Z

Well that’s a confusing way to write BrackettedExpression | '(' Expression 'AS' Var ')'

Kelvin 2022-08-01T15:49:27.395479Z

> Though I can’t think of how you’d reference that downstream. It’s all part of covering my bases when it comes to syntax, even if semantically it doesn’t make sense

2022-08-08T11:39:57.530429Z

I think the SPARQL grammar mandates an unsigned / non negative integer in those places too though; so I think it’s reasonable for flint to do the same there too.

2022-08-02T08:57:53.315049Z

☝️ I think you’ve hit the nail on the head here and that this is the right intuition! i.e. focus on grammar/syntax in flint and covering your bases. Don’t shrink the grammar because some statements within the grammar are nonsensical. It’s always the case in essentially language/grammar of sufficient complexity that there are syntactically valid statements which are nonsensical. The best known ways to short cut this are type-checkers; and it’s proven that even the most advanced type checkers can’t detect all possible errors.

Kelvin 2022-08-02T15:38:06.716139Z

Indeed, I don’t want Flint to become a type checker - notice how expressions are completely untyped. I think one issue is that the line between pure syntax checking and type checking can get quite blurred, especially when you have a contract system like Spec.

2022-08-02T23:02:17.103479Z

> I think one issue is that the line between pure syntax checking and type checking can get quite blurred Yes it can. I think the SPARQL BNF may illustrate this point too. > especially when you have a contract system like Spec Yes, spec does blur these lines a little too. I think part of the problem there is that spec complects parsing with property-testing/generation. Can you point to any specific examples in flint where the boundary is blurred?

Kelvin 2022-08-03T14:03:48.136569Z

> Can you point to any specific examples in flint where the boundary is blurred? The simplest example I could think of would be with LIMIT clauses which requires not only integers, but positive, nonzero integers (see https://github.com/yetanalytics/flint/blob/main/src/main/com/yetanalytics/flint/spec/modifier.cljc#L32). Since that is a case where not only the axiom has to be an integer syntactically, but it also has to conform to the aforementioned properties.

2022-08-01T15:56:05.789369Z

Yeah I agree that's ugly.

Kelvin 2022-08-01T16:20:27.520719Z

The whole SPARQL BNF grammar is rather ugly

2022-08-02T09:19:17.680109Z

Yes. I think it may have been cleaner if they’d defined a generic function syntax; and then defined elsewhere the minimal set of functions and their names that should exist in the core language. Instead they made the built in functions all be keywords in the language grammar itself. I guess it’s all trade offs; the complexity needs to go somewhere; they just chose to move a lot of it into the grammar. Presumably to try and simplify the wider specification effort.

quoll 2022-08-02T15:38:17.937629Z

I don’t know who did the grammar to start with, but I got the impression that it came from Jena. Andy always maintained it, and I presumed that he’d written the original, but I don’t know

Kelvin 2022-08-02T15:39:21.924819Z

By “he” do you mean Andy Seaborne (who’s listed as one of the query language editors and is a current maintainer of Jena)?

quoll 2022-08-02T15:44:34.393879Z

Yes, sorry. Andy Seabourne

Kelvin 2022-08-01T16:21:00.359439Z

But yeah thanks for the help on this; I wonder if you have any insight on my earlier question though: https://clojurians.slack.com/archives/C09GHBXRC/p1659361416947169

2022-08-01T16:33:57.898989Z

I'm still waiting to see my first pretty BNF spec. :-)