rdf 2022-08-01 | Slack Archive

Question about SPARQL

So I’m revisiting the https://github.com/yetanalytics/flint library and one of the changes I’m proposing is to prevent aggregate expressions from being nested inside other aggregates.

Kelvin13:08:37

That is because that is not allowed in Apache Jena (which I’m using as my reference implementation); you’d get this exception:

; Execution error (QueryParseException) at org.apache.jena.sparql.lang.QueryParserBase/throwParseException (QueryParserBase.java:578).
; Line [x], column [y]: Nested aggregate in expression not legal

Eric Scott16:08:11

Can you give an example of such a nested aggregate?

Kelvin17:08:38

SELECT (MAX(AVG(?x)) AS ?maxAvg)
WHERE {
    ?x ?y ?z .
}
GROUP BY ?z

Kelvin17:08:55

Does it make semantic sense? Not really (which is probably why Jena has that restriction)

Kelvin17:08:05

For the record this is totally legal:

SELECT (AVG(?x) AS ?avg) (MAX(?avg) AS ?maxAvg)
WHERE {
  ?x ?y ?z .
}
GROUP BY ?z

Eric Scott18:08:19

I agree that doesn't make logical sense to take the aggregate of an aggregate.

Kelvin18:08:19

Which means even if it’s technically allowed in the syntax, I’ll likely add that restriction because a) it doesn’t make semantic sense and b) we’d don’t really want syntax that’s not supported by certain impls like Jena, right?

Eric Scott20:08:15

Yeah

rickmoynihan08:08:13

In my mind adding a restriction like this puts you on the slope of implementing a poor mans type checker; or at least you’d be mixing concerns of static analysis with parsing. As you’ve noted it looks like this is visible in Jena’s implementation. They’re denying the first form as syntactically invalid (when it’s not), but can’t detect it when the functions aren’t composed directly, and an intermediate variable is introduced. By way of analogy; this is a perfectly valid clojure form: (inc "foo") but it’s a semantically incorrect program; and will result in an evaluation error when run. The error isn’t detected in the parser however. I think static analysis features are super useful; but if implemented that they should occur in a separate layer / concern; for example a linter like clj-kondo or type checker.

rickmoynihan08:08:09

Out of interest what does Jena return when you run this? https://clojurians.slack.com/archives/C09GHBXRC/p1659374765264119?thread_ts=1659361477.275269&cid=C09GHBXRC

Kelvin17:08:36

I did some playing around in the https://sparql-playground.sib.swiss/. The following:

SELECT (COUNT(?s) AS ?countS) (MAX(?countS) AS ?maxCountS) WHERE {
  ?s ?p ?o
}
LIMIT 10

returns 14. If you remove the MAX expression binding then ?countS is also 14.

Kelvin17:08:02

So MAX at least returns the max of a singleton collection, which actually makes sense

Kelvin17:08:45

WAIT OOPS I misread the results

Kelvin17:08:05

?countS is 14 but ?maxCountS returns nothing

Kelvin18:08:36

And of course I did the nested aggregates:

Kelvin18:08:39

SELECT (MAX(COUNT(?s)) AS ?maxCountS) WHERE {
  ?s ?p ?o
}
LIMIT 10

Kelvin18:08:14

Given that the query actually went through without error it seems that it is indeed a Jena-specific detail

rickmoynihan22:08:09

Yes, this is the behaviour I’d expect to be honest. The query should run and remove the bindings for the solution when it errors during evaluation. The spec describes this here: https://www.w3.org/TR/sparql11-query/#aggregateExample2 Also note the definition of MAX in the spec,and the description for Set functions here where they’re described as receiving a a value of type MultiSet: https://www.w3.org/TR/sparql11-query/#setFunctions The above is interesting though because it implies to me that the query below should also fail, even though the grammar permits it, as 10 is a literal integer not a MultiSet:

SELECT (MAX(10) AS ?maxCountS) WHERE {
}

I’ve tried this and your variants on the sparql playground (sesame), stardog and jena and all return "10". Which makes sense but it feels like it should be an error according to the spec; though I’ve not looked thoroughly. I suspect all these implementations coerce the literals in this case into multisets through some implicit coercion or something; though it seems peculiar with regards to your discovery.

Kelvin13:08:23

Thing is, I can’t find such a restriction in the https://www.w3.org/TR/sparql11-query/, so I’d like to know if this is a general restriction or a Jena implementation detail.

rickmoynihan08:08:07

@U02FU7RMG8M: I’m pretty sure this is a Jena implementation detail, and for what it’s worth I’d advise against copying it. It looks to me that this isn’t a syntactic concern it’s a semantics concern; and flint should stick to just syntax. I’m not arguing that Jena shouldn’t error in these cases; but IMHO they shouldn’t error in the parser; they should error in the evaluator. > we’d don’t really want syntax that’s not supported by certain impls like Jena, right? I think this is a slippery slope and if you defer to implementations as your oracle you’ll be cutting the expressivity of the language to whatever common subset they all support; which will be less than the union of their collective grammar/features. So if anything I’d argue that you should err on the side of the opposite direction. For example many triple stores implement extensions to the SPARQL grammar, and you probably don’t want to rule out ever supporting these extensions. In my mind this would benefit from a clear policy of potentially growing the grammar (or at least fully supporting it) rather than artificially restricting it.

Kelvin16:08:54

> you’ll be cutting the expressivity of the language to whatever common subset they all support; which will be less than the union of their collective grammar/features. I actually thought that only implementing the common subset would be desirable; that way, you can just plug Flint (or whatever lib) into whatever implementation and you won’t get error messages from said implementation.

Kelvin16:08:16

But it seems like I may be mistaken on that premise?

rickmoynihan22:08:33

In the general case, and as a general principle to follow beyond this specific example, I think applications (and flint) can rely on the implementations erroring; and I don’t see why flint needs to protect me from the implementation erroring. Yes the error can happen sooner if you can detect it syntactically; but you’re on a slippery slope, and I think it’s more principled to stay out of static analysis or type checking etc. Yes those could be useful things; but I think they should be built on top of something like flint; or it’s intermediate representation, and not baked into the parsing itself. In this specific case; I think (though I’m not 100% sure) we’ve discovered that Jena is returning the wrong result in this case. It should be removing the binding from that solution during evaluation; not erroring during parsing. Is it a big deal? Not really; because why would you write query. My point on common subset is really that the spec should be the common subset you implement; and that you should focus on correctness and the grammar/parsing. Jena is I think strictly speaking doing less than the spec requires here, so flint doing that would mean flint is targeting the lowest common denominator of sparql; which will be less than sparql itself specifies. Then once you have that subset you could possibly choose to increase the size of grammar beyond the spec by supporting various syntax extensions some implementations may provide. I’m not suggesting this by the way; just that you could.

Kelvin14:08:22

Also another question: according to the https://www.w3.org/TR/sparql11-query/#sparqlGrammar a GROUP BY clause is defined as:

GroupClause    ::= 'GROUP' 'BY' GroupCondition+
GroupCondition ::= BuiltInCall | FunctionCall | '(' Expression ( 'AS' Var )? ')' | Var

Kelvin14:08:52

Where BuiltInCall and FunctionCall exclude pure arithmetic expressions like (?x + ?x)

Kelvin14:08:32

Yet this seems to be legal in Jena?

SELECT (SUM(?z) AS ?sum)
WHERE {
  ?x ?y ?z .
}
GROUP BY (?x + ?x) ; not a function of the form FNAME(args...)

quoll15:08:11

This is legal because (?x + ?x) is an expression (the 3rd GroupCondition)

Kelvin15:08:41

Yep @UB3R8UYA1 pointed that out to me yesterday!

quoll15:08:12

I read further down and saw that

quoll15:08:07

The horrible part about that BNF is writing code that implements it 😖

quoll15:08:36

You can have code that performs all of the operations, but actually connecting it to the parsed grammar is a pain

Eric Scott15:08:47

As I read this, ?y would have to be a URI and not a numeric literal. Can you SUM over URIs?

quoll15:08:32

Not according to the spec, no

Kelvin15:08:49

Oh these were just junk bindings that I cooked up to demonstrate the example

Kelvin15:08:12

But I’ll change it to ?z to make it more correct

Eric Scott15:08:34

So the problem is that GROUP BY (?literal1 + ?literal2) fails but BIND(?literal1 + ?literal2 AS ?literal3) / GROUP BY (?literal3) does work?

Kelvin15:08:15

The problem is that GROUP BY (?literal1 + ?literal2) SHOULD fail according to the grammar but it doesn’t

❌ 1

Eric Scott15:08:29

So the critical part is here, then, I think: '(' Expression ( 'AS' Var )? ')'

Eric Scott15:08:29

I think the ? after the ('AS Var)' makes that optional, so I think you're within your rights to just put an Expression

👍 1

Kelvin15:08:17

Ooooh

Eric Scott15:08:37

Though I can't think of how you'd reference that downstream.

Kelvin15:08:58

Well that’s a confusing way to write BrackettedExpression | '(' Expression 'AS' Var ')'

Kelvin15:08:27

> Though I can’t think of how you’d reference that downstream. It’s all part of covering my bases when it comes to syntax, even if semantically it doesn’t make sense

rickmoynihan08:08:53

☝️ I think you’ve hit the nail on the head here and that this is the right intuition! i.e. focus on grammar/syntax in flint and covering your bases. Don’t shrink the grammar because some statements within the grammar are nonsensical. It’s always the case in essentially language/grammar of sufficient complexity that there are syntactically valid statements which are nonsensical. The best known ways to short cut this are type-checkers; and it’s proven that even the most advanced type checkers can’t detect all possible errors.

Kelvin15:08:06

Indeed, I don’t want Flint to become a type checker - notice how expressions are completely untyped. I think one issue is that the line between pure syntax checking and type checking can get quite blurred, especially when you have a contract system like Spec.

rickmoynihan23:08:17

> I think one issue is that the line between pure syntax checking and type checking can get quite blurred Yes it can. I think the SPARQL BNF may illustrate this point too. > especially when you have a contract system like Spec Yes, spec does blur these lines a little too. I think part of the problem there is that spec complects parsing with property-testing/generation. Can you point to any specific examples in flint where the boundary is blurred?

Kelvin14:08:48

> Can you point to any specific examples in flint where the boundary is blurred? The simplest example I could think of would be with LIMIT clauses which requires not only integers, but positive, nonzero integers (see https://github.com/yetanalytics/flint/blob/main/src/main/com/yetanalytics/flint/spec/modifier.cljc#L32). Since that is a case where not only the axiom has to be an integer syntactically, but it also has to conform to the aforementioned properties.

rickmoynihan11:08:57

I think the SPARQL grammar mandates an unsigned / non negative integer in those places too though; so I think it’s reasonable for flint to do the same there too.

Eric Scott15:08:05

Yeah I agree that's ugly.

Kelvin16:08:27

The whole SPARQL BNF grammar is rather ugly

rickmoynihan09:08:17

Yes. I think it may have been cleaner if they’d defined a generic function syntax; and then defined elsewhere the minimal set of functions and their names that should exist in the core language. Instead they made the built in functions all be keywords in the language grammar itself. I guess it’s all trade offs; the complexity needs to go somewhere; they just chose to move a lot of it into the grammar. Presumably to try and simplify the wider specification effort.

quoll15:08:17

I don’t know who did the grammar to start with, but I got the impression that it came from Jena. Andy always maintained it, and I presumed that he’d written the original, but I don’t know

Kelvin15:08:21

By “he” do you mean Andy Seaborne (who’s listed as one of the query language editors and is a current maintainer of Jena)?

quoll15:08:34

Yes, sorry. Andy Seabourne

Kelvin16:08:00

But yeah thanks for the help on this; I wonder if you have any insight on my earlier question though: https://clojurians.slack.com/archives/C09GHBXRC/p1659361416947169

Eric Scott16:08:57

I'm still waiting to see my first pretty BNF spec. :-)

2022-08-01

Channels