This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-08-01
Channels
- # aleph (3)
- # announcements (10)
- # babashka (6)
- # bangalore-clj (4)
- # beginners (91)
- # biff (7)
- # cider (25)
- # cljs-dev (1)
- # clojure (109)
- # clojure-europe (9)
- # clojure-norway (5)
- # clojure-uk (1)
- # clojurescript (22)
- # cursive (22)
- # data-science (1)
- # datalevin (5)
- # datomic (7)
- # emacs (7)
- # etaoin (1)
- # events (3)
- # graphql (12)
- # hyperfiddle (1)
- # inf-clojure (1)
- # lsp (69)
- # luminus (1)
- # meander (21)
- # nbb (4)
- # off-topic (27)
- # other-languages (12)
- # rdf (58)
- # releases (3)
- # remote-jobs (2)
- # rum (12)
- # shadow-cljs (4)
- # sql (3)
- # xtdb (1)
So I’m revisiting the https://github.com/yetanalytics/flint library and one of the changes I’m proposing is to prevent aggregate expressions from being nested inside other aggregates.
That is because that is not allowed in Apache Jena (which I’m using as my reference implementation); you’d get this exception:
; Execution error (QueryParseException) at org.apache.jena.sparql.lang.QueryParserBase/throwParseException (QueryParserBase.java:578).
; Line [x], column [y]: Nested aggregate in expression not legal
Can you give an example of such a nested aggregate?
Does it make semantic sense? Not really (which is probably why Jena has that restriction)
For the record this is totally legal:
SELECT (AVG(?x) AS ?avg) (MAX(?avg) AS ?maxAvg)
WHERE {
?x ?y ?z .
}
GROUP BY ?z
I agree that doesn't make logical sense to take the aggregate of an aggregate.
Which means even if it’s technically allowed in the syntax, I’ll likely add that restriction because a) it doesn’t make semantic sense and b) we’d don’t really want syntax that’s not supported by certain impls like Jena, right?
In my mind adding a restriction like this puts you on the slope of implementing a poor mans type checker; or at least you’d be mixing concerns of static analysis with parsing.
As you’ve noted it looks like this is visible in Jena’s implementation. They’re denying the first form as syntactically invalid (when it’s not), but can’t detect it when the functions aren’t composed directly, and an intermediate variable is introduced.
By way of analogy; this is a perfectly valid clojure form: (inc "foo")
but it’s a semantically incorrect program; and will result in an evaluation error when run.
The error isn’t detected in the parser however.
I think static analysis features are super useful; but if implemented that they should occur in a separate layer / concern; for example a linter like clj-kondo
or type checker.
Out of interest what does Jena return when you run this? https://clojurians.slack.com/archives/C09GHBXRC/p1659374765264119?thread_ts=1659361477.275269&cid=C09GHBXRC
I did some playing around in the https://sparql-playground.sib.swiss/. The following:
SELECT (COUNT(?s) AS ?countS) (MAX(?countS) AS ?maxCountS) WHERE {
?s ?p ?o
}
LIMIT 10
returns 14
. If you remove the MAX
expression binding then ?countS
is also 14
.Given that the query actually went through without error it seems that it is indeed a Jena-specific detail
Yes, this is the behaviour I’d expect to be honest. The query should run and remove the bindings for the solution when it errors during evaluation. The spec describes this here:
https://www.w3.org/TR/sparql11-query/#aggregateExample2
Also note the definition of MAX in the spec,and the description for Set functions here where they’re described as receiving a a value of type MultiSet:
https://www.w3.org/TR/sparql11-query/#setFunctions
The above is interesting though because it implies to me that the query below should also fail, even though the grammar permits it, as 10
is a literal integer not a MultiSet:
SELECT (MAX(10) AS ?maxCountS) WHERE {
}
I’ve tried this and your variants on the sparql playground (sesame), stardog and jena and all return "10"
. Which makes sense but it feels like it should be an error according to the spec; though I’ve not looked thoroughly.
I suspect all these implementations coerce the literals in this case into multisets through some implicit coercion or something; though it seems peculiar with regards to your discovery.Thing is, I can’t find such a restriction in the https://www.w3.org/TR/sparql11-query/, so I’d like to know if this is a general restriction or a Jena implementation detail.
@U02FU7RMG8M: I’m pretty sure this is a Jena implementation detail, and for what it’s worth I’d advise against copying it. It looks to me that this isn’t a syntactic concern it’s a semantics concern; and flint should stick to just syntax. I’m not arguing that Jena shouldn’t error in these cases; but IMHO they shouldn’t error in the parser; they should error in the evaluator. > we’d don’t really want syntax that’s not supported by certain impls like Jena, right? I think this is a slippery slope and if you defer to implementations as your oracle you’ll be cutting the expressivity of the language to whatever common subset they all support; which will be less than the union of their collective grammar/features. So if anything I’d argue that you should err on the side of the opposite direction. For example many triple stores implement extensions to the SPARQL grammar, and you probably don’t want to rule out ever supporting these extensions. In my mind this would benefit from a clear policy of potentially growing the grammar (or at least fully supporting it) rather than artificially restricting it.
> you’ll be cutting the expressivity of the language to whatever common subset they all support; which will be less than the union of their collective grammar/features. I actually thought that only implementing the common subset would be desirable; that way, you can just plug Flint (or whatever lib) into whatever implementation and you won’t get error messages from said implementation.
In the general case, and as a general principle to follow beyond this specific example, I think applications (and flint) can rely on the implementations erroring; and I don’t see why flint needs to protect me from the implementation erroring. Yes the error can happen sooner if you can detect it syntactically; but you’re on a slippery slope, and I think it’s more principled to stay out of static analysis or type checking etc. Yes those could be useful things; but I think they should be built on top of something like flint; or it’s intermediate representation, and not baked into the parsing itself. In this specific case; I think (though I’m not 100% sure) we’ve discovered that Jena is returning the wrong result in this case. It should be removing the binding from that solution during evaluation; not erroring during parsing. Is it a big deal? Not really; because why would you write query. My point on common subset is really that the spec should be the common subset you implement; and that you should focus on correctness and the grammar/parsing. Jena is I think strictly speaking doing less than the spec requires here, so flint doing that would mean flint is targeting the lowest common denominator of sparql; which will be less than sparql itself specifies. Then once you have that subset you could possibly choose to increase the size of grammar beyond the spec by supporting various syntax extensions some implementations may provide. I’m not suggesting this by the way; just that you could.
Also another question: according to the https://www.w3.org/TR/sparql11-query/#sparqlGrammar a GROUP BY
clause is defined as:
GroupClause ::= 'GROUP' 'BY' GroupCondition+
GroupCondition ::= BuiltInCall | FunctionCall | '(' Expression ( 'AS' Var )? ')' | Var
Yet this seems to be legal in Jena?
SELECT (SUM(?z) AS ?sum)
WHERE {
?x ?y ?z .
}
GROUP BY (?x + ?x) ; not a function of the form FNAME(args...)
Yep @UB3R8UYA1 pointed that out to me yesterday!
You can have code that performs all of the operations, but actually connecting it to the parsed grammar is a pain
As I read this, ?y would have to be a URI and not a numeric literal. Can you SUM over URIs?
So the problem is that GROUP BY (?literal1 + ?literal2) fails but BIND(?literal1 + ?literal2 AS ?literal3) / GROUP BY (?literal3) does work?
The problem is that GROUP BY (?literal1 + ?literal2)
SHOULD fail according to the grammar but it doesn’t
So the critical part is here, then, I think: '(' Expression ( 'AS' Var )? ')'
I think the ?
after the ('AS Var)' makes that optional, so I think you're within your rights to just put an Expression
Though I can't think of how you'd reference that downstream.
Well that’s a confusing way to write BrackettedExpression | '(' Expression 'AS' Var ')'
> Though I can’t think of how you’d reference that downstream. It’s all part of covering my bases when it comes to syntax, even if semantically it doesn’t make sense
☝️ I think you’ve hit the nail on the head here and that this is the right intuition! i.e. focus on grammar/syntax in flint and covering your bases. Don’t shrink the grammar because some statements within the grammar are nonsensical. It’s always the case in essentially language/grammar of sufficient complexity that there are syntactically valid statements which are nonsensical. The best known ways to short cut this are type-checkers; and it’s proven that even the most advanced type checkers can’t detect all possible errors.
Indeed, I don’t want Flint to become a type checker - notice how expressions are completely untyped. I think one issue is that the line between pure syntax checking and type checking can get quite blurred, especially when you have a contract system like Spec.
> I think one issue is that the line between pure syntax checking and type checking can get quite blurred Yes it can. I think the SPARQL BNF may illustrate this point too. > especially when you have a contract system like Spec Yes, spec does blur these lines a little too. I think part of the problem there is that spec complects parsing with property-testing/generation. Can you point to any specific examples in flint where the boundary is blurred?
> Can you point to any specific examples in flint where the boundary is blurred?
The simplest example I could think of would be with LIMIT
clauses which requires not only integers, but positive, nonzero integers (see https://github.com/yetanalytics/flint/blob/main/src/main/com/yetanalytics/flint/spec/modifier.cljc#L32). Since that is a case where not only the axiom has to be an integer syntactically, but it also has to conform to the aforementioned properties.
I think the SPARQL grammar mandates an unsigned / non negative integer in those places too though; so I think it’s reasonable for flint to do the same there too.
Yeah I agree that's ugly.
Yes. I think it may have been cleaner if they’d defined a generic function syntax; and then defined elsewhere the minimal set of functions and their names that should exist in the core language. Instead they made the built in functions all be keywords in the language grammar itself. I guess it’s all trade offs; the complexity needs to go somewhere; they just chose to move a lot of it into the grammar. Presumably to try and simplify the wider specification effort.
I don’t know who did the grammar to start with, but I got the impression that it came from Jena. Andy always maintained it, and I presumed that he’d written the original, but I don’t know
By “he” do you mean Andy Seaborne (who’s listed as one of the query language editors and is a current maintainer of Jena)?
But yeah thanks for the help on this; I wonder if you have any insight on my earlier question though: https://clojurians.slack.com/archives/C09GHBXRC/p1659361416947169
I'm still waiting to see my first pretty BNF spec. :-)