Fork me on GitHub
#rdf
<
2022-04-04
>
Kelvin14:04:44

SPARQL question - for escaping newlines and returns in strings, is the correct escaped char \\\n or \\\\n?

Kelvin14:04:05

I thought it was the latter from reading the SPARQL grammar, but apparently Apache Jena thinks it's the former and I misunderstood the grammar

Kelvin14:04:16

For reference here are the relevant branches and terminals in the SPARQL CFG:

[156]  	STRING_LITERAL1	  ::=  	"'" ( ([^#x27#x5C#xA#xD]) | ECHAR )* "'"
[157]  	STRING_LITERAL2	  ::=  	'"' ( ([^#x22#x5C#xA#xD]) | ECHAR )* '"'
[158]  	STRING_LITERAL_LONG1	  ::=  	"'''" ( ( "'" | "''" )? ( [^'\] | ECHAR ) )* "'''"
[159]  	STRING_LITERAL_LONG2	  ::=  	'"""' ( ( '"' | '""' )? ( [^"\] | ECHAR ) )* '"""'
[160]  	ECHAR	  ::=  	'\' [tbnrf\"']

quoll19:04:21

So long as the SPARQL parser sees a \ character followed by a n character. In Clojure, that would be encoded in a string of: \\n If your Clojure/Java string looks like \\\n then this will be a \ character followed by a newline character (ASCII 0xA). It should print like this:

=> (println "\\\n")
\

nil
=>
You'll see that this sequence is not part of ECHAR. However, depending on the parser, a \ followed by a non-special character can result in that character just being allowed through (this makes it easier to deal with \\, \", and \' particularly, for instance, when allowing \" to pass through as a " when you're in a ' delimited string, since no \ is needed in that case). So it doesn't surprise me if it works.

Kelvin19:04:27

Ok what is truly bizarre is how Jena treats different numbers of backslashes:

(import '[org.apache.jena.query QueryFactory])

(QueryFactory/create "SELECT ?x WHERE { ?x ?y \"foo\nbar\"}") => Bad!
(QueryFactory/create "SELECT ?x WHERE { ?x ?y \"foo\\nbar\"}") => Good!
(QueryFactory/create "SELECT ?x WHERE { ?x ?y \"foo\\\nbar\"}") => Bad!
(QueryFactory/create "SELECT ?x WHERE { ?x ?y \"foo\\\\nbar\"}") => Good!

Kelvin20:04:37

The first one makes sense since the \n is unescaped. What I don't get is why Jena treats \\n differently from \\\n even though they both have 2 chars in Clojure!

quoll20:04:50

The second one (`\\n`) is correct. The next one (`\\\n`) is bad, because sparql sees ascii chars: 0x5c, 0xa. The 0x5c character \ is not allowed to be followed by anything except one of: tbnrf\"' The final one is where sparql sees ascii chars: 0x5c, 0x5c, 0xa. That's 3 characters: \\n. This is valid SPARQL, but probably not what you want.

Kelvin20:04:53

I realize that another complication is that chars in the SPARQL grammar rules are unescaped

Kelvin20:04:36

Took me a while to realize that in ECHAR := '\' [tbnrf\"'] the second \ and the " were two separate chars, not an escaped "

quoll20:04:34

Ah yes! That's a gotcha. It's a bit weird, since it's not how that regex must be written: #"\\[tbnrf\\\"']" The EBNF syntax does not seem to have escape characters itself, but I'm not sure of that. So maybe if they'd written it as: ECHAR := '\' [tbnrf"'\] Then perhaps it could have been clearer? But then I'm wondering, "Is the closing bracket being escaped?"

quoll20:04:51

Anyway, I was actually looking at exactly this just a few nights ago 🙂

quoll20:04:08

I'm trying to write a SPARQL->Asami wrapper

quoll20:04:33

I have a long way to go 😞

quoll20:04:05

(I'm also trying to write a fast TTL parser in another project, and that's having to do similar things)

Kelvin21:04:13

Will the wrapper be named Korra? 😉

quoll21:04:06

No. That naming scheme came about from someone else. Since I don't have to worry about that anymore, I'm calling it "Twylyte".

Kelvin20:04:13

> This is valid SPARQL, but probably not what you want. Because SPARQL and Java/Clojure tread escaped chars slightly differently?

quoll20:04:58

No, it's because Clojure has already escaped your characters before the string even gets to SPARQL

quoll20:04:27

In the second one, you have the string in Clojure code: "SELECT ?x WHERE { ?x ?y \"foo\\nbar\"}" Those escapes are interpreted, and you end up with: SELECT ?x WHERE { ?x ?y "foo\nbar"} This is what the SPARQL parser will get

Kelvin20:04:59

I see - that confirms my suspicions that Flint's valid string regex is wrong

Kelvin20:04:06

Right now Flint thinks \n and \\n are wrong whereas \\\n and \\\\n are correct

Kelvin20:04:46

So thank you for the correction - escape characters (especially escaping escape characters) is one of the parts of Clojure that greatly confuses me

quoll20:04:59

Assuming those a in Clojure code, the first one is valid in a long string, and the second is valid in a short string or a long string

quoll20:04:44

If you're working at a repl, just println the string. That shows you the unescaped version (fixed from prn)

🤯 1
Kelvin20:04:55

Flint coerces all strings into short strings for simplicity's sake

quoll20:04:29

Ugh. Sorry. I meant println (not prn)

quoll20:04:58

=> (println "SELECT ?x WHERE { ?x ?y \"foo\\nbar\"}")
SELECT ?x WHERE { ?x ?y "foo\nbar"}

👍 1
Kelvin20:04:16

Ah I realized that char-array helps a lot with this too:

(seq (char-array "\n")) => (\newline)
(seq (char-array "\\n")) => (\\ \n)
(seq (char-array "\\\n")) => (\\ \newline)
(seq (char-array "\\\\n")) => (\\ \\ \n)

👍 3
Kelvin20:04:28

And of course when I went and made a PR I realized I also have to deal with ClojureScript. Fun.

Kelvin21:04:07

Well I finally got a https://github.com/yetanalytics/flint/pull/22 up. Expect a new version in the near future, if not tomorrow.