rdf

Kelvin 2022-04-04T14:05:44.104429Z

SPARQL question - for escaping newlines and returns in strings, is the correct escaped char \\\n or \\\\n?

Kelvin 2022-04-04T14:06:05.173419Z

I thought it was the latter from reading the SPARQL grammar, but apparently Apache Jena thinks it's the former and I misunderstood the grammar

Kelvin 2022-04-04T14:08:16.905759Z

For reference here are the relevant branches and terminals in the SPARQL CFG:

[156]  	STRING_LITERAL1	  ::=  	"'" ( ([^#x27#x5C#xA#xD]) | ECHAR )* "'"
[157]  	STRING_LITERAL2	  ::=  	'"' ( ([^#x22#x5C#xA#xD]) | ECHAR )* '"'
[158]  	STRING_LITERAL_LONG1	  ::=  	"'''" ( ( "'" | "''" )? ( [^'\] | ECHAR ) )* "'''"
[159]  	STRING_LITERAL_LONG2	  ::=  	'"""' ( ( '"' | '""' )? ( [^"\] | ECHAR ) )* '"""'
[160]  	ECHAR	  ::=  	'\' [tbnrf\"']

quoll 2022-04-04T19:53:21.227309Z

So long as the SPARQL parser sees a \ character followed by a n character. In Clojure, that would be encoded in a string of: \\n If your Clojure/Java string looks like \\\n then this will be a \ character followed by a newline character (ASCII 0xA). It should print like this:

=> (println "\\\n")
\

nil
=>
You'll see that this sequence is not part of ECHAR. However, depending on the parser, a \ followed by a non-special character can result in that character just being allowed through (this makes it easier to deal with \\, \", and \' particularly, for instance, when allowing \" to pass through as a " when you're in a ' delimited string, since no \ is needed in that case). So it doesn't surprise me if it works.

Kelvin 2022-04-04T19:59:27.455289Z

Ok what is truly bizarre is how Jena treats different numbers of backslashes:

(import '[org.apache.jena.query QueryFactory])

(QueryFactory/create "SELECT ?x WHERE { ?x ?y \"foo\nbar\"}") => Bad!
(QueryFactory/create "SELECT ?x WHERE { ?x ?y \"foo\\nbar\"}") => Good!
(QueryFactory/create "SELECT ?x WHERE { ?x ?y \"foo\\\nbar\"}") => Bad!
(QueryFactory/create "SELECT ?x WHERE { ?x ?y \"foo\\\\nbar\"}") => Good!

Kelvin 2022-04-04T20:00:37.952809Z

The first one makes sense since the \n is unescaped. What I don't get is why Jena treats \\n differently from \\\n even though they both have 2 chars in Clojure!

quoll 2022-04-04T20:05:50.307979Z

The second one (`\\n`) is correct. The next one (`\\\n`) is bad, because sparql sees ascii chars: 0x5c, 0xa. The 0x5c character \ is not allowed to be followed by anything except one of: tbnrf\"' The final one is where sparql sees ascii chars: 0x5c, 0x5c, 0xa. That's 3 characters: \\n. This is valid SPARQL, but probably not what you want.

Kelvin 2022-04-04T20:33:53.262889Z

I realize that another complication is that chars in the SPARQL grammar rules are unescaped

Kelvin 2022-04-04T20:34:36.186139Z

Took me a while to realize that in ECHAR := '\' [tbnrf\"'] the second \ and the " were two separate chars, not an escaped "

quoll 2022-04-04T20:43:34.486039Z

Ah yes! That's a gotcha. It's a bit weird, since it's not how that regex must be written: #"\\[tbnrf\\\"']" The EBNF syntax does not seem to have escape characters itself, but I'm not sure of that. So maybe if they'd written it as: ECHAR := '\' [tbnrf"'\] Then perhaps it could have been clearer? But then I'm wondering, "Is the closing bracket being escaped?"

quoll 2022-04-04T20:44:51.916239Z

Anyway, I was actually looking at exactly this just a few nights ago 🙂

quoll 2022-04-04T20:45:08.476369Z

I'm trying to write a SPARQL->Asami wrapper

quoll 2022-04-04T20:45:33.801629Z

I have a long way to go 😞

quoll 2022-04-04T20:49:05.205559Z

(I'm also trying to write a fast TTL parser in another project, and that's having to do similar things)

Kelvin 2022-04-04T21:15:13.391599Z

Will the wrapper be named Korra? 😉

quoll 2022-04-04T21:39:06.759109Z

No. That naming scheme came about from someone else. Since I don't have to worry about that anymore, I'm calling it "Twylyte".

Kelvin 2022-04-04T20:07:13.205559Z

> This is valid SPARQL, but probably not what you want. Because SPARQL and Java/Clojure tread escaped chars slightly differently?

quoll 2022-04-04T20:07:58.551689Z

No, it's because Clojure has already escaped your characters before the string even gets to SPARQL

quoll 2022-04-04T20:09:27.992539Z

In the second one, you have the string in Clojure code: "SELECT ?x WHERE { ?x ?y \"foo\\nbar\"}" Those escapes are interpreted, and you end up with: SELECT ?x WHERE { ?x ?y "foo\nbar"} This is what the SPARQL parser will get

Kelvin 2022-04-04T20:09:59.542359Z

I see - that confirms my suspicions that Flint's valid string regex is wrong

Kelvin 2022-04-04T20:10:06.836959Z

Right now Flint thinks \n and \\n are wrong whereas \\\n and \\\\n are correct

Kelvin 2022-04-04T20:10:46.089569Z

So thank you for the correction - escape characters (especially escaping escape characters) is one of the parts of Clojure that greatly confuses me

quoll 2022-04-04T20:10:59.877699Z

Assuming those a in Clojure code, the first one is valid in a long string, and the second is valid in a short string or a long string

quoll 2022-04-04T20:11:44.669109Z

If you're working at a repl, just println the string. That shows you the unescaped version (fixed from prn)

🤯 1
Kelvin 2022-04-04T20:11:55.952709Z

Flint coerces all strings into short strings for simplicity's sake

quoll 2022-04-04T20:13:29.628929Z

Ugh. Sorry. I meant println (not prn)

quoll 2022-04-04T20:13:58.329399Z

=> (println "SELECT ?x WHERE { ?x ?y \"foo\\nbar\"}")
SELECT ?x WHERE { ?x ?y "foo\nbar"}

👍 1
Kelvin 2022-04-04T20:19:16.812649Z

Ah I realized that char-array helps a lot with this too:

(seq (char-array "\n")) => (\newline)
(seq (char-array "\\n")) => (\\ \n)
(seq (char-array "\\\n")) => (\\ \newline)
(seq (char-array "\\\\n")) => (\\ \\ \n)

👍 3
Kelvin 2022-04-04T20:47:28.668269Z

And of course when I went and made a PR I realized I also have to deal with ClojureScript. Fun.

Kelvin 2022-04-04T21:08:07.722989Z

Well I finally got a https://github.com/yetanalytics/flint/pull/22 up. Expect a new version in the near future, if not tomorrow.