instaparse 2022-05-24 | Slack Archive

I'm trying to parse a rule like this: (parser "EOL ::= [#xD#xA]+"), but it blows up with a parse error:

EOL ::= [#xD#xA]+
         ^
Expected one of:
!
&
ε
eps
EPSILON
epsilon
Epsilon
<
(
{
[
#"#\"[^\"\\]*(?:\\.[^\"\\]*)*\"(?x) #Double-quoted regexp"
#"#'[^'\\]*(?:\\.[^'\\]*)*'(?x) #Single-quoted regexp"
#"\"[^\"\\]*(?:\\.[^\"\\]*)*\"(?x) #Double-quoted string"
#"'[^'\\]*(?:\\.[^'\\]*)*'(?x) #Single-quoted string"
(*
#"[^, \r\t\n<>(){}\[\]+*?:=|'"#&!;./]+(?x) #Non-terminal"

winsome21:05:45

I'm going off of this EBNF syntax: https://www.w3.org/TR/REC-xml/#sec-notation

winsome21:05:46

"#xN - where N is a hexadecimal integer, the expression matches the character whose number (code point) in ISO/IEC 10646 is N. The number of leading zeros in the #xN form is insignificant."

winsome21:05:11

Do I need to translate that syntax into some other representation? Is there one in particular that I should choose?

hiredman22:05:16

instaparse uses clojure's syntax for regexes, so it expects # to be the start of a regex, maybe \ to escape it (would have to be \\ in a string literal)

winsome22:05:21

oh, it didn't occur to me that it would look for those inside a string.

winsome22:05:42

Escaping with \ and \\ produce the same problem, though.

winsome22:05:41

These are the code points for cr lf, I believe, maybe I need to translate those into the the clojure versions

hiredman22:05:47

ah, yes, well even if # didn't throw the above error, the syntax they use for matching octets is not a thing

hiredman22:05:04

(codepoints, not octets)

aengelberg22:05:10

yeah, the problem is that #xN is a pseudo-syntax that the XML specification may have invented for its own grammar, to help clarify the nuances of the character code points. But Instaparse doesn’t know how to interpret that as an actual parser.

hiredman22:05:39

the way to embed a character by code point in a clojure string is \uN

👍 1

winsome22:05:09

Is N a hex number?

hiredman22:05:09

user=> "\u0029"
")"
user=>

hiredman22:05:31

(yes)

winsome22:05:12

"\u000D\u000A"
"\r\n"

aengelberg22:05:29

I think this should work in instaparse:

EOL ::= "\u000D" | "\u000A"

aengelberg22:05:37

actually, this might not work if you’re slurping the grammar from a file and passing that into instaparse. the \u000A thing is a Clojure reader feature, not an instaparse feature

aengelberg22:05:44

Java regexes also support referring to chars as code points, which means you can use the Instaparse regex feature as well:

EOL ::= #"[\\x0D\\x0A]"

winsome22:05:44

(grammar/parser "EOL ::= #\"[\\x0D\\x0A]\"") seems to work.

👍 1

winsome22:05:27

And changing the double quote to a single quote makes it a little less messy: (grammar/parser "EOL ::= #'[\\x0D\\x0A]'")

winsome22:05:30

Thanks!

aengelberg22:05:03

no problem

2022-05-24

Channels