Fork me on GitHub
#instaparse
<
2017-06-06
>
wilkerlucio12:06:28

hello people, I'm stuck with a instaparse rule, maybe someone here can help me out 🙂

wilkerlucio12:06:36

I'm writing a parser for Javascript regexes

wilkerlucio12:06:40

this is the current grammar:

wilkerlucio12:06:49

Regex = <'/'> Alternation <'/'> MatchFlag*

Alternation = Concatenation (<'|'> Concatenation)*

Concatenation = SuffixedExpr*

SuffixedExpr = SingleExpr Suffix?
SingleExpr = BaseExpr | ParenthesizedExpr
ParenthesizedExpr = <'('> GroupFlags? Alternation <')'>
Suffix = (Optional | Positive | NonNegative | CurlyRepetition) Quantifier?
Optional = <'?'>
Positive = <'+'>
NonNegative = <'*'>
CurlyRepetition = <'{'> #"\d+" (<','> #"\d+" ?) ? <'}'>
Quantifier = '?' | '+'
BaseExpr = CharExpr | LiteralChar | Anchor | BackReference

Anchor = '^' | '$' | '\\' #"[bB]"
LiteralChar = PlainChar | EscapedChar

BackReference = <'\\'> #"[1-9][0-9]*"

PlainChar = #"[^.|\\+*$^\[(){?]"
CharExpr = Dot | LiteralChar | BCC
Dot = '.'

BCC = <'['> BCCUnionLeft? <']'>

BCCUnionLeft = BCCNegation? BCCElemBase*

BCCNegation = '^'

BCCElemBase = BCCCharNonRange | SpecialCharClass | BCCRange | BCC
BCCRangeRightable = BCCCharEndRange | SpecialCharClass
BCCRange = BCCChar <'-'> BCCCharEndRange
BCCRangeWithBracket = <']-'> BCCCharEndRange
BCCCharNonRange = BCCChar !('-' BCCRangeRightable)
BCCChar = BCCPlainChar | EscapedChar
BCCCharEndRange = BCCPlainChar | EscapedChar
BCCPlainChar = #"[^\]\[\\]" | '\\b'

EscapedChar = SpecialCharClass | NormalSlashedCharacters | ControlChar | HexChar | BasicEscapedChar

HexChar = ShortHexChar | MediumHexChar | LongHexChar | VeryLongHexChar
ShortHexChar = <'\\x'> #'[0-9a-fA-F]{2}'
MediumHexChar = <'\\u'> #'[0-9a-fA-F]{4}'
LongHexChar = <'\\x{'> #'[0-9a-fA-F]{4}' <'}'>
VeryLongHexChar = <'\\x{'> #'[0-9a-fA-F]{6}' <'}'>
BasicEscapedChar = <'\\'> #"[\s\S]"
SpecialCharClass = <'\\'> #"[dDwWsSv0]"

NormalSlashedCharacters = #"\\[tnrf]"

ControlChar = <'\\c'> #"[A-Z]"

(** FLAGS **)
GroupFlags = NonCapturingMatchFlags
           | PositiveLookAheadFlag
           | NegativeLookAheadFlag

NonCapturingMatchFlags = <'?'> !')' <':'>
PositiveLookAheadFlag = <'?='>
NegativeLookAheadFlag = <'?!'>

MatchFlag = #"[gimuy]"

wilkerlucio12:06:39

I would like it to parse {, as a plain char, currently on the PlainChar definition this char is excluded to allow for the CurlyRepetition

wilkerlucio12:06:47

I'm probably missing something

wilkerlucio12:06:01

but if I allow the { at PlainChar

wilkerlucio12:06:15

then when I try to do: a{2}

wilkerlucio12:06:46

I was expect it to go into SuffixedExpr and match a PlainChar followed by a Suffix that would be a CurlyRepetition

wilkerlucio12:06:11

but instead seems like it's not matching the suffix, and instead matches a series of plain chars

wilkerlucio12:06:31

sorry the noise, I got a much simplified version now, by this grammar:

wilkerlucio12:06:34

Concatenation = SuffixedExpr*

SuffixedExpr = LiteralChar CurlyRepetition?

CurlyRepetition = <'{'> #"\d+" <'}'>

LiteralChar = #"."

wilkerlucio12:06:22

I expected it to match "a{2}" as a LiteralChar followed by a CurlyRepetition, but instead it matches as [:Concatenation [:SuffixedExpr [:LiteralChar "a"]] [:SuffixedExpr [:LiteralChar "{"]] [:SuffixedExpr [:LiteralChar "2"]] [:SuffixedExpr [:LiteralChar "}"]]]

wilkerlucio12:06:02

how can make it try force match the CurlyRepetition before stepping a level up and matching more literal chars?

aengelberg16:06:42

@wilkerlucio maybe ordered choice / in instaparse might help here?

aengelberg16:06:33

so you want a{2 to parse as 3 PlainChars, but a{2} to parse as a PlainChar followed by a CurlyRepetition?

wilkerlucio17:06:21

@aengelberg yes, with some help I was able to figure it out, here is a way to handle it:

wilkerlucio17:06:23

Concatenation = SuffixedExpr*

SuffixedExpr = LiteralChar CurlyRepetition? / AnyLiteralChar

CurlyRepetition = <'{'> #"\d+" <'}'>

LiteralChar = #"[^{]"
AnyLiteralChar = #"."

wilkerlucio17:06:49

I had to restrict the first one a little and have a second more permissive rule, now it parses the way I was expecting 🙂

aengelberg17:06:55

I suggest you take away the ? after CurlyRepetition, to make the grammar unambiguous (improving performance).

wilkerlucio17:06:50

@aengelberg but it is needed there, because the curlyrepetition is optional

wilkerlucio17:06:16

ah, I think I got you said, it would match anyway

wilkerlucio17:06:24

in my real case its a bit more complicated

wilkerlucio17:06:39

and the latest AnyLiteral actually just matches the {, otherwise other complications arise

wilkerlucio17:06:57

parsing regex is pretty annoying to be honest -.-

aengelberg17:06:21

also, something to watch out for: . does not include the newline character

aengelberg17:06:33

but [^{] does

wilkerlucio17:06:54

yeah, when I need everything I like to use something like [\s\S]

wilkerlucio17:06:57

so it matches everything

aengelberg17:06:08

that’s exactly what I was going to suggest

wilkerlucio17:06:16

in case you wonder, this is what my current grammar looks like (for parsing JS RegExp):

wilkerlucio17:06:20

Regex = <'/'> Alternation <'/'> MatchFlag*

Alternation = Concatenation (<'|'> Concatenation)*

Concatenation = SuffixedExpr*

SuffixedExpr = SingleExpr Suffix? / CurlyRepetition / LiteralSpecialChar
SingleExpr = BaseExpr | ParenthesizedExpr
ParenthesizedExpr = <'('> GroupFlags? Alternation <')'>
Suffix = (Optional | Positive | NonNegative | CurlyRepetition) Quantifier?
Optional = <'?'>
Positive = <'+'>
NonNegative = <'*'>
CurlyRepetition = <'{'> #"\d+" (<','> #"\d+" ?) ? <'}'>
Quantifier = '?' | '+'
BaseExpr = CharExpr | LiteralChar | Anchor | BackReference

Anchor = '^' | '$' | '\\' #"[bB]"
LiteralChar = PlainChar | EscapedChar
LiteralSpecialChar = '{'

BackReference = <'\\'> #"[1-9][0-9]*"

PlainChar = #"[^.|\\+*$^\[(){?]"
CharExpr = Dot / LiteralChar / BCCEmpty / BCC
Dot = '.'

BCC = <'['> BCCUnionLeft? <']'>
BCCEmpty = '[]'

BCCUnionLeft = BCCNegation? BCCElemBase*

BCCNegation = '^'

BCCElemBase = BCCCharNonRange | SpecialCharClass | BCCRange | BCC
BCCRangeRightable = BCCCharEndRange | SpecialCharClass
BCCRange = BCCChar <'-'> BCCCharEndRange
BCCRangeWithBracket = <']-'> BCCCharEndRange
BCCCharNonRange = BCCChar !('-' BCCRangeRightable)
BCCChar = BCCPlainChar | EscapedChar
BCCCharEndRange = BCCPlainChar | EscapedChar
BCCPlainChar = #"[^\]\[\\]" | '\\b'

EscapedChar = SpecialCharClass / NormalSlashedCharacters / ControlChar / HexChar / BasicEscapedChar

HexChar = ShortHexChar | MediumHexChar | LongHexChar | VeryLongHexChar
ShortHexChar = <'\\x'> #'[0-9a-fA-F]{2}'
MediumHexChar = <'\\u'> #'[0-9a-fA-F]{4}'
LongHexChar = <'\\x{'> #'[0-9a-fA-F]{4}' <'}'>
VeryLongHexChar = <'\\x{'> #'[0-9a-fA-F]{6}' <'}'>
BasicEscapedChar = <'\\'> #"[\s\S]"
SpecialCharClass = <'\\'> #"[dDwWsSv0]"

NormalSlashedCharacters = #"\\[tnrf]"

ControlChar = <'\\c'> #"[A-Z]"

(** FLAGS **)
GroupFlags = NonCapturingMatchFlags
           | PositiveLookAheadFlag
           | NegativeLookAheadFlag

NonCapturingMatchFlags = <'?'> !')' <':'>
PositiveLookAheadFlag = <'?='>
NegativeLookAheadFlag = <'?!'>

MatchFlag = #"[gimuy]"

wilkerlucio17:06:24

it doesn't need to be perfect, the usage of this is to port the test.chuck string-from-regex to CLJS

aengelberg17:06:40

gfredericks has already done work in test.chuck to parse Java regexes in instaparse, which may be useful: https://github.com/gfredericks/test.chuck/blob/master/resources/com/gfredericks/test/chuck/regex.bnf

wilkerlucio17:06:55

yeah, this is actually based of that

wilkerlucio17:06:01

I'm trying to port it to CLJS

aengelberg17:06:06

oh lol, you’re right, I just needed to look closer

aengelberg17:06:15

you’re porting test.chuck to cljs?

aengelberg17:06:20

or just the regex generator?

wilkerlucio17:06:33

just the regex generator, the rest is already all cljc actually

aengelberg17:06:42

ok this explains a lot 🙂

aengelberg17:06:26

so then how did you get the grammar into a weird state that behaved improperly with curly repetitions, just by removing certain things not part of the EcmaScript regex spec?

aengelberg17:06:34

or was it already like that?

wilkerlucio17:06:04

the Java Regex have many different features compared to JS one

aengelberg17:06:22

hmm I was under the impression that Java had a super-set of features to JS

aengelberg17:06:26

clearly I was mistaken

wilkerlucio17:06:34

no, JS is more tolerant in some cases

wilkerlucio17:06:54

for example, those are invalid on JVM, but ok on JS: /a{/ /a{}/ /[]/

wilkerlucio17:06:20

when JS sees an incomplete curly braces, it threats as literals

wilkerlucio17:06:27

where JVM throws an exception

wilkerlucio17:06:16

but in general the JS is simpler, since it doens't support character class unions (that feature adds a lot of complexity on the JVM Regex grammar, see: http://www.regular-expressions.info/charclassintersect.html)

wilkerlucio17:06:36

so, this is the kind of feature that has a strong dependency on the platform

wilkerlucio17:06:11

Gary did a great job making generative testing to check if the custom parser conforms with the platform regex parser, check this test: https://github.com/gfredericks/test.chuck/blob/master/test/com/gfredericks/test/chuck/regexes_test.clj#L117-L119

wilkerlucio17:06:38

it generates random regexs and try to parse it with custom and native regex parsers, and they have to conform (all fail or all pass)

wilkerlucio17:06:03

it's just very hard to get the grammar to work exactly like the native one, with all quirks dealt with