Fork me on GitHub
#instaparse
<
2020-12-08
>
Zaymon04:12:42

Hello all. I’m starting to learn parsing and EBNF and I am struggling to remove the ambiguity from my parser. I have constructed a simple example to demonstrate the problem I am having. The following parser tags text marked as emphasises like ***emphasis***

(def remove-ambiguity
  (insta/parser
   "S = (em / char)+ | epsilon
    em = <'*' '*'> char* <'*' '*'>
    <char> = #'.'")
Although with an input such as **em** **em** there are many possible parse results:
([:S [:em "e" "m" "*" "*" " " "*" "*" "e" "m"]]
 [:S "*" "*" "e" "m" "*" "*" " " "*" "*" "e" "m" "*" "*"]
 [:S [:em "e" "m" "*" "*" " "] "e" "m" "*" "*"]
 [:S "*" "*" "e" "m" "*" "*" " " [:em "e" "m"]]
 [:S "*" "*" "e" "m" [:em " " "*" "*" "e" "m"]]
 [:S "*" "*" "e" "m" [:em " "] "e" "m" "*" "*"]
 [:S [:em "e" "m"] " " "*" "*" "e" "m" "*" "*"]
 [:S [:em "e" "m"] " " [:em "e" "m"]] <-- This is the one I want

;; This makes sense since there are a few ways you can match up the asterisks to match the rule. However I only ever want to allow results like this `[:em "e" "m"] " " [:em "e" "m"]]
It’s almost like I want it to greedily take the first match possible and then ignore all others. But I have no idea how to express this. Any help would be greatly appreciated 😄.

hiredman17:12:03

your grammar says '' is both the start of an em sequence, and two chars, and that is the ambiguity

Zaymon23:12:35

Is there a way I can force the correct behavior? I always want it to be the first found pair

Zaymon23:12:05

How do I specify that a char is any character or sequence of characters except **

Zaymon00:12:12

Looks like I can use negative lookahead in the definition of char