Is it possible to define significant whitespace directly in the grammar, or is that something that needs to be done as a post-processing step? Concrete example: parsing nested ifs in a Python-like language.
Thanks! To clarify, I'm not blocked: I used a post-processing step to "recover" the blocks, with my grammar just emitting the initial spaces. I was merely wondering whether the instaparse extensions compared to "traditional" BNF/context-free grammars allowed for this.
Looks like you're at least in the same space as I am, which is "probably not, and if so, it's still going to be easier to do it separately".
@gaverhae I see your point. Also, I am familiar with Python. Interesting question! Especially since Python doesn't tell you the nature of the indentation, as long as it's the same for every line of the block. https://docs.python.org/3/reference/lexical_analysis.html#indentation
I'd be willing to give that up and say it has to be N spaces per level, if that makes things easier, but I fail to see how it could help.
The Python language uses a pre-parsing lexer step, which keeps track of the indentation stack and emits INDENT / DEDENT tokens on indentation changes (see https://docs.python.org/3/reference/lexical_analysis.html#indentation). Because of that trick, the grammar can afterwards be expressed easily (see https://docs.python.org/3/reference/grammar.html), e.g. for blocks
block:
| NEWLINE INDENT statements DEDENT
| simple_stmts
I believe that's a limitation of all context-free grammars when it comes to indentation-sensitive languages (which require tracking indentation context). There is https://github.com/Engelberg/instaparse/issues/10 explaining more. What you can do as a workaround is transform the input in a preprocessing step -- similar to what the CPython parser does.I'll have to look at that in more details, but, clarifying my use-case: I'm trying to match essentially Python syntax, so, in case you're not familiar, significant whitespace as a means of defining nesting level. For example:
def function_1():
if True:
def nested_function(arg):
if n == 2:
return 2
else:
return 3
x = 3
return lambda n: n + x + nested_function(n)
else:
return "hello"
Obviously non-sense code, but hopefully it's a bit clearer. There's a def within an if within a def. Is there a way to define a BNF (+ instaparse extension) that can "count" the spaces to turn that into roughly something like:
[:def "function_1" [:args]
[:body [:if [:expr [:bool "True"]]
[:if-true [:def ...What I'm currently doing is recovering the nesting from the number of spaces in a post-processing step; I was wondering if there is a ("simple") way to get instaparse to do it directly instead?
It's unclear to me how the four spaces define any level of context/nesting in your example? It looks like it's always going to be 4 spaces at the start of a line having the exact same meaning, regardless of how many blank spaces were at the beginning of the previous/next line.
@afoltzm I'm not trying to match different "kinds" of whitespace, I'm trying to compare the number of initial spaces on one line with the number of initial spaces in the surrounding lines, if that makes sense?
https://github.com/engelberg/instaparse?tab=readme-ov-file#lookahead I think studying the final lookahead example might help you answer this question
I'm really not seeing it. Where in that example is it doing anything to "compare" or "track" whitespace from one line to the next?
Let's try it with a small example. I'd like a grammar that turns:
a
a
a
a
a
a
a
into
[:S
[:a
[:a]
[:a
[:a
[:a]]]
[:a]]
[:a]]
I'm not seeing how the examples in the lookahead section help with that, but I'm very open to the notion that I'm blind. My understanding is that this is not possible in context-free grammars, but I'm unsure about exactly how much power instaparse gets from the PEG extensions, and wether it's enough to bridge this very specific gap.I'm looking for a "theoretically yes/no" style of answer, not specifically for a "here's how to do it".
you can use regex to match different types of whitespace and treat them contextually differently based on surrounding non-whitespace text, so I think the answer is "theoretically yes"
Not sure if I understand your use case, but here is an example from my pet project, that relies on significant whitespace ("exactly 4 spaces marks a no-time line") in its grammar -- though it discards the matched whitespace itself. https://github.com/JohannesFKnauf/parti-time/blob/master/src/main/ebnf/timeline_grammar.ebnf
> no-time = #" {4}" uses regex terminal matching > <entry-header> = hhmm-time <" "> project <EOL> > ... > activity = <no-time " "> rest-of-line <EOL> use explicit string terminals
https://github.com/Engelberg/instaparse?tab=readme-ov-file#notation explains more
@gaverhae Does that help you?