2025-07-23 instaparse | Clojure Slack Archive

instaparse 2025-07-23

gaverhae 2025-07-23T10:50:30.540519Z

Is it possible to define significant whitespace directly in the grammar, or is that something that needs to be done as a post-processing step? Concrete example: parsing nested ifs in a Python-like language.

gaverhae 2025-07-26T14:20:16.698169Z

Thanks! To clarify, I'm not blocked: I used a post-processing step to "recover" the blocks, with my grammar just emitting the initial spaces. I was merely wondering whether the instaparse extensions compared to "traditional" BNF/context-free grammars allowed for this.

gaverhae 2025-07-26T14:20:44.902209Z

Looks like you're at least in the same space as I am, which is "probably not, and if so, it's still going to be easier to do it separately".

👍 1

Johannes F. Knauf 2025-07-25T09:49:06.757189Z

@gaverhae I see your point. Also, I am familiar with Python. Interesting question! Especially since Python doesn't tell you the nature of the indentation, as long as it's the same for every line of the block. https://docs.python.org/3/reference/lexical_analysis.html#indentation

gaverhae 2025-07-25T15:05:51.252439Z

I'd be willing to give that up and say it has to be N spaces per level, if that makes things easier, but I fail to see how it could help.

Johannes F. Knauf 2025-07-25T19:30:09.831589Z

The Python language uses a pre-parsing lexer step, which keeps track of the indentation stack and emits INDENT / DEDENT tokens on indentation changes (see https://docs.python.org/3/reference/lexical_analysis.html#indentation). Because of that trick, the grammar can afterwards be expressed easily (see https://docs.python.org/3/reference/grammar.html), e.g. for blocks

block:
    | NEWLINE INDENT statements DEDENT 
    | simple_stmts

I believe that's a limitation of all context-free grammars when it comes to indentation-sensitive languages (which require tracking indentation context). There is https://github.com/Engelberg/instaparse/issues/10 explaining more. What you can do as a workaround is transform the input in a preprocessing step -- similar to what the CPython parser does.

gaverhae 2025-07-24T16:12:22.231499Z

I'll have to look at that in more details, but, clarifying my use-case: I'm trying to match essentially Python syntax, so, in case you're not familiar, significant whitespace as a means of defining nesting level. For example:

def function_1():
  if True:
    def nested_function(arg):
      if n == 2:
        return 2
      else:
        return 3
    x = 3
    return lambda n: n + x + nested_function(n)
  else:
    return "hello"

Obviously non-sense code, but hopefully it's a bit clearer. There's a def within an if within a def. Is there a way to define a BNF (+ instaparse extension) that can "count" the spaces to turn that into roughly something like:

[:def "function_1" [:args]
 [:body [:if [:expr [:bool "True"]]
        [:if-true [:def ...

gaverhae 2025-07-24T16:13:23.288769Z

What I'm currently doing is recovering the nesting from the number of spaces in a post-processing step; I was wondering if there is a ("simple") way to get instaparse to do it directly instead?

gaverhae 2025-07-24T16:14:44.049729Z

It's unclear to me how the four spaces define any level of context/nesting in your example? It looks like it's always going to be 4 spaces at the start of a line having the exact same meaning, regardless of how many blank spaces were at the beginning of the previous/next line.

gaverhae 2025-07-24T16:15:50.489419Z

@afoltzm I'm not trying to match different "kinds" of whitespace, I'm trying to compare the number of initial spaces on one line with the number of initial spaces in the surrounding lines, if that makes sense?

respatialized 2025-07-24T21:02:23.603159Z

https://github.com/engelberg/instaparse?tab=readme-ov-file#lookahead I think studying the final lookahead example might help you answer this question

gaverhae 2025-07-24T21:25:38.105999Z

I'm really not seeing it. Where in that example is it doing anything to "compare" or "track" whitespace from one line to the next?

gaverhae 2025-07-24T21:32:19.086779Z

Let's try it with a small example. I'd like a grammar that turns:

a
 a
 a
  a
   a
 a
a

into

[:S
 [:a
  [:a]
  [:a
   [:a
    [:a]]]
  [:a]]
 [:a]]

I'm not seeing how the examples in the lookahead section help with that, but I'm very open to the notion that I'm blind. My understanding is that this is not possible in context-free grammars, but I'm unsure about exactly how much power instaparse gets from the PEG extensions, and wether it's enough to bridge this very specific gap.

gaverhae 2025-07-23T11:19:32.518099Z

I'm looking for a "theoretically yes/no" style of answer, not specifically for a "here's how to do it".

respatialized 2025-07-23T12:39:19.225369Z

you can use regex to match different types of whitespace and treat them contextually differently based on surrounding non-whitespace text, so I think the answer is "theoretically yes"

Johannes F. Knauf 2025-07-23T14:30:35.458939Z

Not sure if I understand your use case, but here is an example from my pet project, that relies on significant whitespace ("exactly 4 spaces marks a no-time line") in its grammar -- though it discards the matched whitespace itself. https://github.com/JohannesFKnauf/parti-time/blob/master/src/main/ebnf/timeline_grammar.ebnf

Johannes F. Knauf 2025-07-23T14:32:26.950409Z

> no-time = #" {4}" uses regex terminal matching > <entry-header> = hhmm-time <" "> project <EOL> > ... > activity = <no-time " "> rest-of-line <EOL> use explicit string terminals

Johannes F. Knauf 2025-07-23T14:32:39.868669Z

https://github.com/Engelberg/instaparse?tab=readme-ov-file#notation explains more

Johannes F. Knauf 2025-07-23T14:33:05.494189Z

@gaverhae Does that help you?

Clojurians Log v2

instaparse 2025-07-23