instaparse 2016-08-28 | Slack Archive

Gah, what's wrong with this parser? doc-metadata works fine, but running headlines on the remaining content just returns flat content. https://github.com/seylerius/organum

seylerius21:08:36

@aengelberg: Got any clues?

seylerius21:08:57

Simple reproduction: (headlines (last (doc-metadata (slurp ""))))

seylerius21:08:08

It's something in the h token, because that's the last thing I changed before it started failing.

ska21:08:45

At a first glance, the #'.+' looks suspicious to me. Is greediness biting you here? (Did not try it out, though)

aengelberg21:08:00

@seylerius the regex you put for :content is probably not what you want. Due to the (?s) flag, seems to match everything including newlines, as long as the first character is not a *.

aengelberg21:08:06

I'm not sure what your desired behavior is though.

aengelberg21:08:16

BTW, both the first ^ and the ? in your regex appear redundant, if I understand it correctly.

seylerius21:08:29

The content regexp is fine. It's after I changed a few things to tidy up :h and added tag parsing that it started failing.

seylerius21:08:54

Basically, a headline starts with some number of stars. Everything else isn't a headline.

aengelberg21:08:56

I cloned your project and am looking at that parser. Is there a different version / branch I missed?

seylerius21:08:38

Nope, I pushed the latest version just before I spoke up today.

aengelberg21:08:13

Sorry I may have been unclear. When I said :content I meant the content inside the headlines parser.

aengelberg21:08:26

Not the doc-metadata parser

aengelberg21:08:05

As an experiment I removed all the hide-tags from the headlines parser, since I got that behavior you were talking about (flat content). That exposed the headlines' :content rule as being greedy.

aengelberg21:08:06

organum.core> (headlines content)
[:S [:token [:content "This is an attempt...

seylerius21:08:20

Yep. I've got an ordered choice making it prefer to define a section (headline then content) if possible, and just content if not. The defining difference between content and headline is whether it starts with stars.

seylerius21:08:11

Although, Hmmm. You've got a point about the mode there.

aengelberg21:08:58

I think this is what happened: - The section rule failed at the start of the string - It then fell back to the content rule due to ordered choice - The content rule mistakenly parses the whole string (for the reason I mentioned above) - Parse is done

seylerius21:08:22

Yeah. You're right. Making the content rule less accepting (not (?s)) fixes that part, and now I'm seeing failures to parse the first headline. Joy.

seylerius21:08:27

How does inataparse play with non-capturing groups?

aengelberg21:08:09

Not familiar with that term; are you referring to the groups returned by a Java regex match?

seylerius21:08:04

Non-capturing groups are for saying, "this should be here, but don't return it in a group"

seylerius21:08:39

Okay, new push. Can't manage to get tags out separate.

aengelberg21:08:20

oh, you mean things like regex lookahead and lookbehind?

seylerius21:08:41

They work if I make them mandatory, but get eaten by the headline body if they're optional. Would lookahead allow saying "if there's whitespace followed by a colon, stop here"?

aengelberg21:08:37

This is the instaparse source code that applies regexes, may shed some light on whether certain constructs would work. https://github.com/Engelberg/instaparse/blob/master/src/instaparse/gll.clj#L670

aengelberg21:08:21

I would expect regex non matching lookaheads to work, but non-matching lookbehinds to NOT work. Instaparse runs a regex match on the substring of the current index onward, so previous characters are invisible. EDIT: I misunderstood the term "non-matching"

aengelberg21:08:11

I see you're using (?:) now. I don't think "non capturing" is what you want

seylerius21:08:22

I think you're right.

aengelberg21:08:47

organum.core> (re-find #"a" "a")
"a"
organum.core> (re-find #"(?:a)" "a")
"a"

seylerius21:08:56

What's weird is non-greedy options fail entirely.

aengelberg21:08:18

(?:) basically means, if there are any other groups () inside that block, DON'T return them as an additional output.

seylerius21:08:04

Ah, it looks like negative lookahead is the trick.

aengelberg21:08:20

(?!=)?

aengelberg21:08:43

the ?: flag shouldn't affect Instaparse's usage of regexes at all. Instaparse throws away match groups

seylerius21:08:55

(?!\\s+:)

aengelberg21:08:21

seems legit

seylerius21:08:45

Nope. Pushing. Still eats the tags.

aengelberg21:08:00

hmm

seylerius21:08:16

Pushed

aengelberg21:08:51

need to run now, can probably help more in an hour or so. I'd say the next step is manually parsing the regexes on the strings.

aengelberg21:08:18

and try gradually taking characters away from the regex to see what the problem is

seylerius21:08:23

Okay, thanks for the help. Talk with ya when you've got time.

aengelberg21:08:36

feel free to dump any further findings here

seylerius21:08:57

Will do. Slack has persistence, which is pretty handy

seylerius23:08:43

Okay, trying reluctance means I only get the first character of the headline, and the rest becomes part of the content.

seylerius23:08:59

Trying lookahead seems to just fail.

seylerius23:08:06

Okay, tags are mostly fixed, but it's only grabbing the first one.

seylerius23:08:11

Pushed.

seylerius23:08:29

Would appreciate a look when you have time, @aengelberg

seylerius23:08:30

Ach. It's also not getting second headlines. They're turning into content lines due to newline weirdness.

seylerius23:08:49

Pushed again. Fixed newline weirdness

seylerius23:08:17

Hah, fixed it. Required post-tag newline/whitespace.

seylerius23:08:39

Gah. Org is a beautiful format, but it's a bitch to parse.

aengelberg23:08:37

The parser breaks if I put into the file

* The First : Section :foo:bar:

aengelberg23:08:43

Not sure if that's valid org-mode.

2016-08-28

Channels