2016-08-28 instaparse | Clojure Slack Archive

Gah, what's wrong with this parser? doc-metadata works fine, but running headlines on the remaining content just returns flat content. https://github.com/seylerius/organum

seylerius 2016-08-28T21:02:36.000004Z

@aengelberg: Got any clues?

seylerius 2016-08-28T21:03:57.000005Z

Simple reproduction: (headlines (last (doc-metadata (slurp ""))))

seylerius 2016-08-28T21:05:08.000006Z

It's something in the h token, because that's the last thing I changed before it started failing.

ska 2016-08-28T21:10:45.000007Z

At a first glance, the #'.+' looks suspicious to me. Is greediness biting you here? (Did not try it out, though)

aengelberg 2016-08-28T21:25:00.000008Z

@seylerius the regex you put for :content is probably not what you want. Due to the (?s) flag, seems to match everything including newlines, as long as the first character is not a *.

aengelberg 2016-08-28T21:25:06.000009Z

I'm not sure what your desired behavior is though.

aengelberg 2016-08-28T21:26:16.000010Z

BTW, both the first ^ and the ? in your regex appear redundant, if I understand it correctly.

seylerius 2016-08-28T21:26:29.000011Z

The content regexp is fine. It's after I changed a few things to tidy up :h and added tag parsing that it started failing.

seylerius 2016-08-28T21:26:54.000012Z

Basically, a headline starts with some number of stars. Everything else isn't a headline.

aengelberg 2016-08-28T21:26:56.000013Z

I cloned your project and am looking at that parser. Is there a different version / branch I missed?

seylerius 2016-08-28T21:27:38.000014Z

Nope, I pushed the latest version just before I spoke up today.

aengelberg 2016-08-28T21:28:13.000015Z

Sorry I may have been unclear. When I said :content I meant the content inside the headlines parser.

aengelberg 2016-08-28T21:28:26.000016Z

Not the doc-metadata parser

aengelberg 2016-08-28T21:29:05.000017Z

As an experiment I removed all the hide-tags from the headlines parser, since I got that behavior you were talking about (flat content). That exposed the headlines' :content rule as being greedy.

aengelberg 2016-08-28T21:30:06.000018Z

organum.core> (headlines content)
[:S [:token [:content "This is an attempt...

seylerius 2016-08-28T21:30:20.000019Z

Yep. I've got an ordered choice making it prefer to define a section (headline then content) if possible, and just content if not. The defining difference between content and headline is whether it starts with stars.

seylerius 2016-08-28T21:31:11.000020Z

Although, Hmmm. You've got a point about the mode there.

aengelberg 2016-08-28T21:31:58.000021Z

I think this is what happened: - The section rule failed at the start of the string - It then fell back to the content rule due to ordered choice - The content rule mistakenly parses the whole string (for the reason I mentioned above) - Parse is done

seylerius 2016-08-28T21:34:22.000023Z

Yeah. You're right. Making the content rule less accepting (not (?s)) fixes that part, and now I'm seeing failures to parse the first headline. Joy.

seylerius 2016-08-28T21:36:27.000024Z

How does inataparse play with non-capturing groups?

aengelberg 2016-08-28T21:38:09.000025Z

Not familiar with that term; are you referring to the groups returned by a Java regex match?

seylerius 2016-08-28T21:40:04.000026Z

Non-capturing groups are for saying, "this should be here, but don't return it in a group"

seylerius 2016-08-28T21:40:39.000027Z

Okay, new push. Can't manage to get tags out separate.

aengelberg 2016-08-28T21:41:20.000028Z

oh, you mean things like regex lookahead and lookbehind?

seylerius 2016-08-28T21:42:41.000029Z

They work if I make them mandatory, but get eaten by the headline body if they're optional. Would lookahead allow saying "if there's whitespace followed by a colon, stop here"?

aengelberg 2016-08-28T21:45:37.000030Z

This is the instaparse source code that applies regexes, may shed some light on whether certain constructs would work. https://github.com/Engelberg/instaparse/blob/master/src/instaparse/gll.clj#L670

aengelberg 2016-08-28T21:47:21.000032Z

I would expect regex non matching lookaheads to work, but non-matching lookbehinds to NOT work. Instaparse runs a regex match on the substring of the current index onward, so previous characters are invisible. EDIT: I misunderstood the term "non-matching"

aengelberg 2016-08-28T21:49:11.000033Z

I see you're using (?:) now. I don't think "non capturing" is what you want

seylerius 2016-08-28T21:49:22.000034Z

I think you're right.

aengelberg 2016-08-28T21:49:47.000035Z

organum.core> (re-find #"a" "a")
"a"
organum.core> (re-find #"(?:a)" "a")
"a"

seylerius 2016-08-28T21:49:56.000036Z

What's weird is non-greedy options fail entirely.

aengelberg 2016-08-28T21:50:18.000037Z

(?:) basically means, if there are any other groups () inside that block, DON'T return them as an additional output.

seylerius 2016-08-28T21:51:04.000038Z

Ah, it looks like negative lookahead is the trick.

aengelberg 2016-08-28T21:51:20.000039Z

(?!=)?

aengelberg 2016-08-28T21:51:43.000040Z

the ?: flag shouldn't affect Instaparse's usage of regexes at all. Instaparse throws away match groups

seylerius 2016-08-28T21:51:55.000041Z

(?!\\s+:)

aengelberg 2016-08-28T21:52:21.000042Z

seems legit

seylerius 2016-08-28T21:52:45.000043Z

Nope. Pushing. Still eats the tags.

aengelberg 2016-08-28T21:53:00.000045Z

hmm

seylerius 2016-08-28T21:53:16.000046Z

Pushed

aengelberg 2016-08-28T21:53:51.000047Z

need to run now, can probably help more in an hour or so. I'd say the next step is manually parsing the regexes on the strings.

aengelberg 2016-08-28T21:54:18.000048Z

and try gradually taking characters away from the regex to see what the problem is

seylerius 2016-08-28T21:54:23.000049Z

Okay, thanks for the help. Talk with ya when you've got time.

aengelberg 2016-08-28T21:54:36.000050Z

feel free to dump any further findings here

seylerius 2016-08-28T21:54:57.000051Z

Will do. Slack has persistence, which is pretty handy

seylerius 2016-08-28T23:34:43.000052Z

Okay, trying reluctance means I only get the first character of the headline, and the rest becomes part of the content.

seylerius 2016-08-28T23:34:59.000053Z

Trying lookahead seems to just fail.

seylerius 2016-08-28T23:42:06.000054Z

Okay, tags are mostly fixed, but it's only grabbing the first one.

seylerius 2016-08-28T23:42:11.000055Z

Pushed.

seylerius 2016-08-28T23:42:29.000056Z

Would appreciate a look when you have time, @aengelberg

seylerius 2016-08-28T23:43:30.000057Z

Ach. It's also not getting second headlines. They're turning into content lines due to newline weirdness.

seylerius 2016-08-28T23:46:49.000058Z

Pushed again. Fixed newline weirdness

seylerius 2016-08-28T23:50:17.000059Z

Hah, fixed it. Required post-tag newline/whitespace.

seylerius 2016-08-28T23:50:39.000060Z

Gah. Org is a beautiful format, but it's a bitch to parse.

aengelberg 2016-08-28T23:56:37.000061Z

The parser breaks if I put into the file

* The First : Section :foo:bar:

aengelberg 2016-08-28T23:56:43.000062Z

Not sure if that's valid org-mode.

Clojurians Log v2

instaparse 2016-08-28