Fork me on GitHub
#instaparse
<
2016-08-28
>
aengelberg00:08:09

@seylerius this is a good place for that.

aengelberg00:08:35

You could put further "insta/parse"s in the functions inside the "insta/transform" map

seylerius00:08:53

This is awesome.

aengelberg00:08:05

(insta/transform {:x (fn [s] (insta/parse otherparser s))} (insta/parse firstparser s)

aengelberg00:08:12

Hard to bang out a good example on mobile

seylerius00:08:46

That looks fascinating.

aengelberg00:08:12

It would get weird if the nested parser had an error though.

seylerius00:08:29

So how deep does it go looking for :x?

seylerius00:08:09

And how do you make it check for loose strings?

aengelberg00:08:57

It does a full traversal of the hiccup / enlive, as long as all structures around the :x are valid hiccup / enlive

seylerius00:08:14

@aengelberg: How do you get solo strings?

seylerius21:08:19

Gah, what's wrong with this parser? doc-metadata works fine, but running headlines on the remaining content just returns flat content. https://github.com/seylerius/organum

seylerius21:08:57

Simple reproduction: (headlines (last (doc-metadata (slurp ""))))

seylerius21:08:08

It's something in the h token, because that's the last thing I changed before it started failing.

ska21:08:45

At a first glance, the #'.+' looks suspicious to me. Is greediness biting you here? (Did not try it out, though)

aengelberg21:08:00

@seylerius the regex you put for :content is probably not what you want. Due to the (?s) flag, seems to match everything including newlines, as long as the first character is not a *.

aengelberg21:08:06

I'm not sure what your desired behavior is though.

aengelberg21:08:16

BTW, both the first ^ and the ? in your regex appear redundant, if I understand it correctly.

seylerius21:08:29

The content regexp is fine. It's after I changed a few things to tidy up :h and added tag parsing that it started failing.

seylerius21:08:54

Basically, a headline starts with some number of stars. Everything else isn't a headline.

aengelberg21:08:56

I cloned your project and am looking at that parser. Is there a different version / branch I missed?

seylerius21:08:38

Nope, I pushed the latest version just before I spoke up today.

aengelberg21:08:13

Sorry I may have been unclear. When I said :content I meant the content inside the headlines parser.

aengelberg21:08:26

Not the doc-metadata parser

aengelberg21:08:05

As an experiment I removed all the hide-tags from the headlines parser, since I got that behavior you were talking about (flat content). That exposed the headlines' :content rule as being greedy.

aengelberg21:08:06

organum.core> (headlines content)
[:S [:token [:content "This is an attempt...

seylerius21:08:20

Yep. I've got an ordered choice making it prefer to define a section (headline then content) if possible, and just content if not. The defining difference between content and headline is whether it starts with stars.

seylerius21:08:11

Although, Hmmm. You've got a point about the mode there.

aengelberg21:08:58

I think this is what happened: - The section rule failed at the start of the string - It then fell back to the content rule due to ordered choice - The content rule mistakenly parses the whole string (for the reason I mentioned above) - Parse is done

seylerius21:08:22

Yeah. You're right. Making the content rule less accepting (not (?s)) fixes that part, and now I'm seeing failures to parse the first headline. Joy.

seylerius21:08:27

How does inataparse play with non-capturing groups?

aengelberg21:08:09

Not familiar with that term; are you referring to the groups returned by a Java regex match?

seylerius21:08:04

Non-capturing groups are for saying, "this should be here, but don't return it in a group"

seylerius21:08:39

Okay, new push. Can't manage to get tags out separate.

aengelberg21:08:20

oh, you mean things like regex lookahead and lookbehind?

seylerius21:08:41

They work if I make them mandatory, but get eaten by the headline body if they're optional. Would lookahead allow saying "if there's whitespace followed by a colon, stop here"?

aengelberg21:08:37

This is the instaparse source code that applies regexes, may shed some light on whether certain constructs would work. https://github.com/Engelberg/instaparse/blob/master/src/instaparse/gll.clj#L670

aengelberg21:08:21

I would expect regex non matching lookaheads to work, but non-matching lookbehinds to NOT work. Instaparse runs a regex match on the substring of the current index onward, so previous characters are invisible. EDIT: I misunderstood the term "non-matching"

aengelberg21:08:11

I see you're using (?:) now. I don't think "non capturing" is what you want

seylerius21:08:22

I think you're right.

aengelberg21:08:47

organum.core> (re-find #"a" "a")
"a"
organum.core> (re-find #"(?:a)" "a")
"a"

seylerius21:08:56

What's weird is non-greedy options fail entirely.

aengelberg21:08:18

(?:) basically means, if there are any other groups () inside that block, DON'T return them as an additional output.

seylerius21:08:04

Ah, it looks like negative lookahead is the trick.

aengelberg21:08:43

the ?: flag shouldn't affect Instaparse's usage of regexes at all. Instaparse throws away match groups

seylerius21:08:45

Nope. Pushing. Still eats the tags.

aengelberg21:08:51

need to run now, can probably help more in an hour or so. I'd say the next step is manually parsing the regexes on the strings.

aengelberg21:08:18

and try gradually taking characters away from the regex to see what the problem is

seylerius21:08:23

Okay, thanks for the help. Talk with ya when you've got time.

aengelberg21:08:36

feel free to dump any further findings here

seylerius21:08:57

Will do. Slack has persistence, which is pretty handy

seylerius23:08:43

Okay, trying reluctance means I only get the first character of the headline, and the rest becomes part of the content.

seylerius23:08:59

Trying lookahead seems to just fail.

seylerius23:08:06

Okay, tags are mostly fixed, but it's only grabbing the first one.

seylerius23:08:29

Would appreciate a look when you have time, @aengelberg

seylerius23:08:30

Ach. It's also not getting second headlines. They're turning into content lines due to newline weirdness.

seylerius23:08:49

Pushed again. Fixed newline weirdness

seylerius23:08:17

Hah, fixed it. Required post-tag newline/whitespace.

seylerius23:08:39

Gah. Org is a beautiful format, but it's a bitch to parse.

aengelberg23:08:37

The parser breaks if I put into the file

* The First : Section :foo:bar:

aengelberg23:08:43

Not sure if that's valid org-mode.