This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2016-08-28
Channels
@seylerius this is a good place for that.
You could put further "insta/parse"s in the functions inside the "insta/transform" map
(insta/transform {:x (fn [s] (insta/parse otherparser s))} (insta/parse firstparser s)
Hard to bang out a good example on mobile
It would get weird if the nested parser had an error though.
It does a full traversal of the hiccup / enlive, as long as all structures around the :x
are valid hiccup / enlive
@aengelberg: How do you get solo strings?
Gah, what's wrong with this parser? doc-metadata
works fine, but running headlines
on the remaining content just returns flat content. https://github.com/seylerius/organum
@aengelberg: Got any clues?
It's something in the h
token, because that's the last thing I changed before it started failing.
At a first glance, the #'.+'
looks suspicious to me. Is greediness biting you here? (Did not try it out, though)
@seylerius the regex you put for :content
is probably not what you want. Due to the (?s)
flag, seems to match everything including newlines, as long as the first character is not a *
.
I'm not sure what your desired behavior is though.
BTW, both the first ^
and the ?
in your regex appear redundant, if I understand it correctly.
The content regexp is fine. It's after I changed a few things to tidy up :h
and added tag parsing that it started failing.
Basically, a headline starts with some number of stars. Everything else isn't a headline.
I cloned your project and am looking at that parser. Is there a different version / branch I missed?
Sorry I may have been unclear. When I said :content
I meant the content inside the headlines parser.
Not the doc-metadata parser
As an experiment I removed all the hide-tags from the headlines parser, since I got that behavior you were talking about (flat content). That exposed the headlines' :content
rule as being greedy.
organum.core> (headlines content)
[:S [:token [:content "This is an attempt...
Yep. I've got an ordered choice making it prefer to define a section (headline then content) if possible, and just content if not. The defining difference between content and headline is whether it starts with stars.
I think this is what happened:
- The section
rule failed at the start of the string
- It then fell back to the content
rule due to ordered choice
- The content
rule mistakenly parses the whole string (for the reason I mentioned above)
- Parse is done
Yeah. You're right. Making the content rule less accepting (not (?s)
) fixes that part, and now I'm seeing failures to parse the first headline. Joy.
Not familiar with that term; are you referring to the groups returned by a Java regex match?
Non-capturing groups are for saying, "this should be here, but don't return it in a group"
oh, you mean things like regex lookahead and lookbehind?
They work if I make them mandatory, but get eaten by the headline body if they're optional. Would lookahead allow saying "if there's whitespace followed by a colon, stop here"?
This is the instaparse source code that applies regexes, may shed some light on whether certain constructs would work. https://github.com/Engelberg/instaparse/blob/master/src/instaparse/gll.clj#L670
I would expect regex non matching lookaheads to work, but non-matching lookbehinds to NOT work. Instaparse runs a regex match on the substring of the current index onward, so previous characters are invisible. EDIT: I misunderstood the term "non-matching"
I see you're using (?:)
now. I don't think "non capturing" is what you want
organum.core> (re-find #"a" "a")
"a"
organum.core> (re-find #"(?:a)" "a")
"a"
(?:)
basically means, if there are any other groups ()
inside that block, DON'T return them as an additional output.
(?!=)
?
the ?:
flag shouldn't affect Instaparse's usage of regexes at all. Instaparse throws away match groups
seems legit
need to run now, can probably help more in an hour or so. I'd say the next step is manually parsing the regexes on the strings.
and try gradually taking characters away from the regex to see what the problem is
feel free to dump any further findings here
Okay, trying reluctance means I only get the first character of the headline, and the rest becomes part of the content.
Would appreciate a look when you have time, @aengelberg
Ach. It's also not getting second headlines. They're turning into content lines due to newline weirdness.
The parser breaks if I put into the file
* The First : Section :foo:bar:
Not sure if that's valid org-mode.