instaparse 2016-08-30 | Slack Archive

andrei20:08:59

I am trying to write a simple grammar that parses comments: /* some text */, is there a way in instaparse to say any character? e.g.

"comment = ‘/*’ .* ‘*/‘"

aengelberg20:08:24

@andrei Instaparse doesn't have a special character for that, but you can use regular expressions to cover any character

aengelberg20:08:04

e.g. comment = '/*' #'[\\s\\S]'* '*/'

aengelberg20:08:07

(`#"[\s\S]"` is my personal favorite way to match any character in a regex)

seylerius20:08:04

@andrei: Yeah, you'll want something like this:

"comment = <'/*'> #'.*' <'*/'>"

My version hides the comment tokens, though @aengelberg's regexp might be more appropriate.

andrei20:08:43

@aengelberg @seylerius thank you for the suggestions. I think I got a bit mislead by the source code, https://github.com/Engelberg/instaparse/blob/master/src/instaparse/abnf.clj#L19-L40 I thought there are some defaults in instaparse

andrei20:08:23

but now reading through the doc strings, these are only to parse the grammar itself https://github.com/Engelberg/instaparse/blob/master/src/instaparse/abnf.clj#L2

aengelberg20:08:36

a couple things I see in @seylerius's solution: 1) . in a regex doesn't include newlines 2) .* will greedily match past the */ and won't be able to parse the end of a comment

aengelberg20:08:07

@andrei Sorry for the misleading code. Those constants are available but only to the ABNF format.

aengelberg20:08:37

EBNF is the default

andrei20:08:09

are there constants for ebnf? looking at the code I think not

seylerius20:08:14

@andrei A point to keep in mind with @aengelberg's solution is that you'll need to condense the individual characters of the output.

andrei20:08:27

@seylerius @aengelberg is there a way for specifying in instaparse to group matches together, s.t. one doesn’t need to condense the matches?

aengelberg20:08:27

yeah, thanks for clarifying that @seylerius

seylerius20:08:56

You'll get output like [:comment "f" "o" "o" " " "b" "a" "r"] from input like /*foo bar*/

andrei20:08:04

exactly

andrei20:08:23

there are ways to use transform and apply str on it

seylerius20:08:29

Yep.

aengelberg20:08:40

@andrei The official specification for ABNF is more strict and specific than EBNF, and it dictates that those constants are available. EBNF is more of an ambiguous mashup of a variety of standards we were able to find on the internet

andrei20:08:44

it just feels that there should be a grammar direct way

aengelberg20:08:06

So there are no constants in EBNF, since none of the EBNF resources we found seemed to indicate such

seylerius20:08:15

And remember to wrap your comment tokens in <> like I did, so you don't save the markup itself.

aengelberg20:08:31

Sadly there is no grammar direct way to concat the strings

seylerius20:08:49

Transform works pretty well, though.

andrei20:08:09

hmm, or a more elaborated reg exp

andrei20:08:55

I am using smth like this for strings

<string> = dqoute #'([^"\\]|\\.)*' dqoute
   <dqoute> = <'\"'>

seylerius20:08:13

(insta/transform {:comment (partial apply str)} (comment-parser input-data))

andrei20:08:05

and probably the performance impact is small if one applies transforms

seylerius20:08:47

Lolyep. Far as I can tell, inataparse does a good job with efficient transforms.

aengelberg20:08:00

it depends on the size of the file. Probably actually creating all those individual strings is going to be the bottleneck rather than concatenating them later

andrei20:08:06

I must admit I was lead astray by regexps vs transforms which is more efficient - although I think its a very premature optimisation

aengelberg20:08:27

A regex is a sensible solution if you can get it right 🙂

aengelberg20:08:00

My first thought is to do a negative lookahead for */ as part of the regex

seylerius20:08:58

Trouble is, from what I've found, that the */ will get eaten in the .*

seylerius20:08:19

And the negative lookahead will pass because the end token was already eaten

andrei20:08:16

so more reg exp magic for me to look into. to give a bit more context I am playing around with parsing localizable strings.

/* This is a comment */

"hello" = "Hello!";

/* This is another comment */
"click_button" = "Click";

/* Title bar, prints the number of selected products (The translation should be short due to the limit of 100 characters for the title of the mobile app) */
"bar_print_$_selected_products" = "You Selected %@ Products”;

andrei20:08:32

just an experiment, nothing production related.

andrei20:08:20

@aengelberg @seylerius thank you for your help, so far I enjoyed using instaparse. is cool that I can use some things that I learned in college to do some useful things

andrei20:08:54

although I must say that I need to re-learn things about parsers and defining grammars

aengelberg20:08:11

@seylerius I meant a regex negative lookahead, i.e. #".*(?!=/\*)" or something

aengelberg20:08:40

@andrei glad you're having fun! feel free to ask here if you have any more questions

seylerius20:08:59

@aengelberg: That's what I thought. It winds up eating the end-token in the .* and passes the negative lookahead anyway. I was fighting that with the headline parser in organum over the weekend.

seylerius20:08:17

When I was trying to get it to parse tags.

aengelberg20:08:59

oh, I guess the regex would pass, saying "here's a sequence of characters (including /*), and look, there is not a /* *after* these characters!"

seylerius20:08:06

Bingo

aengelberg20:08:33

so maybe #"((?!/\*).)*"

aengelberg20:08:44

that would generate a bunch of match groups though due to the ()

seylerius20:08:14

Gah, lemme see what I did for that in the tags in organum.