instaparse

licht1stein 2023-01-30T12:28:43.330379Z

Hi, I'm totally new to parsing, working on a linguistic side project. I'm trying to parse something that looks like this: "<pc>1,1<k1>a<k2>a<h>1<e>1\n<hom>1.</hom> <s>a</s>", where <pc>1,1 , <k1>a is a first type of tag, and <s>a</s> is another type of tag. I would like to get something like {:pc "1,1" :k1 "a" :k2 "a" :h "1"} for starters, because the second part should be simple xml. I've got this, which works as long as the string doesn't contain anything else, but breaks on the entire sample:

"
S = {tag}
tag = <tag-open> + key + <tag-close> + value
key = #'[a-zA-Z0-9]+'
value = #'[a-zA-Z0-9]+'
tag-open = #'<'
tag-close = '>'
"
I feel like I'm missing an understanding of some basic piece. I also don't know how to separate xml tags from these first kind of tags. Please help.

thom 2023-01-30T16:35:05.860809Z

This is presumably some form of SGML. You can probably find a Java library that’ll parse it already, but also most HTML libraries will do little fixups to unclosed tags (lots based on JSoup etc). If you want to parse it yourself you just need to introduce opening and closing tags to your grammar and make the closing ones optional.