Fork me on GitHub
#instaparse
<
2021-04-18
>
sova-soars-the-sora20:04:23

Hi. I was wondering, is there a way to do fuzzy matching or fuzzy parsing? And how I mean is that I want to draw rectangles around Japanese text, parsing it effectively, but I would like to be able to parse even if some terms are unknown or undefined. Is there a way to do that in instaparse, a way to parse with fuzziness so not every single term in the parse input is defined by the rules?

aengelberg20:04:06

Sadly no. Instaparse was designed to turn strings into data using a well-defined language, so partial matching and fuzzy matching aren't well-supported.

sova-soars-the-sora20:04:14

No problem. I'm wondering how I can do this ^.^ Maybe I can pre-process everything and do a sort of mini dictionary prep step.

sova-soars-the-sora20:04:56

So if I do a dictionary scan of the input text, I think it is smartest to start with longest strings first

sova-soars-the-sora20:04:37

match all the 7-letter words, 6-letter words, 5-letter words, and so on.

sova-soars-the-sora20:04:43

maybe just cut it into slices? "Shewenttothemuseum" -> "Shewent" (no results) "emuseum" no results... but then "Shewent" (also no results) .... "museum" result found. mark it. keep it moving. Kinda like a sieve of erasthenes but on text

sova-soars-the-sora20:04:53

Making m stringlets of size n from a string sounds like linear in data, so we could probably do pretty large datasets but maybe not a whole novel conveniently this way. Hmm, I suppose it is easy if we split on sentence ends (periods 。) and then do the sieve approach on each sentence

sova-soars-the-sora20:04:29

This might actually work pretty darn well!

sova-soars-the-sora20:04:16

Preprocess the input with a sieve + dictionary lookup, figure out the nouns and verbs throw them into the rules then try and run the parse on it. i'll still need some core rules for grammar but the idea is to have a lot of them hard-coded

aengelberg20:04:10

A regex could be a good fit to quickly scan for valid dictionary words. Some regex libraries let you compile a large union of words ( #"word1|word2|word3|... ) into a finite state machine that can do a linear-time scan of text.

sova-soars-the-sora21:04:13

ohhh cool. that's a really neat idea. i think i might need to use web lookups but if i keep tabs on those results they could go into such a regex.