instaparse

jacob.maine 2023-05-10T00:02:40.118529Z

We’re parsing large documents. We’ve found we get better performance (memory and CPU) if we split the documents up into chunks, and then parse each chunk separately. This keeps the grammar smaller and aligns with https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md#performance-tips. However, we also need to keep track of line and column numbers of the original document, which doesn’t work very well with the chunked approach. Say we have two chunks, lines 1-10 and lines 11-20. When we use insta/add-line-and-column-info-to-metadata on the second chunk, the line metadata starts at line 1, not line 11. At the moment we have a collection of helpers to walk the metadata after it’s generated, and offset it. But I was wondering if anyone has a better approach. I’ve submitted a PR with https://github.com/Engelberg/instaparse/pull/226, which pushes the complexity into instaparse itself. But if anyone has other tricks, I’d love to hear.

thom 2023-05-14T19:53:22.346549Z

clj-antlr is much faster than Instaparse if performance remains an issue. Not quite as nice an API though.

jacob.maine 2023-05-15T15:30:23.972749Z

Thanks, I’ll check that out!