Fork me on GitHub
#clojure-europe
<
2022-04-30
>
genRaiy09:04:34

Good morning

🪞 2
Asko Nōmm19:04:09

Am jelly. Here up north in Estonia there is no green yet.

😬 1
nottmey19:04:51

Good morning 👋 😄 I’m thinking about automating my inbound e-mails for receipts and stuff… Anyone got an idea how to parse “random” e-mails and pdfs with clojure, or with what? :thinking_face:

nottmey19:04:41

(and if you know a better channel to ask for this, please let me know ^^)

pavlosmelissinos06:05:02

Not sure about emails but for PDFs: https://github.com/dotemacs/pdfboxing might work for you but you have to write the parser and adapt it when a PDF with a different structure comes. I have raised a PR that https://github.com/dotemacs/pdfboxing/pull/62. Or you can do interop if you prefer. The API is quite straightforward.

pavlosmelissinos06:05:46

It won't work for arbitrary PDFs (i.e. if you don't write a parser) though, which might be what you mean by "random". And I don't think the library will do OCR.

nottmey09:05:09

Yea by random PDFs I meant PDFs which I can’t ensure will stay in the same structure, or will have flukes etc (third party). “Extract text from pdf” or “Extract text from specific regions” is the main part of the PDF use-case, yes. I hope OCR will not be necessary, but if there are good libs I would be interested to know them too 😄 Thanks for the hint! 🙂

pavlosmelissinos10:05:44

For ocr there's stuff like tesseract and ocrmypdf but I don't have experience with either. I'm not sure what you're asking is possible. If the structure is arbitrary, you won't be able to automate the parsing without some kind of approximation, like machine learning, but I doubt that's worth the effort, especially for a personal project. If I were you I'd start with the low-hanging fruit. The cases were there's a clear structure and gradually expand to the more tricky ones.

👍 1
👌 1
nottmey11:05:08

I was hoping to approximate an extraction of the document title (e.g. a letter with a heading) by using the styling as a hint. (e.g. get the bold large text in the middle of the page)

pavlosmelissinos11:05:45

Right, if you can think of some heuristics and you're fine with approximations then it could work, sure. Check out https://github.com/paperless-ngx/paperless-ngx too. It's meant to organize your scanned physical documents and it's not Clojure-related but a large part of it is about parsing (and it does some amount of OCR and machine learning).

nottmey18:05:40

Ok, nice, that looks good too