data-science

lepistane 2024-08-16T20:34:54.196539Z

Not sure if this is right channel but is there a OCR that can be used with clojure?

Rupert (Sevva/All Street) 2024-08-19T11:03:05.300119Z

What kind of documents are you OCRing? • Simple text • simple text + formatting (e.g. bold/bullets) • Complex (tables/graphs/charts/infographics/images etc) Do you just want to keep the text or everything else? The choice of library depends on your priorities between cost, speed, features, accuracy etc. There is no one best option. OCR of complex documents is not a solved problem - expect inaccuracies and limitations. This is a very active area - with things changing a lot (e.g. advice from a year ago may be out of date). I wouldn't just pick a clojure based OCR for convenenience - unless it came out very recently since otherwise it is likely to be behind. A few options: • Access Java binding of OCR library through Java Interop • Access Python binding of OCR library through Python Interop (libpython clj) • Access command line OCR via clojure.java.shell/sh • Access REST OCR via clj-http client. Note that a new emerging option is to use LLMs with vision capabilities - you can use all of the the above options for interoping with LLMs (+ there's also llama.clj too - although you'd likely need to do some work).

lepistane 2024-08-19T12:50:27.654279Z

It's standard text but background may be blurry. I want to scan the name tags for conference (adding QR and doing it simple isn't an option) So text will be standardized but maybe blur/shadow/nights because everyone is gonna do scan differently under different light. Keep the text mostly. Thank you very much for sharing the options! Very nice! I am curious is there a comparison of accuracy between regular OCR vs (m)LLMs (that support image recognition as well)?

Rupert (Sevva/All Street) 2024-08-19T12:54:06.521829Z

For just text extraction you might be able to just use teseract (with some CLojure/Java wrapper around it). Note that Tesseract (and other similar OCR assume) a good base case - the text should be carefully cropped and angled photos should be de-skewed. So then you would need libraries for that too! I think LLM should be good for your usecase - it can handle busy/noisy images (uncropped images). OCR with LLM is pretty accurate (not every uses it yet because it can hallucinate and it can be expensive when dealing with millions of documents). If you have an Nvidia GPU then you could try a vision language model like phi-vision locally - otherwise use one of the Cloud LLM companies.

lepistane 2024-08-19T17:39:51.755169Z

Thank you very much for this! 🙇