data-science

Anton Shastun 2025-06-07T17:24:38.096909Z

Hi, I'm trying to handle invoice data extraction, could you please suggest the easiest way?

Harold 2025-06-16T16:18:33.550029Z

Yes, it is fairly straight forward to have the outputs be 'fill-in-the-blank' in json or other formats (sql). LLM MCP may be worth looking in to as well. 30s latency is typical in our experience as well. How many such invoices are you processing? Are new ones coming in all the time?

raspasov 2025-06-09T07:31:30.315049Z

I have zero experience with that.

raspasov 2025-06-09T07:34:25.135949Z

If you’re just going to call one of the REST APIs of OpenAI, Grok, etc not sure what any library can provide on top of that. They already have APIs to return structured data (JSON, etc) in a specified format, I think.

raspasov 2025-06-09T07:38:19.497469Z

I guess if you find it improves accuracy, it can be useful… I feel like everything is trial-and-error these days for such tasks. Depends on your use-case, specific data, etc.

Anton Shastun 2025-06-17T13:02:31.793699Z

@hhausman currently with gpt-4.1-nano it takes 6s to process invoice img and 3s for pdf that works well for our case

🆒 1
Anton Shastun 2025-06-08T18:07:46.893809Z

@hhausman yes i think so, OCR error, I'm using Tess4J for OCR

Anton Shastun 2025-06-08T18:11:25.056359Z

Yes LLM works great to extract data from unstructured content, but for instance it takes 30s for Gemini handle that task, too long

raspasov 2025-06-09T02:53:44.176959Z

Maybe you can send multiple parallel requests? Not sure if there are rate limits, what’s the cost, etc.

Anton Shastun 2025-06-09T06:41:04.364749Z

@raspasov yes a good idea, will try

Anton Shastun 2025-06-09T06:46:27.413599Z

How about python spacy?

Gent Krasniqi 2025-06-07T17:30:56.141559Z

Any example of what the data looks like?

Anton Shastun 2025-06-07T17:32:58.599979Z

"= &)\nINVOICE\nWOOD DECOR\nEllingten Wood Decer, 36 Terrick Rd, Ellington PE18 2NT, United Kingdom\nBILL TO\nYour client Invoice No.: 042022\n11 Besth Dr Issue date: 30i0di2022\nEllington Duse date: dfasi2022\nMES1 SEU\nUniied Kingdom Rafsrsnes: 042022\n' BESCRIPTION CUANTITY UNIT PRICE (£} AIOUNT (£}\nSampls servics 1 400.00 400.00\nSample wood decoration service\nSample servics 1 1 200.00 200.00\nSample wood decoration service 1\nTOTAL (GBP): £600.00\n[seued by, signalure:\n6&1‘}1@% Wood Decor\nEllington Weod Decer, 38 Tarrick Rd, Ellington PE18 2NT, United Kingdom  Email: smail@yourbusinessnams.co.uk\n"

Harold 2025-06-07T18:57:08.899689Z

"£600.00\n[seued by, signalure:\n6&1‘}1@%" - really? OCR errors, or? tbh, an LLM could very well be the way to go here...

raspasov 2025-06-08T01:38:01.202529Z

Probably an LLM but if the data is a mess and you need high accuracy (vs best effort), a human verification for each invoice is likely a good idea… I guess the only way to tell is to go through a few hundred items at least and check the accuracy.