Hi, I'm trying to handle invoice data extraction, could you please suggest the easiest way?
Yes, it is fairly straight forward to have the outputs be 'fill-in-the-blank' in json or other formats (sql). LLM MCP may be worth looking in to as well. 30s latency is typical in our experience as well. How many such invoices are you processing? Are new ones coming in all the time?
I have zero experience with that.
If you’re just going to call one of the REST APIs of OpenAI, Grok, etc not sure what any library can provide on top of that. They already have APIs to return structured data (JSON, etc) in a specified format, I think.
I guess if you find it improves accuracy, it can be useful… I feel like everything is trial-and-error these days for such tasks. Depends on your use-case, specific data, etc.
@hhausman currently with gpt-4.1-nano it takes 6s to process invoice img and 3s for pdf
that works well for our case
@hhausman yes i think so, OCR error, I'm using Tess4J for OCR
Yes LLM works great to extract data from unstructured content, but for instance it takes 30s for Gemini handle that task, too long
Maybe you can send multiple parallel requests? Not sure if there are rate limits, what’s the cost, etc.
@raspasov yes a good idea, will try
How about python spacy?
Any example of what the data looks like?
"= &)\nINVOICE\nWOOD DECOR\nEllingten Wood Decer, 36 Terrick Rd, Ellington PE18 2NT, United Kingdom\nBILL TO\nYour client Invoice No.: 042022\n11 Besth Dr Issue date: 30i0di2022\nEllington Duse date: dfasi2022\nMES1 SEU\nUniied Kingdom Rafsrsnes: 042022\n' BESCRIPTION CUANTITY UNIT PRICE (£} AIOUNT (£}\nSampls servics 1 400.00 400.00\nSample wood decoration service\nSample servics 1 1 200.00 200.00\nSample wood decoration service 1\nTOTAL (GBP): £600.00\n[seued by, signalure:\n6&1‘}1@% Wood Decor\nEllington Weod Decer, 38 Tarrick Rd, Ellington PE18 2NT, United Kingdom Email: smail@yourbusinessnams.co.uk\n""£600.00\n[seued by, signalure:\n6&1‘}1@%" - really?
OCR errors, or?
tbh, an LLM could very well be the way to go here...
Probably an LLM but if the data is a mess and you need high accuracy (vs best effort), a human verification for each invoice is likely a good idea… I guess the only way to tell is to go through a few hundred items at least and check the accuracy.