Hi
I am working as Machine Learning Engineer at a startup called Pibit.ai (YC21) where we provide information extraction services for a variety of use cases, including bills, policies, loss runs, and more.
Our current solution is a hybrid of Computer Vision and Heuristics to get a structured output from model detections, but we are currently conducting R&D on the capabilities of GPT models to produce a structured output given an image that contains tables and sections with key-value pairs.
By far, we have seen that given a raw OCR text for each page, gpt3.5 is able to extract with very good accuracy when simple tables are present with a 1-1 relation between header and cell. However, there are some cases where in a single column multiple rows are present along with multiple respective cells, so we are doing experiments on feeding document layout for better information extraction (mainly in JSON format)
If you have any suggestions let us know!