#Extract data from Excel spreadsheets
1 messages · Page 1 of 1 (latest)
The knowledge sources do not explicitly mention a single "state-of-the-art" (SOTA) method for extracting data from Excel spreadsheets with split tables, merged cells, and other complex structures. However, they do provide guidance on how OpenAI models, especially GPT-4o and the o-series reasoning models, can be leveraged for such tasks.
Key points from the sources:
-
GPT-4o is highlighted for its multimodal capabilities and contextual understanding, making it adaptable for extracting and transforming data from complex document layouts, including spreadsheets. It can handle multilingual content and various data structures, providing flexible transformation rules and enhanced insight generation. This makes GPT-4o a strong candidate for extracting data from challenging spreadsheet formats, such as those with split tables or merged cells, especially when traditional OCR or parsing tools struggle with layout complexity. The workflow often involves extracting data, transforming it to fit a schema, and then loading it into a database for analysis. The Batch API can be used for cost efficiency in large-scale workflows Data Extraction and Transformation in ELT Workflows using GPT-4o as an OCR Alternative.
-
Model selection guidance suggests starting with a fast, cost-effective model like GPT-4.1 for OCR and upgrading to a more powerful reasoning model like o3 if the forms are complex and difficult to interpret at a glance. The o3 model is noted for its ability to reason about form structure, which is relevant for handling non-standard spreadsheet layouts Practical Guide for Model Selection for Real‑World Use Cases.
- ChatGPT’s data analysis features use Python (pandas) for spreadsheet analysis, but the system defaults to code-based processing for structured data. If a spreadsheet contains complex layouts (split tables, merged cells), the model may not always interpret the structure as intended. In such cases, converting the spreadsheet to a text-based format (like PDF) before uploading can sometimes yield better results, as the model will then use text extraction and reasoning rather than code-based parsing Optimizing File Uploads in ChatGPT Enterprise.
Summary:
For extracting data from Excel spreadsheets with split tables, merged cells, and other complex layouts, the SOTA approach within the OpenAI ecosystem is to use GPT-4o or o3 models for their advanced reasoning and multimodal capabilities. If the spreadsheet is especially complex, consider converting it to a PDF for better text extraction and reasoning. For large-scale or automated workflows, the Batch API can be used. There is no mention of a single, definitive SOTA algorithm, but leveraging these models represents the current best practice according to OpenAI’s documentation.