#section outline parsing
1 messages · Page 1 of 1 (latest)
If you really just want to fix that specific part, revise it be far less elaborate:
title: the title of the section (long titles should be truncated)
but my point is that a generative ai model isn't a great tool for this. A python script will be handle this document outline parsing more reliably (and far more cheaply).
so in my example, I simply prompted gemini to generate that python script https://aistudio.google.com/app/prompts?state={"ids":["1VEHkDePP33eTacJMI1hA8IFHqzFEUSTs"],"action":"open","userId":"118169145925211331610","resourceKeys":{}}&usp=sharing
that kind of code generation is a task that gemini is good at
then you can take that python script from its response and run that on your computer to parse documents
as a bonus, there isn't a "token limit" or anything if you use a python script to do the parsing
Thank you for the explanation and excellent example. I've actually been working on this process for a while: https://community.openai.com/t/using-gpt-4-api-to-semantically-chunk-documents/715689/95?u=somebodysysop
The model call is just one step:
- export the pdf (or whatever) document to txt.
- run code to prepend linenoxxxx:
- send this numbered file to model along with instructions to create hierarchy json file
- process this file with code to add end_line numbers and output that json file.
- new: also add token_count to the json file
- run code on json output to create the chunks.
The reason I don't simply use regex is because I'm dealing with a multitude of documents, all with varied formatting. In addition, the code will be used not just for very well structured legal documents, but for religious texts, government code, and a other documents.
I use the model to look at these documents and give me a unified outline format that I can then use code to process and create my final goal: an embedding json file consisting of texts that have been semantically chunked. This particular task can only be achieved by an LLM. GPT-4o and Claude Opus/Sonnet both do better jobs with the titles (in addition to the outline itself), but are each limited to 4K output.
So, if I'm going to make this work, I'm stuck with Gemini 1.5, Flash or Pro.
Here is modified process flow: So now: export the pdf (or whatever) document to txt. run code to prepend linenoxxxx: send this numbered file to model along with instructions to create hierarchy json file process this file with code to add end_line numbers and output that json file. new: also add token_count to the json file run code on js...
Oh so this is a bit of a bigger project, gotcha.
This might be a dumb question, but why not just use a model like GCP DocumentAI or AWS Textract? They do eactly this and are way more robust than anything else out there
(confession: even as a GCP simp, I actually like Textract more than DocumentAI)
I don't know GCP DocumentAI, but am very familiar with AWS Textract. Had a meeting last week with AWS staff on this very topic. First off, their new "layout-aware" extraction features only work with their python libraries -- and even then, they still only get me a slightly better version of what I am getting now extracting with Solr tika and pdfToText. Secondly, as I am looking to implement this in an embedding pipeline, it is WAY too slow.
Finally, I've already built the system to do exactly what I want. The only problem, so far, is the title issue.
Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. AnalyzeDocument Layout is a new feature that allows customers to automatically extract layout elements such as paragraphs, titles, subtitles, headers, footers, and more from documents. Layout extends Amazon Textr...
yeah Textract is definitely not a mid-pipeline step (nor is documentAI) unless you're doing batch async stuff
heads up that line0112, it isn't properly handling the case of a child section beginning on the same line as its parent
Yes, thanks for pointing that out. Here is where the model begins to lose the hierarchal structure. Have to figure out how to prompt it to handle these.
I must thank you again for looking into this. The model was incorrectly setting the levels after the line you pointed out. So, the last segment of the document (g) ended up being identified as level 1 when it should have been level 2.
*{ "title": "(g) If any other Union or Guild...", "level": 1, "start_line": "line0757", "has_children": "N", "children": [], "end_line": "line0759", "token_count": 39 } *
I was just going to live with it as the title problem was more concerning. However, as a result of your post, I realized how to modify the prompt to be aware of this, and voila!:
*{ "title": "(g) If any other Union or Guild...", "level": 2, "start_line": "line0757", "has_children": "N", "children": [], "end_line": "line0759", "token_count": 39 } *
I'm sure there are other inconsistencies, but for my use case, this was the biggest. Much appreciated!
I didn't notice anything else specific, (same thing happens at line 0645: (d) (1) tho)