I'm trying to extract contents of a PDF about insurance policy. The pdf contains text and tables.
I used: pdf_ocr_response = self.client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": signed_url.url
},
include_image_base64=True
)
the answer is { "pages": [ { "index": 0, "markdown": "\n\n## EXCLUSIONS\n\nThere is no coverage provided by this Policy while the following individual(s) operate a motor vehicle:\nNone\nAdvantage\nOther Premium Bearing: User initiated endorsement", "images": [ { "id": "img-0.jpeg", "top_left_x": 31, "top_left_y": 72, "bottom_right_x": 577, "bottom_right_y": 659, "image_base64": null } ], "dimensions": { "dpi": 200, "height": 792, "width": 612 } } ], "model": "mistral-ocr-2503-completion", "usage_info": { "pages_processed": 1, "doc_size_bytes": 266317 } }
When I used chat, it produced correct answer.