Extracting plain text isn’t that much of a problem, relatively speaking. It’s in... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		layer8 on Feb 5, 2025 \| parent \| context \| favorite \| on: Ingesting PDFs and why Gemini 2.0 changes everythi... Extracting plain text isn’t that much of a problem, relatively speaking. It’s interpreting more complex elements like nested lists, tables, side bars, footnotes/endnotes, cross-references, images and diagrams where things get challenging.

visarga on Feb 6, 2025 [–]

OCR is not 100% either. Reading order is also fragile, it might OCR the word but mess up the line structure.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact