xkcd #743: Infrastructures

Jakylla · edit-2 5 months ago

xkcd #743: Infrastructures

@[email protected] · 1 year ago

The OCR struggles with some PDFs for whatever reasons: font, formatting, etc.

There are 3rd party PDF OCR websites/programs that work better. If I’m having issues I run it through one of those first.

@[email protected] · 1 year ago

Any suggestions? Even the good ones had error rates that might not matter for a couple of pages, but when scaled to a 500 page book, even a 1% error rate results in an annoying level of typos.

@[email protected] · 1 year ago

I use gImageReader + Tesseract, but that probably doesn’t meet your criteria. Unfortunately OCR is very rarely perfect unless the input is perfectly clear and with a “OCR friendly” font/formatting. There are “AI powered” OCR out there, but I can’t speak to how well they work and I don’t know of any free ones.