Is OCR on PGS subtitle files always this bad?

@ch00f · edit-2 4 hours ago

Is OCR on PGS subtitle files always this bad?

@j4k3 · 3 hours ago

I’ve never had great results with tesseract if the image has compression so the mixed background sounds like a nightmare. There is probably some JavaScript stream in there but good luck accessing it. BR is hot garbage for a standard.

@ch00f · 3 hours ago

That’s the thing. There isn’t a background. The PGS layer is separate which is why it’s so surprising the error rate is so high.

@j4k3 · 3 hours ago

OCR 5 from F-droid was really good for me like 2+ years ago, but when I tried it more recently it was garbage. It really stood out to me around 2 years ago because around 5 years ago I tried translating a Chinese datasheet for one of the Atmel uC clone microcontrollers and OCR was not fun then.

Maybe have a look at Huggingface spaces and see if anyone has a better methodology setup as an example. Or look at the history of the models and see if one of the older ones is still available.

@ch00f · 3 hours ago

I think I spoke too soon when I said the text didn’t have a background or was otherwise clean… SubtitleEdit always shows it on a white background, but looks like the text itself actually has a white border which I’m sure is confusing the OCR. See my other comment for examples.

I’m going to start by seeing if I can clean up the text, and if not, I’ll look into huggingface and whatnot. Thanks for the tips.