techsupportEnglish • 10 hours ago

Is OCR on PGS subtitle files always this bad?

5

Is OCR on PGS subtitle files always this bad?

techsupportEnglish • 10 hours ago

I’m working on trying to streamline the process of ripping my blu-ray collection. The biggest bottlneck in this process has always been dealing with subtitles and converting from image-based PGS to textbased SRT. I usually use SubtitleEdit which does okay with occasional mistakes. My understanding is that it combines Tesseract with a decent library to correct errors.

I’m trying to find something that works in the command line and found pgs-to-srt. It also uses Tesseract, but it appears without the library, the results are…not good:

Here’s the first two minutes of Love, Actually:

00:01:13,991 --> 00:01:16,368
DAVID: Whenever | get gloomy
with the state of the world,

2
00:01:16,451 --> 00:01:19,830
| think about
the arrivals gate
alt [Heathrow airport.

3
00:01:20,38 --> 00:01:21,415
General opinion
Started {to make oul

This is just OCR of plain text on a transparent background. How is it this bad? This is using the Tesseract “best” training data.

Chat

@ch00fOP
link
1•9 hours ago
Found out that pgs-to-srt can export images, so you can see what it’s looking at.

Starting to make sense why it’s so bad. Wonder if I can add a preprocessor to do something like this:

techsupport

[email protected]

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

The Lemmy community will help you with your tech problems and questions about anything here. Do not be shy, we will try to help you.

If something works or if you find a solution to your problem let us know it will be greatly apreciated.

Rules: instance rules + stay on topic

Partnered communities:

You Should Know

Recommendations

30 users / day
83 users / week
255 users / month
1.16K users / 6 months
2.62K subscribers
389 Posts
2.73K Comments
Modlog

mods:
GatoB