I’m working on trying to streamline the process of ripping my blu-ray collection. The biggest bottlneck in this process has always been dealing with subtitles and converting from image-based PGS to textbased SRT. I usually use SubtitleEdit which does okay with occasional mistakes. My understanding is that it combines Tesseract with a decent library to correct errors.

I’m trying to find something that works in the command line and found pgs-to-srt. It also uses Tesseract, but it appears without the library, the results are…not good:

Here’s the first two minutes of Love, Actually:

00:01:13,991 --> 00:01:16,368
DAVID: Whenever | get gloomy
with the state of the world,

2
00:01:16,451 --> 00:01:19,830
| think about
the arrivals gate
alt [Heathrow airport.

3
00:01:20,38 --> 00:01:21,415
General opinion
Started {to make oul

This is just OCR of plain text on a transparent background. How is it this bad? This is using the Tesseract “best” training data.

  • @ch00fOP
    link
    19 hours ago

    Found out that pgs-to-srt can export images, so you can see what it’s looking at.

    Starting to make sense why it’s so bad. Wonder if I can add a preprocessor to do something like this: