Is OCR on PGS subtitle files always this bad? (lemmy.world)

submitted 5 months ago* (last edited 4 months ago) by ch00f@lemmy.world to c/techsupport@lemmy.world

3 comments fedilink hide all child comments

I'm working on trying to streamline the process of ripping my blu-ray collection. The biggest bottlneck in this process has always been dealing with subtitles and converting from image-based PGS to textbased SRT. I usually use SubtitleEdit which does okay with occasional mistakes. My understanding is that it combines Tesseract with a decent library to correct errors.

I'm trying to find something that works in the command line and found pgs-to-srt. It also uses Tesseract, but it appears without the library, the results are...not good:

Here's the first two minutes of Love, Actually:

00:01:13,991 --> 00:01:16,368
DAVID: Whenever | get gloomy
with the state of the world,

2
00:01:16,451 --> 00:01:19,830
| think about
the arrivals gate
alt [Heathrow airport.

3
00:01:20,38 --> 00:01:21,415
General opinion
Started {to make oul

This is just OCR of plain text on a transparent background. How is it this bad? This is using the Tesseract "best" training data.

Edit: I’ve been playing around with ocr-to-pgs which also uses tesseract and discovered that subtitles having black outlines really messes with it. I made some improvements.

https://github.com/wydengyre/pgs-to-srt/pull/348

top 3 comments

sorted by: hot top controversial new old

[-] j4k3@lemmy.world 0 points 5 months ago

I've never had great results with tesseract if the image has compression so the mixed background sounds like a nightmare. There is probably some JavaScript stream in there but good luck accessing it. BR is hot garbage for a standard.

[-] ch00f@lemmy.world 0 points 5 months ago

That's the thing. There isn't a background. The PGS layer is separate which is why it's so surprising the error rate is so high.

[-] j4k3@lemmy.world 1 points 5 months ago* (last edited 5 months ago)

OCR 5 from F-droid was really good for me like 2+ years ago, but when I tried it more recently it was garbage. It really stood out to me around 2 years ago because around 5 years ago I tried translating a Chinese datasheet for one of the Atmel uC clones and OCR was not fun then.

Maybe have a look at Huggingface spaces and see if anyone has a better methodology setup as an example. Or look at the history of the models and see if one of the older ones is still available.

this post was submitted on 23 Feb 2025

4 points (100.0% liked)

techsupport

2914 readers

4 users here now

The Lemmy community will help you with your tech problems and questions about anything here. Do not be shy, we will try to help you.

If something works or if you find a solution to your problem let us know it will be greatly apreciated.

Rules: instance rules + stay on topic

Partnered communities:

You Should Know

Software gore

Recommendations

founded 2 years ago

MODERATORS

GatoB@lemmy.world