Automatic Subtitle Dubbing on YouTube Using Computer Vision | By Vadim Besten | October, 2022

Step-by-Step Python Guide to Help You Access More Content

image from shutterstock

About half a year ago, I came across an extension with an almost identical title to my article. I was curious about this idea and wanted to do something similar besides using computer vision, and this is what I found:

The project is based on three services, where the first is responsible for subtitle detection and transformation image with text into machine-readable text format, the second service is designed to translate text (currently only from English to Russian). ), and the final service is responsible for visualization and text dubbing. Communication between services is implemented using the ZeroMQ library.

data storage

The solution to almost any ML task begins with data collection – this was no exception, so it was necessary to collect a dataset consisting of screenshots of YouTube videos with subtitles and highlight bounding boxes of the subtitles themselves. Here are the basic requirements for the dataset:

1) Subtitles should be in different languages.

2) The picture with the video clip must be of varying size, which means the video clip can be full screen or part of the screen.

3) Subtitles should be in different areas of the screen.

4) Requirements for the subtitles themselves:

  • can be different sizes
  • Font Type: Normal and Proportionately Sans Serif
  • font color: white
  • Background Color: Black
  • Background transparency: 75%
  • Window Transparency: 0

The original dataset can be found on Kaggle.

subtitle detection

After collecting the datasets, it was necessary to train the model to find subtitles in order to detect them in real time. I chose Yolov5. And after 50 epochs of training on the pre-trained model, we got some pretty impressive metrics:

And the detection in real time works like this:

Preparing Images for Optical Character Recognition

For most OCR libraries, which we’ll cover in the next section, it’s better to pass a grayscale image, which is why I’d do this first:

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 1))
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
_, thresh_img = cv2.threshold(gray, 200, 255,
transformation_image = cv2.morphologyEx(thresh_img, cv2.MORPH_OPEN,
kernel, iterations=1)

It turns out to be some sort of thing:

hello my name is julia abelson and this

When the image is ready, one might think that you can send it to an OCR library, but this is not the case. The thing is that YOLO gives about ten frames per second, and so we will get a lot of images showing the same text, but why do we need to translate and dub the same text multiple times?

In OpenCV, there is a wonderful function called matchTemplate(), This function can be seen as a very simple form of object detection. Using template matching, we can locate objects in the input image using a “template” that contains the object we want to detect:

To find the template in the original image, we move the template from left to right and top to bottom on the original image:

This function returns as one of its parameters the 0 to 1 probability of how similar two images are to each other. Using this number, we can assume that if it is greater than 0.75, then both images contain the same text. You can read more about this function here.

The second problem is that the text in the next frame is joined by 1-2 words in the previous one, ie:

hello my name is julia abelson and this
hello my name is julia abelson and this is

As you can see, the text in the first picture differs from the text in the second by only one “is”. I think it’s not an option to extract one word at a time and pass this one word to the next step, so I propose to somehow determine where the line ends and pass only that picture where the line ends until the next step. How we determine this will be described further below.

bitwise_and() The function will help us to solve this problem. This bitwise operation merges two images so that only parts of both images remain in the output image. In an ideal situation, if we apply this function to the two images above, we would get something like this:

hello my name is julia abelson and this

But what we actually get is this:

unreadable text

The thing is that detection doesn’t always work the same way, that is, it often happens that an image with the same text, but on a different frame, will differ by 2-3 pixels in terms of width and/or height. In this case, you can find the bounding boxes of each letter and crop the image at these coordinates, and thus we can remove the black background around the text.

As a result, we will have an image similar to the ideal case. The rest is easy: we pass the image along with the first one (which doesn’t have the word “is” at the end) matchTemplate() Do the work and do the same thing as described in the few paragraphs above. That’s it. Image is ready for text recognition 🙂


In my opinion, after a successful search, I wanted to be equally successful at OCR, but it was a mess. As for libraries for text recognition, I saw the following: Tesseract, EasyOCR, and PaddleOCR. Now we have to choose one of the suggested ones. I decided to research and test how well each library performed on my data according to three criteria: CER, WER, and algorithm running time. You can read about the first two metrics here. I took 50 images from the dataset and put them in a separate JSON file. I put the data through an estimate of each solution and got the following results:

Honestly, I thought the results would be better because I produced a really nice grayscale image: minimal background noise, straight text, and no rotations or distortions.

Now we have to decide which library to use. Using PaddleOCR seems almost obvious, but the problem is that I only have 2Gb of video memory, of which ~1.5Gb is consumed by the detection model. On CPU, this solution runs for quite a long time, for real time, it is not very suitable. Same situation happens with EasyOCR, except that on CPU, I was unable to launch it, but when I have a good GPU, I’ll keep it in mind 🙂 Only Tesseract is left, and that’s what I’ll use .

Word Processing and Translation

It often happens that Tesseract may misspelling 1-2 letters in a word, and subsequently, the word may be mistranslated. So I decided to use pyenchant which checks the spelling of the word, and if the word is misspelled, the library will suggest a word similar to the current one and pass the resulting text to the next step – translation.

The translation part is simple – the Google Translate API. In some cases the translation isn’t quite right, but it’s fast, and I haven’t seen any limits on the number of requests.

Dubbing and Text Visualization

For dubbing, I used the non-sophisticated pyttsx3. Russian language doesn’t sound as good, but over time you get used to it. Text visualization was done with PySimpleGUI.

At least in the current implementation, the project is unlikely to be used by people to solve the problem, as there are still problems. One such issue is speed (behind reality by five seconds), but I have ideas to improve the project 🙂

The source code is available on GitHub.

have a nice day!

Leave a Comment