Virtual Backgroundsare one of the hot topics among employees that work remotely at the moment. With some of us being isolated at the moment because of the Covid-19 pandemic, a lot of people have to take video calls in order to carry on their work. Some software tools for video conferincing allow setting a virtual background so that users can build a more friendly atmosphere for taking these calls.
Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.
As a programmer, I was naturally intrigued first time I used such a virtual background. How does it work, I wondered. Can I build such a virtual background? And if yes, how can I do it? Spoiler: it did not go well! Still, I think it was a good educational exercise and I didn't find too much information on this topic while researching this. Therefore, as I do with everyting I learn, I decided to document it here, maybe someone else will benefit from this.
So in this tutorial we are going to try a basic approach for building a a virtual background with Computer Vision techniques, using Python and OpenCV .
The goal of this project is to take a video, try to figure out what's the background and what's the foreground of the video, remove the background part and replace it with a picture - the virtual background. Because in this project we are going to use trivial methods, we will need the assumption that the foreground will, in general, have colors different from the background. But first, let's see what are our tools.
Computer Visionis an interdisciplinary field that deals with how computers can process and(maybe) understand images and videos. We say it is an interdisciplinary field because it borrows a lot of concepts from different disciplines(computer science, algebra, geometry and so on) and combines them to solve a lot of different and complex tasks, like object tracking , object detection, object recognition , object segmentation in images and videos.
OpenCV is a library built for solving computer vision tasks. It is open-source and it is available for several programming languages, including Python and C++. It has a tremendous amount of features for computer vision, with some of them being based on on maths and statistical approaches, and others being based on Machine Learning.
If you've made it this far in this article, you probably know what Python is :grinning:
The approach I tried for this was the following. I'll show code snippets for every step and at the end of the article you'll have the full code.
import numpy as np import cv2
2. Load the video from the local environment and initialize data
ap = cv2.VideoCapture('video6.mp4') ret = True frameCounter = 0 previousFrame = None nextFrame = None iterations = 0
3. Load the substitute background image from the local environment
backgroundImage = cv2.imread("image1.jpg")
4. Split the video frame by frame
while (ret): ret, frame = cap.read()
5. Take every pair of two frames
if frameCounter % 2 == 1: nextFrame = frame if frameCounter % 2 == 0: frameCounter = 0 previousFrame = frame frameCounter = frameCounter + 1 iterations = iterations + 1
6. Find the absolute difference between the two frames and convert it to grayscale -> obtaining a mask.
if iterations > 2: diff = cv2.absdiff(previousFrame, nextFrame) mask = cv2.cvtColor(diff, cv2.COLOR_BGR2GRAY)
Every image consists of pixels - you can imagine this as a 2D matrix with lines and columns and every cell in the matrix is a pixel in an image(of course, for color images we have more dimensions than just 2, but for simplicity, we can ignore this).
We obtain the difference by going pixel by pixel in the first image(so cell by cell in the first matrix) and substituting the corresponding pixel from the other image(so the corresponding cell from the other matrix).
Now here's the trick: if between the 2 frames, a pixel has not been modified, then of course the result will be 0 . How can a pixel be different between 2 frames? If the video is completely static(nothing moves in the image), then the difference will be 0 between each and every frame for all the pixels, because nothing is changed. But if something moves in the image, then we can identify where in the image something has moved by detecting the pixel differences. And we can assume that, in a video conference, the things that move are in the foreground – that's you – and the static part is the background.
And what's so important about this 0 ? The image will show a black color for every pixel that is 0, and we are going to use that in our advantage.
7. Find the cells in the mask that are over a threshold value - I've chosen 3 as a threshold, but you can play with different values. A larger value will remove more from the background, but may also remove more from the foreground.
th = 3 isMask = mask > th nonMask = mask <= th
8. Create an empty image(0 for every cell) with the size of any of the two frames.
result = np.zeros_like(nextFrame, np.uint8)
9. Resize the background image so that it has the same size as the frames.
resized = cv2.resize(backgroundImage, (result.shape, result.shape), interpolation = cv2.INTER_AREA)
10. For every cell from the mask that is bigger than the threshold, copy from the original frame.
result[isMask] = nextFrame[isMask]
11. For every cell from the mask that is lower than the threshold, copy from the substitute background image.
result[nonMask] = resized[nonMask]
12. Save the result frame to the local environment.
cv2.imwrite("output" + str(iterations) + ".jpg", result)
So what are the results? Honestly, I've been a bit dissapointed by the result. Then I did more research and the reason became more obvious. You need a more advanced approach for this and it's no surprise that big companies invest lots of resources on this type of problem.
Here's a screenshot of the video I tried. It's basically a video of my hand moving in front of a wall.
And here's a screenshot of the output image. For the background I used a photo of me in Rasnov, Romania.
As I said, I am not very satisfied with the result. But I am satisfied with what I learned from this project. It was a fun learning experience and a nice way to spend my time working with concepts I am not comfortable to work with.
If you think a problem is very complicated and requires levels of intelligence unusual for what you've seen in a computer software - then the answer might be Machine Learning. :grinning:
There are already Deep Learning models out there that can perform this sort of tasks. But such a model requires large datasets to train on and lots of processing power, out of which I had none at the moment of writing this article. The task to be solved by such a deep learning model is called image segmentation.
Another approach would be a c0mputer vision method for finding the distance between the camera and the objects in the image. Then you would establish a threshold for separating the foreground from the background. After that, you can use the same mask I used to remove the background and introduce a new one.
Thank you so much for reading this. Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.