Hello all! Today, I’m going to share and explain some code which I have been using to clean up the audio in videos using Python. This technique is very simple, especially with the tools that are available, but I’ll take some time to explain what is really happening in the code, and how this type of processing and other more complicated methods can be applied to artificial intelligence agents like Alexa. Hopefully it will give you an appreciation of what goes into audio editing, even though this is just a very simple example. Follow this link to access the code directly.
First, install and import all of the necessary packages to your code, which are listed below. This includes some standard packages such as scipy, matplotlib, and numpy, but also includes two packages which I had never used before called moviepy and LibROSA. I encourage you to read up on these packages, as they provide good functionality for manipulating video files (moviepy) and audio files (LibROSA).
The first piece of code we will write uses a function VideoFileClip from moviepy.editor to read in an mp4 file and separate it into video and audio clips. This function is mostly just making use of FFMPEG (a different library) to perform this separation, but doing this action through MoviePy is much easier from my perspective, with no major downside. After separating the audio from the video file, we write a new audio (.wav) file, which we will load and use in LibROSA later.
The next step is to load the audio file and convert it into a format that can be mathematically manipulated. To do this, use LibROSA’s load command to read in the file that we just wrote using moviepy. Plot the data to make sure you are getting real and sensible values before continuing. It should look something like the blue graph below:
Here is where the data processing begins. Create a new function like the one below:
The first step is to compute the Fourier transform of the signal. This is achieved by using the LibROSA stft (Short Time Fourier Transform) command, which splits our audio data into half-overlapping windows, converts each window to the frequency domain, and returns the values to us. These values are complex (meaning they contain both real and imaginary numbers), and so must be converted into their magnitude and angle, which are achieved by using the numpy abs and angle functions.
The frequency (Fourier) domain may be a bit of a complicated topic for those who haven’t studied it, but I will try to summarize it briefly here. Essentially, this domain is based on the theory that if can record and plot the magnitude of any signal over time or space (for example, plotting temperature over time, or in this case recorded sound over time), that this signal can also be represented in an alternate way: as a mapping of the magnitude of different frequencies. A slow or small change corresponds to a low frequency, while a quick and fast change corresponds to a high frequency. When you convert a signal to the frequency domain, you evaluate how much of the signal corresponds to each single frequency within a wide range (for example, from 0 to 1024 hertz, where hertz (Hz) indicates the number of changes per second). In order to represent a complex signal, you need to combine several, maybe hundreds or thousands, of these individual frequency magnitudes. The signals combine together through simple addition, complementing each other in some instances and canceling in others depending on each frequency, magnitude, and phase (angle). The phase is used to shift each frequency to the left or right, ensuring that the signals combine in exactly the right way in time when the signal is reconstructed. As this is completed in the frequency domain, the shift occurs through multiplication with an exponential function (ejΘ), where Θ is the angle computer above.
The Fourier domain, while it has been used for image processing and in many other mathematical and physical problems, is particularly suited to audio processing, as the main distinction between different sounds is the frequency at which they occur. High pitches have mostly high frequencies, while low pitches have mostly low frequencies. Frequencies that consistently match up after a given number of cycles sound harmonious, while frequencies that are fighting against each other are dissonant. Sounds that don’t have any discernable pitch sound that way because they are the combination of hundreds or thousands of pitches, so no single dominant frequency can be heard. And different combinations are also easily discernable by humans. For example, if you close your eyes and consider the two sounds “p” and “t”, you will likely easily be able to tell them apart. But how could you mathematically define the difference between them? Something like this is only achievable in the frequency domain, and is the basis for all audio processing, including language, music, and many other applications.
So from our code, we have now determined the magnitude (ss) and phase (angle) for each time window of our recorded audio signal. The question now is what we would like to remove. In this case, we are considering a relatively constant static sound that occurs throughout the video. For my videos, if I don’t have a microphone close to the source, there is often static noise that occurs and is distracting from the most important audio in the video. If we assume that this static noise is constant, then we can also assume that this noise is present in the first fraction of a second of the video, while other more useful noises are not. Therefore, if we can characterize this static noise in the frequency domain from analyzing just the first second or less of the video, we can remove these characteristics from the rest of the video.
To do this, we again use the LibROSA.stft function, but rather than taking the Fourier transform of the entire video, we only consider the first several thousand datapoints. In this code, I have considered 8192 datapoints, but this number could be changed depending on when your desired sound begins or ends in your video. For me, the audio frequency was 44100 Hz, so 8192 datapoints corresponds to about the first 1/5th of a second of the video. This was enough for my video, but if It doesn’t work for yours, you could consider changing this to include more or less datapoints. After computing the STFT of this segment, the average magnitude of each frequency is computed. This frequency profile should be approximately the frequency profile of the static noise. Therefore, we simply subtract the magnitude of frequencies in the first 0.2 seconds from the magnitude of the frequencies of each window computed previously (sa = ss – mns.reshape((mns.shape[0], 1))). Finally, the modified windows from the original audio are shifted back to their proper phase using sa0 = sa*b, where b is the exponential function defined above. Lastly, the inverse Fourier transform is computed, and the new audio file is written to a new filename.
The final part of the code is optional, as I discovered that rewriting the video in this way significantly reduced the video quality. As a result, I just combined the original video together with this new audio by using Windows’ built-in video editing tool, and this may be the best option for many people who are reading this. However, to utilize the tools that we have learned today and come out with a complete video, you can replicate the following:
You can load in the new audio file using moviepy’s AudioFileClip, redefine the audio in the main clip (which you can load in the first function we wrote today) through simple assignment, and then rewrite the combined video file.
Now, if you compare the original audio file to the cleaned one, you should be able to notice a significantly smaller amount of noise in the latter. You can similarly compare the original video file to the one with cleaned audio, and should find the same thing.
This concludes my tutorial on static noise removal from video files. I hope that you have learned something that you can apply to your own work, and I hope that you enjoyed my code and explanations.
To hear the cleaned audio, you can follow this link to the Youtube video, and can compare it to this song, whose audio has not been cleaned.