Predicting output size of compressed video (MoviePy) - python

I am about to make the following experiment, but wondering if the chosen approach is correct hence the cry for help. To goal is to be able to develop a general rough rule of thumb for the target video clip size. We have an input video and, using MoviePy library, want to have an output video that should not exceed a certain arbitrary output size. The output is created by accelerating the input video:
VideoFileClip(fname).without_audio().fx(vfx.resize, 0.3).fx(vfx.speedx, 4)
So, if we set a target that the output should not exceed 200kb, the question is what speed rate should be applied for a video of size 3mb, 10mb, 20mb?
To do that the idea was to take a sample of random videos, try out different speed rates, and find a line best fit on the outputs. What is your take on that approach? I know that it's a very coarse measure rendering different results depending on the input type, but still...Thanks

Related

Resampling without changing pitch and ratio

I'm doing speech recognition and denoising. In order to feed the data to my model I need to resample and make it 2 channels. although I don't know the optimized resampling rate for each sound. when I use a fixed number for resampling rate(resr) like 20000 or 16000 sometimes it works and sometimes it makes the pitch wrong or makes it slow. How does resampling work in this case? Do I need an optimizer?
Also what can I do if I have a phone call and one person's voice is too quiet that it indeed gets recognized as noise?
This is my code:
num_channels = sig.shape[0]
# Resample first channel
resig = torchaudio.transforms.Resample(sr, resr)(sig[:1,:])
print(resig.shape)
if (num_channels > 1):
# Resample the second channel and merge both channels
retwo = torchaudio.transforms.Resample(sr,resr)(sig[1:,:])
resig = torch.cat([resig, retwo])
I don't know the optimized resampling rate for each sound
Sample rate is not a parameter you tune for each audio, rather, the same sample rate that was used to train the speech recognition model should be used.
sometimes it works and sometimes it makes the pitch wrong or makes it slow.
Resampling when properly done, does not alter pitch or speed. I guess that you are saving the resulting data with wrong sample rate. Sample rate is not something you can pick an arbitrary number. You have to pick a one that conforms to the system you are working with.
Having said that the proper way to do resampling, regardless of the number of channels is to simply pass the waveform to torchaudio.functional.resample function with original and target sample rate. The function process multiple channels at the same time, so there is no need to run resample function separately for each channel.
Then, if you know the sample rate of input audio beforehand, and all the audio you process have the same sample rate, using torchaudio.transforms.Resample will make the process faster because it will cache the convolution kernel used for resampling.
resampler = torchaudio.transforms.Resample(original_sample_rate, target_sample_rate)
for sig in signals:
resig = resampler(sig)
# process the resulting resampled signal

Object recognition with CNN, what is the best way to train my model : photos or videos?

I aim to design an app that recognize a certain type of objects (let's say, a book) and that can say whether the input is effectively a book or not (binary classification).
For a better user experience, I would like the input to be a video rather than a picture: that way, the user won't have to deal with issues such as sharpness, centering of the object... He'll just have to make a "scan" of the object, without much consideration for the quality of a single image.
And there comes my problem : As I intend to create my training dataset from scratch (the true object I want to detect being absent from existing datasets such as ImageNet),
I was wondering if videos were irrelevant for this type of binary classification and if I should rather ask the user to take a good picture of the object.
On one hand, videos have the advantage of constituting a larger dataset than one created only from photos (though I can expand my picture's dataset thanks to data augmentation) as it is easier to take a 10s video of an object rather than taking 10x24 (more or less…) pictures of it.
But on the other hand I fear the result will be less precise, as in a video many frames are redundant and the average quality might not be as good as in a single, proper image.
Moreover, I do not intend to use the time property of a video (as in a scan the temporality is useless) but rather working one frame at a time (as depicted in this article).
What is the proper way of constituting my dataset? As I really would like to keep this “scan” for the user’s comfort and if images are more precise than videos in such a classification is it eventually possible to automatically extract a single image from a “scan”, and working directly on it?
Good question! The answer is: you should train your model on how you plan to use it. So if you ask the user to take photos, train it on photos. If you ask the user to film the object, train on frames extracted from video.
The images might seem blurry to you, but they won't be for a computer. It will just learn to detect "blurry books", but that's OK, that's what you want.
Of course this is not always the case. The image might become so blurry that the information whether or not there is a book in the frame is no longer there. Where is the line? A general rule of thumb: if you can see it's a book, the computer will also see it. As I think blurry images of books will still be recognizable as books, I think you could totally do it.
Creating "photos (single image, sharp)" from "scan (more blurry, frames from video)" can be done, it's called super-resolution. But those models are pretty beefy, not something you would want to run on a mobile device.
On a completely unrelated note: try googling Transfer Learning! It will benefit you for sure :D.

Feature extraction for keyword spotting on long form audio using a CNN

I've built a simple CNN word detector that is accurately able to predict a given word when using a 1-second .wav as input. As seems to be the standard, I'm using the MFCC of the audio files as input for the CNN.
However, my goal is to be able to apply this to longer audio files with multiple words being spoken, and to have the model be able to predict if and when a given word is spoken. I've been searching online how the best approach, but seem to be hitting a wall and I truly apologize if the answer could've been easily found through google.
My first thought is to cut the audio file into several windows of 1-second length that intersect each other -
and then convert each window into an MFCC and use these as input for the model prediction.
My second thought would be to instead use onset detection in attempts isolate each word, add padding if the word if it was < 1-second, and then feed these as input for the model prediction.
Am I way off here? Any references or recommendations would hugely appreciated. Thank you.
Cutting the audio up into analysis windows is the way to go. It is common to use some overlap. The MFCC features can be calculated first and then split done using an integer number of frames that gets you closest to the window length you want (1s).
See How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for all the audios)? for example code

Read jpg comression quality

Update:
After further reading I found out that the compression is not only not stored in the image header but also that the quality number is basically meaningless because different programs have a different approach to it. I'm still leaving this here in case anybody has a very clever solution to the problem.
I did write a log and have an approach writing the time I last compressed an image comparing it to the last modified time on the file, though I still would like to solve the problem of getting the quality for other images that I can't just get by time modified.
Original question:
I have a small Python script that walks trough certain folders and compresses certain .jpg files to 60 quality utilizing the pillow library.
It works however in a second iteration it would compress all the images that already were compressed.
So is there a way toe get the compression or quality the .jpg currently has so I can skip the file if it was already compressed?
import os
from os.path import join, getsize
from PIL import Image
start_directory = ".\\test"
for root, dirs, files in os.walk(start_directory):
for f in files:
try:
with Image.open(join(root, f)) as im:
if im.quality > 60: # <= something like this
im.save(join(root, f),optimize=True,quality=60)
im.close() # <= not sure if necessary
except IOError:
pass
There is a way this might be solved. You could use bits per pixel to check your jpeg compression rate.
As I understand it, the JPEG "quality" metric is a number from 0-100 where 100 represents lossless and 0 is maximum compression. As you say, the actual compression details will vary from encoder to encoder and (probably) from pixel block to pixel block (macroblocks?). Applying 60 to an image will reduce the image quality but it will also reduce the file size, presumably to 60% of its original size.
All your images will probably have different dimensions. What you want to look at is bits per pixel.
The question is: why do you compress at quality factor 60 in the first place? Is it correct for all your images? Are you trying to achieve a specific file size? Or are you happy just to make them all smaller?
If you, instead, aim for a specific number of bits per pixel, then your check just becomes a calculation of file size divided by number of pixels in the image and check this against your desired bits per pixel. If it’s bigger, apply compression.
Of course, you’ll then have to be slightly more clever than selecting quality factor 60. Presumably 60 either means 60% of original file size or it’s some internal setting. If it’s the former, you can calculate a new value simply enough. If it’s the latter, you may need to use trial and improvement to get the desired file size.
You are asking the impossible here. JPEG "quality" depends upon two factors. First, the sampling of the components. Second, the selection of quantization tables.
Some JPEG encoders have "quality" settings but these could do anything. Some use ranges of 0 .. 100 Others use 0..4. I've seen 0..8. I've seen :"high", "Medium", "Low."
You could conceivably look at the sampling rates in the frame and you could compare the quantization tables and compare them to some baseline tables of your own to make a "quality" evaluation.

Acurately mixing two notes over each other

I have a large library of many pre-recorded music notes (some ~1200), which are all of consistant amplitude.
I'm researching methods of layering two notes over each other so that it sounds like a chord where both notes are played at the same time.
Samples with different attack times:
As you can see, these samples have different peak amplitude points, which need to line up in order to sound like a human played chord.
Manually aligned attack points:
The 2nd image shows the attack points manually alligned by ear, but this is a unfeasable method for such a large data set where I wish to create many permutations of chord samples.
I'm considering a method whereby I identify the time of peak amplitude of two audio samples, and then align those two peak amplitude times when mixing the notes to create the chord. But I am unsure of how to go about such an implementation.
I'm thinking of using python mixing solution such as the one found here Mixing two audio files together with python with some tweaking to mix audio samples over each other.
I'm looking for ideas on how I can identify the times of peak amplitude in my audio samples, or if you have any thoughts on other ways this idea could be implemented I'd be very interested.
Incase anyone were actually interested in this question, I have found a solution to my problem. It's a little convoluded, but it has yeilded excellent results.
To find the time of peak amplitude of a sample, I found this thread here: Finding the 'volume' of a .wav at a given time where the top answer provided links to a scala library called AudioFile, which provided a method to find the peak amplite by going through a sample in frame buffer windows. However this library required all files to be in .aiff format, so a second library of samples was created consisting of all the old .wav samples converted to .aiff.
After reducing the frame buffer window, I was able to determine in which frame the highest amplitude was found. Dividing this frame by the sample rate of the audio samples (which was known to be 48000), I was able to accurately find the time of peak amplitude. This information was used to create a file which stored both the name of the sample file, along with its time of peak amplitude.
Once this was accomplished, a python script was written using the Pydub library http://pydub.com/ which would pair up two samples, and find the difference (t) in their times of peak amplitudes. The sample with the lowest time of peak amplitude would have silence of length (t) preappended to it from a .wav containing only silence.
These two samples were then overlayed onto each other to produce the accurately mixed chord!

Categories

Resources