Slicing audio given video frames

Slicing audio given video frames - python

I have audio from a video that I've loaded with PyTorch. Given a starting index and ending index corresponding to the video segment of interest, along with the video FPS and audio sampling rate, how would I go about extracting the slice of audio that matches the segment of interest of the video?
My intuition is to convert frames to time via:
start_time = frame_start / fps
end_time = frame_end / fps
the convert time to sample position with:
start_sample = int(math.floor(start_time * sr))
end_sample = int(math.floor(end_time * sr))
Is this correct? Or is there something I'm missing? I'm worried that there will be loss of information since I'm converting the samples into ints with floor.

Let's say you have
fs = 44100 # audio sampling frequency
vfr = 24 # video frame rate
frame_start = 10 # index of first frame
frame_end = 10 # index of last frame
audio = np.arange(44100) # audio in form of ndarray
you can calculate at which points in time you want to slice the audio
time_start = frame_start / vfr
time_end = frame_end / vfr # or (frame_end + 1) / vfr for inclusive cut
and then to which samples those points in time correspond:
sample_start_idx = int(time_start * fs)
sample_end_idx = int(time_end * fs)
Its up to you if you want to be super-precise and take into account the fact that audio corresponding to a given frame should rather be starting half a frame before a frame and end half a frame after.
In such a case use:
time_start = np.clip((frame_start - 0.5) / vfr, 0, np.inf)
time_end = (frame_end + 0.5) / vfr

Your solution is just fine. Assuming your sample rate is 16000, the flooring will cause a video/audio desynch on the order of 4.166e-05 seconds, which is orders of magnitude below what human ears are able to discern.
import math
fps = 60
frame_start = 121
frame_end = 181
sr=16000
start_time = frame_start / fps
end_time = frame_end / fps
start_sample = int(math.floor(start_time * sr))
end_sample = int(math.floor(end_time * sr))
print(end_time-end_sample/sr) # 4.166666666671759e-05

Related

How to find noise point of .wav file. i mean, not remove noise, just when occurred noise by using python

How to find noise point of .wav file. i mean, not remove noise, just when occurred noise
i checked this site that classify dog and cat sound
https://www.kaggle.com/nadir89/classification-logistic-regression-svm-on-mfccs/notebook?select=utils.py
but it didnt work properly...
can you guys give me some advises or other way to find noise point of .wav file
is it available to find noise from sound by using logreg(machine learning)? not removing..
is there any way to find noise point?

Try this
import noisereduce
temp = noisereduce.reduce_noise(noise_clip=noise_clip,audio_clip=temp,verbose=True)
noise_clip small part of the signal(sample of noise, maybe 1s frame_duration)
audio_clip actual audio
signal, fs = librosa.load(path)
signln = len(signal)
avg_energy = np.sum(signal ** 2) / float(signln) #avg_energy of acual signal
f_d = 0.02 #frame duration
perc = 0.01
flag = True
j = 0
f_length = fs * f_d #frame length is `frame per second(fs) * frame_duration(f_d)`
signln = len(signal)
retsig = []
noise = signal[0:441] # just considering first little part as noise
avg_energy = np.sum(signal ** 2) / float(signln)
while j < signln:
subsig = signal[int(j): int(j) + int(f_length)]
average_energy = np.sum(subsig ** 2) / float(len(subsig)) # avg energy of current frame
if average_energy <= avg_energy: #if enegy of the current frame is less than actual signal then then we can confirm that this frame as silence or noise part
if flag: #to get first noise or silence appearing on the signal
noise = subsig #if you want to get all the noise frame, then just create a list and append it(noise_list.append(subsig)) and also don't use the flag condition
flag = False
else: # if avg energy of current frame is grater than actual signal energy then this frame contain the data
retsig.append(subsig) # so you need to add that frame to new variable
j += f_length

BPM detection from pyAudioAnalysis is producing the wrong number of beats for any signal

Any help would be much appreciated. I am trying to extract the BPM from any .wav file that is loaded onto the python script by using the pyAudioAnalysis library. For some reason it is not outputting the correct BPM? I tried to change the window size in the beat_extraction() function but it only allows numbers under 1 second and when I change the window size, the BPM seem to change. But when kept at 1 second window it outputs 30 every time.
The following is my code:
from pyAudioAnalysis import audioBasicIO
from pyAudioAnalysis import ShortTermFeatures
from pyAudioAnalysis import MidTermFeatures
import matplotlib.pyplot as plt
import numpy as np
import warnings
file_name = "unity_alan.wav"
# Extract the Sampling Rate (Fs) and the Raw Signal Data (signal)
[Fs, signal] = audioBasicIO.read_audio_file(file_name)
# Uncomment if signal has two channels
# signal = signal[:,0]
# Function to Normalize the Signal
def normalize_signal(signal):
signal = np.double(signal)
return (signal - signal.mean()) / ((np.abs(signal)).max() + 0.0000000001)
# Short Term Features
# For each short-term window a set of features is extracted. This would result in a sequence of feature
# vectors stored in a np matrix (in this case Features_midTerm)
signal = normalize_signal(signal)
# Fs - frequency sampling Rate
print("The Sampling Freq: ",Fs)
# Total Signal Len
signal_len = len(signal)
print("Total Signal Length: ",signal_len)
# Total Time Len of Song in Seconds
len_of_song = signal_len / Fs
print("Total Song Time (s): ",len_of_song)
# Window Size in mseconds
windowSize_in_ms = 50
windowSize_in_s = windowSize_in_ms/1000
# Window Size in Samples
windowSize_in_samples = Fs * (windowSize_in_ms / 1000) #divide by 1000 to turn to seconds
print("Window Size (samples): ",windowSize_in_samples)
# Window Step in mseconds
wStep_in_ms = 25
# Window Step in Samples
wStep_in_samples = Fs * (wStep_in_ms / 1000) #divide by 1000 to turn to seconds
print("Window Step in Samples: ", wStep_in_samples)
# Oversampling Percentage
oversampling_Percentage = (wStep_in_ms / windowSize_in_ms) * 100
print("Oversampling Percentage (overlap of windows): ",oversampling_Percentage)
# Total Number of Windows in Signal
total_number_windows = signal_len / windowSize_in_samples
print("Total Number of Windows in Signal: ",total_number_windows)
# Total number of feature samples produced (windows/size * total windows)
feature_samples_points_total_cal = int(total_number_windows * (windowSize_in_ms/wStep_in_ms))
print("Calculated Total of Points Produced per Short Term Feature: ",feature_samples_points_total_cal)
# Extract features and their names. Each index has its own vector of features. The total number should be the same
# as the calculated total of points produced per feature.
Features_shortTerm, feature_names = ShortTermFeatures.feature_extraction(signal, Fs, windowSize_in_samples, wStep_in_samples)
# Exact Number of points in the Features
feature_samples_points_total_exact = len(Features_shortTerm[0])
print("Exact Total of Points Produced per Short Term Feature: ",feature_samples_points_total_exact)
# Mid-window (in seconds)
mid_window_seconds = int(1 * Fs)
# Mid-step (in seconds)
mid_step_seconds = int(1 * Fs)
# MID FEATURE Extraction
Features_midTerm, short_Features_ignore, mid_feature_names = MidTermFeatures.mid_feature_extraction(signal,Fs,mid_window_seconds,mid_step_seconds,windowSize_in_samples,wStep_in_samples)
# Exact Mid-Term Feature Total Number of Points
midTerm_features_total_points = len(Features_midTerm)
print("Exact Mid-Term Total Number of Feature Points: ",midTerm_features_total_points)
# Beats per min
# The Tempo of music determins the speed at which it is played (measured in BPM)
bpm,confidence_ratio = MidTermFeatures.beat_extraction(Features_shortTerm,1)
print("Beats per Minute (bpm): ",bpm)
print("Confidence ratio for BPM: ", confidence_ratio)
# Figure out why the BPM does not match the actual reading
# of 115 BPM. It is showing 30 BPM which is for sure wrong.
The output of my script is as follows:
The Sampling Freq: 48000
Total Signal Length: 10992884
Total Song Time (s): 229.01841666666667
Window Size (samples): 2400.0
Window Step in Samples: 1200.0
Oversampling Percentage (overlap of windows): 50.0
Total Number of Windows in Signal: 4580.368333333333
Calculated Total of Points Produced per Short Term Feature: 9160
Exact Total of Points Produced per Short Term Feature: 9159
Exact Mid-Term Total Number of Feature Points: 136
Beats per Minute (bpm): 30.0
Confidence ratio for BPM: 1.0
The library's def of the function is as follows:
def beat_extraction(short_features, window_size, plot=False):
"""
This function extracts an estimate of the beat rate for a musical signal.
ARGUMENTS:
- short_features: a np array (n_feats x numOfShortTermWindows)
- window_size: window size in seconds
RETURNS:
- bpm: estimates of beats per minute
- ratio: a confidence measure
"""
# Features that are related to the beat tracking task:
selected_features = [0, 1, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18]
max_beat_time = int(round(2.0 / window_size))
hist_all = np.zeros((max_beat_time,))
# for each feature
for ii, i in enumerate(selected_features):
# dif threshold (3 x Mean of Difs)
dif_threshold = 2.0 * (np.abs(short_features[i, 0:-1] -
short_features[i, 1::])).mean()
if dif_threshold <= 0:
dif_threshold = 0.0000000000000001
# detect local maxima
[pos1, _] = utilities.peakdet(short_features[i, :], dif_threshold)
position_diffs = []
# compute histograms of local maxima changes
for j in range(len(pos1)-1):
position_diffs.append(pos1[j+1]-pos1[j])
histogram_times, histogram_edges = \
np.histogram(position_diffs, np.arange(0.5, max_beat_time + 1.5))
hist_centers = (histogram_edges[0:-1] + histogram_edges[1::]) / 2.0
histogram_times = \
histogram_times.astype(float) / short_features.shape[1]
hist_all += histogram_times
if plot:
plt.subplot(9, 2, ii + 1)
plt.plot(short_features[i, :], 'k')
for k in pos1:
plt.plot(k, short_features[i, k], 'k*')
f1 = plt.gca()
f1.axes.get_xaxis().set_ticks([])
f1.axes.get_yaxis().set_ticks([])
if plot:
plt.show(block=False)
plt.figure()
# Get beat as the argmax of the agregated histogram:
max_indices = np.argmax(hist_all)
bpms = 60 / (hist_centers * window_size)
bpm = bpms[max_indices]
# ... and the beat ratio:
ratio = hist_all[max_indices] / hist_all.sum()
if plot:
# filter out >500 beats from plotting:
hist_all = hist_all[bpms < 500]
bpms = bpms[bpms < 500]
plt.plot(bpms, hist_all, 'k')
plt.xlabel('Beats per minute')
plt.ylabel('Freq Count')
plt.show(block=True)
return bpm, ratio

How do I stretch the x-axis of a matplotlib spectrogram?

Sorry if this is a really obvious question. I am using matplotlib to generate some spectrograms for use as training data in a machine learning model. The spectrograms are of short clips of music and I want to simulate speeding up or slowing down the song by a random amount to create variations in the data. I have shown my code below for generating each spectrogram. I have temporarily modified it to produce 2 images starting at the same point in the song, one with variation and one without, in order to compare them and see if it is working as intended.
from pydub import AudioSegment
import matplotlib.pyplot as plt
import numpy as np
BPM_VARIATION_AMOUNT = 0.2
FRAME_RATE = 22050
CHUNK_SIZE = 2
BUFFER = FRAME_RATE * 5
def generate_random_specgram(track):
# Read audio data from file
audio = AudioSegment.from_file(track.location)
audio = audio.set_channels(1).set_frame_rate(FRAME_RATE)
samples = audio.get_array_of_samples()
start = np.random.randint(BUFFER, len(samples) - BUFFER)
chunk = samples[start:start + int(CHUNK_SIZE * FRAME_RATE)]
# Plot specgram and save to file
filename = ('specgrams/%s-%s-%s.png' % (track.trackid, start, track.bpm))
plt.figure(figsize=(2.56, 0.64), frameon=False).add_axes([0, 0, 1, 1])
plt.axis('off')
plt.specgram(chunk, Fs = FRAME_RATE)
plt.savefig(filename)
plt.close()
# Perform random variations to the BPM
frame_rate = FRAME_RATE
bpm = track.bpm
variation = 1 - BPM_VARIATION_AMOUNT + (
np.random.random() * BPM_VARIATION_AMOUNT * 2)
bpm *= variation
bpm = round(bpm, 2)
# I thought this next line should have been /= but that stretched the wrong way?
frame_rate *= (bpm / track.bpm)
# Read audio data from file
chunk = samples[start:start + int(CHUNK_SIZE * frame_rate)]
# Plot specgram and save to file
filename = ('specgrams/%s-%s-%s.png' % (track.trackid, start, bpm))
plt.figure(figsize=(2.56, 0.64), frameon=False).add_axes([0, 0, 1, 1])
plt.axis('off')
plt.specgram(chunk, Fs = frame_rate)
plt.savefig(filename)
plt.close()
I thought by changing the Fs parameter given to the specgram function this would stretch the data along the x-axis but instead it seems to be resizing the whole graph and introducing white space at the top of the image in strange and unpredictable ways. I'm sure I'm missing something but I can't see what it is. Below is an image to illustrate what I'm getting.

The framerate is a fixed number that only depends on your data, if you change it you will effectively "stretch" the x-axis but in the wrong way. For example, if you have 1000 data points that correspond to 1 second, your framerate (or better sampling frequency) will be 1000. If your signal is a simple 200Hz sine which slightly increases the frequency in time, the specgram will be:
t = np.linspace(0, 1, 1000)
signal = np.sin((200*2*np.pi + 200*t) * t)
frame_rate = 1000
plt.specgram(signal, Fs=frame_rate);
If you change the framerate you will have a wrong x and y-axis scale. If you set the framerate to be 500 you will have:
t = np.linspace(0, 1, 1000)
signal = np.sin((200*2*np.pi + 200*t) * t)
frame_rate = 500
plt.specgram(signal, Fs=frame_rate);
The plot is very similar, but this time is wrong: you have almost 2 seconds on the x-axis, while you should only have 1, moreover, the starting frequency you read is 100Hz instead of 200Hz.
To conclude, the sampling frequency you set needs to be the correct one. If you want to stretch the plot you can use something like plt.xlim(0.2, 0.4). If you want to avoid the white band on top of the plot you can manually set the ylim to be half the frame rate:
plt.ylim(0, frame_rate/2)
This works because of simple properties of the Fourier transform and Nyquist-Shannon theorem.

The solution to my problem was to set the xlim and ylim of the plot. Here is the code from my testing file in which I finally got rid of all the odd whitespace:
from pydub import AudioSegment
import numpy as np
import matplotlib.pyplot as plt
BUFFER = 5
FRAME_RATE = 22050
SAMPLE_LENGTH = 2
def plot(audio_file, bpm, variation=1):
audio = AudioSegment.from_file(audio_file)
audio = audio.set_channels(1).set_frame_rate(FRAME_RATE)
samples = audio.get_array_of_samples()
chunk_length = int(FRAME_RATE * SAMPLE_LENGTH * variation)
start = np.random.randint(
BUFFER * FRAME_RATE,
len(samples) - (BUFFER * FRAME_RATE) - chunk_length)
chunk = samples[start:start + chunk_length]
plt.figure(figsize=(5.12, 2.56)).add_axes([0, 0, 1, 1])
plt.specgram(chunk, Fs=FRAME_RATE * variation)
plt.xlim(0, SAMPLE_LENGTH)
plt.ylim(0, FRAME_RATE / 2 * variation)
plt.savefig('specgram-%f.png' % (bpm * variation))
plt.close()

Numpy averaging a series

numpy_frames_original are frames in a video. Firstly, I wanted to find the average of these frames and subtract it from each frame giving numpy_frames. For the problem I am trying to tackle I thought it would be a good idea to find the average of all of these frames, to do this I wrote the code below:
arr=np.zeros((height,width), np.float)
for i in range(0, number_frames):
imarr=np.array(numpy_frames_original[i,:,:].astype(float))
arr=arr+imarr/number_frames
img_avg=np.array(np.round(arr),dtype=np.uint8)
numpy_frames = np.array(np.absolute(np.round(np.array(numpy_frames_original.astype(float))-np.array(img_avg.astype(float)))), dtype=np.uint8)
Now I have decided It would be better not to get an average of all of the frames, but instead for each frame subtract an average of 100 frames closest to it.
I'm not sure how to write this code?
For example for frame 0 it would average frames 0 - 99 and subtract the average. For frame 3 it would also average frames 0 - 99 and subtract, for frames 62 it would average frames 12-112 and subtract.
Thanks

I think this does what you need.
import numpy
# Create some fake data
frame_count = 10
frame_width = 2
frame_height = 3
frames = numpy.random.randn(frame_count, frame_width, frame_height).astype(numpy.float32)
print 'Original frames\n', frames
# Compute the modified frames over a specified range
mean_range_size = 2
assert frame_count >= mean_range_size * 2
start_mean = frames[:2 * mean_range_size + 1].mean(axis=0)
start_frames = frames[:mean_range_size] - start_mean
middle_frames = numpy.empty((frames.shape[0] - 2 * mean_range_size,
frames.shape[1], frames.shape[2]),
dtype=frames.dtype)
for index in xrange(middle_frames.shape[0]):
middle_frames[index] = frames[mean_range_size + index] - \
frames[index:index + 2 * mean_range_size + 1].mean(axis=0)
end_mean = frames[-2 * mean_range_size - 1:].mean(axis=0)
end_frames = frames[-mean_range_size:] - end_mean
modified_frames = numpy.concatenate([start_frames, middle_frames, end_frames])
print 'Modified frames\n', modified_frames
Note the assert, the code will need to be modified if your shortest sequence is shorter than the total range size (e.g. 100 frames).

Calculating event / stimulus triggered averages efficiently in Python

I would like to calculate event / stimulus triggered averages computationally efficient.
Assuming I have got a signal, e.g.
signal = [random.random() for i in xrange(0, 1000)]
with n_signal datapoints
n_signal = len(signal)
I know that this signal is sampled with a rate of
Fs = 25000 # Hz
In this case I know that the total time of the signal
T_sec = n_signal / float(Fs)
At specific times, certain events occur, e.g.
t_events = [0.01, 0.017, 0.018, 0.022, 0.034, 0.0345, 0.03456]
Now I would like to find the signal from a certain time before these events, e.g.
t_bef = 0.001
until a certain time after these events, e.g.
t_aft = 0.002
And once I have got all of these chunks of the signal, I would like to average these.
In the past I would have created the time vector of the signal
t_signal = numpy.linspace(0, T_sec, n_signal)
and looked for all of the indices for t_events in t_signal e.g. using numpy.serachsorted
(Link)
Since I know the sampling rate of the signal, these can be done much quicker, like
indices = [int(i * Fs) for i in t_events]
This saves me the memory for t_signal and I do not have to go through the whole signal to find my indices.
Next, I would determine how many data samples t_bef and t_aft are corresponding to
nsamples_t_bef = int(t_bef * Fs)
nsamples_t_aft = int(t_aft * Fs)
and I would save the signal chunks in a list
signal_chunks = list()
for i in xrange(0, len(t_events)):
signal_chunks.append(signal[indices[i] - nsamples_t_bef : indices[i] + nsamples_t_aft])
And finally I am averaging these
event_triggered_average = numpy.mean(signal_chunks, axis = 0)
If I am interested in the time vector, I am calculating it with
t_event_triggered_average = numpy.linspace(-t_signal[nsamples_t_bef], t_signal[nsamples_t_aft], nsamples_t_bef + nsamples_t_aft)
Now my questions: Is there a computational more efficient way to do this? If I have got a signal with many data points and many events, this computation can take a while. Is a list the best data structure to save these chunks in? Do you know how to get the chunks of data quicker? Maybe using buffer?
Thanks in advance for your comments and advice.
Minimum working example
import numpy
import random
random.seed(0)
signal = [random.random() for i in xrange(0, 1000)]
# sampling rate
Fs = 25000 # Hz
# total time of the signal
n_signal = len(signal)
T_sec = n_signal / float(Fs)
# time of events of interest
t_events = [0.01, 0.017, 0.018, 0.022, 0.034, 0.0345, 0.03456]
# and their corresponding indices
indices = [int(i * Fs) for i in t_events]
# define the time window of interest around each event
t_bef = 0.001
t_aft = 0.002
# and the corresponding index offset
nsamples_t_bef = int(t_bef * Fs)
nsamples_t_aft = int(t_aft * Fs)
# vector of signal times
t_signal = numpy.linspace(0, T_sec, n_signal)
signal_chunks = list()
for i in xrange(0, len(t_events)):
signal_chunks.append(signal[indices[i] - nsamples_t_bef : indices[i] + nsamples_t_aft])
# average signal value across chunks
event_triggered_average = numpy.mean(signal_chunks, axis = 0)
# not sure what's going on here
t_event_triggered_average = numpy.linspace(-t_signal[nsamples_t_bef],
t_signal[nsamples_t_aft],
nsamples_t_bef + nsamples_t_aft)

Since your signal is defined on a regular grid, you could do some arithmetic to find indices for all the samples that you require. Then you can construct the array with chunks using a single indexing operation.
import numpy as np
# Making some test data
n_signal = 1000
signal = np.random.rand(n_signal)
Fs = 25000 # Hz
t_events = np.array([0.01, 0.017, 0.018, 0.022, 0.034, 0.0345, 0.03456])
# Preferences
t_bef = 0.001
t_aft = 0.002
# The number of samples in a chunk
nsamples = int((t_bef+t_aft) * Fs)
# Create a vector from 0 up to nsamples
sample_idx = np.arange(nsamples)
# Calculate the index of the first sample for each chunk
# Require integers, because it will be used for indexing
start_idx = ((t_events - t_bef) * Fs).astype(int)
# Use broadcasting to create an array with indices
# Each row contains consecutive indices for each chunk
idx = start_idx[:, None] + sample_idx[None, :]
# Get all the chunks using fancy indexing
signal_chunks = signal[idx]
# Calculate the average like you did earlier
event_triggered_average = signal_chunks.mean(axis=0)
Note, the line with .astype(int) does not round to nearest integer, but rounds towards zero.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Slicing audio given video frames - python

Related

How to find noise point of .wav file. i mean, not remove noise, just when occurred noise by using python

BPM detection from pyAudioAnalysis is producing the wrong number of beats for any signal

How do I stretch the x-axis of a matplotlib spectrogram?

Numpy averaging a series

Calculating event / stimulus triggered averages efficiently in Python

Categories

Resources