Pytorch audio map time point to location in spectrogram - python

I have an audio file that lasts 294 seconds (sampling rate is 50000). I use torchaudio to compute its spectrogram the following way:
T.MelSpectrogram(sample_rate=50000, n_fft=1024, hop_length=512)
Say, there is an important event in the original .wav audio at second 57 exactly. How can I determine exactly what pixel that event will start at on the spectrogram.
Or, put simply, how can I map a moment in an audio to a location in a spectrogram?

Related

Get Audio Amplitude Value from a Video Streaming Source

I need to calculate amplitude of the audio from a video streaming source which is in .asf format in PYTHON. Currently, I have tried to convert it into .wav file and used wave python package but I need to do it in real time. In short, need to perform following steps;
Continously read input video stream
Pre processing the audio signal
Calculate amplitude in given interval
Currentl used wave library of python and read the stored wav format clip, then extracted the amplitude from the wave.readframes() output such that
wf = wave.open()
data = wf.readframes()
amplitude = data[2]

What algorithms used in audio limiters?

I'd like to recreate it in numpy or other python library.
I mean a function, that not just simply clips all the samples above the threshold level or normalizes the whole audio. But a function that takes an audio waveform in a range (-1;1), attack time, decay time and threshold level in dB. Reduces the volume of samples above the threshold without distortion and outputs a new sound.
All the solutions I've found so far either add distortion like ffmpeg or don't use 64-bit floating point calculations like SOX.

Extracting features from audio signal

I have just started to work on data in the form of audio. I am using librosa as a tool. My project requires me to extract features like:
Total duration of the audio
Minimum Intensity of the audio signal
Maximum Intensity of the audio signal
Mean Intensity of the audio signal
Jitter
Rate of speaking
Number of Pauses
Maximum Duration of Pauses
Average Duration of Pauses
Total Duration of Pauses
Although, I know about these terms but I have no idea how to extract these from an audio file. Are these inbuilt in some form in the librosa.feature variable? Or we need to manually calculate these? Can someone guide me how to proceed?
I know that this job can be performed using softwares like Praat, but I need to do it in python.
Praat can be used for spectral analysis (spectrograms), pitch
analysis, formant analysis, intensity analysis, jitter, shimmer, and
voice breaks.

signal.spectrogram returns too many hz

I was just getting started with a code to pre-process some audio data in order to lately feed a neural network with it. Before explaining more deeply my actual problem, mention that I took the reference for how to do the project from this site. Also used some code taken from this post and read for more info in the signal.spectogram doc and this post.
For now with all of the sources mentioned before, I managed to get the wav audio file as a numpy array and plot both its amplitude and spectrogram. Theese represent a recording of me saying the word "command" in Spanish.
The strange fact here is that I search on the internet and found that human voice spectrum moves between 80 and 8k Hz, so just to get sure I compared this output with the one Audacity spectrogram returned. As you can see, this seems to be more coherent with the info found, as the frequency range is the one supposed to be for humans.
So that takes me to final question: Am I doing something wrong in the process of reading the audio or generating the spectrogram or maybe am I having plot issues?
By the way I'm new to both python and signal processing so thx in advance for your patience.
Here is the code I'm actually using:
def espectrograma(wav):
sample_rate, samples = wavfile.read(wav)
frequencies, times, spectrogram = signal.spectrogram(samples, sample_rate, nperseg=320, noverlap=16, scaling='density')
#dBS = 10 * np.log10(spectrogram) # convert to dB
plt.subplot(2,1,1)
plt.plot(samples[0:3100])
plt.subplot(2,1,2)
plt.pcolormesh(times, frequencies, spectrogram)
plt.imshow(spectrogram,aspect='auto',origin='lower',cmap='rainbow')
plt.ylim(0,30)
plt.ylabel('Frecuencia [kHz]')
plt.xlabel('Fragmento[20ms]')
plt.colorbar()
plt.show()
The computation of the spectrogram seems fine to me. If you plot the spectrogram in log scale you should observe something more similar to the audition plots you referenced. So uncomment your line
#dBS = 10 * np.log10(spectrogram) # convert to dB
and then use the variable dBS for the plotting instead of spectrogram in
plt.pcolormesh(times, frequencies, spectrogram)
plt.imshow(spectrogram,aspect='auto',origin='lower',cmap='rainbow')
The spectrogram uses a fourier transform to convert your timeseries data into frequency domain.
The maximum frequency that can be measured is (sampling frequency) / 2, so in this case it may seem like your sampling frequency is 60KHz?
Anyway, regarding your question. It may be correct that the human voice spectrum lies within this range, but the fourier transform is never perfect. I would simply adjust your Y-Axis to specifically look at these frequencies.
It seems to me that you are calculating your spectrogram correctly, at least as long as you are reading the sample_rate and samples correctly..

Python: ultrasonic to audio range

I'm using Python 2.7.3 and I have a question relating to ultrasonic frequencies:
Sampling at 40MHz, I measure an ultrasonic signal that's a convolution of a 1MHz resonant frequency and an envelope - The envelope of which depends on the media through which ultrasonic signal travels. I would like to listen to this received signal, my question is:
How may I map the received signal into the range of human hearing? Or put another way,
How may I down-sample and convert this signal to an audio frequency (keep the envelope shape and maybe even elongate the time so it’s longer).
Simulated signal here, but its typically like this in any case:
import numpy as np
import matplotlib.pylab as plt
# resonant frequency is 1MHz
f = 1e6
Omega = 2*np.pi*f
# samle at 40MHz or ts=25ns, for about 1000 samples:
t = np.arange(0,25e-6,25e-9)
y = np.sin(Omega*t) * (t**2) * np.exp(-t/3e-6)
y /= max(y)
plt.plot(y)
plt.grid()
plt.xlabel('sample')
plt.ylabel('value')
plt.show()
There are two common answers to your question:
Just play it at a fraction of the sampling frequency. If you play your signal back with, e.g. 44.1 kHz sampling frequency, you will have an audible tone of approximately 1000 Hz and signal length of roughly 20 ms. (I picked 44.1 kHz as it is certainly one of the frequencies any hw can play back.) This is probably easiest to accomplish by saving your signal into a WAV file (see the wave module) and then you may play it back with anything that plays WAV files.
The standard method would be to mix the resonant frequency down to audible frequencies. This is the fundamental thing in radios. Mathematically it involves multiplying by a carrier frequency which is close to the resonant frequency, and then low-pass filtering the result. The operation can also be viewed as shifting the frequency spectrum closer to 0. However, as your signal envelope is very fast (0.25 ms), this would only result in a short click and thus not be useful here.
Other solutions can be figured out, if there are further requirements. The envelope frequency and the resonant frequency seem to be relatively close to each other, which limits the options. If you need to do this for a real time signal, then the challenge will be elongating the envelope, because then the envelope has to be detected. Otherwise it is not possible to stretch the time.
I wanted to make this a comment, but I have some examples.
There would be many ways to represent this. You could use sound as an encoding medium.
If your original waveform has few properties, like frequency (constant), and envelope (variable/can be approximated), you can for example encode the frequency in a binary form with a short sequence of sounds and silence (1=generate sound/0=generate silence), you could then represent the amplitude with a constant sound with variable frequency (ex. a 100Hz sound would represent a 0 amplitude, and a 10000Hz sound would represent max amplitude). To rebuild the original envelope, you could use interpolation.
I hope you see my point.

Categories

Resources