I am using freq_from_crossings from here (I haven't changed the code). My input is an audio file with an acoustic guitar E2 note and nothing else (as my microphone is pretty bad, the sound is not very clear).
This is the waveform:
And this is the spectrogram I am getting:
From the spectrogram it is pretty clear that the loudest harmonic corresponds to the E2 note. However, freq_from_crossings returns 415.461966359 which is not at all the pitch played. What components could have gone wrong?
Thanks
A waveform that is not a single pure sinewave can have more zero crossings than once per pitch period. Within one period, it can include lots of "wiggles" that cross zero. The harmonic content of your guitar note spectrogram shows that the total waveform is far from being a single pure sinewave. It's also changing over time.
Therefore, estimating pitch frequency from zero crossings won't work for these types of guitar sounds.
In my experience, zero-crossings and auto-correlation are terrible ways to attempt pitch detection -- even on a monophonic signal. Consider using a method that employs either a FFT or DFT transform to acquire the initial frequency activity.
https://en.wikipedia.org/wiki/Transcription_(music)#Pitch_detection
https://github.com/CreativeDetectors/PitchScope_Player
Related
I'd like to recreate it in numpy or other python library.
I mean a function, that not just simply clips all the samples above the threshold level or normalizes the whole audio. But a function that takes an audio waveform in a range (-1;1), attack time, decay time and threshold level in dB. Reduces the volume of samples above the threshold without distortion and outputs a new sound.
All the solutions I've found so far either add distortion like ffmpeg or don't use 64-bit floating point calculations like SOX.
Backstory
I started messing with electronics, and realized I need an oscilloscope. I went to buy the oscilloscope (for like $40) online and watched tutorials on how to use them. I stumbled upon a video using the "X-Y" function of the oscilloscope to draw images; I thought that was cool. I tried searching how to do this from scratch and learned you need to convert the image into the frequency domain and some how convert that to an audio signal and send the signal to the two channels on the oscilloscope from the left and right channels from the audio output. So now I am trying to do the image processing part.
What I Got So Far
Choosing an Image
First thing I did was to create an nxn image using some drawing software. I've read online that the total number of pixels of the image should be a power of two. I don't know why, but I created 256x256 pixel images to minimize calculation time. Here is the image I used for this example.
I kept the image simple, so I can vividly see the symmetry when it is transformed. Therefore, if there is no symmetry, then there must be something wrong.
The MATLAB Code
The first thing I did was read the image, convert to gray scale, change data type, and grab the size of the image (for size variability for later use).
%Read image
img = imread('tets.jpg');
%Convert image to gray scale
grayImage = rgb2gray(img);
%Incompatability of data type. uint8 type vs double
grayImage = double(grayImage);
%Grab size of image
[nx, ny, nz] = size(grayImage);
The Algorithm
This is where things get a bit hazy. I am somewhat familiar with the Fourier Transform due to some Mechanical Engineering classes, but the topic was broadly introduced and never really fundamentally part of the course. It was more like, "Hey, check out this thing; but use the Laplace Transformation instead."
So somehow you have to incorporate spatial, amplitude, frequency, and time when doing the calculation. I understand that the spatial coordinates is just the location of each pixel on the image in a matrix or bitmap. I also understand that the amplitude is just the gray scale value from 0-255 of a certain pixel. However, I don't necessarily know how to incorporate frequency and time based on the pixel itself. I think I read somewhere that the frequency increases as the y location of the pixel increases, and the time variable increases with the x location. Here's the link (read first part of Part II).
So I tried following the formula as well as other formulas online and this is what I got for the MATLAB code.
if nx ~= ny
error('Image size must be NxN.'); %for some reason
else
%prepare transformation matrix
DFT = zeros(nx,ny);
%compute transformation for each pixel
for ii = 1:1:nx
for jj = 1:1:ny
amplitude = grayImage(ii,jj);
DFT(ii,jj) = amplitude * exp(-1i * 2 * pi * ((ii*ii/nx) + (jj*jj/ny)));
end
end
%plot of complex numbers
plot(DFT, '*');
%calculate magnitude and phase
magnitudeAverage = abs(DFT)/nx;
phase = angle(DFT);
%plot magnitudes and phase
figure;
plot(magnitudeAverage);
figure;
plot(phase);
end
This code simply tries to follow this discrete fourier transform example video that I found on YouTube. After the calculation I plotted the complex numbers in complex domain. This appears to be in polar coordinates; I don't know why. As stated in the video about the Nyquist Limit, I plotted the average magnitude too. As well as the phase angles of the complex numbers. I'll just show you the plots!
The Plots
Complex Numbers
This is the complex plot; I believe it's in polar form instead of cartesian, but I don't know. It appears symmetric too.
Average Amplitude Vs. Sample
The vertical axis is amplitude, and the horizontal axis is the sample number. This looks like the deconstruction of the signal, but then again I don't really know what I am looking at.
Phase Angle Vs. Sample
The vertical axis is the phase angle, and the horizontal axis is the sample number. This looks the most promising because it looks like a plot in the frequency domain, but this isn't suppose to be a plot in the frequency domain; rather, its a plot in the sample domain? Again, I don't know what I am looking at.
I Need Help Understanding
I need to somehow understand these plots, so I know I am getting the right plot. I believe there may be something very wrong in the algorithm because it doesn't necessarily implement the frequency and time component. So maybe you can tell me how that is done? Or at least guide me?
TLDR;
I am trying to convert images into sound files to display on an oscilloscope. I am stuck on the image processing part. I believe there is something wrong with the MATLAB code (check above) because it doesn't necessarily include the frequency and time component of each pixel. I need help with the code and understanding how to interpret the result, so I know the transfromations are correct-ish.
I was just getting started with a code to pre-process some audio data in order to lately feed a neural network with it. Before explaining more deeply my actual problem, mention that I took the reference for how to do the project from this site. Also used some code taken from this post and read for more info in the signal.spectogram doc and this post.
For now with all of the sources mentioned before, I managed to get the wav audio file as a numpy array and plot both its amplitude and spectrogram. Theese represent a recording of me saying the word "command" in Spanish.
The strange fact here is that I search on the internet and found that human voice spectrum moves between 80 and 8k Hz, so just to get sure I compared this output with the one Audacity spectrogram returned. As you can see, this seems to be more coherent with the info found, as the frequency range is the one supposed to be for humans.
So that takes me to final question: Am I doing something wrong in the process of reading the audio or generating the spectrogram or maybe am I having plot issues?
By the way I'm new to both python and signal processing so thx in advance for your patience.
Here is the code I'm actually using:
def espectrograma(wav):
sample_rate, samples = wavfile.read(wav)
frequencies, times, spectrogram = signal.spectrogram(samples, sample_rate, nperseg=320, noverlap=16, scaling='density')
#dBS = 10 * np.log10(spectrogram) # convert to dB
plt.subplot(2,1,1)
plt.plot(samples[0:3100])
plt.subplot(2,1,2)
plt.pcolormesh(times, frequencies, spectrogram)
plt.imshow(spectrogram,aspect='auto',origin='lower',cmap='rainbow')
plt.ylim(0,30)
plt.ylabel('Frecuencia [kHz]')
plt.xlabel('Fragmento[20ms]')
plt.colorbar()
plt.show()
The computation of the spectrogram seems fine to me. If you plot the spectrogram in log scale you should observe something more similar to the audition plots you referenced. So uncomment your line
#dBS = 10 * np.log10(spectrogram) # convert to dB
and then use the variable dBS for the plotting instead of spectrogram in
plt.pcolormesh(times, frequencies, spectrogram)
plt.imshow(spectrogram,aspect='auto',origin='lower',cmap='rainbow')
The spectrogram uses a fourier transform to convert your timeseries data into frequency domain.
The maximum frequency that can be measured is (sampling frequency) / 2, so in this case it may seem like your sampling frequency is 60KHz?
Anyway, regarding your question. It may be correct that the human voice spectrum lies within this range, but the fourier transform is never perfect. I would simply adjust your Y-Axis to specifically look at these frequencies.
It seems to me that you are calculating your spectrogram correctly, at least as long as you are reading the sample_rate and samples correctly..
I'm using Python 2.7.3 and I have a question relating to ultrasonic frequencies:
Sampling at 40MHz, I measure an ultrasonic signal that's a convolution of a 1MHz resonant frequency and an envelope - The envelope of which depends on the media through which ultrasonic signal travels. I would like to listen to this received signal, my question is:
How may I map the received signal into the range of human hearing? Or put another way,
How may I down-sample and convert this signal to an audio frequency (keep the envelope shape and maybe even elongate the time so it’s longer).
Simulated signal here, but its typically like this in any case:
import numpy as np
import matplotlib.pylab as plt
# resonant frequency is 1MHz
f = 1e6
Omega = 2*np.pi*f
# samle at 40MHz or ts=25ns, for about 1000 samples:
t = np.arange(0,25e-6,25e-9)
y = np.sin(Omega*t) * (t**2) * np.exp(-t/3e-6)
y /= max(y)
plt.plot(y)
plt.grid()
plt.xlabel('sample')
plt.ylabel('value')
plt.show()
There are two common answers to your question:
Just play it at a fraction of the sampling frequency. If you play your signal back with, e.g. 44.1 kHz sampling frequency, you will have an audible tone of approximately 1000 Hz and signal length of roughly 20 ms. (I picked 44.1 kHz as it is certainly one of the frequencies any hw can play back.) This is probably easiest to accomplish by saving your signal into a WAV file (see the wave module) and then you may play it back with anything that plays WAV files.
The standard method would be to mix the resonant frequency down to audible frequencies. This is the fundamental thing in radios. Mathematically it involves multiplying by a carrier frequency which is close to the resonant frequency, and then low-pass filtering the result. The operation can also be viewed as shifting the frequency spectrum closer to 0. However, as your signal envelope is very fast (0.25 ms), this would only result in a short click and thus not be useful here.
Other solutions can be figured out, if there are further requirements. The envelope frequency and the resonant frequency seem to be relatively close to each other, which limits the options. If you need to do this for a real time signal, then the challenge will be elongating the envelope, because then the envelope has to be detected. Otherwise it is not possible to stretch the time.
I wanted to make this a comment, but I have some examples.
There would be many ways to represent this. You could use sound as an encoding medium.
If your original waveform has few properties, like frequency (constant), and envelope (variable/can be approximated), you can for example encode the frequency in a binary form with a short sequence of sounds and silence (1=generate sound/0=generate silence), you could then represent the amplitude with a constant sound with variable frequency (ex. a 100Hz sound would represent a 0 amplitude, and a 10000Hz sound would represent max amplitude). To rebuild the original envelope, you could use interpolation.
I hope you see my point.
I know that this problem has been solved before, but I've been great difficulty finding any literature describing the algorithms used to process this sort of data. I'm essentially doing some edge finding on a set of 2D data. I want to be able to find a couple points on an eye diagram (generally used to qualify high speed communications systems), and as I have had no experience with image processing I am struggling to write efficient methods.
As you can probably see, these diagrams are so called because they resemble the human eye. They can vary a great deal in the thickness, slope, and noise, depending on the signal and the system under test. The measurements that are normally taken are jitter (the horizontal thickness of the crossing region) and eye height (measured at either some specified percentage of the width or the maximum possible point). I know this can best be done with image processing instead of a more linear approach, as my attempts so far take several seconds just to find the left side of the first crossing. Any ideas of how I should go about this in Python? I'm already using NumPy to do some of the processing.
Here's some example data, it is formatted as a 1D array with associated x-axis data. For this particular example, it should be split up every 666 points (2 * int((1.0 / 2.5e9) / 1.2e-12)), since the rate of the signal was 2.5 GB/s, and the time between points was 1.2 ps.
Thanks!
Have you tried OpenCV (Open Computer Vision)? It's widely used and has a Python binding.
Not to be a PITA, but are you sure you wouldn't be better off with a numerical approach? All the tools I've seen for eye-diagram analysis go the numerical route; I haven't seen a single one that analyzes the image itself.
You say your algorithm is painfully slow on that dataset -- my next question would be why. Are you looking at an oversampled dataset? (I'm guessing you are.) And if so, have you tried decimating the signal first? That would at the very least give you fewer samples for your algorithm to wade through.
just going down your route for a moment, if you read those images into memory, as they are, wouldn't it be pretty easy to do two flood fills (starting centre and middle of left edge) that include all "white" data. if the fill routine recorded maximum and minimum height at each column, and maximum horizontal extent, then you have all you need.
in other words, i think you're over-thinking this. edge detection is used in complex "natural" scenes when the edges are unclear. here you edges are so completely obvious that you don't need to enhance them.