I have a signal from a magnetic detector that I'm interested in analyzing, I've made signal decomposition using wavedec()
coeffs = pywt.wavedec(dane_K180_40['CH1[uV]'], 'coif5', level=5)
And I've received decomposition coefficients as follows:
cA1, cD5, cD4, cD3, cD2, cD1 = coeffs
These are ndarrays objects with various lengths.
cD1 is (1519,) cD2 is (774,) and so on. Different length of arrays is my main obstacle.
coefficients
My question:
I have to make DWT Scaleogram and I can't stress it enough that I've tried my best and couldn't do it.
What is the best approach? Using matpllotlib's imshow() as follows:
plt.imshow(np.abs([cD5, cD4, cD3, cD2, cD1]), cmap='bone', interpolation='none', aspect='auto')
gives me an error
TypeError: Image data of dtype object cannot be converted to float
I've tried to google it since I'm not an expert in python and I've tried to change the ndarrays to float.
What is the best for plotting scaleogram, matshow, pcolormesh? ;D
Basically, each cDi array has half the amount of samples as the previous array (this is not the case for every mother wavelet!), so I create a 2D numpy array where the first element is the 'full' amount of samples, and for each subsequent level I repeat the samples 2^level times so that the end result is a rectangular block. You can pick whether you want the Y-axis plotted as a linear or as a logarithmic scale.
# Create signal
xc = np.linspace(0, t_n, num=N)
xd = np.linspace(0, t_n, num=32)
sig = np.sin(2*np.pi * 64 * xc[:32]) * (1 - xd)
composite_signal3 = np.concatenate([np.zeros(32), sig[:32], np.zeros(N-32-32)])
# Use the Daubechies wavelet
w = pywt.Wavelet('db1')
# Perform Wavelet transform up to log2(N) levels
lvls = ceil(log2(N))
coeffs = pywt.wavedec(composite_signal3, w, level=lvls)
# Each level of the WT will split the frequency band in two and apply a
# WT on the highest band. The lower band then gets split into two again,
# and a WT is applied on the higher band of that split. This repeats
# 'lvls' times.
#
# Since the amount of samples in each step decreases, we need to make
# sure that we repeat the samples 2^i times where i is the level so
# that at each level, we have the same amount of transformed samples
# as in the first level. This is only necessary because of plotting.
cc = np.abs(np.array([coeffs[-1]]))
for i in range(lvls - 1):
cc = np.concatenate(np.abs([cc, np.array([np.repeat(coeffs[lvls - 1 - i], pow(2, i + 1))])]))
plt.figure()
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')
plt.title('Discrete Wavelet Transform')
# X-axis has a linear scale (time)
x = np.linspace(start=0, stop=1, num=N//2)
# Y-axis has a logarithmic scale (frequency)
y = np.logspace(start=lvls-1, stop=0, num=lvls, base=2)
X, Y = np.meshgrid(x, y)
plt.pcolormesh(X, Y, cc)
use_log_scale = False
if use_log_scale:
plt.yscale('log')
else:
yticks = [pow(2, i) for i in range(lvls)]
plt.yticks(yticks)
plt.tight_layout()
plt.show()
Related
I'm running a model of particles, and I want to have initial conditions for the particle locations mimicking a gaussian distribution.
If I have N number of particles on 1D grid from -10 to 10, I want them to be distributed on the grid according to a gaussian with a known mean and standard deviation. It's basically creating a histogram where each bin width is 1 (the x-axis of locations resolution is 1), and the frequency of each bin should be how many particles are in it, which should all add up to N.
My strategy was to plot a gaussian function on the x-axis grid, and then just approximate the value of each point for the number of particles:
def gaussian(x, mu, sig):
return 1./(np.sqrt(2.*np.pi)*sig)*np.exp(-np.power((x - mu)/sig, 2.)/2)
mean = 0
sigma = 1
x_values = np.arange(-10, 10, 1)
y = gaussian(x_values, mean, sigma)
However, I have normalization issues (the sum doesn't add up to N), and the number of particles in each point should be an integer (I thought about converting the y array to integers but again, because of the normalization issue I get a flat line).
Usually, the problem is fitting a gaussian to histogram, but in my case, I need to do the reverse - and I couldn't find a solution for it yet. I will appreciate any help!
Thank you!!!
You can use numpy.random.normal to sample this distribution. You can get N points inside range (-10, 10) that follows Gaussian distribution with the following code.
import numpy as np
import matplotlib.pyplot as plt
N = 10000
mean = 5
sigma = 3
bin_edges = np.arange(-10, 11, 1)
x_values = (bin_edges[1:] + bin_edges[:-1]) / 2
points = np.random.normal(mean, sigma, N * 10)
mask = np.logical_and(points < 10, points > -10)
points = points[mask] # drop points outside range
points = points[:N] # only use the first N points
y, _ = np.histogram(points, bins=bin_edges)
plt.scatter(x_values, y)
plt.show()
The idea is to generate a lot of random numbers (10 N in the code), and ignores the points outside your desired range.
So I'm follow a tutorial where we create a signal and filter the noise using fftpack.
Problem 1: I'm trying to plot the filtered and unfiltered signal noise on a graph so that I can see them side by side.
I'm getting this error:
Warning (from warnings module): File
"C:\Python39\lib\site-packages\numpy\core_asarray.py", line 83
return array(a, dtype, copy=False, order=order) ComplexWarning: Casting complex values to real discards the imaginary part
I think this is causing the error:
y = sig
x = time_vec
Problem 2: I'm not sure how to plot the two graphs in the same window?
import numpy as np
from scipy import fftpack
time_step = 0.05
# Return evenly spaced time vector (0.5) between [0, 10]
time_vec = np.arange(0, 10, time_step)
print(time_vec)
period = 5
# create a signal and add some noise:
# input angle 2pi * time vector) in radians and return value ranging from -1 to +1 -- essentially mimicking a sigla wave that is goes in cycles + adding some noise to this bitch
# numpy.random.randn() - return samples from the standard normal distribution of mean 0 and variance 1
sig = (np.sin(2*np.pi*time_vec)/period) + 0.25 * np.random.randn(time_vec.size)
# Return discrete Fourier transform of real or complex sequence
sig_fft = fftpack.fft(sig) # tranform the sin function
# Get Amplitude ?
Amplitude = np.abs(sig_fft) # np.abs() - calculate absolute value from a complex number a + ib
Power = Amplitude**2 # create a power spectrum by power of 2 of amplitude
# Get the (angle) base spectrrum of these transform values i.e. sig_fft
Angle = np.angle(sig_fft) # Return the angle of the complex argument
# For each Amplitude and Power (of each element in the array?) - there is will be a corresponding difference in xxx
# This is will return the sampling frequecy or corresponding frequency of each of the (magnitude) i.e. Power
sample_freq = fftpack.fftfreq(sig.size, d=time_step)
print(Amplitude)
print(sample_freq)
# Because we would like to remove the noise we are concerned with peak freqence that contains the peak amplitude
Amp_Freq = np.array([Amplitude, sample_freq])
# Now we try to find the peak amplitude - so we try to extract
Amp_position = Amp_Freq[0,:].argmax()
peak_freq = Amp_Freq[1, Amp_position] # find the positions of max value position (Amplitude)
# print the position of max Amplitude
print("--", Amp_position)
# print the frequecies of those max amplitude
print(peak_freq)
high_freq_fft = sig_fft.copy()
# assign all the value the corresponding frequecies larger than the peak frequence - assign em 0 - cancel!! in the array (elements) (?)
high_freq_fft[np.abs(sample_freq) > peak_freq] = 0
print("yes:", high_freq_fft)
# Return discrete inverse Fourier transform of real or complex sequence
filtered_sig = fftpack.ifft(high_freq_fft)
print("filtered noise: ", filtered_sig)
# Using Fast Fourier Transform and inverse Fast Fourier Transform we can remove the noise from the frequency domain (that would be otherwise impossible to do in Time Domain) - done.
# Plotting the signal with noise (?) and filtered
import matplotlib.pyplot as plt
y = filtered_sig
x = time_vec
plt.plot(x, y)
plt.xlabel('Time')
plt.ylabel('Filtered Amplitude')
plt.show()
y = sig
x = time_vec
plt.plot(x, y)
plt.xlabel('Time')
plt.ylabel('Unfiltered Amplitude')
plt.show()
Problem 1: arises within matplotlib when you plot filtered_sig as it includes small imaginary parts. You can chop them off by real_if_close.
Problem 2: just don't use show between the first and the second plot
Here is the complete working plotting part in one chart with a legend:
import matplotlib.pyplot as plt
x = time_vec
y = np.real_if_close(filtered_sig)
plt.plot(x, y, label='Filtered')
plt.xlabel('Time')
plt.ylabel('Amplitude')
y = sig
plt.plot(x, y, label='Unfiltered')
plt.legend()
plt.show()
Suppose one wanted to find the period of a given sinusoidal wave signal. From what I have read online, it appears that the two main approaches employ either fourier analysis or autocorrelation. I am trying to automate the process using python and my usage case is to apply this concept to similar signals that come from the time-series of positions (or speeds or accelerations) of simulated bodies orbiting a star.
For simple-examples-sake, consider x = sin(t) for 0 ≤ t ≤ 10 pi.
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
## sample data
t = np.linspace(0, 10 * np.pi, 100)
x = np.sin(t)
fig, ax = plt.subplots()
ax.plot(t, x, color='b', marker='o')
ax.grid(color='k', alpha=0.3, linestyle=':')
plt.show()
plt.close(fig)
Given a sine-wave of the form x = a sin(b(t+c)) + d, the period of the sine-wave is obtained as 2 * pi / b. Since b=1 (or by visual inspection), the period of our sine wave is 2 * pi. I can check the results obtained from other methods against this baseline.
Attempt 1: Autocorrelation
As I understand it (please correct me if I'm wrong), correlation can be used to see if one signal is a time-lagged copy of another signal (similar to how cosine and sine differ by a phase difference). So autocorrelation is testing a signal against itself to measure the times at which the time-lag repeats said signal. Using the example posted here:
result = np.correlate(x, x, mode='full')
Since x and t each consist of 100 elements and result consists of 199 elements, I am not sure why I should arbitrarily select the last 100 elements.
print("\n autocorrelation (shape={}):\n{}\n".format(result.shape, result))
autocorrelation (shape=(199,)):
[ 0.00000000e+00 -3.82130761e-16 -9.73648712e-02 -3.70014208e-01
-8.59889695e-01 -1.56185995e+00 -2.41986054e+00 -3.33109112e+00
-4.15799070e+00 -4.74662427e+00 -4.94918053e+00 -4.64762251e+00
-3.77524157e+00 -2.33298717e+00 -3.97976240e-01 1.87752669e+00
4.27722402e+00 6.54129270e+00 8.39434617e+00 9.57785701e+00
9.88331103e+00 9.18204933e+00 7.44791758e+00 4.76948221e+00
1.34963425e+00 -2.50822289e+00 -6.42666652e+00 -9.99116299e+00
-1.27937834e+01 -1.44791297e+01 -1.47873668e+01 -1.35893098e+01
-1.09091510e+01 -6.93157447e+00 -1.99159756e+00 3.45267493e+00
8.86228186e+00 1.36707567e+01 1.73433176e+01 1.94357232e+01
1.96463736e+01 1.78556800e+01 1.41478477e+01 8.81191526e+00
2.32100171e+00 -4.70897483e+00 -1.15775811e+01 -1.75696560e+01
-2.20296487e+01 -2.44327920e+01 -2.44454330e+01 -2.19677060e+01
-1.71533510e+01 -1.04037163e+01 -2.33560966e+00 6.27458308e+00
1.45655029e+01 2.16769872e+01 2.68391837e+01 2.94553896e+01
2.91697473e+01 2.59122266e+01 1.99154591e+01 1.17007613e+01
2.03381596e+00 -8.14633251e+00 -1.78184255e+01 -2.59814393e+01
-3.17580589e+01 -3.44884934e+01 -3.38046447e+01 -2.96763956e+01
-2.24244433e+01 -1.26974172e+01 -1.41464998e+00 1.03204331e+01
2.13281784e+01 3.04712823e+01 3.67721634e+01 3.95170295e+01
3.83356037e+01 3.32477037e+01 2.46710643e+01 1.33886439e+01
4.77778141e-01 -1.27924775e+01 -2.50860560e+01 -3.51343866e+01
-4.18671622e+01 -4.45258983e+01 -4.27482779e+01 -3.66140001e+01
-2.66465884e+01 -1.37700036e+01 7.76494745e-01 1.55574483e+01
2.90828312e+01 3.99582426e+01 4.70285203e+01 4.95000000e+01
4.70285203e+01 3.99582426e+01 2.90828312e+01 1.55574483e+01
7.76494745e-01 -1.37700036e+01 -2.66465884e+01 -3.66140001e+01
-4.27482779e+01 -4.45258983e+01 -4.18671622e+01 -3.51343866e+01
-2.50860560e+01 -1.27924775e+01 4.77778141e-01 1.33886439e+01
2.46710643e+01 3.32477037e+01 3.83356037e+01 3.95170295e+01
3.67721634e+01 3.04712823e+01 2.13281784e+01 1.03204331e+01
-1.41464998e+00 -1.26974172e+01 -2.24244433e+01 -2.96763956e+01
-3.38046447e+01 -3.44884934e+01 -3.17580589e+01 -2.59814393e+01
-1.78184255e+01 -8.14633251e+00 2.03381596e+00 1.17007613e+01
1.99154591e+01 2.59122266e+01 2.91697473e+01 2.94553896e+01
2.68391837e+01 2.16769872e+01 1.45655029e+01 6.27458308e+00
-2.33560966e+00 -1.04037163e+01 -1.71533510e+01 -2.19677060e+01
-2.44454330e+01 -2.44327920e+01 -2.20296487e+01 -1.75696560e+01
-1.15775811e+01 -4.70897483e+00 2.32100171e+00 8.81191526e+00
1.41478477e+01 1.78556800e+01 1.96463736e+01 1.94357232e+01
1.73433176e+01 1.36707567e+01 8.86228186e+00 3.45267493e+00
-1.99159756e+00 -6.93157447e+00 -1.09091510e+01 -1.35893098e+01
-1.47873668e+01 -1.44791297e+01 -1.27937834e+01 -9.99116299e+00
-6.42666652e+00 -2.50822289e+00 1.34963425e+00 4.76948221e+00
7.44791758e+00 9.18204933e+00 9.88331103e+00 9.57785701e+00
8.39434617e+00 6.54129270e+00 4.27722402e+00 1.87752669e+00
-3.97976240e-01 -2.33298717e+00 -3.77524157e+00 -4.64762251e+00
-4.94918053e+00 -4.74662427e+00 -4.15799070e+00 -3.33109112e+00
-2.41986054e+00 -1.56185995e+00 -8.59889695e-01 -3.70014208e-01
-9.73648712e-02 -3.82130761e-16 0.00000000e+00]
Attempt 2: Fourier
Since I am not sure where to go from the last attempt, I sought a new attempt. To my understanding, Fourier analysis basically shifts a signal from/to the time-domain (x(t) vs t) to/from the frequency domain (x(t) vs f=1/t); the signal in frequency-space should appear as a sinusoidal wave that dampens over time. The period is obtained from the most observed frequency since this is the location of the peak of the distribution of frequencies.
Since my values are all real-valued, applying the Fourier transform should mean my output values are all complex-valued. I wouldn't think this is a problem, except for the fact that scipy has methods for real-values. I do not fully understand the differences between all of the different scipy methods. That makes following the algorithm proposed in this posted solution hard for me to follow (ie, how/why is the threshold value picked?).
omega = np.fft.fft(x)
freq = np.fft.fftfreq(x.size, 1)
threshold = 0
idx = np.where(abs(omega)>threshold)[0][-1]
max_f = abs(freq[idx])
print(max_f)
This outputs 0.01, meaning the period is 1/0.01 = 100. This doesn't make sense either.
Attempt 3: Power Spectral Density
According to the scipy docs, I should be able to estimate the power spectral density (psd) of the signal using a periodogram (which, according to wikipedia, is the fourier transform of the autocorrelation function). By selecting the dominant frequency fmax at which the signal peaks, the period of the signal can be obtained as 1 / fmax.
freq, pdensity = signal.periodogram(x)
fig, ax = plt.subplots()
ax.plot(freq, pdensity, color='r')
ax.grid(color='k', alpha=0.3, linestyle=':')
plt.show()
plt.close(fig)
The periodogram shown below peaks at 49.076... at a frequency of fmax = 0.05. So, period = 1/fmax = 20. This doesn't make sense to me. I have a feeling it has something to do with the sampling rate, but don't know enough to confirm or progress further.
I realize I am missing some fundamental gaps in understanding how these things work. There are a lot of resources online, but it's hard to find this needle in the haystack. Can someone help me learn more about this?
Let's first look at your signal (I've added endpoint=False to make the division even):
t = np.linspace(0, 10*np.pi, 100, endpoint=False)
x = np.sin(t)
Let's divide out the radians (essentially by taking t /= 2*np.pi) and create the same signal by relating to frequencies:
fs = 20 # Sampling rate of 100/5 = 20 (e.g. Hz)
f = 1 # Signal frequency of 1 (e.g. Hz)
t = np.linspace(0, 5, 5*fs, endpoint=False)
x = np.sin(2*np.pi*f*t)
This makes it more salient that f/fs == 1/20 == 0.05 (i.e. the periodicity of the signal is exactly 20 samples). Frequencies in a digital signal always relate to its sampling rate, as you have already guessed. Note that the actual signal is exactly the same no matter what the values of f and fs are, as long as their ratio is the same:
fs = 1 # Natural units
f = 0.05
t = np.linspace(0, 100, 100*fs, endpoint=False)
x = np.sin(2*np.pi*f*t)
In the following I'll use these natural units (fs = 1). The only difference will be in t and hence the generated frequency axes.
Autocorrelation
Your understanding of what the autocorrelation function does is correct. It detects the correlation of a signal with a time-lagged version of itself. It does this by sliding the signal over itself as seen in the right column here (from Wikipedia):
Note that as both inputs to the correlation function are the same, the resulting signal is necessarily symmetric. That is why the output of np.correlate is usually sliced from the middle:
acf = np.correlate(x, x, 'full')[-len(x):]
Now index 0 corresponds to 0 delay between the two copies of the signal.
Next you'll want to find the index or delay that presents the largest correlation. Due to the shrinking overlap this will by default also be index 0, so the following won't work:
acf.argmax() # Always returns 0
Instead I recommend to find the largest peak instead, where a peak is defined to be any index with a larger value than both its direct neighbours:
inflection = np.diff(np.sign(np.diff(acf))) # Find the second-order differences
peaks = (inflection < 0).nonzero()[0] + 1 # Find where they are negative
delay = peaks[acf[peaks].argmax()] # Of those, find the index with the maximum value
Now delay == 20, which tells you that the signal has a frequency of 1/20 of its sampling rate:
signal_freq = fs/delay # Gives 0.05
Fourier transform
You used the following to calculate the FFT:
omega = np.fft.fft(x)
freq = np.fft.fftfreq(x.size, 1)
Thhese functions re designed for complex-valued signals. They will work for real-valued signals, but you'll get a symmetric output as the negative frequency components will be identical to the positive frequency components. NumPy provides separate functions for real-valued signals:
ft = np.fft.rfft(x)
freqs = np.fft.rfftfreq(len(x), t[1]-t[0]) # Get frequency axis from the time axis
mags = abs(ft) # We don't care about the phase information here
Let's have a look:
plt.plot(freqs, mags)
plt.show()
Note two things: the peak is at frequency 0.05, and the maximum frequency on the axis is 0.5 (the Nyquist frequency, which is exactly half the sampling rate). If we had picked fs = 20, this would be 10.
Now let's find the maximum. The thresholding method you have tried can work, but the target frequency bin is selected blindly and so this method would suffer in the presence of other signals. We could just select the maximum value:
signal_freq = freqs[mags.argmax()] # Gives 0.05
However, this would fail if, e.g., we have a large DC offset (and hence a large component in index 0). In that case we could just select the highest peak again, to make it more robust:
inflection = np.diff(np.sign(np.diff(mags)))
peaks = (inflection < 0).nonzero()[0] + 1
peak = peaks[mags[peaks].argmax()]
signal_freq = freqs[peak] # Gives 0.05
If we had picked fs = 20, this would have given signal_freq == 1.0 due to the different time axis from which the frequency axis was generated.
Periodogram
The method here is essentially the same. The autocorrelation function of x has the same time axis and period as x, so we can use the FFT as above to find the signal frequency:
pdg = np.fft.rfft(acf)
freqs = np.fft.rfftfreq(len(x), t[1]-t[0])
plt.plot(freqs, abs(pdg))
plt.show()
This curve obviously has slightly different characteristics from the direct FFT on x, but the main takeaways are the same: the frequency axis ranges from 0 to 0.5*fs, and we find a peak at the same signal frequency as before: freqs[abs(pdg).argmax()] == 0.05.
Edit:
To measure the actual periodicity of np.sin, we can just use the "angle axis" that we passed to np.sin instead of the time axis when generating the frequency axis:
freqs = np.fft.rfftfreq(len(x), 2*np.pi*f*(t[1]-t[0]))
rad_period = 1/freqs[mags.argmax()] # 6.283185307179586
Though that seems pointless, right? We pass in 2*np.pi and we get 2*np.pi. However, we can do the same with any regular time axis, without presupposing pi at any point:
fs = 10
t = np.arange(1000)/fs
x = np.sin(t)
rad_period = 1/np.fft.rfftfreq(len(x), 1/fs)[abs(np.fft.rfft(x)).argmax()] # 6.25
Naturally, the true value now lies in between two bins. That's where interpolation comes in and the associated need to choose a suitable window function.
I have a set of values that I'd like to plot the gaussian kernel density estimation of, however there are two problems that I'm having:
I only have the values of bars not the values themselves
I am plotting onto a categorical axis
Here's the plot I've generated so far:
The order of the y axis is actually relevant since it is representative of the phylogeny of each bacterial species.
I'd like to add a gaussian kde overlay for each color, but so far I haven't been able to leverage seaborn or scipy to do this.
Here's the code for the above grouped bar plot using python and matplotlib:
enterN = len(color1_plotting_values)
fig, ax = plt.subplots(figsize=(20,30))
ind = np.arange(N) # the x locations for the groups
width = .5 # the width of the bars
p1 = ax.barh(Species_Ordering.Species.values, color1_plotting_values, width, label='Color1', log=True)
p2 = ax.barh(Species_Ordering.Species.values, color2_plotting_values, width, label='Color2', log=True)
for b in p2:
b.xy = (b.xy[0], b.xy[1]+width)
Thanks!
How to plot a "KDE" starting from a histogram
The protocol for kernel density estimation requires the underlying data. You could come up with a new method that uses the empirical pdf (ie the histogram) instead, but then it wouldn't be a KDE distribution.
Not all hope is lost, though. You can get a good approximation of a KDE distribution by first taking samples from the histogram, and then using KDE on those samples. Here's a complete working example:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sts
n = 100000
# generate some random multimodal histogram data
samples = np.concatenate([np.random.normal(np.random.randint(-8, 8), size=n)*np.random.uniform(.4, 2) for i in range(4)])
h,e = np.histogram(samples, bins=100, density=True)
x = np.linspace(e.min(), e.max())
# plot the histogram
plt.figure(figsize=(8,6))
plt.bar(e[:-1], h, width=np.diff(e), ec='k', align='edge', label='histogram')
# plot the real KDE
kde = sts.gaussian_kde(samples)
plt.plot(x, kde.pdf(x), c='C1', lw=8, label='KDE')
# resample the histogram and find the KDE.
resamples = np.random.choice((e[:-1] + e[1:])/2, size=n*5, p=h/h.sum())
rkde = sts.gaussian_kde(resamples)
# plot the KDE
plt.plot(x, rkde.pdf(x), '--', c='C3', lw=4, label='resampled KDE')
plt.title('n = %d' % n)
plt.legend()
plt.show()
Output:
The red dashed line and the orange line nearly completely overlap in the plot, showing that the real KDE and the KDE calculated by resampling the histogram are in excellent agreement.
If your histograms are really noisy (like what you get if you set n = 10 in the above code), you should be a bit cautious when using the resampled KDE for anything other than plotting purposes:
Overall the agreement between the real and resampled KDEs is still good, but the deviations are noticeable.
Munge your categorial data into an appropriate form
Since you haven't posted your actual data I can't give you detailed advice. I think your best bet will be to just number your categories in order, then use that number as the "x" value of each bar in the histogram.
I have stated my reservations to applying a KDE to OP's categorical data in my comments above. Basically, as the phylogenetic distance between species does not obey the triangle inequality, there cannot be a valid kernel that could be used for kernel density estimation. However, there are other density estimation methods that do not require the construction of a kernel. One such method is k-nearest neighbour inverse distance weighting, which only requires non-negative distances which need not satisfy the triangle inequality (nor even need to be symmetric, I think). The following outlines this approach:
import numpy as np
#--------------------------------------------------------------------------------
# simulate data
total_classes = 10
sample_values = np.random.rand(total_classes)
distance_matrix = 100 * np.random.rand(total_classes, total_classes)
# Distances to the values itself are zero; hence remove diagonal.
distance_matrix -= np.diag(np.diag(distance_matrix))
# --------------------------------------------------------------------------------
# For each sample, compute an average based on the values of the k-nearest neighbors.
# Weigh each sample value by the inverse of the corresponding distance.
# Apply a regularizer to the distance matrix.
# This limits the influence of values with very small distances.
# In particular, this affects how the value of the sample itself (which has distance 0)
# is weighted w.r.t. other values.
regularizer = 1.
distance_matrix += regularizer
# Set number of neighbours to "interpolate" over.
k = 3
# Compute average based on sample value itself and k neighbouring values weighted by the inverse distance.
# The following assumes that the value of distance_matrix[ii, jj] corresponds to the distance from ii to jj.
for ii in range(total_classes):
# determine neighbours
indices = np.argsort(distance_matrix[ii, :])[:k+1] # +1 to include the value of the sample itself
# compute weights
distances = distance_matrix[ii, indices]
weights = 1. / distances
weights /= np.sum(weights) # weights need to sum to 1
# compute weighted average
values = sample_values[indices]
new_sample_values[ii] = np.sum(values * weights)
print(new_sample_values)
THE EASY WAY
For now, I am skipping any philosophical argument about the validity of using Kernel density in such settings. Will come around that later.
An easy way to do this is using scikit-learn KernelDensity:
import numpy as np
import pandas as pd
from sklearn.neighbors import KernelDensity
from sklearn import preprocessing
ds=pd.read_csv('data-by-State.csv')
Y=ds.loc[:,'State'].values # State is AL, AK, AZ, etc...
# With categorical data we need some label encoding here...
le = preprocessing.LabelEncoder()
le.fit(Y) # le.classes_ would be ['AL', 'AK', 'AZ',...
y=le.transform(Y) # y would be [0, 2, 3, ..., 6, 7, 9]
y=y[:, np.newaxis] # preparing for kde
kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(y)
# You can control the bandwidth so the KDE function performs better
# To find the optimum bandwidth for your data you can try Crossvalidation
x=np.linspace(0,5,100)[:, np.newaxis] # let's get some x values to plot on
log_dens=kde.score_samples(x)
dens=np.exp(log_dens) # these are the density function values
array([0.06625658, 0.06661817, 0.06676005, 0.06669403, 0.06643584,
0.06600488, 0.0654239 , 0.06471854, 0.06391682, 0.06304861,
0.06214499, 0.06123764, 0.06035818, 0.05953754, 0.05880534,
0.05818931, 0.05771472, 0.05740393, 0.057276 , 0.05734634,
0.05762648, 0.05812393, 0.05884214, 0.05978051, 0.06093455,
..............
0.11885574, 0.11883695, 0.11881434, 0.11878766, 0.11875657,
0.11872066, 0.11867943, 0.11863229, 0.11857859, 0.1185176 ,
0.11844852, 0.11837051, 0.11828267, 0.11818407, 0.11807377])
And these values are all you need to plot your Kernel Density over your histogram. Capito?
Now, on the theoretical side, if X is a categorical(*), unordered variable with c possible values, then for 0 ≤ h < 1
is a valid kernel. For an ordered X,
where |x1-x2|should be understood as how many levels apart x1 and x2 are. As h tends to zero, both of these become indicators and return a relative frequency counting. h is oftentimes referred to as bandwidth.
(*) No distance needs to be defined on the variable space. Doesn't need to be a metric space.
Devroye, Luc and Gábor Lugosi (2001). Combinatorial Methods in Density Estimation. Berlin: Springer-Verlag.
I have 3D dataset (X,Y,Z). I would like to perform KDE, plot the data and its estimation. Then, get the zero crossings and plot it with KDE. My attempt is below. I have the following questions:
line X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j] and positions = np.vstack([X.ravel(),Y.ravel(),Z.ravel()])as here (kde documentation) will they have any effects in visualising the real estimation for the original data?. I don't really understand why I have to use my min and max to perform KDE and then use ravel()?
why I have to transpose the data in f = np.reshape(kernel(positions).T, X.shape)
Is the code correct ?
I failed to plot the original data with KDE estimation and KDE estimation/ original data with zero crossing:
Should zero crossings be vector ?. In the code below it's tuple
df = pd.read_csv(file, delimiter = ',')
Convert series from data-frame into arrays
X = np.array(df['x'])
Y = np.array(df['y'])
Z = np.array(df['z'])
data = np.vstack([X, Y, Z])
# perform KDE
kernel = scipy.stats.kde.gaussian_kde(data)
density = kernel(data)
fig, ax = plt.subplots(subplot_kw=dict(projection='3d'))
x, y, z = data
scatter = ax.scatter(x, y, z, c=density)
xmin = values[0].min()
xmax = values[0].max()
ymin = values[1].min()
ymax = values[1].max()
zmin = values[2].min()
zmax = values[2].max()
X,Y, Z = np.mgrid[xmin:xmax:100j,ymin:ymax:100j,zmin:zmax:100j]
positions = np.vstack([X.ravel(),Y.ravel(),Z.ravel()])
f = np.reshape(kernel(positions).T, X.shape)
derivative = np.gradient(f)
dz, dy, dx = derivative
xdiff = np.sign(dx) # along X-axis
ydiff = np.sign(dy) # along Y-axis
zdiff = np.sign(dz) # along Z-axis
xcross = np.where(xdiff[:-1] != xdiff[1:])
ycross = np.where([ydiff[:-1] != ydiff[1:]])
zcross = np.where([zdiff[:-1] != zdiff[1:]])
Zerocross = xcross + ycross + zcross
line X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j] and positions = np.vstack([X.ravel(),Y.ravel(),Z.ravel()]) as here (kde documentation) will they have any effects in visualising the real estimation for the original data?. I don't really understand why I have to use my min and max to perform KDE and then use ravel()?
Those two lines set up a grid of x, y, z locations where the KDE will be evaluated. In the code above they are only being used to estimate the derivative of the kernel density function. Since they aren't currently being used for anything related to plotting, they won't affect the visualisation.
xmin, xmax etc. are used to ensure that the grid covers the full range of x, y, z values in your data. The syntax xmin:xmax:100j does the equivalent of np.linspace(xmin, xmax, 100), i.e. np.mgrid returns 100 evenly spaced points between xmin and xmax.
The X, Y and Z arrays returned by np.mgrid will each have shapes (100, 100, 100), whereas the positions argument to kernel(positions) needs to be (n_dimensions, n_points). The line np.vstack([X.ravel(),Y.ravel(),Z.ravel()]) just reshapes the output of np.mgrid into this form. .ravel() flattens each (100, 100, 100) array into a (1000000,) vector, and np.vstack concatenates them over the first dimension to make a (3, 1000000) array of points.
why I have to transpose the data in f = np.reshape(kernel(positions).T, X.shape)
You don't :-). The output of kernel(positions) is a 1D vector, so transposing it will have no effect.
I failed to plot the original data with KDE estimation and KDE estimation/ original data with zero crossing:
What did you try? The code above seems to estimate zero-crossings of the gradient of the kernel density function, but doesn't include any code to plot them. What sort of a plot do you want to make?
Should zero crossings be vector ?. In the code below it's tuple
When you call np.where(x) where x is a multidimensional array, you get back a tuple containing the indices where x is non-zero. Since xdiff[:-1] != xdiff[1:] is a 3D array, you will get back a tuple containing three 1D arrays of indices, one per dimension.
You probably don't want the extra set of square brackets in np.where([ydiff[:-1] != ydiff[1:]]), since in that case [ydiff[:-1] != ydiff[1:]] will be treated as a (1, 100, 100, 100) array rather than (100, 100, 100), and you'll therefore get a tuple containing 4 arrays of indices rather than 3 (the first one will be all zeros, since the size in the first dimension is 1).