According to the original paper by Huang
https://arxiv.org/pdf/1401.4211.pdf
The marginal Hibert spectrum is given by:
where A = A(w,t) (i.e., a function time and frequency) and p(w,A)
the joint probability density function of P(ω, A) of the frequency [ωi] and amplitude [Ai].
I am trying to estimate 1) The joint probability density using the plt.hist2d 2) the integral shown below using a sum.
The code I am using is the following:
IA_flat1 = np.ravel(IA) ### Turn matrix to 1 D array
IF_flat1 = np.ravel(IF) ### Here IA corresponds to A
IF_flat = IF_flat1[(IF_flat1>min_f) & (IF_flat1<fs)] ### Keep only desired frequencies
IA_flat = IA_flat1[(IF_flat1>min_f) & (IF_flat1<fs)] ### Keep IA that correspond to desired frequencies
### return the Joint probability density
Pjoint,f_edges, A_edges,_ = plt.hist2d(IF_flat,IA_flat,bins=[bins_F,bins_A], density=True)
plt.close()
n1 = np.digitize(IA_flat, A_edges).astype(int) ### Return the indices of the bins to which
n2 = np.digitize(IF_flat, f_edges).astype(int) ### each value in input array belongs.
### define integration function
from numba import jit, prange ### Numba is added for speed
#jit(nopython=True, parallel= True)
def get_int(A_edges, Pjoint ,IA_flat, n1, n2):
dA = np.diff(A_edges)[0] ### Find dx for integration
sum_h = np.zeros(np.shape(Pjoint)[0]) ### Intitalize array
for j in prange(np.shape(Pjoint)[0]):
h = np.zeros(np.shape(Pjoint)[1]) ### Intitalize array
for k in prange(np.shape(Pjoint)[1]):
needed = IA_flat[(n1==k) & (n2==j)] ### Keep only the elements of arrat that
### are related to PJoint[j,k]
h[k] = Pjoint[j,k]*np.nanmean(needed**2)*dA ### Pjoint*A^2*dA
sum_h[j] = np.nansum(h) ### Sum_{i=0}^{N}(Pjoint*A^2*dA)
return sum_h
### Now run previously defined function
sum_h = get_int(A_edges, Pjoint ,IA_flat, n1, n2)
1) I am not sure that everything is correct though. Any suggestions or comments on what I might be doing wrong?
2) Is there a way to do the same using a scipy integration scheme?
You can extract the probability from the 2D histogram and use it for the integration:
# Added some numbers to have something to run
import numpy as np
import matplotlib.pyplot as plt
IA = np.random.rand(100,100)
IF = np.random.rand(100,100)
bins_F = np.linspace(0,1,20)
bins_A = np.linspace(0,1,100)
min_f = 0
fs = 1.0
IA_flat1 = np.ravel(IA) ### Turn matrix to 1 D array
IF_flat1 = np.ravel(IF) ### Here IA corresponds to A
IF_flat = IF_flat1[(IF_flat1>min_f) & (IF_flat1<fs)] ### Keep only desired frequencies
IA_flat = IA_flat1[(IF_flat1>min_f) & (IF_flat1<fs)] ### Keep IA that correspond to desired frequencies
### return the Joint probability density
Pjoint,f_edges, A_edges,_ = plt.hist2d(IF_flat,IA_flat,bins=[bins_F,bins_A], density=True)
f_values = (f_edges[1:]+f_edges[:-1])/2
A_values = (A_edges[1:]+A_edges[:-1])/2
dA = A_values[1]-A_values[0] # for the integral
#Pjoint.shape (19,99)
h = np.zeros(f_values.shape)
for i in range(len(f_values)):
f = f_values[i]
# column of the histogram with frequency f, probability
p = Pjoint[i]
# summatory equivalent to the integral
integral_result = np.sum(p*A_values**2*dA )
h[i] = integral_result
plt.figure()
plt.plot(f_values,h)
I have created a code that returns the output that I am after - 2 graphs with multiple lines on each graph. However, the code is slow and quite big (in terms of how many lines of code it takes). I am interested in any improvements I can make that will help me to get such graphs faster, and make my code more presentable.
Additionally, I would like to add more to my graphs (axis names and titles is what I am after). Normally, I would use plt.xlabel,plt.ylabel and plt.title to do so, however I couldn't quite understand how to use them here. The aim here is to add a line to each graph after each loop ( I have adapted this piece of code to do so).
I should note that I need to use Python for this task (so I cannot change to anything else) and I do need Sympy library to find values that are plotted in my graphs.
My code so far is as follows:
import matplotlib.pyplot as plt
import sympy as sym
import numpy as np
sym.init_printing()
x, y = sym.symbols('x, y') # defining our unknown probabilities
al = np.arange(20,1000,5).reshape((196,1)) # values of alpha/beta
prob_of_strA = []
prob_of_strB = []
colours=['r','g','b','k','y']
pen_values = [[0,-5,-10,-25,-50],[0,-25,-50,-125,-250]]
fig1, ax1 = plt.subplots()
fig2, ax2 = plt.subplots()
for j in range(0,len(pen_values[1])):
for i in range(0,len(al)): # choosing the value of beta
A = sym.Matrix([[10, 50], [int(al[i]), pen_values[0][j]]]) # defining matrix A
B = sym.Matrix([[pen_values[1][j], 50], [int(al[i]), 10]]) # defining matrix B
sigma_r = sym.Matrix([[x, 1-x]]) # defining the vector of probabilities
sigma_c = sym.Matrix([y, 1-y]) # defining the vector of probabilities
ts1 = A * sigma_c ; ts2 = sigma_r * B # defining our utilities
y_sol = sym.solvers.solve(ts1[0] - ts1[1],y,dict = True) # solving for y
x_sol = sym.solvers.solve(ts2[0] - ts2[1],x,dict = True) # solving for x
prob_of_strA.append(y_sol[0][y]) # adding the value of y to the vector
prob_of_strB.append(x_sol[0][x]) # adding the value of x to the vector
ax1.plot(al,prob_of_strA,colours[j],label = ["penalty = " + str(pen_values[0][j])]) # plotting value of y for a given penalty value
ax2.plot(al,prob_of_strB,colours[j],label = ["penalty = " + str(pen_values[1][j])]) # plotting value of x for a given penalty value
ax1.legend() # showing the legend
ax2.legend() # showing the legend
prob_of_strA = [] # emptying the vector for the next round
prob_of_strB = [] # emptying the vector for the next round
You can save a couple of lines by initializing your empty vectors inside the loop. You don't have to bother re-defining them at the end.
for j in range(0,len(pen_values[1])):
prob_of_strA = []
prob_of_strB = []
for i in range(0,len(al)): # choosing the value of beta
A = sym.Matrix([[10, 50], [int(al[i]), pen_values[0][j]]]) # defining matrix A
B = sym.Matrix([[pen_values[1][j], 50], [int(al[i]), 10]]) # defining matrix B
sigma_r = sym.Matrix([[x, 1-x]]) # defining the vector of probabilities
sigma_c = sym.Matrix([y, 1-y]) # defining the vector of probabilities
ts1 = A * sigma_c ; ts2 = sigma_r * B # defining our utilities
y_sol = sym.solvers.solve(ts1[0] - ts1[1],y,dict = True) # solving for y
x_sol = sym.solvers.solve(ts2[0] - ts2[1],x,dict = True) # solving for x
prob_of_strA.append(y_sol[0][y]) # adding the value of y to the vector
prob_of_strB.append(x_sol[0][x]) # adding the value of x to the vector
ax1.plot(al,prob_of_strA,colours[j],label = ["penalty = " + str(pen_values[0][j])]) # plotting value of y for a given penalty value
ax2.plot(al,prob_of_strB,colours[j],label = ["penalty = " + str(pen_values[1][j])]) # plotting value of x for a given penalty value
ax1.legend() # showing the legend
ax2.legend() # showing the legend
Explanation:
I have two numpy arrays: dataX and dataY, and I am trying to filter each array to reduce the noise. The image shown below shows the actual input data (blue dots) and an example of what I want it to be like(red dots). I do not need the filtered data to be as perfect as in the example but I do want it to be as straight as possible. I have provided sample data in the code.
What I have tried:
Firstly, you can see that the data isn't 'continuous', so I first divided them into individual 'segments' ( 4 of them in this example), and then applied a filter to each 'segment'. Someone suggested that I use a Savitzky-Golay filter. The full, run-able code is below:
import scipy as sc
import scipy.signal
import numpy as np
import matplotlib.pyplot as plt
# Sample Data
ydata = np.array([1,0,1,2,1,2,1,0,1,1,2,2,0,0,1,0,1,0,1,2,7,6,8,6,8,6,6,8,6,6,8,6,6,7,6,5,5,6,6, 10,11,12,13,12,11,10,10,11,10,12,11,10,10,10,10,12,12,10,10,17,16,15,17,16, 17,16,18,19,18,17,16,16,16,16,16,15,16])
xdata = np.array([1,2,3,1,5,4,7,8,6,10,11,12,13,10,12,13,17,16,19,18,21,19,23,21,25,20,26,27,28,26,26,26,29,30,30,29,30,32,33, 1,2,3,1,5,4,7,8,6,10,11,12,13,10,12,13,17,16,19,18,21,19,23,21,25,20,26,27,28,26,26,26,29,30,30,29,30,32])
# Used a diff array to find where there is a big change in Y.
# If there's a big change in Y, then there must be a change of 'segment'.
diffy = np.diff(ydata)
# Create empty numpy arrays to append values into
filteredX = np.array([])
filteredY = np.array([])
# Chose 3 to be the value indicating the change in Y
index = np.where(diffy >3)
# Loop through the array
start = 0
for i in range (0, (index[0].size +1) ):
# Check if last segment is reached
if i == index[0].size:
print xdata[start:]
partSize = xdata[start:].size
# Window length must be an odd integer
if partSize % 2 == 0:
partSize = partSize - 1
filteredDataX = sc.signal.savgol_filter(xdata[start:], partSize, 3)
filteredDataY = sc.signal.savgol_filter(ydata[start:], partSize, 3)
filteredX = np.append(filteredX, filteredDataX)
filteredY = np.append(filteredY, filteredDataY)
else:
print xdata[start:index[0][i]]
partSize = xdata[start:index[0][i]].size
if partSize % 2 == 0:
partSize = partSize - 1
filteredDataX = sc.signal.savgol_filter(xdata[start:index[0][i]], partSize, 3)
filteredDataY = sc.signal.savgol_filter(ydata[start:index[0][i]], partSize, 3)
start = index[0][i]
filteredX = np.append(filteredX, filteredDataX)
filteredY = np.append(filteredY, filteredDataY)
# Plots
plt.plot(xdata,ydata, 'bo', label = 'Input Data')
plt.plot(filteredX, filteredY, 'ro', label = 'Filtered Data')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Result')
plt.legend()
plt.show()
This is my result:
When each point is connected, the result looks as follows.
I have played around with the order, but it seems like a third order gave the best result.
I have also tried these filters, among a few others:
scipy.signal.medfilt
scipy.ndimage.filters.uniform_filter1d
But so far none of the filters I have tried were close to what I really wanted. What is the best way to filter data such as this? Looking forward to your help.
One way to get something looking close to your ideal would be clustering + linear regression.
Note that you have to provide the number of clusters and I also cheated a bit in scaling up y before clustering.
import numpy as np
from scipy import cluster, stats
ydata = np.array([1,0,1,2,1,2,1,0,1,1,2,2,0,0,1,0,1,0,1,2,7,6,8,6,8,6,6,8,6,6,8,6,6,7,6,5,5,6,6, 10,11,12,13,12,11,10,10,11,10,12,11,10,10,10,10,12,12,10,10,17,16,15,17,16, 17,16,18,19,18,17,16,16,16,16,16,15,16])
xdata = np.array([1,2,3,1,5,4,7,8,6,10,11,12,13,10,12,13,17,16,19,18,21,19,23,21,25,20,26,27,28,26,26,26,29,30,30,29,30,32,33, 1,2,3,1,5,4,7,8,6,10,11,12,13,10,12,13,17,16,19,18,21,19,23,21,25,20,26,27,28,26,26,26,29,30,30,29,30,32])
def split_to_lines(x, y, k):
yo = np.empty_like(y, dtype=float)
# get the cluster centers and the labels for each point
centers, map_ = cluster.vq.kmeans2(np.array((x, y * 2)).T.astype(float), k)
# for each cluster, use the labels to select the points belonging to
# the cluster and do a linear regression
for i in range(k):
slope, interc, *_ = stats.linregress(x[map_==i], y[map_==i])
# use the regression parameters to construct y values on the
# best fit line
yo[map_==i] = x[map_==i] * slope + interc
return yo
import pylab
pylab.plot(xdata, ydata, 'or')
pylab.plot(xdata, split_to_lines(xdata, ydata, 4), 'ob')
pylab.show()