creating a large pdf matrix efficiently - python

I have a dataset of 60,000 examples of the form:
mu1 mu2 std1 std2
0 -0.745 0.729 0.0127 0.0149
1 -0.711 0.332 0.1240 0.0433
...
They are essentially parameters of 2-dimensional normal distributions. What I want to do is create a (NxN) matrix P such that P_ij = Normal( mu_i | mean=mu_j, cov=diagonal(std_j)), where mu_i is (mu1, mu2) for data 'i'.
I can do this with the following code for example:
from scipy import stats
import numpy as np
mu_all = data[['mu1', 'mu2']]
std_all = data[['std1', 'std2']]
P = []
for i in range(len(data)):
mu_i = mu_all[i,:]
std_i = std_all[i,:]
prob_i = stats.multivariate_normal.pdf(mu_all, mean=mu_i, cov=np.diag(std_i))
P.append(prob_i)
P = np.array(P).T
But this is too expensive (my machine freezes). How can I do this more efficiently? My guess is that scipy cannot handle computing pdf of 60000 at once. Is there an alternative?

Just realized creating a matrix of that size (60,0000 x 60,000) cannot be handled in python:
Very large matrices using Python and NumPy
So I don't think this can be done

Related

Matrix operations using parameters modified through moving horizon estimation

I've recently started trying out moving horizon estimation with GEKKO. My specified manipulated variables are used in a heat balance equation within my model, and I am having some issues with the matrix operations in the model.
Example code:
from gekko import GEKKO
import numpy as np
#creating a sample array of input values
nt = 51
u_meas = np.zeros(nt)
u_meas[3:10] = 1.0
u_meas[10:20] = 2.0
u_meas[20:40] = 0.5
u_meas[40:] = 3.0
p = GEKKO(remote=False)
p.time = np.linspace(0,10,nt)
n = 1 #process model order
#designating u as my input, and that I'm going to be using these measurements to estimate my parameters with MHE
p.u = p.MV(value=u_meas)
p.u.FSTATUS=1
#parameters I'm looking to modulate
p.K = p.FV(value=1, lb = 1, ub = 3) #gain
p.tau = p.FV(value=5, lb = 1, ub = 10) #time constant
p.x = [p.Intermediate(p.u)]
#constants within the model that do not change
X_O2 = 0.5
X_SiO2 = 0.25
X_N2 = 0.1
m_feed = 100
#creating an array with my feed separated into components. This creates a 1D array with the individual feed streams of my components.
mdot_F_i = (np.tile(m_feed,3)*np.array([X_O2, X_SiO2, X_N2])
#at this point, I want to add my MV values to the end of my component feed array for later heat and mass balance equations. Normally, in my previous model without MHE, I would put
mdot_c_i = np.concatenate(mdot_F_i, x, (other MV variables after))
However, now that u is a specified MV in GEKKO, and not a set value, I get an error at the mdot_c_i line that says that the array at index 0 has 1 dimension, and the array at index 1 has 2 dimensions.
I'm guessing that I have to specify mdot_c_i as an intermediate variable within Gekko. I've tried a couple different variations, alternately specifying mdot_c_i as an intermediate and trying to use only the values of the MV; however, I keep getting that error.
Has anyone experiences similar issues to this?
Thank you!
You can resolve this by using np.append() instead of np.concatenate(). Try something like:
mdot_c_i = np.append(mdot_F_i, p.u)
Here is a minimum and complete example if you'd like to try it.
import numpy as np
from gekko import GEKKO
m = GEKKO(remote=False)
x = m.Array(m.Var,3,lb=-10,ub=10)
y = m.Var(5,lb=-5,ub=5)
z = np.append(x,y)
m.Minimize(np.dot([1,1,-1,1],z))
m.solve(disp=False)
print([zi.value[0] for zi in z])
# solution: [-10.0, -10.0, 10.0, -5.0]
Gekko variables need to be stored as objects, not as numerical values. The error may be because the np.concatenate() function is trying to access the length of the Gekko manipulated variable data p.u.value to concatenate those values instead of concatenating p.u as an object.

Similar matrix computation using numpy

I am trying to find a similar matrix B to a 3 X 3 matrix :A using a random invertible matrix P .
B = P_inv.A.P
import numpy as np
from scipy import linalg as LA
from numpy.linalg import inv
A = np.random.randint(1,10,9).reshape(3,3)
P = np.random.randn(3,3)
P_inv = inv(P)
eig1 = LA.eigvalsh(A)
eig1 = np.sort(eig1)
B1 = P_inv.dot(A)
B = B1.dot(P)
eig2 = LA.eigvalsh(B)
eig2 = np.sort(eig2)
print(np.round(eig1 ,3))
print(np.round(eig2,3))
However ,I ntoice that eig1 & eig2 are never equal.
What am I missing, or is it a numerical error ?
Thanks
Kedar
You're using eigvalsh, which requires that the matrix be real symmetric (or complex Hermitian), which your randomly generated matrix is not.
Deleting the h and using eigvals instead fixes this.

Eig in Python giving different Eigenvalues?

So essentially what the problem is the eig function in Matlab and Python are giving me different things. I am reproducing data from a paper in order to confirm my numerical method is correct (So I know the answers- have them via Matlab)
I have tried eigh, still no improvement.
Below is the data matrix used:
2852 170.380000000000 77.3190000000000 -51.0710000000000 -191.560000000000 105.410000000000 240.950000000000 102.700000000000
2842 169.640000000000 76.6120000000000 -50.3980000000000 -191.310000000000 105.660000000000 240.850000000000 102.960000000000
2838.80000000000 176.950000000000 80.4150000000000 -51.5700000000000 -192.190000000000 104.870000000000 239.700000000000 104.110000000000
2837.40000000000 182.930000000000 88.4070000000000 -54.1410000000000 -194.460000000000 104.230000000000 238.760000000000 105.020000000000
2890.80000000000 167.270000000000 122 -67.7490000000000 -275.150000000000 160.960000000000 248.010000000000 95.9470000000000
2962.10000000000 113.910000000000 177.060000000000 -98.9930000000000 -259.270000000000 80.7860000000000 262.890000000000 80.9180000000000
3013.90000000000 72.9740000000000 225.260000000000 -135.700000000000 -233.520000000000 0.0469300000000000 272.110000000000 71.5160000000000
3026.50000000000 112.420000000000 243.020000000000 -169.460000000000 -218.060000000000 0.0465190000000000 271.250000000000 71.8280000000000
3367.10000000000 -0.310680000000000 479.870000000000 0.494350000000000 -0.603940000000000 -0.147820000000000 282.700000000000 -64.1680000000000
import scipy.io as sc
import math as m
import numpy as np
from numpy import diag, power
from scipy.linalg import expm, sinm, cosm
import matplotlib.pyplot as plt
import pandas as pd
###########################. Import Data from Excel Sheet.
###################################
df = pd.read_excel('DataCompanionMatrix.xlsx', header=None)
data = np.array(df)
###########################. FUNCTION DEFINE.
#################################################
m = data.shape[0]
n = data.shape[1]
x = data[0:-1,:]
y = data[-1,:]
A = np.dot(x,np.transpose(x))
xx = np.dot(x,np.transpose(y))
Co_values = np.dot(np.linalg.pinv(A),xx)
C = np.zeros((n,n))
for i in range(0,n-1):
C[i,i-1] = 1
C[:,n-1] = Co_values
eigV,eigW = np.linalg.eig(C)
print(eigV)
The data is a 9x8 matrix, x is a 8x8 matrix, y is a 1x8 array, A is 8x8, C is 8x8, co is 1x8 array.
In Matlab the eigenvalues are an 1x8 array of complex eigenvalues. In Python, I get 1x8 array filled with 7 zeros and 1 integer.
I expect to plot the eigenvalues and they should sit on the unit circle, this I've done on Matlab.
C matrix- matlab and python (both look like this)
Python eigenvalues
Matlab eigenvalues
The array C you create in Python does not correspond to the one you have in MATLAB.
If I modify your Python code as follows, I get the same array C and the same eigenvalues:
C = np.zeros((n,n))
for i in range(0,n-1):
C[i+1,i] = 1 # This is where the differences are!
C[:,n-1] = Co_values

python bandpass filter - singular matrix error

I've been trying to design a bandpass filter using scipy but I keep getting a LinAlg Singular Matrix error. I read that a singular matrix is one that is not invertable, but I'm not sure how that error is coming up and what I can do to fix it
The code takes in an EEG signal (which, in the code below, I have just replaced with an int array for testing) and filters out frequencies < 8Hz and > 12Hz (alpha band)
Can anyone shed some light on where the singular matrix error is coming from? Or alternatively, if you know of a better way to filter a signal like this I'd love to test out other options too
from scipy import signal
from scipy.signal import filter_design as fd
import matplotlib.pylab as plt
#bandpass
Wp = [8, 12] # Cutoff frequency
Ws = [7.5, 12.5] # Stop frequency
Rp = 1 # passband maximum loss (gpass)
As = 100 # stoppand min attenuation (gstop)
b,a = fd.iirdesign(Wp,Ws,Rp,As,ftype='butter')
w,H = signal.freqz(b,a) # filter response
plt.plot(w,H)
t = np.linspace(1,256,256)
x = np.arange(256)
plt.plot(t,x)
y = signal.filtfilt(b,a,x)
plt.plot(t,y)
As indicated in iirdesign documentation, Wp and Ws are "are normalized from 0 to 1, where 1 is the Nyquist frequency".
If your sampling rate is Fs (for example 100Hz) you can normalize the cutoff and stop frequencies using:
Wp = [x / (Fs/2.0) for x in Wp]
Ws = [x / (Fs/2.0) for x in Ws]

Is there a Python equivalent to the mahalanobis() function in R? If not, how can I implement it?

I have the following code in R that calculates the mahalanobis distance on the Iris dataset and returns a numeric vector with 150 values, one for every observation in the dataset.
x=read.csv("Iris Data.csv")
mean<-colMeans(x)
Sx<-cov(x)
D2<-mahalanobis(x,mean,Sx)
I tried to implement the same in Python using 'scipy.spatial.distance.mahalanobis(u, v, VI)' function, but it seems this function takes only one-dimensional arrays as parameters.
I used the Iris dataset from R, I suppose it is the same you are using.
First, these is my R benchmark, for comparison:
x <- read.csv("IrisData.csv")
x <- x[,c(2,3,4,5)]
mean<-colMeans(x)
Sx<-cov(x)
D2<-mahalanobis(x,mean,Sx)
Then, in python you can use:
from scipy.spatial.distance import mahalanobis
import scipy as sp
import pandas as pd
x = pd.read_csv('IrisData.csv')
x = x.ix[:,1:]
Sx = x.cov().values
Sx = sp.linalg.inv(Sx)
mean = x.mean().values
def mahalanobisR(X,meanCol,IC):
m = []
for i in range(X.shape[0]):
m.append(mahalanobis(X.iloc[i,:],meanCol,IC) ** 2)
return(m)
mR = mahalanobisR(x,mean,Sx)
I defined a function so you can use it in other sets, (observe I use pandas DataFrames as inputs)
Comparing results:
In R
> D2[c(1,2,3,4,5)]
[1] 2.134468 2.849119 2.081339 2.452382 2.462155
In Python:
In [43]: mR[0:5]
Out[45]:
[2.1344679233248431,
2.8491186861585733,
2.0813386639577991,
2.4523816316796712,
2.4621545347140477]
Just be careful that what you get in R is the squared Mahalanobis distance.
A simpler solution would be:
from scipy.spatial.distance import cdist
x = ...
mean = x.mean(axis=0).reshape(1, -1) # make sure 2D
vi = np.linalg.inv(np.cov(x.T))
cdist(mean, x, 'mahalanobis', VI=vi)

Categories

Resources