Get the Transformation Matrix From the SciPy Procrustes Implementation - python

The Procrustes library has an example where it demonstrates how to get the Transformation Matrix of two matrices by solving the Procrustes problem. The library seems to be old and doesn't work in Python 3.
I was wondering if there's any way to use the SciPy implementation of the Procrustes problem and be able to solve the exact problem discussed in the library's example.
Another StackOverflow question seems to need the exact thing that I'm asking here but I can't get it to give me the proper Transformation Matrix that would transform the Source Matrix to nearly
Using
In summary, I'd like to be able to implement this example using the SciPy library.

You could use scipy.linalg.orthogonal_procrustes. Here's a demonstration. Note that the function generateAB only exists to generate the arrays A and B for the demo. The key steps of the calculation are to center A and B, and then call orthogonal_procrustes.
import numpy as np
from scipy.stats import ortho_group
from scipy.linalg import orthogonal_procrustes
def generateAB(shape, noise=0, rng=None):
# Generate A and B for the example.
if rng is None:
rng = np.random.default_rng()
m, n = shape
# Random matrix A
A = 3 + 2*rng.random(shape)
Am = A.mean(axis=0, keepdims=True)
# Random orthogonal matrix T
T = ortho_group.rvs(n, random_state=rng)
# Target matrix B
B = ((A - Am) # T + rng.normal(scale=noise, size=A.shape)
+ 3*rng.random((1, n)))
# Include T in the return, but in a real problem, T would not be known.
return A, B, T
# For reproducibility, use a seeded RNG.
rng = np.random.default_rng(0x1ce1cebab1e)
A, B, T = generateAB((7, 5), noise=0.01, rng=rng)
# Find Q. Note that `orthogonal_procrustes` does not include
# dilation or translation. To handle translation, we center
# A and B by subtracting the means of the points.
A0 = A - A.mean(axis=0, keepdims=True)
B0 = B - B.mean(axis=0, keepdims=True)
Q, scale = orthogonal_procrustes(A0, B0)
with np.printoptions(precision=3, suppress=True):
print('T (used to generate B from A):')
print(T)
print('Q (computed by orthogonal_procrustes):')
print(Q)
print('\nCompare A0 # Q with B0.')
print('A0 # Q:')
print(A0 # Q)
print('B0 (should be close to A0 # Q if the noise parameter was small):')
print(B0)
Output:
T (used to generate B from A):
[[-0.873 0.017 0.202 -0.44 -0.054]
[-0.129 0.606 -0.763 -0.047 -0.18 ]
[ 0.055 -0.708 -0.567 -0.408 0.088]
[ 0.024 0.24 -0.028 -0.168 0.955]
[ 0.466 0.272 0.235 -0.78 -0.21 ]]
Q (computed by orthogonal_procrustes):
[[-0.871 0.022 0.203 -0.443 -0.052]
[-0.129 0.604 -0.765 -0.046 -0.178]
[ 0.053 -0.709 -0.565 -0.409 0.087]
[ 0.027 0.239 -0.029 -0.166 0.956]
[ 0.47 0.273 0.233 -0.779 -0.21 ]]
Compare A0 # Q with B0.
A0 # Q:
[[-0.622 0.224 0.946 1.038 0.578]
[ 0.263 0.143 -0.031 -0.949 0.492]
[-0.49 0.758 0.473 -0.221 -0.755]
[ 0.205 -0.74 0.065 -0.192 -0.551]
[-0.295 -0.434 -1.103 0.444 0.547]
[ 0.585 -0.378 -0.645 -0.233 0.651]
[ 0.354 0.427 0.296 0.113 -0.963]]
B0 (should be close to A0 # Q if the noise parameter was small):
[[-0.627 0.226 0.949 1.032 0.576]
[ 0.268 0.135 -0.028 -0.95 0.492]
[-0.493 0.765 0.475 -0.201 -0.75 ]
[ 0.214 -0.743 0.071 -0.196 -0.55 ]
[-0.304 -0.433 -1.115 0.451 0.551]
[ 0.589 -0.375 -0.645 -0.235 0.651]
[ 0.354 0.426 0.292 0.1 -0.969]]

Related

How to introduce missing values in time series data

I'm new to python and also new to this site. My colleague and I are working on a time series dataset. we wish to introduce some missing values to the dataset and then use some techniques to fill in the missing values to see how well those techniques perform for the data imputation task. The challenge we have at the moment is how to introduce missing values to the dataset in a consecutive manner and not just randomly. For example, we want to replace data for a period of time with NaNs, eg, 3 consecutive days. I will really appreciate if anyone can point us in the right direction on how to get this done. we are working with python.
Here is my sample data
There is a method for filling NaNs
dataframe['name_of_column'].fillna('value')
See set_missing_data function below:
import numpy as np
np.set_printoptions(precision=3, linewidth=1000)
def set_missing_data(data, missing_locations, missing_length):
for i in missing_locations:
data[i:i+missing_length] = np.nan
np.random.seed(0)
n_data_points = np.random.randint(40, 50)
data = np.random.normal(size=[n_data_points])
n_missing = np.random.randint(3, 6)
missing_length = 3
missing_locations = np.random.choice(
n_data_points - missing_length,
size=n_missing,
replace=False
)
print(data)
set_missing_data(data, missing_locations, missing_length)
print(data)
Console output:
[ 0.118 0.114 0.37 1.041 -1.517 -0.866 -0.055 -0.107 1.365 -0.098 -2.426 -0.453 -0.471 0.973 -1.278 1.437 -0.078 1.09 0.097 1.419 1.168 0.947 1.085 2.382 -0.406 0.266 -1.356 -0.114 -0.844 0.706 -0.399 -0.827 -0.416 -0.525 0.813 -0.229 2.162 -0.957 0.067 0.206 -0.457 -1.06 0.615 1.43 -0.212]
[ 0.118 nan nan nan -1.517 -0.866 -0.055 -0.107 nan nan nan -0.453 -0.471 0.973 -1.278 1.437 -0.078 1.09 0.097 nan nan nan 1.085 2.382 -0.406 0.266 -1.356 -0.114 -0.844 0.706 -0.399 -0.827 -0.416 -0.525 0.813 -0.229 2.162 -0.957 0.067 0.206 -0.457 -1.06 0.615 1.43 -0.212]

Print two arrays side by side using numpy

I'm trying to create a table of cosines using numpy in python. I want to have the angle next to the cosine of the angle, so it looks something like this:
0.0 1.000 5.0 0.996 10.0 0.985 15.0 0.966
20.0 0.940 25.0 0.906 and so on.
I'm trying to do it using a for loop but I'm not sure how to get this to work.
Currently, I have .
Any suggestions?
Let's say you have:
>>> d = np.linspace(0, 360, 10, endpoint=False)
>>> c = np.cos(np.radians(d))
If you don't mind having some brackets and such on the side, then you can simply concatenate column-wise using np.c_, and display:
>>> print(np.c_[d, c])
[[ 0.00000000e+00 1.00000000e+00]
[ 3.60000000e+01 8.09016994e-01]
[ 7.20000000e+01 3.09016994e-01]
[ 1.08000000e+02 -3.09016994e-01]
[ 1.44000000e+02 -8.09016994e-01]
[ 1.80000000e+02 -1.00000000e+00]
[ 2.16000000e+02 -8.09016994e-01]
[ 2.52000000e+02 -3.09016994e-01]
[ 2.88000000e+02 3.09016994e-01]
[ 3.24000000e+02 8.09016994e-01]]
But if you care about removing them, one possibility is to use a simple regex:
>>> import re
>>> print(re.sub(r' *\n *', '\n',
np.array_str(np.c_[d, c]).replace('[', '').replace(']', '').strip()))
0.00000000e+00 1.00000000e+00
3.60000000e+01 8.09016994e-01
7.20000000e+01 3.09016994e-01
1.08000000e+02 -3.09016994e-01
1.44000000e+02 -8.09016994e-01
1.80000000e+02 -1.00000000e+00
2.16000000e+02 -8.09016994e-01
2.52000000e+02 -3.09016994e-01
2.88000000e+02 3.09016994e-01
3.24000000e+02 8.09016994e-01
I'm removing the brackets, and then passing it to the regex to remove the spaces on either side in each line.
np.array_str also lets you set the precision. For more control, you can use np.array2string instead.
Side-by-Side Array Comparison using Numpy
A built-in Numpy approach using the column_stack((...)) method.
numpy.column_stack((A, B)) is a column stack with Numpy which allows you to compare two or more matrices/arrays.
Use the numpy.column_stack((A, B)) method with a tuple. The tuple must be represented with () parenthesizes representing a single argument with as many matrices/arrays as you want.
import numpy as np
A = np.random.uniform(size=(10,1))
B = np.random.uniform(size=(10,1))
C = np.random.uniform(size=(10,1))
np.column_stack((A, B, C)) ## <-- Compare Side-by-Side
The result looks like this:
array([[0.40323596, 0.95947336, 0.21354263],
[0.18001121, 0.35467198, 0.47653884],
[0.12756083, 0.24272134, 0.97832504],
[0.95769626, 0.33855075, 0.76510239],
[0.45280595, 0.33575171, 0.74295859],
[0.87895151, 0.43396391, 0.27123183],
[0.17721346, 0.06578044, 0.53619146],
[0.71395251, 0.03525021, 0.01544952],
[0.19048783, 0.16578012, 0.69430883],
[0.08897691, 0.41104408, 0.58484384]])
Numpy column_stack is useful for AI/ML applications when comparing the predicted results with the expected answers. This determines the effectiveness of the Neural Net training. It is a quick way to detect where errors are in the network calculations.
Pandas is very convenient module for such tasks:
In [174]: import pandas as pd
...:
...: x = pd.DataFrame({'angle': np.linspace(0, 355, 355//5+1),
...: 'cos': np.cos(np.deg2rad(np.linspace(0, 355, 355//5+1)))})
...:
...: pd.options.display.max_rows = 20
...:
...: x
...:
Out[174]:
angle cos
0 0.0 1.000000
1 5.0 0.996195
2 10.0 0.984808
3 15.0 0.965926
4 20.0 0.939693
5 25.0 0.906308
6 30.0 0.866025
7 35.0 0.819152
8 40.0 0.766044
9 45.0 0.707107
.. ... ...
62 310.0 0.642788
63 315.0 0.707107
64 320.0 0.766044
65 325.0 0.819152
66 330.0 0.866025
67 335.0 0.906308
68 340.0 0.939693
69 345.0 0.965926
70 350.0 0.984808
71 355.0 0.996195
[72 rows x 2 columns]
You can use python's zip function to go through the elements of both lists simultaneously.
import numpy as np
degreesVector = np.linspace(0.0, 360.0, 73.0)
cosinesVector = np.cos(np.radians(degreesVector))
for d, c in zip(degreesVector, cosinesVector):
print d, c
And if you want to make a numpy array out of the degrees and cosine values, you can modify the for loop in this way:
table = []
for d, c in zip(degreesVector, cosinesVector):
table.append([d, c])
table = np.array(table)
And now on one line!
np.array([[d, c] for d, c in zip(degreesVector, cosinesVector)])
You were close - but if you iterate over angles, just generate the cosine for that angle:
In [293]: for angle in range(0,60,10):
...: print('{0:8}{1:8.3f}'.format(angle, np.cos(np.radians(angle))))
...:
0 1.000
10 0.985
20 0.940
30 0.866
40 0.766
50 0.643
To work with arrays, you have lots of options:
In [294]: angles=np.linspace(0,60,7)
In [295]: cosines=np.cos(np.radians(angles))
iterate over an index:
In [297]: for i in range(angles.shape[0]):
...: print('{0:8}{1:8.3f}'.format(angles[i],cosines[i]))
Use zip to dish out the values 2 by 2:
for a,c in zip(angles, cosines):
print('{0:8}{1:8.3f}'.format(a,c))
A slight variant on that:
for ac in zip(angles, cosines):
print('{0:8}{1:8.3f}'.format(*ac))
You could concatenate the arrays together into a 2d array, and display that:
In [302]: np.vstack((angles, cosines)).T
Out[302]:
array([[ 0. , 1. ],
[ 10. , 0.98480775],
[ 20. , 0.93969262],
[ 30. , 0.8660254 ],
[ 40. , 0.76604444],
[ 50. , 0.64278761],
[ 60. , 0.5 ]])
In [318]: print(np.vstack((angles, cosines)).T)
[[ 0. 1. ]
[ 10. 0.98480775]
[ 20. 0.93969262]
[ 30. 0.8660254 ]
[ 40. 0.76604444]
[ 50. 0.64278761]
[ 60. 0.5 ]]
np.column_stack can do that without the transpose.
And you can pass that array to your formatting with:
for ac in np.vstack((angles, cosines)).T:
print('{0:8}{1:8.3f}'.format(*ac))
or you could write that to a csv style file with savetxt (which just iterates over the 'rows' of the 2d array and writes with fmt):
In [310]: np.savetxt('test.txt', np.vstack((angles, cosines)).T, fmt='%8.1f %8.3f')
In [311]: cat test.txt
0.0 1.000
10.0 0.985
20.0 0.940
30.0 0.866
40.0 0.766
50.0 0.643
60.0 0.500
Unfortunately savetxt requires the old style formatting. And trying to write to sys.stdout runs into byte v unicode string issues in Py3.
Just in numpy with some format ideas, to use #MaxU 's syntax
a = np.array([[i, np.cos(np.deg2rad(i)), np.sin(np.deg2rad(i))]
for i in range(0,361,30)])
args = ["Angle", "Cos", "Sin"]
frmt = ("{:>8.0f}"+"{:>8.3f}"*2)
print(("{:^8}"*3).format(*args))
for i in a:
print(frmt.format(*i))
Angle Cos Sin
0 1.000 0.000
30 0.866 0.500
60 0.500 0.866
90 0.000 1.000
120 -0.500 0.866
150 -0.866 0.500
180 -1.000 0.000
210 -0.866 -0.500
240 -0.500 -0.866
270 -0.000 -1.000
300 0.500 -0.866
330 0.866 -0.500
360 1.000 -0.000

MUSIC Algorithm Spectrum Python Implementation

I am working on a small radar project that can measure the Doppler shift created by the heart and chest. Since I know the number of sources in advance, I decided to choose the MUSIC Algorithm for spectral analysis. I am acquiring data and sending it to Python for analysis. However, my Python code is saying that the power for ALL frequencies of a signal with two mixed sinusoids of frequency 1 Hz and 2 Hz is equal. My code is linked here with a sample output:
from scipy import signal
import numpy as np
from numpy import linalg as LA
import matplotlib.pyplot as plt
import cmath
import scipy
N = 5
z = np.linspace(0,2*np.pi, num=N)
x = np.sin(2*np.pi * z) + np.sin(1 * np.pi * z) + np.random.random(N) * 0.3 # sample signal
conj = np.conj(x);
l = len(conj)
sRate = 25 # sampling rate
p = 2
flipped = [0 for h in range(0, l)]
flipped = conj[::-1]
acf = signal.convolve(x,flipped,'full')
a1 = scipy.linalg.toeplitz(c=np.asarray(acf),r=np.asarray(acf))#autocorrelation matrix that will be decomposed into eigenvectors
eigenValues,eigenVectors = LA.eig(a1)
idx = eigenValues.argsort()[::-1]
eigenValues = eigenValues[idx]
eigenVectors = eigenVectors[:,idx]
idx = eigenValues.argsort()[::-1]
eigenValues = eigenValues[idx]# soriting the eigenvectors and eigenvalues from greatest to least eigenvalue
eigenVectors = eigenVectors[:,idx]
signal_eigen = eigenVectors[0:p]#these vectors make up the signal subspace, by using the number of principal compoenets, 2 to split the eigenvectors
noise_eigen = eigenVectors[p:len(eigenVectors)]# noise subspace
for f in range(0, sRate):
sum1 = 0
frequencyVector = np.zeros(len(noise_eigen[0]), dtype=np.complex_)
for i in range(0,len(noise_eigen[0])):
frequencyVector[i] = np.conjugate(complex(np.cos(2 * np.pi * i * f), np.sin(2 * np.pi * i * f)))#creating a frequency vector with e to the 2pi *k *f and taking the conjugate of the each component
for u in range(0,len(noise_eigen)):
sum1 += (abs(np.dot(np.asarray(frequencyVector).transpose(), np.asarray( noise_eigen[u]) )))**2 # summing the dot product of each noise eigenvector and frequency vector taking the absolute value and squaring
print(1/sum1)
print("\n")
"""
(OUTPUT OF THE ABOVE CODE)
0.120681885992
0
0.120681885992
1
0.120681885992
2
0.120681885992
3
0.120681885992
4
0.120681885992
5
0.120681885992
6
0.120681885992
7
0.120681885992
8
0.120681885992
9
0.120681885992
10
0.120681885992
11
0.120681885992
12
0.120681885992
13
0.120681885992
14
0.120681885992
15
0.120681885992
16
0.120681885992
17
0.120681885992
18
0.120681885992
19
0.120681885992
20
0.120681885992
21
0.120681885992
22
0.120681885992
23
0.120681885992
24
Process finished with exit code 0
"""
Here is the formula for the MUSIC Algorithm:
https://drive.google.com/file/d/0B5EG2FEWlIZwYmkteUludHNXS0k/view?usp=sharing
Mathematically, the problem is that i and f are both integers. Thus, 2*π*i*f is an integral multiple of 2π. Allowing for a tiny bit of round-off error, this gives you a cosine very close to 1.0 and a sin very close to 0.0. These values yield virtually no variation in frequencyVector from one iteration to the next.
I also see a problem in that you set up your signal_eigen matrix, but never use it. Isn't the signal itself required by this algorithm? As a result, all you're doing is sampling the noise at intervals of 2πi.
Let's try chopping up one cycle into sRate evenly-spaced sampling points. This results in spikes at 0.24 and 0.76 (out of the range 0.0 - 0.99). Does this match your intuition about how this should work?
signal_eigen = eigenVectors[0:p]
noise_eigen = eigenVectors[p:len(eigenVectors)] # noise subspace
print "Signal\n", signal_eigen
print "Noise\n", noise_eigen
for f_int in range(0, sRate * p + 1):
sum1 = 0
frequencyVector = np.zeros(len(noise_eigen[0]), dtype=np.complex_)
f = float(f_int) / sRate
for i in range(0,len(noise_eigen[0])):
# create a frequency vector with e to the 2pi *k *f and taking the conjugate of the each component
frequencyVector[i] = np.conjugate(complex(np.cos(2 * np.pi * i * f), np.sin(2 * np.pi * i * f)))
# print f, i, np.pi, np.cos(2 * np.pi * i * f)
# print frequencyVector
for u in range(0,len(noise_eigen)):
# sum the squared dot product of each noise eigenvector and frequency vector.
sum1 += (abs(np.dot(np.asarray(frequencyVector).transpose(), np.asarray( noise_eigen[u]) )))**2
print f, 1/sum1
Output
Signal
[[ -3.25974386e-01 3.26744322e-01 -5.24205744e-16 -1.84108176e-01
-7.07106781e-01 -6.86652798e-17 2.71561652e-01 3.78607948e-16
4.23482344e-01]
[ 3.40976541e-01 5.42419088e-02 -5.00000000e-01 -3.62655793e-01
-1.06880232e-16 3.53553391e-01 -3.89304223e-01 -3.53553391e-01
3.12595284e-01]]
Noise
[[ -3.06261935e-01 -5.16768248e-01 7.82012443e-16 -3.72989138e-01
-3.12515753e-16 -5.00000000e-01 5.19589478e-03 -5.00000000e-01
-2.51205535e-03]
[ 3.21775774e-01 8.19916352e-02 5.00000000e-01 -3.70053622e-01
1.44550753e-16 3.53553391e-01 4.33613344e-01 -3.53553391e-01
-2.54514258e-01]
[ -4.00349040e-01 4.82750272e-01 -8.71533036e-16 -3.42123880e-01
-2.68725150e-16 2.42479504e-16 -4.16290671e-01 -4.89739378e-16
-5.62428795e-01]
[ 3.21775774e-01 8.19916352e-02 -5.00000000e-01 -3.70053622e-01
-2.80456498e-16 -3.53553391e-01 4.33613344e-01 3.53553391e-01
-2.54514258e-01]
[ -3.06261935e-01 -5.16768248e-01 1.08027782e-15 -3.72989138e-01
-1.25036869e-16 5.00000000e-01 5.19589478e-03 5.00000000e-01
-2.51205535e-03]
[ 3.40976541e-01 5.42419088e-02 5.00000000e-01 -3.62655793e-01
-2.64414807e-16 -3.53553391e-01 -3.89304223e-01 3.53553391e-01
3.12595284e-01]
[ -3.25974386e-01 3.26744322e-01 -4.97151703e-16 -1.84108176e-01
7.07106781e-01 -1.62796158e-16 2.71561652e-01 2.06561854e-16
4.23482344e-01]]
0.0 0.115397176866
0.04 0.12355071192
0.08 0.135377011677
0.12 0.136669716901
0.16 0.148772917566
0.2 0.195742574649
0.24 0.237792763699
0.28 0.181921271171
0.32 0.12959840172
0.36 0.121070836044
0.4 0.139075881122
0.44 0.139216853056
0.48 0.117815494324
0.52 0.117815494324
0.56 0.139216853056
0.6 0.139075881122
0.64 0.121070836044
0.68 0.12959840172
0.72 0.181921271171
0.76 0.237792763699
0.8 0.195742574649
0.84 0.148772917566
0.88 0.136669716901
0.92 0.135377011677
0.96 0.12355071192
I'm also unsure of the correct implementation; having more of the paper for formula context would help. I'm not certain about the range and sampling of the f values. When I worked on FFT software, f was swept over the wave form in small increments, typically 2π/sRate.
I'm not getting those distinctive spikes now -- not sure what I did before. I made a small parametrized change, adding a num_slice variable:
num_slice = sRate * N
for f_int in range(0, num_slice + 1):
sum1 = 0
frequencyVector = np.zeros(len(noise_eigen[0]), dtype=np.complex_)
f = float(f_int) / num_slice
You can compute it however you like, of course, but the ensuing loop runs through just the one cycle. Here's my output:
0.0 0.136398199883
0.008 0.136583829848
0.016 0.13711117893
0.024 0.137893463111
0.032 0.138792904453
0.04 0.139633157335
0.048 0.140219450839
0.056 0.140365986349
0.064 0.139926689416
0.072 0.138822121693
0.08 0.137054535152
0.088 0.13470609994
0.096 0.131921188389
0.104 0.128879079596
0.112 0.125765649854
0.12 0.122750994163
0.128 0.119976226317
0.136 0.117549199221
0.144 0.115546862203
0.152 0.114021482029
0.16 0.113008398728
0.168 0.112533730494
0.176 0.112621097254
0.184 0.113296863522
0.192 0.114593615279
0.2 0.116551634665
0.208 0.119218062482
0.216 0.12264326497
0.224 0.126873674308
0.232 0.131940131305
0.24 0.137840727381
0.248 0.144517728837
0.256 0.151830000359
0.264 0.159526062508
0.272 0.167228413981
0.28 0.174444818009
0.288 0.180621604818
0.296 0.185241411664
0.304 0.187943197745
0.312 0.188619481273
0.32 0.187445977812
0.328 0.184829467764
0.336 0.181300320748
0.344 0.177396490666
0.352 0.173576190425
0.36 0.170171993077
0.368 0.167379359825
0.376 0.165265454514
0.384 0.163786582966
0.392 0.16280869726
0.4 0.162130870823
0.408 0.161514399035
0.416 0.160719375729
0.424 0.159546457646
0.432 0.157875982968
0.44 0.155693319037
0.448 0.153091632029
0.456 0.150251065569
0.464 0.147402137481
0.472 0.144785618099
0.48 0.14261932062
0.488 0.141076562538
0.496 0.140275496354
0.504 0.140275496354
0.512 0.141076562538
0.52 0.14261932062
0.528 0.144785618099
0.536 0.147402137481
0.544 0.150251065569
0.552 0.153091632029
0.56 0.155693319037
0.568 0.157875982968
0.576 0.159546457646
0.584 0.160719375729
0.592 0.161514399035
0.6 0.162130870823
0.608 0.16280869726
0.616 0.163786582966
0.624 0.165265454514
0.632 0.167379359825
0.64 0.170171993077
0.648 0.173576190425
0.656 0.177396490666
0.664 0.181300320748
0.672 0.184829467764
0.68 0.187445977812
0.688 0.188619481273
0.696 0.187943197745
0.704 0.185241411664
0.712 0.180621604818
0.72 0.174444818009
0.728 0.167228413981
0.736 0.159526062508
0.744 0.151830000359
0.752 0.144517728837
0.76 0.137840727381
0.768 0.131940131305
0.776 0.126873674308
0.784 0.12264326497
0.792 0.119218062482
0.8 0.116551634665
0.808 0.114593615279
0.816 0.113296863522
0.824 0.112621097254
0.832 0.112533730494
0.84 0.113008398728
0.848 0.114021482029
0.856 0.115546862203
0.864 0.117549199221
0.872 0.119976226317
0.88 0.122750994163
0.888 0.125765649854
0.896 0.128879079596
0.904 0.131921188389
0.912 0.13470609994
0.92 0.137054535152
0.928 0.138822121693
0.936 0.139926689416
0.944 0.140365986349
0.952 0.140219450839
0.96 0.139633157335
0.968 0.138792904453
0.976 0.137893463111
0.984 0.13711117893
0.992 0.136583829848
1.0 0.136398199883

numpy and sklearn PCA return different covariance vector

Trying to learn PCA through and through but interestingly enough when I use numpy and sklearn I get different covariance matrix results.
The numpy results match this explanatory text here but the sklearn results different from both.
Is there any reason why this is so?
d = pd.read_csv("example.txt", header=None, sep = " ")
print(d)
0 1
0 0.69 0.49
1 -1.31 -1.21
2 0.39 0.99
3 0.09 0.29
4 1.29 1.09
5 0.49 0.79
6 0.19 -0.31
7 -0.81 -0.81
8 -0.31 -0.31
9 -0.71 -1.01
Numpy Results
print(np.cov(d, rowvar = 0))
[[ 0.61655556 0.61544444]
[ 0.61544444 0.71655556]]
sklearn Results
from sklearn.decomposition import PCA
clf = PCA()
clf.fit(d.values)
print(clf.get_covariance())
[[ 0.5549 0.5539]
[ 0.5539 0.6449]]
Because for np.cov,
Default normalization is by (N - 1), where N is the number of observations given (unbiased estimate). If bias is 1, then normalization is by N.
Set bias=1, the result is the same as PCA:
In [9]: np.cov(df, rowvar=0, bias=1)
Out[9]:
array([[ 0.5549, 0.5539],
[ 0.5539, 0.6449]])
So I've encountered the same issue, and I think that it returns different values because the covariance is calculated in a different way. According to the sklearn documentation, the get_covariance() method, uses the noise variances to obtain the covariance matrix.

Python programming - numpy polyfit saying NAN

I am having some issues with a pretty simple code I have written. I have 4 sets of data, and want to generate polynomial best fit lines using numpy polyfit. 3 of the lists yield numbers when using polyfit, but the third data set yields NAN when using polyfit. Below is the code and the print out. Any ideas?
Code:
all of the 'ind_#'s are the lists of data. Below converts them into numpy arrays that can then generate polynomial best fit line
ind_1=np.array(ind_1, np.float)
dep_1=np.array(dep_1, np.float)
x_1=np.arange(min(ind_1)-1, max(ind_1)+1, .01)
ind_2=np.array(ind_2, np.float)
dep_2=np.array(dep_2, np.float)
x_2=np.arange(min(ind_2)-1, max(ind_2)+1, .01)
ind_3=np.array(ind_3, np.float)
dep_3=np.array(dep_3, np.float)
x_3=np.arange(min(ind_3)-1, max(ind_3)+1, .01)
ind_4=np.array(ind_4, np.float)
dep_4=np.array(dep_4, np.float)
x_4=np.arange(min(ind_4)-1, max(ind_4)+1, .01)
Below prints off the arrays generated above, as well as the contents of the polyfit list, which are usually the coefficients of the polynomial equation, but for the third case below, all of the polyfit contents print off as NAN
print(ind_1)
print(dep_1)
print(np.polyfit(ind_1,dep_1,2))
print(ind_2)
print(dep_2)
print(np.polyfit(ind_2,dep_2,2))
print(ind_3)
print(dep_3)
print(np.polyfit(ind_3,dep_3,2))
print(ind_4)
print(dep_4)
print(np.polyfit(ind_4,dep_4,2))
Print out:
[ 1.405 1.871 2.713 ..., 5.367 5.404 2.155]
[ 0.274 0.07 0.043 ..., 0.607 0.614 0.152]
[ 0.01391925 -0.00950728 0.14803846]
[ 0.9760001 2.067 8.8 ..., 1.301 1.625 2.007 ]
[ 0.219 0.05 0.9810001 ..., 0.163 0.161 0.163 ]
[ 0.00886807 -0.00868727 0.17793324]
[ 1.143 0.9120001 2.162 ..., 2.915 2.865 2.739 ]
[ 0.283 0.3 0.27 ..., 0.227 0.213 0.161]
[ nan nan nan]
[ 0.167 0.315 1.938 ..., 2.641 1.799 2.719]
[ 0.6810001 0.7140001 0.309 ..., 0.283 0.313 0.251 ]
[ 0.00382331 0.00222269 0.16940372]
Why are the polyfit constants from the third case listed as NAN? All the data sets have same type of data, and all of the code is consistent. Please help.
Just looked at your data. This is happening because you have a NaN in dep_3 (element 713). You can make sure that you only use finite values in the fit like this:
idx = np.isfinite(ind_3) & np.isfinite(dep_3)
print(np.polyfit(ind_3[idx], dep_3[idx], 2))
As for finding for bad values in large datasets, numpy makes that really easy. You can find the indices like this:
print(np.where(~np.isfinite(dep_3)))

Categories

Resources