Does fancyimpute's SoftImpute require normalized data? - python

The page https://pypi.python.org/pypi/fancyimpute has the line
# Instead of solving the nuclear norm objective directly, instead
# induce sparsity using singular value thresholding
X_filled_softimpute = SoftImpute().complete(X_incomplete_normalized)
which kind of suggests that I need to normalize the input data. However I did not find any details on the internet, what exactly is meant by that. Do I have to normalize my data beforehand and what exactly is expected?

Yes you should definitely normalize the data. Consider the following example:
from fancyimpute import SoftImpute
import numpy as np
v=np.random.normal(100,0.5,(5,3))
v[2,1:3]=np.nan
v[0,0]=np.nan
v[3,0]=np.nan
SoftImpute().complete(v)
The result is
array([[ 81.78428587, 99.69638878, 100.67626769],
[ 99.82026281, 100.09077899, 99.50273223],
[ 99.70946085, 70.98619873, 69.57668189],
[ 81.82898539, 99.66269922, 100.95263318],
[ 99.14285815, 100.10809651, 99.73870089]])
Note that the places where I put nan are completely off. However, if instead you run
from fancyimpute import SoftImpute
import numpy as np
v=np.random.normal(0,1,(5,3))
v[2,1:3]=np.nan
v[0,0]=np.nan
v[3,0]=np.nan
SoftImpute().complete(v)
(same code as before, the only difference is that v is normalized) you get the following reasonable result:
array([[ 0.07705556, -0.53449412, -0.20081351],
[ 0.9709198 , -1.19890962, -0.25176222],
[ 0.41839224, -0.11786451, 0.03231515],
[ 0.21374759, -0.66986997, 0.78565414],
[ 0.30004524, 1.28055845, 0.58625942]])
Thus, when you are using SoftImpute, don't forget to normalize your data (you can do that by making the mean of every column to be 0, and the std to be 1).

Related

Numpy applying a time interval sequence to a multidimensional ndarray (such as coordinates)

EDIT: added prefix / suffix value to interval arrays to make them the same length as their corresponding data arrays, as per #user1319128 's suggestion and indeed interp does the job. For sure his solution was workable and good. I just couldn't see it because I was tired and stupid.
I am sure this is a fairly mundane application, but so I have failed to find or come up with a way to do this without doing it outside of numpy. Maybe my brain just needs a rest, anyway here is the problem with example and solution requirements.
So I have to arrays with different lengths and I want to apply common time intervals between them to these arrays, so that that the result is I have versions of these arrays that are all the same length and their values relate to each other at the same row (if that makes sense). In the example below I have named this functionality "apply_timeintervals_to_array". The example code:
import numpy as np
from colorsys import hsv_to_rgb
num_xy = 20
num_colors = 12
#xy = np.random.rand(num_xy, 2) * 1080
xy = np.array([[ 687.32758344, 956.05651214],
[ 226.97671414, 698.48071588],
[ 648.59878864, 175.4882185 ],
[ 859.56600997, 487.25205922],
[ 794.43015178, 16.46114312],
[ 884.7166732 , 634.59100322],
[ 878.94218682, 835.12886098],
[ 965.47135726, 542.09202328],
[ 114.61867445, 601.74092126],
[ 134.02663822, 334.27221884],
[ 940.6589034 , 245.43354493],
[ 285.87902276, 550.32600784],
[ 785.00104142, 993.19960822],
[1040.49576307, 486.24009511],
[ 165.59409198, 156.79786175],
[1043.54280058, 313.09073855],
[ 645.62878826, 100.81909068],
[ 625.78003257, 252.17917611],
[1056.77009875, 793.02218098],
[ 2.93152052, 596.9795026 ]])
xy_deltas = np.sum((xy[1:] - xy[:-1])**2, axis=-1)
xy_ti = np.concatenate(([0.0],
(xy_deltas) / np.sum(xy_deltas)))
colors_ti = np.concatenate((np.linspace(0, 1, num_colors),
[1.0]))
common_ti = np.unique(np.sort(np.concatenate((xy_ti,
colors_ti))))
common_colors = (np.array(tuple(hsv_to_rgb(t, 0.9, 0.9) for t
in np.concatenate(([0.0],
common_ti,
[1.0]))))
* 255).astype(int)[1:-1]
common_xy = apply_timeintervals_to_array(common_ti, xy)
So one could then use the common arrays for additional computations or for rendering.
The question is what could accomplish the "apply_timeintervals_to_array" functionality, or alternatively a better way to generate the same data.
I hope this is clear enough, let me know if it isn't. Thank you in advance.
I think , numpy.interp should meet your expectations.For example, If a have an 2d array of length 20 , and would like to interpolate at different common_ti values ,whose length is 30 , the code would be as follows.
xy = np.arange(0,400,10).reshape(20,2)
xy_ti = np.arange(20)/19
common_ti = np.linspace(0,1,30)
x=np.interp(common_ti,xy_ti,xy[:,0]) # interpolate the first column
y=np.interp(common_ti,xy_ti,xy[:,1]) #interpolate the second column

3D interpolation between two cloud of points

I want to interpolate a set of temperature, defined on each node of a mesh of a CFD simulation, on a different mesh.
Data from the original set are in csv (X1,Y1,Z1,T1) and I want to find new T2 values on a X2,Y2,Z2 mesh.
From the many possibilities that SCIPY provide us, which is the more suitable for that application? Which are the differences between a linear and a nearest-node approach?
Thank you for your time.
EDIT
Here is an example:
import numpy as np
from scipy.interpolate import griddata
from scipy.interpolate import LinearNDInterpolator
data = np.array([
[ -3.5622760653000E-02, 8.0497122655290E-02, 3.0788827491158E-01],
[ -3.5854682326000E-02, 8.0591522802259E-02, 3.0784350432341E-01],
[ -2.8168760240000E-02, 8.0819296043557E-02, 3.0988532075795E-01],
[ -2.8413346037000E-02, 8.0890746063578E-02, 3.1002054434659E-01],
[ -2.8168663383000E-02, 8.0981744777379E-02, 3.1015319609412E-01],
[ -3.4150537103000E-02, 8.1385114641365E-02, 3.0865343388355E-01],
[ -3.4461673349000E-02, 8.1537336777452E-02, 3.0858242919307E-01],
[ -3.4285601228000E-02, 8.1655884824782E-02, 3.0877386496235E-01],
[ -2.1832991391000E-02, 8.0380712111108E-02, 3.0867371621337E-01],
[ -2.1933870390000E-02, 8.0335713699008E-02, 3.0867959866155E-01]])
temp = np.array([1.4285955811000E+03,
1.4281038818000E+03,
1.4543135986000E+03,
1.4636379395000E+03,
1.4624763184000E+03,
1.3410919189000E+03,
1.3400545654000E+03,
1.3505817871000E+03,
1.2361110840000E+03,
1.2398562012000E+03])
linInter= LinearNDInterpolator(data, temp)
print (linInter(np.array([[-2.8168760240000E-02, 8.0819296043557E-02, 3.0988532075795E-01]])))
this code is working, but I have a dataset of 10million of points to be interpolated on a data set of the same size.
The problem is that this operation is very slow to do for all of my points: is there a way to improve my code?
I used LinearNDinterpolator beacuse it seems to be faster than NearestNDInterpolator (LinearVSNearest).
One solution would be to use RegularGridInterpolator (if your grid is regular). Another approach I can think of is to reduce your data size by taking intervals:
step = 4 # you can increase this based on your data size (eg 100)
m = ((data.argsort(0) % step)==0).any(1)
linInter= LinearNDInterpolator(data[m], temp[m])

How to extract exact frequencies from signal?

I have a 1D signal array. This array holds information about some features that I want to analyze with np.fft.
As an example I tried the following:
My function should be the simple sine wave lambda x : sin(x), in theory when I put an input array through this function I would get a signal array, which when transformed with an fft should tell me that the main component from that signal was (in pseudocode) signal = 1* sin(x).
So far I couldnt get any wiser from any of the answers here so I put this question up.
Now my question: How do I get the "raw" sine component weights from my signal ?
This is where I'm stuck:
>>> y = f(x)
>>> fqs = np.fft.fft(y)
>>> fqs
array([ 3.07768354+0.00000000e+00j, 3.68364588+8.32272378e-16j,
8.73514635-7.15951776e-15j, -7.34287625+1.04901868e-14j,
-2.15156054+5.10742080e-15j, -1.1755705 +4.87611209e-16j,
-0.78767676+3.40334406e-16j, -0.58990993+4.25167217e-16j,
-0.476018 -3.43242308e-16j, -0.40636656+1.13055751e-15j,
-0.36327126+1.55440604e-16j, -0.33804202-1.07128132e-16j,
-0.32634218+2.76861429e-16j, -0.32634218+8.99298797e-16j,
-0.33804202+5.02435797e-16j, -0.36327126-1.55440604e-16j,
-0.40636656-3.06536611e-16j, -0.476018 -4.57882679e-17j,
-0.58990993+4.31587904e-16j, -0.78767676+9.75500354e-16j,
-1.1755705 -4.87611209e-16j, -2.15156054-1.87113952e-15j,
-7.34287625+1.79193327e-15j, 8.73514635-6.76648711e-15j,
3.68364588-6.60371698e-15j])
>>> np.abs(_)
array([3.07768354, 3.68364588, 8.73514635, 7.34287625, 2.15156054,
1.1755705 , 0.78767676, 0.58990993, 0.476018 , 0.40636656,
0.36327126, 0.33804202, 0.32634218, 0.32634218, 0.33804202,
0.36327126, 0.40636656, 0.476018 , 0.58990993, 0.78767676,
1.1755705 , 2.15156054, 7.34287625, 8.73514635, 3.68364588])
>>> where do I find my 1*sin(x) ?
Even though your x variable is know shown here, I think you're not generating a periodic function. This works fine for me:
import numpy as np
x=np.linspace(0,np.pi*2,100,endpoint=False)
y=np.sin(x)
yf=np.fft.rfft(y)
output is
(-1.5265566588595902e-16+0.0j)
(-1.8485213360008856e-14+-50.0j)
(5.8988036787285649e-15+-3.4015634637549994e-16j)
(-1.0781745022416177e-14+-3.176912458933349e-15j)
(6.9770353907875146e-15+-3.6920723832369405e-15j)
The only no zero imaginary number is at mode 1.

Matlab and Python produces different results for PCA

I am using PCA and I found PCA in sklearn in Python and pca() in Matlab produce different results. Here are the test matrix I am using.
a = np.array([[-1,-1], [-2,-1], [-3, -2], [1,1], [2,1], [3,2]])
For Python sklearn, I got
p = PCA()
print(p.fit_transform(a))
[[-1.38340578 0.2935787 ]
[-2.22189802 -0.25133484]
[-3.6053038 0.04224385]
[ 1.38340578 -0.2935787 ]
[ 2.22189802 0.25133484]
[ 3.6053038 -0.04224385]]
For Matlab, I got
pca(a', 'Centered', false)
[0.2196 0.5340
0.3526 -0.4571
0.5722 0.0768
-0.2196 -0.5340
-0.3526 0.4571
-0.5722 -0.0768]
Why is such difference observed?
Thanks for the answer of Dan. The results look quite reasonable now. However if I test with a random matrix, it seems that Matlab and Python are producing results that are not scalar multiple of each other. Why this happens?
test matrix a:
[[ 0.36671885 0.77268624 0.94687497]
[ 0.75741855 0.63457672 0.88671836]
[ 0.20818031 0.709373 0.45114135]
[ 0.24488718 0.87400025 0.89382836]
[ 0.16554686 0.74684393 0.08551401]
[ 0.07371664 0.1632872 0.84217978]]
Python results:
p = PCA()
p.fit_transform(a))
[[ 0.25305509 -0.10189215 -0.11661895]
[ 0.36137036 -0.20480169 0.27455458]
[-0.25638649 -0.02923213 -0.01619661]
[ 0.14741593 -0.12777308 -0.2434731 ]
[-0.6122582 -0.08568121 0.06790961]
[ 0.10680331 0.54938026 0.03382447]]
Matlab results:
pca(a', 'Centered', false)
0.504156973865138 -0.0808159771243340 -0.107296852182663
0.502756555190181 -0.174432053627297 0.818826939851221
0.329948209311847 0.315668718703861 -0.138813345638127
0.499181592718705 0.0755364557146097 -0.383301081533716
0.232039797509016 0.694464307249012 -0.0436361728092353
0.284905319274925 -0.612706345940607 -0.387190971583757
Thanks for the help of Dan all through this. In fact I found it's a misuse of Matlab functions. Matlab returns principal components coefficients by default. Using [~, score] = pca(a, 'Centered', true) will get the same results as Python.
PCA works off Eigen vectors. So long as the vectors are parallel, the magnitude is irrelevant (just a different normalizaton).
In your case, the two are scalar multiples of each other. Try (in MATLAB)
Python = [-1.38340578 0.2935787
-2.22189802 -0.25133484
3.6053038 0.04224385
1.38340578 -0.2935787
2.22189802 0.25133484
3.6053038 -0.04224385]
Matlab = [ 0.2196 0.5340
0.3526 -0.4571
0.5722 0.0768
-0.2196 -0.5340
-0.3526 0.4571
-0.5722 -0.0768]
Now notice that B(:,1)*-6.2997 is basically equal to A(:,1). Or put another way
A(:,n)./B(:,n)
gives you (roughly) the same number for each row. This means the two vectors have the same direction (i.e. they are just scalar multiples of each other) and so you are getting the same principal components.
See here for another example: https://math.stackexchange.com/a/1183707/118848

pymc imput passing back 1e20

Can't tell if I am doing something wrong with pymc's impute functionality or if this is a bug. Impute via the masked array passes 1e20 values to missing elements, while the inefficient method Impute seems to pass back correct samples. Below is a small example.
import numpy as np
import pymc as py
disasters_array = np.random.random((3,3))
disasters_array[1,1]=None
# The inefficient way, using the Impute function:
D = py.Impute('D', py.Normal, disasters_array, mu=.5, tau=1E5)
# The efficient way, using masked arrays:
# Generate masked array. Where the mask is true,
# the value is taken as missing.
print disasters_array
masked_values = np.ma.masked_invalid(disasters_array)
# Pass masked array to data stochastic, and it does the right thing
disasters = py.Normal('disasters', mu=.5, tau=1E5, value=masked_values, observed=True)
#py.deterministic
def test(disasters=disasters, D=D):
print D
print disasters
mcmc = py.MCMC(py.Model(set([test,disasters])))
Output:
Original Matrix:
[[ 0.23507836 0.2024624 0.90518228]
[ 0.95816 **nan** 0.43145808]
[ 0.99566308 0.25431568 0.25464137]]
D with imputations:
[[array(0.23507836309832741) array(0.20246240248367342)
array(0.9051822818081371)]
[array(0.9581599997650212) **array(0.5005324083232756)**
array(0.43145807852698237)]
[array(0.9956630757864052) array(0.2543156788973996)
array(0.25464136701826867)]]
Masked Array approach:
[[ 2.35078363e-01 2.02462402e-01 9.05182282e-01]
[ 9.58160000e-01 **1.00000000e+20** 4.31458079e-01]
[ 9.95663076e-01 2.54315679e-01 2.54641367e-01]]

Categories

Resources