I have a symmetric matrix (1877 x 1877), here is the matrix file. I try to standardize the values between 0-1. After I apply this method, the matrix is no longer symmetric. Any help is appreciated.
print((dist.transpose() == dist).all()) # this prints 'True'
def sci_minmax(X):
minmax_scale = preprocessing.MinMaxScaler()
return minmax_scale.fit_transform(X)
sci_dist_scaled = sci_minmax(dist)
(sci_dist_scaled.transpose() == sci_dist_scaled).all() # this print 'False'
sci_dist_scaled.dtype, dist.dtype # (dtype('float64'), dtype('float64'))
Looking at this description the minmaxscaler appears to work column-by-column, so, naturally, you can't expect it to preserve symmetry.
What's best to do in your case depends a bit on what you are trying to achieve, really. If having the values between 0 and 1 is all you require you can rescale by hand:
mn, mx = dist.min(), dist.max()
dist01 = (dist - mn) / (mx - mn)
but depending on your ultimate problem this may be too simplistic...
Related
My goal is to compute a derivative of a moving window of a multidimensional dataset along a given dimension, where the dataset is stored as Xarray DataArray or DataSet.
In the simplest case, given a 2D array I would like to compute a moving difference across multiple entries in one dimension, e.g.:
data = np.kron(np.linspace(0,1,10), np.linspace(1,4,6) ).reshape(10,6)
T=3
reducedArray = np.zeros_like(data)
for i in range(data.shape[1]):
if i < T:
reducedArray[:,i] = data[:,i] - data[:,0]
else:
reducedArray[:,i] = data[:,i] - data[:,i-T]
where the if i <T condition ensures that input and output contain proper values (i.e., no nans) and are of identical shape.
Xarray's diff aims to perform a finite-difference approximation of a given derivative order using nearest-neighbours, so it is not suitable here, hence the question:
Is it possible to perform this operation using Xarray functions only?
The rolling weighted average example appears to be something similar, but still too distinct due to the usage of NumPy routines. I've been thinking that something along the lines of the following should work:
xr2DDataArray = xr.DataArray(
data,
dims=('x','y'),
coords={'x':np.linspace(0,1,10), 'y':np.linspace(1,4,6)}
)
r = xr2DDataArray.rolling(x=T,min_periods=2)
r.reduce( redFn )
I am struggling with the definition of redFn here ,though.
Caveat The actual dataset to which the operation is to be applied will have a size of ~10GiB, so a solution that does not blow up the memory requirements will be highly appreciated!
Update/Solution
Using Xarray rolling
After sleeping on it and a bit more fiddling the post linked above actually contains a solution. To obtain a finite difference we just have to define the weights to be $\pm 1$ at the ends and $0$ else:
def fdMovingWindow(data, **kwargs):
T = kwargs['T'];
del kwargs['T'];
weights = np.zeros(T)
weights[0] = -1
weights[-1] = 1
axis = kwargs['axis']
if data.shape[axis] == T:
return np.sum(data * weights, **kwargs)
else:
return 0
r.reduce(fdMovingWindow, T=4)
alternatively, using construct and a dot product:
weights = np.zeros(T)
weights[0] = -1
weights[-1] = 1
xrWeights = xr.DataArray(weights, dims=['window'])
xr2DDataArray.rolling(y=T,min_periods=1).construct('window').dot(xrWeights)
This carries a massive caveat: The procedure essentially creates a list arrays representing the moving window. This is fine for a modest 2D / 3D array, but for a 4D array that takes up ~10 GiB in memory this will lead to an OOM death!
Simplicistic - memory efficient
A less memory-intensive way is to copy the array and work in a way similar to NumPy's arrays:
xrDiffArray = xr2DDataArray.copy()
dy = xr2DDataArray.y.values[1] - xr2DDataArray.y.values[0] #equidistant sampling
for src in xr2DDataArray:
if src.y.values < xr2DDataArray.y.values[0] + T*dy:
xrDiffArray.loc[dict(y = src.y.values)] = src.values - xr2DDataArray.values[0]
else:
xrDiffArray.loc[dict(y = src.y.values)] = src.values - xr2DDataArray.sel(y = src.y.values - dy*T).values
This will produce the intended result without dimensional errors, but it requires a copy of the dataset.
I was hoping to utilise Xarray to prevent a copy and instead just chain operations that are then evaluated if and when values are actually requested.
A suggestion as to how to accomplish this will still be welcomed!
I have never used xarray, so maybe I am mistaken, but I think you can get the result you want avoiding using loops and conditionals. This is at least twice faster than your example for numpy arrays:
data = np.kron(np.linspace(0,1,10), np.linspace(1,4,6)).reshape(10,6)
reducedArray = np.empty_like(data)
reducedArray[:, T:] = data[:, T:] - data[:, :-T]
reducedArray[:, :T] = data[:, :T] - data[:, 0, np.newaxis]
I imagine the improvement will be higher when using DataArrays.
It does not use xarray functions but neither depends on numpy functions. I am confident that translating this to xarray will be straightforward, I know that it works if there are no coords, but once you include them, you get an error because of the coords mismatch (coords of data[:, T:] and of data[:, :-T] are different). Sadly, I can't do better now.
I'm running the following code:
import numpy as np
import matplotlib
matplotlib.use("TkAgg")
import matplotlib.pyplot as plt
N = 100
t = 1
a1 = np.full((N-1,), -t)
a2 = np.full((N,), 2*t)
Hamiltonian = np.diag(a1, -1) + np.diag(a2) + np.diag(a1, 1)
eval, evec = np.linalg.eig(Hamiltonian)
idx = eval.argsort()[::-1]
eval, evec = eval[idx], evec[:,idx]
wave2 = evec[2] / np.sum(abs(evec[2]))
prob2 = evec[2]**2 / np.sum(evec[2]**2)
_ = plt.plot(wave2)
_ = plt.plot(prob2)
plt.show()
And the plot that comes out is this:
But I'd expect the blue line to be a sinoid as well. This has got me confused and I can't find what's causing the sudden sign changes. Plotting the function absolutely shows that the values associated with each x are fine, but the signs are screwed up.
Any ideas on what might cause this or how to solve it?
Here's a modified version of your script that does what you expected. The changes are:
Corrected the indexing for the eigenvectors; they are the columns of evec.
Use np.linalg.eigh instead of np.linalg.eig. This isn't strictly necessary, but you might as well use the more efficient code.
Don't reverse the order of the sorted eigenvalues. I keep the eigenvalues sorted from lowest to highest. Because eigh returns the eigenvalues in ascending order, I just commented out the code that sorts the eigenvalues.
(Only the first change is a required correction.)
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
N = 100
t = 1
a1 = np.full((N-1,), -t)
a2 = np.full((N,), 2*t)
Hamiltonian = np.diag(a1, -1) + np.diag(a2) + np.diag(a1, 1)
eval, evec = np.linalg.eigh(Hamiltonian)
#idx = eval.argsort()[::-1]
#eval, evec = eval[idx], evec[:,idx]
k = 2
wave2 = evec[:, k] / np.sum(abs(evec[:, k]))
prob2 = evec[:, k]**2 / np.sum(evec[:, k]**2)
_ = plt.plot(wave2)
_ = plt.plot(prob2)
plt.show()
The plot:
I may be wrong, but aren't they all valid eigen vectors/values? The sign shouldn't matter, as the definition of an eigen vector is:
In linear algebra, an eigenvector or characteristic vector of a linear transformation is a non-zero vector that only changes by an overall scale when that linear transformation is applied to it.
Just because the scale is negative doesn't mean it isn't valid.
See this post about Matlab's eig that has a similar problem
One way to fix this is to simply pick a sign for the start, and multiply everthing by -1 that doesn't fit that sign (or take abs of every element and multiply by your expected sign). For your results this should work (nothing crosses 0).
Neither matlab nor numpy care about what you are trying to solve, its simple mathematics that dictates that both signed eigenvector/value combinations are valid, your values are sinusoidal, its just that there exists two sets of eigenvector/values that work (negative and positive)
Very simple regression task. I have three variables x1, x2, x3 with some random noise. And I know target equation: y = q1*x1 + q2*x2 + q3*x3. Now I want to find target coefs: q1, q2, q3 evaluate the
performance using the mean Relative Squared Error (RSE) (Prediction/Real - 1)^2 to evaluate the performance of our prediction methods.
In the research, I see that this is ordinary Least Squares Problem. But I can't get from examples on the internet how to solve this particular problem in Python. Let say I have data:
import numpy as np
sourceData = np.random.rand(1000, 3)
koefs = np.array([1, 2, 3])
target = np.dot(sourceData, koefs)
(In real life that data are noisy, with not normal distribution.) How to find this koefs using Least Squares approach in python? Any lib usage.
#ayhan made a valuable comment.
And there is a problem with your code: Actually there is no noise in the data you collect. The input data is noisy, but after the multiplication, you don't add any additional noise.
I've added some noise to your measurements and used the least squares formula to fit the parameters, here's my code:
data = np.random.rand(1000,3)
true_theta = np.array([1,2,3])
true_measurements = np.dot(data, true_theta)
noise = np.random.rand(1000) * 1
noisy_measurements = true_measurements + noise
estimated_theta = np.linalg.inv(data.T # data) # data.T # noisy_measurements
The estimated_theta will be close to true_theta. If you don't add noise to the measurements, they will be equal.
I've used the python3 matrix multiplication syntax.
You could use np.dot instead of #
That makes the code longer, so I've split the formula:
MTM_inv = np.linalg.inv(np.dot(data.T, data))
MTy = np.dot(data.T, noisy_measurements)
estimated_theta = np.dot(MTM_inv, MTy)
You can read up on least squares here: https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#The_general_problem
UPDATE:
Or you could just use the builtin least squares function:
np.linalg.lstsq(data, noisy_measurements)
In addition to the #lhk answer I have found great scipy Least Squares function. It is easy to get the requested behavior with it.
This way we can provide a custom function that returns residuals and form Relative Squared Error instead of absolute squared difference:
import numpy as np
from scipy.optimize import least_squares
data = np.random.rand(1000,3)
true_theta = np.array([1,2,3])
true_measurements = np.dot(data, true_theta)
noise = np.random.rand(1000) * 1
noisy_measurements = true_measurements + noise
#noisy_measurements[-1] = data[-1] # (1000 * true_theta) - uncoment this outliner to see how much Relative Squared Error esimator works better then default abs diff for this case.
def my_func(params, x, y):
res = (x # params) / y - 1 # If we change this line to: (x # params) - y - we will got the same result as np.linalg.lstsq
return res
res = least_squares(my_func, x0, args=(data, noisy_measurements) )
estimated_theta = res.x
Also, we can provide custom loss with loss argument function that will process the residuals and form final loss.
I wanted to apply median-absolute-deviation (MAD) based outlier detection using the answer from #Joe Kington as given below:
Pythonic way of detecting outliers in one dimensional observation data
However, what's going wrong with my code, I could not figure out how to assign the outliers as nan values for MY DATA:
import numpy as np
data = np.array([55,32,4,5,6,7,8,9,11,0,2,1,3,4,5,6,7,8,25,25,25,25,10,11,12,25,26,27,28],dtype=float)
median = np.median(data, axis=0)
diff = np.sum((data - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
data_without_outliers = data[modified_z_score < 3.5]
?????
print data_without_outliers
What is the problem with using:
data[modified_z_score > 3.5] = np.nan
Note that this will only work if data is a floating point array (which it should be if you are calculating MAD).
The problem might be line:
diff = np.sum((data - median)**2, axis=-1)
Applying np.sum() will collapse the result to scalar.
Remove top-level sum, and your code will work.
Other way around it to ensure that that data is at least 2d array. You can use numpy.atleast_2d() for that.
In order to assign NaNs, follow answer from https://stackoverflow.com/a/22804327/4989451
Profiling some computational work I'm doing showed me that one bottleneck in my program was a function that basically did this (np is numpy, sp is scipy):
def mix1(signal1, signal2):
spec1 = np.fft.fft(signal1, axis=1)
spec2 = np.fft.fft(signal2, axis=1)
return np.fft.ifft(spec1*spec2, axis=1)
Both signals have shape (C, N) where C is the number of sets of data (usually less than 20) and N is the number of samples in each set (around 5000). The computation for each set (row) is completely independent of any other set.
I figured that this was just a simple convolution, so I tried to replace it with:
def mix2(signal1, signal2):
outputs = np.empty_like(signal1)
for idx, row in enumerate(outputs):
outputs[idx] = sp.signal.convolve(signal1[idx], signal2[idx], mode='same')
return outputs
...just to see if I got the same results. But I didn't, and my questions are:
Why not?
Is there a better way to compute the equivalent of mix1()?
(I realise that mix2 probably wouldn't have been faster as-is, but it might have been a good starting point for parallelisation.)
Here's the full script I used to quickly check this:
import numpy as np
import scipy as sp
import scipy.signal
N = 4680
C = 6
def mix1(signal1, signal2):
spec1 = np.fft.fft(signal1, axis=1)
spec2 = np.fft.fft(signal2, axis=1)
return np.fft.ifft(spec1*spec2, axis=1)
def mix2(signal1, signal2):
outputs = np.empty_like(signal1)
for idx, row in enumerate(outputs):
outputs[idx] = sp.signal.convolve(signal1[idx], signal2[idx], mode='same')
return outputs
def test(num, chans):
sig1 = np.random.randn(chans, num)
sig2 = np.random.randn(chans, num)
res1 = mix1(sig1, sig2)
res2 = mix2(sig1, sig2)
np.testing.assert_almost_equal(res1, res2)
if __name__ == "__main__":
np.random.seed(0x1234ABCD)
test(N, C)
So I tested this out and can now confirm a few things:
1) numpy.convolve is not circular, which is what the fft code is giving you:
2) FFT does not internally pad to a power of 2. Compare the vastly different speeds of the following operations:
x1 = np.random.uniform(size=2**17-1)
x2 = np.random.uniform(size=2**17)
np.fft.fft(x1)
np.fft.fft(x2)
3) Normalization is not a difference -- if you do a naive circular convolution by adding up a(k)*b(i-k), you will get the result of the FFT code.
The thing is padding to a power of 2 is going to change the answer. I've heard tales that there are ways to deal with this by cleverly using prime factors of the length (mentioned but not coded in Numerical Recipes) but I've never seen people actually do that.
scipy.signal.fftconvolve does convolve by FFT, it's python code. You can study the source code, and correct you mix1 function.
As mentioned before, the scipy.signal.convolve function does not perform a circular convolution. If you want a circular convolution performed in realspace (in contrast to using fft's) I suggest using the scipy.ndimage.convolve function. It has a mode parameter which can be set to 'wrap' making it a circular convolution.
for idx, row in enumerate(outputs):
outputs[idx] = sp.ndimage.convolve(signal1[idx], signal2[idx], mode='wrap')