Related
I have 2 np.arrays:
The first one called data:
data= array([ 17. , nan, 8.1, 25.1, nan, 6.9, nan, 27.1, 46.6,
34.1, 25.7, nan, ... , 25.3 ])
Array of float 64 Size (366,)
To get the second one i did an interpolation. So i should first drop the NaN values:
data = data[~numpy.isnan(data)]
So i have now the data like this:
data = array([ 17. , 8.1, 25.1, 6.9, 27.1, 46.6,
34.1, 25.7, ... , 25.3 ])
Array of float 64 Size (283,)
And after the interpolation i get the second one:
interpolated_data = array([ 16 , 7.1, 24.1, 7.9, 26.1, 45.6,
33.1, 27.7, ... , 24.3 ])
Array of float 64 Size (283,)
Now i want to give it back the nan values in the same index position in both arrays.
Expected values:
data = array([ 17. , nan, 8.1, 25.1, nan, 6.9, nan, 27.1, 46.6,
34.1, 25.7, nan, ... , 25.3 ])
Array of float 64 Size (366,)
interpolated_data = array([ 16 , nan, 7.1, 24.1, nan, 7.9, nan, 26.1, 45.6,
33.1, 27.7, nan, ... , 24.3 ])
Array of float 64 Size (366,)
Would you mind to help me? Thanks in advance.
First your extract the values from your data array with the mask you created:
data= array([ 17. , nan, 8.1, 25.1, nan, 6.9, nan, 27.1, 46.6,
34.1, 25.7, nan, ... , 25.3 ])
nan_mask = numpy.isnan(data)
data1 = data[~nan_mask]
From there you get your interpolated_data. Then, you can create an empty array of the same size of the initial data array and then put back your interpolated_data and the np.nan in this empty array:
interpolated_array = np.empty(data.shape)
interpolated_array[~nan_mask] = interpolated_data
interpolated_array[nan_mask] = np.nan
Keep index where there is no nan, do computation and recreate an array with the same dimension as the initial filled by nan. Use your index to copy your value to the new array.
# initial array
a = array([ 1., 2., nan, 4., nan, 6.])
# index where no nan
idx = np.where(~np.isnan(a))
# new array without nan
m = a[idx]
print(m)
array([1., 2., 4., 6.])
# ... interpolation ...
print(i)
array([10, 20, 40, 60])
# replace nan
b = np.array([np.nan] * len(a))
new[idx] = i
print(b)
array([10., 20., nan, 40., nan, 60.])
Here you go:
# generate data with nan values
data = np.ones(10)
data[4] = np.nan
# get boolean selection where data is nan
boolean_selection = np.isnan(data)
# apply some interpolation on the data that is not nan
# this is just a placeholder
interpolated_data = data[np.logical_not(boolean_selection)]
# fill back the interpolated data
data[np.logical_not(boolean_selection)] = interpolated_data
I'm trying to get the correlation between two matrices from the boston dataset. So I'm doing this.
import sklearn as skl
from sklearn.datasets import load_boston
import numpy as np
import scipy as sc
import matplotlib.pyplot as plt
boston_dataset = load_boston()
X = boston_dataset.data
Y = boston_dataset.target
# Correlation between RM and Y
RM = X[:, 5:6]
np.corrcoef(RM, Y.reshape((506,1)))
But I got NAN in every value of the matrix.
/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py:2526: RuntimeWarning: Degrees of freedom <= 0 for slice
c = cov(x, y, rowvar)
/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py:2455: RuntimeWarning: divide by zero encountered in true_divide
c *= np.true_divide(1, fact)
/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py:2455: RuntimeWarning: invalid value encountered in multiply
c *= np.true_divide(1, fact)
array([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]])
What's happening? Thanks!
You are trying to find correlation of single values, which the warning says, has 0 degree of freedom (for correlation between two single values) and hence divide by 0 results in nan, which is expected. Maybe you meant to find correlation of columns instead of rows, like this:
np.corrcoef(RM, Y.reshape((506,1)), rowvar=False)
output:
[[1. 0.69535995]
[0.69535995 1. ]]
Explanation: By default, numpy np.corrcoef takes row-wise correlation of the two matrices. According to numpy doc, if you want column-wise correlation, you can use rowvar arguement:
If rowvar is True (default), then each row represents a variable, with observations in the columns. Otherwise, the relationship is transposed: each column represents a variable, while the rows contain observations.
Try slicing your X array on a single index (so X[:, 5] instead of X[:, 5:6]). Then it will be the same shape as your Y array, without needing to reshape it. The following works:
# Correlation between RM and Y
RM = X[:, 5]
np.corrcoef(RM, Y)
This might be a silly question, but I can't seem to find an answer for it. I have a large array that I've previously saved using np.save, and now I'd like to load the data into a new file, creating a separate list from each column. The only issue is that some of the rows in my large array only have a single nan value, so the array looks something like this (as an extremely simplified example):
np.array([[5,12,3],
[nan],
[10,13,9],
[nan],
[nan]])
I can use a for loop to achieve what I want, but I was wondering if there was a better way than this:
import numpy as np
results = np.load('data.npy')
depth, upper, lower = [], [], []
for item in results:
if len(item) > 1:
depth.append(item[0])
upper.append(item[1])
lower.append(item[2])
else:
depth.append(np.nan)
upper.append(np.nan)
lower.append(np.nan)
My desired output would look like:
depth = [5,nan,10,nan,nan]
upper = [12,nan,13,nan,nan]
lower = [3,nan,9,nan,nan]
Thanks for your help! I realize I should have previously altered the code that creates the "data.npy" file, so that it has the same number of columns for each row, but that code already takes hours to run and I'd rather avoid that!
With varying length sub arrays, this is dtype=object array. For most purposes this is the same as a list of these subarrays. So most actions will require iteration.
A variant on your action would be a list comprehension
In [61]: dd=[[nan,nan,nan] if len(i)==1 else i for i in d]
In [62]: dd
Out[62]: [[5, 12, 3], [nan, nan, nan], [10, 13, 9], [nan, nan, nan], [nan, nan, nan]]
Your three target arrays are then columns of:
In [63]: np.array(dd)
Out[63]:
array([[ 5., 12., 3.],
[ nan, nan, nan],
[ 10., 13., 9.],
[ nan, nan, nan],
[ nan, nan, nan]])
Another approach is to make an array of that type filled with nan, and then copy over the non-nan values. But that too requires iteration to find the length of the subsarrays.
In [65]: [len(i)>1 for i in d]
Out[65]: [True, False, True, False, False]
np.nan is a float, so a 2d array with nan will be dtype float.
A shorter way using pandas:
import numpy as np
import pandas as pd
data = np.array([[5,12,3], [np.nan], [10,13,9], [np.nan], [np.nan]])
df = pd.DataFrame.from_records(data.tolist())
df.columns = ['depth','upper','lower']
Output:
>>> df
depth upper lower
0 5.0 12.0 3.0
1 NaN NaN NaN
2 10.0 13.0 9.0
3 NaN NaN NaN
4 NaN NaN NaN
You can now address each column to get your desired output
>>> df.depth
0 5.0
1 NaN
2 10.0
3 NaN
4 NaN
If you need lists:
>>> df.depth.tolist()
[5.0, nan, 10.0, nan, nan]
What is the right way to add 2 numpy arrays a and b (both 2D) with numpy.nan as missing value?
a + b
or
numpy.ma.sum(a,b)
Since the inputs are 2D arrays, you can stack them along the third axis with np.dstack and then use np.nansum which would ensure NaNs are ignored, unless there are NaNs in both input arrays, in which case output would also have NaN. Thus, the implementation would look something like this -
np.nansum(np.dstack((A,B)),2)
Sample run -
In [157]: A
Out[157]:
array([[ 0.77552455, 0.89241629, nan, 0.61187474],
[ 0.62777982, 0.80245533, nan, 0.66320306],
[ 0.41578442, 0.26144272, 0.90260667, nan],
[ 0.65122428, 0.3211213 , 0.81634856, nan],
[ 0.52957704, 0.73460363, 0.16484994, 0.20701344]])
In [158]: B
Out[158]:
array([[ 0.55809925, 0.1339353 , nan, 0.35154039],
[ 0.94484722, 0.23814073, 0.36048809, 0.20412318],
[ 0.25191484, nan, 0.43721322, 0.95810905],
[ 0.69115038, 0.51490958, nan, 0.44613473],
[ 0.01709308, 0.81771896, 0.3229837 , 0.64013882]])
In [159]: np.nansum(np.dstack((A,B)),2)
Out[159]:
array([[ 1.3336238 , 1.02635159, nan, 0.96341512],
[ 1.57262704, 1.04059606, 0.36048809, 0.86732624],
[ 0.66769925, 0.26144272, 1.33981989, 0.95810905],
[ 1.34237466, 0.83603089, 0.81634856, 0.44613473],
[ 0.54667013, 1.55232259, 0.48783363, 0.84715226]])
Just replace the NaNs with zeros in both arrays:
a[np.isnan(a)] = 0 # replace all nan in a with 0
b[np.isnan(b)] = 0 # replace all nan in b with 0
And then perform the addition:
a + b
This relies on the fact that 0 is the "identity element" for addition.
I have a bunch of matrices eq1, eq2, etc. defined like
from numpy import meshgrid, sqrt, arange
# from numpy import isnan, logical_not
xs = arange(-7.25, 7.25, 0.01)
ys = arange(-5, 5, 0.01)
x, y = meshgrid(xs, ys)
eq1 = ((x/7.0)**2.0*sqrt(abs(abs(x)-3.0)/(abs(x)-3.0))+(y/3.0)**2.0*sqrt(abs(y+3.0/7.0*sqrt(33.0))/(y+3.0/7.0*sqrt(33.0)))-1.0)
eq2 = (abs(x/2.0)-((3.0*sqrt(33.0)-7.0)/112.0)*x**2.0-3.0+sqrt(1-(abs(abs(x)-2.0)-1.0)**2.0)-y)
where eq1, eq2, eq3, etc. are large square matrices. As you can see, there are many nan elements surrounding a 'block' of plot-able values. I want to remove all the nan values whilst keeping the shape of the block of the valid values in the matrix. Note that these 'blocks' can be located anywhere in the eq1, eq2 matrix.
I've looked at answers given in Removing nan values from an array and Removing NaN elements from a matrix, but these don't seem to be completely relevant to my case.
IIUC, you can use boolean indexing with np.isnan to keep the slices. There are probably slicker ways to do this, but starting from something like:
>>> eq = np.zeros((5,6)) + np.nan
>>> eq[2:4, 1:3].flat = [1,np.nan,3,4]
>>> eq
array([[ nan, nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan, nan],
[ nan, 1., nan, nan, nan, nan],
[ nan, 3., 4., nan, nan, nan],
[ nan, nan, nan, nan, nan, nan]])
You could select the rows and columns with data using something like
>>> eq = eq[:,~np.isnan(eq).all(0)]
>>> eq = eq[~np.isnan(eq).all(1)]
>>> eq
array([[ 1., nan],
[ 3., 4.]])
Short and sweet,
eq1_c = eq1[~np.isnan(eq1)]
np.isnan returns a bool array that can be used to index your original array. Take its negation and you will get back the non-nan values.
One option is to manually iterate through the grid and check for Nan values. A Nan value is easy to spot because comparing it to itself will result in False. You could use this to set all Nan values to 0.0 for example.
for x in xrange(len(eq1)):
for y in xrange(len(eq1[x])):
v = eq1[x][y]
if v!=v:
eq1[x][y] = 0.0