Numpy reading data from '.npy' file directly into arrays - python

This might be a silly question, but I can't seem to find an answer for it. I have a large array that I've previously saved using np.save, and now I'd like to load the data into a new file, creating a separate list from each column. The only issue is that some of the rows in my large array only have a single nan value, so the array looks something like this (as an extremely simplified example):
np.array([[5,12,3],
[nan],
[10,13,9],
[nan],
[nan]])
I can use a for loop to achieve what I want, but I was wondering if there was a better way than this:
import numpy as np
results = np.load('data.npy')
depth, upper, lower = [], [], []
for item in results:
if len(item) > 1:
depth.append(item[0])
upper.append(item[1])
lower.append(item[2])
else:
depth.append(np.nan)
upper.append(np.nan)
lower.append(np.nan)
My desired output would look like:
depth = [5,nan,10,nan,nan]
upper = [12,nan,13,nan,nan]
lower = [3,nan,9,nan,nan]
Thanks for your help! I realize I should have previously altered the code that creates the "data.npy" file, so that it has the same number of columns for each row, but that code already takes hours to run and I'd rather avoid that!

With varying length sub arrays, this is dtype=object array. For most purposes this is the same as a list of these subarrays. So most actions will require iteration.
A variant on your action would be a list comprehension
In [61]: dd=[[nan,nan,nan] if len(i)==1 else i for i in d]
In [62]: dd
Out[62]: [[5, 12, 3], [nan, nan, nan], [10, 13, 9], [nan, nan, nan], [nan, nan, nan]]
Your three target arrays are then columns of:
In [63]: np.array(dd)
Out[63]:
array([[ 5., 12., 3.],
[ nan, nan, nan],
[ 10., 13., 9.],
[ nan, nan, nan],
[ nan, nan, nan]])
Another approach is to make an array of that type filled with nan, and then copy over the non-nan values. But that too requires iteration to find the length of the subsarrays.
In [65]: [len(i)>1 for i in d]
Out[65]: [True, False, True, False, False]
np.nan is a float, so a 2d array with nan will be dtype float.

A shorter way using pandas:
import numpy as np
import pandas as pd
data = np.array([[5,12,3], [np.nan], [10,13,9], [np.nan], [np.nan]])
df = pd.DataFrame.from_records(data.tolist())
df.columns = ['depth','upper','lower']
Output:
>>> df
depth upper lower
0 5.0 12.0 3.0
1 NaN NaN NaN
2 10.0 13.0 9.0
3 NaN NaN NaN
4 NaN NaN NaN
You can now address each column to get your desired output
>>> df.depth
0 5.0
1 NaN
2 10.0
3 NaN
4 NaN
If you need lists:
>>> df.depth.tolist()
[5.0, nan, 10.0, nan, nan]

Related

Replacing all elements except NaN in Python

I want to replace all elements in array X except nan with 10.0. Is there a one-step way to do it? I present the expected output.
import numpy as np
from numpy import nan
X = np.array([[3.25774286e+02, 3.22008654e+02, nan, 1.85356823e+02,
1.85356823e+02, 3.22008654e+02, nan, 3.22008654e+02]])
The expected output is
X = array([[10.0, 10.0, nan, 10.0,
10.0, 10.0, nan, 10.0]])
You can get an array of True/False for nan location using np.isnan, invert it, and use it to replace all other values with 10.0
indices = np.isnan(X)
X[~indices] = 10.0
print(X) # [[10. 10. nan 10. 10. 10. nan 10.]]
You can use a combination of numpy.isnan and numpy.where.
>>> np.where(np.isnan(X), X, 10)
array([[10., 10., nan, 10., 10., 10., nan, 10.]])
One-liner, in-place for x
np.place(x, np.invert(np.isnan(x)), 10.0)

pandas simple pairwise occurrence

There is a function corr in pandas to create a table with mutual correlation coefficients in presence of sparse data. But how to calculate the number of mutual occurrences in the data instead of correlation coefficient?
i.e.
A = [NaN, NaN, 3]
B = [NaN, NaN, 8]
F(A,B) = 1
A = [1, NaN, NaN]
B = [NaN, NaN, 8]
F(A,B) = 0
I need pandas.DataFrame([A,B]).<function>() -> matrix of occurrences
In pandas, you may want to use dropna: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
You can do something like
co_occur = df.dropna(how = "any")
the_count = co_occur.shape[0] # number of remaining rows
This will drop all rows where there is any NaN (thereby leaving you only with rows that contain values for every variable) and then count the number of remaining rows.
Alternatively, you could do it with lists (as in your code above) assuming the lists are the same length:
A = [NaN, NaN, 3]
B = [NaN, NaN, 8]
co_occur = len( [i for i in range(len(A)) if A[i] and B[i]] )
I am using numpy
sum(np.sum(~np.isnan(np.array([A,B])),0)==2)
Out[335]: 1
For you second case
sum(np.sum(~np.isnan(np.array([A,B])),0)==2)
Out[337]: 0
With pandas
(df.A.notnull() & df.B.notnull()).sum()
Or
df.notnull().all(axis=1).sum()

Add two matrices containing NaN in python

I have two 5D matrices which I would like to add elementwise. The matrices have the exact same dimensions and number of elements, but they both contain randomly distributed NaN values.
I would like to add these two matrices elementwise in an efficient way. I am currently adding them by looping through them elementwise, but this loop takes about 40 minutes and I just thought there must be a more efficient way of doing it.
What I think would be an efficient way is if it was possible to use numpy.nansum to add them, but from what I can find, numpy.nansum only works with 1D arrays.
I would prefer it if the adding went down as it does with numpy.nansum (https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.nansum.html). Namely, (1) if two values are added I want the sum to be a value, (2) if a value and a NaN are added I want the sum to be the value and (3) if two NaN are added I want the sum to be NaN.
Below is an exaplary code:
import numpy as np
# Creating fake data
A = np.arange(0,720,1).reshape(2,3,4,5,6)
B = np.arange(720,1440,1).reshape(2,3,4,5,6)
# Assigning some elements as NaN
A[0,1,2,3,4] = np.nan
A[1,2,3,4,5] = np.nan
B[1,2,3,4,5] = np.nan
So, if I now add A and B (lets say C = A + B), I want element C[0,1,2,3,4] to be the value of B[0,1,2,3,4], element C[1,2,3,4,5] to be NaN and all other elements in C to be the sums of the respectively added elements in A and B.
Does anyone have an efficient solution for this addition?
np.where(np.isnan(A), B, A + np.nan_to_num(B))
We see how this works in two parts:
For the nan part of A, we fill in values from B.
If B and A are nan at the same time, the values stored will be nan. If values in B are not nan while those from A are nan, the values of B will be taken.
For the part of A that is non-nan, we fill in A + np.nan_to_num(B).
np.nan_to_num(B) will turn B's nan part into 0. Thus, A + np.nan_to_num(B) will not be nan when B is nan.
Thanks for Paul Panzer's correction.
I was thinking of something more prosaic
In [22]: A=np.arange(10.) # make sure A is float
In [23]: B=np.arange(100,110.)
In [24]: A[[1,3,9]]=np.nan
In [25]: B[[2,5,9]]=np.nan
In [26]: A
Out[26]: array([ 0., nan, 2., nan, 4., 5., 6., 7., 8., nan])
In [27]: B
Out[27]: array([100., 101., nan, 103., 104., nan, 106., 107., 108., nan])
In [29]: C=A+B
In [30]: C
Out[30]: array([100., nan, nan, nan, 108., nan, 112., 114., 116., nan])
In [31]: mask1 = np.isnan(A) & ~np.isnan(B)
In [32]: C[mask1] = B[mask1]
In [33]: mask2 = np.isnan(B) & ~np.isnan(A)
In [34]: C[mask2] = A[mask2]
In [35]: C
Out[35]: array([100., 101., 2., 103., 108., 5., 112., 114., 116., nan])
I like the stack and nansum approach, but I'm not sure it's faster:
In [36]: s=np.stack((A,B))
In [37]: C1 = np.nansum(s, axis=0)
In [38]: C1
Out[38]: array([100., 101., 2., 103., 108., 5., 112., 114., 116., 0.])
In [40]: C1[np.all(np.isnan(s), axis=0)] = np.nan
In [41]: C1
Out[41]: array([100., 101., 2., 103., 108., 5., 112., 114., 116., nan])
Look at s if this approach is puzzling:
In [42]: s
Out[42]:
array([[ 0., nan, 2., nan, 4., 5., 6., 7., 8., nan],
[100., 101., nan, 103., 104., nan, 106., 107., 108., nan]])
s is a new array, with a new 0 dimension. sum on that dimension is the same as A+B. This stacking lets us take advantage of the nansum. Unfortunately you still want to keep some nan, so we still have to do a masked assignment to handle that detail.
s = np.stack((A, B))
C = np.nansum(s, axis=0)
C[np.all(np.isnan(s), axis=0)] = np.nan
This will treat np.nan as 0.0 for purposes of summing, and then the final line adds back the places where np.nan existed for all entries along the new "depth" axis that spans across A and B.
Note that this last operation is necessary for NumPy versions > 1.8, as it says in the documentation:
In NumPy versions <= 1.8.0 Nan is returned for slices that are all-NaN or empty. In later versions zero is returned.
If you can guarantee NumPy version <= 1.8, then just the nansum part alone would suffice.
Just add a new axe before summing :
np.nansum(np.concatenate((A[None,:],B[None,:])),axis=0)

Adding two 2D NumPy arrays ignoring NaNs in them

What is the right way to add 2 numpy arrays a and b (both 2D) with numpy.nan as missing value?
a + b
or
numpy.ma.sum(a,b)
Since the inputs are 2D arrays, you can stack them along the third axis with np.dstack and then use np.nansum which would ensure NaNs are ignored, unless there are NaNs in both input arrays, in which case output would also have NaN. Thus, the implementation would look something like this -
np.nansum(np.dstack((A,B)),2)
Sample run -
In [157]: A
Out[157]:
array([[ 0.77552455, 0.89241629, nan, 0.61187474],
[ 0.62777982, 0.80245533, nan, 0.66320306],
[ 0.41578442, 0.26144272, 0.90260667, nan],
[ 0.65122428, 0.3211213 , 0.81634856, nan],
[ 0.52957704, 0.73460363, 0.16484994, 0.20701344]])
In [158]: B
Out[158]:
array([[ 0.55809925, 0.1339353 , nan, 0.35154039],
[ 0.94484722, 0.23814073, 0.36048809, 0.20412318],
[ 0.25191484, nan, 0.43721322, 0.95810905],
[ 0.69115038, 0.51490958, nan, 0.44613473],
[ 0.01709308, 0.81771896, 0.3229837 , 0.64013882]])
In [159]: np.nansum(np.dstack((A,B)),2)
Out[159]:
array([[ 1.3336238 , 1.02635159, nan, 0.96341512],
[ 1.57262704, 1.04059606, 0.36048809, 0.86732624],
[ 0.66769925, 0.26144272, 1.33981989, 0.95810905],
[ 1.34237466, 0.83603089, 0.81634856, 0.44613473],
[ 0.54667013, 1.55232259, 0.48783363, 0.84715226]])
Just replace the NaNs with zeros in both arrays:
a[np.isnan(a)] = 0 # replace all nan in a with 0
b[np.isnan(b)] = 0 # replace all nan in b with 0
And then perform the addition:
a + b
This relies on the fact that 0 is the "identity element" for addition.

Removing nan elements from matrix

I have a bunch of matrices eq1, eq2, etc. defined like
from numpy import meshgrid, sqrt, arange
# from numpy import isnan, logical_not
xs = arange(-7.25, 7.25, 0.01)
ys = arange(-5, 5, 0.01)
x, y = meshgrid(xs, ys)
eq1 = ((x/7.0)**2.0*sqrt(abs(abs(x)-3.0)/(abs(x)-3.0))+(y/3.0)**2.0*sqrt(abs(y+3.0/7.0*sqrt(33.0))/(y+3.0/7.0*sqrt(33.0)))-1.0)
eq2 = (abs(x/2.0)-((3.0*sqrt(33.0)-7.0)/112.0)*x**2.0-3.0+sqrt(1-(abs(abs(x)-2.0)-1.0)**2.0)-y)
where eq1, eq2, eq3, etc. are large square matrices. As you can see, there are many nan elements surrounding a 'block' of plot-able values. I want to remove all the nan values whilst keeping the shape of the block of the valid values in the matrix. Note that these 'blocks' can be located anywhere in the eq1, eq2 matrix.
I've looked at answers given in Removing nan values from an array and Removing NaN elements from a matrix, but these don't seem to be completely relevant to my case.
IIUC, you can use boolean indexing with np.isnan to keep the slices. There are probably slicker ways to do this, but starting from something like:
>>> eq = np.zeros((5,6)) + np.nan
>>> eq[2:4, 1:3].flat = [1,np.nan,3,4]
>>> eq
array([[ nan, nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan, nan],
[ nan, 1., nan, nan, nan, nan],
[ nan, 3., 4., nan, nan, nan],
[ nan, nan, nan, nan, nan, nan]])
You could select the rows and columns with data using something like
>>> eq = eq[:,~np.isnan(eq).all(0)]
>>> eq = eq[~np.isnan(eq).all(1)]
>>> eq
array([[ 1., nan],
[ 3., 4.]])
Short and sweet,
eq1_c = eq1[~np.isnan(eq1)]
np.isnan returns a bool array that can be used to index your original array. Take its negation and you will get back the non-nan values.
One option is to manually iterate through the grid and check for Nan values. A Nan value is easy to spot because comparing it to itself will result in False. You could use this to set all Nan values to 0.0 for example.
for x in xrange(len(eq1)):
for y in xrange(len(eq1[x])):
v = eq1[x][y]
if v!=v:
eq1[x][y] = 0.0

Categories

Resources