Removing nan elements from matrix - python

I have a bunch of matrices eq1, eq2, etc. defined like
from numpy import meshgrid, sqrt, arange
# from numpy import isnan, logical_not
xs = arange(-7.25, 7.25, 0.01)
ys = arange(-5, 5, 0.01)
x, y = meshgrid(xs, ys)
eq1 = ((x/7.0)**2.0*sqrt(abs(abs(x)-3.0)/(abs(x)-3.0))+(y/3.0)**2.0*sqrt(abs(y+3.0/7.0*sqrt(33.0))/(y+3.0/7.0*sqrt(33.0)))-1.0)
eq2 = (abs(x/2.0)-((3.0*sqrt(33.0)-7.0)/112.0)*x**2.0-3.0+sqrt(1-(abs(abs(x)-2.0)-1.0)**2.0)-y)
where eq1, eq2, eq3, etc. are large square matrices. As you can see, there are many nan elements surrounding a 'block' of plot-able values. I want to remove all the nan values whilst keeping the shape of the block of the valid values in the matrix. Note that these 'blocks' can be located anywhere in the eq1, eq2 matrix.
I've looked at answers given in Removing nan values from an array and Removing NaN elements from a matrix, but these don't seem to be completely relevant to my case.

IIUC, you can use boolean indexing with np.isnan to keep the slices. There are probably slicker ways to do this, but starting from something like:
>>> eq = np.zeros((5,6)) + np.nan
>>> eq[2:4, 1:3].flat = [1,np.nan,3,4]
>>> eq
array([[ nan, nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan, nan],
[ nan, 1., nan, nan, nan, nan],
[ nan, 3., 4., nan, nan, nan],
[ nan, nan, nan, nan, nan, nan]])
You could select the rows and columns with data using something like
>>> eq = eq[:,~np.isnan(eq).all(0)]
>>> eq = eq[~np.isnan(eq).all(1)]
>>> eq
array([[ 1., nan],
[ 3., 4.]])

Short and sweet,
eq1_c = eq1[~np.isnan(eq1)]
np.isnan returns a bool array that can be used to index your original array. Take its negation and you will get back the non-nan values.

One option is to manually iterate through the grid and check for Nan values. A Nan value is easy to spot because comparing it to itself will result in False. You could use this to set all Nan values to 0.0 for example.
for x in xrange(len(eq1)):
for y in xrange(len(eq1[x])):
v = eq1[x][y]
if v!=v:
eq1[x][y] = 0.0

Related

Replacing all elements except NaN in Python

I want to replace all elements in array X except nan with 10.0. Is there a one-step way to do it? I present the expected output.
import numpy as np
from numpy import nan
X = np.array([[3.25774286e+02, 3.22008654e+02, nan, 1.85356823e+02,
1.85356823e+02, 3.22008654e+02, nan, 3.22008654e+02]])
The expected output is
X = array([[10.0, 10.0, nan, 10.0,
10.0, 10.0, nan, 10.0]])
You can get an array of True/False for nan location using np.isnan, invert it, and use it to replace all other values with 10.0
indices = np.isnan(X)
X[~indices] = 10.0
print(X) # [[10. 10. nan 10. 10. 10. nan 10.]]
You can use a combination of numpy.isnan and numpy.where.
>>> np.where(np.isnan(X), X, 10)
array([[10., 10., nan, 10., 10., 10., nan, 10.]])
One-liner, in-place for x
np.place(x, np.invert(np.isnan(x)), 10.0)

Python, numpy correlation returns nan

I'm trying to get the correlation between two matrices from the boston dataset. So I'm doing this.
import sklearn as skl
from sklearn.datasets import load_boston
import numpy as np
import scipy as sc
import matplotlib.pyplot as plt
boston_dataset = load_boston()
X = boston_dataset.data
Y = boston_dataset.target
# Correlation between RM and Y
RM = X[:, 5:6]
np.corrcoef(RM, Y.reshape((506,1)))
But I got NAN in every value of the matrix.
/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py:2526: RuntimeWarning: Degrees of freedom <= 0 for slice
c = cov(x, y, rowvar)
/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py:2455: RuntimeWarning: divide by zero encountered in true_divide
c *= np.true_divide(1, fact)
/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py:2455: RuntimeWarning: invalid value encountered in multiply
c *= np.true_divide(1, fact)
array([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]])
What's happening? Thanks!
You are trying to find correlation of single values, which the warning says, has 0 degree of freedom (for correlation between two single values) and hence divide by 0 results in nan, which is expected. Maybe you meant to find correlation of columns instead of rows, like this:
np.corrcoef(RM, Y.reshape((506,1)), rowvar=False)
output:
[[1. 0.69535995]
[0.69535995 1. ]]
Explanation: By default, numpy np.corrcoef takes row-wise correlation of the two matrices. According to numpy doc, if you want column-wise correlation, you can use rowvar arguement:
If rowvar is True (default), then each row represents a variable, with observations in the columns. Otherwise, the relationship is transposed: each column represents a variable, while the rows contain observations.
Try slicing your X array on a single index (so X[:, 5] instead of X[:, 5:6]). Then it will be the same shape as your Y array, without needing to reshape it. The following works:
# Correlation between RM and Y
RM = X[:, 5]
np.corrcoef(RM, Y)

How to create a count function for an array that maintains the time and location of the points (python)

I have two netcdf files which I have used to calculate the Humidex of the Houston area. From there I need to find a way to count the number of days at each lat/lon that have days that meet a certain threshold (41). I then need to plot a spatial map of the count number over the region so I can compare the number of extremely hot days a each point in the region. I've used xarray.where in order to isolate the number of days at this threshold, but when I apply a count function I lose my time and lat/lon variables, and just get an output of the total number of data points at this threshold.
humidex is a calculation of two different netcdf files, it has latitude and longitude variables
​
>>> hotday = xr.DataArray(humidex)
>>> hotday.where(hotday >=41)
<xarray.DataArray 'tasmax' (lat: 960, lon: 1920)>
array([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], dtype=float32)
Coordinates:
* lat (lat) float64 -89.86 -89.67 -89.48 -89.3 ... 89.3 89.48 89.67 89.86
* lon (lon) float64 0.0 0.1875 0.375 0.5625 ... 359.2 359.4 359.6 359.8
height float64 2.0
>>>for ii in hotday:
>>> counting=xr.DataArray.count(ii)
>>>counting
<xarray.DataArray 'tasmax' ()>
array(1920)
Coordinates:
lat float64 89.86
height float64 ...
I hope this makes sense, I'm still new to coding and this has really thrown me.
Welcome to SO. There are numerous ways to do solve your problem.
Here's one proposed method:
import xarray as xr
data = xr.tutorial.open_dataset('air_temperature')
high_temps = xr.where(data > 300, 1, 0) #set all temps over 300K = 1; others to 0
summed_temps = high_temps.sum(dim='time')
You could then plot the heat map directly.

Add two matrices containing NaN in python

I have two 5D matrices which I would like to add elementwise. The matrices have the exact same dimensions and number of elements, but they both contain randomly distributed NaN values.
I would like to add these two matrices elementwise in an efficient way. I am currently adding them by looping through them elementwise, but this loop takes about 40 minutes and I just thought there must be a more efficient way of doing it.
What I think would be an efficient way is if it was possible to use numpy.nansum to add them, but from what I can find, numpy.nansum only works with 1D arrays.
I would prefer it if the adding went down as it does with numpy.nansum (https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.nansum.html). Namely, (1) if two values are added I want the sum to be a value, (2) if a value and a NaN are added I want the sum to be the value and (3) if two NaN are added I want the sum to be NaN.
Below is an exaplary code:
import numpy as np
# Creating fake data
A = np.arange(0,720,1).reshape(2,3,4,5,6)
B = np.arange(720,1440,1).reshape(2,3,4,5,6)
# Assigning some elements as NaN
A[0,1,2,3,4] = np.nan
A[1,2,3,4,5] = np.nan
B[1,2,3,4,5] = np.nan
So, if I now add A and B (lets say C = A + B), I want element C[0,1,2,3,4] to be the value of B[0,1,2,3,4], element C[1,2,3,4,5] to be NaN and all other elements in C to be the sums of the respectively added elements in A and B.
Does anyone have an efficient solution for this addition?
np.where(np.isnan(A), B, A + np.nan_to_num(B))
We see how this works in two parts:
For the nan part of A, we fill in values from B.
If B and A are nan at the same time, the values stored will be nan. If values in B are not nan while those from A are nan, the values of B will be taken.
For the part of A that is non-nan, we fill in A + np.nan_to_num(B).
np.nan_to_num(B) will turn B's nan part into 0. Thus, A + np.nan_to_num(B) will not be nan when B is nan.
Thanks for Paul Panzer's correction.
I was thinking of something more prosaic
In [22]: A=np.arange(10.) # make sure A is float
In [23]: B=np.arange(100,110.)
In [24]: A[[1,3,9]]=np.nan
In [25]: B[[2,5,9]]=np.nan
In [26]: A
Out[26]: array([ 0., nan, 2., nan, 4., 5., 6., 7., 8., nan])
In [27]: B
Out[27]: array([100., 101., nan, 103., 104., nan, 106., 107., 108., nan])
In [29]: C=A+B
In [30]: C
Out[30]: array([100., nan, nan, nan, 108., nan, 112., 114., 116., nan])
In [31]: mask1 = np.isnan(A) & ~np.isnan(B)
In [32]: C[mask1] = B[mask1]
In [33]: mask2 = np.isnan(B) & ~np.isnan(A)
In [34]: C[mask2] = A[mask2]
In [35]: C
Out[35]: array([100., 101., 2., 103., 108., 5., 112., 114., 116., nan])
I like the stack and nansum approach, but I'm not sure it's faster:
In [36]: s=np.stack((A,B))
In [37]: C1 = np.nansum(s, axis=0)
In [38]: C1
Out[38]: array([100., 101., 2., 103., 108., 5., 112., 114., 116., 0.])
In [40]: C1[np.all(np.isnan(s), axis=0)] = np.nan
In [41]: C1
Out[41]: array([100., 101., 2., 103., 108., 5., 112., 114., 116., nan])
Look at s if this approach is puzzling:
In [42]: s
Out[42]:
array([[ 0., nan, 2., nan, 4., 5., 6., 7., 8., nan],
[100., 101., nan, 103., 104., nan, 106., 107., 108., nan]])
s is a new array, with a new 0 dimension. sum on that dimension is the same as A+B. This stacking lets us take advantage of the nansum. Unfortunately you still want to keep some nan, so we still have to do a masked assignment to handle that detail.
s = np.stack((A, B))
C = np.nansum(s, axis=0)
C[np.all(np.isnan(s), axis=0)] = np.nan
This will treat np.nan as 0.0 for purposes of summing, and then the final line adds back the places where np.nan existed for all entries along the new "depth" axis that spans across A and B.
Note that this last operation is necessary for NumPy versions > 1.8, as it says in the documentation:
In NumPy versions <= 1.8.0 Nan is returned for slices that are all-NaN or empty. In later versions zero is returned.
If you can guarantee NumPy version <= 1.8, then just the nansum part alone would suffice.
Just add a new axe before summing :
np.nansum(np.concatenate((A[None,:],B[None,:])),axis=0)

Numpy reading data from '.npy' file directly into arrays

This might be a silly question, but I can't seem to find an answer for it. I have a large array that I've previously saved using np.save, and now I'd like to load the data into a new file, creating a separate list from each column. The only issue is that some of the rows in my large array only have a single nan value, so the array looks something like this (as an extremely simplified example):
np.array([[5,12,3],
[nan],
[10,13,9],
[nan],
[nan]])
I can use a for loop to achieve what I want, but I was wondering if there was a better way than this:
import numpy as np
results = np.load('data.npy')
depth, upper, lower = [], [], []
for item in results:
if len(item) > 1:
depth.append(item[0])
upper.append(item[1])
lower.append(item[2])
else:
depth.append(np.nan)
upper.append(np.nan)
lower.append(np.nan)
My desired output would look like:
depth = [5,nan,10,nan,nan]
upper = [12,nan,13,nan,nan]
lower = [3,nan,9,nan,nan]
Thanks for your help! I realize I should have previously altered the code that creates the "data.npy" file, so that it has the same number of columns for each row, but that code already takes hours to run and I'd rather avoid that!
With varying length sub arrays, this is dtype=object array. For most purposes this is the same as a list of these subarrays. So most actions will require iteration.
A variant on your action would be a list comprehension
In [61]: dd=[[nan,nan,nan] if len(i)==1 else i for i in d]
In [62]: dd
Out[62]: [[5, 12, 3], [nan, nan, nan], [10, 13, 9], [nan, nan, nan], [nan, nan, nan]]
Your three target arrays are then columns of:
In [63]: np.array(dd)
Out[63]:
array([[ 5., 12., 3.],
[ nan, nan, nan],
[ 10., 13., 9.],
[ nan, nan, nan],
[ nan, nan, nan]])
Another approach is to make an array of that type filled with nan, and then copy over the non-nan values. But that too requires iteration to find the length of the subsarrays.
In [65]: [len(i)>1 for i in d]
Out[65]: [True, False, True, False, False]
np.nan is a float, so a 2d array with nan will be dtype float.
A shorter way using pandas:
import numpy as np
import pandas as pd
data = np.array([[5,12,3], [np.nan], [10,13,9], [np.nan], [np.nan]])
df = pd.DataFrame.from_records(data.tolist())
df.columns = ['depth','upper','lower']
Output:
>>> df
depth upper lower
0 5.0 12.0 3.0
1 NaN NaN NaN
2 10.0 13.0 9.0
3 NaN NaN NaN
4 NaN NaN NaN
You can now address each column to get your desired output
>>> df.depth
0 5.0
1 NaN
2 10.0
3 NaN
4 NaN
If you need lists:
>>> df.depth.tolist()
[5.0, nan, 10.0, nan, nan]

Categories

Resources