Masking only non-NaN values (Python)

Masking only non-NaN values (Python) - python

I have a multidimensions matrix and want to mask all values which are NOT NaN values. I know there is a mask for invalid where one can mask NaN values but I want the opposite - to only want to keep the NaN values. I've tried using where but am not sure if I am writing it correctly.
Code, tt & tt2 produce (same thing)
tt = np.ma.array([[[0,1,2],[3,np.nan,5],[6,7,8]],
[[10,11,12],[13,np.nan,15],[16,17,18]],
[[20,21,22],[23,np.nan,25],[26,27,28]]])
tt2 = np.ma.where(tt == np.nan, tt == np.nan, tt)
[[[ 0. 1. 2.]
[ 3. nan 5.]
[ 6. 7. 8.]]
[[10. 11. 12.]
[13. nan 15.]
[16. 17. 18.]]
[[20. 21. 22.]
[23. nan 25.]
[26. 27. 28.]]]
Desired Result:
All integers to be masked (--), leaving only Nan

I think you want:
tt2 = np.ma.masked_where(~np.isnan(tt), tt)
Note the use of np.isnan (i.e., note that np.NaN == np.NaN is False!), and the not (~) operator. In other words, this does, "mask where the array tt is not NaN". Good luck.

Related

Condensing an array where some rows differ only by one column (to one with unique rows but more columns)

I have a long array (could be pandas or numpy, as convenient) where some rows have the first two columns identical (x-y position), and the third is unique (time), eg:
x y t
0. 0. 10.
0. 0. 11.
0. 0. 12.
0. 1. 13.
0. 1. 14.
1. 1. 15.
Positions are grouped, but there may be 1, 2 or 3 time values listed for each, meaning there may be 1, 2 or 3 columns with identical x and y. The array needs to be reshaped/condensed such that each position has its own row, with min and max values of time - i.e., target is:
x y t1 t2
0. 0. 10. 12.
0. 1. 13. 14.
1. 1. 15. inf
Is there a simple/elegant way of doing this in pandas or numpy? I've tried loops but they're messy and terribly inefficient, and I've tried using np.unique:
target_array = np.unique(initial_array[:, 0:2], axis=0)
That yields
x y
0. 0.
0. 1.
1. 1.
which is a good start, but then I'm stuck on generating the last two columns.

IIUC, you can use
out = (df.groupby(['x', 'y'])['t']
.agg(t1='min', t2='max', c='count')
.reset_index()
.pipe(lambda df: df.assign(t2=df['t2'].mask(df['c'].eq(1), np.inf)) )
.drop(columns='c')
)
print(out)
x y t1 t2
0 0.0 0.0 10.0 12.0
1 0.0 1.0 13.0 14.0
2 1.0 1.0 15.0 inf

How can I use pasty to create a dmatrix without having to write out each of the variable names individually?

Say I have a large dataframe, and some lists of columns, and I want to be able to put them in a patsy dmatricies without having to write out each name individually. That is, I want to call the names from a list a list of column names to form the terms. Rather than write out each and every single term in my data frame column.
For example take the following df
df=pd.DataFrame( {'a':[1,2,3,4], 'b':[5,6,7,8],
'c':[8,4,5,3], 'd':[1,3,55,3],
'e':[8,4,5,3]})
df
>>
a b c d e
0 1 5 8 1 8
1 2 6 4 3 4
2 3 7 5 55 5
3 4 8 3 3 3
As I understand it to call this into a d matrix requires me to do the following:
y,x = dmatrices('a~b+c+d+e', data=df)
However I would like to be able to run something more along the lines of:
regress=['b', 'c']
control=['e', 'd']
y,x=dmatricies('a~{}+{}'.format(' '.join(e for e in regressors),
' '.join(c for c in control)), data=df)
However this was unsuccesful.
I also attempted to use a dictionary with two entries, say regress and control, that filled with lists of the column names, and then input that into the first entry of dmatricies but it didnt work either.
Does anyone have any suggestions on a more efficient way to get things into patsy's dmatricies rather than writing out each and every column name we would like to include in the matrix?
Thanks in advance and let me know if I was not clear on anything.

Doing with for loop here
for z in regress:
for t in control:
y,x=dmatrices('a~{}+{}'.format(z,t), data=df)
print('a~{}+{}'.format(z,t))
print(y,x)
a~b+e
[[1.]
[2.]
[3.]
[4.]] [[1. 5. 8.]
[1. 6. 4.]
[1. 7. 5.]
[1. 8. 3.]]
a~c+e
[[1.]
[2.]
[3.]
[4.]] [[1. 8. 8.]
[1. 4. 4.]
[1. 5. 5.]
[1. 3. 3.]]
a~d+e
[[1.]
[2.]
[3.]
[4.]] [[ 1. 1. 8.]
[ 1. 3. 4.]
[ 1. 55. 5.]
[ 1. 3. 3.]]

xarray.align propagate NaNs?

I'm aligning multiple datasets (model and observations) and I thought it would make a lot of sense if xarray.align had a method to propagate NaNs/missing data in one dataset to the others. For now, I'm using xr.dataset.where in combination with np.isfinite, but especially my attempt to generalize this for more than two arrays feels a bit tricky. Is there a better way to do this?
a = xr.DataArray(np.arange(10).astype(float))
b = xr.DataArray(np.arange(10).astype(float))
a[[4, 5]] = np.nan
print(a.values)
print(b.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
# Default behaviour
c, d = xr.align(a, b)
print(c.values)
print(d.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
# Desired behaviour
e, f = xr.align(a.where(np.isfinite(b)), b.where(np.isfinite(a)))
print(e.values)
print(f.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
# Attempt to generalize for multiple arrays
c = b.copy()
c [[1, -1]] = np.nan
def align_better(*dataarrays):
allvalid = np.all(np.array([np.isfinite(x) for x in dataarrays]), axis=0)
return xr.align(*[da.where(allvalid) for da in dataarrays])
g, h, i = align_better(a, b, c)
print(g.values)
print(h.values)
print(i.values)
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]

From the xarray docs:
Given any number of Dataset and/or DataArray objects, returns new objects with aligned indexes and dimension sizes.
Array from the aligned objects are suitable as input to mathematical operators, because along each dimension they have the same index and size.
Missing values (if join != 'inner') are filled with NaN.
Nothing about this function deals with the values in the arrays, just the dimensions and coordinates. This function is used for setting up arrays for operations against each other.
If your desired behavior is a function that returns NaN for all arrays where any arrays are NaN, your align_better function seems like a decent way to do it.

The function in my initial attempt was slow because dataarrays were casted to numpy arrays. In this modified version, I first align the datasets. Then I can safely use the .values method. This is much faster.
def align_better(*dataarrays):
""" Align datasets and propage NaNs """
aligned = xr.align(*dataarrays)
allvalid = np.all(np.asarray([np.isfinite(x).values for x in aligned]), axis=0)
return [da.where(allvalid) for da in aligned]

numpy search by type element and change it

I have a 2d array a and 2d array b. I need to calculate c =a/b,
so there is some inf or NaN objects. How can I check it with numpy and set them to np.nan?
Here is my code:
import numpy as np
a=np.asarray([[1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5]])
b=np.asarray([[1,2,0,4,5],[1,2,0,4,5],[1,2,0,4,5],[1,2,3,4,5]])
c=a/b
b=np.where(isinstance(c, float),np.nan,c)

I am not sure, correct me if I am wrong, You are referring the inf objects in c i.e after calculating c = a/b.
Following is the sample code:
import numpy as np
np.seterr(divide='ignore', invalid='ignore') #To avoid RuntimeWarning: divide by zero encountered in true_divide after removing the cwd from sys.path.
a=np.asarray([[1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5]])
b=np.asarray([[1,2,0,4,5],[1,2,0,4,5],[1,2,0,4,5],[1,2,3,4,5]])
c=a/b
print(c)
[[ 1. 1. inf 1. 1.]
[ 1. 1. inf 1. 1.]
[ 1. 1. inf 1. 1.]
[ 1. 1. 1. 1. 1.]]
c[np.isinf(c)] = np.nan #Finds inf object and replace with nan.
print(c)
[[ 1. 1. nan 1. 1.]
[ 1. 1. nan 1. 1.]
[ 1. 1. nan 1. 1.]
[ 1. 1. 1. 1. 1.]]
Hope it helps!
Attached the jupyter notebook screenshot for reference:

Why does sum() operation on numpy masked_array change fill value to 1e20?

Is this a feature or a bug? Can someone explain to me this behavior of a numpy masked_array? It seems to change the fill_value after applying the sum operation, which is confusing if you intend to use the filled result.
data=ones((5,5))
m=zeros((5,5),dtype=bool)
"""Mask out row 3"""
m[3,:]=True
arr=ma.masked_array(data,mask=m,fill_value=nan)
print arr
print 'Fill value:', arr.fill_value
print arr.filled()
farr=arr.sum(axis=1)
print farr
print 'Fill value:', farr.fill_value
print farr.filled()
"""I was expecting this"""
print nansum(arr.filled(),axis=1)
Prints output:
[[1.0 1.0 1.0 1.0 1.0]
[1.0 1.0 1.0 1.0 1.0]
[1.0 1.0 1.0 1.0 1.0]
[-- -- -- -- --]
[1.0 1.0 1.0 1.0 1.0]]
Fill value: nan
[[ 1. 1. 1. 1. 1.]
[ 1. 1. 1. 1. 1.]
[ 1. 1. 1. 1. 1.]
[ nan nan nan nan nan]
[ 1. 1. 1. 1. 1.]]
[5.0 5.0 5.0 -- 5.0]
Fill value: 1e+20
[ 5.00000000e+00 5.00000000e+00 5.00000000e+00 1.00000000e+20
5.00000000e+00]
[ 5. 5. 5. nan 5.]

The array returned by arr.sum is a new array which does not inherit the fill_value of arr (though I agree that might be a nice improvement to np.ma). As a workaround, you could use
In [18]: farr.filled(arr.fill_value)
Out[18]: array([ 5., 5., 5., nan, 5.])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Masking only non-NaN values (Python) - python

I think you want: tt2 = np.ma.masked_where(~np.isnan(tt), tt) Note the use of np.isnan (i.e., note that np.NaN == np.NaN is False!), and the not (~) operator. In other words, this does, "mask where the array tt is not NaN". Good luck.

Related

Condensing an array where some rows differ only by one column (to one with unique rows but more columns)

How can I use pasty to create a dmatrix without having to write out each of the variable names individually?

xarray.align propagate NaNs?

numpy search by type element and change it

Why does sum() operation on numpy masked_array change fill value to 1e20?

Categories

Resources