Suppose we have an input array with some (but not all) nan values from which we want to write into a nan-initialized output array. After writing non-nan data into the output array there are still nan values and I don't understand at all why:
# minimal example just for testing purposes
import numpy as np
# fix state of seed
np.random.seed(1000)
# create input array and nan-filled output array
a = np.random.rand(6,3,5)
b = np.zeros((6,3,5)) * np.nan
x = [np.arange(6),1,2]
# select data in one dimension with others fixed
y_temp = a[x]
# set arbitrary index to nan
y_temp[1] = np.nan
ind_valid = ~np.isnan(y_temp)
# select non-nan values
y = y_temp[ind_valid]
# write input to output at corresponding indices
b[x][ind_valid] = y
print b[x][ind_valid]
# surprise, surprise :(
# [ nan nan nan nan nan nan]
# workaround (that will of course cost computation time, even if not much)
c = np.zeros(len(y_temp)) * np.nan
c[ind_valid] = y
b[x] = c
print b[x][ind_valid]
# and this is what we want to have
# [ 0.39719446 nan 0.39820488 0.68190824 0.86534558 0.69910395]
I thought the array b would reserve some block in memory and by indexing with x it "knows" those indices. Then it should also know them when selecting only some of them with ind_valid and be able to write into exactly those bit addresses in memory. No idea, but maybe it's sth. similar as python nested list unexpected behaviour? Please explain and maybe also provide a nice solution instead of the proposed workaround! Thanks!
Related
What could be the reason, why the max and min of my numpy array is nan?
I checked my array with:
for i in range(data[0]):
if data[i] == numpy.nan:
print("nan")
And there is no nan in my data.
Is my search wrong?
If not: What could be the reason for max and min being nan?
Here you go:
import numpy as np
a = np.array([1, 2, 3, np.nan, 4])
print(f'a.max() = {a.max()}')
print(f'np.nanmax(a) = {np.nanmax(a)}')
print(f'a.min() = {a.min()}')
print(f'np.nanmin(a) = {np.nanmin(a)}')
Output:
a.max() = nan
np.nanmax(a) = 4.0
a.min() = nan
np.nanmin(a) = 1.0
Balaji Ambresh showed precisely how to find min / max even
if the source array contains NaN, there is nothing to add on this matter.
But your code sample contains also other flaws that deserve to be pointed out.
Your loop contains for i in range(data[0]):.
You probably wanted to execute this loop for each element of data,
but your loop will be executed as many times as the value of
the initial element of data.
Variations:
If it is e.g. 1, it will be executed only once.
If it is 0 or negative, it will not be executed at all.
If it is >= than the size of data, IndexError exception
will be raised.
If your array contains at least 1 NaN, then the whole array
is of float type (NaN is a special case of float) and you get
TypeError exception: 'numpy.float64' object cannot be interpreted
as an integer.
Remedium (one of possible variants): This loop should start with
for elem in data: and the code inside should use elem as the
current element of data.
The next line contains if data[i] == numpy.nan:.
Even if you corrected it to if elem == np.nan:, the code inside
the if block will never be executed.
The reason is that np.nan is by definition not equal to any
other value, even it this other value is another np.nan.
Remedium: Change to if np.isnan(elem): (Balaji wrote in his comment
how to change your code, I added why).
And finally: How to check quickly an array for NaNs:
To get a detailed list, whether each element is NaN, run np.isnan(data)
and you will get a bool array.
To get a single answer, whether data contains at least one NaN,
no matter where, run np.isnan(data).any().
This code is shorter and runs significantly faster.
The reason is that np.nan == x is always False, even when x is np.nan . This is aligned with the NaN definition in Wikipedia.
Check yourself:
In [4]: import numpy as np
In [5]: np.nan == np.nan
Out[5]: False
If you want to check if a number x is np.nan, you must use
np.isnan(x)
If you want to get max/min of an np.array with nan's, use np.nanmax()/ np.nanmin():
minval = np.nanmin(data)
Easy use np.nanmax(variable_name) and np.nanmin(variable_name)
import numpy as np
z=np.arange(10,20)
z=np.where(z<15,np.nan,z)#Making below 15 z value as nan.
print(z)
print("z max value excluding nan :",np.nanmax(z))
print("z min value excluding nan :",np.nanmin(z))
I have an xarray dataset. I want to make a copy of that so it has the same dimensions/coordinates/shape as the original. That's easy.
import xarray as xr
n_segs = 4
n_dates = 5
num_vars = 4
dims = (n_segs, n_dates)
das = [xr.DataArray(np.random.rand(*dims), dims=['seg_id', 'date'])
for i in range(num_vars)]
ds_orig = xr.Dataset({'a': das[0], 'b': das[1], 'c': das[2], 'd': das[3]})
ds_copy = ds_orig.copy(deep=True)
Then I want to assign all the values in the copy a constant value (let's say 1). I've figured out how to do this with where:
ds_copy.where(ds_copy == np.nan, other=1)
but this assumes that none of my values will be nan and is a little counter-intuitive IMO. Is there a more robust way?
I suppose I can also loop through the data variables (which is what this suggests for Pandas)...:
for v in ds_copy.data_vars:
ds_copy[v].loc[:, :] = 1
Maybe what I'm looking for here is a replace method.
I would recommend the loop approach because it will preserve dtypes from the original values. Only one ellipsis in the loc is enough, and the .data_vars can be omitted (datasets have a dictionary interface):
for v in ds_copy:
ds_copy[v].loc[:] = 1
To get a more robust version of the where version, you can pass False directly to make sure other will always be used:
ds_copy.where(False, 1)
When storing ints and floats, keeping or not the dtype will probably not have any effect, however, if there are also string or boolean variables, results may change drastically.
i have the following example:
import pandas as pd
import numpy as np
df = pd.DataFrame([(0,2,5), (2,4,None),(7,-5,4), (1,None,None)])
def clean(series):
start = np.min(list(series.index[pd.isnull(series)]))
end = len(series)
series[start:] = series[start-1]
return series
my objective is to obtain a dataframe in which each row which contains a None value is filled in with the last available numerical value.
so, for example, running this function on just the 3rd row of the dataframe, i would produce the following:
row = df.ix[3]
test = clean(row)
test
0 1.0
1 1.0
2 1.0
Name: 3, dtype: float64
i cannot get this to work using the .apply() method, i.e. df.apply(clean,axis=1)
i should mention that this is a toy example - the custom function i would write in the real one is more dynamic in how it fills the values - so i am not looking for basic utilities like .ffill or .fillna
The apply method didn't work because when the row is completely filled your clean function will not know where to start the index from because of empty array for the given series.
So use a condition before altering series data i.e
def clean(series):
# Creating a copy for the sake of safety
series = series.copy()
# Alter series if only there exists a None value
if pd.isnull(series).any():
start = np.min(list(series.index[pd.isnull(series)]))
# for completely filled row
# series.index[pd.isnull(series)] will return
# Int64Index([], dtype='int64')
end = len(series)
series[start:] = series[start-1]
return series
df.apply(clean,1)
Output :
0 1 2
0 0.0 2.0 5.0
1 2.0 4.0 4.0
2 7.0 -5.0 4.0
3 1.0 1.0 1.0
Hope it clarifies why apply didn't work. I also suggest to take builtins to consideration to clean the data rather than writing functions from scratch.
At first, This is the code to solve your toy problem. But this code isn't what you want.
df.ffill(axis=1)
Next, I try to test your code.
df.apply(clean,axis=1)
#...start = np.min(list(series.index[pd.isnull(series)]))...
#=>ValueError: ('zero-size array to reduction operation minimum
# which has no identity', 'occurred at index 0')
To understand the situation, test with lambda function.
df.apply(lambda series:list(series.index[pd.isnull(series)]),axis=1)
0 []
1 [2]
2 []
3 [1, 2]
dtype: object
And next expression puts the same value error:
import numpy as np
np.min([])
In conclusion, pandas.apply() works well but clean function doesn't.
Could you use something like the fillna with backfill? I think this might be more efficient, if backfill meets your scenario..
i.e.
df.fillna(method='backfill')
However, this assumes a np.nan in the cells?
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
I am training a neural network to do regression, (1 input and 1 output). Let's x and y be the usual input and output dataset, respectively.
My problem is that the y dataset (not the x) have some values set to nan, so the fitting goes to nan. I wonder if there is an option to ignore the nan values in the fitting, in a similar way to the numpy functions np.nanmean to calculate the mean ignoring nans and so on.
If that option does not exist I suppose I would have to find the nan values and erase them manually, and at the same time erase the values in x corresponding to the nan position in y.
x y
2 4
3 2
4 np.nan
5 7
6 np.nan
7 np.nan
In this simple example the nan values in the y column should be removed and at the same time the corresponding values in the x column (4, 6, 7).
Thank you.
EDIT: Ok, I have a problem filtering the nans, I do:
for index, x in np.ndenumerate(a):
if x == np.nan:
print index, x
and it doesn't print anything and I am sure there are nan values...
EDIT (SELF ANSWER): Ok, I have found a way to localize the nans:
for index, x in np.ndenumerate(a):
if x != x:
print index, x
As said in the comments, simply remove the nan as a preprocessing step:
import numpy as np
x = range(2,8)
y = [4,2,np.nan,7,np.nan,np.nan]
for a,b in zip(x,y):
if str(b) == 'nan':
x.remove(a)
y.remove(b)
print x,y
produces [2, 3, 5] [4, 2, 7].
If you're using some tool to preprocess the data which gives you the np.nan, check whether the API allows you to disable this behavior and take a minute to think whether this is really the behavior you want (or if you e.g. want to map this to constants because you find your input to be valuable even though they have no labels).
I have an array of size: (50, 50). Within this array there is a slice of size (20,10).
Only this slice contains data, the remainder is all set to nan.
How do I cut this slice out of my large array?
You can get this using fancy indexing to collect the items that are not NaN:
a = a[ np.logical_not( np.isnan(a) ) ].reshape(20,10)
or, alternatively, as suggested by Joe Kington:
a = a[ ~np.isnan(a) ]
Do you know where the NaNs are? If so, something like this should work:
newarray = np.copy(oldarray[xstart:xend,ystart:yend])
where xstart and xend are the beginning and end of the slice you want in the x dimension and similarly for y. You can then delete the old array to free up memory if you don't need it anymore.
If you don't know where the NaNs are, this should do the trick:
# in this example, the starting array is A, numpy is imported as np
boolA = np.isnan(A) #get a boolean array of where the nans are
nonnanidxs = zip(*np.where(boolA == False)) #all the indices which are non NaN
#slice out the nans
corner1 = nonnanidxs[0]
corner2 = nonnanidxs[-1]
xdist = corner2[0] - corner1[0] + 1
ydist = corner2[1] - corner1[1] + 1
B = copy(A[corner1[0]:corner1[0]+xdist,corner1[1]:corner1[1]+ydist])
#B is now the array you want
Note that this would be pretty slow for large arrays because np.where looks through the whole thing. There's an open issue in the number bug tracker for a method that finds the first index equal to some value and then stops. There might be a more elegant way to do this, this is just the first thing that came to my head.
EDIT: ignore, sgpc's answer is much better.