Keras fitting ignoring nan values - python

I am training a neural network to do regression, (1 input and 1 output). Let's x and y be the usual input and output dataset, respectively.
My problem is that the y dataset (not the x) have some values set to nan, so the fitting goes to nan. I wonder if there is an option to ignore the nan values in the fitting, in a similar way to the numpy functions np.nanmean to calculate the mean ignoring nans and so on.
If that option does not exist I suppose I would have to find the nan values and erase them manually, and at the same time erase the values in x corresponding to the nan position in y.
x y
2 4
3 2
4 np.nan
5 7
6 np.nan
7 np.nan
In this simple example the nan values in the y column should be removed and at the same time the corresponding values in the x column (4, 6, 7).
Thank you.
EDIT: Ok, I have a problem filtering the nans, I do:
for index, x in np.ndenumerate(a):
if x == np.nan:
print index, x
and it doesn't print anything and I am sure there are nan values...
EDIT (SELF ANSWER): Ok, I have found a way to localize the nans:
for index, x in np.ndenumerate(a):
if x != x:
print index, x

As said in the comments, simply remove the nan as a preprocessing step:
import numpy as np
x = range(2,8)
y = [4,2,np.nan,7,np.nan,np.nan]
for a,b in zip(x,y):
if str(b) == 'nan':
x.remove(a)
y.remove(b)
print x,y
produces [2, 3, 5] [4, 2, 7].
If you're using some tool to preprocess the data which gives you the np.nan, check whether the API allows you to disable this behavior and take a minute to think whether this is really the behavior you want (or if you e.g. want to map this to constants because you find your input to be valuable even though they have no labels).

Related

How to not impute NaN values with pandas cut function?

I'm trying to use the cut function to convert numeric data into categories. My input data may have NaN values, which I would like to stay NaN after the cut. From what I understand reading the documentation, this is the default behavior and the following code should work:
intervals = [(i, i+1) for i in range(101)]
bins = pd.IntervalIndex.from_tuples(intervals)
pd.cut(pd.Series([np.nan,0.5,10]),bins)
However, the output I get is:
>(49,50]
(0,1]
(9,10]
Notice that the NaN value is converted to the middle interval.
One strange thing is that it appears as though once the number of intervals is 100 or less, I get the desired output:
intervals = [(i, i+1) for i in range(100)]
bins = pd.IntervalIndex.from_tuples(intervals)
pd.cut(pd.Series([np.nan,0.5,10]),bins)
output:
>NaN
(0,1]
(9,10]
Is there a way to specify that I don't want NaN values to be imputed?
This seems like a bug that originates from numpy.searchsorted():
pandas-dev/pandas#31586 - pd.cut returning incorrect output in some cases
numpy/numpy#15499 - BUG: searchsorted with object arrays containing nan
As a workaround, you could replace np.nan with some other guaranteed missing value, e.g. .replace(np.nan,'foo'):
intervals = [(i, i+1) for i in range(101)]
bins = pd.IntervalIndex.from_tuples(intervals)
pd.cut(pd.Series([np.nan,0.5,10]).replace(np.nan,'foo'),bins)
0 NaN
1 (0.0, 1.0]
2 (9.0, 10.0]
dtype: category

Pandas - change cell value based on conditions from cell and from column

I have a Dataframe with a lot of "bad" cells. Let's say, they have all -99.99 as values, and I want to remove them (set them to NaN).
This works fine:
df[df == -99.99] = None
But actually I want to delete all these cells ONLY if another cell in the same row is market as 1 (e.g. in the column "Error").
I want to delete all -99.99 cells, but only if df["Error"] == 1.
The most straight-forward solution I thin is something like
df[(df == -99.99) & (df["Error"] == 1)] = None
but it gives me the error:
ValueError: cannot reindex from a duplicate axis
I tried every given solutions on the internet but I cant get it to work! :(
Since my Dataframe is big I don't want to iterate it (which of course, would work, but take a lot of time).
Any hint?
Try using broadcasting while passing numpy values:
# sample data, special value is -99
df = pd.DataFrame([[-99,-99,1], [2,-99,2],
[1,1,1], [-99,0, 1]],
columns=['a','b','Errors'])
# note the double square brackets
df[(df==-99) & (df[['Errors']]==1).values] = np.nan
Output:
a b Errors
0 NaN NaN 1
1 2.0 -99.0 2
2 1.0 1.0 1
3 NaN 0.0 1
At least, this is working (but with column iteration):
for i in df.columns:
df.loc[df[i].isin([-99.99]) & df["Error"].isin([1]), i] = None

Why is max and min of numpy array nan?

What could be the reason, why the max and min of my numpy array is nan?
I checked my array with:
for i in range(data[0]):
if data[i] == numpy.nan:
print("nan")
And there is no nan in my data.
Is my search wrong?
If not: What could be the reason for max and min being nan?
Here you go:
import numpy as np
a = np.array([1, 2, 3, np.nan, 4])
print(f'a.max() = {a.max()}')
print(f'np.nanmax(a) = {np.nanmax(a)}')
print(f'a.min() = {a.min()}')
print(f'np.nanmin(a) = {np.nanmin(a)}')
Output:
a.max() = nan
np.nanmax(a) = 4.0
a.min() = nan
np.nanmin(a) = 1.0
Balaji Ambresh showed precisely how to find min / max even
if the source array contains NaN, there is nothing to add on this matter.
But your code sample contains also other flaws that deserve to be pointed out.
Your loop contains for i in range(data[0]):.
You probably wanted to execute this loop for each element of data,
but your loop will be executed as many times as the value of
the initial element of data.
Variations:
If it is e.g. 1, it will be executed only once.
If it is 0 or negative, it will not be executed at all.
If it is >= than the size of data, IndexError exception
will be raised.
If your array contains at least 1 NaN, then the whole array
is of float type (NaN is a special case of float) and you get
TypeError exception: 'numpy.float64' object cannot be interpreted
as an integer.
Remedium (one of possible variants): This loop should start with
for elem in data: and the code inside should use elem as the
current element of data.
The next line contains if data[i] == numpy.nan:.
Even if you corrected it to if elem == np.nan:, the code inside
the if block will never be executed.
The reason is that np.nan is by definition not equal to any
other value, even it this other value is another np.nan.
Remedium: Change to if np.isnan(elem): (Balaji wrote in his comment
how to change your code, I added why).
And finally: How to check quickly an array for NaNs:
To get a detailed list, whether each element is NaN, run np.isnan(data)
and you will get a bool array.
To get a single answer, whether data contains at least one NaN,
no matter where, run np.isnan(data).any().
This code is shorter and runs significantly faster.
The reason is that np.nan == x is always False, even when x is np.nan . This is aligned with the NaN definition in Wikipedia.
Check yourself:
In [4]: import numpy as np
In [5]: np.nan == np.nan
Out[5]: False
If you want to check if a number x is np.nan, you must use
np.isnan(x)
If you want to get max/min of an np.array with nan's, use np.nanmax()/ np.nanmin():
minval = np.nanmin(data)
Easy use np.nanmax(variable_name) and np.nanmin(variable_name)
import numpy as np
z=np.arange(10,20)
z=np.where(z<15,np.nan,z)#Making below 15 z value as nan.
print(z)
print("z max value excluding nan :",np.nanmax(z))
print("z min value excluding nan :",np.nanmin(z))

Python nested array indexing - unexpected behaviour

Suppose we have an input array with some (but not all) nan values from which we want to write into a nan-initialized output array. After writing non-nan data into the output array there are still nan values and I don't understand at all why:
# minimal example just for testing purposes
import numpy as np
# fix state of seed
np.random.seed(1000)
# create input array and nan-filled output array
a = np.random.rand(6,3,5)
b = np.zeros((6,3,5)) * np.nan
x = [np.arange(6),1,2]
# select data in one dimension with others fixed
y_temp = a[x]
# set arbitrary index to nan
y_temp[1] = np.nan
ind_valid = ~np.isnan(y_temp)
# select non-nan values
y = y_temp[ind_valid]
# write input to output at corresponding indices
b[x][ind_valid] = y
print b[x][ind_valid]
# surprise, surprise :(
# [ nan nan nan nan nan nan]
# workaround (that will of course cost computation time, even if not much)
c = np.zeros(len(y_temp)) * np.nan
c[ind_valid] = y
b[x] = c
print b[x][ind_valid]
# and this is what we want to have
# [ 0.39719446 nan 0.39820488 0.68190824 0.86534558 0.69910395]
I thought the array b would reserve some block in memory and by indexing with x it "knows" those indices. Then it should also know them when selecting only some of them with ind_valid and be able to write into exactly those bit addresses in memory. No idea, but maybe it's sth. similar as python nested list unexpected behaviour? Please explain and maybe also provide a nice solution instead of the proposed workaround! Thanks!

Correct Syntax for clearing SettingWithCopyWarning in Pandas

I'm currently learning how to use Pandas, and I'm in a situation where I'm attempting to replace missing data (Horsepower feature) using a best-fit line generated from linear regression with the Displacement column. What I'm doing is iterating through only the parts of the dataframe that are marked as NaN in the Horsepower column and replacing the data by feeding in the value of Displacement in that same row into the best-fit algorithm. My code looks like this:
for row, value in auto_data.HORSEPOWER[pd.isnull(auto_data.HORSEPOWER)].iteritems():
auto_data.HORSEPOWER[row] = int(round(slope * auto_data.DISPLACEMENT[row] + intercept))
Now, the code works and the data is replaced as expected, but it generates the SettingWithCopyWarning when I run it. I understand why the warning is generated, and that in this case I'm fine, but if there is a better way to iterate through the subset, or a method that's just more elegant, I'd rather avoid chained indexing that could cause a real problem in the future.
I've looked through the docs, and through other answers on Stack Overflow. All solutions to this seem to use .loc, but I just can't seem to figure out the correct syntax to get the subset of NaN rows using .loc Any help is appreciated. If it helps, the dataframe looks like this:
auto_data.dtypes
Out[15]:
MPG float64
CYLINDERS int64
DISPLACEMENT float64
HORSEPOWER float64
WEIGHT int64
ACCELERATION float64
MODELYEAR int64
NAME object
dtype: object
IIUC you should be able to just do:
auto_data.loc[auto_data[HORSEPOWER].isnull(),'HORSEPOWER'] = np.round(slope * auto_data['DISPLACEMENT'] + intercept)
The above will be vectorised and avoid looping, the error you get is from doing this:
auto_data.HORSEPOWER[row]
I think if you did:
auto_data.loc[row,'HORSEPOWER']
then the warning should not be raised
Instead of looping through the DataFrame row-by-row, it would be more efficient to calculate the extrapolated values in a vectorized way for the entire column:
y = (slope * auto_data['DISPLACEMENT'] + intercept).round()
and then use update to replace the NaN values:
auto_data['HORSEPOWER'].update(y)
Using update works for the particular case of replacing NaN values.
Ed Chum's solution shows how to replace the value in arbitrary rows by using a boolean mask and auto_data.loc.
For example,
import numpy as np
import pandas as pd
auto_data = pd.DataFrame({
'HORSEPOWER':[1, np.nan, 2],
'DISPLACEMENT': [3, 4, 5]})
slope, intercept = 2, 0.5
y = (slope * auto_data['DISPLACEMENT'] + intercept).round()
auto_data['HORSEPOWER'].update(y)
print(auto_data)
yields
DISPLACEMENT HORSEPOWER
0 3 6
1 4 8
2 5 10

Categories

Resources