Why is max and min of numpy array nan? - python

What could be the reason, why the max and min of my numpy array is nan?
I checked my array with:
for i in range(data[0]):
if data[i] == numpy.nan:
print("nan")
And there is no nan in my data.
Is my search wrong?
If not: What could be the reason for max and min being nan?

Here you go:
import numpy as np
a = np.array([1, 2, 3, np.nan, 4])
print(f'a.max() = {a.max()}')
print(f'np.nanmax(a) = {np.nanmax(a)}')
print(f'a.min() = {a.min()}')
print(f'np.nanmin(a) = {np.nanmin(a)}')
Output:
a.max() = nan
np.nanmax(a) = 4.0
a.min() = nan
np.nanmin(a) = 1.0

Balaji Ambresh showed precisely how to find min / max even
if the source array contains NaN, there is nothing to add on this matter.
But your code sample contains also other flaws that deserve to be pointed out.
Your loop contains for i in range(data[0]):.
You probably wanted to execute this loop for each element of data,
but your loop will be executed as many times as the value of
the initial element of data.
Variations:
If it is e.g. 1, it will be executed only once.
If it is 0 or negative, it will not be executed at all.
If it is >= than the size of data, IndexError exception
will be raised.
If your array contains at least 1 NaN, then the whole array
is of float type (NaN is a special case of float) and you get
TypeError exception: 'numpy.float64' object cannot be interpreted
as an integer.
Remedium (one of possible variants): This loop should start with
for elem in data: and the code inside should use elem as the
current element of data.
The next line contains if data[i] == numpy.nan:.
Even if you corrected it to if elem == np.nan:, the code inside
the if block will never be executed.
The reason is that np.nan is by definition not equal to any
other value, even it this other value is another np.nan.
Remedium: Change to if np.isnan(elem): (Balaji wrote in his comment
how to change your code, I added why).
And finally: How to check quickly an array for NaNs:
To get a detailed list, whether each element is NaN, run np.isnan(data)
and you will get a bool array.
To get a single answer, whether data contains at least one NaN,
no matter where, run np.isnan(data).any().
This code is shorter and runs significantly faster.

The reason is that np.nan == x is always False, even when x is np.nan . This is aligned with the NaN definition in Wikipedia.
Check yourself:
In [4]: import numpy as np
In [5]: np.nan == np.nan
Out[5]: False
If you want to check if a number x is np.nan, you must use
np.isnan(x)
If you want to get max/min of an np.array with nan's, use np.nanmax()/ np.nanmin():
minval = np.nanmin(data)

Easy use np.nanmax(variable_name) and np.nanmin(variable_name)
import numpy as np
z=np.arange(10,20)
z=np.where(z<15,np.nan,z)#Making below 15 z value as nan.
print(z)
print("z max value excluding nan :",np.nanmax(z))
print("z min value excluding nan :",np.nanmin(z))

Related

Are the outcomes of the numpy.where method on a pandas dataframe calculated on the full array or the filtered array?

I want to use a numpyp.where on a pandas dataframe to check for existence of a certain string in a column. If the string is present apply a split-function and take the second list element, if not just take the first character. However the following code doesn't work, it throws a IndexError: list index out of range because the first entry contains no underscore:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['a','a_1','b_','b_2_3']})
df["B"] = np.where(df.A.str.contains('_'),df.A.apply(lambda x: x.split('_')[1]),df.A.str[0])
Only calling np.where returns an array of indices for which the condition holds true, so I was under the impression that the split-command would only be used on that subset of the data:
np.where(df.A.str.contains('_'))
Out[14]: (array([1, 2, 3], dtype=int64),)
But apparently the split-command is used on the entire unfiltered array which seems odd to me as that seems like a potentially big number of unnecessary operations that would slow down the calculation.
I'm not asking for an alternative solution, coming up with that isn't hard.
I'm merely wondering if this is an expected outcome or an issue with either pandas or numpy.
Python isn't a "lazy" language so code is evaluated immediately. generators/iterators do introduce some lazyness, but that doesn't apply here
if we split your line of code, we get the following statements:
df.A.str.contains('_')
df.A.apply(lambda x: x.split('_')[1])
df.A.str[0]
Python has to evaluate these statements before it can pass them as arguments to np.where
to see all this happening, we can rewrite the above as little functions that displays some output:
def fn_contains(x):
print('contains', x)
return '_' in x
def fn_split(x):
s = x.split('_')
print('split', x, s)
# check for errors here
if len(s) > 1:
return s[1]
def fn_first(x):
print('first', x)
return x[0]
and then you can run them on your data with:
s = pd.Series(['a','a_1','b_','b_2_3'])
np.where(
s.apply(fn_contains),
s.apply(fn_split),
s.apply(fn_first)
)
and you'll see everything being executed in order. this is basically what's happening "inside" numpy/pandas when you execute things
In my opinion numpy.where only set values by condition, so second and third arrays are counted for all data - filtered and also non filtered.
If need apply some function only for filtered values:
mask = df.A.str.contains('_')
df.loc[mask, "B"] = df.loc[mask, "A"].str.split('_').str[1]
In your solution is error, but problem is not connected with np.where. After split by _ if not exist value, get one eleemnt list, so selecting second value of list by [1] raise error:
print (df.A.apply(lambda x: x.split('_')))
0 [a]
1 [a, 1]
2 [b, ]
3 [b, 2, 3]
Name: A, dtype: object
print (df.A.apply(lambda x: x.split('_')[1]))
IndexError: list index out of range
So here is possible use pandas solution, if performance is not important, because strings functions are slow:
df["B"] = np.where(df.A.str.contains('_'),
df.A.str.split('_').str[1],
df.A.str[0])

Python Sklearn MinMaxScaler ValueError: Input contains infinity or a value too large for dtype('float64') [duplicate]

I am using sklearn and having a problem with the affinity propagation. I have built an input matrix and I keep getting the following error.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I have run
np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True
I tried using
mat[np.isfinite(mat) == True] = 0
to remove the infinite values but this did not work either.
What can I do to get rid of the infinite values in my matrix, so that I can use the affinity propagation algorithm?
I am using anaconda and python 2.7.9.
This might happen inside scikit, and it depends on what you're doing. I recommend reading the documentation for the functions you're using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.
EDIT: How could I miss that:
np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True
is obviously wrong. Right would be:
np.any(np.isnan(mat))
and
np.all(np.isfinite(mat))
You want to check whether any of the elements are NaN, and not whether the return value of the any function is a number...
I got the same error message when using sklearn with pandas. My solution is to reset the index of my dataframe df before running any sklearn code:
df = df.reset_index()
I encountered this issue many times when I removed some entries in my df, such as
df = df[df.label=='desired_one']
This is my function (based on this) to clean the dataset of nan, Inf, and missing cells (for skewed datasets):
import pandas as pd
import numpy as np
def clean_dataset(df):
assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
df.dropna(inplace=True)
indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(axis=1)
return df[indices_to_keep].astype(np.float64)
In most cases getting rid of infinite and null values solve this problem.
get rid of infinite values.
df.replace([np.inf, -np.inf], np.nan, inplace=True)
get rid of null values the way you like, specific value such as 999, mean, or create your own function to impute missing values
df.fillna(999, inplace=True)
This is the check on which it fails:
https://github.com/scikit-learn/scikit-learn/blob/0.17.X/sklearn/utils/validation.py#L51
Which says
def _assert_all_finite(X):
"""Like assert_all_finite, but only for ndarray."""
X = np.asanyarray(X)
# First try an O(n) time, O(1) space solution for the common case that
# everything is finite; fall back to O(n) space np.isfinite to prevent
# false positives from overflow in sum method.
if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
and not np.isfinite(X).all()):
raise ValueError("Input contains NaN, infinity"
" or a value too large for %r." % X.dtype)
So make sure that you have non NaN values in your input. And all those values are actually float values. None of the values should be Inf either.
The Dimensions of my input array were skewed, as my input csv had empty spaces.
With this version of python 3:
/opt/anaconda3/bin/python --version
Python 3.6.0 :: Anaconda 4.3.0 (64-bit)
Looking at the details of the error, I found the lines of codes causing the failure:
/opt/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
56 and not np.isfinite(X).all()):
57 raise ValueError("Input contains NaN, infinity"
---> 58 " or a value too large for %r." % X.dtype)
59
60
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
From this, I was able to extract the correct way to test what was going on with my data using the same test which fails given by the error message: np.isfinite(X)
Then with a quick and dirty loop, I was able to find that my data indeed contains nans:
print(p[:,0].shape)
index = 0
for i in p[:,0]:
if not np.isfinite(i):
print(index, i)
index +=1
(367340,)
4454 nan
6940 nan
10868 nan
12753 nan
14855 nan
15678 nan
24954 nan
30251 nan
31108 nan
51455 nan
59055 nan
...
Now all I have to do is remove the values at these indexes.
None of the answers here worked for me. This was what worked.
Test_y = np.nan_to_num(Test_y)
It replaces the infinity values with high finite values and the nan values with numbers
I had the same error, and in my case X and y were dataframes so I had to convert them to matrices first:
X = X.values.astype(np.float)
y = y.values.astype(np.float)
Edit: The originally suggested X.as_matrix() is Deprecated
Problem seems to occur in DecisionTreeClassifier input check, Try
X_train = X_train.replace((np.inf, -np.inf, np.nan), 0).reset_index(drop=True)
I had the error after trying to select a subset of rows:
df = df.reindex(index=my_index)
Turns out that my_index contained values that were not contained in df.index, so the reindex function inserted some new rows and filled them with nan.
Remove all infinite values:
(and replace with min or max for that column)
import numpy as np
# generate example matrix
matrix = np.random.rand(5,5)
matrix[0,:] = np.inf
matrix[2,:] = -np.inf
>>> matrix
array([[ inf, inf, inf, inf, inf],
[0.87362809, 0.28321499, 0.7427659 , 0.37570528, 0.35783064],
[ -inf, -inf, -inf, -inf, -inf],
[0.72877665, 0.06580068, 0.95222639, 0.00833664, 0.68779902],
[0.90272002, 0.37357483, 0.92952479, 0.072105 , 0.20837798]])
# find min and max values for each column, ignoring nan, -inf, and inf
mins = [np.nanmin(matrix[:, i][matrix[:, i] != -np.inf]) for i in range(matrix.shape[1])]
maxs = [np.nanmax(matrix[:, i][matrix[:, i] != np.inf]) for i in range(matrix.shape[1])]
# go through matrix one column at a time and replace + and -infinity
# with the max or min for that column
for i in range(matrix.shape[1]):
matrix[:, i][matrix[:, i] == -np.inf] = mins[i]
matrix[:, i][matrix[:, i] == np.inf] = maxs[i]
>>> matrix
array([[0.90272002, 0.37357483, 0.95222639, 0.37570528, 0.68779902],
[0.87362809, 0.28321499, 0.7427659 , 0.37570528, 0.35783064],
[0.72877665, 0.06580068, 0.7427659 , 0.00833664, 0.20837798],
[0.72877665, 0.06580068, 0.95222639, 0.00833664, 0.68779902],
[0.90272002, 0.37357483, 0.92952479, 0.072105 , 0.20837798]])
I found that after calling pct_change on a new column that nan existed in one of rows. I remove the nan row with the following code
df = df.replace([np.inf, -np.inf], np.nan)
df = df.dropna()
df = df.reset_index()
i got the same error. it worked with df.fillna(-99999, inplace=True) before doing any replacement, substitution etc
I would like to propose a solution for numpy that worked well for me. The line
from numpy import inf
inputArray[inputArray == inf] = np.finfo(np.float64).max
substitues all infite values of a numpy array with the maximum float64 number.
Puff !! In my case the problem was about NaN values...
You can list your columns that had NaN with this function
your_data.isnull().sum()
and then you can fill these NAN values in your dataset file.
Here is the code for how to "Replace NaN with zero and infinity with large finite numbers."
your_data[:] = np.nan_to_num(your_data)
from numpy.nan_to_num
In my case the problem was that many scikit functions return numpy arrays, which are devoid of pandas index. So there was an index mismatch when I used those numpy arrays to build new DataFrames and then I tried to mix them with the original data.
dataset = dataset.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
This worked for me
I had the same issue, in my case the answer was simply that I had a cell in my CSV with no value ("x,y,z,,"). Putting a default value in fixed it for me.
Using isneginf may help.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.isneginf.html#numpy.isneginf
x[numpy.isneginf(x)] = 0 #0 is the value you want to replace with
Note: This solution only applies if you consciously want to keep NaN entries in your dataset.
This error happened to me when I was using some of the scikit-learn functionality (in my case: GridSearchCV). Under the hood I was using an xgboost XGBClassifier which handles NaN data gracefully. However, GridSearchCV was using sklearn.utils.validation module that encforced lack of missing data in the input data by calling _assert_all_finite function. This was ultimately causing an error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
Sidenote: _assert_all_finite accepts an allow_nan argument, which, if set to True, would not be causing issues. However, scikit-learn API does not allow us to have control over this argument.
Solution
My solution was to use patch module to silence the _assert_all_finite function so that it does not raise ValueError. Here is a snippet
import sklearn
with mock.patch("sklearn.utils.validation._assert_all_finite"):
# your code that raises ValueError
this will replace the _assert_all_finite by a dummy mock function so it won't get executed.
Please note that patching is not a recommended practice and might result in unpredictable behaviour!
EDIT:
This Pull Request should resolve the issue (though the fix has not been released as of Jan 2022)
If you're running an estimator, it could be that your learning rate is too high. I passed in the wrong array to a grid search by accident and ended up training with a learning rate of 500, which I could see causing issues with the training process.
Basically it's not necessarily only your inputs that have to all be valid, but the intermediate data as well.
After a long time of dealing with this problem, I realized that this is because in splits of training and testing sets there are columns of data which are the same for all data rows. Then some calculations in some algorithms may lead to infinity results. If the data that you are using is in a way that close rows are more likely to be similar then shuffling the data can help. This is a bug with scikit. I'm using version 0.23.2.
If you happen to use the "kc_house_data.csv" dataset (which some commenters and many data-science newcomers seem to use, because it's presented in lots of popular course material), the data is faulty and the true source for the error.
To fix it, as of 2022:
Delete the last (empty) line in the csv file
There are two lines that contain one empty data value "x,x,,x,x" - to fix it, don't delete the comma, instead add a random integer value like 2000, so it looks like this "x,x,2000,x,x"
Don't forget to save and reload in your project.
All the other answers are helpful and correct, but not in this case:
If you use kc_house_data.csv you need to fix the data in the file, nothing else will help, the empty data field will shift the other data around randomly and generate weird bugs that are hard to track to the source!
In my case the algorithm required data to be between (0,1) noninclusive. My quite brutal solutions was to add a small random number to all desired values:
y_train = pd.DataFrame(y_train).applymap(lambda x: x + np.random.rand()/100000.0)["col_name"]
y_train[y_train >= 1] = 0.999999
while y_train is in the range of [0,1].
This is definitely not suitable for all cases, as you are messing with your input data but can be a solution if you have sparse data and only need a quick forecast
try
mat.sum()
If the sum of your data is infinity (greater that the max float value which is 3.402823e+38) you will get that error.
see the _assert_all_finite function in validation.py from the scikit source code:
if is_float and np.isfinite(X.sum()):
pass
elif is_float:
msg_err = "Input contains {} or a value too large for {!r}."
if (allow_nan and np.isinf(X).any() or
not allow_nan and not np.isfinite(X).all()):
type_err = 'infinity' if allow_nan else 'NaN, infinity'
# print(X.sum())
raise ValueError(msg_err.format(type_err, X.dtype))

Python nested array indexing - unexpected behaviour

Suppose we have an input array with some (but not all) nan values from which we want to write into a nan-initialized output array. After writing non-nan data into the output array there are still nan values and I don't understand at all why:
# minimal example just for testing purposes
import numpy as np
# fix state of seed
np.random.seed(1000)
# create input array and nan-filled output array
a = np.random.rand(6,3,5)
b = np.zeros((6,3,5)) * np.nan
x = [np.arange(6),1,2]
# select data in one dimension with others fixed
y_temp = a[x]
# set arbitrary index to nan
y_temp[1] = np.nan
ind_valid = ~np.isnan(y_temp)
# select non-nan values
y = y_temp[ind_valid]
# write input to output at corresponding indices
b[x][ind_valid] = y
print b[x][ind_valid]
# surprise, surprise :(
# [ nan nan nan nan nan nan]
# workaround (that will of course cost computation time, even if not much)
c = np.zeros(len(y_temp)) * np.nan
c[ind_valid] = y
b[x] = c
print b[x][ind_valid]
# and this is what we want to have
# [ 0.39719446 nan 0.39820488 0.68190824 0.86534558 0.69910395]
I thought the array b would reserve some block in memory and by indexing with x it "knows" those indices. Then it should also know them when selecting only some of them with ind_valid and be able to write into exactly those bit addresses in memory. No idea, but maybe it's sth. similar as python nested list unexpected behaviour? Please explain and maybe also provide a nice solution instead of the proposed workaround! Thanks!

Unique values from pandas.Series [duplicate]

This question already has answers here:
in operator, float("NaN") and np.nan
(2 answers)
Closed 5 years ago.
Consider the following pandas.Series:
import pandas as pd
import numpy as np
s = pd.Series([np.nan, 1, 1, np.nan])
s
0 NaN
1 1.0
2 1.0
3 NaN
dtype: float64
I want to find only unique values in this particular series using the built-in set function:
unqs = set(s)
unqs
{nan, 1.0, nan}
Why are there duplicate NaNs in the resultant set? Using a similar function (pandas.unique) does not produce this result, so what's the difference, here?
pd.unique(s)
array([ nan, 1.])
Like in Java, and JavaScript, nan in numpy does not equal itself.
>>> np.nan == np.nan
False
This means when the set constructor checks "do I have an instance of nan in this set yet?" it alwasy returns False
So… why?
nan in both cases means "value which cannot be represented by 'float'". This means an attempt to convert it to float necessarily fails. It's also unable to be sorted, because there's no way to tell if nan is supposed to be larger or smaller than any number.
After all, which is bigger "cat" or 7? And is "goofy" == "pluto"?
SO… what do I do?
There are a couple of ways to resolve this problem. Personally, I generally try to fill nan before processing: DataFrame.fillna will help with that, and I would always use df.unique() to get a set of unique values.
no_nas = s.dropna().unique()
with_nas = s.unique()
with_replaced_nas = s.fillna(-1).unique() # using a placeholder
(note: all of the above can be passed into the set constructor.
What if I don't want to use the Pandas way?
There are reasons not to use Pandas, or to rely on native objects instead of Pandas. These should suffice.
Your other option is to filter and remove the nan.
unqs = set(item for item in s if not np.isnan(item))
You could also replace things inline:
placeholder = '{placeholder}' # There are a variety of placeholder options.
unqs = set(item if not np.isnan(item) else placeholder for item in s)

Erratic NaN behaviour in numpy/pandas

I've been trying to replace missing values in a Pandas dataframe, but without success. I tried the .fillna method and also tried to loop through the entire data set, checking each cell and replacing NaNs with a chosen value. However, in both cases, Python executes the script without throwing up any errors, but the NaN values remain.
When I dug a bit deeper, I discovered behaviour that seems erratic to me, best demonstrated with an example:
In[ ] X['Smokinginpregnancy'].head()
Out[ ]
Index
E09000002 NaN
E09000003 5.216126
E09000004 10.287496
E09000005 3.090379
E09000006 6.080041
Name: Smokinginpregnancy, dtype: float64
I know for a fact that the first item in this column is missing and pandas recognises it as NaN. In fact, if I call this item on its own, python tells me it's NaN:
In [ ] X['Smokinginpregnancy'][0]
Out [ ]
nan
However, when I test whether it's NaN, python returns False.
In [ ] X['Smokinginpregnancy'][0] == np.nan
Out [ ] False
I suspect that when .fillna is being executed, python checks whether the item is NaN but gets back a False, so it continues, leaving the cell alone.
Does anyone know what's going on? Any solutions? (apart from opening the csv file in excel and then manually replacing the values.)
I'm using Anaconda's Python 3 distribution.
You are doing:
X['Smokinginpregnancy'][0] == np.nan
This is guaranteed to return False because all NaNs compare unequal to everything by IEEE754 standard:
>>> x = float('nan')
>>> x == x
False
>>> x == 1
False
>>> x == float('nan')
False
See also here.
You have to use math.isnan to check for NaNs:
>>> math.isnan(x)
True
Or numpy.isnan
So use:
numpy.isnan(X['Smokinginpregnancy'][0])
Regarding pandas.fillna note that this function returns the filled array. Maybe you did something like:
X.fillna(...)
without reassigning X? Alternatively you must pass inplace=True to mutate the dataframe on which you are calling the method.
NaN in pandas can be check function pandas.isnull. I created boolean mask and return subset with NaN values.
Function filnna can be used for one column Smokinginpregnancy (more info in doc):
X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
or
X['Smokinginpregnancy'].fillna('100', inplace=True)
Warning:
Sometimes inplace=True can be ignored, better is not use. - link, github, github 3 comments.
All together:
print X['Smokinginpregnancy'].head()
#Index
#E09000002 NaN
#E09000003 5.216126
#E09000004 10.287496
#E09000005 3.090379
#E09000006 6.080041
#check NaN in column Smokinginpregnancy by boolean mask
mask = pd.isnull(X['Smokinginpregnancy'])
XNaN = X[mask]
print XNaN
# Smokinginpregnancy
#Index
#E09000002 NaN
#use function fillna for column Smokinginpregnancy
#X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
X['Smokinginpregnancy'].fillna('100', inplace=True)
print X
# Smokinginpregnancy
#Index
#E09000002 100
#E09000003 5.216126
#E09000004 10.2875
#E09000005 3.090379
#E09000006 6.080041
More information, why comparison doesn't work:
One has to be mindful that in python (and numpy), the nan's don’t compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan. More info in Bakuriu's answer.
In [11]: None == None
Out[11]: True
In [12]: np.nan == np.nan
Out[12]: False

Categories

Resources