I want to use a numpyp.where on a pandas dataframe to check for existence of a certain string in a column. If the string is present apply a split-function and take the second list element, if not just take the first character. However the following code doesn't work, it throws a IndexError: list index out of range because the first entry contains no underscore:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['a','a_1','b_','b_2_3']})
df["B"] = np.where(df.A.str.contains('_'),df.A.apply(lambda x: x.split('_')[1]),df.A.str[0])
Only calling np.where returns an array of indices for which the condition holds true, so I was under the impression that the split-command would only be used on that subset of the data:
np.where(df.A.str.contains('_'))
Out[14]: (array([1, 2, 3], dtype=int64),)
But apparently the split-command is used on the entire unfiltered array which seems odd to me as that seems like a potentially big number of unnecessary operations that would slow down the calculation.
I'm not asking for an alternative solution, coming up with that isn't hard.
I'm merely wondering if this is an expected outcome or an issue with either pandas or numpy.
Python isn't a "lazy" language so code is evaluated immediately. generators/iterators do introduce some lazyness, but that doesn't apply here
if we split your line of code, we get the following statements:
df.A.str.contains('_')
df.A.apply(lambda x: x.split('_')[1])
df.A.str[0]
Python has to evaluate these statements before it can pass them as arguments to np.where
to see all this happening, we can rewrite the above as little functions that displays some output:
def fn_contains(x):
print('contains', x)
return '_' in x
def fn_split(x):
s = x.split('_')
print('split', x, s)
# check for errors here
if len(s) > 1:
return s[1]
def fn_first(x):
print('first', x)
return x[0]
and then you can run them on your data with:
s = pd.Series(['a','a_1','b_','b_2_3'])
np.where(
s.apply(fn_contains),
s.apply(fn_split),
s.apply(fn_first)
)
and you'll see everything being executed in order. this is basically what's happening "inside" numpy/pandas when you execute things
In my opinion numpy.where only set values by condition, so second and third arrays are counted for all data - filtered and also non filtered.
If need apply some function only for filtered values:
mask = df.A.str.contains('_')
df.loc[mask, "B"] = df.loc[mask, "A"].str.split('_').str[1]
In your solution is error, but problem is not connected with np.where. After split by _ if not exist value, get one eleemnt list, so selecting second value of list by [1] raise error:
print (df.A.apply(lambda x: x.split('_')))
0 [a]
1 [a, 1]
2 [b, ]
3 [b, 2, 3]
Name: A, dtype: object
print (df.A.apply(lambda x: x.split('_')[1]))
IndexError: list index out of range
So here is possible use pandas solution, if performance is not important, because strings functions are slow:
df["B"] = np.where(df.A.str.contains('_'),
df.A.str.split('_').str[1],
df.A.str[0])
I am using sklearn and having a problem with the affinity propagation. I have built an input matrix and I keep getting the following error.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I have run
np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True
I tried using
mat[np.isfinite(mat) == True] = 0
to remove the infinite values but this did not work either.
What can I do to get rid of the infinite values in my matrix, so that I can use the affinity propagation algorithm?
I am using anaconda and python 2.7.9.
This might happen inside scikit, and it depends on what you're doing. I recommend reading the documentation for the functions you're using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.
EDIT: How could I miss that:
np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True
is obviously wrong. Right would be:
np.any(np.isnan(mat))
and
np.all(np.isfinite(mat))
You want to check whether any of the elements are NaN, and not whether the return value of the any function is a number...
I got the same error message when using sklearn with pandas. My solution is to reset the index of my dataframe df before running any sklearn code:
df = df.reset_index()
I encountered this issue many times when I removed some entries in my df, such as
df = df[df.label=='desired_one']
This is my function (based on this) to clean the dataset of nan, Inf, and missing cells (for skewed datasets):
import pandas as pd
import numpy as np
def clean_dataset(df):
assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
df.dropna(inplace=True)
indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(axis=1)
return df[indices_to_keep].astype(np.float64)
In most cases getting rid of infinite and null values solve this problem.
get rid of infinite values.
df.replace([np.inf, -np.inf], np.nan, inplace=True)
get rid of null values the way you like, specific value such as 999, mean, or create your own function to impute missing values
df.fillna(999, inplace=True)
This is the check on which it fails:
https://github.com/scikit-learn/scikit-learn/blob/0.17.X/sklearn/utils/validation.py#L51
Which says
def _assert_all_finite(X):
"""Like assert_all_finite, but only for ndarray."""
X = np.asanyarray(X)
# First try an O(n) time, O(1) space solution for the common case that
# everything is finite; fall back to O(n) space np.isfinite to prevent
# false positives from overflow in sum method.
if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
and not np.isfinite(X).all()):
raise ValueError("Input contains NaN, infinity"
" or a value too large for %r." % X.dtype)
So make sure that you have non NaN values in your input. And all those values are actually float values. None of the values should be Inf either.
The Dimensions of my input array were skewed, as my input csv had empty spaces.
With this version of python 3:
/opt/anaconda3/bin/python --version
Python 3.6.0 :: Anaconda 4.3.0 (64-bit)
Looking at the details of the error, I found the lines of codes causing the failure:
/opt/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
56 and not np.isfinite(X).all()):
57 raise ValueError("Input contains NaN, infinity"
---> 58 " or a value too large for %r." % X.dtype)
59
60
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
From this, I was able to extract the correct way to test what was going on with my data using the same test which fails given by the error message: np.isfinite(X)
Then with a quick and dirty loop, I was able to find that my data indeed contains nans:
print(p[:,0].shape)
index = 0
for i in p[:,0]:
if not np.isfinite(i):
print(index, i)
index +=1
(367340,)
4454 nan
6940 nan
10868 nan
12753 nan
14855 nan
15678 nan
24954 nan
30251 nan
31108 nan
51455 nan
59055 nan
...
Now all I have to do is remove the values at these indexes.
None of the answers here worked for me. This was what worked.
Test_y = np.nan_to_num(Test_y)
It replaces the infinity values with high finite values and the nan values with numbers
I had the same error, and in my case X and y were dataframes so I had to convert them to matrices first:
X = X.values.astype(np.float)
y = y.values.astype(np.float)
Edit: The originally suggested X.as_matrix() is Deprecated
Problem seems to occur in DecisionTreeClassifier input check, Try
X_train = X_train.replace((np.inf, -np.inf, np.nan), 0).reset_index(drop=True)
I had the error after trying to select a subset of rows:
df = df.reindex(index=my_index)
Turns out that my_index contained values that were not contained in df.index, so the reindex function inserted some new rows and filled them with nan.
Remove all infinite values:
(and replace with min or max for that column)
import numpy as np
# generate example matrix
matrix = np.random.rand(5,5)
matrix[0,:] = np.inf
matrix[2,:] = -np.inf
>>> matrix
array([[ inf, inf, inf, inf, inf],
[0.87362809, 0.28321499, 0.7427659 , 0.37570528, 0.35783064],
[ -inf, -inf, -inf, -inf, -inf],
[0.72877665, 0.06580068, 0.95222639, 0.00833664, 0.68779902],
[0.90272002, 0.37357483, 0.92952479, 0.072105 , 0.20837798]])
# find min and max values for each column, ignoring nan, -inf, and inf
mins = [np.nanmin(matrix[:, i][matrix[:, i] != -np.inf]) for i in range(matrix.shape[1])]
maxs = [np.nanmax(matrix[:, i][matrix[:, i] != np.inf]) for i in range(matrix.shape[1])]
# go through matrix one column at a time and replace + and -infinity
# with the max or min for that column
for i in range(matrix.shape[1]):
matrix[:, i][matrix[:, i] == -np.inf] = mins[i]
matrix[:, i][matrix[:, i] == np.inf] = maxs[i]
>>> matrix
array([[0.90272002, 0.37357483, 0.95222639, 0.37570528, 0.68779902],
[0.87362809, 0.28321499, 0.7427659 , 0.37570528, 0.35783064],
[0.72877665, 0.06580068, 0.7427659 , 0.00833664, 0.20837798],
[0.72877665, 0.06580068, 0.95222639, 0.00833664, 0.68779902],
[0.90272002, 0.37357483, 0.92952479, 0.072105 , 0.20837798]])
I found that after calling pct_change on a new column that nan existed in one of rows. I remove the nan row with the following code
df = df.replace([np.inf, -np.inf], np.nan)
df = df.dropna()
df = df.reset_index()
i got the same error. it worked with df.fillna(-99999, inplace=True) before doing any replacement, substitution etc
I would like to propose a solution for numpy that worked well for me. The line
from numpy import inf
inputArray[inputArray == inf] = np.finfo(np.float64).max
substitues all infite values of a numpy array with the maximum float64 number.
Puff !! In my case the problem was about NaN values...
You can list your columns that had NaN with this function
your_data.isnull().sum()
and then you can fill these NAN values in your dataset file.
Here is the code for how to "Replace NaN with zero and infinity with large finite numbers."
your_data[:] = np.nan_to_num(your_data)
from numpy.nan_to_num
In my case the problem was that many scikit functions return numpy arrays, which are devoid of pandas index. So there was an index mismatch when I used those numpy arrays to build new DataFrames and then I tried to mix them with the original data.
dataset = dataset.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
This worked for me
I had the same issue, in my case the answer was simply that I had a cell in my CSV with no value ("x,y,z,,"). Putting a default value in fixed it for me.
Using isneginf may help.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.isneginf.html#numpy.isneginf
x[numpy.isneginf(x)] = 0 #0 is the value you want to replace with
Note: This solution only applies if you consciously want to keep NaN entries in your dataset.
This error happened to me when I was using some of the scikit-learn functionality (in my case: GridSearchCV). Under the hood I was using an xgboost XGBClassifier which handles NaN data gracefully. However, GridSearchCV was using sklearn.utils.validation module that encforced lack of missing data in the input data by calling _assert_all_finite function. This was ultimately causing an error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
Sidenote: _assert_all_finite accepts an allow_nan argument, which, if set to True, would not be causing issues. However, scikit-learn API does not allow us to have control over this argument.
Solution
My solution was to use patch module to silence the _assert_all_finite function so that it does not raise ValueError. Here is a snippet
import sklearn
with mock.patch("sklearn.utils.validation._assert_all_finite"):
# your code that raises ValueError
this will replace the _assert_all_finite by a dummy mock function so it won't get executed.
Please note that patching is not a recommended practice and might result in unpredictable behaviour!
EDIT:
This Pull Request should resolve the issue (though the fix has not been released as of Jan 2022)
If you're running an estimator, it could be that your learning rate is too high. I passed in the wrong array to a grid search by accident and ended up training with a learning rate of 500, which I could see causing issues with the training process.
Basically it's not necessarily only your inputs that have to all be valid, but the intermediate data as well.
After a long time of dealing with this problem, I realized that this is because in splits of training and testing sets there are columns of data which are the same for all data rows. Then some calculations in some algorithms may lead to infinity results. If the data that you are using is in a way that close rows are more likely to be similar then shuffling the data can help. This is a bug with scikit. I'm using version 0.23.2.
If you happen to use the "kc_house_data.csv" dataset (which some commenters and many data-science newcomers seem to use, because it's presented in lots of popular course material), the data is faulty and the true source for the error.
To fix it, as of 2022:
Delete the last (empty) line in the csv file
There are two lines that contain one empty data value "x,x,,x,x" - to fix it, don't delete the comma, instead add a random integer value like 2000, so it looks like this "x,x,2000,x,x"
Don't forget to save and reload in your project.
All the other answers are helpful and correct, but not in this case:
If you use kc_house_data.csv you need to fix the data in the file, nothing else will help, the empty data field will shift the other data around randomly and generate weird bugs that are hard to track to the source!
In my case the algorithm required data to be between (0,1) noninclusive. My quite brutal solutions was to add a small random number to all desired values:
y_train = pd.DataFrame(y_train).applymap(lambda x: x + np.random.rand()/100000.0)["col_name"]
y_train[y_train >= 1] = 0.999999
while y_train is in the range of [0,1].
This is definitely not suitable for all cases, as you are messing with your input data but can be a solution if you have sparse data and only need a quick forecast
try
mat.sum()
If the sum of your data is infinity (greater that the max float value which is 3.402823e+38) you will get that error.
see the _assert_all_finite function in validation.py from the scikit source code:
if is_float and np.isfinite(X.sum()):
pass
elif is_float:
msg_err = "Input contains {} or a value too large for {!r}."
if (allow_nan and np.isinf(X).any() or
not allow_nan and not np.isfinite(X).all()):
type_err = 'infinity' if allow_nan else 'NaN, infinity'
# print(X.sum())
raise ValueError(msg_err.format(type_err, X.dtype))
I've been trying to replace missing values in a Pandas dataframe, but without success. I tried the .fillna method and also tried to loop through the entire data set, checking each cell and replacing NaNs with a chosen value. However, in both cases, Python executes the script without throwing up any errors, but the NaN values remain.
When I dug a bit deeper, I discovered behaviour that seems erratic to me, best demonstrated with an example:
In[ ] X['Smokinginpregnancy'].head()
Out[ ]
Index
E09000002 NaN
E09000003 5.216126
E09000004 10.287496
E09000005 3.090379
E09000006 6.080041
Name: Smokinginpregnancy, dtype: float64
I know for a fact that the first item in this column is missing and pandas recognises it as NaN. In fact, if I call this item on its own, python tells me it's NaN:
In [ ] X['Smokinginpregnancy'][0]
Out [ ]
nan
However, when I test whether it's NaN, python returns False.
In [ ] X['Smokinginpregnancy'][0] == np.nan
Out [ ] False
I suspect that when .fillna is being executed, python checks whether the item is NaN but gets back a False, so it continues, leaving the cell alone.
Does anyone know what's going on? Any solutions? (apart from opening the csv file in excel and then manually replacing the values.)
I'm using Anaconda's Python 3 distribution.
You are doing:
X['Smokinginpregnancy'][0] == np.nan
This is guaranteed to return False because all NaNs compare unequal to everything by IEEE754 standard:
>>> x = float('nan')
>>> x == x
False
>>> x == 1
False
>>> x == float('nan')
False
See also here.
You have to use math.isnan to check for NaNs:
>>> math.isnan(x)
True
Or numpy.isnan
So use:
numpy.isnan(X['Smokinginpregnancy'][0])
Regarding pandas.fillna note that this function returns the filled array. Maybe you did something like:
X.fillna(...)
without reassigning X? Alternatively you must pass inplace=True to mutate the dataframe on which you are calling the method.
NaN in pandas can be check function pandas.isnull. I created boolean mask and return subset with NaN values.
Function filnna can be used for one column Smokinginpregnancy (more info in doc):
X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
or
X['Smokinginpregnancy'].fillna('100', inplace=True)
Warning:
Sometimes inplace=True can be ignored, better is not use. - link, github, github 3 comments.
All together:
print X['Smokinginpregnancy'].head()
#Index
#E09000002 NaN
#E09000003 5.216126
#E09000004 10.287496
#E09000005 3.090379
#E09000006 6.080041
#check NaN in column Smokinginpregnancy by boolean mask
mask = pd.isnull(X['Smokinginpregnancy'])
XNaN = X[mask]
print XNaN
# Smokinginpregnancy
#Index
#E09000002 NaN
#use function fillna for column Smokinginpregnancy
#X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
X['Smokinginpregnancy'].fillna('100', inplace=True)
print X
# Smokinginpregnancy
#Index
#E09000002 100
#E09000003 5.216126
#E09000004 10.2875
#E09000005 3.090379
#E09000006 6.080041
More information, why comparison doesn't work:
One has to be mindful that in python (and numpy), the nan's don’t compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan. More info in Bakuriu's answer.
In [11]: None == None
Out[11]: True
In [12]: np.nan == np.nan
Out[12]: False