Remove row values containing non-numeric values in pandas dataframe [duplicate] - python

I have a large dataframe in pandas that apart from the column used as index is supposed to have only numeric values:
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
How can I find the row of the dataframe df that has a non-numeric value in it?
In this example it's the fourth row in the dataframe, which has the string 'bad' in the a column. How can this row be found programmatically?

You could use np.isreal to check the type of each element (applymap applies a function to each element in the DataFrame):
In [11]: df.applymap(np.isreal)
Out[11]:
a b
item
a True True
b True True
c True True
d False True
e True True
If all in the row are True then they are all numeric:
In [12]: df.applymap(np.isreal).all(1)
Out[12]:
item
a True
b True
c True
d False
e True
dtype: bool
So to get the subDataFrame of rouges, (Note: the negation, ~, of the above finds the ones which have at least one rogue non-numeric):
In [13]: df[~df.applymap(np.isreal).all(1)]
Out[13]:
a b
item
d bad 0.4
You could also find the location of the first offender you could use argmin:
In [14]: np.argmin(df.applymap(np.isreal).all(1))
Out[14]: 'd'
As #CTZhu points out, it may be slightly faster to check whether it's an instance of either int or float (there is some additional overhead with np.isreal):
df.applymap(lambda x: isinstance(x, (int, float)))

Already some great answers to this question, however here is a nice snippet that I use regularly to drop rows if they have non-numeric values on some columns:
# Eliminate invalid data from dataframe (see Example below for more context)
num_df = (df.drop(data_columns, axis=1)
.join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
num_df = num_df[num_df[data_columns].notnull().all(axis=1)]
The way this works is we first drop all the data_columns from the df, and then use a join to put them back in after passing them through pd.to_numeric (with option 'coerce', such that all non-numeric entries are converted to NaN). The result is saved to num_df.
On the second line we use a filter that keeps only rows where all values are not null.
Note that pd.to_numeric is coercing to NaN everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25' will be recognized as the numeric value 1.25.
Disclaimer: pd.to_numeric was introduced in pandas version 0.17.0
Example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"],
...: "a": [1,2,3,"bad",5],
...: "b":[0.1,0.2,0.3,0.4,0.5]})
In [3]: df
Out[3]:
a b item
0 1 0.1 a
1 2 0.2 b
2 3 0.3 c
3 bad 0.4 d
4 5 0.5 e
In [4]: data_columns = ['a', 'b']
In [5]: num_df = (df
...: .drop(data_columns, axis=1)
...: .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
In [6]: num_df
Out[6]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
3 d NaN 0.4
4 e 5 0.5
In [7]: num_df[num_df[data_columns].notnull().all(axis=1)]
Out[7]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
4 e 5 0.5

# Original code
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
Convert to numeric using 'coerce' which fills bad values with 'nan'
a = pd.to_numeric(df.a, errors='coerce')
Use isna to return a boolean index:
idx = a.isna()
Apply that index to the data frame:
df[idx]
output
Returns the row with the bad data in it:
a b
item
d bad 0.4

Sorry about the confusion, this should be the correct approach. Do you want only to capture 'bad' only, not things like 'good'; Or just any non-numerical values?
In[15]:
np.where(np.any(np.isnan(df.convert_objects(convert_numeric=True)), axis=1))
Out[15]:
(array([3]),)

In case you are working with a column with string values, you can use
THE VERY USEFUL function series.str.isnumeric() like:
a = pd.Series(['hi','hola','2.31','288','312','1312', '0,21', '0.23'])
What i do is to copy that column to new column, and do a str.replace('.','') and str.replace(',','') then i select the numeric values.
and:
a = a.str.replace('.','')
a = a.str.replace(',','')
a.str.isnumeric()
Out[15]:
0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 True
dtype: bool
Good luck all!

I'm thinking something like, just give an idea, to convert the column to string, and work with string is easier. however this does not work with strings containing numbers, like bad123. and ~ is taking the complement of selection.
df['a'] = df['a'].astype(str)
df[~df['a'].str.contains('0|1|2|3|4|5|6|7|8|9')]
df['a'] = df['a'].astype(object)
and using '|'.join([str(i) for i in range(10)]) to generate '0|1|...|8|9'
or using np.isreal() function, just like the most voted answer
df[~df['a'].apply(lambda x: np.isreal(x))]

Did you convert your data using .astype() ?
All great comments above must solve 99% of the cases, but if you are still in trouble, please also check if you converted your data type.
Sometimes I force the data to type float16 to save memory. Using:
df[col] = df[col].astype(np.float16)
But this might silently break your code. So if you did any kind of data type transformation, double check for overflows. Disable the conversion and try again.
It worked for me!

Related

How to conditionally change pandas DataFrame values into f-strings?

I have a pandas DataFrame whose values I want to conditionally change into strings without looping over every value.
Example input:
In [1]: df = pd.DataFrame(data = [[1,2], [4,5]], columns = ['a', 'b'])
Out[2]:
a b
0 1 2
1 4 5
This is my best attempt which doesn't work properly
df['a'] = np.where(df['a'] < 3, f'string-{df["a"]}', df['a'])
In [1]: df
Out[2]:
a b
0 string0 1\n1 4\nName: a, dtype: int64 2
1 4 5
Desired output:
Out[2]:
A B
0 string-1 2
1 4 5
I am using np.where() since looping is not feasible due to the size of the actual DataFrame. The actual f-string I am using is also more complex and has two variables that include column names, but the problem is the same.
Are there other ways to conditionally change pandas values into f-strings without looping over each value?
You can use .map() together with f-string, as follows:
df['a'] = df['a'].map(lambda x: f'string-{x}' if x < 3 else x)
Alternatively, you can also use .loc together with string concatenation, as follows:
df.loc[df['a'] < 3, 'a'] = 'string-' + df['a'].astype(str)
#OR
df['a']=np.where(df['a'] < 3, 'string-'+df['a'].astype(str), df['a'])
Result:
print(df)
a b
0 string-1 2
1 4 5

Fill values in pandas column with condition involving 2 other columns

I am trying to fill this 'C' column in such a way that when the value in 'A' is not NaN, 'C' takes value from 'B', else the value in 'C' remains unchanged.
Heres the code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['greek', 'indian', np.nan, np.nan, 'australian'], 'B':
np.random.random(5)})
df['C'] = np.nan
df
I tried df.C = df.B.where(df.A != np.nan, np.nan), but it isnt working as the condition involves another column i think, for loop isnt yielding the desired result either.
How to get there using shortest lines of codes as possible?
The problem is not with np.where, the problem is that you are comparing the value directly against np.nan using !=
>>> np.nan == np.nan
False
So, use a function/method that allows you to check if the value is nan or not:
>>> df.C = df.B.where(df.A.notna(), np.nan)
A B C
0 greek 0.030809 0.030809
1 indian 0.545261 0.545261
2 NaN 0.470802 NaN
3 NaN 0.716640 NaN
4 australian 0.148297 0.148297

How do I keep original pandas dataframe values while using df.astype() ? I need to raise a value error for below example

df = pd.DataFrame({
'A': ['a', 'b', 'c', 'd', 'e'],
'B': [1, 2.5, 3, 4, 5],
'C': ['abc', 'def', 'ghi', 'jkl', 'mno'] })
col_type = {'A':str, 'B':int, 'C':str}
df = df.astype(col_type)
df
Output is:
A B C
0 a 1 abc
1 b 2 def
2 c 3 ghi
3 d 4 jkl
4 e 5 mno
But I want to raise a value error at index 1 for column B. I don't need the integer value. I want to do it automatically( Like loop through all columns)
Panda's built .astype() in doesn't appear to have a 'safe casting' method as you want.
In numpy you can use
np.ndarray.astype(preferred_type, casting='safe')
So unfortunately I don't have a pretty solution for you but I would do something like
coltypes = [str,int,str]
colnames = ['a','b','c']
data_for_df = [df.values[:,i].astype(coltypes[i], casting='safe') for i in range(len(df))]
df = pd.DataFrame(data_for_df,columns=colnames)
Someone might be able to give a better answer than me :)
If you want to control that some columns of floating point values only contain integers, you can simply examine the difference between the original column(s) and the same one(s) after int conversion:
(df['B'] - df['B'].astype('int')) == 0
gives following Series:
0 True
1 False
2 True
3 True
4 True
Name: B, dtype: bool
From there, you can raise an exception
tmp = (df['B'] - df['B'].astype('int')) == 0
if not tmp.all():
raise TypeError("Non int value at "+ ', '.join(df[(df['B'] - df['B'].astype('int')) != 0]
.index.astype(str)))
With the sample data, it gives as expected:
TypeError: Non int value at 1

pandas DataFrame set value on boolean mask

I'm trying to set a number of different in a pandas DataFrame all to the same value. I thought I understood boolean indexing for pandas, but I haven't found any resources on this specific error.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df[mask] = 30
Traceback (most recent call last):
...
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
Above, I want to replace all of the True entries in the mask with the value 30.
I could do df.replace instead, but masking feels a bit more efficient and intuitive here. Can someone explain the error, and provide an efficient way to set all of the values?
You can't use the boolean mask on mixed dtypes for this unfortunately, you can use pandas where to set the values:
In [59]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df = df.where(mask, other=30)
df
Out[59]:
A B
0 1 a
1 30 30
2 3 30
Note: that the above will fail if you do inplace=True in the where method, so df.where(mask, other=30, inplace=True) will raise:
TypeError: Cannot do inplace boolean setting on mixed-types with a non
np.nan value
EDIT
OK, after a little misunderstanding you can still use where y just inverting the mask:
In [2]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df.where(~mask, other=30)
Out[2]:
A B
0 30 30
1 2 b
2 30 f
If you want to use different columns to create your mask, you need to call the values property of the dataframe.
Example
Let's say we want to, replace values in A_1 and 'A_2' according to a mask in B_1 and B_2. For example, replace those values in A (to 999) that corresponds to nulls in B.
The original dataframe:
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 5 n NaN
2 3 6 NaN NaN
The desired dataframe
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 999 n NaN
2 999 999 NaN NaN
The code:
df = pd.DataFrame({
'A_1': [1, 2, 3],
'A_2': [4, 5, 6],
'B_1': ['y', 'n', np.nan],
'B_2': ['n', np.nan, np.nan]})
_mask = df[['B_1', 'B_2']].notnull().values
df[['A_1', 'A_2']] = df[['A_1','A_2']].where(_mask, other=999)
A_1 A_2
0 1 4
1 2 999
2 999 999
I'm not 100% sure but I suspect the error message relates to the fact that there is not identical treatment of missing data across different dtypes. Only float has NaN, but integers can be automatically converted to floats so it's not a problem there. But it appears mixing number dtypes and object dtypes does not work so easily...
Regardless of that, you could get around it pretty easily with np.where:
df[:] = np.where( mask, 30, df )
A B
0 30 30
1 2 b
2 30 f
pandas uses NaN to mark invalid or missing data and can be used across types, since your DataFrame as mixed int and string data types it will not accept the assignment to a single type (other than NaN) as this would create a mixed type (int and str) in B through an in-place assignment.
#JohnE method using np.where creates a new DataFrame in which the type of column B is an object not a string as in the initial example.

Finding non-numeric rows in dataframe in pandas?

I have a large dataframe in pandas that apart from the column used as index is supposed to have only numeric values:
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
How can I find the row of the dataframe df that has a non-numeric value in it?
In this example it's the fourth row in the dataframe, which has the string 'bad' in the a column. How can this row be found programmatically?
You could use np.isreal to check the type of each element (applymap applies a function to each element in the DataFrame):
In [11]: df.applymap(np.isreal)
Out[11]:
a b
item
a True True
b True True
c True True
d False True
e True True
If all in the row are True then they are all numeric:
In [12]: df.applymap(np.isreal).all(1)
Out[12]:
item
a True
b True
c True
d False
e True
dtype: bool
So to get the subDataFrame of rouges, (Note: the negation, ~, of the above finds the ones which have at least one rogue non-numeric):
In [13]: df[~df.applymap(np.isreal).all(1)]
Out[13]:
a b
item
d bad 0.4
You could also find the location of the first offender you could use argmin:
In [14]: np.argmin(df.applymap(np.isreal).all(1))
Out[14]: 'd'
As #CTZhu points out, it may be slightly faster to check whether it's an instance of either int or float (there is some additional overhead with np.isreal):
df.applymap(lambda x: isinstance(x, (int, float)))
Already some great answers to this question, however here is a nice snippet that I use regularly to drop rows if they have non-numeric values on some columns:
# Eliminate invalid data from dataframe (see Example below for more context)
num_df = (df.drop(data_columns, axis=1)
.join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
num_df = num_df[num_df[data_columns].notnull().all(axis=1)]
The way this works is we first drop all the data_columns from the df, and then use a join to put them back in after passing them through pd.to_numeric (with option 'coerce', such that all non-numeric entries are converted to NaN). The result is saved to num_df.
On the second line we use a filter that keeps only rows where all values are not null.
Note that pd.to_numeric is coercing to NaN everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25' will be recognized as the numeric value 1.25.
Disclaimer: pd.to_numeric was introduced in pandas version 0.17.0
Example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"],
...: "a": [1,2,3,"bad",5],
...: "b":[0.1,0.2,0.3,0.4,0.5]})
In [3]: df
Out[3]:
a b item
0 1 0.1 a
1 2 0.2 b
2 3 0.3 c
3 bad 0.4 d
4 5 0.5 e
In [4]: data_columns = ['a', 'b']
In [5]: num_df = (df
...: .drop(data_columns, axis=1)
...: .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
In [6]: num_df
Out[6]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
3 d NaN 0.4
4 e 5 0.5
In [7]: num_df[num_df[data_columns].notnull().all(axis=1)]
Out[7]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
4 e 5 0.5
# Original code
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
Convert to numeric using 'coerce' which fills bad values with 'nan'
a = pd.to_numeric(df.a, errors='coerce')
Use isna to return a boolean index:
idx = a.isna()
Apply that index to the data frame:
df[idx]
output
Returns the row with the bad data in it:
a b
item
d bad 0.4
Sorry about the confusion, this should be the correct approach. Do you want only to capture 'bad' only, not things like 'good'; Or just any non-numerical values?
In[15]:
np.where(np.any(np.isnan(df.convert_objects(convert_numeric=True)), axis=1))
Out[15]:
(array([3]),)
In case you are working with a column with string values, you can use
THE VERY USEFUL function series.str.isnumeric() like:
a = pd.Series(['hi','hola','2.31','288','312','1312', '0,21', '0.23'])
What i do is to copy that column to new column, and do a str.replace('.','') and str.replace(',','') then i select the numeric values.
and:
a = a.str.replace('.','')
a = a.str.replace(',','')
a.str.isnumeric()
Out[15]:
0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 True
dtype: bool
Good luck all!
I'm thinking something like, just give an idea, to convert the column to string, and work with string is easier. however this does not work with strings containing numbers, like bad123. and ~ is taking the complement of selection.
df['a'] = df['a'].astype(str)
df[~df['a'].str.contains('0|1|2|3|4|5|6|7|8|9')]
df['a'] = df['a'].astype(object)
and using '|'.join([str(i) for i in range(10)]) to generate '0|1|...|8|9'
or using np.isreal() function, just like the most voted answer
df[~df['a'].apply(lambda x: np.isreal(x))]
Did you convert your data using .astype() ?
All great comments above must solve 99% of the cases, but if you are still in trouble, please also check if you converted your data type.
Sometimes I force the data to type float16 to save memory. Using:
df[col] = df[col].astype(np.float16)
But this might silently break your code. So if you did any kind of data type transformation, double check for overflows. Disable the conversion and try again.
It worked for me!

Categories

Resources