I'm working with individual rows of pandas data frames, but I'm stumbling over coercion issues while indexing and inserting rows. Pandas seems to always want to coerce from a mixed int/float to all-float types, and I can't see any obvious controls on this behaviour.
For example, here is a simple data frame with a as int and b as float:
import pandas as pd
pd.__version__ # '0.25.2'
df = pd.DataFrame({'a': [1], 'b': [2.2]})
print(df)
# a b
# 0 1 2.2
print(df.dtypes)
# a int64
# b float64
# dtype: object
Here is a coercion issue while indexing one row:
print(df.loc[0])
# a 1.0
# b 2.2
# Name: 0, dtype: float64
print(dict(df.loc[0]))
# {'a': 1.0, 'b': 2.2}
And here is a coercion issue while inserting one row:
df.loc[1] = {'a': 5, 'b': 4.4}
print(df)
# a b
# 0 1.0 2.2
# 1 5.0 4.4
print(df.dtypes)
# a float64
# b float64
# dtype: object
In both instances, I want the a column to remain as an integer type, rather than being coerced to a float type.
After some digging, here are some terribly ugly workarounds. (A better answer will be accepted.)
A quirk found here is that non-numeric columns stops coercion, so here is how to index one row to a dict:
dict(df.assign(_='').loc[0].drop('_', axis=0))
# {'a': 1, 'b': 2.2}
And inserting a row can be done by creating a new data frame with one row:
df = df.append(pd.DataFrame({'a': 5, 'b': 4.4}, index=[1]))
print(df)
# a b
# 0 1 2.2
# 1 5 4.4
Both of these tricks are not optimised for large data frames, so I would greatly appreciate a better answer!
Whenever you are getting data from dataframe or appending data to a dataframe and need to keep the data type same, avoid conversion to other internal structures which are not aware of the data types needed.
When you do df.loc[0] it converts to pd.Series,
>>> type(df.loc[0])
<class 'pandas.core.series.Series'>
And now, Series will only have a single dtype. Thus coercing int to float.
Instead keep structure as pd.DataFrame,
>>> type(df.loc[[0]])
<class 'pandas.core.frame.DataFrame'>
Select row needed as a frame and then convert to dict
>>> df.loc[[0]].to_dict(orient='records')
[{'a': 1, 'b': 2.2}]
Similarly, to add a new row, Use pandas pd.DataFrame.append function,
>>> df = df.append([{'a': 5, 'b': 4.4}]) # NOTE: To append as a row, use []
a b
0 1 2.2
0 5 4.4
The above will not cause type conversion,
>>> df.dtypes
a int64
b float64
dtype: object
The root of the problem is that
The indexing of pandas dataframe returns a pandas series
We can see that:
type(df.loc[0])
# pandas.core.series.Series
And a series can only have one dtype, in your case either int64 or float64.
There are two workarounds come to my head:
print(df.loc[[0]])
# this will return a dataframe instead of series
# so the result will be
# a b
# 0 1 2.2
# but the dictionary is hard to read
print(dict(df.loc[[0]]))
# {'a': 0 1
# Name: a, dtype: int64, 'b': 0 2.2
# Name: b, dtype: float64}
or
print(df.astype(object).loc[0])
# this will change the type of value to object first and then print
# so the result will be
# a 1
# b 2.2
# Name: 0, dtype: object
print(dict(df.astype(object).loc[0]))
# in this way the dictionary is as expected
# {'a': 1, 'b': 2.2}
When you append a dictionary to a dataframe, it will convert the dictionary to a Series first and then append. (So the same problem happens again)
https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L6973
if isinstance(other, dict):
other = Series(other)
So your walkaround is actually a solid one, or else we could:
df.append(pd.Series({'a': 5, 'b': 4.4}, dtype=object, name=1))
# a b
# 0 1 2.2
# 1 5 4.4
A different approach with slight data manipulations:
Assume you have a list of dictionaries (or dataframes)
lod=[{'a': [1], 'b': [2.2]}, {'a': [5], 'b': [4.4]}]
where each dictionary represents a row (note the lists in the second dictionary). Then you can create a dataframe easily via:
pd.concat([pd.DataFrame(dct) for dct in lod])
a b
0 1 2.2
0 5 4.4
and you maintain the types of the columns. See concat
So if you have a dataframe and a list of dicts, you could just use
pd.concat([df] + [pd.DataFrame(dct) for dct in lod])
In the first case, you can work with the nullable integer data type. The Series selection doesn't coerce to float and values are placed in an object container. The dictionary is then properly created, with the underlying value stored as a np.int64.
df = pd.DataFrame({'a': [1], 'b': [2.2]})
df['a'] = df['a'].astype('Int64')
d = dict(df.loc[0])
#{'a': 1, 'b': 2.2}
type(d['a'])
#numpy.int64
With your syntax, this almost works for the second case too, but this upcasts to object, so not great:
df.loc[1] = {'a': 5, 'b': 4.4}
# a b
#0 1 2.2
#1 5 4.4
df.dtypes
#a object
#b float64
#dtype: object
However, we can make a small change to the syntax for adding a row at the end (with a RangeIndex) and now types are dealt with properly.
df = pd.DataFrame({'a': [1], 'b': [2.2]})
df['a'] = df['a'].astype('Int64')
df.loc[df.shape[0], :] = [5, 4.4]
# a b
#0 1 2.2
#1 5 4.4
df.dtypes
#a Int64
#b float64
#dtype: object
Related
I have a list 'abc' and a dataframe 'df':
abc = ['foo', 'bar']
df =
A B
0 12 NaN
1 23 NaN
I want to insert the list into cell 1B, so I want this result:
A B
0 12 NaN
1 23 ['foo', 'bar']
Ho can I do that?
1) If I use this:
df.ix[1,'B'] = abc
I get the following error message:
ValueError: Must have equal len keys and value when setting with an iterable
because it tries to insert the list (that has two elements) into a row / column but not into a cell.
2) If I use this:
df.ix[1,'B'] = [abc]
then it inserts a list that has only one element that is the 'abc' list ( [['foo', 'bar']] ).
3) If I use this:
df.ix[1,'B'] = ', '.join(abc)
then it inserts a string: ( foo, bar ) but not a list.
4) If I use this:
df.ix[1,'B'] = [', '.join(abc)]
then it inserts a list but it has only one element ( ['foo, bar'] ) but not two as I want ( ['foo', 'bar'] ).
Thanks for help!
EDIT
My new dataframe and the old list:
abc = ['foo', 'bar']
df2 =
A B C
0 12 NaN 'bla'
1 23 NaN 'bla bla'
Another dataframe:
df3 =
A B C D
0 12 NaN 'bla' ['item1', 'item2']
1 23 NaN 'bla bla' [11, 12, 13]
I want insert the 'abc' list into df2.loc[1,'B'] and/or df3.loc[1,'B'].
If the dataframe has columns only with integer values and/or NaN values and/or list values then inserting a list into a cell works perfectly. If the dataframe has columns only with string values and/or NaN values and/or list values then inserting a list into a cell works perfectly. But if the dataframe has columns with integer and string values and other columns then the error message appears if I use this: df2.loc[1,'B'] = abc or df3.loc[1,'B'] = abc.
Another dataframe:
df4 =
A B
0 'bla' NaN
1 'bla bla' NaN
These inserts work perfectly: df.loc[1,'B'] = abc or df4.loc[1,'B'] = abc.
Since set_value has been deprecated since version 0.21.0, you should now use at. It can insert a list into a cell without raising a ValueError as loc does. I think this is because at always refers to a single value, while loc can refer to values as well as rows and columns.
df = pd.DataFrame(data={'A': [1, 2, 3], 'B': ['x', 'y', 'z']})
df.at[1, 'B'] = ['m', 'n']
df =
A B
0 1 x
1 2 [m, n]
2 3 z
You also need to make sure the column you are inserting into has dtype=object. For example
>>> df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [1,2,3]})
>>> df.dtypes
A int64
B int64
dtype: object
>>> df.at[1, 'B'] = [1, 2, 3]
ValueError: setting an array element with a sequence
>>> df['B'] = df['B'].astype('object')
>>> df.at[1, 'B'] = [1, 2, 3]
>>> df
A B
0 1 1
1 2 [1, 2, 3]
2 3 3
Pandas >= 0.21
set_value has been deprecated. You can now use DataFrame.at to set by label, and DataFrame.iat to set by integer position.
Setting Cell Values with at/iat
# Setup
>>> df = pd.DataFrame({'A': [12, 23], 'B': [['a', 'b'], ['c', 'd']]})
>>> df
A B
0 12 [a, b]
1 23 [c, d]
>>> df.dtypes
A int64
B object
dtype: object
If you want to set a value in second row of the "B" column to some new list, use DataFrame.at:
>>> df.at[1, 'B'] = ['m', 'n']
>>> df
A B
0 12 [a, b]
1 23 [m, n]
You can also set by integer position using DataFrame.iat
>>> df.iat[1, df.columns.get_loc('B')] = ['m', 'n']
>>> df
A B
0 12 [a, b]
1 23 [m, n]
What if I get ValueError: setting an array element with a sequence?
I'll try to reproduce this with:
>>> df
A B
0 12 NaN
1 23 NaN
>>> df.dtypes
A int64
B float64
dtype: object
>>> df.at[1, 'B'] = ['m', 'n']
# ValueError: setting an array element with a sequence.
This is because of a your object is of float64 dtype, whereas lists are objects, so there's a mismatch there. What you would have to do in this situation is to convert the column to object first.
>>> df['B'] = df['B'].astype(object)
>>> df.dtypes
A int64
B object
dtype: object
Then, it works:
>>> df.at[1, 'B'] = ['m', 'n']
>>> df
A B
0 12 NaN
1 23 [m, n]
Possible, But Hacky
Even more wacky, I've found that you can hack through DataFrame.loc to achieve something similar if you pass nested lists.
>>> df.loc[1, 'B'] = [['m'], ['n'], ['o'], ['p']]
>>> df
A B
0 12 [a, b]
1 23 [m, n, o, p]
You can read more about why this works here.
df3.set_value(1, 'B', abc) works for any dataframe. Take care of the data type of column 'B'. For example, a list can not be inserted into a float column, at that case df['B'] = df['B'].astype(object) can help.
Quick work around
Simply enclose the list within a new list, as done for col2 in the data frame below. The reason it works is that python takes the outer list (of lists) and converts it into a column as if it were containing normal scalar items, which is lists in our case and not normal scalars.
mydict={'col1':[1,2,3],'col2':[[1, 4], [2, 5], [3, 6]]}
data=pd.DataFrame(mydict)
data
col1 col2
0 1 [1, 4]
1 2 [2, 5]
2 3 [3, 6]
Also getting
ValueError: Must have equal len keys and value when setting with an iterable,
using .at rather than .loc did not make any difference in my case, but enforcing the datatype of the dataframe column did the trick:
df['B'] = df['B'].astype(object)
Then I could set lists, numpy array and all sorts of things as single cell values in my dataframes.
As mentionned in this post pandas: how to store a list in a dataframe?; the dtypes in the dataframe may influence the results, as well as calling a dataframe or not to be assigned to.
I've got a solution that's pretty simple to implement.
Make a temporary class just to wrap the list object and later call the value from the class.
Here's a practical example:
Let's say you want to insert list object into the dataframe.
df = pd.DataFrame([
{'a': 1},
{'a': 2},
{'a': 3},
])
df.loc[:, 'b'] = [
[1,2,4,2,],
[1,2,],
[4,5,6]
] # This works. Because the list has the same length as the rows of the dataframe
df.loc[:, 'c'] = [1,2,4,5,3] # This does not work.
>>> ValueError: Must have equal len keys and value when setting with an iterable
## To force pandas to have list as value in each cell, wrap the list with a temporary class.
class Fake(object):
def __init__(self, li_obj):
self.obj = li_obj
df.loc[:, 'c'] = Fake([1,2,5,3,5,7,]) # This works.
df.c = df.c.apply(lambda x: x.obj) # Now extract the value from the class. This works.
Creating a fake class to do this might look like a hassle but it can have some practical applications. For an example you can use this with apply when the return value is list.
Pandas would normally refuse to insert list into a cell but if you use this method, you can force the insert.
I prefer .at and .loc. It is important to note, that the target column needs a dtype (object), which can handle the list.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'A': [0, 1, 2, 3],
'B': np.array([np.nan]*3 + [[3, 33]], dtype=object),
})
print('df to start with:', df, '\ndtypes:', df.dtypes, sep='\n')
df.at[0, 'B'] = [0, 100] # at assigns single elemnt
df.loc[1, 'B'] = [[ [1, 11] ]] # loc expects 2d input
print('df modified:', df, '\ndtypes:', df.dtypes, sep='\n')
output
df to start with:
A B
0 0 NaN
1 1 NaN
2 2 NaN
3 3 [3, 33]
dtypes:
A int64
B object
dtype: object
df modified:
A B
0 0 [0, 100]
1 1 [[1, 11]]
2 2 NaN
3 3 [3, 33]
dtypes:
A int64
B object
dtype: object
first set the cell to blank. next use at to assign the abc list to the cell at 1, 'B'
abc = ['foo', 'bar']
df =pd.DataFrame({'A':[12,23],'B':[np.nan,np.nan]})
df.loc[1,'B']=''
df.at[1,'B']=abc
print(df)
Consider the following sequence of operations:
Create a data frame with two columns with the following types int64, float64
Create a new frame by converting all columns to object
Inspect the new data frame
Persist the new data frame
Expect the second column to get persisted as shown in the 3rd step: i.e. as string, not as float64
Illustrated below:
# Step 1
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 0], 'b': [1, 500.43, 256.13, 5]})
# Step 2
df2 = df.astype(object)
# Step 3
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 4 non-null object
1 b 4 non-null object
dtypes: object(2)
memory usage: 192.0+ bytes
# NOTE notice how column `b` is rendered
df2
a b
0 3 1
1 2 500.43
2 1 256.13
3 0 5
# Step 4
df2.to_csv("/tmp/df2", index=False, sep="\t")
Now let us inspect the generated output:
$ cat df2
a b
3 1.0
2 500.43
1 256.13
0 5.0
Notice how column b is persisted: the decimal places are still present for round numbers even though the datatype is object. Why does this happen? What am I missing here?
I'm using Pandas 1.1.2 with Python 3.7.9.
I think, 'object' is NumPy/pandas dtype and not one of the python data types.
If you run:
type(df2.iloc[0,1])
before step 4, you will get 'float' data type even though it's been already changed to 'object'.
You can use:
df.to_csv("df.csv",float_format='%g', index=False, sep="\t")
instead of casting in step 2.
I am not great with pandas and still learning. I looked at a few solution and thought why not do an apply on the data before we send it to csv file.
Here's what I did to get the values printed as 1 and 5 instead of 1.0 and 5.0
values in df are mix of string, float, ints
import pandas as pd
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 's', 't'], 'b': [1, 500.43, 256.13, 5, 'txt']})
df2 = df.astype(object)
def convert(x):
a = []
for i in x.to_list():
a.append(coerce(i))
return pd.Series(a)
#return pd.Series([str(int(i)) if int(i) == i else i for i in x.to_list()])
def coerce(y):
try:
p = float(y)
q = int(y)
if p != q:
return str(p)
else:
return str(q)
except:
return str(y)
df2.apply(convert).to_csv("abc.txt", index=False, sep="\t")
Output in the file will be:
a b
3 1
2 500.43
1 256.13
s 5
t txt
all values in df are numeric (integers or floats)
import pandas as pd
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 0], 'b': [1, 500.43, 256.13, 5]})
df2 = df.astype(object)
def convert(x):
return pd.Series([str(int(i)) if int(i) == i else i for i in x.to_list()])
df2.apply(convert).to_csv("abc.txt", index=False, sep="\t")
The output is as follows:
a b
3 1
2 500.43
1 256.13
0 5
Here I am assuming all values in df2 are numeric. If it has a string value, then int(i) will fail.
I have a large dataframe in pandas that apart from the column used as index is supposed to have only numeric values:
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
How can I find the row of the dataframe df that has a non-numeric value in it?
In this example it's the fourth row in the dataframe, which has the string 'bad' in the a column. How can this row be found programmatically?
You could use np.isreal to check the type of each element (applymap applies a function to each element in the DataFrame):
In [11]: df.applymap(np.isreal)
Out[11]:
a b
item
a True True
b True True
c True True
d False True
e True True
If all in the row are True then they are all numeric:
In [12]: df.applymap(np.isreal).all(1)
Out[12]:
item
a True
b True
c True
d False
e True
dtype: bool
So to get the subDataFrame of rouges, (Note: the negation, ~, of the above finds the ones which have at least one rogue non-numeric):
In [13]: df[~df.applymap(np.isreal).all(1)]
Out[13]:
a b
item
d bad 0.4
You could also find the location of the first offender you could use argmin:
In [14]: np.argmin(df.applymap(np.isreal).all(1))
Out[14]: 'd'
As #CTZhu points out, it may be slightly faster to check whether it's an instance of either int or float (there is some additional overhead with np.isreal):
df.applymap(lambda x: isinstance(x, (int, float)))
Already some great answers to this question, however here is a nice snippet that I use regularly to drop rows if they have non-numeric values on some columns:
# Eliminate invalid data from dataframe (see Example below for more context)
num_df = (df.drop(data_columns, axis=1)
.join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
num_df = num_df[num_df[data_columns].notnull().all(axis=1)]
The way this works is we first drop all the data_columns from the df, and then use a join to put them back in after passing them through pd.to_numeric (with option 'coerce', such that all non-numeric entries are converted to NaN). The result is saved to num_df.
On the second line we use a filter that keeps only rows where all values are not null.
Note that pd.to_numeric is coercing to NaN everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25' will be recognized as the numeric value 1.25.
Disclaimer: pd.to_numeric was introduced in pandas version 0.17.0
Example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"],
...: "a": [1,2,3,"bad",5],
...: "b":[0.1,0.2,0.3,0.4,0.5]})
In [3]: df
Out[3]:
a b item
0 1 0.1 a
1 2 0.2 b
2 3 0.3 c
3 bad 0.4 d
4 5 0.5 e
In [4]: data_columns = ['a', 'b']
In [5]: num_df = (df
...: .drop(data_columns, axis=1)
...: .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
In [6]: num_df
Out[6]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
3 d NaN 0.4
4 e 5 0.5
In [7]: num_df[num_df[data_columns].notnull().all(axis=1)]
Out[7]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
4 e 5 0.5
# Original code
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
Convert to numeric using 'coerce' which fills bad values with 'nan'
a = pd.to_numeric(df.a, errors='coerce')
Use isna to return a boolean index:
idx = a.isna()
Apply that index to the data frame:
df[idx]
output
Returns the row with the bad data in it:
a b
item
d bad 0.4
Sorry about the confusion, this should be the correct approach. Do you want only to capture 'bad' only, not things like 'good'; Or just any non-numerical values?
In[15]:
np.where(np.any(np.isnan(df.convert_objects(convert_numeric=True)), axis=1))
Out[15]:
(array([3]),)
In case you are working with a column with string values, you can use
THE VERY USEFUL function series.str.isnumeric() like:
a = pd.Series(['hi','hola','2.31','288','312','1312', '0,21', '0.23'])
What i do is to copy that column to new column, and do a str.replace('.','') and str.replace(',','') then i select the numeric values.
and:
a = a.str.replace('.','')
a = a.str.replace(',','')
a.str.isnumeric()
Out[15]:
0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 True
dtype: bool
Good luck all!
I'm thinking something like, just give an idea, to convert the column to string, and work with string is easier. however this does not work with strings containing numbers, like bad123. and ~ is taking the complement of selection.
df['a'] = df['a'].astype(str)
df[~df['a'].str.contains('0|1|2|3|4|5|6|7|8|9')]
df['a'] = df['a'].astype(object)
and using '|'.join([str(i) for i in range(10)]) to generate '0|1|...|8|9'
or using np.isreal() function, just like the most voted answer
df[~df['a'].apply(lambda x: np.isreal(x))]
Did you convert your data using .astype() ?
All great comments above must solve 99% of the cases, but if you are still in trouble, please also check if you converted your data type.
Sometimes I force the data to type float16 to save memory. Using:
df[col] = df[col].astype(np.float16)
But this might silently break your code. So if you did any kind of data type transformation, double check for overflows. Disable the conversion and try again.
It worked for me!
Does anyone know if there's an equivalent for the Dict.get() method for pandas Series? I've got a series that looks like this:
In [25]: s1
Out[25]:
a 1
b 2
c 3
dtype: int64
And I'd like to return a default value, such as 0, if I try to access an index that isn't there. For example, I'd like s1.ix['z'] to return 0 instead of KeyError. I know pandas has great support for dealing with missing values in other circumstances, but I couldn't find anything specifically about this.
Thank you!
As mentioned in the comments, pandas implements get directly for Series.
So s1.get('x', 0) would return 0
I couldn't make sense of the previous answers since they were missing the reference to the dict you use as source. Here is my answer, works for me at least
In[12]: s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'z'])
my_dict = {'a': 1, 'b': 2, 'c': 3}
result = s1.index.map(my_dict).fillna(0)
new_series = pd.Series(result, index=s1.index)
new_series
Out[13]:
a 1.0
b 2.0
c 3.0
z 0.0
dtype: float64
I have a large dataframe in pandas that apart from the column used as index is supposed to have only numeric values:
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
How can I find the row of the dataframe df that has a non-numeric value in it?
In this example it's the fourth row in the dataframe, which has the string 'bad' in the a column. How can this row be found programmatically?
You could use np.isreal to check the type of each element (applymap applies a function to each element in the DataFrame):
In [11]: df.applymap(np.isreal)
Out[11]:
a b
item
a True True
b True True
c True True
d False True
e True True
If all in the row are True then they are all numeric:
In [12]: df.applymap(np.isreal).all(1)
Out[12]:
item
a True
b True
c True
d False
e True
dtype: bool
So to get the subDataFrame of rouges, (Note: the negation, ~, of the above finds the ones which have at least one rogue non-numeric):
In [13]: df[~df.applymap(np.isreal).all(1)]
Out[13]:
a b
item
d bad 0.4
You could also find the location of the first offender you could use argmin:
In [14]: np.argmin(df.applymap(np.isreal).all(1))
Out[14]: 'd'
As #CTZhu points out, it may be slightly faster to check whether it's an instance of either int or float (there is some additional overhead with np.isreal):
df.applymap(lambda x: isinstance(x, (int, float)))
Already some great answers to this question, however here is a nice snippet that I use regularly to drop rows if they have non-numeric values on some columns:
# Eliminate invalid data from dataframe (see Example below for more context)
num_df = (df.drop(data_columns, axis=1)
.join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
num_df = num_df[num_df[data_columns].notnull().all(axis=1)]
The way this works is we first drop all the data_columns from the df, and then use a join to put them back in after passing them through pd.to_numeric (with option 'coerce', such that all non-numeric entries are converted to NaN). The result is saved to num_df.
On the second line we use a filter that keeps only rows where all values are not null.
Note that pd.to_numeric is coercing to NaN everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25' will be recognized as the numeric value 1.25.
Disclaimer: pd.to_numeric was introduced in pandas version 0.17.0
Example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"],
...: "a": [1,2,3,"bad",5],
...: "b":[0.1,0.2,0.3,0.4,0.5]})
In [3]: df
Out[3]:
a b item
0 1 0.1 a
1 2 0.2 b
2 3 0.3 c
3 bad 0.4 d
4 5 0.5 e
In [4]: data_columns = ['a', 'b']
In [5]: num_df = (df
...: .drop(data_columns, axis=1)
...: .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
In [6]: num_df
Out[6]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
3 d NaN 0.4
4 e 5 0.5
In [7]: num_df[num_df[data_columns].notnull().all(axis=1)]
Out[7]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
4 e 5 0.5
# Original code
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
Convert to numeric using 'coerce' which fills bad values with 'nan'
a = pd.to_numeric(df.a, errors='coerce')
Use isna to return a boolean index:
idx = a.isna()
Apply that index to the data frame:
df[idx]
output
Returns the row with the bad data in it:
a b
item
d bad 0.4
Sorry about the confusion, this should be the correct approach. Do you want only to capture 'bad' only, not things like 'good'; Or just any non-numerical values?
In[15]:
np.where(np.any(np.isnan(df.convert_objects(convert_numeric=True)), axis=1))
Out[15]:
(array([3]),)
In case you are working with a column with string values, you can use
THE VERY USEFUL function series.str.isnumeric() like:
a = pd.Series(['hi','hola','2.31','288','312','1312', '0,21', '0.23'])
What i do is to copy that column to new column, and do a str.replace('.','') and str.replace(',','') then i select the numeric values.
and:
a = a.str.replace('.','')
a = a.str.replace(',','')
a.str.isnumeric()
Out[15]:
0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 True
dtype: bool
Good luck all!
I'm thinking something like, just give an idea, to convert the column to string, and work with string is easier. however this does not work with strings containing numbers, like bad123. and ~ is taking the complement of selection.
df['a'] = df['a'].astype(str)
df[~df['a'].str.contains('0|1|2|3|4|5|6|7|8|9')]
df['a'] = df['a'].astype(object)
and using '|'.join([str(i) for i in range(10)]) to generate '0|1|...|8|9'
or using np.isreal() function, just like the most voted answer
df[~df['a'].apply(lambda x: np.isreal(x))]
Did you convert your data using .astype() ?
All great comments above must solve 99% of the cases, but if you are still in trouble, please also check if you converted your data type.
Sometimes I force the data to type float16 to save memory. Using:
df[col] = df[col].astype(np.float16)
But this might silently break your code. So if you did any kind of data type transformation, double check for overflows. Disable the conversion and try again.
It worked for me!