Related
I'm working with individual rows of pandas data frames, but I'm stumbling over coercion issues while indexing and inserting rows. Pandas seems to always want to coerce from a mixed int/float to all-float types, and I can't see any obvious controls on this behaviour.
For example, here is a simple data frame with a as int and b as float:
import pandas as pd
pd.__version__ # '0.25.2'
df = pd.DataFrame({'a': [1], 'b': [2.2]})
print(df)
# a b
# 0 1 2.2
print(df.dtypes)
# a int64
# b float64
# dtype: object
Here is a coercion issue while indexing one row:
print(df.loc[0])
# a 1.0
# b 2.2
# Name: 0, dtype: float64
print(dict(df.loc[0]))
# {'a': 1.0, 'b': 2.2}
And here is a coercion issue while inserting one row:
df.loc[1] = {'a': 5, 'b': 4.4}
print(df)
# a b
# 0 1.0 2.2
# 1 5.0 4.4
print(df.dtypes)
# a float64
# b float64
# dtype: object
In both instances, I want the a column to remain as an integer type, rather than being coerced to a float type.
After some digging, here are some terribly ugly workarounds. (A better answer will be accepted.)
A quirk found here is that non-numeric columns stops coercion, so here is how to index one row to a dict:
dict(df.assign(_='').loc[0].drop('_', axis=0))
# {'a': 1, 'b': 2.2}
And inserting a row can be done by creating a new data frame with one row:
df = df.append(pd.DataFrame({'a': 5, 'b': 4.4}, index=[1]))
print(df)
# a b
# 0 1 2.2
# 1 5 4.4
Both of these tricks are not optimised for large data frames, so I would greatly appreciate a better answer!
Whenever you are getting data from dataframe or appending data to a dataframe and need to keep the data type same, avoid conversion to other internal structures which are not aware of the data types needed.
When you do df.loc[0] it converts to pd.Series,
>>> type(df.loc[0])
<class 'pandas.core.series.Series'>
And now, Series will only have a single dtype. Thus coercing int to float.
Instead keep structure as pd.DataFrame,
>>> type(df.loc[[0]])
<class 'pandas.core.frame.DataFrame'>
Select row needed as a frame and then convert to dict
>>> df.loc[[0]].to_dict(orient='records')
[{'a': 1, 'b': 2.2}]
Similarly, to add a new row, Use pandas pd.DataFrame.append function,
>>> df = df.append([{'a': 5, 'b': 4.4}]) # NOTE: To append as a row, use []
a b
0 1 2.2
0 5 4.4
The above will not cause type conversion,
>>> df.dtypes
a int64
b float64
dtype: object
The root of the problem is that
The indexing of pandas dataframe returns a pandas series
We can see that:
type(df.loc[0])
# pandas.core.series.Series
And a series can only have one dtype, in your case either int64 or float64.
There are two workarounds come to my head:
print(df.loc[[0]])
# this will return a dataframe instead of series
# so the result will be
# a b
# 0 1 2.2
# but the dictionary is hard to read
print(dict(df.loc[[0]]))
# {'a': 0 1
# Name: a, dtype: int64, 'b': 0 2.2
# Name: b, dtype: float64}
or
print(df.astype(object).loc[0])
# this will change the type of value to object first and then print
# so the result will be
# a 1
# b 2.2
# Name: 0, dtype: object
print(dict(df.astype(object).loc[0]))
# in this way the dictionary is as expected
# {'a': 1, 'b': 2.2}
When you append a dictionary to a dataframe, it will convert the dictionary to a Series first and then append. (So the same problem happens again)
https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L6973
if isinstance(other, dict):
other = Series(other)
So your walkaround is actually a solid one, or else we could:
df.append(pd.Series({'a': 5, 'b': 4.4}, dtype=object, name=1))
# a b
# 0 1 2.2
# 1 5 4.4
A different approach with slight data manipulations:
Assume you have a list of dictionaries (or dataframes)
lod=[{'a': [1], 'b': [2.2]}, {'a': [5], 'b': [4.4]}]
where each dictionary represents a row (note the lists in the second dictionary). Then you can create a dataframe easily via:
pd.concat([pd.DataFrame(dct) for dct in lod])
a b
0 1 2.2
0 5 4.4
and you maintain the types of the columns. See concat
So if you have a dataframe and a list of dicts, you could just use
pd.concat([df] + [pd.DataFrame(dct) for dct in lod])
In the first case, you can work with the nullable integer data type. The Series selection doesn't coerce to float and values are placed in an object container. The dictionary is then properly created, with the underlying value stored as a np.int64.
df = pd.DataFrame({'a': [1], 'b': [2.2]})
df['a'] = df['a'].astype('Int64')
d = dict(df.loc[0])
#{'a': 1, 'b': 2.2}
type(d['a'])
#numpy.int64
With your syntax, this almost works for the second case too, but this upcasts to object, so not great:
df.loc[1] = {'a': 5, 'b': 4.4}
# a b
#0 1 2.2
#1 5 4.4
df.dtypes
#a object
#b float64
#dtype: object
However, we can make a small change to the syntax for adding a row at the end (with a RangeIndex) and now types are dealt with properly.
df = pd.DataFrame({'a': [1], 'b': [2.2]})
df['a'] = df['a'].astype('Int64')
df.loc[df.shape[0], :] = [5, 4.4]
# a b
#0 1 2.2
#1 5 4.4
df.dtypes
#a Int64
#b float64
#dtype: object
I am new to stackoverflow.
I noticed this behavior of pandas combine_first() and would simply like to understand why.
When I have the following dataframe,
df = pd.DataFrame({'A':[6,'',7,''], 'B':[1, 3, 5, 3]})
df['A'].combine_first(df['B'])
Out[1]:
0 6
1
2 7
3
Name: A, dtype: object
Whereas initiating with np.nan instead of ' ' gives the expected behavior of combine_first()
df = pd.DataFrame({'A':[6,np.nan,7,np.nan], 'B':[1, 3, 5, 3]})
df['A'].combine_first(df['B'])
Out[2]:
0 6.0
1 3.0
2 7.0
3 3.0
Name: A, dtype: float64
And also replacing the ' ' with np.nan and then applying combine_first() doesn't seem to work either.
df = pd.DataFrame({'A':[6,'',7,''], 'B':[1, 3, 5, 3]})
df.replace('', np.nan)
df['A'].combine_first(df['B'])
Out[3]:
0 6
1
2 7
3
Name: A, dtype: object
I would like to understand why this happens before using an alternate method for this purpose.
This seemed to have been pretty obvious for people here. But thank-you for posting the comments!
My mistake in the 3rd dataframe I posted, pointed out by #W-B
df = pd.DataFrame({'A':[6,'',7,''], 'B':[1, 3, 5, 3]})
df = df.replace('', np.nan)
df['A'].combine_first(df['B'])
Also as #ALollz pointed out, df['A'] has empty strings ' ' are not null values. It does sound simple in hind-sight. But I couldn't figure it out earlier!
Thank-you!
I have a large dataframe in pandas that apart from the column used as index is supposed to have only numeric values:
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
How can I find the row of the dataframe df that has a non-numeric value in it?
In this example it's the fourth row in the dataframe, which has the string 'bad' in the a column. How can this row be found programmatically?
You could use np.isreal to check the type of each element (applymap applies a function to each element in the DataFrame):
In [11]: df.applymap(np.isreal)
Out[11]:
a b
item
a True True
b True True
c True True
d False True
e True True
If all in the row are True then they are all numeric:
In [12]: df.applymap(np.isreal).all(1)
Out[12]:
item
a True
b True
c True
d False
e True
dtype: bool
So to get the subDataFrame of rouges, (Note: the negation, ~, of the above finds the ones which have at least one rogue non-numeric):
In [13]: df[~df.applymap(np.isreal).all(1)]
Out[13]:
a b
item
d bad 0.4
You could also find the location of the first offender you could use argmin:
In [14]: np.argmin(df.applymap(np.isreal).all(1))
Out[14]: 'd'
As #CTZhu points out, it may be slightly faster to check whether it's an instance of either int or float (there is some additional overhead with np.isreal):
df.applymap(lambda x: isinstance(x, (int, float)))
Already some great answers to this question, however here is a nice snippet that I use regularly to drop rows if they have non-numeric values on some columns:
# Eliminate invalid data from dataframe (see Example below for more context)
num_df = (df.drop(data_columns, axis=1)
.join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
num_df = num_df[num_df[data_columns].notnull().all(axis=1)]
The way this works is we first drop all the data_columns from the df, and then use a join to put them back in after passing them through pd.to_numeric (with option 'coerce', such that all non-numeric entries are converted to NaN). The result is saved to num_df.
On the second line we use a filter that keeps only rows where all values are not null.
Note that pd.to_numeric is coercing to NaN everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25' will be recognized as the numeric value 1.25.
Disclaimer: pd.to_numeric was introduced in pandas version 0.17.0
Example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"],
...: "a": [1,2,3,"bad",5],
...: "b":[0.1,0.2,0.3,0.4,0.5]})
In [3]: df
Out[3]:
a b item
0 1 0.1 a
1 2 0.2 b
2 3 0.3 c
3 bad 0.4 d
4 5 0.5 e
In [4]: data_columns = ['a', 'b']
In [5]: num_df = (df
...: .drop(data_columns, axis=1)
...: .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
In [6]: num_df
Out[6]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
3 d NaN 0.4
4 e 5 0.5
In [7]: num_df[num_df[data_columns].notnull().all(axis=1)]
Out[7]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
4 e 5 0.5
# Original code
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
Convert to numeric using 'coerce' which fills bad values with 'nan'
a = pd.to_numeric(df.a, errors='coerce')
Use isna to return a boolean index:
idx = a.isna()
Apply that index to the data frame:
df[idx]
output
Returns the row with the bad data in it:
a b
item
d bad 0.4
Sorry about the confusion, this should be the correct approach. Do you want only to capture 'bad' only, not things like 'good'; Or just any non-numerical values?
In[15]:
np.where(np.any(np.isnan(df.convert_objects(convert_numeric=True)), axis=1))
Out[15]:
(array([3]),)
In case you are working with a column with string values, you can use
THE VERY USEFUL function series.str.isnumeric() like:
a = pd.Series(['hi','hola','2.31','288','312','1312', '0,21', '0.23'])
What i do is to copy that column to new column, and do a str.replace('.','') and str.replace(',','') then i select the numeric values.
and:
a = a.str.replace('.','')
a = a.str.replace(',','')
a.str.isnumeric()
Out[15]:
0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 True
dtype: bool
Good luck all!
I'm thinking something like, just give an idea, to convert the column to string, and work with string is easier. however this does not work with strings containing numbers, like bad123. and ~ is taking the complement of selection.
df['a'] = df['a'].astype(str)
df[~df['a'].str.contains('0|1|2|3|4|5|6|7|8|9')]
df['a'] = df['a'].astype(object)
and using '|'.join([str(i) for i in range(10)]) to generate '0|1|...|8|9'
or using np.isreal() function, just like the most voted answer
df[~df['a'].apply(lambda x: np.isreal(x))]
Did you convert your data using .astype() ?
All great comments above must solve 99% of the cases, but if you are still in trouble, please also check if you converted your data type.
Sometimes I force the data to type float16 to save memory. Using:
df[col] = df[col].astype(np.float16)
But this might silently break your code. So if you did any kind of data type transformation, double check for overflows. Disable the conversion and try again.
It worked for me!
I have a large dataframe in pandas that apart from the column used as index is supposed to have only numeric values:
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
How can I find the row of the dataframe df that has a non-numeric value in it?
In this example it's the fourth row in the dataframe, which has the string 'bad' in the a column. How can this row be found programmatically?
You could use np.isreal to check the type of each element (applymap applies a function to each element in the DataFrame):
In [11]: df.applymap(np.isreal)
Out[11]:
a b
item
a True True
b True True
c True True
d False True
e True True
If all in the row are True then they are all numeric:
In [12]: df.applymap(np.isreal).all(1)
Out[12]:
item
a True
b True
c True
d False
e True
dtype: bool
So to get the subDataFrame of rouges, (Note: the negation, ~, of the above finds the ones which have at least one rogue non-numeric):
In [13]: df[~df.applymap(np.isreal).all(1)]
Out[13]:
a b
item
d bad 0.4
You could also find the location of the first offender you could use argmin:
In [14]: np.argmin(df.applymap(np.isreal).all(1))
Out[14]: 'd'
As #CTZhu points out, it may be slightly faster to check whether it's an instance of either int or float (there is some additional overhead with np.isreal):
df.applymap(lambda x: isinstance(x, (int, float)))
Already some great answers to this question, however here is a nice snippet that I use regularly to drop rows if they have non-numeric values on some columns:
# Eliminate invalid data from dataframe (see Example below for more context)
num_df = (df.drop(data_columns, axis=1)
.join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
num_df = num_df[num_df[data_columns].notnull().all(axis=1)]
The way this works is we first drop all the data_columns from the df, and then use a join to put them back in after passing them through pd.to_numeric (with option 'coerce', such that all non-numeric entries are converted to NaN). The result is saved to num_df.
On the second line we use a filter that keeps only rows where all values are not null.
Note that pd.to_numeric is coercing to NaN everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25' will be recognized as the numeric value 1.25.
Disclaimer: pd.to_numeric was introduced in pandas version 0.17.0
Example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"],
...: "a": [1,2,3,"bad",5],
...: "b":[0.1,0.2,0.3,0.4,0.5]})
In [3]: df
Out[3]:
a b item
0 1 0.1 a
1 2 0.2 b
2 3 0.3 c
3 bad 0.4 d
4 5 0.5 e
In [4]: data_columns = ['a', 'b']
In [5]: num_df = (df
...: .drop(data_columns, axis=1)
...: .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
In [6]: num_df
Out[6]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
3 d NaN 0.4
4 e 5 0.5
In [7]: num_df[num_df[data_columns].notnull().all(axis=1)]
Out[7]:
item a b
0 a 1 0.1
1 b 2 0.2
2 c 3 0.3
4 e 5 0.5
# Original code
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
'b': [0.1, 0.2, 0.3, 0.4, 0.5],
'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')
Convert to numeric using 'coerce' which fills bad values with 'nan'
a = pd.to_numeric(df.a, errors='coerce')
Use isna to return a boolean index:
idx = a.isna()
Apply that index to the data frame:
df[idx]
output
Returns the row with the bad data in it:
a b
item
d bad 0.4
Sorry about the confusion, this should be the correct approach. Do you want only to capture 'bad' only, not things like 'good'; Or just any non-numerical values?
In[15]:
np.where(np.any(np.isnan(df.convert_objects(convert_numeric=True)), axis=1))
Out[15]:
(array([3]),)
In case you are working with a column with string values, you can use
THE VERY USEFUL function series.str.isnumeric() like:
a = pd.Series(['hi','hola','2.31','288','312','1312', '0,21', '0.23'])
What i do is to copy that column to new column, and do a str.replace('.','') and str.replace(',','') then i select the numeric values.
and:
a = a.str.replace('.','')
a = a.str.replace(',','')
a.str.isnumeric()
Out[15]:
0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 True
dtype: bool
Good luck all!
I'm thinking something like, just give an idea, to convert the column to string, and work with string is easier. however this does not work with strings containing numbers, like bad123. and ~ is taking the complement of selection.
df['a'] = df['a'].astype(str)
df[~df['a'].str.contains('0|1|2|3|4|5|6|7|8|9')]
df['a'] = df['a'].astype(object)
and using '|'.join([str(i) for i in range(10)]) to generate '0|1|...|8|9'
or using np.isreal() function, just like the most voted answer
df[~df['a'].apply(lambda x: np.isreal(x))]
Did you convert your data using .astype() ?
All great comments above must solve 99% of the cases, but if you are still in trouble, please also check if you converted your data type.
Sometimes I force the data to type float16 to save memory. Using:
df[col] = df[col].astype(np.float16)
But this might silently break your code. So if you did any kind of data type transformation, double check for overflows. Disable the conversion and try again.
It worked for me!
How can I easily perform an operation on a Pandas DataFrame Index? Lets say I create a DataFrame like so:
df = DataFrame(rand(5,3), index=[0, 1, 2, 4, 5])
and I want to find the mean sampling rate. The way I do this now doesn't seem quite right.
fs = 1./np.mean(np.diff(df.index.values.astype(np.float)))
I feel like there must be a better way to do this, but I can't figure it out.
Thanks for any help.
#BrenBarn is correct, better to make a column in a frame, but you can do this
In [2]: df = DataFrame(np.random.rand(5,3), index=[0, 1, 2, 4, 5])
In [3]: df.index.to_series()
Out[3]:
0 0
1 1
2 2
4 4
5 5
dtype: int64
In [4]: s = df.index.to_series()
In [5]: 1./s.diff().mean()
Out[5]: 0.80000000000000004