Pandas: Coerce errors while reading CSV - python

The pandas.to_datetime function has an errors keyword argument, that if set to 'coerce' will replace any values that it fails to cast with NaT.
Is there a way to replicate that functionality in pandas.read_csv while it's casting the columns?
For example, if I have the following data in a CSV file:
a,c
0,a
1,b
2,c
a,d
And I try:
pd.read_csv("file.csv", dtype={"a":"int64", "c":'object'})
It throws an error saying that it was unable to convert column a to type int64.
Is there a way to read a CSV with pandas so that if it fails while casting a column to fill a failed value with NaN or something that I specify?

Here is a solution that might work for you; or at least get you going in a direction.
Caveat:
AFIK what you're after, is not possible - i.e.: an int64 column with a NaN value because NaN is a float data type. Additionally, there is no need to convert column c to object, as this is implied.
Suggested Solution:
First, read your CSV without casting data types. Then, clean your data / convert your data types.
import numpy as np
import pandas as pd
# Just pretend this is reading from a CSV.
data = {'a': [0, 1, 2, 'a'],
'c': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)
Original Dataset:
a c
0 0 a
1 1 b
2 2 c
3 a d
a object
c object
dtype: object
Convert column a:
Using the pd.to_numeric function, you can do something similar to to_datetime by coercing any errors to NaN. However, this converts your column to float64, as NaN is a float data type.
df['a'] = pd.to_numeric(df['a'], errors='coerce')
Output:
a c
0 0.0 a
1 1.0 b
2 2.0 c
3 NaN d
a float64
c object
dtype: object
Convert column a to int64:
If you must have column a as an integer, you can do this:
df['a'] = df['a'].replace(np.nan, 0).astype(np.int64)
Output:
a c
0 0 a
1 1 b
2 2 c
3 0 d
a int64
c object
dtype: object
Hope this gets you started.

Here's another solution that does it at read time. You can pass manual conversion function to csv reading as pd.read_csv(..., converters=...).
For your case, you should pass converters={'a': convert_to_none_coerce_if_not} where convert_to_none_coerce_if_not can be:
import numpy as np
def convert_to_none_coerce_if_not(val: str):
try:
if int(str) == float(str):
# string is int
return np.int16(str)
else:
# string is numeric, but a float
return np.nan
except ValueError as e:
# string cannot be parsed as a number, return nan
return np.nan

Related

Why fillna does not work on float values?

I try to replace in all the empty cell of a dataset the mean of that column.
I use modifiedData=data.fillna(data.mean())
but it works only on integer column type.
I have also a column with float values and in it fillna does not work.
Why?
.fillna() works on columns that are nan. The concept of nan can't exist in an int column. Pandas dtype int does not support nan.
If you have a column with what seems to be integers, it is more likely an object column. Perhaps even filled with strings. Strings that are empty in some cases.
Empty strings are not filled by .fillna()
In [8]: pd.Series(["2", "1", ""]).fillna(0)
Out[8]:
0 2
1 1
2
dtype: object
An easy way to figure out what's going on is to use the df.Column.isna() method.
If that method gives you all False. you know there are no nan to fill.
To turn empty strings into nan values
In [11]: s = pd.Series(["2", "1", ""])
In [12]: empty_string_mask = s.str.len() == 0
In [21]: s.loc[empty_string_mask] = float('nan')
In [22]: s
Out[22]:
0 2
1 1
2 NaN
dtype: object
After that you can fillna
In [23]: s.fillna(0)
Out[23]:
0 2
1 1
2 0
dtype: object
Another way of going about this problem is to check the dtype
df.column.dtype
If it says 'object' It confirms your issue
You can cast the column to a float column
df.column = df.column.dtype(float)
Though manipulating dtypes in pandas usually leads to pains, this may be an easier route to take for this particular problem.

Adding an array to pandas dataframe

I have a dataframe, and I want to create a new column and add arrays to this each row of this new column. I know to do this I have to change the datatype of the column to 'object' I tried the following but it doesn;t work,
import pandas
import numpy as np
df = pandas.DataFrame({'a':[1,2,3,4]})
df['b'] = np.nan
df['b'] = df['b'].astype(object)
df.loc[0,'b'] = [[1,2,4,5]]
The error is
ValueError: Must have equal len keys and value when setting with an ndarray
However, it works if I convert the datatype of the whole dataframe into 'object':
df = pandas.DataFrame({'a':[1,2,3,4]})
df['b'] = np.nan
df = df.astype(object)
df.loc[0,'b'] = [[1,2,4,5]]
So my question is: why do I have to change the datatype of whole DataFrame?
try this:
In [12]: df.at[0,'b'] = [1,2,4,5]
In [13]: df
Out[13]:
a b
0 1 [1, 2, 4, 5]
1 2 NaN
2 3 NaN
3 4 NaN
PS be aware that as soon as you put non scalar value in any cells - the corresponding column's dtype will be changed to object in order to be able to contain non-scalar values:
In [14]: df.dtypes
Out[14]:
a int64
b object
dtype: object
PPS generally it's a bad idea to store non-scalar values in cells, because the vast majority of Pandas/Numpy methods will not work properly with such data.

Pandas rounds number to 0

I'm trying to assign a value to a cell, yet Pandas rounds it to zero. (I'm using Python 3.6)
in: df['column1']['row1'] = 1 / 331616
in: print(df['column1']['row1'])
out: 0
But if I try to assign this value to a standard Python dictionary key, it works fine.
in: {'column1': {'row1': 1/331616}}
out: {'column1': {'row1': 3.0155360416867704e-06}}
I've already done this, but it didn't help:
pd.set_option('precision',50)
pd.set_option('chop_threshold',
.00000000005)
Please, help.
pandas appears to be presuming that your datatype is an integer (int).
There are several ways to address this, either by setting the datatype to a float when the DataFrame is constructed OR by changing (or casting) the datatype (also referred to as a dtype) to a float on the fly.
setting the datatype (dtype) during construction:
>>> import pandas as pd
In making this simple DataFrame, we provide a single example value (1) and the columns for the DataFrame are defined as containing floats during creation
>>> df = pd.DataFrame([[1]], columns=['column1'], index=['row1'], dtype=float)
>>> df['column1']['row1'] = 1 / 331616
>>> df
column1
row1 0.000003
converting the datatype on the fly:
>>> df = pd.DataFrame([[1]], columns=['column1'], index=['row1'], dtype=int)
>>> df['column1'] = df['column1'].astype(float)
>>> df['column1']['row1'] = 1 / 331616
df
column1
row1 0.000003
Your column's datatype most likely is set to int. You'll need to either convert it to float or mixed types object before assigning the value:
df = pd.DataFrame([1,2,3,4,5,6])
df.dtypes
# 0 int64
# dtype: object
df[0][4] = 7/125
df
# 0
# 0 1
# 1 2
# 2 3
# 3 4
# 4 0
# 5 6
df[0] = df[0].astype('O')
df[0][4] = 7 / 22
df
# 0
# 0 1
# 1 2
# 2 3
# 3 4
# 4 0.318182
# 5 6
df.dtypes
# 0 object
# dtype: object

How can I check the dtype of the contents of a column in python pandas?

This question is related to how to check the dtype of a column in python pandas.
An empty pandas dataframe is created. Following this, it's filled with data. How can I then check if any of its columns contain complex types?
index = [np.array(['foo', 'qux'])]
columns = ["A", "B"]
df = pd.DataFrame(index=index, columns=columns)
df.loc['foo']["A"] = 1 + 1j
df.loc['foo']["B"] = 1
df.loc['qux']["A"] = 2
df.loc['qux']["B"] = 2
print df
for type in df.dtypes:
if type == complex:
print type
At the moment, I get the type as object which isn't useful.
A B
foo (1+1j) 1
qux 2 2
Consider the series s
s = pd.Series([1, 3.4, 2 + 1j], dtype=np.object)
s
0 1
1 3.4
2 (2+1j)
dtype: object
If I use pd.to_numeric, it will upcast the dtype to complex if any are complex
pd.to_numeric(s).dtype
dtype('complex128')

Elegant way to create empty pandas DataFrame with NaN of type float

I want to create a Pandas DataFrame filled with NaNs. During my research I found an answer:
import pandas as pd
df = pd.DataFrame(index=range(0,4),columns=['A'])
This code results in a DataFrame filled with NaNs of type "object". So they cannot be used later on for example with the interpolate() method. Therefore, I created the DataFrame with this complicated code (inspired by this answer):
import pandas as pd
import numpy as np
dummyarray = np.empty((4,1))
dummyarray[:] = np.nan
df = pd.DataFrame(dummyarray)
This results in a DataFrame filled with NaN of type "float", so it can be used later on with interpolate(). Is there a more elegant way to create the same result?
Simply pass the desired value as first argument, like 0, math.inf or, here, np.nan. The constructor then initializes and fills the value array to the size specified by arguments index and columns:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.nan, index=[0, 1, 2, 3], columns=['A', 'B'])
>>> df
A B
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
>>> df.dtypes
A float64
B float64
dtype: object
You could specify the dtype directly when constructing the DataFrame:
>>> df = pd.DataFrame(index=range(0,4),columns=['A'], dtype='float')
>>> df.dtypes
A float64
dtype: object
Specifying the dtype forces Pandas to try creating the DataFrame with that type, rather than trying to infer it.
Hope this can help!
pd.DataFrame(np.nan, index = np.arange(<num_rows>), columns = ['A'])
You can try this line of code:
pdDataFrame = pd.DataFrame([np.nan] * 7)
This will create a pandas dataframe of size 7 with NaN of type float:
if you print pdDataFrame the output will be:
0
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
Also the output for pdDataFrame.dtypes is:
0 float64
dtype: object
For multiple columns you can do:
df = pd.DataFrame(np.zeros([nrow, ncol])*np.nan)

Categories

Resources