Find mixed types in Pandas columns - python

Ever so often I get this warning when parsing data files:
WARNING:py.warnings:/usr/local/python3/miniconda/lib/python3.4/site-
packages/pandas-0.16.0_12_gdcc7431-py3.4-linux-x86_64.egg/pandas
/io/parsers.py:1164: DtypeWarning: Columns (0,2,14,20) have mixed types.
Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
But if the data is large (I have 50k rows), how can I find WHERE in the data the change of dtype occurs?

I'm not entirely sure what you're after, but it's easy enough to find the rows which contain elements which don't share the type of the first row. For example:
>>> df = pd.DataFrame({"A": np.arange(500), "B": np.arange(500.0)})
>>> df.loc[321, "A"] = "Fred"
>>> df.loc[325, "B"] = True
>>> weird = (df.applymap(type) != df.iloc[0].apply(type)).any(axis=1)
>>> df[weird]
A B
321 Fred 321
325 325 True

In addition to DSM's answer, with a many-column dataframe it can be helpful to find the columns that change type like so:
for col in df.columns:
weird = (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis=1)
if len(df[weird]) > 0:
print(col)

This approach uses pandas.api.types.infer_dtype to find the columns which have mixed dtypes. It was tested with Pandas 1 under Python 3.8.
Note that this answer has multiple uses of assignment expressions which work only with Python 3.8 or newer. It can however trivially be modified to not use them.
if mixed_dtypes := {c: dtype for c in df.columns if (dtype := pd.api.types.infer_dtype(df[c])).startswith("mixed")}:
raise TypeError(f"Dataframe has one more mixed dtypes: {mixed_dtypes}")
This approach doesn't however find a row with the changed dtype.

Create sample data with a column that has 2 data types
import seaborn
iris = seaborn.load_dataset("iris")
# Change one row to another type
iris.loc[0,"sepal_length"] = iris.loc[0,"sepal_length"].astype(str)
When columns use more than one type, print the column name and the types used:
for col in iris.columns:
unique_types = iris[col].apply(type).unique()
if len(unique_types) > 1:
print(col, unique_types)
To fix the column types you can:
use df[col] = df[col].astype(str) to change the data type.
or if the data frame was read from a csv file define the ̀dtype` argument in a dictionary of columns.

Related

Why are my lambda and map() functions returning floats insteads of integers on pandas dataframe? [duplicate]

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.
When I try to cast the id column to integer while reading the .csv, I get:
df= pd.read_csv("data.csv", dtype={'id': int})
error: Integer column has NA values
Alternatively, I tried to convert the column type after reading as below, but this time I get:
df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer
How can I tackle this?
In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.
Nullable Integer Data Type.
Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)
0 1
1 2
2 NaN
dtype: Int64
For convert column to nullable integers use:
df['myCol'] = df['myCol'].astype('Int64')
The lack of NaN rep in integer columns is a pandas "gotcha".
The usual workaround is to simply use floats.
My use case is munging data prior to loading into a DB table:
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
Remove NaNs, convert to int, convert to str and then reinsert NANs.
It's not pretty but it gets the job done!
It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values
Whether your pandas series is object datatype or simply float datatype the below method will work
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float).astype('Int64')
I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.
for col in discrete:
df[col] = pd.to_numeric(df[col],errors='coerce').astype(pd.Int64Dtype())
If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:
df['col'] = (
df['col'].fillna(0)
.astype(int)
.astype(object)
.where(df['col'].notnull())
)
This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.
You could use .dropna() if it is OK to drop the rows with the NaN values.
df = df.dropna(subset=['id'])
Alternatively,
use .fillna() and .astype() to replace the NaN with values and convert them to int.
I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.
My solution was to use str as the intermediate type.
Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.
df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)
For the illustration, here is an example how floats may loose the precision:
s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)
And the output is:
1.2345678901234567e+19 12345678901234567168 12345678901234567890
As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.
When reading in your data all you have to do is:
df= pd.read_csv("data.csv", dtype={'id': 'Int64'})
Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.
As a side note, this will also work with .astype()
df['id'] = df['id'].astype('Int64')
Documentation here
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write
if row['id']:
regular_process(row)
else:
special_process(row)
Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.
keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
import pandas as pd
df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])
If you want to use it when you chain methods, you can use assign:
df = (
df.assign(col = lambda x: x['col'].astype('Int64'))
)
The issue with Int64, like many other's solutions, is that if you have null values, they get replaced with <NA> values, which do not work with pandas default 'NaN' functions, like isnull() or fillna(). Or if you convert values to -1 you end up in a situation where you may be deleting your information. My solution is a little lame, but will provide int values with np.nan, allowing for nan functions to work without compromising your values.
def to_int(x):
try:
return int(x)
except:
return np.nan
df[column] = df[column].apply(to_int)
Use .fillna() to replace all NaN values with 0 and then convert it to int using astype(int)
df['id'] = df['id'].fillna(0).astype(int)
For anyone needing to have int values within NULL/NaN-containing columns, but working under the constraint of being unable to use pandas version 0.24.0 nullable integer features mentioned in other answers, I suggest converting the columns to object type using pd.where:
df = df.where(pd.notnull(df), None)
This converts all NaNs in the dataframe to None, treating mixed-type columns as objects, but leaving the int values as int, rather than float.
First you need to specify the newer integer type, Int8 (...Int64) that can handle null integer data (pandas version >= 0.24.0)
df = df.astype('Int8')
But you may want to only target specific columns which have integer data mixed with NaN/nulls:
df = df.astype({'col1':'Int8','col2':'Int8','col3':'Int8')
At this point, the NaN's are converted into <NA> and if you want to change the default null value with df.fillna(), you need to coerce the object datatype on the columns you wish to change, otherwise you will see
TypeError: <U1 cannot be converted to an IntegerDtype
You can do this by
df = df.astype(object) if you don't mind changing every column datatype to object (individually, each value's type is still preserved) ... OR
df = df.astype({"col1": object,"col2": object}) if you prefer to target individual columns.
This should help with forcing your integer columns mixed with nulls to stay formatted as integers and change the null values to whatever you like. I can't speak to the efficiency of this method, but it worked for my formatting and printing purposes.
I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:
def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
return pd.read_csv(file_path, **kwargs)
else:
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
else:
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df
Try this:
df[['id']] = df[['id']].astype(pd.Int64Dtype())
If you print it's dtypes, you will get id Int64 instead of normal one int64
First remove the rows which contain NaN. Then do Integer conversion on remaining rows.
At Last insert the removed rows again.
Hope it will work
Had a similar problem. That was my solution:
def toint(zahl = 1.1):
try:
zahl = int(zahl)
except:
zahl = np.nan
return zahl
print(toint(4.776655), toint(np.nan), toint('test'))
4 nan nan
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float)
df['id'] = toint(df['id'])
Since I didn't see the answer here, I might as well add it:
One-liner to convert NANs to empty string if you for some reason you still can't handle np.na or pd.NA like me when relying on a library with an older version of pandas:
df.select_dtypes('number').fillna(-1).astype(str).replace('-1', '')
I think the approach of #Digestible1010101 is the more appropriate for Pandas 1.2.+ versions, something like this should do the job:
df = df.astype({
'col_1': 'Int64',
'col_2': 'Int64',
'col_3': 'Int64',
'col_4': 'Int64', })
Similar to #hibernado's answer, but keeping it as integers (instead of strings)
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = np.where(df[col] == -1, np.nan, df[col])
df.loc[~df['id'].isna(), 'id'] = df.loc[~df['id'].isna(), 'id'].astype('int')
Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.
df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))
use pd.to_numeric()
df["DateColumn"] = pd.to_numeric(df["DateColumn"])
simple and clean

Change DataTypes of Pandas Columns by selecting columns by regex

I have a Pandas dataframe with a lot of columns looking like p_d_d_c0, p_d_d_c1, ... p_d_d_g1, p_d_d_g2, ....
df =
a b c p_d_d_c0 p_d_d_c1 p_d_d_c2 ... p_d_d_g0 p_d_d_g1 ...
All these columns, which confirm to the regex need to be selected and their datatypes need to be changed from object to float. In particular, columns look like p_d_d_c* and p_d_d_g* are they are all object types and I would like to change them to float types. Is there a way to select columns in bulk by using regular expression and change them to float types?
I tried the answer from here, but it takes a lot of time and memory as I have hundreds of these columns.
df[df.filter(regex=("p_d_d_.*"))
I also tried:
df.select(lambda col: col.startswith('p_d_d_g'), axis=1)
But, it gives an error:
AttributeError: 'DataFrame' object has no attribute 'select'
My Pandas version is 1.0.1
So, how to select columns in bulk and change their data types using regex?
Try this:
import pandas as pd
# sample dataframe
df = pd.DataFrame(data={"co1":[1,2,3,4], "co22":[4,3,2,1], "co3":[2,3,2,4], "abc":[5,4,3,2]})
# select all columns which have co in it
floatcols = [col for col in df.columns if "co" in col]
for floatcol in floatcols:
df[floatcol] = df[floatcol].astype(float)
From the same link, and with some astype magic.
column_vals = df.columns.map(lambda x: x.startswith("p_d_d_"))
train_temp = df.loc(axis=1)[column_vals]
train_temp = train_temp.astype(float)
EDIT:
To modify the original dataframe, do something like this:
column_vals = [x for x in df.columns if x.startswith("p_d_d_")]
df[column_vals] = df[column_vals].astype(float)

Converting column dtypes as 'objects' to 'floats or 'ints [duplicate]

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.
When I try to cast the id column to integer while reading the .csv, I get:
df= pd.read_csv("data.csv", dtype={'id': int})
error: Integer column has NA values
Alternatively, I tried to convert the column type after reading as below, but this time I get:
df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer
How can I tackle this?
In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.
Nullable Integer Data Type.
Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)
0 1
1 2
2 NaN
dtype: Int64
For convert column to nullable integers use:
df['myCol'] = df['myCol'].astype('Int64')
The lack of NaN rep in integer columns is a pandas "gotcha".
The usual workaround is to simply use floats.
My use case is munging data prior to loading into a DB table:
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
Remove NaNs, convert to int, convert to str and then reinsert NANs.
It's not pretty but it gets the job done!
It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values
Whether your pandas series is object datatype or simply float datatype the below method will work
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float).astype('Int64')
I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.
for col in discrete:
df[col] = pd.to_numeric(df[col],errors='coerce').astype(pd.Int64Dtype())
If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:
df['col'] = (
df['col'].fillna(0)
.astype(int)
.astype(object)
.where(df['col'].notnull())
)
This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.
You could use .dropna() if it is OK to drop the rows with the NaN values.
df = df.dropna(subset=['id'])
Alternatively,
use .fillna() and .astype() to replace the NaN with values and convert them to int.
I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.
My solution was to use str as the intermediate type.
Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.
df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)
For the illustration, here is an example how floats may loose the precision:
s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)
And the output is:
1.2345678901234567e+19 12345678901234567168 12345678901234567890
As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.
When reading in your data all you have to do is:
df= pd.read_csv("data.csv", dtype={'id': 'Int64'})
Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.
As a side note, this will also work with .astype()
df['id'] = df['id'].astype('Int64')
Documentation here
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write
if row['id']:
regular_process(row)
else:
special_process(row)
Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.
keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
import pandas as pd
df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])
If you want to use it when you chain methods, you can use assign:
df = (
df.assign(col = lambda x: x['col'].astype('Int64'))
)
The issue with Int64, like many other's solutions, is that if you have null values, they get replaced with <NA> values, which do not work with pandas default 'NaN' functions, like isnull() or fillna(). Or if you convert values to -1 you end up in a situation where you may be deleting your information. My solution is a little lame, but will provide int values with np.nan, allowing for nan functions to work without compromising your values.
def to_int(x):
try:
return int(x)
except:
return np.nan
df[column] = df[column].apply(to_int)
Use .fillna() to replace all NaN values with 0 and then convert it to int using astype(int)
df['id'] = df['id'].fillna(0).astype(int)
For anyone needing to have int values within NULL/NaN-containing columns, but working under the constraint of being unable to use pandas version 0.24.0 nullable integer features mentioned in other answers, I suggest converting the columns to object type using pd.where:
df = df.where(pd.notnull(df), None)
This converts all NaNs in the dataframe to None, treating mixed-type columns as objects, but leaving the int values as int, rather than float.
First you need to specify the newer integer type, Int8 (...Int64) that can handle null integer data (pandas version >= 0.24.0)
df = df.astype('Int8')
But you may want to only target specific columns which have integer data mixed with NaN/nulls:
df = df.astype({'col1':'Int8','col2':'Int8','col3':'Int8')
At this point, the NaN's are converted into <NA> and if you want to change the default null value with df.fillna(), you need to coerce the object datatype on the columns you wish to change, otherwise you will see
TypeError: <U1 cannot be converted to an IntegerDtype
You can do this by
df = df.astype(object) if you don't mind changing every column datatype to object (individually, each value's type is still preserved) ... OR
df = df.astype({"col1": object,"col2": object}) if you prefer to target individual columns.
This should help with forcing your integer columns mixed with nulls to stay formatted as integers and change the null values to whatever you like. I can't speak to the efficiency of this method, but it worked for my formatting and printing purposes.
I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:
def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
return pd.read_csv(file_path, **kwargs)
else:
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
else:
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df
Try this:
df[['id']] = df[['id']].astype(pd.Int64Dtype())
If you print it's dtypes, you will get id Int64 instead of normal one int64
First remove the rows which contain NaN. Then do Integer conversion on remaining rows.
At Last insert the removed rows again.
Hope it will work
Had a similar problem. That was my solution:
def toint(zahl = 1.1):
try:
zahl = int(zahl)
except:
zahl = np.nan
return zahl
print(toint(4.776655), toint(np.nan), toint('test'))
4 nan nan
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float)
df['id'] = toint(df['id'])
Since I didn't see the answer here, I might as well add it:
One-liner to convert NANs to empty string if you for some reason you still can't handle np.na or pd.NA like me when relying on a library with an older version of pandas:
df.select_dtypes('number').fillna(-1).astype(str).replace('-1', '')
I think the approach of #Digestible1010101 is the more appropriate for Pandas 1.2.+ versions, something like this should do the job:
df = df.astype({
'col_1': 'Int64',
'col_2': 'Int64',
'col_3': 'Int64',
'col_4': 'Int64', })
Similar to #hibernado's answer, but keeping it as integers (instead of strings)
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = np.where(df[col] == -1, np.nan, df[col])
df.loc[~df['id'].isna(), 'id'] = df.loc[~df['id'].isna(), 'id'].astype('int')
Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.
df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))
use pd.to_numeric()
df["DateColumn"] = pd.to_numeric(df["DateColumn"])
simple and clean

column with missing values to INT dtype without modifying the missing values [duplicate]

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.
When I try to cast the id column to integer while reading the .csv, I get:
df= pd.read_csv("data.csv", dtype={'id': int})
error: Integer column has NA values
Alternatively, I tried to convert the column type after reading as below, but this time I get:
df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer
How can I tackle this?
In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.
Nullable Integer Data Type.
Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)
0 1
1 2
2 NaN
dtype: Int64
For convert column to nullable integers use:
df['myCol'] = df['myCol'].astype('Int64')
The lack of NaN rep in integer columns is a pandas "gotcha".
The usual workaround is to simply use floats.
My use case is munging data prior to loading into a DB table:
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
Remove NaNs, convert to int, convert to str and then reinsert NANs.
It's not pretty but it gets the job done!
It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values
Whether your pandas series is object datatype or simply float datatype the below method will work
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float).astype('Int64')
I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.
for col in discrete:
df[col] = pd.to_numeric(df[col],errors='coerce').astype(pd.Int64Dtype())
If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:
df['col'] = (
df['col'].fillna(0)
.astype(int)
.astype(object)
.where(df['col'].notnull())
)
This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.
You could use .dropna() if it is OK to drop the rows with the NaN values.
df = df.dropna(subset=['id'])
Alternatively,
use .fillna() and .astype() to replace the NaN with values and convert them to int.
I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.
My solution was to use str as the intermediate type.
Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.
df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)
For the illustration, here is an example how floats may loose the precision:
s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)
And the output is:
1.2345678901234567e+19 12345678901234567168 12345678901234567890
As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.
When reading in your data all you have to do is:
df= pd.read_csv("data.csv", dtype={'id': 'Int64'})
Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.
As a side note, this will also work with .astype()
df['id'] = df['id'].astype('Int64')
Documentation here
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write
if row['id']:
regular_process(row)
else:
special_process(row)
Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.
keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
import pandas as pd
df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])
If you want to use it when you chain methods, you can use assign:
df = (
df.assign(col = lambda x: x['col'].astype('Int64'))
)
The issue with Int64, like many other's solutions, is that if you have null values, they get replaced with <NA> values, which do not work with pandas default 'NaN' functions, like isnull() or fillna(). Or if you convert values to -1 you end up in a situation where you may be deleting your information. My solution is a little lame, but will provide int values with np.nan, allowing for nan functions to work without compromising your values.
def to_int(x):
try:
return int(x)
except:
return np.nan
df[column] = df[column].apply(to_int)
Use .fillna() to replace all NaN values with 0 and then convert it to int using astype(int)
df['id'] = df['id'].fillna(0).astype(int)
For anyone needing to have int values within NULL/NaN-containing columns, but working under the constraint of being unable to use pandas version 0.24.0 nullable integer features mentioned in other answers, I suggest converting the columns to object type using pd.where:
df = df.where(pd.notnull(df), None)
This converts all NaNs in the dataframe to None, treating mixed-type columns as objects, but leaving the int values as int, rather than float.
First you need to specify the newer integer type, Int8 (...Int64) that can handle null integer data (pandas version >= 0.24.0)
df = df.astype('Int8')
But you may want to only target specific columns which have integer data mixed with NaN/nulls:
df = df.astype({'col1':'Int8','col2':'Int8','col3':'Int8')
At this point, the NaN's are converted into <NA> and if you want to change the default null value with df.fillna(), you need to coerce the object datatype on the columns you wish to change, otherwise you will see
TypeError: <U1 cannot be converted to an IntegerDtype
You can do this by
df = df.astype(object) if you don't mind changing every column datatype to object (individually, each value's type is still preserved) ... OR
df = df.astype({"col1": object,"col2": object}) if you prefer to target individual columns.
This should help with forcing your integer columns mixed with nulls to stay formatted as integers and change the null values to whatever you like. I can't speak to the efficiency of this method, but it worked for my formatting and printing purposes.
I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:
def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
return pd.read_csv(file_path, **kwargs)
else:
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
else:
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df
Try this:
df[['id']] = df[['id']].astype(pd.Int64Dtype())
If you print it's dtypes, you will get id Int64 instead of normal one int64
First remove the rows which contain NaN. Then do Integer conversion on remaining rows.
At Last insert the removed rows again.
Hope it will work
Had a similar problem. That was my solution:
def toint(zahl = 1.1):
try:
zahl = int(zahl)
except:
zahl = np.nan
return zahl
print(toint(4.776655), toint(np.nan), toint('test'))
4 nan nan
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float)
df['id'] = toint(df['id'])
Since I didn't see the answer here, I might as well add it:
One-liner to convert NANs to empty string if you for some reason you still can't handle np.na or pd.NA like me when relying on a library with an older version of pandas:
df.select_dtypes('number').fillna(-1).astype(str).replace('-1', '')
I think the approach of #Digestible1010101 is the more appropriate for Pandas 1.2.+ versions, something like this should do the job:
df = df.astype({
'col_1': 'Int64',
'col_2': 'Int64',
'col_3': 'Int64',
'col_4': 'Int64', })
Similar to #hibernado's answer, but keeping it as integers (instead of strings)
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = np.where(df[col] == -1, np.nan, df[col])
df.loc[~df['id'].isna(), 'id'] = df.loc[~df['id'].isna(), 'id'].astype('int')
Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.
df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))
use pd.to_numeric()
df["DateColumn"] = pd.to_numeric(df["DateColumn"])
simple and clean

Pandas Dataframe object types fillna exception over different datatypes

I have a Pandas Dataframe with different dtypes for the different columns. E.g. df.dtypes returns the following.
Date datetime64[ns]
FundID int64
FundName object
CumPos int64
MTMPrice float64
PricingMechanism object
Various of cheese columns have missing values in them. Doing a group operations on it with NaN values in place cause problems. To get rid of them with the .fillna() method is the obvious choice. Problem is the obvious clouse for strings are .fillna("") while .fillna(0) is the correct choice for ints and floats. Using either method on DataFrame throws exception. Any elegant solutions besides doing them individually (have about 30 columns)? I have a lot of code depending on the DataFrame and would prefer not to retype the columns as it is likely to break some other logic.
Can do:
df.FundID.fillna(0)
df.FundName.fillna("")
etc
You can iterate through them and use an if statement!
for col in df:
#get dtype for column
dt = df[col].dtype
#check if it is a number
if dt == int or dt == float:
df[col].fillna(0)
else:
df[col].fillna("")
When you iterate through a pandas DataFrame, you will get the names of each of the columns, so to access those columns, you use df[col]. This way you don't need to do it manually and the script can just go through each column and check its dtype!
You can grab the float64 and object columns using:
In [11]: float_cols = df.blocks['float64'].columns
In [12]: object_cols = df.blocks['object'].columns
and int columns won't have NaNs else they would be upcast to float.
Now you can apply the respective fillnas, one cheeky way:
In [13]: d1 = dict((col, '') for col in object_cols)
In [14]: d2 = dict((col, 0) for col in float_cols)
In [15]: df.fillna(value=dict(d1, **d2))
A compact version example:
#replace Nan with '' for columns of type 'object'
df=df.select_dtypes(include='object').fillna('')
However, after the above operation, the dataframe will only contain the 'object' type columns. For keeping all columns, use the solution proposed by #Ryan Saxe.
#Ryan Saxe's answer is accurate. To get it to work on my data I had to set inplace=True and also data= 0 and data= "". See code below:
for col in df:
#get dtype for column
dt = df[col].dtype
#check if it is a number
if dt == int or dt == float:
df[col].fillna(data=0, inplace=True)
else:
df[col].fillna(data="", inplace=True)
similar to #Guddi: A bit verbose, but still more concise then #Ryan's answer and keeping all columns:
df[df.select_dtypes("object").columns] = df.select_dtypes("object").fillna("")
Rather than running the conversion one column at a time, which is inefficient, here is a way to grab all of the int or float columns and change in one shot.
int_float_cols = df.select_dtypes(include=['int', 'float']).columns
df[int_float_cols] = df[int_float_cols].fillna(value=0)
Obvious how to adapt this to handle object.
I'm aware that in Pandas older versions, there were no NAs allowed in integers, so grabbing the "ints" is not strictly necessary and it may accidentially promote ints to floats. However, in our use case, it is better to be safe than sorry.
I ran into this because ordinary approach, df.fillna(0) corrupted all of the datetime variables.

Categories

Resources