Skip rows with missing values in read_csv - python

I have a very large csv which I need to read in. To make this fast and save RAM usage I am using read_csv and set the dtype of some columns to np.uint32. The problem is that some rows have missing values and pandas uses a float to represent those.
Is it possible to simply skip rows with missing values? I know I could do this after reading in the whole file but this means I couldn't set the dtype until then and so would use too much RAM.
Is it possible to convert missing values to some other I choose during the reading of the data?

It would be dainty if you could fill NaN with say 0 during read itself. Perhaps a feature request in Pandas's git-hub is in order...
Using a converter function
However, for the time being, you can define your own function to do that and pass it to the converters argument in read_csv:
def conv(val):
if val == np.nan:
return 0 # or whatever else you want to represent your NaN with
return val
df = pd.read_csv(file, converters={colWithNaN : conv}, dtypes=...)
Note that converters takes a dict, so you need to specify it for each column that has NaN to be dealt with. It can get a little tiresome if a lot of columns are affected. You can specify either column names or numbers as keys.
Also note that this might slow down your read_csv performance, depending on how the converters function is handled. Further, if you just have one column that needs NaNs handled during read, you can skip a proper function definition and use a lambda function instead:
df = pd.read_csv(file, converters={colWithNaN : lambda x: 0 if x == np.nan else x}, dtypes=...)
Reading in chunks
You could also read the file in small chunks that you stitch together to get your final output. You can do a bunch of things this way. Here is an illustrative example:
result = pd.DataFrame()
df = pd.read_csv(file, chunksize=1000)
for chunk in df:
chunk.dropna(axis=0, inplace=True) # Dropping all rows with any NaN value
chunk[colToConvert] = chunk[colToConvert].astype(np.uint32)
result = result.append(chunk)
del df, chunk
Note that this method does not strictly duplicate data. There is a time when the data in chunk exists twice, right after the result.append statement, but only chunksize rows are repeated, which is a fair bargain. This method may also work out to be faster than by using a converter function.

There is no feature in Pandas that does that. You can implement it in regular Python like this:
import csv
import pandas as pd
def filter_records(records):
"""Given an iterable of dicts, converts values to int.
Discards any record which has an empty field."""
for record in records:
for k, v in record.iteritems():
if v == '':
break
record[k] = int(v)
else: # this executes whenever break did not
yield record
with open('t.csv') as infile:
records = csv.DictReader(infile)
df = pd.DataFrame.from_records(filter_records(records))
Pandas uses the csv module internally anyway. If the performance of the above turns out to be a problem, you could probably speed it up with Cython (which Pandas also uses).

If you show some data, SO ppl could help.
pd.read_csv('FILE', keep_default_na=False)
For starters try these:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
na_values : str or list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’.
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.
na_filter : boolean, default True
Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file

Related

Why are my lambda and map() functions returning floats insteads of integers on pandas dataframe? [duplicate]

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.
When I try to cast the id column to integer while reading the .csv, I get:
df= pd.read_csv("data.csv", dtype={'id': int})
error: Integer column has NA values
Alternatively, I tried to convert the column type after reading as below, but this time I get:
df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer
How can I tackle this?
In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.
Nullable Integer Data Type.
Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)
0 1
1 2
2 NaN
dtype: Int64
For convert column to nullable integers use:
df['myCol'] = df['myCol'].astype('Int64')
The lack of NaN rep in integer columns is a pandas "gotcha".
The usual workaround is to simply use floats.
My use case is munging data prior to loading into a DB table:
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
Remove NaNs, convert to int, convert to str and then reinsert NANs.
It's not pretty but it gets the job done!
It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values
Whether your pandas series is object datatype or simply float datatype the below method will work
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float).astype('Int64')
I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.
for col in discrete:
df[col] = pd.to_numeric(df[col],errors='coerce').astype(pd.Int64Dtype())
If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:
df['col'] = (
df['col'].fillna(0)
.astype(int)
.astype(object)
.where(df['col'].notnull())
)
This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.
You could use .dropna() if it is OK to drop the rows with the NaN values.
df = df.dropna(subset=['id'])
Alternatively,
use .fillna() and .astype() to replace the NaN with values and convert them to int.
I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.
My solution was to use str as the intermediate type.
Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.
df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)
For the illustration, here is an example how floats may loose the precision:
s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)
And the output is:
1.2345678901234567e+19 12345678901234567168 12345678901234567890
As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.
When reading in your data all you have to do is:
df= pd.read_csv("data.csv", dtype={'id': 'Int64'})
Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.
As a side note, this will also work with .astype()
df['id'] = df['id'].astype('Int64')
Documentation here
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write
if row['id']:
regular_process(row)
else:
special_process(row)
Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.
keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
import pandas as pd
df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])
If you want to use it when you chain methods, you can use assign:
df = (
df.assign(col = lambda x: x['col'].astype('Int64'))
)
The issue with Int64, like many other's solutions, is that if you have null values, they get replaced with <NA> values, which do not work with pandas default 'NaN' functions, like isnull() or fillna(). Or if you convert values to -1 you end up in a situation where you may be deleting your information. My solution is a little lame, but will provide int values with np.nan, allowing for nan functions to work without compromising your values.
def to_int(x):
try:
return int(x)
except:
return np.nan
df[column] = df[column].apply(to_int)
Use .fillna() to replace all NaN values with 0 and then convert it to int using astype(int)
df['id'] = df['id'].fillna(0).astype(int)
For anyone needing to have int values within NULL/NaN-containing columns, but working under the constraint of being unable to use pandas version 0.24.0 nullable integer features mentioned in other answers, I suggest converting the columns to object type using pd.where:
df = df.where(pd.notnull(df), None)
This converts all NaNs in the dataframe to None, treating mixed-type columns as objects, but leaving the int values as int, rather than float.
First you need to specify the newer integer type, Int8 (...Int64) that can handle null integer data (pandas version >= 0.24.0)
df = df.astype('Int8')
But you may want to only target specific columns which have integer data mixed with NaN/nulls:
df = df.astype({'col1':'Int8','col2':'Int8','col3':'Int8')
At this point, the NaN's are converted into <NA> and if you want to change the default null value with df.fillna(), you need to coerce the object datatype on the columns you wish to change, otherwise you will see
TypeError: <U1 cannot be converted to an IntegerDtype
You can do this by
df = df.astype(object) if you don't mind changing every column datatype to object (individually, each value's type is still preserved) ... OR
df = df.astype({"col1": object,"col2": object}) if you prefer to target individual columns.
This should help with forcing your integer columns mixed with nulls to stay formatted as integers and change the null values to whatever you like. I can't speak to the efficiency of this method, but it worked for my formatting and printing purposes.
I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:
def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
return pd.read_csv(file_path, **kwargs)
else:
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
else:
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df
Try this:
df[['id']] = df[['id']].astype(pd.Int64Dtype())
If you print it's dtypes, you will get id Int64 instead of normal one int64
First remove the rows which contain NaN. Then do Integer conversion on remaining rows.
At Last insert the removed rows again.
Hope it will work
Had a similar problem. That was my solution:
def toint(zahl = 1.1):
try:
zahl = int(zahl)
except:
zahl = np.nan
return zahl
print(toint(4.776655), toint(np.nan), toint('test'))
4 nan nan
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float)
df['id'] = toint(df['id'])
Since I didn't see the answer here, I might as well add it:
One-liner to convert NANs to empty string if you for some reason you still can't handle np.na or pd.NA like me when relying on a library with an older version of pandas:
df.select_dtypes('number').fillna(-1).astype(str).replace('-1', '')
I think the approach of #Digestible1010101 is the more appropriate for Pandas 1.2.+ versions, something like this should do the job:
df = df.astype({
'col_1': 'Int64',
'col_2': 'Int64',
'col_3': 'Int64',
'col_4': 'Int64', })
Similar to #hibernado's answer, but keeping it as integers (instead of strings)
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = np.where(df[col] == -1, np.nan, df[col])
df.loc[~df['id'].isna(), 'id'] = df.loc[~df['id'].isna(), 'id'].astype('int')
Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.
df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))
use pd.to_numeric()
df["DateColumn"] = pd.to_numeric(df["DateColumn"])
simple and clean

How to change dtype of each dataframe column to float if it best fits?

I have an excel that has columns with values like 1.32131**, among columns of dtype string. As a result the dtypes of these columns in the dataframe are object. I have cleaned the asterisks from the dataframe and now I need to convert the dtypes of these columns to float64. I am aware of ways to do it if I define the columns that need to be changed or the desired dtypes for each column (like the functions mentioned here), but I have too many columns to use such solutions. Thus I am looking for a more efficient and clean way.
For example, if I wanted to convert to int64 I would use convert_dtypes(), but it seems that it doesn't support floats and it returns these columns with object dtype.
Then, if possible, convert to StringDtype, BooleanDtype or an
appropriate integer extension type, otherwise leave as object.
Right now I am using the following script that works but I think it's to big for its purpose and it a bit slow.
# Create df and clean it
# note that the data exist in an excel normally and the dict is only for reproducibility purposes
dict = {'Name':['BPh1', 'BPh2', 'BPh3', 'BPh4', 'BPh5', 'BPh6', 'BPh7'], 'BBB':['2.00755**', '2.7766**', '0.490127**','0.490127**', '0.87667**', '0.899189**', '3.084**'], 'Buffer_solubility_mg_L':['0.00112934**','0.000798559**', '0.000218191**', '0.000122249**', '0.00382848**', '0.00109165**', '0.000665366**'], 'CYP_2C19_inhibition':['Inhibitor','Inhibitor','Non','Non','Inhibitor','Inhibitor',
'Inhibitor']}
ss = pd.DataFrame(dict).replace("\*",'',regex=True)
# Convert dtype to float when possible
for col in ss.columns[1:]:
print(col,'\n',ss[col].dtypes)
try:
ss[col] = pd.to_numeric(ss[col])
except:
pass
print(ss[col].dtypes,'\n')
Is there a cleaner way to do this conversion?
I'd change/clean the values before creating the dataframe, that way you're not first creating one, and then converting it to something else (might save a little bit of time as well). The advantage is that you can do it in a single line. I don't think you can get much faster than this, given the input data that you have to work with.
import pandas
# Create df and clean it
dict = {'Name':['BPh1', 'BPh2', 'BPh3', 'BPh4', 'BPh5', 'BPh6', 'BPh7'], 'BBB':['2.00755**', '2.7766**', '0.490127**','0.490127**', '0.87667**', '0.899189**', '3.084**'], 'Buffer_solubility_mg_L':['0.00112934**','0.000798559**', '0.000218191**', '0.000122249**', '0.00382848**', '0.00109165**', '0.000665366**'], 'CYP_2C19_inhibition':['Inhibitor','Inhibitor','Non','Non','Inhibitor','Inhibitor',
'Inhibitor']}
# Perform the conversion on creation
df = pandas.DataFrame(
{
col: pandas.to_numeric([v.replace("*", "") for v in values], errors="ignore")
for col, values in dict.items()
}
)

Converting column dtypes as 'objects' to 'floats or 'ints [duplicate]

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.
When I try to cast the id column to integer while reading the .csv, I get:
df= pd.read_csv("data.csv", dtype={'id': int})
error: Integer column has NA values
Alternatively, I tried to convert the column type after reading as below, but this time I get:
df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer
How can I tackle this?
In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.
Nullable Integer Data Type.
Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)
0 1
1 2
2 NaN
dtype: Int64
For convert column to nullable integers use:
df['myCol'] = df['myCol'].astype('Int64')
The lack of NaN rep in integer columns is a pandas "gotcha".
The usual workaround is to simply use floats.
My use case is munging data prior to loading into a DB table:
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
Remove NaNs, convert to int, convert to str and then reinsert NANs.
It's not pretty but it gets the job done!
It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values
Whether your pandas series is object datatype or simply float datatype the below method will work
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float).astype('Int64')
I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.
for col in discrete:
df[col] = pd.to_numeric(df[col],errors='coerce').astype(pd.Int64Dtype())
If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:
df['col'] = (
df['col'].fillna(0)
.astype(int)
.astype(object)
.where(df['col'].notnull())
)
This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.
You could use .dropna() if it is OK to drop the rows with the NaN values.
df = df.dropna(subset=['id'])
Alternatively,
use .fillna() and .astype() to replace the NaN with values and convert them to int.
I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.
My solution was to use str as the intermediate type.
Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.
df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)
For the illustration, here is an example how floats may loose the precision:
s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)
And the output is:
1.2345678901234567e+19 12345678901234567168 12345678901234567890
As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.
When reading in your data all you have to do is:
df= pd.read_csv("data.csv", dtype={'id': 'Int64'})
Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.
As a side note, this will also work with .astype()
df['id'] = df['id'].astype('Int64')
Documentation here
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write
if row['id']:
regular_process(row)
else:
special_process(row)
Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.
keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
import pandas as pd
df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])
If you want to use it when you chain methods, you can use assign:
df = (
df.assign(col = lambda x: x['col'].astype('Int64'))
)
The issue with Int64, like many other's solutions, is that if you have null values, they get replaced with <NA> values, which do not work with pandas default 'NaN' functions, like isnull() or fillna(). Or if you convert values to -1 you end up in a situation where you may be deleting your information. My solution is a little lame, but will provide int values with np.nan, allowing for nan functions to work without compromising your values.
def to_int(x):
try:
return int(x)
except:
return np.nan
df[column] = df[column].apply(to_int)
Use .fillna() to replace all NaN values with 0 and then convert it to int using astype(int)
df['id'] = df['id'].fillna(0).astype(int)
For anyone needing to have int values within NULL/NaN-containing columns, but working under the constraint of being unable to use pandas version 0.24.0 nullable integer features mentioned in other answers, I suggest converting the columns to object type using pd.where:
df = df.where(pd.notnull(df), None)
This converts all NaNs in the dataframe to None, treating mixed-type columns as objects, but leaving the int values as int, rather than float.
First you need to specify the newer integer type, Int8 (...Int64) that can handle null integer data (pandas version >= 0.24.0)
df = df.astype('Int8')
But you may want to only target specific columns which have integer data mixed with NaN/nulls:
df = df.astype({'col1':'Int8','col2':'Int8','col3':'Int8')
At this point, the NaN's are converted into <NA> and if you want to change the default null value with df.fillna(), you need to coerce the object datatype on the columns you wish to change, otherwise you will see
TypeError: <U1 cannot be converted to an IntegerDtype
You can do this by
df = df.astype(object) if you don't mind changing every column datatype to object (individually, each value's type is still preserved) ... OR
df = df.astype({"col1": object,"col2": object}) if you prefer to target individual columns.
This should help with forcing your integer columns mixed with nulls to stay formatted as integers and change the null values to whatever you like. I can't speak to the efficiency of this method, but it worked for my formatting and printing purposes.
I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:
def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
return pd.read_csv(file_path, **kwargs)
else:
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
else:
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df
Try this:
df[['id']] = df[['id']].astype(pd.Int64Dtype())
If you print it's dtypes, you will get id Int64 instead of normal one int64
First remove the rows which contain NaN. Then do Integer conversion on remaining rows.
At Last insert the removed rows again.
Hope it will work
Had a similar problem. That was my solution:
def toint(zahl = 1.1):
try:
zahl = int(zahl)
except:
zahl = np.nan
return zahl
print(toint(4.776655), toint(np.nan), toint('test'))
4 nan nan
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float)
df['id'] = toint(df['id'])
Since I didn't see the answer here, I might as well add it:
One-liner to convert NANs to empty string if you for some reason you still can't handle np.na or pd.NA like me when relying on a library with an older version of pandas:
df.select_dtypes('number').fillna(-1).astype(str).replace('-1', '')
I think the approach of #Digestible1010101 is the more appropriate for Pandas 1.2.+ versions, something like this should do the job:
df = df.astype({
'col_1': 'Int64',
'col_2': 'Int64',
'col_3': 'Int64',
'col_4': 'Int64', })
Similar to #hibernado's answer, but keeping it as integers (instead of strings)
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = np.where(df[col] == -1, np.nan, df[col])
df.loc[~df['id'].isna(), 'id'] = df.loc[~df['id'].isna(), 'id'].astype('int')
Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.
df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))
use pd.to_numeric()
df["DateColumn"] = pd.to_numeric(df["DateColumn"])
simple and clean

column with missing values to INT dtype without modifying the missing values [duplicate]

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.
When I try to cast the id column to integer while reading the .csv, I get:
df= pd.read_csv("data.csv", dtype={'id': int})
error: Integer column has NA values
Alternatively, I tried to convert the column type after reading as below, but this time I get:
df= pd.read_csv("data.csv")
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer
How can I tackle this?
In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.
Nullable Integer Data Type.
Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:
arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)
0 1
1 2
2 NaN
dtype: Int64
For convert column to nullable integers use:
df['myCol'] = df['myCol'].astype('Int64')
The lack of NaN rep in integer columns is a pandas "gotcha".
The usual workaround is to simply use floats.
My use case is munging data prior to loading into a DB table:
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)
Remove NaNs, convert to int, convert to str and then reinsert NANs.
It's not pretty but it gets the job done!
It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values
Whether your pandas series is object datatype or simply float datatype the below method will work
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float).astype('Int64')
I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.
for col in discrete:
df[col] = pd.to_numeric(df[col],errors='coerce').astype(pd.Int64Dtype())
If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:
df['col'] = (
df['col'].fillna(0)
.astype(int)
.astype(object)
.where(df['col'].notnull())
)
This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.
You could use .dropna() if it is OK to drop the rows with the NaN values.
df = df.dropna(subset=['id'])
Alternatively,
use .fillna() and .astype() to replace the NaN with values and convert them to int.
I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.
My solution was to use str as the intermediate type.
Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.
df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)
For the illustration, here is an example how floats may loose the precision:
s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)
And the output is:
1.2345678901234567e+19 12345678901234567168 12345678901234567890
As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.
When reading in your data all you have to do is:
df= pd.read_csv("data.csv", dtype={'id': 'Int64'})
Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.
As a side note, this will also work with .astype()
df['id'] = df['id'].astype('Int64')
Documentation here
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write
if row['id']:
regular_process(row)
else:
special_process(row)
Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.
keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
import pandas as pd
df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])
If you want to use it when you chain methods, you can use assign:
df = (
df.assign(col = lambda x: x['col'].astype('Int64'))
)
The issue with Int64, like many other's solutions, is that if you have null values, they get replaced with <NA> values, which do not work with pandas default 'NaN' functions, like isnull() or fillna(). Or if you convert values to -1 you end up in a situation where you may be deleting your information. My solution is a little lame, but will provide int values with np.nan, allowing for nan functions to work without compromising your values.
def to_int(x):
try:
return int(x)
except:
return np.nan
df[column] = df[column].apply(to_int)
Use .fillna() to replace all NaN values with 0 and then convert it to int using astype(int)
df['id'] = df['id'].fillna(0).astype(int)
For anyone needing to have int values within NULL/NaN-containing columns, but working under the constraint of being unable to use pandas version 0.24.0 nullable integer features mentioned in other answers, I suggest converting the columns to object type using pd.where:
df = df.where(pd.notnull(df), None)
This converts all NaNs in the dataframe to None, treating mixed-type columns as objects, but leaving the int values as int, rather than float.
First you need to specify the newer integer type, Int8 (...Int64) that can handle null integer data (pandas version >= 0.24.0)
df = df.astype('Int8')
But you may want to only target specific columns which have integer data mixed with NaN/nulls:
df = df.astype({'col1':'Int8','col2':'Int8','col3':'Int8')
At this point, the NaN's are converted into <NA> and if you want to change the default null value with df.fillna(), you need to coerce the object datatype on the columns you wish to change, otherwise you will see
TypeError: <U1 cannot be converted to an IntegerDtype
You can do this by
df = df.astype(object) if you don't mind changing every column datatype to object (individually, each value's type is still preserved) ... OR
df = df.astype({"col1": object,"col2": object}) if you prefer to target individual columns.
This should help with forcing your integer columns mixed with nulls to stay formatted as integers and change the null values to whatever you like. I can't speak to the efficiency of this method, but it worked for my formatting and printing purposes.
I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:
def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
if custom_dtype is None:
return pd.read_csv(file_path, **kwargs)
else:
assert 'dtype' not in kwargs.keys()
df = pd.read_csv(file_path, dtype = {}, **kwargs)
for col, typ in custom_dtype.items():
if fill_values is None or col not in fill_values.keys():
fill_val = -1
else:
fill_val = fill_values[col]
df[col] = df[col].fillna(fill_val).astype(typ)
return df
Try this:
df[['id']] = df[['id']].astype(pd.Int64Dtype())
If you print it's dtypes, you will get id Int64 instead of normal one int64
First remove the rows which contain NaN. Then do Integer conversion on remaining rows.
At Last insert the removed rows again.
Hope it will work
Had a similar problem. That was my solution:
def toint(zahl = 1.1):
try:
zahl = int(zahl)
except:
zahl = np.nan
return zahl
print(toint(4.776655), toint(np.nan), toint('test'))
4 nan nan
df = pd.read_csv("data.csv")
df['id'] = df['id'].astype(float)
df['id'] = toint(df['id'])
Since I didn't see the answer here, I might as well add it:
One-liner to convert NANs to empty string if you for some reason you still can't handle np.na or pd.NA like me when relying on a library with an older version of pandas:
df.select_dtypes('number').fillna(-1).astype(str).replace('-1', '')
I think the approach of #Digestible1010101 is the more appropriate for Pandas 1.2.+ versions, something like this should do the job:
df = df.astype({
'col_1': 'Int64',
'col_2': 'Int64',
'col_3': 'Int64',
'col_4': 'Int64', })
Similar to #hibernado's answer, but keeping it as integers (instead of strings)
df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = np.where(df[col] == -1, np.nan, df[col])
df.loc[~df['id'].isna(), 'id'] = df.loc[~df['id'].isna(), 'id'].astype('int')
Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.
df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))
use pd.to_numeric()
df["DateColumn"] = pd.to_numeric(df["DateColumn"])
simple and clean

Can't drop NAN with dropna in pandas

I import pandas as pd and run the code below and get the following result
Code:
traindataset = pd.read_csv('/Users/train.csv')
print traindataset.dtypes
print traindataset.shape
print traindataset.iloc[25,3]
traindataset.dropna(how='any')
print traindataset.iloc[25,3]
print traindataset.shape
Output
TripType int64
VisitNumber int64
Weekday object
Upc float64
ScanCount int64
DepartmentDescription object
FinelineNumber float64
dtype: object
(647054, 7)
nan
nan
(647054, 7)
[Finished in 2.2s]
From the result, the dropna line doesn't work because the row number doesn't change and there is still NAN in the dataframe. How that comes? I am craaaazy right now.
You need to read the documentation (emphasis added):
Return object with labels on given axis omitted
dropna returns a new DataFrame. If you want it to modify the existing DataFrame, all you have to do is read further in the documentation:
inplace : boolean, default False
If True, do operation inplace and return None.
So to modify it in place, do traindataset.dropna(how='any', inplace=True).
pd.DataFrame.dropna uses inplace=False by default. This is the norm with most Pandas operations; exceptions do exist, e.g. update.
Therefore, you must either assign back to your variable, or state explicitly inplace=True:
df = df.dropna(how='any') # assign back
df.dropna(how='any', inplace=True) # set inplace parameter
Stylistically, the former is often preferred as it supports operator chaining, and the latter often does not yield any or significant performance benefits.
Alternatively, you can also use notnull() method to select the rows which are not null.
For example if you want to select Non null values from columns country and variety of the dataframe reviews:
answer=reviews.loc[(reviews.country.notnull()) & (reviews.variety.notnull())]
But here we are just selecting relevant data;to remove null values you should use dropna() method.
This is my first post. I just spent a few hours debugging this exact issue and I would like to share how I fixed this issue.
I was converting my entire dataframe to a string and then placing that value back into the dataframe using similar code to what is displayed below: (please note, the code below will only convert the value to a string)
row_counter = 0
for ind, row in dataf.iterrows():
cell_value = str(row['column_header'])
dataf.loc[row_counter, 'column_header'] = cell_value
row_counter += 1
After converting the entire dataframe to a string, I then used the dropna() function. The values that were previously NaN (considered a null value by pandas) were converted to the string 'nan'.
In conclusion, drop blank values FIRST, before you start manipulating data in the CSV and converting its data type.

Categories

Resources