pandas.read_csv: keep column as integer while having NaN values - python

I just converted to Python from R, and now I'm trying to read in data from a csv file.
I was very annoyed with all my integer columns being treated as floats, and after some digging I see that this is the problem:
NumPy or Pandas: Keeping array type as integer while having a NaN value
I see that the accepted answer gives me a hint as to where to go, but problem is that I have data with hundreds of columns, as is typical when doing data science, I suppose. So I don't want to specify for every column what type to use when reading in data with read_csv. This is fixed automatically in R.
Is it really this hard to use pandas to read in data in a proper way in Python?
Source: https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

You can try using:
df = pd.read_csv('./file.csv', dtype='Int64')
Edit: So that doesn't work for strings. Instead, try something like this:
for col in df.columns[df.isna().any()].tolist():
if df[col].dtype == 'float':
df[col] = df[col].astype('Int64')
Loop through each column that has an NA value and check it has type of float, then convert them to Int64

Related

Pandas.read_sql_query reads int as float

I am exporting SQL database into csv using Pandas.read_sql_query + df.to_csv and have a problem that integer fields are represented as float in DataFrame.
My code:
conn = pymysql.connect(server, port)
chunk = pandas.read_sql_query('''select * from table''', conn)
df = pandas.DataFrame(chunk) # here int values are float
df.to_csv()
I am exporting a number of tables this way and the problem is that int fields are exported as float (with a dot). Also, I don't know upfront which column has which type (code supposed to be generic for all tables)
However, my target is to export everything as is - in strings
The things I've tried (without a success):
df.applymap(str)
read_sql_query(coerce_float=False)
df.fillna('')
DataFrame(dtype=object) / DataFrame(dtype=str)
Of course, I can then post-process the data to type-cast integers, but would be better to do it during the initial import
UPD: My dataset has NULL values. They should be replaced with empty strings (as the purpose is to typecast all columns to strings)
Pandas infers the datatype from a sample of the data. If an integer column has null values, pandas assigns it the float datatype because NaN is of float type and because pandas is based on numpy which has no nullable integer type. So try making sure that you dont have any nulls or that nulls are replaced with 0's if it makes sense in your dataset, 0 being an integer.
ALSO, another way to go about this would be to specify the dtypes on data import, but you have to use a special kind of integer, like Int64Dtype, as per the docs:
"If you need to represent integers with possibly missing values, use one of the nullable-integer extension dtypes provided by pandas"
So I found a solution by post-processing the data (force converting to strings) using applymap:
def convert_to_int(x):
try:
return str(int(x))
except:
return x
df = df.applymap(convert_to_int)
The step takes significant time for processing, however solves my problem

Need helping with a sorting error in Pandas

I have a data frame that looks like this:
Pandas DF:
I exported it to excel to be able to see it easier. But basically I am trying to sort it by SeqNo asc and it isnt counting correctly. So instead of going 0,0,0,0,1,1,1,1,2,2,2,2 its going 0,0,0,0,0,1,1,1,1,10,10,10,10. Please help out if possible. Here is the code that I have to sort it. I have tried many other methods but it just isnt sorting correctly.
final_df = df.sort_values(by=['SeqNo'])
Based on your description I think it is treating the column values as "String" instead of "int". You can confirm this by checking the datatype of your column (Ex: use df.info() to check datatype of all the columns in dataframe)
One option to resolve this is to convert that particular column type from string to "int" before sorting and exporting to excel. You can apply pandas "to_numeric()" function before sorting and exporting to excel. Please check pandas documentation for to_numeric() (Refer to https://www.linkedin.com/pulse/change-data-type-columns-pandas-mohit-sharma/ for sample)
First of all try Command Given Below for Verifying the type of Data given to you because it's important to understand your data first:-
print(df.dtypes)
Above Command will display all the Datatypes of Given Data. Then try to find SeqNo Datatype. If your Output for SeqNo is Showing Something like:-
SeqNo object
dtype: object
Then Your Data is of String Format and you have to Convert it to Integer or Numeric Format. So, For Converting it there are two Ways:-
1. By astype(int) Method
df['SeqNo'] = df['SeqNo'].astype(int)
2. By to_numeric Method
df['SeqNo'] = pd.to_numeric(df['SeqNo'])
After this Step Try again to Verify the Datatype has been changed or not by typing print(df.dtypes) and Now it will show similar output as stated below:-
SeqNo int32
dtype: object
Now you can print Data after Sorting Operation in Ascending Format:-
final_df = df.sort_values(by = ['SeqNo'], ascending = True)

How to create mixed type data in pandas

This is a rather non-standard question. For educational purposes, I'm trying to create a mixed type column in a csv file, so that I get a warning message when importing the dataset in a pandas DataFrame and later on, deal with that column to show how it's done.
The problem is that I'd type 0s in a string column in Excel, save it and close the file, but the clever pandas still imports that column as a string column, so it doesn't detect that there are in fact floats in it.
I also tried to change the format of only these 0s in pandas using astype('float'), exporting and re-importing. Still doesn't work.
Does anyone have an idea how can I create a column that pandas will read a mixed type?
Thanks in advance!
I'm trying to create a mixed type column in a csv file, so that I get
a warning message when importing the dataset in a pandas
Pandas will always infer the type of a column (Series object) and this is always going to be a single type. If every value in the column is string then pandas will load it as a column of type string.
If there are "mixed" values that can't be reasonably loaded as a strings, integers... then the inferred type will simply be dtype: object. Which also means that you will get no warning.
You can force the type when loading dataframe via dtype parameter.
pd.read_csv("test_file.csv", index_col=0, dtype=int)
Now the pandas will try to convert everything to int and if there are values that can't be converted to int, you will get an exception such as
ValueError: invalid literal for int() with base 10: 'a'
When trying to load dataset that contains string a in it. But again, this will not produce a warning, the operation will simply fail.
Here is how you can create a mixed column.
df = pd.DataFrame()
df["mix"] = ["a", "b", 1, True]
df.to_csv("test_file.csv")
df_again = pd.read_csv("test_file.csv", index_col=0)
print(df_again["mix"])
Type of the mix column is object
...
Name: mix, dtype: object
If you change the read_csv in the above code into
df_again = pd.read_csv("test_file.csv", index_col=0, dtype=int)
you will get the mentioned error.
You can create a DataFrame with mixed values.
pandas tries to guess the type for each column by chunks, if chunks have different guessed type then a warning is emitted but the type is preserved.
df = pd.DataFrame({'a': (['1'] * 100000 +
['X'] * 100000 +
['1'] * 100000),
'b': ['b'] * 300000})
df.to_csv('test.csv', index=False)
# when reading it pandas emits a DtypeWarning: Columns (0) have mixed types
df2 = pd.read_csv('test.csv')
>>> type(df2.iloc[262140, 0])
<class 'str'>
>>> type(df2.iloc[262150, 0])
<class 'int'>
It's probably not the best way to write a production-ready code but it can be useful in tests, and when debugging your code.
See the documentation: https://pandas.pydata.org/docs/reference/api/pandas.errors.DtypeWarning.html

Convert an Object dtype column to Number Dtype in a datafrane Pandas

Trying to answer this question Get List of Unique String per Column we ran into a different problem from my dataset. When I import this CSV file to the dataframe every column is OBJECT type, we need to convert the columns that are just number to real (number) dtype and those that are not number to String dtype.
Is there a way to achieve this?
Download the data sample from here
I have tried following code from following article Pandas: change data type of columns but did not work.
df = pd.DataFrame(a, columns=['col1','col2','col3'])
As always thanks for your help
Option 1
use pd.to_numeric in an apply
df.apply(pd.to_numeric, errors='ignore')
Option 2
use pd.to_numeric on df.values.ravel
cvrtd = pd.to_numeric(df.values.ravel(), errors='coerce').reshape(-1, len(df.columns))
pd.DataFrame(np.where(np.isnan(cvrtd), df.values, cvrtd), df.index, df.columns)
Note
These are not exactly the same. For some column that contains mixed values, option 2 converts what it can while option 2 leaves everything in that column an object. Looking at your file, I'd choose option 1.
Timing
df = pd.read_csv('HistorianDataSample/HistorianDataSample.csv', skiprows=[1, 2])

String problem / Select all values > 8000 in pandas dataframe

I want to select all values bigger than 8000 within a pandas dataframe.
new_df = df.loc[df['GM'] > 8000]
However, it is not working. I think the problem is, that the value comes from an Excel file and the number is interpreted as string e.g. "1.111,52". Do you know how I can convert such a string to float / int in order to compare it properly?
Taken from the documentation of pd.read_excel:
Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.
This means that pandas checks the type of the format stored in excel. If this was numeric in Excel, the conversion should go correct. If your column was string, try to use:
df = pd.read_excel('filename.xlsx', thousands='.')
If you have a csv file, you can solve this by specifying thousands + decimal character:
df = pd.read_csv('filename.csv', thousands='.', decimal=',')
You can see value of df.dtypes to see what is the type of each column. Then, if the column type is not as you want to, you can change it by df['GM'].astype(float), and then new_df = df.loc[df['GM'].astype(float) > 8000] should work as you want to.
you can convert entire column data type to numeric
import pandas as pd
df['GM'] = pd.to_numeric(df['GM'])
You can see the data type of your column by using type function. In order to convert it to float use astype function as follows:
df['GM'].astype(float)

Categories

Resources