Avoid converting data to int automatically while reading using pandas data frame - python

I have a csv file with no headers. It has around 35 columns.
I am reading this file using pandas.
Currently, issue is that when it reads the file, it automatically assigns datatype to each columns.
How to avoid assigning automatic data types?
I have a column C, which I want to store as string instead of int. But pandas automatically assigns it to int
I tried 2 things.
1)
my_df = pd.DataFrame()
my_df = pd.read_csv('my_csv_file.csv',names=['A','B','C'...'Z'],converters={'C':str},engine = 'python')
Above code gives me error
ValueError: Expected 37 fields in line 1, saw 35
If I remove, converters={'C':str},engine = 'python' there is no error
2)
old_df['C'] = old_df['C'].astype(int)
Issue with this approach is that, if the value in column is '00123', it has already been converted to 123 and then it converts it to '123'. It would lose initial Zeroes , because it thinks it is integer.

use dtype option or converters in read_csv read_csv doc, works regardless of using python engine or not:
df = pd.DataFrame({'col1':['00123','00125'],'col2':[1,2],'col3':[1.0,2.0]})
df.to_csv('test.csv',index=False)
new_df = pd.read_csv('test.csv',dtype={'col1':str,'col2':np.int64,'col3':np.float64})
If you simply use dtype=str then it will read every column in as a string (object). But you can not do that with converters as it expects a dictionary. You could substitute converters for dtype in above code and get same result.

Related

Python Pandas read_excel without converting int to float

im reading an Excel file:
df = pd.read_excel(r'C:\test.xlsx', 'Sheet0', skiprows = 1)
The Excel file contains a column formatted General and a value like "405788", after reading this with pandas the output looks like "405788.0" so its converted as float. I need any value as String without changing the values, can someone help me out with this?
[Edit]
If i copy the values in a new Excel file and load this, the integers does not get converted to float. But i need to get the Values correct of the original file, so is there anything i can do?
Options dtype and converted changes the type as i need in str but as a floating number with .0
You can try to use the dtype attribute of the read_excel method.
df = pd.read_excel(r'C:\test.xlsx', 'Sheet0', skiprows = 1,
dtype={'Name': str, 'Value': str})
More information in the pandas docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

Pandas.read_sql_query reads int as float

I am exporting SQL database into csv using Pandas.read_sql_query + df.to_csv and have a problem that integer fields are represented as float in DataFrame.
My code:
conn = pymysql.connect(server, port)
chunk = pandas.read_sql_query('''select * from table''', conn)
df = pandas.DataFrame(chunk) # here int values are float
df.to_csv()
I am exporting a number of tables this way and the problem is that int fields are exported as float (with a dot). Also, I don't know upfront which column has which type (code supposed to be generic for all tables)
However, my target is to export everything as is - in strings
The things I've tried (without a success):
df.applymap(str)
read_sql_query(coerce_float=False)
df.fillna('')
DataFrame(dtype=object) / DataFrame(dtype=str)
Of course, I can then post-process the data to type-cast integers, but would be better to do it during the initial import
UPD: My dataset has NULL values. They should be replaced with empty strings (as the purpose is to typecast all columns to strings)
Pandas infers the datatype from a sample of the data. If an integer column has null values, pandas assigns it the float datatype because NaN is of float type and because pandas is based on numpy which has no nullable integer type. So try making sure that you dont have any nulls or that nulls are replaced with 0's if it makes sense in your dataset, 0 being an integer.
ALSO, another way to go about this would be to specify the dtypes on data import, but you have to use a special kind of integer, like Int64Dtype, as per the docs:
"If you need to represent integers with possibly missing values, use one of the nullable-integer extension dtypes provided by pandas"
So I found a solution by post-processing the data (force converting to strings) using applymap:
def convert_to_int(x):
try:
return str(int(x))
except:
return x
df = df.applymap(convert_to_int)
The step takes significant time for processing, however solves my problem

Pandas while importing ignores 0 from the values if it starts with 0 [duplicate]

I am importing study data into a Pandas data frame using read_csv.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.
converters={'column_name': str}
Let's say I have csv file projects.csv like below:
project_name,project_id
Some Project,000245
Another Project,000478
As for example below code is trimming leading zeros:
from pandas import read_csv
dataframe = read_csv('projects.csv')
print dataframe
Result:
project_name project_id
0 Some Project 245
1 Another Project 478
Solution code example:
from pandas import read_csv
dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe
Required result:
project_name project_id
0 Some Project 000245
1 Another Project 000478
To have all columns as str:
pd.read_csv('sample.csv', dtype=str)
To have certain columns as str:
# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
dtype_dic= {'subject_id': str,
'subject_number' : 'float'}
use that mapping with pd.read_csv():
df = pd.read_csv(yourdata, dtype = dtype_dic)
et voila!
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file
You could also do:
df = pd.read_csv("your_file.csv", dtype=str)
By doing this you will have all your columns as strings and you won't lose any leading zeros.
You Can do This , Works On all Versions of Pandas
pd.read_csv('filename.csv', dtype={'zero_column_name': object})
You can use converters to convert number to fixed width if you know the width.
For example, if the width is 5, then
data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})
This will do the trick. It works for pandas==0.23.0 and also read_excel.
Python3.6 or higher required.
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.
EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.
As an example, consider the following my_data.txt file:
id,A
03,5
04,6
To preserve the leading zeros for the id column:
df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
id A
0 03 5
1 04 6

Receiving KeyError when converting a column from float to int

created a pandas data frame using read_csv.
I then changed the column name of the 0th column from 'Unnamed' to 'Experiment_Number'.
The values in this column are floating point numbers and I've been trying to convert them to integers using:
df['Experiment_Number'] = df['Experiment_Number'].astype(int)
I get this error:
KeyError: 'Experiment_Number'
I've been trying every way since yesterday, for example also
df['Experiment_Number'] = df.astype({'Experiment_Number': int})
and many other variations.
Can someone please help, I'm new using pandas and this close to giving up on this :(
Any help will be appreciated
I had used this for renaming the column before:
df.columns.values[0] = 'Experiment_Number'
This should have worked. The fact that it didn't can only mean there were special characters/unprintable characters in your column names.
I can offer another possible suggestion, using df.rename:
df = df.rename(columns={df.columns[0] : 'Experiment_Number'})
You can convert the type during your read_csv() call then rename it afterward. As in
df = pandas.read_csv(filename,
dtype = {'Unnamed': 'float'}, # inform read_csv this field is float
converters = {'Unnamed': int}) # apply the int() function
df.rename(columns = {'Unnamed' : 'Experiment_Number'}, inplace=True)
The dtype is not strictly necessary, because the converter will override it in this case, but it is wise to get in the habit of always providing a dtype for every field of your input. It is annoying, for example, how pandas treats integers as floats by default. Also, you may later remove the converters option without worry, if you specified dtype.

String problem / Select all values > 8000 in pandas dataframe

I want to select all values bigger than 8000 within a pandas dataframe.
new_df = df.loc[df['GM'] > 8000]
However, it is not working. I think the problem is, that the value comes from an Excel file and the number is interpreted as string e.g. "1.111,52". Do you know how I can convert such a string to float / int in order to compare it properly?
Taken from the documentation of pd.read_excel:
Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.
This means that pandas checks the type of the format stored in excel. If this was numeric in Excel, the conversion should go correct. If your column was string, try to use:
df = pd.read_excel('filename.xlsx', thousands='.')
If you have a csv file, you can solve this by specifying thousands + decimal character:
df = pd.read_csv('filename.csv', thousands='.', decimal=',')
You can see value of df.dtypes to see what is the type of each column. Then, if the column type is not as you want to, you can change it by df['GM'].astype(float), and then new_df = df.loc[df['GM'].astype(float) > 8000] should work as you want to.
you can convert entire column data type to numeric
import pandas as pd
df['GM'] = pd.to_numeric(df['GM'])
You can see the data type of your column by using type function. In order to convert it to float use astype function as follows:
df['GM'].astype(float)

Categories

Resources