Python Pandas read_excel without converting int to float - python

im reading an Excel file:
df = pd.read_excel(r'C:\test.xlsx', 'Sheet0', skiprows = 1)
The Excel file contains a column formatted General and a value like "405788", after reading this with pandas the output looks like "405788.0" so its converted as float. I need any value as String without changing the values, can someone help me out with this?
[Edit]
If i copy the values in a new Excel file and load this, the integers does not get converted to float. But i need to get the Values correct of the original file, so is there anything i can do?
Options dtype and converted changes the type as i need in str but as a floating number with .0

You can try to use the dtype attribute of the read_excel method.
df = pd.read_excel(r'C:\test.xlsx', 'Sheet0', skiprows = 1,
dtype={'Name': str, 'Value': str})
More information in the pandas docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

Related

How to stop Pandas converting integer to decimal when reading in an .xlsx file?

I have an .xlsx file that I am loading into a dataframe using the pd.read_excel method. However, when I do so, one of my columns appears to change format, with pandas adding a decimal point. Does anyone know why this is happening and how to stop it please?
Example of data in the .xlsx file:
191001
191002
191003
Example of the same data in the dataframe:
191001.0
191002.0
191003.0
The relevant column is using the 'General' format option in Excel.
I tried removing the decimal point with the following method; however I got the error message "pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer".
df.column1 = df.column1.astype(int)
Any help would be appreciated!
Your file most likely has infinite and nan values within the column.
You will need to remove them first
import numpy as np
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.fillna(0, inplace = True)
df.column1 = df.column1.astype(int)

Errors reading CSV with Pandas

I have a dataset of 100 million rows that I need to analyze. I use this function to read the file:
csv2020=pd.read_csv('filename.txt',
sep="\t",
error_bad_lines=False,
usecols=['field1', 'field2', 'field3', 'field4'],
dtype={'field1': int,'field2': float, 'field3': float, 'field4': float})
But I'm getting an error about one of the lines not possible to convert to a float:
ValueError: could not convert string to float: 'ORCH'
I would like to omit any lines where this error occurs, but I don't know how besides the error-bad-lines argument. Help?
Thanks!
The error_bad_lines option is not for this purpose, it only applies to an incorrect number of fields.
Read your file without the dtype option and do the conversion afterwards using pandas.to_numeric with the errors='coerce' option:
df = pd.read_csv(…)
df['field1'] = pd.to_numeric(df['field1'], errors='coerce')
df['field2'] = …
Some of the columns you are trying to import as float has strings and therefore cannot be converted.
Read the CSV first without the "dtype...." and look at your dataframe

Why pandas adding '.0' at the end of strings?

I'm processing a csv file. Source file contain value as '20190801'. Pandas detects it as int or float for different files. But before writing the output, I convert all columns to string and datatype shows all columns as object. But the output containing .0 at the end. Why is that?
e.g: 20190801.0
for col in data.columns:
data[col] = data[col].astype(str)
print(data.dtypes) <-- prints all columns datatypes as object
data.to_csv(neo_path, index=False)
I fixed like this;
I added converters parameter and making sure all those problematic columns should remain as strings in my case.
data = pd.read_csv(filepath, converters={"SiteCode":str,'Date':str,'Tank ID':str,'SIRA RECORD ID':str}
....
data.to_csv(neo_path,index=False)
In this case I get rid of, converting all column types as string as pointed in my quetsion.
for col in data.columns:
data[col] = data[col].astype(str)
: This didnt work when writing the output to csv. It converts string back again to float

Avoid converting data to int automatically while reading using pandas data frame

I have a csv file with no headers. It has around 35 columns.
I am reading this file using pandas.
Currently, issue is that when it reads the file, it automatically assigns datatype to each columns.
How to avoid assigning automatic data types?
I have a column C, which I want to store as string instead of int. But pandas automatically assigns it to int
I tried 2 things.
1)
my_df = pd.DataFrame()
my_df = pd.read_csv('my_csv_file.csv',names=['A','B','C'...'Z'],converters={'C':str},engine = 'python')
Above code gives me error
ValueError: Expected 37 fields in line 1, saw 35
If I remove, converters={'C':str},engine = 'python' there is no error
2)
old_df['C'] = old_df['C'].astype(int)
Issue with this approach is that, if the value in column is '00123', it has already been converted to 123 and then it converts it to '123'. It would lose initial Zeroes , because it thinks it is integer.
use dtype option or converters in read_csv read_csv doc, works regardless of using python engine or not:
df = pd.DataFrame({'col1':['00123','00125'],'col2':[1,2],'col3':[1.0,2.0]})
df.to_csv('test.csv',index=False)
new_df = pd.read_csv('test.csv',dtype={'col1':str,'col2':np.int64,'col3':np.float64})
If you simply use dtype=str then it will read every column in as a string (object). But you can not do that with converters as it expects a dictionary. You could substitute converters for dtype in above code and get same result.

String problem / Select all values > 8000 in pandas dataframe

I want to select all values bigger than 8000 within a pandas dataframe.
new_df = df.loc[df['GM'] > 8000]
However, it is not working. I think the problem is, that the value comes from an Excel file and the number is interpreted as string e.g. "1.111,52". Do you know how I can convert such a string to float / int in order to compare it properly?
Taken from the documentation of pd.read_excel:
Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.
This means that pandas checks the type of the format stored in excel. If this was numeric in Excel, the conversion should go correct. If your column was string, try to use:
df = pd.read_excel('filename.xlsx', thousands='.')
If you have a csv file, you can solve this by specifying thousands + decimal character:
df = pd.read_csv('filename.csv', thousands='.', decimal=',')
You can see value of df.dtypes to see what is the type of each column. Then, if the column type is not as you want to, you can change it by df['GM'].astype(float), and then new_df = df.loc[df['GM'].astype(float) > 8000] should work as you want to.
you can convert entire column data type to numeric
import pandas as pd
df['GM'] = pd.to_numeric(df['GM'])
You can see the data type of your column by using type function. In order to convert it to float use astype function as follows:
df['GM'].astype(float)

Categories

Resources