Errors reading CSV with Pandas - python

I have a dataset of 100 million rows that I need to analyze. I use this function to read the file:
csv2020=pd.read_csv('filename.txt',
sep="\t",
error_bad_lines=False,
usecols=['field1', 'field2', 'field3', 'field4'],
dtype={'field1': int,'field2': float, 'field3': float, 'field4': float})
But I'm getting an error about one of the lines not possible to convert to a float:
ValueError: could not convert string to float: 'ORCH'
I would like to omit any lines where this error occurs, but I don't know how besides the error-bad-lines argument. Help?
Thanks!

The error_bad_lines option is not for this purpose, it only applies to an incorrect number of fields.
Read your file without the dtype option and do the conversion afterwards using pandas.to_numeric with the errors='coerce' option:
df = pd.read_csv(…)
df['field1'] = pd.to_numeric(df['field1'], errors='coerce')
df['field2'] = …

Some of the columns you are trying to import as float has strings and therefore cannot be converted.
Read the CSV first without the "dtype...." and look at your dataframe

Related

How to stop Pandas converting integer to decimal when reading in an .xlsx file?

I have an .xlsx file that I am loading into a dataframe using the pd.read_excel method. However, when I do so, one of my columns appears to change format, with pandas adding a decimal point. Does anyone know why this is happening and how to stop it please?
Example of data in the .xlsx file:
191001
191002
191003
Example of the same data in the dataframe:
191001.0
191002.0
191003.0
The relevant column is using the 'General' format option in Excel.
I tried removing the decimal point with the following method; however I got the error message "pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer".
df.column1 = df.column1.astype(int)
Any help would be appreciated!
Your file most likely has infinite and nan values within the column.
You will need to remove them first
import numpy as np
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.fillna(0, inplace = True)
df.column1 = df.column1.astype(int)

Python Pandas read_excel without converting int to float

im reading an Excel file:
df = pd.read_excel(r'C:\test.xlsx', 'Sheet0', skiprows = 1)
The Excel file contains a column formatted General and a value like "405788", after reading this with pandas the output looks like "405788.0" so its converted as float. I need any value as String without changing the values, can someone help me out with this?
[Edit]
If i copy the values in a new Excel file and load this, the integers does not get converted to float. But i need to get the Values correct of the original file, so is there anything i can do?
Options dtype and converted changes the type as i need in str but as a floating number with .0
You can try to use the dtype attribute of the read_excel method.
df = pd.read_excel(r'C:\test.xlsx', 'Sheet0', skiprows = 1,
dtype={'Name': str, 'Value': str})
More information in the pandas docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

Why pandas adding '.0' at the end of strings?

I'm processing a csv file. Source file contain value as '20190801'. Pandas detects it as int or float for different files. But before writing the output, I convert all columns to string and datatype shows all columns as object. But the output containing .0 at the end. Why is that?
e.g: 20190801.0
for col in data.columns:
data[col] = data[col].astype(str)
print(data.dtypes) <-- prints all columns datatypes as object
data.to_csv(neo_path, index=False)
I fixed like this;
I added converters parameter and making sure all those problematic columns should remain as strings in my case.
data = pd.read_csv(filepath, converters={"SiteCode":str,'Date':str,'Tank ID':str,'SIRA RECORD ID':str}
....
data.to_csv(neo_path,index=False)
In this case I get rid of, converting all column types as string as pointed in my quetsion.
for col in data.columns:
data[col] = data[col].astype(str)
: This didnt work when writing the output to csv. It converts string back again to float

Receiving KeyError when converting a column from float to int

created a pandas data frame using read_csv.
I then changed the column name of the 0th column from 'Unnamed' to 'Experiment_Number'.
The values in this column are floating point numbers and I've been trying to convert them to integers using:
df['Experiment_Number'] = df['Experiment_Number'].astype(int)
I get this error:
KeyError: 'Experiment_Number'
I've been trying every way since yesterday, for example also
df['Experiment_Number'] = df.astype({'Experiment_Number': int})
and many other variations.
Can someone please help, I'm new using pandas and this close to giving up on this :(
Any help will be appreciated
I had used this for renaming the column before:
df.columns.values[0] = 'Experiment_Number'
This should have worked. The fact that it didn't can only mean there were special characters/unprintable characters in your column names.
I can offer another possible suggestion, using df.rename:
df = df.rename(columns={df.columns[0] : 'Experiment_Number'})
You can convert the type during your read_csv() call then rename it afterward. As in
df = pandas.read_csv(filename,
dtype = {'Unnamed': 'float'}, # inform read_csv this field is float
converters = {'Unnamed': int}) # apply the int() function
df.rename(columns = {'Unnamed' : 'Experiment_Number'}, inplace=True)
The dtype is not strictly necessary, because the converter will override it in this case, but it is wise to get in the habit of always providing a dtype for every field of your input. It is annoying, for example, how pandas treats integers as floats by default. Also, you may later remove the converters option without worry, if you specified dtype.

String problem / Select all values > 8000 in pandas dataframe

I want to select all values bigger than 8000 within a pandas dataframe.
new_df = df.loc[df['GM'] > 8000]
However, it is not working. I think the problem is, that the value comes from an Excel file and the number is interpreted as string e.g. "1.111,52". Do you know how I can convert such a string to float / int in order to compare it properly?
Taken from the documentation of pd.read_excel:
Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.
This means that pandas checks the type of the format stored in excel. If this was numeric in Excel, the conversion should go correct. If your column was string, try to use:
df = pd.read_excel('filename.xlsx', thousands='.')
If you have a csv file, you can solve this by specifying thousands + decimal character:
df = pd.read_csv('filename.csv', thousands='.', decimal=',')
You can see value of df.dtypes to see what is the type of each column. Then, if the column type is not as you want to, you can change it by df['GM'].astype(float), and then new_df = df.loc[df['GM'].astype(float) > 8000] should work as you want to.
you can convert entire column data type to numeric
import pandas as pd
df['GM'] = pd.to_numeric(df['GM'])
You can see the data type of your column by using type function. In order to convert it to float use astype function as follows:
df['GM'].astype(float)

Categories

Resources