String problem / Select all values > 8000 in pandas dataframe - python

I want to select all values bigger than 8000 within a pandas dataframe.
new_df = df.loc[df['GM'] > 8000]
However, it is not working. I think the problem is, that the value comes from an Excel file and the number is interpreted as string e.g. "1.111,52". Do you know how I can convert such a string to float / int in order to compare it properly?

Taken from the documentation of pd.read_excel:
Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.
This means that pandas checks the type of the format stored in excel. If this was numeric in Excel, the conversion should go correct. If your column was string, try to use:
df = pd.read_excel('filename.xlsx', thousands='.')
If you have a csv file, you can solve this by specifying thousands + decimal character:
df = pd.read_csv('filename.csv', thousands='.', decimal=',')

You can see value of df.dtypes to see what is the type of each column. Then, if the column type is not as you want to, you can change it by df['GM'].astype(float), and then new_df = df.loc[df['GM'].astype(float) > 8000] should work as you want to.

you can convert entire column data type to numeric
import pandas as pd
df['GM'] = pd.to_numeric(df['GM'])

You can see the data type of your column by using type function. In order to convert it to float use astype function as follows:
df['GM'].astype(float)

Related

Python Pandas read_excel without converting int to float

im reading an Excel file:
df = pd.read_excel(r'C:\test.xlsx', 'Sheet0', skiprows = 1)
The Excel file contains a column formatted General and a value like "405788", after reading this with pandas the output looks like "405788.0" so its converted as float. I need any value as String without changing the values, can someone help me out with this?
[Edit]
If i copy the values in a new Excel file and load this, the integers does not get converted to float. But i need to get the Values correct of the original file, so is there anything i can do?
Options dtype and converted changes the type as i need in str but as a floating number with .0
You can try to use the dtype attribute of the read_excel method.
df = pd.read_excel(r'C:\test.xlsx', 'Sheet0', skiprows = 1,
dtype={'Name': str, 'Value': str})
More information in the pandas docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

Need helping with a sorting error in Pandas

I have a data frame that looks like this:
Pandas DF:
I exported it to excel to be able to see it easier. But basically I am trying to sort it by SeqNo asc and it isnt counting correctly. So instead of going 0,0,0,0,1,1,1,1,2,2,2,2 its going 0,0,0,0,0,1,1,1,1,10,10,10,10. Please help out if possible. Here is the code that I have to sort it. I have tried many other methods but it just isnt sorting correctly.
final_df = df.sort_values(by=['SeqNo'])
Based on your description I think it is treating the column values as "String" instead of "int". You can confirm this by checking the datatype of your column (Ex: use df.info() to check datatype of all the columns in dataframe)
One option to resolve this is to convert that particular column type from string to "int" before sorting and exporting to excel. You can apply pandas "to_numeric()" function before sorting and exporting to excel. Please check pandas documentation for to_numeric() (Refer to https://www.linkedin.com/pulse/change-data-type-columns-pandas-mohit-sharma/ for sample)
First of all try Command Given Below for Verifying the type of Data given to you because it's important to understand your data first:-
print(df.dtypes)
Above Command will display all the Datatypes of Given Data. Then try to find SeqNo Datatype. If your Output for SeqNo is Showing Something like:-
SeqNo object
dtype: object
Then Your Data is of String Format and you have to Convert it to Integer or Numeric Format. So, For Converting it there are two Ways:-
1. By astype(int) Method
df['SeqNo'] = df['SeqNo'].astype(int)
2. By to_numeric Method
df['SeqNo'] = pd.to_numeric(df['SeqNo'])
After this Step Try again to Verify the Datatype has been changed or not by typing print(df.dtypes) and Now it will show similar output as stated below:-
SeqNo int32
dtype: object
Now you can print Data after Sorting Operation in Ascending Format:-
final_df = df.sort_values(by = ['SeqNo'], ascending = True)

How to convert datatype of all columns in a pandas dataframe

I have pandas dataframe with 200+ columns. All the columns are of type int. And I need to convert them to float type. I could not find a way to do it.
I tried
for column in X_data:
X_data[column].astype('float64')
But after the for loop, when I print X_data.dtypes, all columns show as int only.
I also tried X_data = X_data.apply(pd.to_numeric) but it did not convert to float.
The dataframe is constructed from a csv file load.
If you want to convert specific columns to specific types you can use:
new_type_dict = {
'col1': float,
'col2': float
}
df = df.astype(new_type_dict)
It will now convert the selected columns to new types
I found it from here
The values aren't being saved in place. Try the following:
for column in X_data:
X_data[column] = X_data[column].astype('float64')

Why pandas adding '.0' at the end of strings?

I'm processing a csv file. Source file contain value as '20190801'. Pandas detects it as int or float for different files. But before writing the output, I convert all columns to string and datatype shows all columns as object. But the output containing .0 at the end. Why is that?
e.g: 20190801.0
for col in data.columns:
data[col] = data[col].astype(str)
print(data.dtypes) <-- prints all columns datatypes as object
data.to_csv(neo_path, index=False)
I fixed like this;
I added converters parameter and making sure all those problematic columns should remain as strings in my case.
data = pd.read_csv(filepath, converters={"SiteCode":str,'Date':str,'Tank ID':str,'SIRA RECORD ID':str}
....
data.to_csv(neo_path,index=False)
In this case I get rid of, converting all column types as string as pointed in my quetsion.
for col in data.columns:
data[col] = data[col].astype(str)
: This didnt work when writing the output to csv. It converts string back again to float

Avoid converting data to int automatically while reading using pandas data frame

I have a csv file with no headers. It has around 35 columns.
I am reading this file using pandas.
Currently, issue is that when it reads the file, it automatically assigns datatype to each columns.
How to avoid assigning automatic data types?
I have a column C, which I want to store as string instead of int. But pandas automatically assigns it to int
I tried 2 things.
1)
my_df = pd.DataFrame()
my_df = pd.read_csv('my_csv_file.csv',names=['A','B','C'...'Z'],converters={'C':str},engine = 'python')
Above code gives me error
ValueError: Expected 37 fields in line 1, saw 35
If I remove, converters={'C':str},engine = 'python' there is no error
2)
old_df['C'] = old_df['C'].astype(int)
Issue with this approach is that, if the value in column is '00123', it has already been converted to 123 and then it converts it to '123'. It would lose initial Zeroes , because it thinks it is integer.
use dtype option or converters in read_csv read_csv doc, works regardless of using python engine or not:
df = pd.DataFrame({'col1':['00123','00125'],'col2':[1,2],'col3':[1.0,2.0]})
df.to_csv('test.csv',index=False)
new_df = pd.read_csv('test.csv',dtype={'col1':str,'col2':np.int64,'col3':np.float64})
If you simply use dtype=str then it will read every column in as a string (object). But you can not do that with converters as it expects a dictionary. You could substitute converters for dtype in above code and get same result.

Categories

Resources