Need helping with a sorting error in Pandas - python

I have a data frame that looks like this:
Pandas DF:
I exported it to excel to be able to see it easier. But basically I am trying to sort it by SeqNo asc and it isnt counting correctly. So instead of going 0,0,0,0,1,1,1,1,2,2,2,2 its going 0,0,0,0,0,1,1,1,1,10,10,10,10. Please help out if possible. Here is the code that I have to sort it. I have tried many other methods but it just isnt sorting correctly.
final_df = df.sort_values(by=['SeqNo'])

Based on your description I think it is treating the column values as "String" instead of "int". You can confirm this by checking the datatype of your column (Ex: use df.info() to check datatype of all the columns in dataframe)
One option to resolve this is to convert that particular column type from string to "int" before sorting and exporting to excel. You can apply pandas "to_numeric()" function before sorting and exporting to excel. Please check pandas documentation for to_numeric() (Refer to https://www.linkedin.com/pulse/change-data-type-columns-pandas-mohit-sharma/ for sample)

First of all try Command Given Below for Verifying the type of Data given to you because it's important to understand your data first:-
print(df.dtypes)
Above Command will display all the Datatypes of Given Data. Then try to find SeqNo Datatype. If your Output for SeqNo is Showing Something like:-
SeqNo object
dtype: object
Then Your Data is of String Format and you have to Convert it to Integer or Numeric Format. So, For Converting it there are two Ways:-
1. By astype(int) Method
df['SeqNo'] = df['SeqNo'].astype(int)
2. By to_numeric Method
df['SeqNo'] = pd.to_numeric(df['SeqNo'])
After this Step Try again to Verify the Datatype has been changed or not by typing print(df.dtypes) and Now it will show similar output as stated below:-
SeqNo int32
dtype: object
Now you can print Data after Sorting Operation in Ascending Format:-
final_df = df.sort_values(by = ['SeqNo'], ascending = True)

Related

pandas.read_csv: keep column as integer while having NaN values

I just converted to Python from R, and now I'm trying to read in data from a csv file.
I was very annoyed with all my integer columns being treated as floats, and after some digging I see that this is the problem:
NumPy or Pandas: Keeping array type as integer while having a NaN value
I see that the accepted answer gives me a hint as to where to go, but problem is that I have data with hundreds of columns, as is typical when doing data science, I suppose. So I don't want to specify for every column what type to use when reading in data with read_csv. This is fixed automatically in R.
Is it really this hard to use pandas to read in data in a proper way in Python?
Source: https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
You can try using:
df = pd.read_csv('./file.csv', dtype='Int64')
Edit: So that doesn't work for strings. Instead, try something like this:
for col in df.columns[df.isna().any()].tolist():
if df[col].dtype == 'float':
df[col] = df[col].astype('Int64')
Loop through each column that has an NA value and check it has type of float, then convert them to Int64

Python Dataframe: Using astype to change column type fails

I have been using astype to change the type of the columns for a while. However, I run into an unexpected result today. I have a column named modularity_class, and I am trying to convert it from float to int and assign to a new column
communities_to_analyze['modularity'] = communities_to_analyze['modularity_class'].astype(int)
However, this gives me interesting result
>>print(communities_to_analyze['modularity'][0])
94
>>print(communities_to_analyze.iloc[0]['modularity'])
94.0
This looks so ridiculous. I am using pandas 1.1.1, and this has never happened to me before. I was wondering if anyone has run into the same problem before?
communities_to_analyze.iloc[0]['modularity']
Here you first access the first row of your data frame. It is converted to a pandas series. If there is a float number somewhere in the row every value is converted to float. Thus, if you then access the index 'modularity' of your pandas series it will return a float and not integer.
communities_to_analyze['modularity'][0])
Here you do it the other way round. First, select the column 'modularity'. The values remain integer because there is no float in this column.
Thus, if you then access the first value it will return an integer.
df['modularity_class'][0] will returns the content of one cell of the dataframe, in this case the string '94.0'. This is now returning a str object that doesn't have this function.You can convert it like below:
communities_to_analyze['modularity'] = int(communities_to_analyze['modularity_class'])

Receiving KeyError when converting a column from float to int

created a pandas data frame using read_csv.
I then changed the column name of the 0th column from 'Unnamed' to 'Experiment_Number'.
The values in this column are floating point numbers and I've been trying to convert them to integers using:
df['Experiment_Number'] = df['Experiment_Number'].astype(int)
I get this error:
KeyError: 'Experiment_Number'
I've been trying every way since yesterday, for example also
df['Experiment_Number'] = df.astype({'Experiment_Number': int})
and many other variations.
Can someone please help, I'm new using pandas and this close to giving up on this :(
Any help will be appreciated
I had used this for renaming the column before:
df.columns.values[0] = 'Experiment_Number'
This should have worked. The fact that it didn't can only mean there were special characters/unprintable characters in your column names.
I can offer another possible suggestion, using df.rename:
df = df.rename(columns={df.columns[0] : 'Experiment_Number'})
You can convert the type during your read_csv() call then rename it afterward. As in
df = pandas.read_csv(filename,
dtype = {'Unnamed': 'float'}, # inform read_csv this field is float
converters = {'Unnamed': int}) # apply the int() function
df.rename(columns = {'Unnamed' : 'Experiment_Number'}, inplace=True)
The dtype is not strictly necessary, because the converter will override it in this case, but it is wise to get in the habit of always providing a dtype for every field of your input. It is annoying, for example, how pandas treats integers as floats by default. Also, you may later remove the converters option without worry, if you specified dtype.

Convert an Object dtype column to Number Dtype in a datafrane Pandas

Trying to answer this question Get List of Unique String per Column we ran into a different problem from my dataset. When I import this CSV file to the dataframe every column is OBJECT type, we need to convert the columns that are just number to real (number) dtype and those that are not number to String dtype.
Is there a way to achieve this?
Download the data sample from here
I have tried following code from following article Pandas: change data type of columns but did not work.
df = pd.DataFrame(a, columns=['col1','col2','col3'])
As always thanks for your help
Option 1
use pd.to_numeric in an apply
df.apply(pd.to_numeric, errors='ignore')
Option 2
use pd.to_numeric on df.values.ravel
cvrtd = pd.to_numeric(df.values.ravel(), errors='coerce').reshape(-1, len(df.columns))
pd.DataFrame(np.where(np.isnan(cvrtd), df.values, cvrtd), df.index, df.columns)
Note
These are not exactly the same. For some column that contains mixed values, option 2 converts what it can while option 2 leaves everything in that column an object. Looking at your file, I'd choose option 1.
Timing
df = pd.read_csv('HistorianDataSample/HistorianDataSample.csv', skiprows=[1, 2])

String problem / Select all values > 8000 in pandas dataframe

I want to select all values bigger than 8000 within a pandas dataframe.
new_df = df.loc[df['GM'] > 8000]
However, it is not working. I think the problem is, that the value comes from an Excel file and the number is interpreted as string e.g. "1.111,52". Do you know how I can convert such a string to float / int in order to compare it properly?
Taken from the documentation of pd.read_excel:
Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.
This means that pandas checks the type of the format stored in excel. If this was numeric in Excel, the conversion should go correct. If your column was string, try to use:
df = pd.read_excel('filename.xlsx', thousands='.')
If you have a csv file, you can solve this by specifying thousands + decimal character:
df = pd.read_csv('filename.csv', thousands='.', decimal=',')
You can see value of df.dtypes to see what is the type of each column. Then, if the column type is not as you want to, you can change it by df['GM'].astype(float), and then new_df = df.loc[df['GM'].astype(float) > 8000] should work as you want to.
you can convert entire column data type to numeric
import pandas as pd
df['GM'] = pd.to_numeric(df['GM'])
You can see the data type of your column by using type function. In order to convert it to float use astype function as follows:
df['GM'].astype(float)

Categories

Resources