TypeError: Jupyter notebook - python

I am making text preprocessing but it is challenging, can someone explain why I have the type
error? I check the type of the column it is int, so what is the wrong with the code?
I am using Jupiter notebook.
fav = df[['favourites_count','text']].sort_values('favourites_count',
ascending = False)[:5].reset_index()
for i in range(5):
print(i, fav['text'][i],'\n')
TypeError: '<=' not supported between instances of 'str' and 'int'

This is most likely due to your column favourites_count having mixed data types. I suggest you convert it to numeric before sorting:
df['favourites_count'] = pd.to_numeric(df['favourites_count'])

When you sort your dataframe df along your column "favourites_count", the sorting algorithm compares the values in along that column.
As it compares one numeric value with other numeric value, it should have came accros a value which is not a "int" data type.
Check the type of the column with following syntax:
df["favourites_count"].dtypes
If the output says
dtype('O')
That means the data in the column is mixed data.
As mentioned in https://stackoverflow.com/a/62833412/13905190, convert the datatype of "favourites_count" into numeric datatype using "pd.to_numeric()" function.
Now if you check the "dtypes" of your column, it should output:
dtype('int64')
Since you successfully converted the datatype of the numeric column, you can sort it without any errors.

Related

Filtering rows of pandas data frame by column value without specifying column name

I am trying to filter the rows of a data frame where the last column is less than 4, without specifying the column name.
[See this question for context on data frame] (Convert the last column of a data frame from hex to decimal)
I used the solution provided by #not_speshal but I am getting this error.
TypeError: '>=' not supported between instances of 'str' and 'int'.
How can I fix it?
The solution is wrong. When you use this one:
df[df.columns[-2]<=4]
What you do is to compare the name of the column at index -2 with 4. Like "Color" > 4, which gives an error.
Try this one:
df[df[df.columns[-2]]<=4]

Need helping with a sorting error in Pandas

I have a data frame that looks like this:
Pandas DF:
I exported it to excel to be able to see it easier. But basically I am trying to sort it by SeqNo asc and it isnt counting correctly. So instead of going 0,0,0,0,1,1,1,1,2,2,2,2 its going 0,0,0,0,0,1,1,1,1,10,10,10,10. Please help out if possible. Here is the code that I have to sort it. I have tried many other methods but it just isnt sorting correctly.
final_df = df.sort_values(by=['SeqNo'])
Based on your description I think it is treating the column values as "String" instead of "int". You can confirm this by checking the datatype of your column (Ex: use df.info() to check datatype of all the columns in dataframe)
One option to resolve this is to convert that particular column type from string to "int" before sorting and exporting to excel. You can apply pandas "to_numeric()" function before sorting and exporting to excel. Please check pandas documentation for to_numeric() (Refer to https://www.linkedin.com/pulse/change-data-type-columns-pandas-mohit-sharma/ for sample)
First of all try Command Given Below for Verifying the type of Data given to you because it's important to understand your data first:-
print(df.dtypes)
Above Command will display all the Datatypes of Given Data. Then try to find SeqNo Datatype. If your Output for SeqNo is Showing Something like:-
SeqNo object
dtype: object
Then Your Data is of String Format and you have to Convert it to Integer or Numeric Format. So, For Converting it there are two Ways:-
1. By astype(int) Method
df['SeqNo'] = df['SeqNo'].astype(int)
2. By to_numeric Method
df['SeqNo'] = pd.to_numeric(df['SeqNo'])
After this Step Try again to Verify the Datatype has been changed or not by typing print(df.dtypes) and Now it will show similar output as stated below:-
SeqNo int32
dtype: object
Now you can print Data after Sorting Operation in Ascending Format:-
final_df = df.sort_values(by = ['SeqNo'], ascending = True)

Python Dataframe: Using astype to change column type fails

I have been using astype to change the type of the columns for a while. However, I run into an unexpected result today. I have a column named modularity_class, and I am trying to convert it from float to int and assign to a new column
communities_to_analyze['modularity'] = communities_to_analyze['modularity_class'].astype(int)
However, this gives me interesting result
>>print(communities_to_analyze['modularity'][0])
94
>>print(communities_to_analyze.iloc[0]['modularity'])
94.0
This looks so ridiculous. I am using pandas 1.1.1, and this has never happened to me before. I was wondering if anyone has run into the same problem before?
communities_to_analyze.iloc[0]['modularity']
Here you first access the first row of your data frame. It is converted to a pandas series. If there is a float number somewhere in the row every value is converted to float. Thus, if you then access the index 'modularity' of your pandas series it will return a float and not integer.
communities_to_analyze['modularity'][0])
Here you do it the other way round. First, select the column 'modularity'. The values remain integer because there is no float in this column.
Thus, if you then access the first value it will return an integer.
df['modularity_class'][0] will returns the content of one cell of the dataframe, in this case the string '94.0'. This is now returning a str object that doesn't have this function.You can convert it like below:
communities_to_analyze['modularity'] = int(communities_to_analyze['modularity_class'])

Find non-numeric values in pandas dataframe column

I got a a column in a dataframe that contains numbers and strings. So I replaced the strings by numbers via df.column.replace(["A", "B", "C", "D"], [1, 2, 3, 4], inplace=True).
But the column is still dtype "object". I can not sort the column (TypeError error: '<' not supported between instances of 'str' and 'int').
Now how can I identify those numbers that are strings? I tried print(df[pd.to_numeric(df['column']).isnull()]) and it gives back an empty dataframe, as expected. However I read that this does not work in my case (actual numbers saved as strings). So how can I identify those numbers saved as a string?
Am I right that if a column only contains REAL numbers (int or float) it will automatically change to dtype int or float?
Thank you!
You can use pd.to_numeric with something like:
df['column'] = pd.to_numeric(df['column'], errors='coerce')
For the errors argument you have few option, see reference documentation here
Expanding on Francesco's answer, it's possible to create a mask of non-numeric values and identify unique instances to handle or remove.
This uses the fact that where values cant be coerced, they are treated as nulls.
is_non_numeric = pd.to_numeric(df['column'], errors='coerce').isnull()
df[is_non_numeric]['column'].unique()
Or alternatively in a single line:
df[pd.to_numeric(df['column'], errors='coerce').isnull()]['column'].unique()
you can change dtype
df.column.dtype=df.column.astype(int)

Receiving KeyError when converting a column from float to int

created a pandas data frame using read_csv.
I then changed the column name of the 0th column from 'Unnamed' to 'Experiment_Number'.
The values in this column are floating point numbers and I've been trying to convert them to integers using:
df['Experiment_Number'] = df['Experiment_Number'].astype(int)
I get this error:
KeyError: 'Experiment_Number'
I've been trying every way since yesterday, for example also
df['Experiment_Number'] = df.astype({'Experiment_Number': int})
and many other variations.
Can someone please help, I'm new using pandas and this close to giving up on this :(
Any help will be appreciated
I had used this for renaming the column before:
df.columns.values[0] = 'Experiment_Number'
This should have worked. The fact that it didn't can only mean there were special characters/unprintable characters in your column names.
I can offer another possible suggestion, using df.rename:
df = df.rename(columns={df.columns[0] : 'Experiment_Number'})
You can convert the type during your read_csv() call then rename it afterward. As in
df = pandas.read_csv(filename,
dtype = {'Unnamed': 'float'}, # inform read_csv this field is float
converters = {'Unnamed': int}) # apply the int() function
df.rename(columns = {'Unnamed' : 'Experiment_Number'}, inplace=True)
The dtype is not strictly necessary, because the converter will override it in this case, but it is wise to get in the habit of always providing a dtype for every field of your input. It is annoying, for example, how pandas treats integers as floats by default. Also, you may later remove the converters option without worry, if you specified dtype.

Categories

Resources