Find non-numeric values in pandas dataframe column - python

I got a a column in a dataframe that contains numbers and strings. So I replaced the strings by numbers via df.column.replace(["A", "B", "C", "D"], [1, 2, 3, 4], inplace=True).
But the column is still dtype "object". I can not sort the column (TypeError error: '<' not supported between instances of 'str' and 'int').
Now how can I identify those numbers that are strings? I tried print(df[pd.to_numeric(df['column']).isnull()]) and it gives back an empty dataframe, as expected. However I read that this does not work in my case (actual numbers saved as strings). So how can I identify those numbers saved as a string?
Am I right that if a column only contains REAL numbers (int or float) it will automatically change to dtype int or float?
Thank you!

You can use pd.to_numeric with something like:
df['column'] = pd.to_numeric(df['column'], errors='coerce')
For the errors argument you have few option, see reference documentation here

Expanding on Francesco's answer, it's possible to create a mask of non-numeric values and identify unique instances to handle or remove.
This uses the fact that where values cant be coerced, they are treated as nulls.
is_non_numeric = pd.to_numeric(df['column'], errors='coerce').isnull()
df[is_non_numeric]['column'].unique()
Or alternatively in a single line:
df[pd.to_numeric(df['column'], errors='coerce').isnull()]['column'].unique()

you can change dtype
df.column.dtype=df.column.astype(int)

Related

Changing dataframe column dtypes in Pandas

I am using df.columns to fetch the header of the dataframe and storing into a list. A is the list of the header value of dataframe.
A=list(df.columns)
But each element of the list are in string dtype and my header also have int value below an example of the header:
A=['ABC','1345','Acv-1234']
But I want that '1345' came to list as int dtype, not as string,
like this
A=['ABC',1345,'Acv-1234']
Can anyone suggest an approach for this?
A simple way to do it is to iterate through the columns and check if the column name (string type) contains only numbers
( str.isdecimal() ) than convert it to int otherwise keep it as a string
In one line:
A = [int(x) if x.isdecimal() else x for x in df.columns ]
I suspect that '1345' is already a string in your df.columns before assign them to list A. You must search for the source of your df, and how the columns are assigned, in order to assign columns types.
However you can always change df.coluns as you want in any time with:
df.columns=['ABC', 1345 ,'Acv-1234']

TypeError: Jupyter notebook

I am making text preprocessing but it is challenging, can someone explain why I have the type
error? I check the type of the column it is int, so what is the wrong with the code?
I am using Jupiter notebook.
fav = df[['favourites_count','text']].sort_values('favourites_count',
ascending = False)[:5].reset_index()
for i in range(5):
print(i, fav['text'][i],'\n')
TypeError: '<=' not supported between instances of 'str' and 'int'
This is most likely due to your column favourites_count having mixed data types. I suggest you convert it to numeric before sorting:
df['favourites_count'] = pd.to_numeric(df['favourites_count'])
When you sort your dataframe df along your column "favourites_count", the sorting algorithm compares the values in along that column.
As it compares one numeric value with other numeric value, it should have came accros a value which is not a "int" data type.
Check the type of the column with following syntax:
df["favourites_count"].dtypes
If the output says
dtype('O')
That means the data in the column is mixed data.
As mentioned in https://stackoverflow.com/a/62833412/13905190, convert the datatype of "favourites_count" into numeric datatype using "pd.to_numeric()" function.
Now if you check the "dtypes" of your column, it should output:
dtype('int64')
Since you successfully converted the datatype of the numeric column, you can sort it without any errors.

Converting strings to ints in a DataFrame

How to covert a DataFrame column containing strings and "-" values to floats.
I have tried with pd.to_numeric and pd.Series.astype(int) but i havenĀ“t success.
What do you recommend??
If I correctly understand, you want pandas to convert the string 7.629.352 to the float value 7629352.0 and the string (21.808.956) to the value -21808956.0.
For the first part, it is directly possible with the thousands parameter, and it is even possible to process - as a NaN:
m = read_csv(..., thousands='.', na_values='-')
The real problem is the parens for negative values.
You could use a python function to convert the values. A possible alternative would be to post process the dataframe column wise:
m = read_csv(..., thousands='.', na_values='-')
for col in m.columns:
if m[col].dtype == np.dtype('object'):
m[col] = m[col].str.replace(r'\.', '').str.replace(r'\((.*)\)', r'-\1').astype('float64')

Why did pandas give "0.66-0.36" when I tried to add two columns?

I am trying to do a simple summation with column name Tangible Book Value and Earnings Per Share:
df['price_asset_EPS'] = (df["Tangible Book Value"]) + (df["Earnings Per Share"])
However, the result doesn't evaluate the numbers and also the plus is missing as below
0.66-0.36
1.440.0
What I have missed in between?
Looks like both columns are strings (not float):
0.66-0.36
1.440.0
see how '+' on those columns did string concatenation instead of addition? It concatenated "0.66" and "-0.36", then "1.44" and "0.0".
As to why those columns are strings not float, look at the dtype that pandas.read_csv gave them. There are many duplicate questions here telling you how to specify the right dtypes to read_csv.
Your columns are not being treated as numbers but strings. Try running df.dtypes. Against each column, you'll have its type. If you don't see a float or int, that means these columns have probably been read in as strings.
import pandas as pd
dff = pd.DataFrame([[1,'a'], [2, 'b']])
dff.dtypes
0 int64
1 object
Below I have created a dataframe with numbers inside quotes. Take a look at the dtypes.
dff = pd.DataFrame([['1','a'], ['2', 'b']])
dff.dtypes
0 object
1 object
Here you can see that numbers column is not marked int/float because of the quotes. Now, if I take the sum of the first column
dff.iloc[:,0].sum()
'12'
I get '12', which is the same case as yours. To convert these columns to numeric, look into pd.to_numeric
dff.iloc[:,0] = pd.to_numeric(dff.iloc[:,0], errors='ignore')
dff.iloc[:,0].sum()
3

Trouble calling `str.len` on a pandas object column

I have a Pandas DataFrame with a string column called title and I want to convert each row's entry to that string's length. So "abcd" would be converted to 4, etc.
I'm doing this:
result_df['title'] = result_df['title'].str.len()
But unfortunately, I get this error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
which seems to imply that I don't actually have strings in my column...
How should I go about this?
Thanks!
You're either trying to convert the whole column to str and not the values or have mixed types in the column. Try:
result_df['title'] = result_df['title'].apply(lambda x: len(str(x)))
If your column has strings and numeric data, you can first convert everything to strings and then get the length.
result_df['title'] = result_df['title'].astype(str).str.len()
To find the data that is not a string/unicode, try this:
result_df.loc[result_df['title'].apply(
lambda x: not isinstance(x, (str, unicode))), 'title']

Categories

Resources