Trouble calling `str.len` on a pandas object column - python

I have a Pandas DataFrame with a string column called title and I want to convert each row's entry to that string's length. So "abcd" would be converted to 4, etc.
I'm doing this:
result_df['title'] = result_df['title'].str.len()
But unfortunately, I get this error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
which seems to imply that I don't actually have strings in my column...
How should I go about this?
Thanks!

You're either trying to convert the whole column to str and not the values or have mixed types in the column. Try:
result_df['title'] = result_df['title'].apply(lambda x: len(str(x)))

If your column has strings and numeric data, you can first convert everything to strings and then get the length.
result_df['title'] = result_df['title'].astype(str).str.len()
To find the data that is not a string/unicode, try this:
result_df.loc[result_df['title'].apply(
lambda x: not isinstance(x, (str, unicode))), 'title']

Related

Replace value and convert type of a dataframe as a value in a dict

I have the following dict:
I need to exclude the "," sign that separates the numbers. So instead of 10,599.01, I need 10599.01 I also need to convert it to float.
I tried the following code, but did not work
assets["nikkei"] = assets["nikkei"].str.replace(',', '').astype(float)
How should I do it? The value of the dict is a pandas dataframes

Attribute Error in Pandas using Strip When Column Empty

I have a column in a Pandas dataframe that sometimes contains blank rows. I want to use str.strip() to tidy up the rows that contain strings but this gives me the following error when a row is empty:
AttributeError: Can only use .str accessor with string values!
This is the code:
ts_df['Message'] = ts_df['Message'].str.strip()
How do I ignore the blank rows?
str.strip() should be able to handle NaN values if your column contains only strings and NaN. So, it's most probably your column is mixed with other non-string types (e.g. int or float, not string of int or float but really of type int or float).
If you want to clean up the column and maintain only string type values, you can cast it to string by .astype(str). However, NaN will also be casted to string 'nan' when the column is casted to string. Hence, you have to replace NaN by empty string first by .fillna() with empty string before casting to string type, as follows:
ts_df['Message'] = ts_df['Message'].fillna('').astype(str).str.strip()
May be your column contains null values which resulting the dtype as float64 instead of str. Try converting the column to string first using astype(str)
ts_df['Message'] = ts_df['Message'].astype(str).str.strip()

Changing dataframe column dtypes in Pandas

I am using df.columns to fetch the header of the dataframe and storing into a list. A is the list of the header value of dataframe.
A=list(df.columns)
But each element of the list are in string dtype and my header also have int value below an example of the header:
A=['ABC','1345','Acv-1234']
But I want that '1345' came to list as int dtype, not as string,
like this
A=['ABC',1345,'Acv-1234']
Can anyone suggest an approach for this?
A simple way to do it is to iterate through the columns and check if the column name (string type) contains only numbers
( str.isdecimal() ) than convert it to int otherwise keep it as a string
In one line:
A = [int(x) if x.isdecimal() else x for x in df.columns ]
I suspect that '1345' is already a string in your df.columns before assign them to list A. You must search for the source of your df, and how the columns are assigned, in order to assign columns types.
However you can always change df.coluns as you want in any time with:
df.columns=['ABC', 1345 ,'Acv-1234']

Find non-numeric values in pandas dataframe column

I got a a column in a dataframe that contains numbers and strings. So I replaced the strings by numbers via df.column.replace(["A", "B", "C", "D"], [1, 2, 3, 4], inplace=True).
But the column is still dtype "object". I can not sort the column (TypeError error: '<' not supported between instances of 'str' and 'int').
Now how can I identify those numbers that are strings? I tried print(df[pd.to_numeric(df['column']).isnull()]) and it gives back an empty dataframe, as expected. However I read that this does not work in my case (actual numbers saved as strings). So how can I identify those numbers saved as a string?
Am I right that if a column only contains REAL numbers (int or float) it will automatically change to dtype int or float?
Thank you!
You can use pd.to_numeric with something like:
df['column'] = pd.to_numeric(df['column'], errors='coerce')
For the errors argument you have few option, see reference documentation here
Expanding on Francesco's answer, it's possible to create a mask of non-numeric values and identify unique instances to handle or remove.
This uses the fact that where values cant be coerced, they are treated as nulls.
is_non_numeric = pd.to_numeric(df['column'], errors='coerce').isnull()
df[is_non_numeric]['column'].unique()
Or alternatively in a single line:
df[pd.to_numeric(df['column'], errors='coerce').isnull()]['column'].unique()
you can change dtype
df.column.dtype=df.column.astype(int)

Converting strings to ints in a DataFrame

How to covert a DataFrame column containing strings and "-" values to floats.
I have tried with pd.to_numeric and pd.Series.astype(int) but i havenĀ“t success.
What do you recommend??
If I correctly understand, you want pandas to convert the string 7.629.352 to the float value 7629352.0 and the string (21.808.956) to the value -21808956.0.
For the first part, it is directly possible with the thousands parameter, and it is even possible to process - as a NaN:
m = read_csv(..., thousands='.', na_values='-')
The real problem is the parens for negative values.
You could use a python function to convert the values. A possible alternative would be to post process the dataframe column wise:
m = read_csv(..., thousands='.', na_values='-')
for col in m.columns:
if m[col].dtype == np.dtype('object'):
m[col] = m[col].str.replace(r'\.', '').str.replace(r'\((.*)\)', r'-\1').astype('float64')

Categories

Resources