Attribute Error in Pandas using Strip When Column Empty - python

I have a column in a Pandas dataframe that sometimes contains blank rows. I want to use str.strip() to tidy up the rows that contain strings but this gives me the following error when a row is empty:
AttributeError: Can only use .str accessor with string values!
This is the code:
ts_df['Message'] = ts_df['Message'].str.strip()
How do I ignore the blank rows?

str.strip() should be able to handle NaN values if your column contains only strings and NaN. So, it's most probably your column is mixed with other non-string types (e.g. int or float, not string of int or float but really of type int or float).
If you want to clean up the column and maintain only string type values, you can cast it to string by .astype(str). However, NaN will also be casted to string 'nan' when the column is casted to string. Hence, you have to replace NaN by empty string first by .fillna() with empty string before casting to string type, as follows:
ts_df['Message'] = ts_df['Message'].fillna('').astype(str).str.strip()

May be your column contains null values which resulting the dtype as float64 instead of str. Try converting the column to string first using astype(str)
ts_df['Message'] = ts_df['Message'].astype(str).str.strip()

Related

Changing dataframe column dtypes in Pandas

I am using df.columns to fetch the header of the dataframe and storing into a list. A is the list of the header value of dataframe.
A=list(df.columns)
But each element of the list are in string dtype and my header also have int value below an example of the header:
A=['ABC','1345','Acv-1234']
But I want that '1345' came to list as int dtype, not as string,
like this
A=['ABC',1345,'Acv-1234']
Can anyone suggest an approach for this?
A simple way to do it is to iterate through the columns and check if the column name (string type) contains only numbers
( str.isdecimal() ) than convert it to int otherwise keep it as a string
In one line:
A = [int(x) if x.isdecimal() else x for x in df.columns ]
I suspect that '1345' is already a string in your df.columns before assign them to list A. You must search for the source of your df, and how the columns are assigned, in order to assign columns types.
However you can always change df.coluns as you want in any time with:
df.columns=['ABC', 1345 ,'Acv-1234']

Find non-numeric values in pandas dataframe column

I got a a column in a dataframe that contains numbers and strings. So I replaced the strings by numbers via df.column.replace(["A", "B", "C", "D"], [1, 2, 3, 4], inplace=True).
But the column is still dtype "object". I can not sort the column (TypeError error: '<' not supported between instances of 'str' and 'int').
Now how can I identify those numbers that are strings? I tried print(df[pd.to_numeric(df['column']).isnull()]) and it gives back an empty dataframe, as expected. However I read that this does not work in my case (actual numbers saved as strings). So how can I identify those numbers saved as a string?
Am I right that if a column only contains REAL numbers (int or float) it will automatically change to dtype int or float?
Thank you!
You can use pd.to_numeric with something like:
df['column'] = pd.to_numeric(df['column'], errors='coerce')
For the errors argument you have few option, see reference documentation here
Expanding on Francesco's answer, it's possible to create a mask of non-numeric values and identify unique instances to handle or remove.
This uses the fact that where values cant be coerced, they are treated as nulls.
is_non_numeric = pd.to_numeric(df['column'], errors='coerce').isnull()
df[is_non_numeric]['column'].unique()
Or alternatively in a single line:
df[pd.to_numeric(df['column'], errors='coerce').isnull()]['column'].unique()
you can change dtype
df.column.dtype=df.column.astype(int)

Converting strings to ints in a DataFrame

How to covert a DataFrame column containing strings and "-" values to floats.
I have tried with pd.to_numeric and pd.Series.astype(int) but i haven´t success.
What do you recommend??
If I correctly understand, you want pandas to convert the string 7.629.352 to the float value 7629352.0 and the string (21.808.956) to the value -21808956.0.
For the first part, it is directly possible with the thousands parameter, and it is even possible to process - as a NaN:
m = read_csv(..., thousands='.', na_values='-')
The real problem is the parens for negative values.
You could use a python function to convert the values. A possible alternative would be to post process the dataframe column wise:
m = read_csv(..., thousands='.', na_values='-')
for col in m.columns:
if m[col].dtype == np.dtype('object'):
m[col] = m[col].str.replace(r'\.', '').str.replace(r'\((.*)\)', r'-\1').astype('float64')

Trouble calling `str.len` on a pandas object column

I have a Pandas DataFrame with a string column called title and I want to convert each row's entry to that string's length. So "abcd" would be converted to 4, etc.
I'm doing this:
result_df['title'] = result_df['title'].str.len()
But unfortunately, I get this error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
which seems to imply that I don't actually have strings in my column...
How should I go about this?
Thanks!
You're either trying to convert the whole column to str and not the values or have mixed types in the column. Try:
result_df['title'] = result_df['title'].apply(lambda x: len(str(x)))
If your column has strings and numeric data, you can first convert everything to strings and then get the length.
result_df['title'] = result_df['title'].astype(str).str.len()
To find the data that is not a string/unicode, try this:
result_df.loc[result_df['title'].apply(
lambda x: not isinstance(x, (str, unicode))), 'title']

Pyspark: Remove UTF null character from pyspark dataframe

I have a pyspark dataframe similar to the following:
df = sql_context.createDataFrame([
Row(a=3, b=[4,5,6],c=[10,11,12], d='bar', e='utf friendly'),
Row(a=2, b=[1,2,3],c=[7,8,9], d='foo', e=u'ab\u0000the')
])
Where one of the values for column e contains the UTF null character \u0000. If I try to load this df into a postgresql database, I get the following error:
ERROR: invalid byte sequence for encoding "UTF8": 0x00
which makes sense. How can I efficiently remove the null character from the pyspark dataframe before loading the data into postgres?
I have tried using some of the pyspark.sql.functions to clean the data first without success. encode, decode, and regex_replace did not work:
df.select(regexp_replace(col('e'), u'\u0000', ''))
df.select(encode(col('e'), 'UTF-8'))
df.select(decode(col('e'), 'UTF-8'))
Ideally, I would like to clean the entire dataframe without specifying exactly which columns or what the violating character is, since I don't necessarily know this information ahead of time.
I am using a postgres 9.4.9 database with UTF8 encoding.
Ah wait - I think I have it. If I do something like this, it seems to work:
null = u'\u0000'
new_df = df.withColumn('e', regexp_replace(df['e'], null, ''))
And then mapping to all string columns:
string_columns = ['d','e']
new_df = df.select(
*(regexp_replace(col(c), null, '').alias(c) if c in string_columns else c for
c in df.columns)
)
You can use DataFrame.fillna() to replace null values.
Replace null values, alias for na.fill(). DataFrame.fillna() and
DataFrameNaFunctions.fill() are aliases of each other.
Parameters:
value – int, long, float, string, or dict. Value to
replace null values with. If the value is a dict, then subset is
ignored and value must be a mapping from column name (string) to
replacement value. The replacement value must be an int, long, float,
or string.
subset – optional list of column names to consider. Columns
specified in subset that do not have matching data type are ignored.
For example, if value is a string, and subset contains a non-string
column, then the non-string column is simply ignored.

Categories

Resources