how to strip customized missing value pandas dataframe - python

I have a dataset with a customized missing values which is the character `\?` but a cell with the missing value also contains whitespaces with inconsistent number of space characters. As in my example picture, at row 11, It could have 3 spaces, or 4 spaces.
So my idea is to apply the str.strip() function for each cell to identify it as the missing values and drop it, but it still is not recognized as missing values.
df = pd.read_csv('full_name', header=None, na_values=['?'])
df = df.apply(lambda x: x.str.strip() if x.dtype== 'object' else x)
df.dropna(axis=0, inplace=True, how='any')
df.head(20)]
what is an efficient way to solve this?

dropna drops NaN values. Since your NaNs are actually ?, you could replace them with NaN and use dropna:
df = df.replace('?', np.nan).dropna()
mask them and use dropna:
df = df.mask(df.eq('?')).dropna()
or simply filter those rows out and only select rows without any ?:
df = df[df.ne('?').all(axis=1)]

Use:
df = pd.DataFrame({'test': [1,2, ' ? ', ' ? ']})
df[~df['test'].str.contains('\?', na=False)]

Related

How to remove all rows of a datframe column that contain a question mark instead of occupation

This is my attempt:
df['occupation']= df['occupation'].str.replace('?', '')
df.dropna(subset=['occupation'], inplace=True)
but it is not working, How do i remove all of the rows of the occupation column that i read from a csv file that contain a ? rather than an occupation
If you're reading the csv with pd.read_csv(), you can pass na_values.
# to treat '?' as NaN in all columns:
pd.read_csv(fname, na_values='?')
# to treat '?' as NaN in just the occupation column:
pd.read_csv(fname, na_values={'occupation': '?'})
Then, you can dropna or fillna('') on that column as you see fit.
Clean up the white space and use an 'unselect' filter:
import pandas as pd
bugs = ['grasshopper','cricket','ant','spider']
fruit = ['lemon','komquat','watermelon','apple']
squashed = [' ? ','Yes','No','Eww']
df = pd.DataFrame(list(zip(bugs,fruit,squashed)), columns = ['Bugs','Fruit','Squashed'])
print(df.head())
df = df[df['Squashed'].apply(lambda x: x.strip()) != '?']
print('after stripping white space and after unselect')
print(df.head())
Why
The dataframe method .dropna() won't detect blanks (ie '') but will look for Nan or NaT or None.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
However, using .replace() to set the value to missing won't work because .replace() requires the type to match and None doesn't match any type you'll have in a column already.
Better to clean up the white space (which is the simple case) using lambda on each entry to apply the string transformation.
You can try this...
df = df[df.occupation != "?"]

pandas concatenate multiple columns together with pipe while skip the empty values

hi I want to concatenate multiple columns together using pipes as connector in pandas python and if the columns is blank values then skip this columns.
I tried the following code, it does not skip the values when its empty, it will still have a '|' to connect with other fields, what I want is the completely pass the empty field ..
for example: currently it gives me 'N|911|WALLACE|AVE||||MT|031|000600'
while I want 'N|911|WALLACE|AVE|MT|031|000600'
df['key'] = df[['fl_predir','fl_prim_range','fl_prim_name','fl_addr_suffix','fl_postdir','fl_unit_desig','fl_sec_range','fl_st','fl_fips_county','blk']].agg('|'.join, axis=1)
can anybody help me on this?
cols = ['fl_predir','fl_prim_range','fl_prim_name','fl_addr_suffix','fl_postdir','fl_unit_desig','fl_sec_range','fl_st','fl_fips_county','blk']
df['key'] = df[cols].apply(lambda row: '|'.join(x for x in row if x), axis=1, raw=True)
You can use melt to flat your dataframe, drop null values then group by index and finally concatenate values:
cols = ['fl_predir', 'fl_prim_range', 'fl_prim_name', 'fl_addr_suffix' ,
'fl_postdir', 'fl_unit_desig', 'fl_sec_range', 'fl_st',
'fl_fips_county', 'blk']
df['key'] = (df[cols].melt(ignore_index=False)['value'].dropna()
.astype(str).groupby(level=0).agg('|'.join))
Output:
>>> df['key']
0 N|911|WALLACE|AVE|MT|31|600
Name: key, dtype: object
Alternative (Pandas < 1.1.0)
df['keys'] = (df[cols].unstack().dropna().astype(str)
.groupby(level=1).agg('|'.join))

How can I calculate the percentage of empty values in a pandas dataframe?

I have a dataframe df, from which I know there are empty values, i.e. '' (blank spaces).
I want to calculate the percentage per column of those observations and replace them with NaN.
To get the percentage I've tried:
for col in df:
empty = round((df[df[col]] == '').sum()/df.shape[0]*100, 1)
I have a similar code which calculates the zeros, which does work:
zeros = round((df[col] == 0).sum()/df.shape[0]*100, 1)
I think you need Series.isna for test missing values (but not empty spaces):
nans = round(df[col].isna().sum()/df.shape[0]*100, 1)
Solution should be simplify with mean:
nans = round(df[col].isna().mean()*100, 1)
For replace empty spaces or spaces to NaNs use:
df = df.replace(r'^\s*$', np.nan, regex=True)
nans = round(df[col].isna().mean()*100, 1)
If need test all columns:
nans = df.isna().mean().mul(100).round()
The full answer to your problem will be :
for col in df:
empty_avg = round(df[col].isna().mean()*100, 1) # This line is to find the average of empty values.
df = df[df != ''] # This will replace all the empty values with NaN.

Pandas: replace by regex for 'category' dtype

What is the best way to do df.replace for dataframe with category dtype.
Suppose I create dataframe:
df = pandas.DataFrame(
[
['a'], [' '], ['']
],
columns['x'],
dtype = 'category'
)
print(df.replace(r'^\s*$', numpy.nan, regex=True))
The result:
x
0 a
1
2
E.g. values on lines 1 and 2 are not replaced (because according to documentation only strings are getting replaces).
If I remove dtype = 'category' - then values are replaced by NaN as expected.
I wander - what is the best way to replace blanks in whole dataframe where all columns are category type with NaNs?
Is it:
for col in df.columns:
df[col] = df[col].str.replace(r'^\s*$', numpy.nan, regex=True)
From the documentation:
Renaming categories is done by assigning new values to the
Series.cat.categories property or by using the rename_categories()
method
However
Categories must also not be NaN or a ValueError is raised
In case the number of white spaces is fixed you could alternatively do
df[df == ' '] = numpy.nan

Python Pandas replace multiple columns zero to Nan

List with attributes of persons loaded into pandas dataframe df2. For cleanup I want to replace value zero (0 or '0') by np.nan.
df2.dtypes
ID object
Name object
Weight float64
Height float64
BootSize object
SuitSize object
Type object
dtype: object
Working code to set value zero to np.nan:
df2.loc[df2['Weight'] == 0,'Weight'] = np.nan
df2.loc[df2['Height'] == 0,'Height'] = np.nan
df2.loc[df2['BootSize'] == '0','BootSize'] = np.nan
df2.loc[df2['SuitSize'] == '0','SuitSize'] = np.nan
Believe this can be done in a similar/shorter way:
df2[["Weight","Height","BootSize","SuitSize"]].astype(str).replace('0',np.nan)
However the above does not work. The zero's remain in df2. How to tackle this?
I think you need replace by dict:
cols = ["Weight","Height","BootSize","SuitSize","Type"]
df2[cols] = df2[cols].replace({'0':np.nan, 0:np.nan})
You could use the 'replace' method and pass the values that you want to replace in a list as the first parameter along with the desired one as the second parameter:
cols = ["Weight","Height","BootSize","SuitSize","Type"]
df2[cols] = df2[cols].replace(['0', 0], np.nan)
Try:
df2.replace(to_replace={
'Weight':{0:np.nan},
'Height':{0:np.nan},
'BootSize':{'0':np.nan},
'SuitSize':{'0':np.nan},
})
data['amount']=data['amount'].replace(0, np.nan)
data['duration']=data['duration'].replace(0, np.nan)
in column "age", replace zero with blanks
df['age'].replace(['0', 0'], '', inplace=True)
Replace zero with nan for single column
df['age'] = df['age'].replace(0, np.nan)
Replace zero with nan for multiple columns
cols = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]
df[cols] = df[cols].replace(['0', 0], np.nan)
Replace zero with nan for dataframe
df.replace(0, np.nan, inplace=True)
If you just want to o replace the zeros in whole dataframe, you can directly replace them without specifying any columns:
df = df.replace({0:pd.NA})
Another alternative way:
cols = ["Weight","Height","BootSize","SuitSize","Type"]
df2[cols] = df2[cols].mask(df2[cols].eq(0) | df2[cols].eq('0'))

Categories

Resources