List with attributes of persons loaded into pandas dataframe df2. For cleanup I want to replace value zero (0 or '0') by np.nan.
df2.dtypes
ID object
Name object
Weight float64
Height float64
BootSize object
SuitSize object
Type object
dtype: object
Working code to set value zero to np.nan:
df2.loc[df2['Weight'] == 0,'Weight'] = np.nan
df2.loc[df2['Height'] == 0,'Height'] = np.nan
df2.loc[df2['BootSize'] == '0','BootSize'] = np.nan
df2.loc[df2['SuitSize'] == '0','SuitSize'] = np.nan
Believe this can be done in a similar/shorter way:
df2[["Weight","Height","BootSize","SuitSize"]].astype(str).replace('0',np.nan)
However the above does not work. The zero's remain in df2. How to tackle this?
I think you need replace by dict:
cols = ["Weight","Height","BootSize","SuitSize","Type"]
df2[cols] = df2[cols].replace({'0':np.nan, 0:np.nan})
You could use the 'replace' method and pass the values that you want to replace in a list as the first parameter along with the desired one as the second parameter:
cols = ["Weight","Height","BootSize","SuitSize","Type"]
df2[cols] = df2[cols].replace(['0', 0], np.nan)
Try:
df2.replace(to_replace={
'Weight':{0:np.nan},
'Height':{0:np.nan},
'BootSize':{'0':np.nan},
'SuitSize':{'0':np.nan},
})
data['amount']=data['amount'].replace(0, np.nan)
data['duration']=data['duration'].replace(0, np.nan)
in column "age", replace zero with blanks
df['age'].replace(['0', 0'], '', inplace=True)
Replace zero with nan for single column
df['age'] = df['age'].replace(0, np.nan)
Replace zero with nan for multiple columns
cols = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]
df[cols] = df[cols].replace(['0', 0], np.nan)
Replace zero with nan for dataframe
df.replace(0, np.nan, inplace=True)
If you just want to o replace the zeros in whole dataframe, you can directly replace them without specifying any columns:
df = df.replace({0:pd.NA})
Another alternative way:
cols = ["Weight","Height","BootSize","SuitSize","Type"]
df2[cols] = df2[cols].mask(df2[cols].eq(0) | df2[cols].eq('0'))
Related
I have a dataset with a customized missing values which is the character `\?` but a cell with the missing value also contains whitespaces with inconsistent number of space characters. As in my example picture, at row 11, It could have 3 spaces, or 4 spaces.
So my idea is to apply the str.strip() function for each cell to identify it as the missing values and drop it, but it still is not recognized as missing values.
df = pd.read_csv('full_name', header=None, na_values=['?'])
df = df.apply(lambda x: x.str.strip() if x.dtype== 'object' else x)
df.dropna(axis=0, inplace=True, how='any')
df.head(20)]
what is an efficient way to solve this?
dropna drops NaN values. Since your NaNs are actually ?, you could replace them with NaN and use dropna:
df = df.replace('?', np.nan).dropna()
mask them and use dropna:
df = df.mask(df.eq('?')).dropna()
or simply filter those rows out and only select rows without any ?:
df = df[df.ne('?').all(axis=1)]
Use:
df = pd.DataFrame({'test': [1,2, ' ? ', ' ? ']})
df[~df['test'].str.contains('\?', na=False)]
I'm trying to create a new column based on other columns existing in my df.
My new column, col, should be 1 if there is at least one 1 in columns A ~ E.
If all values in columns A ~ E is 0, then value of col should be 0.
I've attached image for a better understanding.
What is the most efficient way to do this with python, not using loop? Thanks.
enter image description here
If need test all columns use DataFrame.max or DataFrame.any with cast to integers for True/False to 1/0 mapping:
df['col'] = df.max(axis=1)
df['col'] = df.any(axis=1).astype(int)
Or if need test columns between A:E add DataFrame.loc:
df['col'] = df.loc[:, 'A':'E'].max(axis=1)
df['col'] = df.loc[:, 'A':'E'].any(axis=1).astype(int)
If need specify columns by list use subset:
cols = ['A','B','C','D','E']
df['col'] = df[cols].max(axis=1)
df['col'] = df[cols].any(axis=1).astype(int)
I have a dataframe df, from which I know there are empty values, i.e. '' (blank spaces).
I want to calculate the percentage per column of those observations and replace them with NaN.
To get the percentage I've tried:
for col in df:
empty = round((df[df[col]] == '').sum()/df.shape[0]*100, 1)
I have a similar code which calculates the zeros, which does work:
zeros = round((df[col] == 0).sum()/df.shape[0]*100, 1)
I think you need Series.isna for test missing values (but not empty spaces):
nans = round(df[col].isna().sum()/df.shape[0]*100, 1)
Solution should be simplify with mean:
nans = round(df[col].isna().mean()*100, 1)
For replace empty spaces or spaces to NaNs use:
df = df.replace(r'^\s*$', np.nan, regex=True)
nans = round(df[col].isna().mean()*100, 1)
If need test all columns:
nans = df.isna().mean().mul(100).round()
The full answer to your problem will be :
for col in df:
empty_avg = round(df[col].isna().mean()*100, 1) # This line is to find the average of empty values.
df = df[df != ''] # This will replace all the empty values with NaN.
I have to set the values of the first 3 rows of dataset in column "alcohol" as NaN
newdf=pd.DataFrame({'alcohol':[np.nan]},index=[0,1,2])
wine.update(newdf)
wine
After running the code, no error is coming and dataframe is also not updated
Assuming alcohol as column.
df.loc[:2, "alcohol"] = np.nan
#alternative
df.alcohol.iloc[:3] = np.nan
Use iloc with get_loc for position for column alcohol:
wine.iloc[:3, wine.columns.get_loc('alcohol')] = np.nan
Or use loc with first values of index:
wine.loc[wine.index[:3], 'alcohol'] = np.nan
this is my code:
for col in df:
if col.startswith('event'):
df[col].fillna(0, inplace=True)
df[col] = df[col].map(lambda x: re.sub("\D","",str(x)))
I have 0 to 10 event column "event_0, event_1,..."
When I fill nan with this code it fills all nan cells under all event columns to 0 but it does not change event_0 which is the first column of that selection and it is also filled by nan.
I made these columns from 'events' column with following code:
event_seperator = lambda x: pd.Series([i for i in
str(x).strip().split('\n')]).add_prefix('event_')
df_events = df['events'].apply(event_seperator)
df = pd.concat([df.drop(columns=['events']), df_events], axis=1)
Please tell me what is wrong? you can see dataframe before changing in the picture.
I don't know why that happened since I made all those columns the
same.
Your data suggests this is precisely what has not been done.
You have a few options depending on what you are trying to achieve.
1. Convert all non-numeric values to 0
Use pd.to_numeric with errors='coerce':
df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
2. Replace either string ('nan') or null (NaN) values with 0
Use pd.Series.replace followed by the previous method:
df[col] = df[col].replace('nan', np.nan).fillna(0)