if i want to apply lambda with multiple condition how should I do it?
df.train.age.apply(lambda x:0 (if x>=0 and x<500))
or is there much better methods?
create a mask and select from your array with the mask ... only apply to the result
mask = (df.train.age >=0) & (df.train.age < 500)
df.train.age[mask].apply(something)
if you just need to set the ones that dont match to zero thats even easier
df.train.age[~mask] = 0
Your syntax needs to have an else:
df.train.age.apply(lambda x:0 if x>=0 and x<500 else 0)
This is a good way to do it
The same can be obtained without using apply, but using np.where as below.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'age' : [-10,10,20,30,40,100,110]
})
df['age'] = np.where((df['age'] >= 100) | (df['age'] < 0), 0, df['age'])
df
If you have any confusion in using the above code, Please post your sample dataframe. I'll update my answer.
Related
I have a df as below
I want to make this df binary as follows
I tried
df[:]=np.where(df>0, 1, 0)
but with this I am losing my df index.
I can try this on all columns one by one or use loop but I think there would be some easy & quick way to do this.
You can convert boolean mask by DataFrame.gt to integers:
df1 = df.gt(0).astype(int)
Or use DataFrame.clip if integers and no negative values:
df1 = df.clip(upper=1)
Your solution should working with loc:
df.loc[:] = np.where(df>0, 1, 0)
of course it is possible by func, it can be done with just operator
(df > 0) * 1
Without using numpy:
df[df>0]=0
Doing the following conditional fill in pyspark how would I do this in pandas
colIsAcceptable = when(col("var") < 0.9, 1).otherwise(0)
You can use:
df['new_col'] = df['col'].lt(0.9).astype(int)
or with numpy.where:
import numpy as np
df['new_col'] = np.where(df['col'].lt(0.9), 1, 0)
You can use numpy.where.
import numpy as np
df['colIsAcceptable'] = np.where(df['col'] < 0.9, 1, 0)
colIsAcceptable = df['var'].apply(lambda x: 1 if x < 0.9 else 0)
apply can be slow on very large datasets, and there are more efficient ways that I don't know of, but is good for general purposes
I assume the first column on your dataframe is named 'var'. and then the second column name is 'colIsAcceptable', then you can use .map() function
df['colIsAcceptable']= df['var'].map(lambda x: 1 if x<0.9 else 0)
df['col2'] = 0
df.loc[df['col1'] < 0.9, 'col2'] = 1
This is a simple example to do something like what you are asking.
I have a dataframe df_ac and a logic for this dataframe is:
df_ac['annfact'] = np.where((df_ac['annfact'] == 0) & (df_ac['cert'] == 0), 1, df_ac['annfact'])
How to use pandas filter to convert the above logic, something like this ?
df_ac['annfact'] = df_ac[(df_ac['annfact'] == 0) & (df_ac['cert'] == 0)] =1 ?
And I hope the pandas filter way will faster than np.where
Any friend can help convert the code or any suggestion ?
You can use a boolean mask to update certain values. This will modify the "annfact" column directly.
mask = (df_ac["annfact"] == 0) & (df_ac["cert"] == 0)
df.loc[mask, "annfact"] = 1
I am working with Pandas (python) on a dataset containing some fMRI results.
I am trying to drop rows when the value of a specific column is lower than a threshold I set. The thing is, I would also like to keep NAN values.
df = df[(dfr['in_mm3'] > 270) or (np.isnan(df['in_mm3']) == True)]
Obviously this doesn't work, but it is for you to better understand what I'm trying to achieve.
Any help would be appreciated.
Thanks.
You are almost there. This should work:
import numpy as np
df = df[ np.logical_or(dfr['in_mm3'] > 270, np.isnan(df['in_mm3'])) ]
df = df[(dfr['in_mm3'] > 270) | (pd.isnan(df['in_mm3']))]
Here we are printing the values where pd.isnan(df['in_mm3']) is true and (dfr['in_mm3'] > 270) is satisfied
I have a DataFrame, and I want to keep only columns, when their mean is over a certain treshhold.
My code looks like this:
import pandas as pd
df = pd.DataFrame(np.random.random((20,20)))
mean_keep= (df.mean() > 0.5)
mean_keep= mean_keep[mean_keep == True]
df_new = df[mean_keep.index]
and it is working. However I wonder if there is a function like "TAKE_ONLY_COLUMNS" that can reduce this to one line like
df_new = df[TAKE_ONLY_COLUMNS(df.mean() > 0.5)]
Use df.loc[] here:
df_new=df.loc[:,df.mean() > 0.5]
print(df_new)
This will automatically keep the columns where the condition is True.