Doing the following conditional fill in pyspark how would I do this in pandas
colIsAcceptable = when(col("var") < 0.9, 1).otherwise(0)
You can use:
df['new_col'] = df['col'].lt(0.9).astype(int)
or with numpy.where:
import numpy as np
df['new_col'] = np.where(df['col'].lt(0.9), 1, 0)
You can use numpy.where.
import numpy as np
df['colIsAcceptable'] = np.where(df['col'] < 0.9, 1, 0)
colIsAcceptable = df['var'].apply(lambda x: 1 if x < 0.9 else 0)
apply can be slow on very large datasets, and there are more efficient ways that I don't know of, but is good for general purposes
I assume the first column on your dataframe is named 'var'. and then the second column name is 'colIsAcceptable', then you can use .map() function
df['colIsAcceptable']= df['var'].map(lambda x: 1 if x<0.9 else 0)
df['col2'] = 0
df.loc[df['col1'] < 0.9, 'col2'] = 1
This is a simple example to do something like what you are asking.
Related
I have a df as below
I want to make this df binary as follows
I tried
df[:]=np.where(df>0, 1, 0)
but with this I am losing my df index.
I can try this on all columns one by one or use loop but I think there would be some easy & quick way to do this.
You can convert boolean mask by DataFrame.gt to integers:
df1 = df.gt(0).astype(int)
Or use DataFrame.clip if integers and no negative values:
df1 = df.clip(upper=1)
Your solution should working with loc:
df.loc[:] = np.where(df>0, 1, 0)
of course it is possible by func, it can be done with just operator
(df > 0) * 1
Without using numpy:
df[df>0]=0
I have a dataframe df_ac and a logic for this dataframe is:
df_ac['annfact'] = np.where((df_ac['annfact'] == 0) & (df_ac['cert'] == 0), 1, df_ac['annfact'])
How to use pandas filter to convert the above logic, something like this ?
df_ac['annfact'] = df_ac[(df_ac['annfact'] == 0) & (df_ac['cert'] == 0)] =1 ?
And I hope the pandas filter way will faster than np.where
Any friend can help convert the code or any suggestion ?
You can use a boolean mask to update certain values. This will modify the "annfact" column directly.
mask = (df_ac["annfact"] == 0) & (df_ac["cert"] == 0)
df.loc[mask, "annfact"] = 1
A B
0 0.119 5.344960e+08
1 0.008 7.950629e+09
2 318.575 1.996548e+05
3 153.644 4.139767e+05
sum = 63605028.818
df['B'] = df['A'].rdiv(sum).replace(np.inf, 0).round(3)
Getting exponential values(as a series) , I want normal numerical values in B column like - 534496040.49 etc.
You can do something like this:
df['B'] = df['A'].rdiv(my_sum).replace(np.inf, 0).astype('int64')
You can also change the view option of pandas:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import pandas as pd
pd.options.display.float_format = '{:,.3f}'.format
Set float_format option/setting of pandas and it will show all floats in this format. You won't need to explicitly round each column.
Alternatively, use map()
df['B'] = df['A'].rdiv(sum).replace(np.inf, 0)
df['B'] = df['B'].map(':,.3f'.format)
if i want to apply lambda with multiple condition how should I do it?
df.train.age.apply(lambda x:0 (if x>=0 and x<500))
or is there much better methods?
create a mask and select from your array with the mask ... only apply to the result
mask = (df.train.age >=0) & (df.train.age < 500)
df.train.age[mask].apply(something)
if you just need to set the ones that dont match to zero thats even easier
df.train.age[~mask] = 0
Your syntax needs to have an else:
df.train.age.apply(lambda x:0 if x>=0 and x<500 else 0)
This is a good way to do it
The same can be obtained without using apply, but using np.where as below.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'age' : [-10,10,20,30,40,100,110]
})
df['age'] = np.where((df['age'] >= 100) | (df['age'] < 0), 0, df['age'])
df
If you have any confusion in using the above code, Please post your sample dataframe. I'll update my answer.
I have a DataFrame, and I want to keep only columns, when their mean is over a certain treshhold.
My code looks like this:
import pandas as pd
df = pd.DataFrame(np.random.random((20,20)))
mean_keep= (df.mean() > 0.5)
mean_keep= mean_keep[mean_keep == True]
df_new = df[mean_keep.index]
and it is working. However I wonder if there is a function like "TAKE_ONLY_COLUMNS" that can reduce this to one line like
df_new = df[TAKE_ONLY_COLUMNS(df.mean() > 0.5)]
Use df.loc[] here:
df_new=df.loc[:,df.mean() > 0.5]
print(df_new)
This will automatically keep the columns where the condition is True.