add a new column based on other row values in dataframe

add a new column based on other row values in dataframe - python

I have this dataframe named "test" and a list of words list_w = ['monthly', 'moon']. I want to add a new column "revised cosine" such that: for each word present in list_w with 'weak' condition with cosine == 'Na', the revised cosine of their corresponding condition 'unrel_weak' will also be 'Na', similarly for each word present in list_w with 'strong' condition with cosine == 'Na', their revised cosine of their corresponding condition 'unrel_strong' will also be 'Na'
isi prime target condition meanRT cosine
0 50 weekly monthly strong 676.2 0.9
1 1050 weekly monthly strong 643.5 0.9
2 50 daily monthly weak 737.2 Na
3 1050 daily monthly weak 670.6 Na
4 50 bathtub monthly unrel_strong 692.2 0.1
5 1050 bathtub monthly unrel_strong 719.1 0.1
6 50 sponge monthly unrel_weak 805.8 0.3
7 1050 sponge monthly unrel_weak 685.7 0.3
8 50 crescent moon strong 625.0 Na
9 1050 crescent moon strong 537.2 Na
10 50 sunset moon weak 698.4 0.2
11 1050 sunset moon weak 704.3 0.2
12 50 premises moon unrel_strong 779.2 0.7
13 1050 premises moon unrel_strong 647.6 0.7
14 50 descent moon unrel_weak 686.0 0.5
15 1050 descent moon unrel_weak 725.4 0.5
My code is as below:
for w in list_w:
if test.loc[(test['target']==w) & (test['condition']=='strong'), 'cosine']=='Na':
test.loc[(test['target']==w) & (test['condition']=='unrel_strong'), 'cosine'] ='Na'
My code returns error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
My expected output should be like in the dataframe below (with "revised cosine" column added)
isi prime target condition meanRT cosine revised cosine
0 50 weekly monthly strong 676.2 0.9 0.9
1 1050 weekly monthly strong 643.5 0.9 0.9
2 50 daily monthly weak 737.2 Na Na
3 1050 daily monthly weak 670.6 Na Na
4 50 bathtub monthly unrel_strong 692.2 0.1 0.1
5 1050 bathtub monthly unrel_strong 719.1 0.1 0.1
6 50 sponge monthly unrel_weak 805.8 0.3 Na
7 1050 sponge monthly unrel_weak 685.7 0.3 Na
8 50 crescent moon strong 625.0 Na Na
9 1050 crescent moon strong 537.2 Na Na
10 50 sunset moon weak 698.4 0.2 0.2
11 1050 sunset moon weak 704.3 0.2 0.2
12 50 premises moon unrel_strong 779.2 0.7 Na
13 1050 premises moon unrel_strong 647.6 0.7 Na
14 50 descent moon unrel_weak 686.0 0.5 0.5
15 1050 descent moon unrel_weak 725.4 0.5 0.5
any ideas to help me out? I checked logical_and but they seem to work only with 2 conditions.
Overwritting the cosine colum is also fine, as long as the output is like the revised cosine. Thanks in advance!

This error message comes from the fact that you can't do if statements like this with Pandas.
Try like this:
for w in list_w:
for c in ["weak", "strong"]:
mask = (
(test["target"] == w) & (test["condition"] == c) & (test["cosine"] == "Na")
)
test.loc[mask, "revised cosine"] = "Na"

Solution
m = test['cosine'].eq('Na') & \
test['target'].isin(list_w) & \
test['condition'].isin(['weak', 'strong'])
i1 = test.set_index(['isi', 'target', 'condition']).index
i2 = test[m].set_index(['isi', 'target', test.loc[m, 'condition'].radd('unrel_')]).index
test['revised_cosine'] = test['cosine'].mask(i1.isin(i2), 'Na')
Explanations
Let us create a boolean mask m which holds True when the cosine column contains Na and at the same time target column contains one of the word from list_w and condition column is either weak or strong
>>> m
0 False
1 False
2 True
3 True
4 False
5 False
6 False
7 False
8 True
9 True
10 False
11 False
12 False
13 False
14 False
15 False
dtype: bool
Create a MultiIndex based on the columns isi, target and condition, lets call it i1. Filter the rows in test dataframe using mask m, add a prefix unrel_ to the filtered rows in condition column and create another MultiIndex i2 in the similar way.
>>> i1
MultiIndex([( 50, 'monthly', 'strong'),
(1050, 'monthly', 'strong'),
( 50, 'monthly', 'weak'),
(1050, 'monthly', 'weak'),
( 50, 'monthly', 'unrel_strong'),
(1050, 'monthly', 'unrel_strong'),
( 50, 'monthly', 'unrel_weak'),
(1050, 'monthly', 'unrel_weak'),
( 50, 'moon', 'strong'),
(1050, 'moon', 'strong'),
( 50, 'moon', 'weak'),
(1050, 'moon', 'weak'),
( 50, 'moon', 'unrel_strong'),
(1050, 'moon', 'unrel_strong'),
( 50, 'moon', 'unrel_weak'),
(1050, 'moon', 'unrel_weak')],
names=['isi', 'target', 'condition'])
>>> i2
MultiIndex([( 50, 'monthly', 'unrel_weak'),
(1050, 'monthly', 'unrel_weak'),
( 50, 'moon', 'unrel_strong'),
(1050, 'moon', 'unrel_strong')],
names=['isi', 'target', 'condition'])
Mask the values in cosine column using a boolean mask which can be created by testing the membership of i1 in i2
isi prime target condition meanRT cosine revised_cosine
0 50 weekly monthly strong 676.2 0.9 0.9
1 1050 weekly monthly strong 643.5 0.9 0.9
2 50 daily monthly weak 737.2 Na Na
3 1050 daily monthly weak 670.6 Na Na
4 50 bathtub monthly unrel_strong 692.2 0.1 0.1
5 1050 bathtub monthly unrel_strong 719.1 0.1 0.1
6 50 sponge monthly unrel_weak 805.8 0.3 Na
7 1050 sponge monthly unrel_weak 685.7 0.3 Na
8 50 crescent moon strong 625.0 Na Na
9 1050 crescent moon strong 537.2 Na Na
10 50 sunset moon weak 698.4 0.2 0.2
11 1050 sunset moon weak 704.3 0.2 0.2
12 50 premises moon unrel_strong 779.2 0.7 Na
13 1050 premises moon unrel_strong 647.6 0.7 Na
14 50 descent moon unrel_weak 686.0 0.5 0.5
15 1050 descent moon unrel_weak 725.4 0.5 0.5

Related

Using a for loop is there a way to control which # loop appends values to a list?

I'm currently working with 3 data frames named doctorate, high_school and bachelor that look a bit like this:
ID age education marital_status occupation annual_income Age_25 Age_30 Age_35 Age_40 Age_45 Age_50
1 2 50 doctorate married professional mid 25 and over 30 and over 35 and over 40 and over 45 and over 50 and over
7 8 40 doctorate married professional high 25 and over 30 and over 35 and over 40 and over under 45 under 50
11 12 45 doctorate married professional mid 25 and over 30 and over 35 and over 40 and over 45 and over under 50
16 17 44 doctorate divorced transport mid 25 and over 30 and over 35 and over 40 and over under 45 under 50
I'm trying to create probabilities based on the annual_income column using the following for loop:
income_levels = ['low','mid','high']
education_levels = [bachelor,doctorate,high_school]
for inc_level in income_levels:
for ed_level in education_levels:
print(inc_level,len(ed_level[ed_level['annual_income'] == inc_level]) / len(ed_level))
It produces this, which is what I want:
low 0.125
low 0.0
low 0.25
mid 0.625
mid 0.75
mid 0.5
high 0.25
high 0.25
high 0.25
However, I want to be able to append these values to a list depending on the income category, the lists would be low_income,mid_income,high_income. I'm sure there's a way that I can modify my for loop to be able to do this, but I can't bridge the gap to getting there. Could anyone help me?

In this case, you're trying to find list via a key/string. Why not just use a dict of lists?
income_levels = ['low','mid','high']
education_levels = [bachelor,doctorate,high_school]
# initial dictionary
inc_level_rates = {il: list() for il in income_levels}
for inc_level in income_levels:
for ed_level in education_levels:
rate = len(ed_level[ed_level['annual_income'] == inc_level]) / len(ed_level)
inc_level_rates[inc_level].append(rate)
print(inc_level, rate)
print(inc_level_rates)

average of binned values

I have 2 separate dataframes and want to do correlation between them
Time temperature | Time ratio
0 32 | 0 0.02
1 35 | 1 0.1
2 30 | 2 0.25
3 31 | 3 0.17
4 34 | 4 0.22
5 34 | 5 0.07
I want to bin my data every 0.05 (from ratio), with time as index and do an average in each bin on all the temperature values that correspond to that bin.
I will therefore obtain one averaged value for each 0.05 point
anyone could help out please? Thanks!
****edit on how data look like**** (df1 on the left, df2 on the right)
Time device-1 device-2... | Time device-1 device-2...
0 32 34 | 0 0.02 0.01
1 35 31 | 1 0.1 0.23
2 30 30 | 2 0.25 0.15
3 31 32 | 3 0.17 0.21
4 34 35 | 4 0.22 0.13
5 34 31 | 5 0.07 0.06

This could work with the pandas library:
import pandas as pd
import numpy as np
temp = [32,35,30,31,34,34]
ratio = [0.02,0.1,0.25,0.17,0.22,0.07]
times = range(6)
# Create your dataframe
df = pd.DataFrame({'Time': times, 'Temperature': temp, 'Ratio': ratio})
# Bins
bins = pd.cut(df.Ratio,np.arange(0,0.25,0.05))
# get the mean temperature of each group and the list of each time
df.groupby(bins).agg({"Temperature": "mean", "Time": list})
Output:
Temperature Time
Ratio
(0.0, 0.05] 32.0 [0]
(0.05, 0.1] 34.5 [1, 5]
(0.1, 0.15] NaN []
(0.15, 0.2] 31.0 [3]
You can discard the empty bins with .dropna() like this:
df.groupby(bins).agg({"Temperature": "mean", "Time": list}).dropna()
Temperature Time
Ratio
(0.0, 0.05] 32.0 [0]
(0.05, 0.1] 34.5 [1, 5]
(0.15, 0.2] 31.0 [3]
EDIT: In the case of multiple machines, here is a solution:
import pandas as pd
import numpy as np
n_machines = 3
# Generate random data for temperature and ratios
temperature_df = pd.DataFrame( {'Machine_{}'.format(i):
pd.Series(np.random.randint(30,40,10))
for i in range(n_machines)} )
ratio_df = pd.DataFrame( {'Machine_{}'.format(i):
pd.Series(np.random.uniform(0.01,0.5,10))
for i in range(n_machines)} )
# If ratio is between 0 and 1, we get the bins spaced by .05
def get_bins(s):
return pd.cut(s,np.arange(0,1,0.05))
# Get bin assignments for each machine
bins = ratio_df.apply(get_bins,axis=1)
# Get the mean of each group for each machine
df = temperature_df.apply(lambda x: x.groupby(bins[x.name]).agg("mean"))
Then if you want to display the result, you could use the seaborn package:
import matplotlib.pyplot as plt
import seaborn as sns
df_reshaped = df.reset_index().melt(id_vars='index')
df_reshaped.columns = [ 'Ratio bin','Machine','Mean temperature' ]
sns.barplot(data=df_reshaped,x="Ratio bin",y="Mean temperature",hue="Machine")
plt.show()

Bayesian Averaging in a Dataframe

I'm attempting to extract a series of Bayesian averages, based on a dataframe (by row).
For example, say I have a series of (0 to 1) user ratings of candy bars, stored in a dataframe like so:
User1 User2 User3
Snickers 0.01 NaN 0.7
Mars Bars 0.25 0.4 0.1
Milky Way 0.9 1.0 NaN
Almond Joy NaN NaN NaN
Babe Ruth 0.5 0.1 0.3
I'd like to create a column in a different DF which represents each candy bar's Bayesian Average from the above data.
To calculate the BA, I'm using the equation presented here:
S = score of the candy bar
R = average of user ratings for the candy bar
C = average of user ratings for all candy bars
w = weight assigned to R and computed as v/(v+m), where v is the number of user ratings for that candy bar, and m is average number of reviews for all candy bars.
I've translated that into python as such:
def bayesian_average(df):
"""given a dataframe, returns a series of bayesian averages"""
R = df.mean(axis=1)
C = df.sum(axis=1).sum()/df.count(axis=1).sum()
w = df.count(axis=1)/(df.count(axis=1)+(df.count(axis=1).sum()/len(df.dropna(how='all', inplace=False))))
return ((w*R) + ((1-w)*C))
other_df['bayesian_avg'] = bayesian_average(ratings_df)
However, my calculation seems to be off, in such a way that as the number of User columns in my initial dataframe grows, the final calculated Bayesian average grows as well (into numbers greater than 1).
Is this a problem with the fundamental equation I'm using, or with how I've translated it into python? Or is there an easier way to handle this in general (e.g. a preexisting package/function)?
Thanks!

I began with the dataframe you gave as an example:
d = {
'Bar': ['Snickers', 'Mars Bars', 'Milky Way', 'Almond Joy', 'Babe Ruth'],
'User1': [0.01, 0.25, 0.9, np.nan, 0.5],
'User2': [np.nan, 0.4, 1.0, np.nan, 0.1],
'User3': [0.7, 0.1, np.nan, np.nan, 0.3]
}
df = pd.DataFrame(data=d)
Which looks like this:
Bar User1 User2 User3
0 Snickers 0.01 NaN 0.7
1 Mars Bars 0.25 0.4 0.1
2 Milky Way 0.90 1.0 NaN
3 Almond Joy NaN NaN NaN
4 Babe Ruth 0.50 0.1 0.3
The first thing I did was create a list of all columns that had user reviews:
user_cols = []
for col in df.columns.values:
if 'User' in col:
user_cols.append(col)
Next, I found it most straightforward to create each variable of the Bayesian Average equation either as a column in the dataframe, or as a standalone variable:
Calculate the value of v for each bar:
df['v'] = df[user_cols].count(axis=1)
Calculate the value of m (equals 2.0 in this example):
m = np.mean(df['v'])
Calculate the value of w for each bar:
df['w'] = df['v']/(df['v'] + m)
And calculate the value of R for each bar:
df['R'] = np.mean(df[user_cols], axis=1)
Finally, get the value of C (equals 0.426 in this example):
C = np.nanmean(df[user_cols].values.flatten())
And now we're ready to calculate the Bayesian Average score, S, for each candy bar:
df['S'] = df['w']*df['R'] + (1 - df['w'])*C
This gives us a dataframe that looks like this:
Bar User1 User2 User3 v w R S
0 Snickers 0.01 NaN 0.7 2 0.5 0.355 0.3905
1 Mars Bars 0.25 0.4 0.1 3 0.6 0.250 0.3204
2 Milky Way 0.90 1.0 NaN 2 0.5 0.950 0.6880
3 Almond Joy NaN NaN NaN 0 0.0 NaN NaN
4 Babe Ruth 0.50 0.1 0.3 3 0.6 0.300 0.3504
Where the final column S contains all the S-scores for the candy bars. If you want you could then delete the v, w, and R temporary columns: df = df.drop(['v', 'w', 'R'], axis=1):
Bar User1 User2 User3 S
0 Snickers 0.01 NaN 0.7 0.3905
1 Mars Bars 0.25 0.4 0.1 0.3204
2 Milky Way 0.90 1.0 NaN 0.6880
3 Almond Joy NaN NaN NaN NaN
4 Babe Ruth 0.50 0.1 0.3 0.3504

Using Pandas & Pivot table how to use column(level) groupby sum values for the next steps analysis?

I want to find out how many sample will be taken from each level using proportion allocation method.
I have total 3 level's : [Small , Medium , Large ].
First , I want to take a sum for this 3 level's.
Next, I want to find out probability for this 3 levels
Next, I want to use this probability answer with multiply by how many samples given for this 3 levels
And, Last step is : sample will be select as top village's for the each level.
Data :
Village Workers Level
Aagar 10 Small
Dhagewadi 32 Small
Sherewadi 34 Small
Shindwad 42 Small
Dhokari 84 Medium
Khanapur 65 Medium
Ambikanagar 45 Medium
Takali 127 Large
Gardhani 122 Large
Pi.Khand 120 Large
Pangri 105 Large
let me explain, I am attaching excel photo
In the first step: I want to get sum values for level -> Small, Medium and High. i.e ( 10+32+34+42)=118 for Small level.
In the next step I want to find out probability for each levels rounding in 2 decimal.
i.e ( 118/786) =0.15 for small level.
And using length(size) of each level multiply by probability for find out how many sample(village) taken from each level.
i.e for Medium level we have probability 0.25 , and we have 3 villages in Medium level. so, 0.25*3 = 0.75 will be sample taken from medium level.
So, it will rounding to the next whole number 0.75 ~ 1 sample taken from Medium level, and it will take top village in this level. so, in medium level "Dhokri" village will be select,
I have done some work,
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.read_csv("/home/desktop/Desktop/t.csv")
df = df.sort('Workers', ascending=True)
df['level'] = pd.qcut(df['Workers'], 3, ['Small','Medium','Large'])
df
I am use this command for get the sum for level's. next what to do I am confuse,
df=df.groupby(['level'])['Workers'].aggregate(['sum']).unstack()
Is it possible in python , to get that village name what I get in the using excel ?

You can use:
transform with sum for same length of column
divide by div with sum and round
another transform with size
last custom function
df['Sum_Level_wise'] = df.groupby('Level')['Workers'].transform('sum')
df['Probability'] = df['Sum_Level_wise'].div(df['Workers'].sum()).round(2)
df['Sample'] = df['Probability'] * df.groupby('Level')['Workers'].transform('size')
df['Selected villages'] = df['Sample'].apply(np.ceil).astype(int)
df['Selected village'] = df.groupby('Level')
.apply(lambda x: x['Village'].head(x['Selected villages'].iat[0]))
.reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')
print (df)
Village Workers Level Sum_Level_wise Probability Sample \
0 Aagar 10 Small 118 0.15 0.60
1 Dhagewadi 32 Small 118 0.15 0.60
2 Sherewadi 34 Small 118 0.15 0.60
3 Shindwad 42 Small 118 0.15 0.60
4 Dhokari 84 Medium 194 0.25 0.75
5 Khanapur 65 Medium 194 0.25 0.75
6 Ambikanagar 45 Medium 194 0.25 0.75
7 Takali 127 Large 474 0.60 2.40
8 Gardhani 122 Large 474 0.60 2.40
9 Pi.Khand 120 Large 474 0.60 2.40
10 Pangri 105 Large 474 0.60 2.40
Selected villages Selected village
0 1 Aagar
1 1
2 1
3 1
4 1 Dhokari
5 1
6 1
7 3 Takali
8 3 Gardhani
9 3 Pi.Khand
10 3
You can try debug with custom function:
def f(x):
a = x['Village'].head(x['Selected villages'].iat[0])
print (x['Village'])
print (a)
if (len(x) < len(a)):
print ('original village cannot be filled to Selected village, because length is higher')
return a
df['Selected village'] = df.groupby('Level').apply(f).reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')

pandas: find percentile stats of a given column

I have a pandas data frame my_df, where I can find the mean(), median(), mode() of a given column:
my_df['field_A'].mean()
my_df['field_A'].median()
my_df['field_A'].mode()
I am wondering is it possible to find more detailed stats such as 90 percentile? Thanks!

You can use the pandas.DataFrame.quantile() function, as shown below.
import pandas as pd
import random
A = [ random.randint(0,100) for i in range(10) ]
B = [ random.randint(0,100) for i in range(10) ]
df = pd.DataFrame({ 'field_A': A, 'field_B': B })
df
# field_A field_B
# 0 90 72
# 1 63 84
# 2 11 74
# 3 61 66
# 4 78 80
# 5 67 75
# 6 89 47
# 7 12 22
# 8 43 5
# 9 30 64
df.field_A.mean() # Same as df['field_A'].mean()
# 54.399999999999999
df.field_A.median()
# 62.0
# You can call `quantile(i)` to get the i'th quantile,
# where `i` should be a fractional number.
df.field_A.quantile(0.1) # 10th percentile
# 11.9
df.field_A.quantile(0.5) # same as median
# 62.0
df.field_A.quantile(0.9) # 90th percentile
# 89.10000000000001

assume series s
s = pd.Series(np.arange(100))
Get quantiles for [.1, .2, .3, .4, .5, .6, .7, .8, .9]
s.quantile(np.linspace(.1, 1, 9, 0))
0.1 9.9
0.2 19.8
0.3 29.7
0.4 39.6
0.5 49.5
0.6 59.4
0.7 69.3
0.8 79.2
0.9 89.1
dtype: float64
OR
s.quantile(np.linspace(.1, 1, 9, 0), 'lower')
0.1 9
0.2 19
0.3 29
0.4 39
0.5 49
0.6 59
0.7 69
0.8 79
0.9 89
dtype: int32

I figured out below would work:
my_df.dropna().quantile([0.0, .9])

You can even give multiple columns with null values and get multiple quantile values (I use 95 percentile for outlier treatment)
my_df[['field_A','field_B']].dropna().quantile([0.0, .5, .90, .95])

a very easy and efficient way is to call the describe function on the particular column
df['field_A'].describe()
this will give you the mean ,max ,median and the 75th percentile

Describe will give you quartiles, if you want percentiles, you can do something like
df['YOUR_COLUMN_HERE'].describe(percentiles=[.1, .2, .3, .4, .5, .6 , .7, .8, .9, 1])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

add a new column based on other row values in dataframe - python

This error message comes from the fact that you can't do if statements like this with Pandas. Try like this: for w in list_w: for c in ["weak", "strong"]: mask = ( (test["target"] == w) & (test["condition"] == c) & (test["cosine"] == "Na") ) test.loc[mask, "revised cosine"] = "Na"

Related

Using a for loop is there a way to control which # loop appends values to a list?

average of binned values

Bayesian Averaging in a Dataframe

Using Pandas & Pivot table how to use column(level) groupby sum values for the next steps analysis?

pandas: find percentile stats of a given column

Categories

Resources