Bayesian Averaging in a Dataframe - python

I'm attempting to extract a series of Bayesian averages, based on a dataframe (by row).
For example, say I have a series of (0 to 1) user ratings of candy bars, stored in a dataframe like so:
User1 User2 User3
Snickers 0.01 NaN 0.7
Mars Bars 0.25 0.4 0.1
Milky Way 0.9 1.0 NaN
Almond Joy NaN NaN NaN
Babe Ruth 0.5 0.1 0.3
I'd like to create a column in a different DF which represents each candy bar's Bayesian Average from the above data.
To calculate the BA, I'm using the equation presented here:
S = score of the candy bar
R = average of user ratings for the candy bar
C = average of user ratings for all candy bars
w = weight assigned to R and computed as v/(v+m), where v is the number of user ratings for that candy bar, and m is average number of reviews for all candy bars.
I've translated that into python as such:
def bayesian_average(df):
"""given a dataframe, returns a series of bayesian averages"""
R = df.mean(axis=1)
C = df.sum(axis=1).sum()/df.count(axis=1).sum()
w = df.count(axis=1)/(df.count(axis=1)+(df.count(axis=1).sum()/len(df.dropna(how='all', inplace=False))))
return ((w*R) + ((1-w)*C))
other_df['bayesian_avg'] = bayesian_average(ratings_df)
However, my calculation seems to be off, in such a way that as the number of User columns in my initial dataframe grows, the final calculated Bayesian average grows as well (into numbers greater than 1).
Is this a problem with the fundamental equation I'm using, or with how I've translated it into python? Or is there an easier way to handle this in general (e.g. a preexisting package/function)?
Thanks!

I began with the dataframe you gave as an example:
d = {
'Bar': ['Snickers', 'Mars Bars', 'Milky Way', 'Almond Joy', 'Babe Ruth'],
'User1': [0.01, 0.25, 0.9, np.nan, 0.5],
'User2': [np.nan, 0.4, 1.0, np.nan, 0.1],
'User3': [0.7, 0.1, np.nan, np.nan, 0.3]
}
df = pd.DataFrame(data=d)
Which looks like this:
Bar User1 User2 User3
0 Snickers 0.01 NaN 0.7
1 Mars Bars 0.25 0.4 0.1
2 Milky Way 0.90 1.0 NaN
3 Almond Joy NaN NaN NaN
4 Babe Ruth 0.50 0.1 0.3
The first thing I did was create a list of all columns that had user reviews:
user_cols = []
for col in df.columns.values:
if 'User' in col:
user_cols.append(col)
Next, I found it most straightforward to create each variable of the Bayesian Average equation either as a column in the dataframe, or as a standalone variable:
Calculate the value of v for each bar:
df['v'] = df[user_cols].count(axis=1)
Calculate the value of m (equals 2.0 in this example):
m = np.mean(df['v'])
Calculate the value of w for each bar:
df['w'] = df['v']/(df['v'] + m)
And calculate the value of R for each bar:
df['R'] = np.mean(df[user_cols], axis=1)
Finally, get the value of C (equals 0.426 in this example):
C = np.nanmean(df[user_cols].values.flatten())
And now we're ready to calculate the Bayesian Average score, S, for each candy bar:
df['S'] = df['w']*df['R'] + (1 - df['w'])*C
This gives us a dataframe that looks like this:
Bar User1 User2 User3 v w R S
0 Snickers 0.01 NaN 0.7 2 0.5 0.355 0.3905
1 Mars Bars 0.25 0.4 0.1 3 0.6 0.250 0.3204
2 Milky Way 0.90 1.0 NaN 2 0.5 0.950 0.6880
3 Almond Joy NaN NaN NaN 0 0.0 NaN NaN
4 Babe Ruth 0.50 0.1 0.3 3 0.6 0.300 0.3504
Where the final column S contains all the S-scores for the candy bars. If you want you could then delete the v, w, and R temporary columns: df = df.drop(['v', 'w', 'R'], axis=1):
Bar User1 User2 User3 S
0 Snickers 0.01 NaN 0.7 0.3905
1 Mars Bars 0.25 0.4 0.1 0.3204
2 Milky Way 0.90 1.0 NaN 0.6880
3 Almond Joy NaN NaN NaN NaN
4 Babe Ruth 0.50 0.1 0.3 0.3504

Related

T Test on Multiple Columns in Dataframe

Dataframe looks something like:
decade rain snow
1910 0.2 0.2
1910 0.3 0.4
2000 0.4 0.5
2010 0.1 0.1
I'd love some help with a function in python to run a t test comparing decade combinations for a given column. This function works great except does not take an input column such as rain or snow.
from itertools import combinations
def ttest_run(c1, c2):
results = st.ttest_ind(cat1, cat2,nan_policy='omit')
df = pd.DataFrame({'dec1': c1,
'dec2': c2,
'tstat': results.statistic,
'pvalue': results.pvalue},
index = [0])
return df
df_list = [ttest_run(i, j) for i, j in combinations(data['decade'].unique().tolist(), 2)]
final_df = pd.concat(df_list, ignore_index = True)
I think you want something like this:
import pandas as pd
from itertools import combinations
from scipy import stats as st
d = {'decade': ['1910', '1910', '2000', '2010', '1990', '1990', '1990', '1990'],
'rain': [0.2, 0.3, 0.3, 0.1, 0.1, 0.2, 0.3, 0.4],
'snow': [0.2, 0.4, 0.5, 0.1, 0.1, 0.2, 0.3, 0.4]}
df = pd.DataFrame(data=d)
def all_pairwise(df, compare_col = 'decade'):
decade_pairs = [(i,j) for i, j in combinations(df[compare_col].unique().tolist(), 2)]
# or add a list of colnames to function signature
cols = list(df.columns)
cols.remove(compare_col)
list_of_dfs = []
for pair in decade_pairs:
for col in cols:
c1 = df[df[compare_col] == pair[0]][col]
c2 = df[df[compare_col] == pair[1]][col]
results = st.ttest_ind(c1, c2, nan_policy='omit')
tmp = pd.DataFrame({'dec1': pair[0],
'dec2': pair[1],
'tstat': results.statistic,
'pvalue': results.pvalue}, index = [col])
list_of_dfs.append(tmp)
df_stats = pd.concat(list_of_dfs)
return df_stats
df_stats = all_pairwise(df)
df_stats
Now if you execute that code you'll get runtime warnings from division by 0 errors occurring from too few data points when calculating t-statistics which cause the Nans in the output
>>> df_stats
dec1 dec2 tstat pvalue
rain 1910 2000 NaN NaN
snow 1910 2000 NaN NaN
rain 1910 2010 NaN NaN
snow 1910 2010 NaN NaN
rain 1910 1990 0.000000 1.000000
snow 1910 1990 0.436436 0.685044
rain 2000 2010 NaN NaN
...
If you don't want all columns but only some specified set change the function signature/definition line to read:
def all_pairwise(df, cols, compare_col = 'decade'):
where cols should be an iterable of string column names (a list will work fine). You'll need to remove the two lines:
cols = list(df.columns)
cols.remove(compare_col)
from the function body and otherwise will work fine.
You'll always get the runtime warnings unless you filter out decades with too few records before passing to the function.
Here is an example call from the version that accepts a list of columns as arguments and shows the runtime warning.
>>> all_pairwise(df, cols=['rain'])
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3723: RuntimeWarning: Degrees of freedom <= 0 for slice
return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:254: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
dec1 dec2 tstat pvalue
rain 1910 2000 NaN NaN
rain 1910 2010 NaN NaN
rain 1910 1990 0.0 1.0
rain 2000 2010 NaN NaN
rain 2000 1990 NaN NaN
rain 2010 1990 NaN NaN
>>>

How to display `.value_counts()` in interval in pandas dataframe

I need to display .value_counts() in interval in pandas dataframe. Here's my code
prob['bucket'] = pd.qcut(prob['prob good'], 20)
grouped = prob.groupby('bucket', as_index = False)
kstable = pd.DataFrame()
kstable['min_prob'] = grouped.min()['prob good']
kstable['max_prob'] = grouped.max()['prob good']
kstable['counts'] = prob['bucket'].value_counts()
My Output
min_prob max_prob counts
0 0.26 0.48 NaN
1 0.49 0.52 NaN
2 0.53 0.54 NaN
3 0.55 0.56 NaN
4 0.57 0.58 NaN
I know that I have pronblem in kstable['counts'] syntax, but how to solve this?
Use named aggregation for simplify your code, for counts is used GroupBy.size to new column counts and is apply function for column bucket:
prob['bucket'] = pd.qcut(prob['prob good'], 20)
kstable = prob.groupby('bucket', as_index = False).agg(min_prob=('prob good','min'),
max_prob=('prob good','max'),
counts=('bucket','size'))
In your solution should working with DataFrame.assign:
kstable = kstable.assign(counts = prob['bucket'].value_counts())

average of binned values

I have 2 separate dataframes and want to do correlation between them
Time temperature | Time ratio
0 32 | 0 0.02
1 35 | 1 0.1
2 30 | 2 0.25
3 31 | 3 0.17
4 34 | 4 0.22
5 34 | 5 0.07
I want to bin my data every 0.05 (from ratio), with time as index and do an average in each bin on all the temperature values that correspond to that bin.
I will therefore obtain one averaged value for each 0.05 point
anyone could help out please? Thanks!
****edit on how data look like**** (df1 on the left, df2 on the right)
Time device-1 device-2... | Time device-1 device-2...
0 32 34 | 0 0.02 0.01
1 35 31 | 1 0.1 0.23
2 30 30 | 2 0.25 0.15
3 31 32 | 3 0.17 0.21
4 34 35 | 4 0.22 0.13
5 34 31 | 5 0.07 0.06
This could work with the pandas library:
import pandas as pd
import numpy as np
temp = [32,35,30,31,34,34]
ratio = [0.02,0.1,0.25,0.17,0.22,0.07]
times = range(6)
# Create your dataframe
df = pd.DataFrame({'Time': times, 'Temperature': temp, 'Ratio': ratio})
# Bins
bins = pd.cut(df.Ratio,np.arange(0,0.25,0.05))
# get the mean temperature of each group and the list of each time
df.groupby(bins).agg({"Temperature": "mean", "Time": list})
Output:
Temperature Time
Ratio
(0.0, 0.05] 32.0 [0]
(0.05, 0.1] 34.5 [1, 5]
(0.1, 0.15] NaN []
(0.15, 0.2] 31.0 [3]
You can discard the empty bins with .dropna() like this:
df.groupby(bins).agg({"Temperature": "mean", "Time": list}).dropna()
Temperature Time
Ratio
(0.0, 0.05] 32.0 [0]
(0.05, 0.1] 34.5 [1, 5]
(0.15, 0.2] 31.0 [3]
EDIT: In the case of multiple machines, here is a solution:
import pandas as pd
import numpy as np
n_machines = 3
# Generate random data for temperature and ratios
temperature_df = pd.DataFrame( {'Machine_{}'.format(i):
pd.Series(np.random.randint(30,40,10))
for i in range(n_machines)} )
ratio_df = pd.DataFrame( {'Machine_{}'.format(i):
pd.Series(np.random.uniform(0.01,0.5,10))
for i in range(n_machines)} )
# If ratio is between 0 and 1, we get the bins spaced by .05
def get_bins(s):
return pd.cut(s,np.arange(0,1,0.05))
# Get bin assignments for each machine
bins = ratio_df.apply(get_bins,axis=1)
# Get the mean of each group for each machine
df = temperature_df.apply(lambda x: x.groupby(bins[x.name]).agg("mean"))
Then if you want to display the result, you could use the seaborn package:
import matplotlib.pyplot as plt
import seaborn as sns
df_reshaped = df.reset_index().melt(id_vars='index')
df_reshaped.columns = [ 'Ratio bin','Machine','Mean temperature' ]
sns.barplot(data=df_reshaped,x="Ratio bin",y="Mean temperature",hue="Machine")
plt.show()

Pandas GroupBy to calculate weighted percentages meeting a certain condition

I have a dataframe with survey data like so, with each row being a different respondent.
weight race Question_1 Question_2 Question_3
0.9 white 1 5 4
1.1 asian 5 4 3
0.95 white 2 1 5
1.25 black 5 4 3
0.80 other 4 5 2
Each question is on a scale from 1 to 5 (there are several more questions in the actual data). For each question, I am trying to calculate the percentage of respondents who responded with a 5, grouped by race and weighted by the weight column.
I believe that the code below works for calculating the percentage who responded with a 5 for each question, grouped by race. But I do not know how to weight it by the weight column.
df.groupby('race').apply(lambda x: ((x == 5).sum()) / x.count())
I am new to pandas. Could someone please explain how to do this? Thanks for any help.
Edit: The desired output for the above dataframe would look something like this. Obviously the real data has far more respondents (rows) and many more questions.
Question_1 Question_2 Question_3
white 0.00 0.49 0.51
black 1.00 0.00 0.00
asian 1.00 0.00 0.00
other 0.00 1.00 0.00
Thank you.
Here is a solution by defining a custom function and applying that function to each columns. Then you could concatenate each column into a dataframe:
def wavg(x, col):
return (x['weight']*(x[col]==5)).sum()/x['weight'].sum()
grouped = df.groupby('race')
pd.concat([grouped.apply(wavg,col) for col in df.columns if col.startswith('Question')],axis=1)\
.rename(columns = {num:f'Question_{num+1}' for num in range(3)})
Output:
Question_1 Question_2 Question_3
race
asian 1.0 0.000000 0.000000
black 1.0 0.000000 0.000000
other 0.0 1.000000 0.000000
white 0.0 0.486486 0.513514
Here's how you could do it for question 1. You can easily generalize it for the other questions.
# Define a dummy indicating a '5 response'
df['Q1'] = np.where(df['Question_1']==5 ,1, 0)
# Create a weighted version of the above dummy
df['Q1_w'] = df['Q1'] * df['weight']
# Compute the sum by race
ds = df.groupby(['race'])[['Q1_w', 'weight']].sum()
# Compute the weighted average
ds['avg'] = ds['Q1_w'] / ds['weight']
Basically, you first take the sum of the weights and of the weighted 5 dummy by race and then divide by the sum of the weights.
This gives you the weighted average.

Pandas: rolling windows with a sum product

There are a number of answers that each provide me with a portion of my desired result, but I am challenged putting them all together. My core Pandas data frame looks like this, where I am trying to estimate volume_step_1:
date volume_step_0 volume_step_1
2018-01-01 100 a
2018-01-02 101 b
2018-01-03 105 c
2018-01-04 123 d
2018-01-05 121 e
I then have a reference table with the conversion rates, for e.g.
step conversion
0 0.60
1 0.81
2 0.18
3 0.99
4 0.75
I have another table containing point estimates of a Poisson distribution:
days_to_complete step_no pc_cases
0 0 0.50
1 0 0.40
2 0 0.07
Using these data, I now want to estimate
volume_step_1 =
(volume_step_0(today) * days_to_complete(step0, day0) * conversion(step0)) +
(volume_step_0(yesterday) * days_to_complete(step0,day1) * conversion(step0))
and so forth.
How do I write some Python code to do so?
Calling your dataframes (from top to bottom as df1, df2, and df3):
df1['volume_step_1'] = (
(df1['volume_step_0']*
df2.loc[(df2['days_to_complete'] == 0) & (df2['step_no'] == 0), 'pc_cases']*
df3.loc[df3['step'] == 0, 'conversion']) +
df1['volume_step_0'].shift(1)*
df2.loc[(df2['days_to_complete'] == 1) & (df2['step_no'] == 0), 'pc_cases']*
df3.loc[df3['step'] == 0, 'conversion'])
EDIT:
IIUC, you are trying to get a 'dot product' of sorts between the volume_step_0 column and the product of the pc_cases and conversionfor a particular step_no. You can merge df2 and df3 to match steps:
df_merged = df_merged = df2.merge(df3, how = 'left', left_on = 'step', right_on = 'step_no')
df_merged.head(3)
step conversion days_to_complete step_no pc_cases
0 0.0 0.6 0.0 0.0 0.50
1 0.0 0.6 1.0 0.0 0.40
2 0.0 0.6 2.0 0.0 0.07
I'm guessing you're only using stepk to get volume_step_k+1, and you want to iterate the sum over the days. The following code generates a vector of days_to_complete(step0, dayk) and conversion(step0) for all values of k that are available in days_to_complete, and finds their product:
df_fin = df_merged[df_merged['step'] == 0][['conversion', 'pc_cases']].product(axis = 1)
0 0.300
1 0.240
2 0.042
df_fin = df_fin[::-1].reset_index(drop = True)
Finally, you want to take the dot product of the days_to_complete * conversion vector by the volume_step_0 vector, for a rolling window (as many values exist in days_to_complete):
vol_step_1 = pd.Series([df1['volume_step_0'][i:i+len(df3)].reset_index(drop = True).dot(df_fin) for i in range(0,len(df3))])
df1['volume_step_1'] = df1['volume_step_1'][::-1].reset_index(drop = True)
Output:
df1
date volume_step_0 volume_step_1
0 2018-01-01 100 NaN
1 2018-01-02 101 NaN
2 2018-01-03 105 70.230
3 2018-01-04 123 66.342
4 2018-01-05 121 59.940
While this is by no means a comprehensive solution, the code is meant to provide the logic to "sum multiple products", as you had asked.

Categories

Resources