Plot with pandas: group and mean - python

My data from my 'combos' data frame looks like this:
pr = [1.0,2.0,3.0,4.0,1.0,2.0,3.0,4.0,1.0,2.0,3.0,4.0,.....1.0,2.0,3.0,4.0]
lmi = [200, 200, 200, 250, 250,.....780, 780, 780, 800, 800, 800]
pred = [0.16, 0.18, 0.25, 0.43, 0.54......., 0.20, 0.34, 0.45, 0.66]
I plot the results like this:
fig,ax = plt.subplots()
for pr in [1.0,2.0,3.0,4.0]:
ax.plot(combos[combos.pr==pr].lmi, combos[combos.pr==pr].pred, label=pr)
ax.set_xlabel('lmi')
ax.set_ylabel('pred')
ax.legend(loc='best')
And I get this plot:
How can I plot means of "pred" for each "lmi" data point when keeping the pairs (lmi, pr) intact?

You can first group your DataFrame by lmi then compute the mean for each group just as your title suggests:
combos.groupby('lmi').pred.mean().plot()
In one line we:
Group the combos DataFrame by the lmi column
Get the pred column for each lmi
Compute the mean across the pred column for each lmi group
Plot the mean for each lmi group

As of your updates to the question it is now clear that you want to calculate the means for each pair (pr, lmi). This can be done by grouping over these columns and then simply calling mean(). With reset_index(), we then restore the DataFrame format to the previous form.
$ combos.groupby(['lmi', 'pr']).mean().reset_index()
lmi pr pred
0 200 1.0 0.16
1 200 2.0 0.18
2 200 3.0 0.25
3 250 1.0 0.54
4 250 4.0 0.43
5 780 2.0 0.20
6 780 3.0 0.34
7 780 4.0 0.45
8 800 1.0 0.66
In this new DataFrame pred does contain the means and you can use the same plotting procedure you have been using before.

Related

Bar chart with ticks based on multiple dataframe columns

How can I make a bar chart in matplotlib (or pandas) from the bins in my dataframe?
I want something like this, below, where the x-axis labels come from the low, high in my dataframe (so first tick would read [-1.089, 0) and the y value is the percent column in my dataframe.
Here is an example dataset. The dataset is already in this format (I don't have an uncut version).
df = pd.DataFrame(
{
"low": [-1.089, 0, 0.3, 0.5, 0.6, 0.8],
"high": [0, 0.3, 0.5, 0.6, 0.8, 10.089],
"percent": [0.509, 0.11, 0.074, 0.038, 0.069, 0.202],
}
)
display(df)
Create a new column using the the low, high cols.
Covert the int values in the low and high columns to str type and set the new str in the [<low>, <high>) notation that you want.
From there, you can create a bar plot dirrectly from df using df.plot.bar(), assigning the newly created column as x and percent as y.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html
Recreate the bins using IntervalArray.from_arrays:
df['label'] = pd.arrays.IntervalArray.from_arrays(df.low, df.high)
# low high percent label
# 0 -1.089 0.000 0.509 (-1.089, 0.0]
# 1 0.000 0.300 0.110 (0.0, 0.3]
# 2 0.300 0.500 0.074 (0.3, 0.5]
# 3 0.500 0.600 0.038 (0.5, 0.6]
# 4 0.600 0.800 0.069 (0.6, 0.8]
# 5 0.800 10.089 0.202 (0.8, 10.089]
Then plot with x as these bins:
df.plot.bar(x='label', y='percent')

How to display `.value_counts()` in interval in pandas dataframe

I need to display .value_counts() in interval in pandas dataframe. Here's my code
prob['bucket'] = pd.qcut(prob['prob good'], 20)
grouped = prob.groupby('bucket', as_index = False)
kstable = pd.DataFrame()
kstable['min_prob'] = grouped.min()['prob good']
kstable['max_prob'] = grouped.max()['prob good']
kstable['counts'] = prob['bucket'].value_counts()
My Output
min_prob max_prob counts
0 0.26 0.48 NaN
1 0.49 0.52 NaN
2 0.53 0.54 NaN
3 0.55 0.56 NaN
4 0.57 0.58 NaN
I know that I have pronblem in kstable['counts'] syntax, but how to solve this?
Use named aggregation for simplify your code, for counts is used GroupBy.size to new column counts and is apply function for column bucket:
prob['bucket'] = pd.qcut(prob['prob good'], 20)
kstable = prob.groupby('bucket', as_index = False).agg(min_prob=('prob good','min'),
max_prob=('prob good','max'),
counts=('bucket','size'))
In your solution should working with DataFrame.assign:
kstable = kstable.assign(counts = prob['bucket'].value_counts())

average of binned values

I have 2 separate dataframes and want to do correlation between them
Time temperature | Time ratio
0 32 | 0 0.02
1 35 | 1 0.1
2 30 | 2 0.25
3 31 | 3 0.17
4 34 | 4 0.22
5 34 | 5 0.07
I want to bin my data every 0.05 (from ratio), with time as index and do an average in each bin on all the temperature values that correspond to that bin.
I will therefore obtain one averaged value for each 0.05 point
anyone could help out please? Thanks!
****edit on how data look like**** (df1 on the left, df2 on the right)
Time device-1 device-2... | Time device-1 device-2...
0 32 34 | 0 0.02 0.01
1 35 31 | 1 0.1 0.23
2 30 30 | 2 0.25 0.15
3 31 32 | 3 0.17 0.21
4 34 35 | 4 0.22 0.13
5 34 31 | 5 0.07 0.06
This could work with the pandas library:
import pandas as pd
import numpy as np
temp = [32,35,30,31,34,34]
ratio = [0.02,0.1,0.25,0.17,0.22,0.07]
times = range(6)
# Create your dataframe
df = pd.DataFrame({'Time': times, 'Temperature': temp, 'Ratio': ratio})
# Bins
bins = pd.cut(df.Ratio,np.arange(0,0.25,0.05))
# get the mean temperature of each group and the list of each time
df.groupby(bins).agg({"Temperature": "mean", "Time": list})
Output:
Temperature Time
Ratio
(0.0, 0.05] 32.0 [0]
(0.05, 0.1] 34.5 [1, 5]
(0.1, 0.15] NaN []
(0.15, 0.2] 31.0 [3]
You can discard the empty bins with .dropna() like this:
df.groupby(bins).agg({"Temperature": "mean", "Time": list}).dropna()
Temperature Time
Ratio
(0.0, 0.05] 32.0 [0]
(0.05, 0.1] 34.5 [1, 5]
(0.15, 0.2] 31.0 [3]
EDIT: In the case of multiple machines, here is a solution:
import pandas as pd
import numpy as np
n_machines = 3
# Generate random data for temperature and ratios
temperature_df = pd.DataFrame( {'Machine_{}'.format(i):
pd.Series(np.random.randint(30,40,10))
for i in range(n_machines)} )
ratio_df = pd.DataFrame( {'Machine_{}'.format(i):
pd.Series(np.random.uniform(0.01,0.5,10))
for i in range(n_machines)} )
# If ratio is between 0 and 1, we get the bins spaced by .05
def get_bins(s):
return pd.cut(s,np.arange(0,1,0.05))
# Get bin assignments for each machine
bins = ratio_df.apply(get_bins,axis=1)
# Get the mean of each group for each machine
df = temperature_df.apply(lambda x: x.groupby(bins[x.name]).agg("mean"))
Then if you want to display the result, you could use the seaborn package:
import matplotlib.pyplot as plt
import seaborn as sns
df_reshaped = df.reset_index().melt(id_vars='index')
df_reshaped.columns = [ 'Ratio bin','Machine','Mean temperature' ]
sns.barplot(data=df_reshaped,x="Ratio bin",y="Mean temperature",hue="Machine")
plt.show()

Bayesian Averaging in a Dataframe

I'm attempting to extract a series of Bayesian averages, based on a dataframe (by row).
For example, say I have a series of (0 to 1) user ratings of candy bars, stored in a dataframe like so:
User1 User2 User3
Snickers 0.01 NaN 0.7
Mars Bars 0.25 0.4 0.1
Milky Way 0.9 1.0 NaN
Almond Joy NaN NaN NaN
Babe Ruth 0.5 0.1 0.3
I'd like to create a column in a different DF which represents each candy bar's Bayesian Average from the above data.
To calculate the BA, I'm using the equation presented here:
S = score of the candy bar
R = average of user ratings for the candy bar
C = average of user ratings for all candy bars
w = weight assigned to R and computed as v/(v+m), where v is the number of user ratings for that candy bar, and m is average number of reviews for all candy bars.
I've translated that into python as such:
def bayesian_average(df):
"""given a dataframe, returns a series of bayesian averages"""
R = df.mean(axis=1)
C = df.sum(axis=1).sum()/df.count(axis=1).sum()
w = df.count(axis=1)/(df.count(axis=1)+(df.count(axis=1).sum()/len(df.dropna(how='all', inplace=False))))
return ((w*R) + ((1-w)*C))
other_df['bayesian_avg'] = bayesian_average(ratings_df)
However, my calculation seems to be off, in such a way that as the number of User columns in my initial dataframe grows, the final calculated Bayesian average grows as well (into numbers greater than 1).
Is this a problem with the fundamental equation I'm using, or with how I've translated it into python? Or is there an easier way to handle this in general (e.g. a preexisting package/function)?
Thanks!
I began with the dataframe you gave as an example:
d = {
'Bar': ['Snickers', 'Mars Bars', 'Milky Way', 'Almond Joy', 'Babe Ruth'],
'User1': [0.01, 0.25, 0.9, np.nan, 0.5],
'User2': [np.nan, 0.4, 1.0, np.nan, 0.1],
'User3': [0.7, 0.1, np.nan, np.nan, 0.3]
}
df = pd.DataFrame(data=d)
Which looks like this:
Bar User1 User2 User3
0 Snickers 0.01 NaN 0.7
1 Mars Bars 0.25 0.4 0.1
2 Milky Way 0.90 1.0 NaN
3 Almond Joy NaN NaN NaN
4 Babe Ruth 0.50 0.1 0.3
The first thing I did was create a list of all columns that had user reviews:
user_cols = []
for col in df.columns.values:
if 'User' in col:
user_cols.append(col)
Next, I found it most straightforward to create each variable of the Bayesian Average equation either as a column in the dataframe, or as a standalone variable:
Calculate the value of v for each bar:
df['v'] = df[user_cols].count(axis=1)
Calculate the value of m (equals 2.0 in this example):
m = np.mean(df['v'])
Calculate the value of w for each bar:
df['w'] = df['v']/(df['v'] + m)
And calculate the value of R for each bar:
df['R'] = np.mean(df[user_cols], axis=1)
Finally, get the value of C (equals 0.426 in this example):
C = np.nanmean(df[user_cols].values.flatten())
And now we're ready to calculate the Bayesian Average score, S, for each candy bar:
df['S'] = df['w']*df['R'] + (1 - df['w'])*C
This gives us a dataframe that looks like this:
Bar User1 User2 User3 v w R S
0 Snickers 0.01 NaN 0.7 2 0.5 0.355 0.3905
1 Mars Bars 0.25 0.4 0.1 3 0.6 0.250 0.3204
2 Milky Way 0.90 1.0 NaN 2 0.5 0.950 0.6880
3 Almond Joy NaN NaN NaN 0 0.0 NaN NaN
4 Babe Ruth 0.50 0.1 0.3 3 0.6 0.300 0.3504
Where the final column S contains all the S-scores for the candy bars. If you want you could then delete the v, w, and R temporary columns: df = df.drop(['v', 'w', 'R'], axis=1):
Bar User1 User2 User3 S
0 Snickers 0.01 NaN 0.7 0.3905
1 Mars Bars 0.25 0.4 0.1 0.3204
2 Milky Way 0.90 1.0 NaN 0.6880
3 Almond Joy NaN NaN NaN NaN
4 Babe Ruth 0.50 0.1 0.3 0.3504

how to calculate statistic significance for two list in an ab test by using python

I made an A/B test, I have data as following:
control_conversion test_conversion
day1 100 101
day3 140 200
day5 200 320
day7 400 800
Control and test group have 1000 traffic
so the conversion rate would be:
control_conversion test_conversion
day1 0.10 0.10
day3 0.14 0.20
day5 0.20 0.32
day7 0.40 0.80
I want to use python to calculate the statistic significance for
day1, day3, day5, day7 for control and test.
So I would need to make two list:
control = [0.1, 0.14, 0.20, 0.40]
test = [0.1,0.2,0.32,0.8]
How can I calculate four p value for the two list?
like what I want to see is a list of p values
pvalue =[0.1, 0.2, 0,1,0.2,0.1]
quick and dirty, assuming control and test hold the same number of items:
control = [0.1, 0.14, 0.20, 0.40]
test = [0.1,0.2,0.32,0.8]
for idx in range(len(control)):
val_co = control[idx]
val_te = test[idx]
# do whatever you want to do with val_co and val_te
You can try using the binomial test function in SciPy
from scipy.stats import binom_test
n = 1000
control = [100, 140,200,400]
test = [101,200,320,800]
pvals = []
for idx in range(len(control)):
pvals.append(binom_test(test[idx],n=n, p=control[idx]/n))
print(pvals)
[0.9160130517865064, 1.8593423831091924e-07, 4.004795877115897e-19, 1.644604962019165e-147]
(I just did a basic 101 blog post on this)

Categories

Resources