Calculate Mean on Multiple Groups

Calculate Mean on Multiple Groups - python

I have a Table
Sex Value1 Value2 City
M 2 1 Berlin
W 3 5 Paris
W 1 3 Paris
M 2 5 Berlin
M 4 2 Paris
I want to calculate the average of Value1 and Value2 for different groups. In my origial Dataset I have 10 Group variables (with a max of 5 characteristics like 5 Cities) that I have shortened to Sex and City (2 Characteristics) in this example. The result should look like this
AvgOverall AvgM AvgW AvgBerlin AvgParis
Value1 2,4 2,6 2 2 2,66
Value2 3,2 2,6 4 3 3,3
I am familiar with the group by and tried
df.groupby('City').mean()
But here we have the problem that Sex is getting also into the calculation. Does anyone has an idea how to solve this? Thanks in advance!

You can grouping by 2 columns to 2 dataframes and then use concat also with means of numeric columns (non numeric are excluded):
df1 = df.groupby('City').mean().T
df2 = df.groupby('Sex').mean().T
df3 = pd.concat([df.mean().rename('Overall'), df2, df1], axis=1).add_prefix('Avg')
print (df3)
AvgOverall AvgM AvgW AvgBerlin AvgParis
Value1 2.4 2.666667 2.0 2.0 2.666667
Value2 3.2 2.666667 4.0 3.0 3.333333

Related

subtract between two rows

I have a dataset similar to this:
name group val1 val2
John A 3 2
Cici B 4 3
Ian C 2 2
Zhang D 2 1
Zhang E 1 2
Ian F 1 2
John B 2 1
Ian B 1 2
I did a pivot table and it now looks like this using this piece of code
df_pivot = pd.pivot_table(df, values=['val_1, val_2], index=['name', 'group']).reset_index()
df
name group val1 val2
John A 3 2
John B 2 1
Ian C 2 2
Ian F 1 2
Ian B 1 2
Zhang D 2 1
Zhang E 1 2
Cici B 4 3
After the pivot table, I need to calculate 1) groupby name 2) calculate the delta between groups. Take John as an example
The output should be:
John A-B 1 1
Ian C-F 1 0
F-B 0 0
B-C 1 0 (the delta is -1, but we only do absolute value)
How to move forward from my pivot table

Getting each combination to subtract (a-b, a-c, b-c) won't be directly possible with a simple groupby function. I suggest that you pivot your data and use a custom function to calculate each combination of possible differences:
import pandas as pd
import itertools
def combo_subtraction(df, level=0):
unique_groups = df.columns.levels[level]
combos = itertools.combinations(unique_groups, 2)
pieces = {}
for g1, g2 in combos:
name = "{}-{}".format(g1, g2)
pieces[name] = df.xs(g1, level=level, axis=1) - df.xs(g2, level=level, axis=1)
return pd.concat(pieces)
out = (df.pivot(index="name", columns="group") # convert data to wide format
.pipe(combo_subtraction, level=1) # apply our combination subtraction
.dropna() # clean up the result
.swaplevel()
.sort_index())
print(out)
val1 val2
name
Ian A-B 0.0 0.0
A-C -1.0 0.0
B-C -1.0 0.0
John A-B 1.0 1.0
Zhang A-B 1.0 -1.0
The combo_subtraction function simply iterates over all possible combinations of 2 of "A", "B", and "C" and performs the subtraction operation. It then sticks the results of these combinations back together forming our result.

Calculate proportions of rows in dataframe

I have a problem that you hopefully can help with.
I have a dataframe with multiple columns that looks something like this:
education experience ExpenseA ExpenseB ExpenseC
uni yes 3 2 5
uni no 7 6 8
middle yes 2 0 8
high no 12 5 8
uni yes 3 7 5
The Expenses A, B and C should add up to 10 per row, but often they don't because the data was not gathered correctly. For the rows where this is not the case, I want to take proportions.
The formula for this should be (cell value) / ((sum [ExpenseA] til [ExpenseC])/10)
example row two: total = 21 --> cells should be (value / 2.1)
How can I itterate this over all the rows for these specific columns?

I think you need divide sum of columns with exclude first 2 columns selected by DataFrame.iloc:
df.iloc[:, 2:] = df.iloc[:, 2:].div(df.iloc[:, 2:].sum(axis=1).div(10), axis=0)
print (df)
education experience ExpenseA ExpenseB ExpenseC
0 uni yes 3.000000 2.000000 5.000000
1 uni no 3.333333 2.857143 3.809524
2 middle yes 2.000000 0.000000 8.000000
3 high no 4.800000 2.000000 3.200000
4 uni yes 2.000000 4.666667 3.333333
Or sum columns with Expense substrings by DataFrame.filter:
df1 = df.filter(like='Expense')
df[df1.columns] = df1.div(df1.sum(axis=1).div(10), axis=0)

How to get the difference of the max and min of the row and input as series for a dataframe?

I have following dataframe. Values are the rating by customer.
Ind Department Value1 Value2 Value3 Value4
1 Electronics 5 4 3 2
2 Clothing 4 3 2 1
3 Grocery 3 3 5 1
Here I would like to make column range that is the difference of the max and min value from the row. Expected is as below:
Ind Department Value1 Value2 Value3 Value4 range
1 Electronics 5 4 3 2 3
2 Clothing 4 3 2 1 3
3 Grocery 3 3 5 1 3

df['range'] = df.max(axis=1) - df.min(axis=1)
If you want to specify column numbers to calculate the range:
df['range'] = df.iloc[:,col1index:col2index].max(axis=1) - df.iloc[:,col1index:col2index].min(axis=1)

You can try numpy ptp
np.ptp(df.loc[:,'Value1':].values,axis=1)
array([3, 3, 4], dtype=int64)
df['range']=np.ptp(df.loc[:,'Value1':].values,axis=1)

Filter for only the Values column and compute the difference of the max and min per row :
boxes = df.filter(like="Value")
df["range"] = boxes.max(1) - boxes.min(1)
df
Ind Department Value1 Value2 Value3 Value4 range
0 1 Electronics 5 4 3 2 3
1 2 Clothing 4 3 2 1 3
2 3 Grocery 3 3 5 1 4
Same end result, but longer route, in my opinion - set the first two columns as index, get the difference of the max and min for each row, and reset the index :
(df
.set_index(["Ind", "Department"])
.assign(max_min=lambda x: x.max(1) - x.min(1))
.reset_index()
)

how to merge two dataframes and sum the values of columns

I have two dataframes
df1
Name class value
Sri 1 5
Ram 2 8
viv 3 4
df2
Name class value
Sri 1 5
viv 4 4
My desired output is,
df,
Name class value
Sri 2 10
Ram 2 8
viv 7 8
Please help, thanks in advance!

I think need set_index for both DataFrames, add and last reset_index:
df = df1.set_index('Name').add(df2.set_index('Name'), fill_value=0).reset_index()
print (df)
Name class value
0 Ram 2.0 8.0
1 Sri 2.0 10.0
2 viv 7.0 8.0
If values in Name are not unique use groupby and aggregate sum:
df = df1.groupby('Name').sum().add(df2.groupby('Name').sum(), fill_value=0).reset_index()

pd.concat + groupby + sum
You can concatenate your individual dataframes and then group by your key column:
df = pd.concat([df1, df2])\
.groupby('Name')['class', 'value']\
.sum().reset_index()
print(df)
Name class value
0 Ram 2 8
1 Sri 2 10
2 viv 7 8

Fill values of a column based on mean of another column

I have a pandas DataFrame. I'm trying to fill the nans of the Price column based on the average price of the corresponding level in the Section column. What's an efficient and elegant way to do this? My data looks something like this
Name Sex Section Price
Joe M 1 2
Bob M 1 nan
Nancy F 2 5
Grace F 1 6
Jen F 2 3
Paul M 2 nan

You could use combine groupby, transform, and mean. Note that I've modified your example because otherwise both Sections have the same mean value. Starting from
In [21]: df
Out[21]:
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 NaN
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 NaN
we can use
df["Price"] = (df["Price"].fillna(df.groupby("Section")["Price"].transform("mean"))
to produce
In [23]: df
Out[23]:
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 4.0
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 7.5
This works because we can compute the mean by Section:
In [29]: df.groupby("Section")["Price"].mean()
Out[29]:
Section
1 4.0
2 7.5
Name: Price, dtype: float64
and broadcast this back up to a full Series we can pass to fillna() using transform:
In [30]: df.groupby("Section")["Price"].transform("mean")
Out[30]:
0 4.0
1 4.0
2 7.5
3 4.0
4 7.5
5 7.5
Name: Price, dtype: float64

pandas surgical but slower
Refer to #DSM's answer for a quicker pandas solution
This is a more surgical approach that may provide some perspective, possibly usefull
use groupyby
calculate our mean for each Section
means = df.groupby('Section').Price.mean()
identify nulls
use isnull to use for boolean slicing
nulls = df.Price.isnull()
use map
slice the Section column to limit to just those rows with null Price
fills = df.Section[nulls].map(means)
use loc
fill in the spots in df only where nulls are
df.loc[nulls, 'Price'] = fills
All together
means = df.groupby('Section').Price.mean()
nulls = df.Price.isnull()
fills = df.Section[nulls].map(means)
df.loc[nulls, 'Price'] = fills
print(df)
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 4.0
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 7.5

by "corresponding level" i am assuming you mean with equal section value.
if so, you can solve this by
for section_value in sorted(set(df.Section)):
df.loc[df['Section']==section_value, 'Price'] = df.loc[df['Section']==section_value, 'Price'].fillna(df.loc[df['Section']==section_value, 'Price'].mean())
hope it helps! peace

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculate Mean on Multiple Groups - python

Related

subtract between two rows

Calculate proportions of rows in dataframe

How to get the difference of the max and min of the row and input as series for a dataframe?

how to merge two dataframes and sum the values of columns

Fill values of a column based on mean of another column

Categories

Resources