I have a Table
Sex Value1 Value2 City
M 2 1 Berlin
W 3 5 Paris
W 1 3 Paris
M 2 5 Berlin
M 4 2 Paris
I want to calculate the average of Value1 and Value2 for different groups. In my origial Dataset I have 10 Group variables (with a max of 5 characteristics like 5 Cities) that I have shortened to Sex and City (2 Characteristics) in this example. The result should look like this
AvgOverall AvgM AvgW AvgBerlin AvgParis
Value1 2,4 2,6 2 2 2,66
Value2 3,2 2,6 4 3 3,3
I am familiar with the group by and tried
df.groupby('City').mean()
But here we have the problem that Sex is getting also into the calculation. Does anyone has an idea how to solve this? Thanks in advance!
You can grouping by 2 columns to 2 dataframes and then use concat also with means of numeric columns (non numeric are excluded):
df1 = df.groupby('City').mean().T
df2 = df.groupby('Sex').mean().T
df3 = pd.concat([df.mean().rename('Overall'), df2, df1], axis=1).add_prefix('Avg')
print (df3)
AvgOverall AvgM AvgW AvgBerlin AvgParis
Value1 2.4 2.666667 2.0 2.0 2.666667
Value2 3.2 2.666667 4.0 3.0 3.333333
Related
I have a dataset similar to this:
name group val1 val2
John A 3 2
Cici B 4 3
Ian C 2 2
Zhang D 2 1
Zhang E 1 2
Ian F 1 2
John B 2 1
Ian B 1 2
I did a pivot table and it now looks like this using this piece of code
df_pivot = pd.pivot_table(df, values=['val_1, val_2], index=['name', 'group']).reset_index()
df
name group val1 val2
John A 3 2
John B 2 1
Ian C 2 2
Ian F 1 2
Ian B 1 2
Zhang D 2 1
Zhang E 1 2
Cici B 4 3
After the pivot table, I need to calculate 1) groupby name 2) calculate the delta between groups. Take John as an example
The output should be:
John A-B 1 1
Ian C-F 1 0
F-B 0 0
B-C 1 0 (the delta is -1, but we only do absolute value)
How to move forward from my pivot table
Getting each combination to subtract (a-b, a-c, b-c) won't be directly possible with a simple groupby function. I suggest that you pivot your data and use a custom function to calculate each combination of possible differences:
import pandas as pd
import itertools
def combo_subtraction(df, level=0):
unique_groups = df.columns.levels[level]
combos = itertools.combinations(unique_groups, 2)
pieces = {}
for g1, g2 in combos:
name = "{}-{}".format(g1, g2)
pieces[name] = df.xs(g1, level=level, axis=1) - df.xs(g2, level=level, axis=1)
return pd.concat(pieces)
out = (df.pivot(index="name", columns="group") # convert data to wide format
.pipe(combo_subtraction, level=1) # apply our combination subtraction
.dropna() # clean up the result
.swaplevel()
.sort_index())
print(out)
val1 val2
name
Ian A-B 0.0 0.0
A-C -1.0 0.0
B-C -1.0 0.0
John A-B 1.0 1.0
Zhang A-B 1.0 -1.0
The combo_subtraction function simply iterates over all possible combinations of 2 of "A", "B", and "C" and performs the subtraction operation. It then sticks the results of these combinations back together forming our result.
I have a problem that you hopefully can help with.
I have a dataframe with multiple columns that looks something like this:
education experience ExpenseA ExpenseB ExpenseC
uni yes 3 2 5
uni no 7 6 8
middle yes 2 0 8
high no 12 5 8
uni yes 3 7 5
The Expenses A, B and C should add up to 10 per row, but often they don't because the data was not gathered correctly. For the rows where this is not the case, I want to take proportions.
The formula for this should be (cell value) / ((sum [ExpenseA] til [ExpenseC])/10)
example row two: total = 21 --> cells should be (value / 2.1)
How can I itterate this over all the rows for these specific columns?
I think you need divide sum of columns with exclude first 2 columns selected by DataFrame.iloc:
df.iloc[:, 2:] = df.iloc[:, 2:].div(df.iloc[:, 2:].sum(axis=1).div(10), axis=0)
print (df)
education experience ExpenseA ExpenseB ExpenseC
0 uni yes 3.000000 2.000000 5.000000
1 uni no 3.333333 2.857143 3.809524
2 middle yes 2.000000 0.000000 8.000000
3 high no 4.800000 2.000000 3.200000
4 uni yes 2.000000 4.666667 3.333333
Or sum columns with Expense substrings by DataFrame.filter:
df1 = df.filter(like='Expense')
df[df1.columns] = df1.div(df1.sum(axis=1).div(10), axis=0)
I have following dataframe. Values are the rating by customer.
Ind Department Value1 Value2 Value3 Value4
1 Electronics 5 4 3 2
2 Clothing 4 3 2 1
3 Grocery 3 3 5 1
Here I would like to make column range that is the difference of the max and min value from the row. Expected is as below:
Ind Department Value1 Value2 Value3 Value4 range
1 Electronics 5 4 3 2 3
2 Clothing 4 3 2 1 3
3 Grocery 3 3 5 1 3
df['range'] = df.max(axis=1) - df.min(axis=1)
If you want to specify column numbers to calculate the range:
df['range'] = df.iloc[:,col1index:col2index].max(axis=1) - df.iloc[:,col1index:col2index].min(axis=1)
You can try numpy ptp
np.ptp(df.loc[:,'Value1':].values,axis=1)
array([3, 3, 4], dtype=int64)
df['range']=np.ptp(df.loc[:,'Value1':].values,axis=1)
Filter for only the Values column and compute the difference of the max and min per row :
boxes = df.filter(like="Value")
df["range"] = boxes.max(1) - boxes.min(1)
df
Ind Department Value1 Value2 Value3 Value4 range
0 1 Electronics 5 4 3 2 3
1 2 Clothing 4 3 2 1 3
2 3 Grocery 3 3 5 1 4
Same end result, but longer route, in my opinion - set the first two columns as index, get the difference of the max and min for each row, and reset the index :
(df
.set_index(["Ind", "Department"])
.assign(max_min=lambda x: x.max(1) - x.min(1))
.reset_index()
)
I have two dataframes
df1
Name class value
Sri 1 5
Ram 2 8
viv 3 4
df2
Name class value
Sri 1 5
viv 4 4
My desired output is,
df,
Name class value
Sri 2 10
Ram 2 8
viv 7 8
Please help, thanks in advance!
I think need set_index for both DataFrames, add and last reset_index:
df = df1.set_index('Name').add(df2.set_index('Name'), fill_value=0).reset_index()
print (df)
Name class value
0 Ram 2.0 8.0
1 Sri 2.0 10.0
2 viv 7.0 8.0
If values in Name are not unique use groupby and aggregate sum:
df = df1.groupby('Name').sum().add(df2.groupby('Name').sum(), fill_value=0).reset_index()
pd.concat + groupby + sum
You can concatenate your individual dataframes and then group by your key column:
df = pd.concat([df1, df2])\
.groupby('Name')['class', 'value']\
.sum().reset_index()
print(df)
Name class value
0 Ram 2 8
1 Sri 2 10
2 viv 7 8
I have a pandas DataFrame. I'm trying to fill the nans of the Price column based on the average price of the corresponding level in the Section column. What's an efficient and elegant way to do this? My data looks something like this
Name Sex Section Price
Joe M 1 2
Bob M 1 nan
Nancy F 2 5
Grace F 1 6
Jen F 2 3
Paul M 2 nan
You could use combine groupby, transform, and mean. Note that I've modified your example because otherwise both Sections have the same mean value. Starting from
In [21]: df
Out[21]:
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 NaN
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 NaN
we can use
df["Price"] = (df["Price"].fillna(df.groupby("Section")["Price"].transform("mean"))
to produce
In [23]: df
Out[23]:
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 4.0
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 7.5
This works because we can compute the mean by Section:
In [29]: df.groupby("Section")["Price"].mean()
Out[29]:
Section
1 4.0
2 7.5
Name: Price, dtype: float64
and broadcast this back up to a full Series we can pass to fillna() using transform:
In [30]: df.groupby("Section")["Price"].transform("mean")
Out[30]:
0 4.0
1 4.0
2 7.5
3 4.0
4 7.5
5 7.5
Name: Price, dtype: float64
pandas surgical but slower
Refer to #DSM's answer for a quicker pandas solution
This is a more surgical approach that may provide some perspective, possibly usefull
use groupyby
calculate our mean for each Section
means = df.groupby('Section').Price.mean()
identify nulls
use isnull to use for boolean slicing
nulls = df.Price.isnull()
use map
slice the Section column to limit to just those rows with null Price
fills = df.Section[nulls].map(means)
use loc
fill in the spots in df only where nulls are
df.loc[nulls, 'Price'] = fills
All together
means = df.groupby('Section').Price.mean()
nulls = df.Price.isnull()
fills = df.Section[nulls].map(means)
df.loc[nulls, 'Price'] = fills
print(df)
Name Sex Section Price
0 Joe M 1 2.0
1 Bob M 1 4.0
2 Nancy F 2 5.0
3 Grace F 1 6.0
4 Jen F 2 10.0
5 Paul M 2 7.5
by "corresponding level" i am assuming you mean with equal section value.
if so, you can solve this by
for section_value in sorted(set(df.Section)):
df.loc[df['Section']==section_value, 'Price'] = df.loc[df['Section']==section_value, 'Price'].fillna(df.loc[df['Section']==section_value, 'Price'].mean())
hope it helps! peace