Calculate weighted sum using two columns in pandas dataframe - python

I am trying to calculate weighted sum using two columns in a python dataframe.
Dataframe structure:
unique_id weight value
1 0.061042375 20.16094523
1 0.3064548 19.50932003
1 0.008310739 18.76469039
1 0.624192086 21.25
2 0.061042375 20.23776924
2 0.3064548 19.63366165
2 0.008310739 18.76299395
2 0.624192086 21.25
.......
Output I desired is:
Weighted sum for each unique_id = sum((weight) * (value))
Example: Weighted sum for unique_id 1 = ( (0.061042375 * 20.16094523) + (0.3064548 * 19.50932003) + (0.008310739 * 18.76469039) + (0.624192086 * 21.25) )
I checked out this answer (Calculate weighted average using a pandas/dataframe) but could not figure out the correct way of applying it to my specific scenario.
This is what I am doing based on the above answer:
#Assume temp_weighted_sum_dataframe is the dataframe stated above
grouped_data = temp_weighted_sum_dataframe.groupby('unique_id') #I think this groups data based on unique_id values
weighted_sum_output = (grouped_data.weight * grouped_data.value).transform("sum") #This should allow me to multiple weight and value for every record within each group and sum it up to one value for that group.
# On above line I am getting the error > TypeError: unsupported operand type(s) for *: 'SeriesGroupBy' and 'SeriesGroupBy'
Any help is appreciated, thanks

The accepted answer in the linked question would indeed solve your problem. However, I would solve it differently with just one groupby:
u = (df.assign(s=df['weight']*df['value'])
.groupby('unique_id')
[['s', 'weight']]
.sum()
)
u['s']/u['weight']
Output:
unique_id
1 20.629427
2 20.672208
dtype: float64

you could do it this way:
df['partial_sum'] = df['weight']*df['value']
out = df.groupby('unique_id')['partial_sum'].agg('sum')
output:
unique_id
1 20.629427
2 20.672208
or..
df['weight'].mul(df['value']).groupby(df['unique_id']).sum()
same output

You may take advantage agg by using agg with # (it is dot)
df.groupby('unique_id')[['weight']].agg(lambda x: x.weight # x.value)
Out[24]:
weight
unique_id
1 20.629427
2 20.672208

Related

Cannot set a DataFrame with multiple columns to the single column total_servings

I am a beginner and getting familiar with pandas .
It is throwing an error , When I was trying to create a new column this way :
drinks['total_servings'] = drinks.loc[: ,'beer_servings':'wine_servings'].apply(calculate,axis=1)
Below is my code, and I get the following error for line number 9:
"Cannot set a DataFrame with multiple columns to the single column total_servings"
Any help or suggestion would be appreciated :)
import pandas as pd
drinks = pd.read_csv('drinks.csv')
def calculate(drinks):
return drinks['beer_servings']+drinks['spirit_servings']+drinks['wine_servings']
print(drinks)
drinks['total_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].apply(calculate,axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
drinks
In your code, when functioncalculate is called with axis=1, it passes each row of the Dataframe as an argument. Here, the function calculate is returning dataframe with multiple columns but you are trying to assigned to a single column, which is not possible. You can try updating your code to this,
def calculate(each_row):
return each_row['beer_servings'] + each_row['spirit_servings'] + each_row['wine_servings']
drinks['total_servings'] = drinks.apply(calculate, axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
print(drinks)
I suppose the reason is the wrong argument name inside calculate method. The given argument is drink but drinks used to calculate sum of columns.
The reason is drink is Series object that represents Row and sum of its elements is scalar. Meanwhile drinks is a DataFrame and sum of its columns will be a Series object
Sample code shows that this method works.
import pandas as pd
df = pd.DataFrame({
"A":[1,1,1,1,1],
"B":[2,2,2,2,2],
"C":[3,3,3,3,3]
})
def calculate(to_calc_df):
return to_calc_df["A"] + to_calc_df["B"] + to_calc_df["C"]
df["total"] = df.loc[:, "A":"C"].apply(calculate, axis=1)
print(df)
Result
A B C total
0 1 2 3 6
1 1 2 3 6
2 1 2 3 6
3 1 2 3 6
4 1 2 3 6

do calculations for multiple columns with some conditions in pandas dataframe

My question is relevant to my previous question. But it is different. So, I created a new post even though the data is same.
I would like to do some calculations for multiple columns with some conditions in pandas dataframe.
my table:
id1 date_time adress a_size flag
reom 2005-8-20 22:51:10 75157.5413 ceifwekd 1
reom 2005-8-20 1:01:25 3571.37946 ceifwekd 1
reom 2005-8-20 11:21:01 3571.37946 tnohcve 0
reom 2005-8-20 8:29:09 97439.219 tnohcve 0
penr 2005-8-20 17:07:16 97439.219 ceifwekd 1
penr 2005-8-20 9:10:37 7391.6258 ceifwekd 0
I need to get the percentage of flag == 1 by "address" :
df['ratio'] = df['address'].map(df.groupby('address').apply(lambda x: x[x['flag'] == 1].count() / x['flag'].count()))
But I got error:
TypeError: 'DataFrame' object is not callable
thanks
Just use df.groupby('address')['flag'].mean().
I will using transform with mean
df['ratio'] = df.groupby('address')['flag'].transform('mean')
You can try transform:
df['ratio'] = df.groupby('address').transform(lambda x: x[x['flag'] == 1].count() / x['flag'].count())

Building complex subsets in Pandas DataFrame

I'm making my way around GroupBy, but I still need some help. Let's say that I've a DataFrame with columns Group, giving objects group number, some parameter R and spherical coordinates RA and Dec. Here is a mock DataFrame:
df = pd.DataFrame({
'R' : (-21.0,-21.5,-22.1,-23.7,-23.8,-20.4,-21.8,-19.3,-22.5,-24.7,-19.9),
'RA': (154.362789,154.409301,154.419191,154.474165,154.424842,162.568516,8.355454,8.346812,8.728223,8.759622,8.799796),
'Dec': (-0.495605,-0.453085,-0.481657,-0.614827,-0.584243,8.214719,8.355454,8.346812,8.728223,8.759622,8.799796),
'Group': (1,1,1,1,1,2,2,2,2,2,2)
})
I want to built a selection containing for each group the "brightest" object, i.e. the one with the smallest R (or the greatest absolute value, since Ris negative) and the 3 closest objects of the group (so I keep 4 objects in each group - we can assume that there is no group smaller than 4 objects if needed).
We assume here that we have defined the following functions:
#deg to rad
def d2r(x):
return x * np.pi / 180.0
#rad to deg
def r2d(x):
return x * 180.0 / np.pi
#Computes separation on a sphere
def calc_sep(phi1,theta1,phi2,theta2):
return np.arccos(np.sin(theta1)*np.sin(theta2) +
np.cos(theta1)*np.cos(theta2)*np.cos(phi2 - phi1) )
and that separation between two objects is given by r2d(calc_sep(RA1,Dec1,RA2,Dec2)), with RA1 as RA for the first object, and so on.
I can't figure out how to use GroupBy to achieve this...
What you can do here is build a more specific helper function that gets applied to each "sub-frame" (each group).
GroupBy is really just a facility that creates something like an iterator of (group id, DataFrame) pairs, and a function is applied to each of these when you call .groupby().apply. (That glazes over a lot of details, see here for some details on internals if you're interested.)
So after defining your three NumPy-based functions, also define:
def sep_df(df, keep=3):
min_r = df.loc[df.R.argmin()]
RA1, Dec1 = min_r.RA, min_r.Dec
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
idx = sep.nsmallest(keep+1).index
return df.loc[idx]
Then just apply and you get a MultiIndex DataFrame where the first index level is the group.
print(df.groupby('Group').apply(sep_df))
Dec Group R RA
Group
1 3 -0.61483 1 -23.7 154.47416
2 -0.48166 1 -22.1 154.41919
0 -0.49561 1 -21.0 154.36279
4 -0.58424 1 -23.8 154.42484
2 8 8.72822 2 -22.5 8.72822
10 8.79980 2 -19.9 8.79980
6 8.35545 2 -21.8 8.35545
9 8.75962 2 -24.7 8.75962
With some comments interspersed:
def sep_df(df, keep=3):
# Applied to each sub-Dataframe (this is what GroupBy does under the hood)
# Get RA and Dec values at minimum R
min_r = df.loc[df.R.argmin()] # Series - row at which R is minimum
RA1, Dec1 = min_r.RA, min_r.Dec # Relevant 2 scalars within this row
# Calculate separation for each pair including minimum R row
# The result is a series of separations, same length as `df`
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
# Get index values of `keep` (default 3) smallest results
# Retain `keep+1` values because one will be the minimum R
# row where separation=0
idx = sep.nsmallest(keep+1).index
# Restrict the result to those 3 index labels + your minimum R
return df.loc[idx]
For speed, consider passing sort=False to GroupBy if the result still works for you.
I want to built a selection containing for each group the "brightest" object...and the 3 closest objects of the group
step 1:
create a dataframe for the brightest object in each group
maxR = df.sort_values('R').groupby('Group')['Group', 'Dec', 'RA'].head(1)
step 2:
merge the two frames on Group & calculate the separation
merged = df.merge(maxR, on = 'Group', suffixes=['', '_max'])
merged['sep'] = merged.apply(
lambda x: r2d(calc_sep(x.RA, x.Dec, x.RA_max, x.Dec_max)),
axis=1
)
step 3:
order the data frame, group by 'Group', (optional) discard intermediate fields & take the first 4 rows from each group
finaldf = merged.sort_values(['Group', 'sep'], ascending=[1,1]
).groupby('Group')[df.columns].head(4)
Produces the following data frame with your sample data:
Dec Group R RA
4 -0.584243 1 -23.8 154.424842
3 -0.614827 1 -23.7 154.474165
2 -0.481657 1 -22.1 154.419191
0 -0.495605 1 -21.0 154.362789
9 8.759622 2 -24.7 8.759622
8 8.728223 2 -22.5 8.728223
10 8.799796 2 -19.9 8.799796
6 8.355454 2 -21.8 8.355454

Summarize a column in pandas data frame based on other columns

I have a small data frame tbl:
CatAreaSqKm CatMean CatPctFull CatCount CatSum COMID
1861888 0.2439 0.0000 0.000000 0 0.000000
1862004 0.4050 27.9765 18.222222 82 2294.072964
1862014 0.0720 27.9765 28.750000 23 643.459490
UpCatAreaSqKm UpCatMean UpCatPctFull UpCatCount UpCatSum
COMID
1861888 105360.5349 29.177349 97.901832 114610993 3.344045e+09
1862004 105445.4517 29.174944 97.902537 114704191 3.346488e+09
1862014 105360.2127 29.177349 97.902093 114610948 3.344044e+09
I want to do the following operation:
tbl['WsMean'] = ((tbl.CatSum + tbl.UpCatSum)/(tbl.CatCount + tbl.UpCatCount))
However, if I get a zero for CatCount + UpCatCount I will be dividing by zero, so for that particular row I want to set 'WsMean' to zero but for the others I would like it to be computed for the value calculated by the statement above. How can I do this? I can only think of a statement like:
tbl['WsMean'] = 0
but that would blanket all records in the table with 0.
Any ideas? Thanks
Dividing by zero results in a NaN value. You could use fillna(0) to replace the NaNs with zeros:
tbl['WsMean'] = ((tbl.CatSum + tbl.UpCatSum)/(tbl.CatCount + tbl.UpCatCount)).fillna(0)

Annualized return by group

I have monthly return and want to calculate annualized return by two groups. Below is the sample data.
Return_M Rise
0.097425 1
0.188547 1
-0.1509 1
0.28011 1
-0.09596 1
0.041459 1
0.106838 1
0.046581 0
-0.16068 0
0.009242 0
0.006104 0
-0.00709 0
0.050352 0
-0.01023 0
-0.00731 0
0.031946 0
0.048552 0
This is what I tried, but the code actually count the length of df1 not by group. I hope a method that could be applied broadly.
df2 = df1.groupby(['Rise'])[['Return_M']].apply(lambda x:np.prod(1+x)**(12/len(x)))
This is the expected output:
Rise Return_M
1 0.249862
0 -0.00443
You only have to groupby on Rise column and aggregate on the Return_M column.
The following snippet assumes you want to divide by 12 (based on your question)
df2 = df1.groupby('Rise').agg({'Return_M': 'sum'}).reset_index()
df2['avg'] = df2['Return_M']/12
df2[['Rise', 'avg']]
But if you need the average based on however many records you have for each group of Rise, you can simply do:
df2 = df1.groupby('Rise').agg('Return_M': 'mean')
EDIT: Editing the answer based on OP's comment:
To get the geometric annualized return as per your formula, the following will work:
df.groupby('Rise').Return_M.apply(lambda x: (1+x).product() ** (12/float(len(x))))
However, the output is different from the expected output you posted in your question:
Rise
0 0.986765
1 1.952498
This however is exactly the correct output as per the formula you described.
I did this calculation manually too, for Rise = 1:
I took the product of each (1 plus Return_M) value
Raised the product to (12 divided by length of the group, which is 7 for this group).
(1 + 0.097425) * (1 + 0.188547) * (1 + -0.1509) * (1 + 0.28011) * (1 + -0.09596)* (1 + 0.041459)* (1 + 0.106838) = 1.4774446702
1.4774446702 ^ (12/7) = 1.9524983367
So just check if your logic is correct. Please mark this answer as accepted if it solves your problem.

Categories

Resources