Weighted Mean as a Column in Pandas - python

I am trying to add a column with the weighted average of 4 columns with 4 columns of weights
df = pd.DataFrame.from_dict(dict([('A', [2000, 1000, 2509, 2145]),
('A_Weight', [37, 47, 33, 16]),
('B', [2100, 1500, 2000, 1600]),
('B_weights', [17, 21, 6, 2]),
('C', [2500, 1400, 0, 2300]),
('C_weights', [5, 35, 0, 40]),
('D', [0, 1600, 2100, 2000]),
('D_weights', [0, 32, 10, 5])]))
I want the weighted average to be in a new column named "WA" but every time I try it displays NaN
Desired Dataframe would be a new column with the following values as ex:
Formula I used (((A * A_weight)+(B * b_weight)+(C * C_weight)+(D * D_weight)) / sum(all weights)
df['WA'] = [2071.19,1323.70, 2363.20,2214.60 ]
Thank you

A straight-forward and simple way to do is as follows:
(Since your columns name for the weights are not consistently named, e.g. some with 's' and some without, some with capital 'W' and some with lower case 'w', it is not convenient to group columns e.g. by .filter())
df['WA'] = ( (df['A'] * df['A_Weight']) + (df['B'] * df['B_weights']) + (df['C'] * df['C_weights']) + (df['D'] * df['D_weights']) ) / (df['A_Weight'] + df['B_weights'] + df['C_weights'] + df['D_weights'])
Result:
print(df)
A A_Weight B B_weights C C_weights D D_weights WA
0 2000 37 2100 17 2500 5 0 0 2071.186441
1 1000 47 1500 21 1400 35 1600 32 1323.703704
2 2509 33 2000 6 0 0 2100 10 2363.204082
3 2145 16 1600 2 2300 40 2000 5 2214.603175

The not so straight-forward way:
Group columns by prefix via str.split
get the column-wise product via groupby prod
get the row-wise sum of the products with sum on axis 1.
filter + sum on axis 1 to get sum of "weights" columns
Divide the the group product sums with the weight sums.
df['WA'] = (
df.groupby(df.columns.str.split('_').str[0], axis=1).prod().sum(axis=1)
/ df.filter(regex='_[wW]eight(s)?$').sum(axis=1)
)
A A_Weight B B_weights C C_weights D D_weights WA
0 2000 37 2100 17 2500 5 0 0 2071.186441
1 1000 47 1500 21 1400 35 1600 32 1323.703704
2 2509 33 2000 6 0 0 2100 10 2363.204082
3 2145 16 1600 2 2300 40 2000 5 2214.603175

Another option to an old question:
Split data into numerator and denominator:
numerator = df.filter(regex=r"[A-Z]$")
denominator = df.filter(like='_')
Convert denominator into a MultiIndex, comes in handy when computing with numerator:
denominator.columns = denominator.columns.str.split('_', expand = True)
Multiply numerator by denominator, and divide the sum of the outcome with the sum of the denominator:
outcome = numerator.mul(denominator, level=0, axis=1).sum(1)
outcome = outcome.div(denominator.sum(1))
df.assign(WA = outcome)
A A_Weight B B_weights C C_weights D D_weights WA
0 2000 37 2100 17 2500 5 0 0 2071.186441
1 1000 47 1500 21 1400 35 1600 32 1323.703704
2 2509 33 2000 6 0 0 2100 10 2363.204082
3 2145 16 1600 2 2300 40 2000 5 2214.603175

Related

How to efficinetly combine dataframe rows based on conditions?

I have the following dataset, which contains a column with the cluster number, the number of observations in that cluster and the maximum value of another variable x grouped by that cluster.
clust = np.arange(0, 10)
obs = np.array([1041, 544, 310, 1648, 1862, 2120, 2916, 5148, 12733, 1])
x_max = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
df = pd.DataFrame(np.c_[clust, obs, x_max], columns=['clust', 'obs', 'x_max'])
clust obs x_max
0 0 1041 10
1 1 544 20
2 2 310 30
3 3 1648 40
4 4 1862 50
5 5 2120 60
6 6 2916 70
7 7 5148 80
8 8 12733 90
9 9 1 100
My task is to combine the clust row values with adjasent rows, so that each cluster contains at least 1000 observations.
My current attempt gets stuck in an infinite loop because the last cluster has only 1 observation.
condition = True
while (condition):
condition = False
for i in np.arange(0, len(df) + 1):
if df.loc[i, 'x'] < 1000:
df.loc[i, 'id'] = df.loc[i, 'id'] + 1
df = df.groupby('id', as_index=False).agg({'x': 'sum', 'y': 'max'})
condition = True
break
Is there perhaps a more efficient way of doing this? I come from a background in SAS, where such situations would be solved with the if last.row condition, but it seems here is no such condition in python.
The resulting table should look like this
clust obs x_max
0 1041 10
1 2502 40
2 1862 50
3 2120 60
4 2916 70
5 5148 80
6 12734 100
Here is another way. A vectorize way here is difficult to implement, but using for loop on an array (or a list) will be faster than using loc at each iteration. Also, not a good practice to change df within the loop, it can only bring problem.
# define variables
s = 0 #for the sum of observations
gr = [] #for the final grouping values
i = 0 #for the group indices
# loop over observations from an array
for obs in df['obs'].to_numpy():
s+= obs
gr.append(i)
# check that the size of the group is big enough
if s>1000:
s = 0
i+=1
# condition to deal with last rows if last group not big enough
if s!=0:
gr = [i-1 if val==i else val for val in gr]
# now create your new df
new_df = (
df.groupby(gr).agg({'obs':sum, 'x_max':max})
.reset_index().rename(columns={'index':'cluster'})
)
print(new_df)
# cluster obs x_max
# 0 0 1041 10
# 1 1 2502 40
# 2 2 1862 50
# 3 3 2120 60
# 4 4 2916 70
# 5 5 5148 80
# 6 6 12734 100

How to use multiple conditions, including selecting on quantile in Python

Imagine the following dataset df:
Row
Population_density
Distance
1
400
50
2
500
30
3
300
40
4
200
120
5
500
60
6
1000
50
7
3300
30
8
500
90
9
700
100
10
1000
110
11
900
200
12
850
30
How can I make a new dummy column that represents a 1 when values of df['Population_density'] are above the third quantile (>75%) AND the df['Distance'] is < 100, while a 0 is given to the remainder of the data? Consequently, rows 6 and 7 should have a 1 while the other rows should have a 0.
Creating a dummy variable with only one criterium can be fairly easy. For instance, the following condition works for creating a new dummy variable that contains a 1 when the Distance is <100 and a 0 otherwise: df['Distance_Below_100'] = np.where(df['Distance'] < 100, 1, 0). However, I do not know how to combine conditions whereby one of the conditions includes a quantile selection (in this case, the upper 25% of the variable Population_density.
import pandas as pd
# assign data of lists.
data = {'Row': range(1,13,1), 'Population_density': [400, 500, 300, 200, 500, 1000, 3300, 500, 700, 1000, 900, 850],
'Distance': [50, 30, 40, 120, 60, 50, 30, 90, 100, 110, 200, 30]}
# Create DataFrame
df = pd.DataFrame(data)
You can use & or | to join the conditions
import numpy as np
df['Distance_Below_100'] = np.where(df['Population_density'].gt(df['Population_density'].quantile(0.75)) & df['Distance'].lt(100), 1, 0)
print(df)
Row Population_density Distance Distance_Below_100
0 1 400 50 0
1 2 500 30 0
2 3 300 40 0
3 4 200 120 0
4 5 500 60 0
5 6 1000 50 1
6 7 3300 30 1
7 8 500 90 0
8 9 700 100 0
9 10 1000 110 0
10 11 900 200 0
11 12 850 30 0
he, to make a function on data frame i recommended to use lambda.
for example this is your function:
def myFunction(value):
pass
to create a new column 'new_column', (pick_cell) is which cell you want to make a function on:
df['new_column']= df.apply(lambda x : myFunction(x.pick_cell))

Pandas Group By With Overlapping Bins

I want to sum up data across overlapping bins. Basically the question here but instead of the bins being (0-8 years old), (9 - 17 years old), (18-26 years old), (27-35 years old), and (26 - 44 years old) I want them to be (0-8 years old), (1 - 9 years old), (2-10 years old), (3-11 years old), and (4 - 12 years old).
Starting with a df like this
id
awards
age
1
100
24
1
150
26
1
50
54
2
193
34
2
209
50
I am using the code from this answer to calculate summation across non-overlapping bins.
bins = [9 * i for i in range(0, df['age'].max() // 9 + 2)]
cuts = pd.cut(df['age'], bins, right=False)
print(cuts)
0 [18, 27)
1 [18, 27)
2 [54, 63)
3 [27, 36)
4 [45, 54)
Name: age, dtype: category
Categories (7, interval[int64, left]): [[0, 9) < [9, 18) < [18, 27) < [27, 36) < [36, 45) < [45, 54) < [54, 63)]
df_out = (df.groupby(['id', cuts])
.agg(total_awards=('awards', 'sum'))
.reset_index(level=0)
.reset_index(drop=True)
)
df_out['age_interval'] = df_out.groupby('id').cumcount()
Result
print(df_out)
id total_awards age_interval
0 1 0 0
1 1 0 1
2 1 250 2
3 1 0 3
4 1 0 4
5 1 0 5
6 1 50 6
7 2 0 0
8 2 0 1
9 2 0 2
10 2 193 3
11 2 0 4
12 2 209 5
13 2 0 6
Is it possible to work off the existing code to do this with overlapping bins?
First pivot_table your data to get a row per id and the columns being the ages. then reindex to get all the ages possible, from 0 to at least the max in the column age (here I use the max plus the interval length). Now you can use rolling along the columns. Rename the columns to create meaningful names. Finally stack and reset_index to get a dataframe with the expected shape.
interval = 9 #include both bounds like 0 and 8 for the first interval
res = (
df.pivot_table(index='id', columns='age', values='awards',
aggfunc=sum, fill_value=0)
.reindex(columns=range(0, df['age'].max()+interval), fill_value=0)
.rolling(interval, axis=1, min_periods=interval).sum()
.rename(columns=lambda x: f'{x-interval+1}-{x} y.o.')
.stack()
.reset_index(name='awards')
)
and you get with the input data provided in the question
print(res)
# id age awards
# 0 1 0-8 y.o. 0.0
# 1 1 1-9 y.o. 0.0
# ...
# 15 1 15-23 y.o. 0.0
# 16 1 16-24 y.o. 100.0
# 17 1 17-25 y.o. 100.0
# 18 1 18-26 y.o. 250.0
# 19 1 19-27 y.o. 250.0
# 20 1 20-28 y.o. 250.0
# 21 1 21-29 y.o. 250.0
# 22 1 22-30 y.o. 250.0
# 23 1 23-31 y.o. 250.0
# 24 1 24-32 y.o. 250.0
# 25 1 25-33 y.o. 150.0
# 26 1 26-34 y.o. 150.0
# 27 1 27-35 y.o. 0.0
# ...
# 45 1 45-53 y.o. 0.0
# 46 1 46-54 y.o. 50.0
# 47 1 47-55 y.o. 50.0
# 48 1 48-56 y.o. 50.0
# 49 1 49-57 y.o. 50.0
# ...
I think the best would be to first compute per-age sums, and then a rolling window to get all 9 year intervals. This only works because all your intervals have the same size − otherwise it would be much harder.
>>> totals = df.groupby('age')['awards'].sum()
>>> totals = totals.reindex(np.arange(0, df['age'].max() + 9)).fillna(0, downcast='infer')
>>> totals
0 6
1 2
2 4
3 6
4 4
..
98 0
99 0
100 0
101 0
102 0
Name: age, Length: 103, dtype: int64
>>> totals.rolling(9).sum().dropna().astype(int).rename(lambda age: f'{age-8}-{age}')
0-8 42
1-9 43
2-10 45
3-11 47
4-12 47
..
90-98 31
91-99 27
92-100 20
93-101 13
94-102 8
Name: age, Length: 95, dtype: int64
This is slightly complicated by the fact you also want to group by id, but the idea stays the same:
>>> idx = pd.MultiIndex.from_product([df['id'].unique(), np.arange(0, df['age'].max() + 9)], names=['id', 'age'])
>>> totals = df.groupby(['id', 'age']).sum().reindex(idx).fillna(0, downcast='infer')
>>> totals
awards
1 0 128
1 204
2 136
3 367
4 387
... ...
2 98 0
99 0
100 0
101 0
102 0
[206 rows x 1 columns]
>>> totals.groupby('id').rolling(9).sum().droplevel(0).dropna().astype(int).reset_index('id')
id awards
age
8 1 3112
9 1 3390
10 1 3431
11 1 3609
12 1 3820
.. .. ...
98 2 1786
99 2 1226
100 2 900
101 2 561
102 2 317
[190 rows x 2 columns]
This is the same as #Ben.T’s answer except we keep the Series shape and his answer pivots it to a dataframe. At any step you could .stack('age') or .unstack('age') to switch between both answer’s formats.
IIUC, you can use pd.IntervalIndex with some list comprehension:
ii = pd.IntervalIndex.from_tuples(
[
(s, e)
for e, s in pd.Series(np.arange(51)).rolling(9).agg(min).dropna().iteritems()
]
)
df_out = pd.concat(
[
pd.Series(ii.contains(x["age"]) * x["awards"], index=ii)
for i, x in df[["age", "awards"]].iterrows()
],
axis=1,
).groupby(level=0).sum().T
df_out.stack()
Output:
0 (0.0, 8.0] 0
(1.0, 9.0] 0
(2.0, 10.0] 0
(3.0, 11.0] 0
(4.0, 12.0] 0
...
4 (38.0, 46.0] 0
(39.0, 47.0] 0
(40.0, 48.0] 0
(41.0, 49.0] 0
(42.0, 50.0] 209
Length: 215, dtype: int64
A old way without pd.cut using a for loop and some masks.
import pandas as pd
max_age = df["age"].max()
interval_length = 8
values = []
for min_age in range(max_age - interval_length + 1):
max_age = min_age + interval_length
awards = df.query("#min_age <= age <= #max_age").loc[:, "age"].sum()
values.append([min_age, max_age, awards])
df_out = pd.DataFrame(values, columns=["min_age", "max_age", "awards"])
Let me know if this is what you want :)
Let df be a DataFrame:
import pandas as pd
import random
def r(b, e):
return [random.randint(b, e) for _ in range(300)]
df = pd.DataFrame({'id': r(1, 3), 'awards': r(0, 400), 'age': r(1, 99)})
For binning by age, I would advise creating a new column since it is clearer (and faster):
df['bin'] = df['age'].apply(lambda x: x // 9)
print(df)
The number of awards per id per bin can be obtained using simply:
totals_separate = df.groupby(['id', 'bin'])['awards'].sum()
print(totals_separate)
If I understand correctly, you would like the sum for each window of size 9 rows:
totals_rolling = df.groupby(['id', 'bin'])['awards'].rolling(9, min_periods=1).sum()
print(totals_rolling)
Reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.rolling.html

How to extract mean and fluctuation by equal index?

I have a CSV file like the below (after sorted the dataframe by iy):
iy,u
1,80
1,90
1,70
1,50
1,60
2,20
2,30
2,35
2,15
2,25
I'm trying to compute the mean and the fluctuation when iy are equal. For example, for the CSV above, what I want is something like this:
iy,u,U,u'
1,80,70,10
1,90,70,20
1,70,70,0
1,50,70,-20
1,60,70,-10
2,20,25,-5
2,30,25,5
2,35,25,10
2,15,25,-10
2,25,25,0
Where U is the average of u when iy are equal, and u' is simply u-U, the fluctuation. I know that there's a function called groupby.mean() in pandas, but I don't want to group the dataframe, just take the mean, put the values in a new column, and then calculate the fluctuation.
How can I proceed?
Use groupby with transform to calculate a mean for each group and assign that value to a new column 'U', then pandas to subtract two columns:
df['U'] = df.groupby('iy').transform('mean')
df["u'"] = df['u'] - df['U']
df
Output:
iy u U u'
0 1 80 70 10
1 1 90 70 20
2 1 70 70 0
3 1 50 70 -20
4 1 60 70 -10
5 2 20 25 -5
6 2 30 25 5
7 2 35 25 10
8 2 15 25 -10
9 2 25 25 0
You could get fancy and do it in one line:
df.assign(U=df.groupby('iy').transform('mean')).eval("u_prime = u-U")

Python find mean of all rows by a column and then find distance

I have a dataframe as below. I understand that df.groupby("degree").mean() would provide me mean by column degree. I would like to take those means and find distance between each data point and those mean. In this case. For each data point, I would like to get 3 distances from means (output of df.groupby("degree").mean()) (4,40) (2,80) and (4,94) and create 3 new columns. Distance should be calculated by formula, BCA_mean=(name-4)^3+(score-40)^3,M.Tech_mean=(name-2)^3+(score-80)^3,MBA_mean=(name-4)^3+(score-94)^3
import pandas as pd
# dictionary of lists
dict = {'name':[5, 4, 2, 3],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
print (df)
name degree score
0 5 MBA 90
1 4 BCA 40
2 2 M.Tech 80
3 3 MBA 98
df.groupby("degree").mean()
degree name score
BCA 4 40
M.Tech 2 80
MBA 4 94
update1
my real dataset has more than 100 columns. i would prefer something that could suit that need. The logic is still the same, for each mean, subtract mean value from a column and take cube of each cell and add
I found something like below. But not sure if there is any other efficient way
y=df.groupby("degree").mean()
print (y)
import numpy as np
(np.square(df[['name','score']].subtract(y.iloc[0,:],axis=1))).sum(axis=1)
df["mean0"]=(np.square(df[['name','score']].subtract(y.iloc[0,:],axis=1))).sum(axis=1)
df
import pandas as pd
# dictionary of lists
dict = {'degree': ["MBA", "BCA", "M.Tech", "MBA","BCA"],
'name':[5, 4, 2, 3,2],
'score':[90, 40, 80, 98,60],
'game':[100,200,300,100,400],
'money':[100,200,300,100,400],
'loan':[100,200,300,100,400],
'rent':[100,200,300,100,400],
'location':[100,200,300,100,400]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
print (df)
dfx=df.groupby("degree").mean()
print(dfx)
def fun(x):
if x[0]=='BCA':
return x[1:] - dfx.iloc[0,:].tolist()
if x[0]=='M.Tech':
return x[1:]-dfx.iloc[1,:].tolist()
if x[0]=='MBA':
return x[1:]-dfx.iloc[2,:].tolist()
df_added=df.apply(fun,axis=1)
df_added
result
degree name score game money loan rent location
0 MBA 5 90 100 100 100 100 100
1 BCA 4 40 200 200 200 200 200
2 M.Tech 2 80 300 300 300 300 300
3 MBA 3 98 100 100 100 100 100
4 BCA 2 60 400 400 400 400 400
``````
mean which is dfx
``````````
name score game money loan rent location
degree
BCA 3 50 300 300 300 300 300
M.Tech 2 80 300 300 300 300 300
MBA 4 94 100 100 100 100 100
````````````
df_added********
difference of each element from their mean column value
``````````
name score game money loan rent location
0 1 -4 0 0 0 0 0
1 1 -10 -100 -100 -100 -100 -100
2 0 0 0 0 0 0 0
3 -1 4 0 0 0 0 0
4 -1 10 100 100 100 100 100

Categories

Resources