function in groupby pandas - python

I would like to calculate a mean value of "bonus" according to column "first_name", but the denominator is not the sum of the cases, because not all the cases have weight of 1, instead the may have 0.5 weight.
for instance in the case of Jason the value that I want is the sum of his bonus divided by 2.5.
Since in real life I have to group by several columns, like area, etc, I would like to adapt a groupby to this situation.
Here is my try, but it gives me the normal mean
raw_data = {'area': [1,2,3,3,4],'first_name': ['Jason','Jason','Jason', 'Jake','Jake'],
'bonus': [10,20, 10, 30, 20],'weight': [1,1,0.5,0.5,1]}
df = pd.DataFrame(raw_data, columns = ['area','first_name','bonus','weight'])
df

Use:
(df.groupby('first_name')[['bonus', 'weight']].sum()
#.add_prefix('sum_') # you could also want it
.assign(result = lambda x: x['bonus'].div(x['weight'])))
or
(df[['first_name', 'bonus', 'weight']].groupby('first_name').sum()
#.add_prefix('sum_')
.assign(result = lambda x: x['bonus'].div(x['weight'])))
Output
bonus weight result
first_name
Jake 50 1.5 33.333333
Jason 40 2.5 16.000000

One way is to use groupby().apply and np.average:
df.groupby('first_name').apply(lambda x: np.average(x.bonus, weights=x.weight))
Output:
first_name
Jake 23.333333
Jason 14.000000
dtype: float64

Related

BMI calculation from two columns of a pandas data frame with missing values

I am still at the beginning of my Python career and I am trying to add a column with the BMI in a DataFrame, which is calculated from two other columns. However, this does not work with my code yet and maybe someone can help me! In my data I also have relatively often no information and then I just want "NaN" and I think that's the reason why my code doesn't work.
df = pd.DataFrame({'gender': ['m', 'w', 'm', 'm'],
'bodyheight': [1.80, 1.70, 1.85, 'NaN'],
'bodyweight': [75, 59, 83, 90]},
columns=['gender', 'bodyheight', 'bodyweight'])
df.apply(lambda x: (x.bodyweight/(x.bodyheight**2)), axis=1)
That's because you have a string NaN in there. You could replace that with an actual NaN value; then use vectorized division (also I feel like you forgot to divide the height by 100 to convert centimeters to meters):
df = df.replace('NaN', np.nan)
df['BMI'] = df['bodyweight'] / df['bodyheight'].div(100).pow(2)
Output:
gender bodyheight bodyweight BMI
0 m 180.0 75 23.148148
1 w 170.0 59 20.415225
2 m 185.0 83 24.251278
3 m NaN 90 NaN

With pandas, how to create a table of average ocurrences, using more than one column?

Say I have a df that looks like this:
name day var_A var_B
0 Pete Wed 4 5
1 Luck Thu 1 10
2 Pete Sun 10 10
And I want to sum var_A and var_B for every name/person and then get the average of this sum by the number of ocurrences of that name/person.
Let's take Pete for example. Sum his variables (in this case, (4+10) + (5+10) = 29), and divide this sum by the ocurrences of Pete in the df (29/2 = 14,5). And the "day" column would be eliminated, there would be only one column for the name and another for the average.
Would look like this:
>>> df.method().method()
name avg
0 Pete 14.5
1 Luck 11.0
I've been trying to do this using groupby and other methods, but I eventually got stuck.
Any help would be appreciated.
I came up with
df.groupby('name')['var_A', 'var_B'].apply(lambda g: g.stack().sum()/len(g)).rename('avg').reset_index()
which produces the correct result, but I'm not sure it's the most elegant way.
pandas' groupby is a lazy expression, and as such it is reusable:
# create group
group = df.drop(columns="day").groupby("name")
# compute output
group.sum().sum(1) / group.size()
name
Luck 11.0
Pete 14.5
dtype: float64

Finding the correlation between two dataframes

TeamA TeamB TeamC
12 17 19
13 20 21
14 21 26
15 22 15
difference = numpy.abs(data['TeamA'] - data['TeamB'])
teamC = data['TeamC']
df1 = pd.DataFrame(difference)
df1.columns = ['diff']
df2 = pd.DataFrame(teamC)
correlation = df1.corrwith(df2,axis=0)
I am looking to return the correlation between (the absolute points difference between team A and Team B) and the number of points of team C. However, my code is not returning any number. Any suggestion?
pandas is expecting a series inside the corrwith instead of a dataframe (despite what the documentations say).
This makes sense because just passing a dataframe does not really help since you do not know which columns to use for generating correlation score
You should instead be doing:
df1.corrwith(df2["TeamC"])
OUT
diff 0.18221
dtype: float64
This answer is just an extension of this thread:
pandas.DataFrame corrwith() method

Apply function to a filtered group after groupby

I know there are tons of groupby-filter questions about pandas, but I've gone through a number of them and they don't have what I need.
Anyway here's what I have for the dataframe df:
user1 user2 date quantity
-----------------------------
Alice Bob 2018-05-21 100
Alice Bob 2018-05-19 20
Alice Carol 2018-01-01 1000
Bob Carol 2018-02-01 100
I want to calculate a function (let's say some function func) of the quantity for a given user1-user2 pair for weekdays only.
So far what I have are:
df['day'] = df['date'].dt.weekday
df.groupby(['user1','user2']).filter(lambda x: (x.day < 5).any() )
But I don't get what I expect. Apparently, what the filter does is to select only those pairs where at least one day entry is < 5. What I need though, are all rows where the day column is less than 5 for one particular user1-user2 pair.
One straightforward solution is to filter your dataframe before you perform the groupby:
res = df[df['date'].dt.weekday < 5].groupby(...)

Passing multiple columns as arguments to aggregation function groupby

I am still struggling to get really familiar with the pandas groupby operations. When passing a function to agg, what if the aggregation function that is passed needs to consider values in columns other than those that are being aggregated.
Consider the following data frame example which is lists sales of products by two salesmen:
DateList = np.array( [(datetime.date.today() - datetime.timedelta(7)) + datetime.timedelta(days = x) for x in [1, 2, 2, 3, 4, 4, 5]] + \
[(datetime.date.today() - datetime.timedelta(7)) + datetime.timedelta(days = x) for x in [1, 1, 2, 3, 4, 5, 5]]
Names = np.array(['Joe' for x in xrange(7)] + ['John' for x in xrange(7)])
Product = np.array(['Product1', 'Product1', 'Product2', 'Product2', 'Product2', 'Product3', 'Product3', \
'Product1', 'Product2', 'Product2', 'Product2', 'Product2', 'Product2', 'Product3'])
Volume = np.array([100, 0, 150, 175, 15, 120, 150, 75, 0, 115, 130, 135, 10, 120])
Prices = {'Product1' : 25.99, 'Product2': 13.99, 'Product3': 8.99}
SalesDF = DataFrame({'Date' : DateLists, 'Seller' : Names, 'Product' : Product, 'Volume' : Volume})
SalesDF.sort(['Date', 'Seller'], inplace = True)
SalesDF['Prices'] = SalesDF.Product.map(Prices)
On some days each seller sells more than one item. Suppose you wanted to aggregate the data set into a single day/seller observations, and you wished to do so based upon which product sold the most volume. To be clear, this would be simple for the volume measure, simply pass a max function to agg. However for evaluating which Product and Price would remain would mean determining which volume measure was highest and then returning the value tha corresponds with that max value.
I am able to get the result I want by using the index values in the column that is passed to the function when agg is called and referencing the underlying data frame:
def AggFunc(x, df, col1):
#Create list of index values that index the data in the column passed as x
IndexVals = list(x.index)
#Use those index values to create a list of the values of col1 in those index positions in the underlying data frame.
ColList = list(df[col1][IndexVals])
# Find the max value of the list of values of col1
MaxVal = np.max(ColList)
# Find the index value of the max value of the list of values of col1
MaxValIndex = ColList.index(MaxVal)
#Return the data point in the list of data passed as column x which correspond to index value of the the max value of the list of col1 data
return list(x)[MaxValIndex]
FunctionDict = {'Product': lambda x : AggFunc(x, SalesDF, 'Volume'), 'Volume' : 'max',\
'Prices': lambda x : AggFunc(x, SalesDF, 'Volume')}
SalesDF.groupby(['Date', "Seller"], as_index = False).agg(FunctionDict)
But I'm wondering if there is a better way where I can pass 'Volume' as an argument to the function that aggregates Product without having to get the index values and create lists from the data in the underlying dataframe? Something tells me no, as agg passes each column as a series to the aggregation function, rather than the dataframe itself.
Any ideas?
Thanks
Maybe extracting the right indices first using .idxmax would be simpler?
>>> grouped = SalesDF.groupby(["Date", "Seller"])["Volume"]
>>> max_idx = grouped.apply(pd.Series.idxmax)
>>> SalesDF.loc[max_idx]
Date Product Seller Volume Prices
0 2013-11-04 Product1 Joe 100 25.99
7 2013-11-04 Product1 John 75 25.99
2 2013-11-05 Product2 Joe 150 13.99
9 2013-11-05 Product2 John 115 13.99
3 2013-11-06 Product2 Joe 175 13.99
10 2013-11-06 Product2 John 130 13.99
5 2013-11-07 Product3 Joe 120 8.99
11 2013-11-07 Product2 John 135 13.99
6 2013-11-08 Product3 Joe 150 8.99
13 2013-11-08 Product3 John 120 8.99
idxmax gives the index of the first occurrence of the maximum value. If you want to keep multiple products if they all obtain the maximum volume, it'd be a little different, something more like
>>> max_vols = SalesDF.groupby(["Date", "Seller"])["Volume"].transform(max)
>>> SalesDF[SalesDF.Volume == max_vols]

Categories

Resources