Pandas: Calculate Median Based on Multiple Conditions in Each Row - python

I am trying to calculate median values on the fly based on multiple conditions in each row of a data frame and am not getting there.
Basically, for every row, I am counting the number of people in the same department with rank B with pay greater than the pay listed in that row. I was able to get the count to work properly with a lambda function:
df['B Count'] = df.apply(lambda x: sum(df[(df['Department'] == x['Department']) & (df['Rank'] == 'B')]['Pay'] > x['Pay']), axis=1)
However, I now need to calculate the median for each case satisfying those conditions. So in row x of the data frame, I need the median of df['Pay'] for all others matching x['Department'] and df['Rank'] == 'B'. I can't apply .median() instead of sum(), as that gives me the median count, not the median pay. Any thoughts?
Using the fake data below, the 'B Count' code from above counts the number of B's in each Department with higher pay than each A. That part works fine. What I want is to then construct the 'B Median' column, calculating the median pay of the B's in each Department with higher pay than each A in the same Department.
Person Department Rank Pay B Count B Median
1 One A 1000 1 1500
2 One B 800
3 One A 500 2 1150
4 One A 3000 0
5 One B 1500
6 Two B 2000
7 Two B 1800
8 Two A 1500 3 1800
9 Two B 1700
10 Two B 1000

Well, I was able to do what I wanted to do with a function:
def median_b(x):
if x['B Count'] == 0:
return np.nan
else:
return df[(df['Department'] == x['Department']) & (df['Rank'] == 'B') & (
df['Pay'] > x['Pay'])]['Pay'].median()
df['B Median'] = df.apply(median_b, axis = 1)
Do any of you know of better ways to achieve this result?

Related

Calculate average based on available data points

Imagine I have the following data frame:
Product
Month 1
Month 2
Month 3
Month 4
Total
Stuff A
5
0
3
3
11
Stuff B
10
11
4
8
33
Stuff C
0
0
23
30
53
that can be constructed from:
df = pd.DataFrame({'Product': ['Stuff A', 'Stuff B', 'Stuff C'],
'Month 1': [5, 10, 0],
'Month 2': [0, 11, 0],
'Month 3': [3, 4, 23],
'Month 4': [3, 8, 30],
'Total': [11, 33, 53]})
This data frame shows the amount of units sold per product, per month.
Now, what I want to do is to create a new column called "Average" that calculates the average units sold per month. HOWEVER, notice in this example that Stuff C's values for months 1 and 2 are 0. This product was probably introduced in Month 3, so its average should be calculated based on months 3 and 4 only. Also notice that Stuff A's units sold in Month 2 were 0, but that does not mean the product was introduced in Month 3 since 5 units were sold in Month 1. That is, its average should be calculated based on all four months. Assume that the provided data frame may contain any number of months.
Based on these conditions, I have come up with the following solution in pseudo-code:
months = ["list of index names of months to calculate"]
x = len(months)
if df["Month 1"] != 0:
df["Average"] = df["Total"] / x
elif df["Month 2"] != 0:
df["Average"] = df["Total"] / x - 1
...
elif df["Month " + str(x)] != 0:
df["Average"] = df["Total"] / 1
else:
df["Average"] = 0
That way, the average would be calculated starting from the first month where units sold are different from 0. However, I haven't been able to translate this logical abstraction into actual working code. I couldn't manage to iterate over len(months) while maintaining the elif conditions. Or maybe there is a better, more practical approach.
I would appreciate any help, since I've been trying to crack this problem for a while with no success.
There is numpy method np.trim_zeros that trims leading and/or trailing zeros. Using a list comprehension, you can iterate over the relevant DataFrame rows, trim the leading zeros and find the average of what remains for each row.
Note that since 'Month 1' to 'Month 4' are consecutive, you can slice the columns between them using .loc.
import numpy as np
df['Average Sales'] = [np.trim_zeros(row, trim='f').mean() for row in df.loc[:, 'Month 1':'Month 4'].to_numpy()]
Output:
Product Month 1 Month 2 Month 3 Month 4 Total Average Sales
0 Stuff A 5 0 3 3 11 2.75
1 Stuff B 10 11 4 8 33 8.25
2 Stuff C 0 0 23 30 53 26.50
Try:
df = df.set_index(['Product','Total'])
df['Average'] = df.where(df.ne(0).cummax(axis=1)).mean(axis=1)
df_out=df.reset_index()
print(df_out)
Output:
Product Total Month 1 Month 2 Month 3 Month 4 Average
0 Stuff A 11 5 0 3 3 2.75
1 Stuff B 33 10 11 4 8 8.25
2 Stuff C 53 0 0 23 30 26.50
Details:
Move Product and Total into the dataframe index, so we can do calcation on the rest of the dataframe.
First create a boolean matrix using ne to zero. Then, use cummax along the rows which means that if there is a non-zero value, It will remain True until then end of the row. If it starts with a zero, then the False will stay until first non-zero then turns to Turn and remain True.
Next, use pd.DataFrame.where to only select those values for that boolean matrix were Turn, other values (leading zeros) will be NaN and not used in the calcuation of mean.
If you don't mind it being a little memory inefficient, you could put your dataframe into a numpy array. Numpy has a built-in function to remove zeroes from an array, and then you could use the mean function to calculate the average. It could look something like this:
import numpy as np
arr = np.array(Stuff_A_DF)
mean = arr[np.nonzero(arr)].mean()
Alternatively, you could manually extract the row to a list, then loop through to remove the zeroes.

Finding the row wise mean

So I have a dataframe as follows:
eg.:
x
y
a
2
b
4
c
7
i need a third column which is the mean other values of y
x
y
mean
a
2
5.5
b
4
4.5
c
7
3
I am able to do this for the first row but how do i do it for all rows given that my dataframe contains 100000 rows where the mean is calculated using values of other rows.
you can calculate remaining mean seperately with using current value. You can calculate total sum then substitude current value then divide to remaining items counts. total sum and total count is remains the same.
total_sum = df['y'].sum()
total_length = len(df.index)
df.mean = (total_sum - df['y']) / (total_length - 1)

Conditional Sampling of Data Frame in Python

I have a Dataframe of Names, Sex, Ages of individuals:
I would like to create a new Dataframe by sampling a fixed number of samples such that the average age of the new DataFrame is the same as the original DataFrame.
sample_df = pd.DataFrame({'Var':['A','B','C','D','E'] , 'Ages' : [22,35,43,18,NaN]})
sample_df
Out[410]:
Var Ages
0 A 22
1 B 35
2 C 43
3 D 18
4 E NaN
I would like to sample only 3 rows such that the age of 'E' is equal to the mean of A,B,C,D
Consider an indefinite iteration using while True then break after needs are met but depending on the variability of your data, this may take some time to process. Below builds a list of 100-row samples and breaks after ten samples are achieved.
samples = []
while True:
sample_df = df.sample(n = 100)
if sample_df['Age'].mean() == df['Age'].mean():
samples.append(sample_df)
if len(samples) == 10:
break

How to calculate pandas DF values based on grouped column

I'm relatively new to Pandas dataframes and I have to do simple calculation, but so far I haven't found a good way to go about it.
Basically what I have is:
type group amount
1 A real 55
2 A fake 12
3 B real 610
4 B fake 23
5 B real 45
Now, I have to add a new column that would show the percentage of fakes in type total. So the simple formula for this table would be for A 12 / (55 + 12) * 100 and for B 23 / (610 + 23 + 45) * 100 and the table should look something like this:
type group amount percentage
1 A real 55
2 A fake 12 17.91
3 B real 610
4 B fake 23
5 B real 45 3.39
I know about groupby statements and basically all the components I need for this (I guess...), but can't figure out how to combine to get this result.
df['percentage'] = df.amount \
/ df.groupby(['type']) \
.amount.transform('sum').loc[df.group.eq('fake')]).fillna('')
df
If handling multiple fake in group per type. We can be a bit more careful. I'll set the index to preserve the type and group columns while I transform.
c = ['type', 'group']
d1 = df.set_index(c, append=True)
d1.amount /= d1.groupby(level=['type']).amount.transform('sum')
d1.reset_index(c)
From here, you can choose to leave that alone or consolidate the group column.
d1.groupby(level=c).sum().reset_index()
Try this out:
percentage = {}
for type in df.type.unique():
numerator = df[(df.type == type) & (df.group == 'fake')].amount.sum()
denominator = df[(df.type == type)].amount.sum()
percentage[type] = numerator / denominator * 100
df['percentage'] = list(df.type.map(percentage))
If you wanted to make sure you accounted for multiple fake groups per type you can do the following
type_group_total = df.groupby(['type', 'group']).transform('sum')
type_total = df.groupby('type')[['amount']].transform('sum')
df['percentage'] = type_group_total / type_total
Output
type group amount percentage
0 A real 55 0.820896
1 A fake 12 0.179104
2 B real 610 0.899705
3 B fake 23 0.100295
4 B fake 45 0.100295

is there any quick function to do looking-back calculating in pandas dataframe?

I wanna implement a calculate method like a simple scenario:
value computed as the sum of daily data during the previous N days (set N = 3 in the following example)
Dataframe df: (df.index is 'date')
date value
20140718 1
20140721 2
20140722 3
20140723 4
20140724 5
20140725 6
20140728 7
......
to do calculating like:
date value new
20140718 1 0
20140721 2 0
20140722 3 0
20140723 4 6 (3+2+1)
20140724 5 9 (4+3+2)
20140725 6 12 (5+4+3)
20140728 7 15 (6+5+4)
......
Now I have done this using for cycle like:
df['value']=[0]*len(df)
for idx in df.index
loc=df.index.get_loc(idx)
if((loc-N)>=0):
tmp=df.ix[df.index[loc-3]:df.index[loc-1]]
sum=tmp['value'].sum()
else:
sum=0
df['new'].ix(idx)=sum
But, when the length of dataframe or the value of N is very long / big, these calculating will be very slow....How I can implement this faster using a function or by other ways?
Besides, if the scenario is more complex? how ? Thanks.
Since you want the sum of the previous three excluding the current one, you can use rolling_apply over the a window of four and sum up all but the last value.
new = rolling_apply(df, 4, lambda x:sum(x[:-1]), min_periods=4)
This is the same as shifting afterwards with a window of three:
new = rolling_apply(df, 3, sum, min_periods=3).shift()
Then
df["new"] = new["value"].fillna(0)

Categories

Resources