Imagine I have the following data frame:
Product
Month 1
Month 2
Month 3
Month 4
Total
Stuff A
5
0
3
3
11
Stuff B
10
11
4
8
33
Stuff C
0
0
23
30
53
that can be constructed from:
df = pd.DataFrame({'Product': ['Stuff A', 'Stuff B', 'Stuff C'],
'Month 1': [5, 10, 0],
'Month 2': [0, 11, 0],
'Month 3': [3, 4, 23],
'Month 4': [3, 8, 30],
'Total': [11, 33, 53]})
This data frame shows the amount of units sold per product, per month.
Now, what I want to do is to create a new column called "Average" that calculates the average units sold per month. HOWEVER, notice in this example that Stuff C's values for months 1 and 2 are 0. This product was probably introduced in Month 3, so its average should be calculated based on months 3 and 4 only. Also notice that Stuff A's units sold in Month 2 were 0, but that does not mean the product was introduced in Month 3 since 5 units were sold in Month 1. That is, its average should be calculated based on all four months. Assume that the provided data frame may contain any number of months.
Based on these conditions, I have come up with the following solution in pseudo-code:
months = ["list of index names of months to calculate"]
x = len(months)
if df["Month 1"] != 0:
df["Average"] = df["Total"] / x
elif df["Month 2"] != 0:
df["Average"] = df["Total"] / x - 1
...
elif df["Month " + str(x)] != 0:
df["Average"] = df["Total"] / 1
else:
df["Average"] = 0
That way, the average would be calculated starting from the first month where units sold are different from 0. However, I haven't been able to translate this logical abstraction into actual working code. I couldn't manage to iterate over len(months) while maintaining the elif conditions. Or maybe there is a better, more practical approach.
I would appreciate any help, since I've been trying to crack this problem for a while with no success.
There is numpy method np.trim_zeros that trims leading and/or trailing zeros. Using a list comprehension, you can iterate over the relevant DataFrame rows, trim the leading zeros and find the average of what remains for each row.
Note that since 'Month 1' to 'Month 4' are consecutive, you can slice the columns between them using .loc.
import numpy as np
df['Average Sales'] = [np.trim_zeros(row, trim='f').mean() for row in df.loc[:, 'Month 1':'Month 4'].to_numpy()]
Output:
Product Month 1 Month 2 Month 3 Month 4 Total Average Sales
0 Stuff A 5 0 3 3 11 2.75
1 Stuff B 10 11 4 8 33 8.25
2 Stuff C 0 0 23 30 53 26.50
Try:
df = df.set_index(['Product','Total'])
df['Average'] = df.where(df.ne(0).cummax(axis=1)).mean(axis=1)
df_out=df.reset_index()
print(df_out)
Output:
Product Total Month 1 Month 2 Month 3 Month 4 Average
0 Stuff A 11 5 0 3 3 2.75
1 Stuff B 33 10 11 4 8 8.25
2 Stuff C 53 0 0 23 30 26.50
Details:
Move Product and Total into the dataframe index, so we can do calcation on the rest of the dataframe.
First create a boolean matrix using ne to zero. Then, use cummax along the rows which means that if there is a non-zero value, It will remain True until then end of the row. If it starts with a zero, then the False will stay until first non-zero then turns to Turn and remain True.
Next, use pd.DataFrame.where to only select those values for that boolean matrix were Turn, other values (leading zeros) will be NaN and not used in the calcuation of mean.
If you don't mind it being a little memory inefficient, you could put your dataframe into a numpy array. Numpy has a built-in function to remove zeroes from an array, and then you could use the mean function to calculate the average. It could look something like this:
import numpy as np
arr = np.array(Stuff_A_DF)
mean = arr[np.nonzero(arr)].mean()
Alternatively, you could manually extract the row to a list, then loop through to remove the zeroes.
Related
I have this dataframe:
lst = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,3,3,3,3,3,3,3,3,3,3,3,3,3]
ser = pd.Series(lst)
df1 = pd.DataFrame(ser, columns=['Quantity'])
When i check unique values from variable quantity i have the following distribution:
df1.groupby(['Quantity'])['Quantity'].count() / sum ( df1['Quantity'])
Quantity
0 0.741935
1 0.338710
2 0.016129
3 0.209677
Name: Quantity, dtype: float64
Because value 2 represents only 0.016 i want to create a new categorical variable that creates "bins" like:
Quantity
0
1-2
3+
How the bins are created is not relevant, the rule of thumb is :
If a number has low representation, it should be aggregated with the other values in a class (bin) .
Other example:
Quantity
0 2662035
1 1200
2 2
Could be converted in :
Quantity
0
1+
You can define the bins the way you want in pandas.cut, by default the right part of the bins is uncluded:
import numpy as np
(pd.cut(df['Quantity'], bins=[-1, 0, 2, np.inf], labels=['0', '1-2', '3+'])
.value_counts()
)
Output:
0 57
1-2 29
3+ 5
Name: Quantity, dtype: int64
combining counts based on a threshold
threshold = 0.05
c = df1['Quantity'].value_counts(sort=False).sort_index()
group = c.div(c.sum()).gt(threshold).cumsum()
(c.reset_index()
.groupby(group)
.agg({'index': lambda x: f'{x.iloc[0]}-{x.iloc[-1]}' if len(x)>1 else str(x.iloc[0]),
'Quantity': 'sum',
})
.set_index('index')
)
Output:
Quantity
index
0 46
1-2 22
3 13
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
So I have the following:
#this is the data we have
df = pd.DataFrame(data=(['A','1-50', 10],['B','25-200', 15],['C','25-300', 5]), columns=['Category','Range', 'Qty'])
#these are the different range categories we need to have.
list_of_ranges = ['1-10', '10-25', '25-50', '50-100', '100-200', '200-300', '300-400']
# insert magic spells here
#this is what the result needs to look like
results = pd.DataFrame(data=(['A','1-25', 10],['A','25-50', 10],['B','25-50', 15],['B','50-100', 15],['B','100-200', 15],['C','25-50', 15],['C','50-100', 15],['C','100-200', 15],['C','200-300', 5]), columns=['Category','Range', 'Qty'])
As per the example above:
I have a df with ranges that need to be broken to subranges all columns need to be duplicated except for the new range.
How can I do that?
Edit1:
Example of the logic
Area "A" has temperatures ranging from 1-50 degrees Celsius for 10 days per year.
This is a single row that reads:
1: A,1-50,10
This same row can be interpreted as: in Area "A" temperature ranges can be 1-10, 10-25, or 25-50 for 10 days per year.
So I would like to have 3 rows:
1: A,1-10,10
2: A,10-25,10
3: A,25-50,10
We need a couple of functions to work with 'ranges' as you defined them, but otherwise it is a matter of creating a list of 'small ranges' for each 'Range' in the df that are 'inside' it, and then explode-ing the df
def split_range(r):
"""
split range into a tuple. range is a string 'xx-yy'
"""
tokens = r.split('-')
return (int(tokens[0]), int(tokens[1]))
def is_inside(r1,r2):
"""
True if range r1 is inside r2. Range is a string 'xx-yy'
"""
t1, t2 = split_range(r1), split_range(r2)
return (t1[0]>=t2[0]) and (t1[1] <= t2[1])
df['small_ranges'] = df.apply(lambda row: [rng for rng in list_of_ranges if is_inside(rng, row['Range']) ], axis=1)
this produces
Category Range Qty small_ranges
-- ---------- ------- ----- -----------------------------------------
0 A 1-50 10 ['1-10', '10-25', '25-50']
1 B 25-200 15 ['25-50', '50-100', '100-200']
2 C 25-300 5 ['25-50', '50-100', '100-200', '200-300']
now we explode
df.explode('small_ranges')
output
Category Range Qty small_ranges
-- ---------- ------- ----- --------------
0 A 1-50 10 1-10
0 A 1-50 10 10-25
0 A 1-50 10 25-50
1 B 25-200 15 25-50
1 B 25-200 15 50-100
1 B 25-200 15 100-200
2 C 25-300 5 25-50
2 C 25-300 5 50-100
2 C 25-300 5 100-200
2 C 25-300 5 200-300
Here it is a solution using pandas.Interval, which seems pretty useful for this case. First we convert your strings to pd.Interval
list_of_ranges = [pd.Interval(*tuple(map(int, r.split('-')))) for r in list_of_ranges]
df['Range'] = df['Range'].apply(lambda r: pd.Interval(*tuple(map(int, r.split('-')))))
The we create a new DataFrame including all desired ranges for each original range:
my_temps = []
for idx, row in df.iterrows():
_df = pd.DataFrame(columns=df.columns)
_df['Range'] = [r for r in list_of_ranges if r.overlaps(row['Range'])]
_df['Category'], _df['Qty'] = row['Category'], row['Qty']
my_temps.append(_df)
final_df = pd.concat(my_temps).reset_index(drop=True)
Then we finally convert the ranges again to their original string format:
final_df['Range'] = final_df['Range'].apply(lambda r: '{}-{}'.format(r.left, r.right))
Which results in the following dataframe:
Category Range Qty
0 A 1-10 10
1 A 10-25 10
2 A 25-50 10
0 B 25-50 15
1 B 50-100 15
2 B 100-200 15
0 C 25-50 5
1 C 50-100 5
2 C 100-200 5
3 C 200-300 5
Let us know if you have any further issues!
I am trying to calculate median values on the fly based on multiple conditions in each row of a data frame and am not getting there.
Basically, for every row, I am counting the number of people in the same department with rank B with pay greater than the pay listed in that row. I was able to get the count to work properly with a lambda function:
df['B Count'] = df.apply(lambda x: sum(df[(df['Department'] == x['Department']) & (df['Rank'] == 'B')]['Pay'] > x['Pay']), axis=1)
However, I now need to calculate the median for each case satisfying those conditions. So in row x of the data frame, I need the median of df['Pay'] for all others matching x['Department'] and df['Rank'] == 'B'. I can't apply .median() instead of sum(), as that gives me the median count, not the median pay. Any thoughts?
Using the fake data below, the 'B Count' code from above counts the number of B's in each Department with higher pay than each A. That part works fine. What I want is to then construct the 'B Median' column, calculating the median pay of the B's in each Department with higher pay than each A in the same Department.
Person Department Rank Pay B Count B Median
1 One A 1000 1 1500
2 One B 800
3 One A 500 2 1150
4 One A 3000 0
5 One B 1500
6 Two B 2000
7 Two B 1800
8 Two A 1500 3 1800
9 Two B 1700
10 Two B 1000
Well, I was able to do what I wanted to do with a function:
def median_b(x):
if x['B Count'] == 0:
return np.nan
else:
return df[(df['Department'] == x['Department']) & (df['Rank'] == 'B') & (
df['Pay'] > x['Pay'])]['Pay'].median()
df['B Median'] = df.apply(median_b, axis = 1)
Do any of you know of better ways to achieve this result?
I am working with a database that looks like the below. For each fruit (just apple and pears below, for conciseness), we have:
1. yearly sales,
2. current sales,
3. monthly sales and
4.the standard deviation of sales.
Their ordering may vary, but it's always 4 values per fruit.
dataset = {'apple_yearly_avg': [57],
'apple_sales': [100],
'apple_monthly_avg':[80],
'apple_st_dev': [12],
'pears_monthly_avg': [33],
'pears_yearly_avg': [35],
'pears_sales': [40],
'pears_st_dev':[8]}
df = pd.DataFrame(dataset).T#tranpose
df = df.reset_index()#clear index
df.columns = (['Description', 'Value'])#name 2 columns
I would like to perform two sets of operations.
For the first set of operations, we isolate a fruit price, say 'pears', and subtract each average sales from current sales.
df_pear = df[df.loc[:, 'Description'].str.contains('pear')]
df_pear['temp'] = df_pear['Value'].where(df_pear.Description.str.contains('sales')).bfill()
df_pear ['some_op'] = df_pear['Value'] - df_pear['temp']
The above works, by creating a temporary column holding pear_sales of 40, backfill it and then use it to subtract values.
Question 1: is there a cleaner way to perform this operation without a temporary array? Also I do get the common warning saying I should use '.loc[row_indexer, col_indexer], even though the output still works.
For the second sets of operations, I need to add '5' rows equal to 'new_purchases' to the bottom of the dataframe, and then fill df_pear['some_op'] with sales * (1 + std_dev *some_multiplier).
df_pear['temp2'] = df_pear['Value'].where(df_pear['Description'].str.contains('st_dev')).bfill()
new_purchases = 5
for i in range(new_purchases):
df_pear = df_pear.append(df_pear.iloc[-1])#appends 5 copies of the last row
counter = 1
for i in range(len(df_pear)-1, len(df_pear)-new_purchases, -1):#backward loop from the bottom
df_pear.some_op.iloc[i] = df_pear['temp'].iloc[0] * (1 + df_pear['temp2'].iloc[i] * counter)
counter += 1
This 'backwards' loop achieves it, but again, I'm worried about readability since there's another temporary column created, and then the indexing is rather ugly?
Thank you.
I think, there is a cleaner way to perform your both tasks, for each
fruit in one go:
Add 2 columns, Fruit and Descr, the result of splitting of Description at the first "_":
df[['Fruit', 'Descr']] = df['Description'].str.split('_', n=1, expand=True)
To see the result you may print df now.
Define the following function to "reformat" the current group:
def reformat(grp):
wrk = grp.set_index('Descr')
sal = wrk.at['sales', 'Value']
dev = wrk.at['st_dev', 'Value']
avg = wrk.at['yearly_avg', 'Value']
# Subtract (yearly) average
wrk['some_op'] = wrk.Value - avg
# New rows
wrk2 = pd.DataFrame([wrk.loc['st_dev']] * 5).assign(
some_op=[ sal * (1 + dev * i) for i in range(5, 0, -1) ])
return pd.concat([wrk, wrk2]) # Old and new rows
Apply this function to each group, grouped by Fruit, drop Fruit
column and save the result back in df:
df = df.groupby('Fruit').apply(reformat)\
.reset_index(drop=True).drop(columns='Fruit')
Now, when you print(df), the result is:
Description Value some_op
0 apple_yearly_avg 57 0
1 apple_sales 100 43
2 apple_monthly_avg 80 23
3 apple_st_dev 12 -45
4 apple_st_dev 12 6100
5 apple_st_dev 12 4900
6 apple_st_dev 12 3700
7 apple_st_dev 12 2500
8 apple_st_dev 12 1300
9 pears_monthly_avg 33 -2
10 pears_sales 40 5
11 pears_yearly_avg 35 0
12 pears_st_dev 8 -27
13 pears_st_dev 8 1640
14 pears_st_dev 8 1320
15 pears_st_dev 8 1000
16 pears_st_dev 8 680
17 pears_st_dev 8 360
Edit
I'm in doubt whether Description should also be replicated to new
rows from "st_dev" row. If you want some other content there, set it
in reformat function, after wrk2 is created.
I wanna implement a calculate method like a simple scenario:
value computed as the sum of daily data during the previous N days (set N = 3 in the following example)
Dataframe df: (df.index is 'date')
date value
20140718 1
20140721 2
20140722 3
20140723 4
20140724 5
20140725 6
20140728 7
......
to do calculating like:
date value new
20140718 1 0
20140721 2 0
20140722 3 0
20140723 4 6 (3+2+1)
20140724 5 9 (4+3+2)
20140725 6 12 (5+4+3)
20140728 7 15 (6+5+4)
......
Now I have done this using for cycle like:
df['value']=[0]*len(df)
for idx in df.index
loc=df.index.get_loc(idx)
if((loc-N)>=0):
tmp=df.ix[df.index[loc-3]:df.index[loc-1]]
sum=tmp['value'].sum()
else:
sum=0
df['new'].ix(idx)=sum
But, when the length of dataframe or the value of N is very long / big, these calculating will be very slow....How I can implement this faster using a function or by other ways?
Besides, if the scenario is more complex? how ? Thanks.
Since you want the sum of the previous three excluding the current one, you can use rolling_apply over the a window of four and sum up all but the last value.
new = rolling_apply(df, 4, lambda x:sum(x[:-1]), min_periods=4)
This is the same as shifting afterwards with a window of three:
new = rolling_apply(df, 3, sum, min_periods=3).shift()
Then
df["new"] = new["value"].fillna(0)