groupby and mean returning NaN - python

I am trying to use groupby to group by symbol and return the average of prior high volume days using pandas.
I create my data:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"date": ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06'],
"symbol": ['ABC', 'ABC', 'ABC', 'AAA', 'AAA', 'AAA'],
"change": [20, 1, 2, 3, 50, 100],
"volume": [20000000, 100, 3000, 500, 40000000, 60000000],
})
Filter by high volume and change:
high_volume_days = df[(df['volume'] >= 20000000) & (df['change'] >= 20)]
Then I get the last days volume (this works):
high_volume_days['previous_high_volume_day'] = high_volume_days.groupby('symbol')['volume'].shift(1)
But when I try to calculate the average of all the days per symbol:
high_volume_days['avg_volume_prior_days'] = df.groupby('symbol')['volume'].mean()
I am getting NaNs:
date symbol change volume previous_high_volume_day avg_volume_prior_days
0 2022-01-01 ABC 20 20000000 NaN NaN
4 2022-01-05 AAA 50 40000000 NaN NaN
5 2022-01-06 AAA 100 60000000 40000000.0 NaN
What am I missing here?
Desired output:
date symbol change volume previous_high_volume_day avg_volume_prior_days
0 2022-01-01 ABC 20 20000000 NaN 20000000
4 2022-01-05 AAA 50 40000000 NaN 40000000
5 2022-01-06 AAA 100 60000000 40000000.0 50000000

high_volume_days['avg_volume_prior_days'] = high_volume_days.groupby('symbol', sort=False)['volume'].expanding().mean().droplevel(0)
high_volume_days
date symbol change volume previous_high_volume_day avg_volume_prior_days
0 2022-01-01 ABC 20 20000000 NaN 20000000.0
4 2022-01-05 AAA 50 40000000 NaN 40000000.0
5 2022-01-06 AAA 100 60000000 40000000.0 50000000.0

Index misalignment: high_volume_days is indexed by integers. The df.groupby(...) is indexed by the symbol.
Use merge instead:
high_volume_days = pd.merge(
high_volume_days,
df.groupby("symbol")["volume"].mean().rename("avg_volume_prior_days"),
left_on="symbol",
right_index=True,
)

df.groupby('symbol')['volume'].mean() returns:
symbol
AAA 33333500.0
ABC 6667700.0
Name: volume, dtype: float64
which is an aggregation of each group to a single value. Note that the groups (symbol) are the index of this series. When you try to assign it back to high_volume_days, there is an index misalignment.
Instead of an aggregation (.mean() is equivalent to .agg("mean")), you should use a transformation: .transform("mean").
==== EDIT ====
Instead of the mean for all values, you're looking for the mean "thus far". You can typically do that using .expanding().mean(), but since you're reassigning back to a column in high_volume_days, you need to either drop the level that contains the symbols, or use a lambda:
high_volume_days.groupby('symbol')['volume'].expanding().mean().droplevel(0)
# or
high_volume_days.groupby('symbol')['volume'].transform(lambda x: x.expanding().mean())

Related

How to compute rolling average in pandas just for a specific date

I have this example dataframe below. I created a function that does what I want, computing a Sales rolling average (7,14 days window) for each Store for the previous day and shifts it to the current date. How can I compute this only for a specific date, 2022-12-31, for example? I have a lot of rows and I don't want to recalculate it each time I add a date.
import numpy as np
import pandas as pd
ex = pd.DataFrame({'Date':pd.date_range('2022-10-01', '2022-12-31'),
'Store': np.random.choice(2, len(pd.date_range('2022-10-01', '2022-12-31'))),
'Sales': np.random.choice(10000, len(pd.date_range('2022-10-01', '2022-12-31')))})
ex.sort_values(['Store','Date'], ascending=False, inplace=True)
for days in [7, 14]:
ex['Sales_mean_' + str(days) + '_days'] = ex.groupby('Store')[['Sales']].apply(lambda x: x.shift(-1).rolling(days).mean().shift(-days+1))```
I redefined a similar dataframe because using a random variable generator makes debugging difficult. At each test the dataframe changes randomly.
In addition to keep it simple, I will use 2 and 3 moving average periods.
Starting dataframe
Date Store Sales
9 2022-10-10 1 5347
8 2022-10-09 1 1561
7 2022-10-08 1 5648
6 2022-10-07 1 8123
5 2022-10-06 1 1401
4 2022-10-05 0 2745
3 2022-10-04 0 7848
2 2022-10-03 0 3151
1 2022-10-02 0 4296
0 2022-10-01 0 9028
It gives :
ex = pd.DataFrame({
"Date": pd.date_range('2022-10-01', '2022-10-10'),
"Store": [0]*5+[1]*5,
"Sales": [9028, 4296, 3151, 7848, 2745, 1401, 8123, 5648, 1561, 5347],
})
ex.sort_values(['Store','Date'], ascending=False, inplace=True)
Proposed code
import pandas as pd
import numpy as np
ex = pd.DataFrame({
"Date": pd.date_range('2022-10-01', '2022-10-10'),
"Store": [0]*5+[1]*5,
"Sales": [9028, 4296, 3151, 7848, 2745, 1401, 8123, 5648, 1561, 5347],
})
ex.sort_values(['Store','Date'], ascending=False, inplace=True)
periods=(2,3)
### STEP 1 -- Initialization : exhaustive Mean() Calculation
for per in periods:
ex["Sales_mean_{0}_days".format(per)] = (
ex.groupby(['Store'])['Sales']
.apply(lambda g: g.shift(-1)
.rolling(per)
.mean()
.shift(-per+1))
)
### STEP 2 -- New Row Insertion
def fmt_newRow(g, newRow, periods):
return {
"Date": pd.Timestamp(newRow[0]),
"Store": newRow[1],
"Sales": newRow[2],
"Sales_mean_{0}_days".format(periods[0]): g['Sales'].iloc[0:periods[0]].mean(),
"Sales_mean_{0}_days".format(periods[1]): g['Sales'].iloc[0:periods[1]].mean(),
}
def add2DF(ex, newRow):
# g : sub-Store group
g = (
ex.loc[ex.Store==newRow[1]]
.sort_values(['Store','Date'], ascending=False)
)
# Append newRow like a dictionnary and sort by ['Store','Date']
ex = (
ex.append(fmt_newRow(g, newRow, periods), ignore_index=True)
.sort_values(['Store','Date'], ascending=False)
.reset_index(drop=True)
)
#
return ex
newRow = ['2022-10-11', 1, 2803] # [Date, Store, Sales]
ex = add2DF(ex, newRow)
print(ex)
Result
Date Store Sales Sales_mean_2_days Sales_mean_3_days
0 2022-10-11 1 2803 3454.0 4185.333333
1 2022-10-10 1 5347 3604.5 5110.666667
2 2022-10-09 1 1561 6885.5 5057.333333
3 2022-10-08 1 5648 4762.0 NaN
4 2022-10-07 1 8123 NaN NaN
5 2022-10-06 1 1401 NaN NaN
6 2022-10-05 0 2745 5499.5 5098.333333
7 2022-10-04 0 7848 3723.5 5491.666667
8 2022-10-03 0 3151 6662.0 NaN
9 2022-10-02 0 4296 NaN NaN
10 2022-10-01 0 9028 NaN NaN
Comments
A new row is a list like this one : [Date, Store, Sales]
Each time you need to save a new row to dataframe, you pass it to fmt_newRow function with the corresponding subgroup g
fmt_newRow returns a new row on the form of a dictionnary which is integrated in the dataframe with append Pandas function
No need to recalculate all the averages, because only the per-last g values are used to calculate the new row average
Moving averages for periods 2 and 3 were checked and are correct.

Pandas create rows based on interval between to dates

I am trying to expand a dataframe containing a number of columns by creating rows based on the interval between two date columns.
For this I am currently using a method that basically creates a cartesian product, which works well on small datasets, but is not good in large sets because it is very inefficient.
This method will be used on a ~ 2-million row by 50 column Dataframe spanning multiple years from min to max date. The resulting dataset will be about 3 million rows, so a more effective approach is required.
I have not succeeded in finding an alternative method which is less resource intensive.
What would be the best approach for this?
My current method here:
from datetime import date
import pandas as pd
raw_data = {'id': ['aa0', 'aa1', 'aa2', 'aa3'],
'number': [1, 2, 2, 1],
'color': ['blue', 'red', 'yellow', "green"],
'date_start': [date(2022,1,1), date(2022,1,1), date(2022,1,7), date(2022,1,12)],
'date_end': [date(2022,1,2), date(2022,1,4), date(2022,1,9), date(2022,1,14)]}
df = pd.DataFrame(raw_data)
This gives the following result
Now to create a set containing all possible dates between the min and max date of the set:
df_d = pd.DataFrame({'date': pd.date_range(df['date_start'].min(), df['date_end'].max() + pd.Timedelta('1d'), freq='1d')})
This results in an expected frame containing all the possible dates
Finally to cross merge the original set with the date set and filter resulting rows based on start and end date per row
df_total = pd.merge(df, df_d,how='cross')
df = df_total[(df_total['date_start']<df_total['date']) & (df_total['date_end']>=df_total['date']) ]
This leads to the following final
This final dataframe is exactly what is needed.
Efficient Solution
d = df['date_end'].sub(df['date_start']).dt.days
df1 = df.reindex(df.index.repeat(d))
i = df1.groupby(level=0).cumcount() + 1
df1['date'] = df1['date_start'] + pd.to_timedelta(i, unit='d')
How it works?
Subtract start from end to calculate the number of days elapsed, then reindex the dataframe by repeating the index exactly elapsed number of days times. Now group df1 by index and use cumcount to create a sequential counter then create a timedelta series using this counter and add this with date_start to get the result
Result
id number color date_start date_end date
0 aa0 1 blue 2022-01-01 2022-01-02 2022-01-02
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-02
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-03
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-04
2 aa2 2 yellow 2022-01-07 2022-01-09 2022-01-08
2 aa2 2 yellow 2022-01-07 2022-01-09 2022-01-09
3 aa3 1 green 2022-01-12 2022-01-14 2022-01-13
3 aa3 1 green 2022-01-12 2022-01-14 2022-01-14
I don't know if this is an approvement, here the pd.date_range only gets created for each start and end date in each row. the created list gets exploded and joined to the original df
from datetime import date
import pandas as pd
raw_data = {'id': ['aa0', 'aa1', 'aa2', 'aa3'],
'number': [1, 2, 2, 1],
'color': ['blue', 'red', 'yellow', "green"],
'date_start': [date(2022,1,1), date(2022,1,1), date(2022,1,7), date(2022,1,12)],
'date_end': [date(2022,1,2), date(2022,1,4), date(2022,1,9), date(2022,1,14)]}
df = pd.DataFrame(raw_data)
s = df.apply(lambda x: pd.date_range(x['date_start'], x['date_end'], freq='1d',inclusive='right').date,axis=1).explode()
df.join(s.rename('date'))

Counting Specific Values by Month

I have some data I want to count by month. The column I want count has three different possible values, each representing a different car sold. Here is an example of my dataframe:
Date Type_Car_Sold
2015-01-01 00:00:00 2
2015-01-01 00:00:00 1
2015-01-01 00:00:00 1
2015-01-01 00:00:00 3
... ...
I want to make it so I have a dataframe that counts each specific car type sold by month separately, so looking like this:
Month Car_Type_1 Car_Type_2 Car_Type_3 Total_Cars_Sold
1 15 12 17 44
2 9 18 20 47
... ... ... ... ...
How exactly would I go about doing this? I've tried doing:
cars_sold = car_data['Type_Car_Sold'].groupby(car_data.Date.dt.month).agg('count')
but that just sums up all the cars sold in the month, rather than breaking it down by the total amount of each type sold. Any thoughts?
Maybe not the cleanest solution, but this should get you pretty close
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
"Date": [datetime(2022,1,1), datetime(2022,1,1), datetime(2022,2,1), datetime(2022,2,1)],
"Type": [1, 2, 1, 1],
})
df['Date'] = df["Date"].dt.to_period('M')
df['Value'] = 1
print(pd.pivot_table(df, values='Value', index=['Date'], columns=['Type'], aggfunc='count'))
Type 1 2
Date
2022-01 1.0 1.0
2022-02 2.0 NaN
Alternatively you can also pass multiple columns to groupby:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
"Date": [datetime(2022,1,1), datetime(2022,1,1), datetime(2022,2,1), datetime(2022,2,1)],
"Type": [1, 2, 1, 1],
})
df['Date'] = df["Date"].dt.to_period('M')
df.groupby(['Date', 'Type']).size()
Date Type
2022-01 1 1
2 1
2022-02 1 2
dtype: int64
This seems to have the unfortunate side effect of excluding keys with zero value. Also the result is multiindexed rows rather than having the index as rows+columns.
For more information on this approach, check this question.

pandas - finding most recent (but previous) date in a second reference dataframe

I have two dataframes and for one I want to find the closest (previous) date in the other.
If the date matches then I need to take the previous date
df_main contains the reference information
For df_sample I want to lookup the Time in df_main for the closest (but previous) entry. I can do this using method='ffill' , but where the date for the Time field is the same day it returns that day - I want it to return the previous - basically a < rather than <=.
In my example df_res I want the closest_val column to contain [ "n/a", 90, 90, 280, 280, 280]
import pandas as pd
dsample = {'Index': [1, 2, 3, 4, 5, 6],
'Time': ["2020-06-01", "2020-06-02", "2020-06-03", "2020-06-04" ,"2020-06-05" ,"2020-06-06"],
'Pred': [100, -200, 300, -400 , -500, 600]
}
dmain = {'Index': [1, 2, 3],
'Time': ["2020-06-01", "2020-06-03","2020-06-06"],
'Actual': [90, 280, 650]
}
def find_closest(x, df2):
df_res = df2.iloc[df2.index.get_loc(x['Time'], method='ffill')]
x['closest_time'] = df_res['Time']
x['closest_val'] = df_res['Actual']
return x
df_sample = pd.DataFrame(data=dsample)
df_main = pd.DataFrame(data=dmain)
df_sample = df_sample.set_index(pd.DatetimeIndex(df_sample['Time']))
df_main = df_main.set_index(pd.DatetimeIndex(df_main['Time']))
df_res = df_sample.apply(find_closest, df2=df_main ,axis=1)
Use pd.merge_asof (make sure 'Time' is indeed a datetime):
pd.merge_asof(dsample, dmain, left_on="Time", right_on="Time", allow_exact_matches=False)
The output is:
Index_x Time Pred Index_y Actual
0 1 2020-06-01 100 NaN NaN
1 2 2020-06-02 -200 1.0 90.0
2 3 2020-06-03 300 1.0 90.0
3 4 2020-06-04 -400 2.0 280.0
4 5 2020-06-05 -500 2.0 280.0
5 6 2020-06-06 600 2.0 280.0
IIUC, we can do a Cartesian product of both your dataframes, then filter out the exact matches, then apply some logic to figure out the closest date.
Finally, we will join your extact, and non exact matches into a final dataframe.
s = pd.merge(
df_sample.assign(key="var1"),
df_main.assign(key="var1").rename(columns={"Time": "TimeDelta"}).drop("Index", 1),
on="key",
how="outer",
).drop("key", 1)
extact_matches = s[s['Time'].eq(s['TimeDelta'])]
non_exact_matches_cart = s[~s['Time'].isin(extact_matches['Time'])]
non_exact_matches = non_exact_matches_cart.assign(
delta=(non_exact_matches_cart["Time"] - non_exact_matches_cart["TimeDelta"])
/ np.timedelta64(1, "D")
).query("delta >= 0").sort_values(["Time", "delta"]).drop_duplicates(
"Time", keep="first"
).drop('delta',1)
alot to take in the above variable, but essentially we are finding the difference in time, removing any difference that goes into the future, and dropping the values keeping the closest date in the past.
df = pd.concat([extact_matches, non_exact_matches], axis=0).sort_values("Time").rename(
columns={"TimeDelta": "closest_time", "Actual": "closest val"}
)
print(df)
Index Time Pred closest_time closest val
0 1 2020-06-01 100 2020-06-01 90
3 2 2020-06-02 -200 2020-06-01 90
7 3 2020-06-03 300 2020-06-03 280
10 4 2020-06-04 -400 2020-06-03 280
13 5 2020-06-05 -500 2020-06-03 280
17 6 2020-06-06 600 2020-06-06 650

Accumulate values according to condition in a Pandas data frame

I have a Pandas data frame (to illustrate the expected behavior) as follow:
df = pd.DataFrame({
'Id': ['001', '001', '002', '002'],
'Date': ['2013-01-07', '2013-01-14', '2013-01-07', '2013-01-14'],
'Purchase_Quantity': [12, 13, 10, 6],
'lead_time': [4, 2, 6, 4],
'Order_Quantity': [21, 34, 21, 13]
})
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby(['Id', 'Date']).agg({
'Purchase_Quantity': sum,
'lead_time': sum,
'Order_Quantity': sum})
Purchase_Quantity lead_time Order_Quantity
Id Date
001 2013-01-07 12 4 21
2013-01-14 13 2 34
002 2013-01-07 10 6 21
2013-01-14 6 4 13
Where lead_time is a duration in days.
I would like to add a column that keep track of the "quantity on hand" which is:
Remaining quantity from previous weeks
Plus ordered quantity that are finally available
Minus purchased quantity of the current week
The expected result should be:
Purchase_Quantity lead_time Order_Quantity OH
Id Date
001 2013-01-07 12 4 21 0
2013-01-14 13 2 34 9
002 2013-01-07 10 6 21 0
2013-01-14 6 4 13 11
I think you should look after itertools.accumulate to build your new row (instead of iterating your data frame rows).
This is a first attempt. I will update it to better match what you try to achieve in your edit.
diff = df['Order_Quantity'] - df['Purchase_Quantity']
acc = list(itertools.accumulate(diff))
df['on_hand'] = acc
print(df)
Edit
I think I misunderstood what you try to achieve.
Here is your base data frame:
Purchase_Quantity lead_time Order_Quantity
Id Date
001 2013-01-07 12 4 21
2013-01-14 13 2 34
002 2013-01-07 10 6 21
2013-01-14 6 4 13
From what I understand you On Hand column must report the number of "Purchased" items which are not arrived yet. Looking to something like this:
Purchase_Quantity lead_time On_Hand
Id Date
001 2013-01-07 12 4 12
2013-01-14 13 2 25 # (12 + 13)
002 2013-01-07 10 6 10
2013-01-14 6 4 16 # (10 + 6)
Did I understand well? If so, what is the Order_Quantity for?
Edit 2
Here is an new example, heavily inspired by this post, which seems to match your use case.
I changed column names to avoid confusion (what is the difference between, "Order" and "Purchase" which translate to the same word in my language...).
You should also convert your lead time to datetime.timedelta object, making units and computation more clear.
import pandas as pd
def main():
df = pd.DataFrame({
'Id': ['001', '001', '002', '002'],
'Date': ['2013-01-07', '2013-01-14', '2013-01-07', '2013-01-14'],
'Ordered': [21, 34, 21, 13],
'LeadTime': [4, 2, 6, 4],
'Sold': [12, 13, 10, 6],
})
df['Date'] = pd.to_datetime(df['Date'])
df['LeadTime'] = pd.to_timedelta(df['LeadTime'], unit="days")
print(df)
df['Received'] = df.apply(lambda x: df.loc[(df['Date']+df['LeadTime'] <= x['Date']) & (df['Id'] == x['Id']), "Ordered"].sum(), axis=1)
df['Diff'] = df['Received'] - df['Sold']
print(df)
if __name__ == '__main__':
main()
As shown here, you probably have to do it in two steps. First build a new column for which the value depend of the current values of the row (see the linked post). Then do others computations that can be vectorized.
This do not provide the expected output still but provide a good starting point I think.

Categories

Resources