cumulative count and sum based on date frame - python

I have a DataFrame df, that, once sorted by date, looks like this:
User Date price
0 2 2020-01-30 50
1 1 2020-02-02 30
2 2 2020-02-28 50
3 2 2020-04-30 10
4 1 2020-12-28 10
5 1 2020-12-30 20
I want to compute, for each row:
the number of row in the last month, and
the sum price in the last month.
On the example above, the output that I'm looking for:
User Date price NumlastMonth Totallastmonth
0 2 2020-01-30 50 0 0
1 1 2020-02-02 30 0 0 # not 1, 50 ???
2 2 2020-02-28 50 1 50
3 2 2020-04-30 10 0 0
4 1 2020-12-28 10 0 0
5 1 2020-12-30 20 1 10 # not 0, 0 ???
I tried this, but the result is for all last row not only last month.
df['NumlastMonth'] = data.sort_values('Date')\
.groupby(['user']).amount.cumcount()
df['NumlastMonth'] = data.sort_values('Date')\
.groupby(['user']).amount.cumsum()

Taking literally the question (acknowledging that the example doesn't quite match the description of the question), we could do:
tally = df.groupby(pd.Grouper(key='Date', freq='M')).agg({'User': 'count', 'price': sum})
tally.index += pd.offsets.Day(1)
tally = tally.reindex(index=df.Date, method='ffill', fill_value=0)
On your input, that gives:
>>> tally
User price
Date
2020-01-30 0 0
2020-02-02 1 50
2020-02-28 1 50
2020-04-30 0 0
2020-12-28 0 0
2020-12-30 0 0
After that, it's easy to change the column names and concat:
df2 = pd.concat([
df.set_index('Date'),
tally.rename(columns={'User': 'NumlastMonth', 'price': 'Totallastmonth'})
], axis=1)
# out:
User price NumlastMonth Totallastmonth
Date
2020-01-30 2 50 0 0
2020-02-02 1 30 1 50
2020-02-28 2 50 1 50
2020-04-30 2 10 0 0
2020-12-28 1 10 0 0
2020-12-30 1 20 0 0
​```

Related

Pandas DataFrame Change Values Based on Values in Different Rows

I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0

Is there a python function to shift revenue column by specific timedelta?

Please refer to the image attached
I have a data frame that has yearly revenue in columns (2020 to 2025). I want to shift the revenue in those columns by a given time delta(column Time Shift). The time delta I have is in terms of days. Is there an efficient way to make the shift?
E.G
What I want to achieve is to shift the yearly revenue in columns by the value of days in the Time Shift column i.e. 4 days of revenue to shift from column to column ( i.e. 1.27[116/365 * 4] should be shifted from 2022 to 2023 for the 1st row)
Thanks in Advance
Text Input data
Launch Date Launch Date Base Time Shift 2020 2021 2022 2023 2024 2025
2022-06-01 2022-06-01 4 0 0 115.98 122.93 119.22 35.31
2025-02-01 2025-02-01 4 0 0 0 0 0 66.18859318
2022-09-01 2022-09-01 4 49.42 254.86 191.12 248.80 206.53 98.22
2025-01-01 2025-01-01 4 0 0 0 0 14.47 54.24
2022-06-01 2022-06-01 4 0 0 50.25 53.26 51.65 15.30
2025-02-01 2025-02-01 4 0 0 0 0 0 28.67
2022-09-01 2022-09-01 4 148.20 758.22 535.45 676.73 545.42 251.83
2025-01-01 2025-01-01 4 0 0 0 0 38.23 139.07
2022-06-01 2022-06-01 4 0 0 140.78 144.88 136.41 39.23
You can figour out how much to shift per year and then subtract from the current year and add to the next year.
Get the column names of interest
ycols = [str(n) for n in range(2020,2026)]
calculate the amount that needs shifting, per year (to the next year):
shift_df = df[ycols].multiply(df['Time_Shift']/365.0, axis=0)
looks like this
2020 2021 2022 2023 2024 2025
-- -------- ------- -------- -------- -------- --------
0 0 0 1.27101 1.34718 1.30652 0.386959
1 0 0 0 0 0 0.725354
2 0.541589 2.79299 2.09447 2.72658 2.26334 1.07638
3 0 0 0 0 0.158575 0.594411
4 0 0 0.550685 0.583671 0.566027 0.167671
5 0 0 0 0 0 0.314192
6 1.62411 8.30926 5.86795 7.41622 5.97721 2.75978
7 0 0 0 0 0.418959 1.52405
8 0 0 1.54279 1.58773 1.4949 0.429918
Now create a copy of df (could use the original if you want of course) and apply the operations:
df2 = df.copy()
df2[ycols] = df2[ycols] - shift_df[ycols]
df2[ycols[1:]] =df2[ycols[1:]] + shift_df[ycols[:-1]].values
slightly tricky bits here are in the last line -- we use the indexing [1:] and [:-1] appropriately to access previous year shift, and also use .values method otherwise the column labels do not match and you can do do the addition.
After this we get df2:
Launch_Date Launch_Date_Base Time_Shift 2020 2021 2022 2023 2024 2025
-- ------------- ------------------ ------------ -------- ------- -------- ------- -------- --------
0 2022-06-01 2022-06-01 4 0 0 114.709 122.854 119.261 36.2296
1 2025-02-01 2025-02-01 4 0 0 0 0 0 65.4632
2 2022-09-01 2022-09-01 4 48.8784 252.609 191.819 248.168 206.993 99.407
3 2025-01-01 2025-01-01 4 0 0 0 0 14.3114 53.8042
4 2022-06-01 2022-06-01 4 0 0 49.6993 53.227 51.6676 15.6984
5 2025-02-01 2025-02-01 4 0 0 0 0 0 28.3558
6 2022-09-01 2022-09-01 4 146.576 751.535 537.891 675.182 546.859 255.047
7 2025-01-01 2025-01-01 4 0 0 0 0 37.811 137.965
8 2022-06-01 2022-06-01 4 0 0 139.237 144.835 136.503 40.295
As you noticed the amount shifted from year 2026 is 'lost' ie we do not assign it to any new column

calculate the date diff between the minimum in the same column depending on another column's condition using Python

d = {'country': ['US', 'US', 'US','US', 'US', 'US', 'UK', 'UK','UK','UK','UK'],
'status': [0, 0, 0, 0, 1,1,0, 0, 0, 1,1],
'count':[0, 1, 10, 20,30,40,0,1,2,4,6],
'date':['2020-04-05', '2020-04-06', '2020-04-07', '2020-04-11', '2020-04-12',
'2020-04-13', '2020-04-02', '2020-04-03', '2020-04-05', '2020-04-06', '2020-04-07']}
df = pd.DataFrame(data=d)
The first thing I thought is to use
df['day_diff']=df.groupby(['country', 'status'])['date'].apply(lambda x: x - x.min())
However, the 0 days for US status0 should appear on 2020-04-06 rather than 2020-04-05, in which the count is 0 (sorry I don't know how to put the output properly here). So there should be two conditions for the group applied to the returned day_diff
count>0
date - date.min()
country status count date day_diff
US 0 0 2020-04-05 0 days
US 0 1 2020-04-06 1 days
US 0 10 2020-04-07 2 days
3 US 0 20 2020-04-11 6 days
4 US 1 30 2020-04-12 0 days
5 US 1 40 2020-04-13 1 days
6 UK 0 0 2020-04-02 0 days
7 UK 0 1 2020-04-03 1 days
8 UK 0 2 2020-04-05 3 days
9 UK 1 4 2020-04-06 0 days
10 UK 1 6 2020-04-07 1 days
I hope to get a result similar to this (of course it does not work):
df.groupby(['country', 'status'])['date'].apply(lambda x: x - x.min() if df['count']>0)
To calculate the difference between the date and the earliest date in the group, depending on whether the count of the first date is above 0.
Thank you in advance.
You can use groupby().transform() to extract the min date:
min_dates = (df.date.mask(df['count'].eq(0))
.groupby([df['country'],df['status']])
.transform('min')
)
df['date_diff'] = df['date'] - min_dates
Output:
country status count date date_diff
0 US 0 0 2020-04-05 -1 days
1 US 0 1 2020-04-06 0 days
2 US 0 10 2020-04-07 1 days
3 US 0 20 2020-04-11 5 days
4 US 1 30 2020-04-12 0 days
5 US 1 40 2020-04-13 1 days
6 UK 0 0 2020-04-02 -1 days
7 UK 0 1 2020-04-03 0 days
8 UK 0 2 2020-04-05 2 days
9 UK 1 4 2020-04-06 0 days
10 UK 1 6 2020-04-07 1 days

How to check a time-range in Pandas?

I have a dataframe like this:
time
2018-06-25 20:42:00
2016-06-26 23:51:00
2017-05-34 12:29:00
2016-03-11 10:14:00
Now I created a column like this
df['isEIDRange'] = 0
Let's say, EID festivate is on 15 June 2018.
So I want to fill 1 value in isEIDRange column. If the date is between 10 June 2018 to 20 June 2018 (5 days before and 5 days after EID)
How can I do it?
Something like?
df.loc[ (df.time > 15 June - 5 days) & (df.time < 15 June + 5 days), 'isEIDRange' ] = 1
Use Series.between function for test values with cast mask to integers:
df['isEIDRange'] = df['time'].between('2018-06-10', '2018-06-20').astype(int)
If want dynamic solution:
df = pd.DataFrame({"time": pd.date_range("2018-06-08", "2018-06-22")})
#print (df)
date = '15 June 2018'
d = pd.to_datetime(date)
diff = pd.Timedelta(5, unit='d')
df['isEIDRange1'] = df['time'].between(d - diff, d + diff).astype(int)
df['isEIDRange2'] = df['time'].between(d - diff, d + diff, inclusive=False).astype(int)
print (df)
time isEIDRange1 isEIDRange2
0 2018-06-08 0 0
1 2018-06-09 0 0
2 2018-06-10 1 0
3 2018-06-11 1 1
4 2018-06-12 1 1
5 2018-06-13 1 1
6 2018-06-14 1 1
7 2018-06-15 1 1
8 2018-06-16 1 1
9 2018-06-17 1 1
10 2018-06-18 1 1
11 2018-06-19 1 1
12 2018-06-20 1 0
13 2018-06-21 0 0
14 2018-06-22 0 0
Or set values by numpy.where:
df['isEIDRange'] = np.where(df['time'].between(d - diff, d + diff), 1, 0)
You can use loc or np.where:
import numpy as np
df['isEIDRange'] = np.where((df['time'] > '2018-06-10') & (df['time'] < '2018-06-20'),1,df['isEIDRange']
This means that when the column time is between 2018-06-10 and 2018-06-20, the column isEIDRange will be equal to 1, otherwise it will retain it's original value (0).
You can use pandas date_range for this:
eid = pd.date_range("15/10/2019", "20/10/2019")
df = pd.DataFrame({"dates": pd.date_range("13/10/2019", "20/10/2019")})
df["eid"] = 0
df.loc[df["dates"].isin(eid), "eid"] = 1
and output:
dates eid
0 2019-10-13 0
1 2019-10-14 0
2 2019-10-15 1
3 2019-10-16 1
4 2019-10-17 1
5 2019-10-18 1
6 2019-10-19 1
7 2019-10-20 1

Pandas groupby date

I have a DataFrame with events. One or more events can occur at a date (so the date can't be an index). The date range is several years. I want to groupby years and months and have a count of the Category values. Thnx
in [12]: df = pd.read_excel('Pandas_Test.xls', 'sheet1')
In [13]: df
Out[13]:
EventRefNr DateOccurence Type Category
0 86596 2010-01-02 00:00:00 3 Small
1 86779 2010-01-09 00:00:00 13 Medium
2 86780 2010-02-10 00:00:00 6 Small
3 86781 2010-02-09 00:00:00 17 Small
4 86898 2010-02-10 00:00:00 6 Small
5 86898 2010-02-11 00:00:00 6 Small
6 86902 2010-02-17 00:00:00 9 Small
7 86908 2010-02-19 00:00:00 3 Medium
8 86908 2010-03-05 00:00:00 3 Medium
9 86909 2010-03-06 00:00:00 8 Small
10 86930 2010-03-12 00:00:00 29 Small
11 86934 2010-03-16 00:00:00 9 Small
12 86940 2010-04-08 00:00:00 9 High
13 86941 2010-04-09 00:00:00 17 Small
14 86946 2010-04-14 00:00:00 10 Small
15 86950 2011-01-19 00:00:00 12 Small
16 86956 2011-01-24 00:00:00 13 Small
17 86959 2011-01-27 00:00:00 17 Small
I tried:
df.groupby(df['DateOccurence'])
For the month and year break out I often add additional columns to the data frame that break out the dates into each piece:
df['year'] = [t.year for t in df.DateOccurence]
df['month'] = [t.month for t in df.DateOccurence]
df['day'] = [t.day for t in df.DateOccurence]
It adds space complexity (adding columns to the df) but is less time complex (less processing on groupby) than a datetime index but it's really up to you. datetime index is the more pandas way to do things.
After breaking out by year, month, day you can do any groupby you need.
df.groupby['year','month'].Category.apply(pd.value_counts)
To get months across multiple years:
df.groupby['month'].Category.apply(pd.value_counts)
Or in Andy Hayden's datetime index
df.groupby[di.month].Category.apply(pd.value_counts)
You can simply pick which method fits your needs better.
You can apply value_counts to the SeriesGroupby (for the column):
In [11]: g = df.groupby('DateOccurence')
In [12]: g.Category.apply(pd.value_counts)
Out[12]:
DateOccurence
2010-01-02 Small 1
2010-01-09 Medium 1
2010-02-09 Small 1
2010-02-10 Small 2
2010-02-11 Small 1
2010-02-17 Small 1
2010-02-19 Medium 1
2010-03-05 Medium 1
2010-03-06 Small 1
2010-03-12 Small 1
2010-03-16 Small 1
2010-04-08 High 1
2010-04-09 Small 1
2010-04-14 Small 1
2011-01-19 Small 1
2011-01-24 Small 1
2011-01-27 Small 1
dtype: int64
I actually hoped this to return the following DataFrame, but you need to unstack it:
In [13]: g.Category.apply(pd.value_counts).unstack(-1).fillna(0)
Out[13]:
High Medium Small
DateOccurence
2010-01-02 0 0 1
2010-01-09 0 1 0
2010-02-09 0 0 1
2010-02-10 0 0 2
2010-02-11 0 0 1
2010-02-17 0 0 1
2010-02-19 0 1 0
2010-03-05 0 1 0
2010-03-06 0 0 1
2010-03-12 0 0 1
2010-03-16 0 0 1
2010-04-08 1 0 0
2010-04-09 0 0 1
2010-04-14 0 0 1
2011-01-19 0 0 1
2011-01-24 0 0 1
2011-01-27 0 0 1
If there were multiple different Categories with the same Date they would be on the same row...

Categories

Resources