Pandas groupy by last 6 months from a reference date - python

I need to sum a column "qtd" taking into account the last 6 months of a reference date.
prod date qtd sum
proda 2018-01-01 2 2
proda 2018-02-01 2 4
proda 2018-04-01 1 5
proda 2018-05-01 4 9
proda 2018-06-01 2 11
proda 2018-07-01 1 11
I need to figure out how to calculate the column "sum".
Note that I don't always have every month on my dataframe, for example I don't have March.
Given a reference date (date) I need to calculate 6 months back and sum the column "qtd"
Thanks!

cumsum( ) function will bring you the cumulative sum for a given column. From numpy.
df[‘sum’] = df[‘qtd’].cumsum()
Ok. In case you want to extract only the slice and calc cumsum(), you can use:
start_date = '2018-01-01'
end_date = '2018-05-01'
between = (df['date'] >= start_date) & (df['date'] <= end_date)
df2 = df[between]
df2['sum'] = df2['qtd'].cumsum()
df2
prod date qtd sum
0 proda 2018-01-01 2 2
1 proda 2018-02-01 2 4
2 proda 2018-04-01 1 5
3 proda 2018-05-01 4 9
Or if you want to calculate it only between specific dates and add it to your data frame, you can use:
start_date = '2/1/18'
end_date = '6/1/18'
def total(start, end, df):
sum_col = []
for i in range(df.shape[0]): # Loop for all lines
if df['date'][i] < start:
# If before start date, NA (you could change to 0 too)
sum_col.append('NaN')
elif df['date'][i] == start: # start to sum
sum_col.append(df['qtd'][I])
#sum between your start and end dates
elif (df['date'][i] > start) and (df['date'][i] <= end):
sum_col.append(df['qtd'][i]+sum_col[i-1])
# after end date, it just adds NAs. You can change to repeat the last total
elif df['date'][i] > end:
sum_col.append('NaN')
return sum_col
df['sum'] = total(start_date, end_date, df)
df
output:
prod date qtd sum
0 proda 1/1/18 2 NaN
1 proda 2/1/18 2 2
2 proda 4/1/18 1 3
3 proda 5/1/18 4 7
4 proda 6/1/18 2 9
5 proda 7/1/18 1 NaN
Hope this helps.

Related

Get Week of month column from date

I want to extract week of month column from the date.
Dummy Data:
data = pd.DataFrame(pd.date_range(' 1/ 1/ 2000', periods = 100, freq ='D'))
Code I tried:
def add_week_of_month(df):
df['monthweek'] = pd.to_numeric(df.index.day/7)
df['monthweek'] = df['monthweek'].apply(lambda x: math.ceil(x))
return df
But this code does count 7 day periods within a month. The first 7 days of a month the column would be 1, from day 8 to day 14 it would be 2 etc
But I want to have is calendar weeks per month, so on the first day of the month the feature would be 1, from the first Monday after that it would be 2 etc.
Can anyone help me with this?
You can convert to weekly period and subtract to the first week of the month + 1 if a Monday.
If you want weeks starting on Sundays, use 'W-SAT' as period and start.dt.dayofweek.eq(6).
# get first day of month
start = data[0]+pd.offsets.MonthBegin()+pd.offsets.MonthBegin(-1)
# or
# start = data[0].dt.to_period('M').dt.to_timestamp()
data['monthweek'] = ((data[0].dt.to_period('W')-start.dt.to_period('W'))
.apply(lambda x: x.n)
.add(start.dt.dayofweek.eq(0))
)
NB. in your input, column 0 is the date.
output:
0 monthweek
0 2000-01-01 0
1 2000-01-02 0
2 2000-01-03 1 # Monday
3 2000-01-04 1
4 2000-01-05 1
5 2000-01-06 1
6 2000-01-07 1
7 2000-01-08 1
8 2000-01-09 1
9 2000-01-10 2 # Monday
10 2000-01-11 2
.. ... ...
95 2000-04-05 1
96 2000-04-06 1
97 2000-04-07 1
98 2000-04-08 1
99 2000-04-09 1
[100 rows x 2 columns]
Example for 2001 (starts on a Monday):
0 monthweek
0 2001-01-01 1 # Monday
1 2001-01-02 1
2 2001-01-03 1
3 2001-01-04 1
4 2001-01-05 1
5 2001-01-06 1
6 2001-01-07 1
7 2001-01-08 2 # Monday
8 2001-01-09 2
9 2001-01-10 2
10 2001-01-11 2
11 2001-01-12 2
12 2001-01-13 2
13 2001-01-14 2
14 2001-01-15 3
get the first day then add it to the day of the month and divide by 7
first_day = dt.replace(day=1)
dom = dt.day
adjusted_dom = dom + first_day.weekday()
return int(math.ceil(adjusted_dom/7.0))

How to Get Day of Week as Integer with First of Month Changing Values?

Using .weekday() to find the day of the week as an integer (Monday = 0 ... Sunday = 6) for everyday from today until next year (+365 days from today). Problem now is that if the 1st of the month starts mid week then I need to return the day of the week with the 1st day of the month now being = 0.
Ex. If the month starts Wednesday then Wednesday = 0... Sunday = 4 (for that week only).
Annotated Picture of Month Explaining What I Want to Do
Originally had the below code but wrong as the first statement will run 7 days regardless.
import datetime
from datetime import date
for day in range (1,365):
departure_date = date.today() + datetime.timedelta(days=day)
if departure_date.weekday() < 7:
day_of_week = departure_date.day
else:
day_of_week = departure_date.weekday()
The following seems to do the job properly:
import datetime as dt
def custom_weekday(date):
if date.weekday() > (date.day-1):
return date.day - 1
else:
return date.weekday()
for day in range (1,366):
departure_date = dt.date.today() + dt.timedelta(days=day)
day_of_week = custom_weekday(date=departure_date)
print(departure_date, day_of_week, departure_date.weekday())
Your code had two small bugs:
the if condition was wrong
days are represented inconsistently: date.weekday() is 0-based, date.day is 1-based
For every date, get the first week of that month. Then, check if the date is within that first week. If it is, use the .day - 1 value (since you are 0-based). Otherwise, use the .weekday().
from datetime import date, datetime, timedelta
for day in range (-5, 40):
departure_date = date.today() + timedelta(days=day)
first_week = date(departure_date.year, departure_date.month, 1).isocalendar()[1]
if first_week == departure_date.isocalendar()[1]:
day_of_week = departure_date.day - 1
else:
day_of_week = departure_date.weekday()
print(departure_date, day_of_week)
2021-08-27 4
2021-08-28 5
2021-08-29 6
2021-08-30 0
2021-08-31 1
2021-09-01 0
2021-09-02 1
2021-09-03 2
2021-09-04 3
2021-09-05 4
2021-09-06 0
2021-09-07 1
2021-09-08 2
2021-09-09 3
2021-09-10 4
2021-09-11 5
2021-09-12 6
2021-09-13 0
2021-09-14 1
2021-09-15 2
2021-09-16 3
2021-09-17 4
2021-09-18 5
2021-09-19 6
2021-09-20 0
2021-09-21 1
2021-09-22 2
2021-09-23 3
2021-09-24 4
2021-09-25 5
2021-09-26 6
2021-09-27 0
2021-09-28 1
2021-09-29 2
2021-09-30 3
2021-10-01 0
2021-10-02 1
2021-10-03 2
2021-10-04 0
2021-10-05 1
2021-10-06 2
2021-10-07 3
2021-10-08 4
2021-10-09 5
2021-10-10 6
For any date D.M.Y, get the weekday W of 1.M.Y.
Then you need to adjust weekday value only for the first 7-W days of that month. To adjust, simply subtract the value W.
Example for September 2021: the first date of month (1.9.2021) is a Wednesday, so W is 2. You need to adjust weekdays for dates 1.9.2021 to 5.9.2021 (because 7-2 is 5) in that month by minus 2.

Create a date counter variable starting with a particular date

I have a variable as:
start_dt = 201901 which is basically Jan 2019
I have an initial data frame as:
month
0
1
2
3
4
I want to add a new column (date) to the dataframe where for month 0, the date is the start_dt - 1 month, and for subsequent months, the date is a month + 1 increment.
I want the resulting dataframe as:
month date
0 12/1/2018
1 1/1/2019
2 2/1/2019
3 3/1/2019
4 4/1/2019
You can subtract 1 and add datetimes converted to month periods by Timestamp.to_period and then output convert to timestamps by to_timestamp:
start_dt = 201801
start_dt = pd.to_datetime(start_dt, format='%Y%m')
s = df['month'].sub(1).add(start_dt.to_period('m')).dt.to_timestamp()
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Or is possible convert column to month offsets with subtract 1 and add datetime:
s = df['month'].apply(lambda x: pd.DateOffset(months=x-1)).add(start_dt)
print (s)
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
Name: month, dtype: datetime64[ns]
Here is how you can use the third-party library dateutil to increment a datetime by one month:
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
start_dt = '201801'
number_of_rows = 10
start_dt = datetime.strptime(start_dt, '%Y%m')
df = pd.DataFrame({'date': [start_dt+relativedelta(months=+n)
for n in range(-1, number_of_rows-1)]})
print(df)
Output:
date
0 2017-12-01
1 2018-01-01
2 2018-02-01
3 2018-03-01
4 2018-04-01
5 2018-05-01
6 2018-06-01
7 2018-07-01
8 2018-08-01
9 2018-09-01
As you can see, in each iteration of the for loop, the initial datetime is being incremented by the corresponding number (starting at -1) of the iteration.

How to check a time-range in Pandas?

I have a dataframe like this:
time
2018-06-25 20:42:00
2016-06-26 23:51:00
2017-05-34 12:29:00
2016-03-11 10:14:00
Now I created a column like this
df['isEIDRange'] = 0
Let's say, EID festivate is on 15 June 2018.
So I want to fill 1 value in isEIDRange column. If the date is between 10 June 2018 to 20 June 2018 (5 days before and 5 days after EID)
How can I do it?
Something like?
df.loc[ (df.time > 15 June - 5 days) & (df.time < 15 June + 5 days), 'isEIDRange' ] = 1
Use Series.between function for test values with cast mask to integers:
df['isEIDRange'] = df['time'].between('2018-06-10', '2018-06-20').astype(int)
If want dynamic solution:
df = pd.DataFrame({"time": pd.date_range("2018-06-08", "2018-06-22")})
#print (df)
date = '15 June 2018'
d = pd.to_datetime(date)
diff = pd.Timedelta(5, unit='d')
df['isEIDRange1'] = df['time'].between(d - diff, d + diff).astype(int)
df['isEIDRange2'] = df['time'].between(d - diff, d + diff, inclusive=False).astype(int)
print (df)
time isEIDRange1 isEIDRange2
0 2018-06-08 0 0
1 2018-06-09 0 0
2 2018-06-10 1 0
3 2018-06-11 1 1
4 2018-06-12 1 1
5 2018-06-13 1 1
6 2018-06-14 1 1
7 2018-06-15 1 1
8 2018-06-16 1 1
9 2018-06-17 1 1
10 2018-06-18 1 1
11 2018-06-19 1 1
12 2018-06-20 1 0
13 2018-06-21 0 0
14 2018-06-22 0 0
Or set values by numpy.where:
df['isEIDRange'] = np.where(df['time'].between(d - diff, d + diff), 1, 0)
You can use loc or np.where:
import numpy as np
df['isEIDRange'] = np.where((df['time'] > '2018-06-10') & (df['time'] < '2018-06-20'),1,df['isEIDRange']
This means that when the column time is between 2018-06-10 and 2018-06-20, the column isEIDRange will be equal to 1, otherwise it will retain it's original value (0).
You can use pandas date_range for this:
eid = pd.date_range("15/10/2019", "20/10/2019")
df = pd.DataFrame({"dates": pd.date_range("13/10/2019", "20/10/2019")})
df["eid"] = 0
df.loc[df["dates"].isin(eid), "eid"] = 1
and output:
dates eid
0 2019-10-13 0
1 2019-10-14 0
2 2019-10-15 1
3 2019-10-16 1
4 2019-10-17 1
5 2019-10-18 1
6 2019-10-19 1
7 2019-10-20 1

Pandas, add date column to a series

I have a timeseries dataframe that is data agnostic and uses period vs date.
I would like to at some point add in dates, using the period.
My dataframe looks like
period custid
1 1
2 1
3 1
1 2
2 2
1 3
2 3
3 3
4 3
I would like to be able to pick a random starting date, for example 1/1/2018, and that would be period 1 so you would end up with
period custid date
1 1 1/1/2018
2 1 2/1/2018
3 1 3/1/2018
1 2 1/1/2018
2 2 2/1/2018
1 3 1/1/2018
2 3 2/1/2018
3 3 3/1/2018
4 3 4/1/2018
You could create a column of timedeltas, based on the period column, where each row is a time delta of period dates (-1, so that it starts at 0). then, starting from your start_date, which you can define as a datetime object, add the timedelta to start date:
start_date = pd.to_datetime('1/1/2018')
df['date'] = pd.to_timedelta(df['period'] - 1, unit='D') + start_date
>>> df
period custid date
0 1 1 2018-01-01
1 2 1 2018-01-02
2 3 1 2018-01-03
3 1 2 2018-01-01
4 2 2 2018-01-02
5 1 3 2018-01-01
6 2 3 2018-01-02
7 3 3 2018-01-03
8 4 3 2018-01-04
Edit: In your comment, you said you were trying to add months, not days. For this, you could use your method, or alternatively, the following:
from pandas.tseries.offsets import MonthBegin
df['date'] = start_date + (df['period'] -1) * MonthBegin()

Categories

Resources