I have seen a lot of similar posts on "nth weekday of the month", but my question pertains to "nth weekday of the year".
Background:
I have a table that has daily sales data. There are 3 columns: date, day of week (Mon, Tue, Wed etc.) and sales. I would like to match nth weekday of Year 1 with Year 2 and compare sales that way.
Example1: 01/06/2020 matches with 01/04/2021, both are the 1st Monday of that year.
Example2: 11/02/2019 matches with 10/31/2020, both are the 44th Saturday of that year.
As you can see, I can't simply do a "nth weekday of the MONTH" because sometimes the matched nth weekday would fall in different months (as seen in 11/02/2019 & 10/31/2020).
I am manipulating the table in pandas. I am wondering if there's a quick way for me to create a column that helps me to calculate the "nth weekday of the year" for me, so that I could later match based on that value?
Thanks for your help.
The pandas package has some good time/date functions.
For example
import pandas as pd
s = pd.date_range('2020-01-01', '2020-12-31', freq='D').to_series()
print(s.dt.dayofweek)
gives you the weekdays as integers.
2020-01-01 2
2020-01-02 3
2020-01-03 4
2020-01-04 5
2020-01-05 6
2020-01-06 0
2020-01-07 1
2020-01-08 2
2020-01-09 3
2020-01-10 4
(Monday=0)
Then you can do
mondays = s.dt.dayofweek.eq(0)
If you want to find the first Monday of the year use.
print(mondays.idxmax())
Timestamp('2020-01-06 00:00:00', freq='D')
Or the 5th Monday:
n = 4
print(s[mondays].iloc[n])
Timestamp('2020-02-03 00:00:00')
If your sales dataframe is df then to compare sales on the first 5 Mondays of two different years you could do something like this:
mondays = df['Date'].dt.dayofweek.eq(0)
mondays_in_y1 = (df['Year'] == 2019) & mondays
mondays_in_y2 = (df['Year'] == 2020) & mondays
pd.DataFrame({
2019: df.loc[mondays_in_y1, 'Sales'].values[:5],
2020: df.loc[mondays_in_y2, 'Sales'].values[:5]
})
IIUC you can play from
import pandas as pd
import numpy as np
df = pd.DataFrame({"date":pd.date_range(start="2020-01-01",
end="2020-12-31")})
# weekday number Monday is 0
df["dow"] = df["date"].dt.weekday
# is weekday as int
df["is_weekday"] = (df["dow"]<5).astype(int)
df["n"] = df["is_weekday"].cumsum()
# remove weekends
df["n"] = np.where(df["n"]==df["n"].shift(), np.nan, df["n"])
df[df["n"]==100]["date"]
Edit
In two lines only
df["n"] = (df["date"].dt.weekday<5).astype(int).cumsum()
df["n"] = np.where(df["n"]==df["n"].shift(), np.nan, df["n"])
You can try using dt.week. It returns a series, but you can simply define a new column with these values.
For example:
import pandas as pd
rng = pd.date_range('2015-02-24', periods=5, freq='D')
df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randn(len(rng))})
Output:
Date Val
0 2015-02-24 -0.977278
1 2015-02-25 0.950088
2 2015-02-26 -0.151357
3 2015-02-27 -0.103219
4 2015-02-28 0.410599
The you should input df['Week_Number'] = df['Date'].dt.week, so you will make a new column with the week number:
Date Val Week_Number
0 2015-02-24 -0.977278 9
1 2015-02-25 0.950088 9
2 2015-02-26 -0.151357 9
3 2015-02-27 -0.103219 9
4 2015-02-28 0.410599 9
Hope it helps. It's my first contribution.
Related
I've got a dataframe in pandas that stores the Id of a person, the quality of interaction, and the date of the interaction. A person can have multiple interactions across multiple dates, so to help visualise and plot this I converted it into a pivot table grouping first by Id then by date to analyse the pattern over time.
e.g.
import pandas as pd
df = pd.DataFrame({'Id':['A4G8','A4G8','A4G8','P9N3','P9N3','P9N3','P9N3','C7R5','L4U7'],
'Date':['2016-1-1','2016-1-15','2016-1-30','2017-2-12','2017-2-28','2017-3-10','2019-1-1','2018-6-1','2019-8-6'],
'Quality':[2,3,6,1,5,10,10,2,2]})
pt = df.pivot_table(values='Quality', index=['Id','Date'])
print(pt)
Leads to this:
Id
Date
Quality
A4G8
2016-1-1
2
2016-1-15
4
2016-1-30
6
P9N3
2017-2-12
1
2017-2-28
5
2017-3-10
10
2019-1-1
10
C7R5
2018-6-1
2
L4U7
2019-8-6
2
However, I'd also like to...
Measure the time from the first interaction for each interaction per Id
Measure the time from the previous interaction with the same Id
So I'd get a table similar to the one below
Id
Date
Quality
Time From First
Time To Prev
A4G8
2016-1-1
2
0 days
NA days
2016-1-15
4
14 days
14 days
2016-1-30
6
29 days
14 days
P9N3
2017-2-12
1
0 days
NA days
2017-2-28
5
15 days
15 days
2017-3-10
10
24 days
9 days
The Id column is a string type, and I've converted the date column into datetime, and the Quality column into an integer.
The column is rather large (>10,000 unique ids) so for performance reasons I'm trying to avoid using for loops. I'm guessing the solution is somehow using pd.eval but I'm stuck as to how to apply it correctly.
Apologies I'm a python, pandas, & stack overflow) noob and I haven't found the answer anywhere yet so even some pointers on where to look would be great :-).
Many thanks in advance
Convert Dates to datetimes and then substract minimal datetimes per groups by GroupBy.transformb subtracted by column Date and for second new column use DataFrameGroupBy.diff:
df['Date'] = pd.to_datetime(df['Date'])
df['Time From First'] = df['Date'].sub(df.groupby('Id')['Date'].transform('min'))
df['Time To Prev'] = df.groupby('Id')['Date'].diff()
print (df)
Id Date Quality Time From First Time To Prev
0 A4G8 2016-01-01 2 0 days NaT
1 A4G8 2016-01-15 3 14 days 14 days
2 A4G8 2016-01-30 6 29 days 15 days
3 P9N3 2017-02-12 1 0 days NaT
4 P9N3 2017-02-28 5 16 days 16 days
5 P9N3 2017-03-10 10 26 days 10 days
6 P9N3 2019-01-01 10 688 days 662 days
7 C7R5 2018-06-01 2 0 days NaT
8 L4U7 2019-08-06 2 0 days NaT
df["Date"] = pd.to_datetime(df.Date)
df = df.merge(
df.groupby(["Id"]).Date.first(),
on="Id",
how="left",
suffixes=["", "_first"]
)
df["Time From First"] = df.Date-df.Date_first
df['Time To Prev'] = df.groupby('Id').Date.diff()
df.set_index(["Id", "Date"], inplace=True)
df
output:
I'm working with the following dataset:
Date
2016-01-04
2016-01-05
2016-01-06
2016-01-07
2016-01-08
and a list holidays = ['2016-01-01','2016-01-18'....'2017-11-23','2017-12-25']
Objective: Create a column indicating whether a particular date is within +- 7 days of any holiday present in the list.
Mock output:
Date
Within a week of Holiday
2016-01-04
1
2016-01-05
1
2016-01-06
1
2016-01-07
1
2016-01-08
0
I'm working with a lot of date records and thus trying to find a quick(most optimized) way to do this.
My Current Solution:
One way I figured to do this quickly would be to create another list with only the unique dates for my desired duration(say 2 years). This way, I can implement a simple solution with 2 for loops to check if a date is within +-7days of a holiday, and it wouldn't be computationally heavy as both lists would be relatively small(730 unique dates and ~20 dates in the holiday list).
Once I have my desired list of dates, all I have to do is run a single check on my 'Date' column to see if that date is a part of this new list I created. However, any suggestions to do this even quicker?
Turn holidays into a DataFrame and then merge_asof with a tolerance of 6 days:
new_df = pd.merge_asof(df, holidays, left_on='Date', right_on='Holiday',
tolerance=pd.Timedelta(days=6))
new_df['Holiday'] = np.where(new_df['Holiday'].notnull(), 1, 0)
new_df = new_df.rename(columns={'Holiday': 'Within a week of Holiday'})
Complete Working Example:
import numpy as np
import pandas as pd
holidays = pd.DataFrame(pd.to_datetime(['2016-01-01', '2016-01-18']),
columns=['Holiday'])
df = pd.DataFrame({
'Date': ['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
'2016-01-08']
})
df['Date'] = pd.to_datetime(df['Date'])
new_df = pd.merge_asof(df, holidays, left_on='Date', right_on='Holiday',
tolerance=pd.Timedelta(days=6))
new_df['Holiday'] = np.where(new_df['Holiday'].notnull(), 1, 0)
new_df = new_df.rename(columns={'Holiday': 'Within a week of Holiday'})
print(new_df)
new_df:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
Or turn Holdiays into a np datetime array then broadcast subtraction across the 'Date' Column, compare the abs to 7 days, and see if there are any matches:
holidays = np.array(['2016-01-01', '2016-01-18']).astype('datetime64')
df['Within a week of Holiday'] = (
abs(df['Date'].values - holidays[:, None]) < pd.Timedelta(days=7)
).any(axis=0).astype(int)
Complete Working Example:
import numpy as np
import pandas as pd
holidays = np.array(['2016-01-01', '2016-01-18']).astype('datetime64')
df = pd.DataFrame({
'Date': ['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
'2016-01-08']
})
df['Date'] = pd.to_datetime(df['Date'])
df['Within a week of Holiday'] = (
abs(df['Date'].values - holidays[:, None]) < pd.Timedelta(days=7)
).any(axis=0).astype(int)
print(df)
df:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
make a function that calculate date with +- 7 days and check if calculated date is in holidays so return True else False and apply that function to Data frame
import datetime
import pandas as pd
holidays = ['2016-01-01','2016-01-18','2017-11-23','2017-12-25']
def holiday_present(date):
date = datetime.datetime.strptime(date, '%Y-%m-%d')
for i in range(-7,7):
datte = (date - datetime.timedelta(days=i)).strftime('%Y-%m-%d')
if datte in holidays:
return True
return False
data = {
"Date":[
"2016-01-04",
"2016-01-05",
"2016-01-06",
"2016-01-07",
"2016-01-08"]
}
df= pd.DataFrame(data)
df["Within a week of Holiday"] = df["Date"].apply(holiday_present).astype(int)
Output:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
Try this:
Sample:
import pandas as pd
df = pd.DataFrame({'Date': {0: '2016-01-04',
1: '2016-01-05',
2: '2016-01-06',
3: '2016-01-07',
4: '2016-01-08'}})
Code:
def get_date_range(holidays):
h = [pd.to_datetime(x) for x in holidays]
h = [pd.date_range(x - pd.DateOffset(6), x + pd.DateOffset(6)) for x in h]
h = [x.strftime('%Y-%m-%d') for y in h for x in y]
return h
df['Within a week of Holiday'] = df['Date'].isin(get_date_range(holidays))*1
Result:
Out[141]:
0 1
1 1
2 1
3 1
4 0
Name: Within a week of Holiday, dtype: int32
What would be the correct way to show what was the average sales volume in Carlisle city for each year between
2010-2020?
Here is an abbreviated form of the large data frame showing only the columns and rows relevant to the question:
import pandas as pd
df = pd.DataFrame({'Date': ['01/09/2009','01/10/2009','01/11/2009','01/12/2009','01/01/2010','01/02/2010','01/03/2010','01/04/2010','01/05/2010','01/06/2010','01/07/2010','01/08/2010','01/09/2010','01/10/2010','01/11/2010','01/12/2010','01/01/2011','01/02/2011'],
'RegionName': ['Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle'],
'SalesVolume': [118,137,122,132,83,81,105,114,110,106,137,130,129,121,129,100,84,62]})
This is what I've tried:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv ('C:/Users/user/AppData/Local/Programs/Python/Python39/Scripts/uk_hpi_dataset_2021_01.csv')
df.Date = pd.to_datetime(df.Date)
df['Year'] = pd.to_datetime(df['Date']).apply(lambda x:
'{year}'.format(year=x.year).zfill(2))
carlisle_vol = df[df['RegionName'].str.contains('Carlisle')]
carlisle_vol.groupby('Year')['SalesVolume'].mean()
print(sales_vol)
When I try to run this code, it doesn't filter the 'Date' column to only calculate the average SalesVolume for the years beginning in '01/01/2010' and ending at '01/12/2020'. For some reason, it also prints out every other column is well. Can anyone please help me to answer this question correctly?
This is the result I've got
>>> df.loc[(df["Date"].dt.year.between(2010, 2020))
& (df["RegionName"] == "Carlisle")] \
.groupby([pd.Grouper(key="Date", freq="Y")])["SalesVolume"].mean()
Date
2010-01-01 112.083333
2011-01-01 73.000000
Freq: A-DEC, Name: SalesVolume, dtype: float64
For further:
The only difference between the answer of #nocibambi is the groupby parameter and particularly the freq argument of pd.Grouper. Imagine your accounting year starts the 1st september.
Sales each 3 months:
>>> df
Date Sales
0 2010-09-01 1 # 1st group: mean=2.5
1 2010-12-01 2
2 2011-03-01 3
3 2011-06-01 4
4 2011-09-01 5 # 2nd group: mean=6.5
5 2011-12-01 6
6 2012-03-01 7
7 2012-06-01 8
>>> df.groupby(pd.Grouper(key="Date", freq="AS-SEP")).mean()
Sales
Date
2010-09-01 2.5
2011-09-01 6.5
Check the documentation to know all values of freq aliases and anchoring suffix
You can access year with the datetime accessor:
df[
(df["RegionName"] == "Carlisle")
& (df["Date"].dt.year >= 2010)
& (df["Date"].dt.year <= 2020)
].groupby(df.Date.dt.year)["SalesVolume"].mean()
>>>
Date
2010 112.083333
2011 73.000000
Name: SalesVolume, dtype: float64
Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))
I have pandas dataframe with Columns 'Date' and 'Skew(float no.)'. I want to average the values of the skew between every Tuesday and the store it in a list or dataframe. I tried using lambda as given in this question Pandas, groupby and summing over specific months I but it only helps to some over a particular week but i cannot go across week i.e from one tuesday to another. Can you give how to do the same?
Here's an example with random data
df = pd.DataFrame({'Date' : pd.date_range('20130101', periods=100),
'Skew': 10+pd.np.random.randn(100)})
min_date = df.Date.min()
start = min_date.dayofweek
if start < 1:
min_date = min_date - pd.np.timedelta64(6+start, 'D')
elif start > 1:
min_date = min_date - pd.np.timedelta64(start-1, 'D')
df.groupby((df.Date - min_date).astype('timedelta64[D]')//7).mean()
Input:
>>> df
Date Skew
0 2013-01-01 10.082080
1 2013-01-02 10.907402
2 2013-01-03 8.485768
3 2013-01-04 9.221740
4 2013-01-05 10.137910
5 2013-01-06 9.084963
6 2013-01-07 9.457736
7 2013-01-08 10.092777
Output:
Skew
Date
0 9.625371
1 9.993275
2 10.041077
3 9.837709
4 9.901311
5 9.985390
6 10.123757
7 9.782892
8 9.889291
9 9.853204
10 10.190098
11 10.594125
12 10.012265
13 9.278008
14 10.530251
Logic: Find relative week from the first week's Tuesday and groupby and each groups (i.e week no's) mean.