Convert week number in dataframe to start date of week (Monday) - python

I'm looking to convert daily data into weekly data. Here is the code I've used to achieve this
daily_data['Week_Number'] = pd.to_datetime(daily_data['candle_date']).dt.week
daily_data['Year'] = pd.to_datetime(daily_data['candle_date']).dt.year
df2 = daily_data.groupby(['Year', 'Week_Number']).agg({'open': 'first', 'high': 'max', 'low': 'min', 'close': 'last', 'volume': 'sum', 'market_cap': 'sum'})
Currently, the dataframe output looks as below -
open high low close volume market_cap
Year Week_Number
2020 31 11106.793367 12041.230145 10914.007709 11059.660924 86939673211 836299315108
32 11059.658520 11903.881608 11011.841384 11653.660942 125051146775 1483987715241
33 11665.874956 12047.515879 11199.052457 11906.236593 141819289223 1513036354035
34 11915.898402 12382.422676 11435.685834 11671.520767 136888268138 1533135548697
35 11668.211439 11806.669046 11183.114210 11704.963980 122232543594 1490089199926
36 11713.540300 12044.196936 9951.201578 10277.329333 161912442921 1434502733759
I'd like the output to have a column week_date that shows the date of Monday of the week as the start date. Ex: Show 27-07-2020 in place of 31st week of 2020 and so on. It's this final piece that I'm stuck with really badly. Please could I request some help to achieve this.
**
SOLUTION FOR THOSE WHO NEED
**
The entire function used to convert daily data to weekly below
def convert_dailydata_to_weeklydata(daily_data):
# Print function name
SupportMethods.print_func_name()
# Loop over the rows until a row with Monday as date is present
row_counter_start = 0
while True:
if datetime.weekday(daily_data['candle_date'][row_counter_start]) == 0:
break
row_counter_start += 1
# # Loop over the rows until a row with Sunday as date is present
# row_counter_end = len(daily_data.index) - 1
# while True:
# if datetime.weekday(daily_data['candle_date'][row_counter_end]) == 6:
# break
# row_counter_end -= 1
# print(daily_data)
# print(row_counter_end)
# Copy all rows after the first Monday row of data is reached
daily_data_temp = daily_data[row_counter_start:]
# Getting week number
daily_data_temp['Week_Number'] = pd.to_datetime(daily_data_temp['candle_date']).dt.week
# Getting year. Weeknum is common across years to we need to create unique index by using year and weeknum
daily_data_temp['Year'] = pd.to_datetime(daily_data_temp['candle_date']).dt.year
# Grouping based on required values
df = daily_data_temp.groupby(['Year', 'Week_Number']).agg(
{'open': 'first', 'high': 'max', 'low': 'min', 'close': 'last', 'volume': 'sum', 'market_cap': 'sum'})
# Reset index
df = df.reset_index()
# Create week date (start of week)
# The + "1" isfor the day of the week.Week numbers 0-6 with 0 being Sunday and 6 being Saturday.
df['week_date'] = pd.to_datetime(df['Year'].astype(str) + df['Week_Number'].astype(str) + "1", format='%G%V%w')
# Set indexes
df = df.set_index(['Year', 'Week_Number'])
# Re-order columns into a new dataframe
weekly_data = df[["week_date", "open", "high", "low", "close", "volume", "market_cap"]]
weekly_data = weekly_data.rename({'week_date': 'candle_date'}, axis=1)
# Drop index columns
weekly_data.reset_index(drop=True, inplace=True)
# Return data by dropping curent week's data
if datetime.weekday(weekly_data.head(-1)['candle_date']) != 0:
return weekly_data.head(-1)
else:
return weekly_data

df['week_date-Week']=pd.to_datetime(df['Week_Number'].astype(str)+df['Year'].astype(str).add('-1') ,format='%V%G-%u')

Try using pd.to_datetime on the 'Year' and 'Week_Number' columns with a format string for Year, Week of Year, and Day of Week ('%G%V%w'):
df = df.reset_index()
df['week_date'] = pd.to_datetime(
df['Year'].astype(str) + df['Week_Number'].astype(str) + "1",
format='%G%V%w'
)
df = df.set_index(['Year', 'Week_Number'])
The + "1" is for the day of the week. Week numbers 0-6 with 0 being Sunday and 6 being Saturday. (Ref. Format Codes)
df:
open close week_date
Year Week_Number
2020 31 11106.793367 11059.660924 2020-07-27
32 11059.658520 11653.660942 2020-08-03

try via apply() and datetime.strptime() method:
import datetime
df = df.reset_index()
df['week_date']=(df[['Year','Week_Number']].astype(str)
.apply(lambda x:datetime.datetime.strptime('-W'.join(x) + '-1', "%Y-W%W-%w"),1))
df = df.set_index(['Year', 'Week_Number'])

Try use dt.strftime with '%V'
pd.to_datetime(pd.Series(['27-07-2020'])).dt.strftime('%V')

pyspark SQL
When data is too heavy, .apply will take a lot of time to process. I used below code to get month first date and week date starting from Monday.
df= df.withColumn('month_date', trunc('date', 'month'))
Output
date month_date
2019-05-28 2019-05-01
Below gets us week date from date column keeping week start from monday
df= df.withColumn("week_end", next_day("date", "SUN")).withColumn("week_start_date", date_sub("week_end", 6))
I used these on databricks . Apply took more than 2 hr on 200 billion rows data and this one took only 5 mins around

Related

How to get calendar years as column names and month and day as index for one timeseries

I have looked for solutions but seem to find none that point me in the right direction, hopefully, someone on here can help. I have a stock price data set, with a frequency of Month Start. I am trying to get an output where the calendar years are the column names, and the day and month will be the index (there will only be 12 rows since it is monthly data). The rows will be filled with the stock prices corresponding to the year and month. I, unfortunately, have no code since I have looked at for loops, groupby, etc but can't seem to figure this one out.
You might want to split the date into month and year and to apply a pivot:
s = pd.to_datetime(df.index)
out = (df
.assign(year=s.year, month=s.month)
.pivot_table(index='month', columns='year', values='Close', fill_value=0)
)
output:
year 2003 2004
month
1 0 2
2 0 3
3 0 4
12 1 0
Used input:
df = pd.DataFrame({'Close': [1,2,3,4]},
index=['2003-12-01', '2004-01-01', '2004-02-01', '2004-03-01'])
You need multiple steps to do that.
First split your column into the right format.
Then convert this column into two separate columns.
Then pivot the table accordingly.
import pandas as pd
# Test Dataframe
df = pd.DataFrame({'Date': ['2003-12-01', '2004-01-01', '2004-02-01', '2004-12-01'],
'Close': [6.661, 7.053, 6.625, 8.999]})
# Split datestring into list of form [year, month-day]
df = df.assign(Date=df.Date.str.split(pat='-', n=1))
# Separate date-list column into two columns
df = pd.DataFrame(df.Date.to_list(), columns=['Year', 'Date'], index=df.index).join(df.Close)
# Pivot the table
df = df.pivot(columns='Year', index='Date')
df
Output:
Close
Year 2003 2004
Date
01-01 NaN 7.053
02-01 NaN 6.625
12-01 6.661 8.999

Resample daily time series to business day

I have the following daily time series that I want to resample (aggregate sum) for only business days (Mon - Fri)
but this code also aggregate the weekends (Sat & Sun)
df_resampled = df.resample('5B').sum()
You can exclude weekends in boolean indexing with DatetimeIndex.dayofweek:
df_resampled = df[~df.index.dayofweek.isin([5,6])].resample('5B').sum()
df_resampled = df[df.index.dayofweek < 5].resample('5B').sum()
You can pivot the table on day of week and remove weekends. Check this out.
Step 0: generate random example (You already have data so you shouldn't really care about this step)
import pandas as pd
import numpy as np
def random_dates(start, end, n, freq, seed=None):
if seed is not None:
np.random.seed(seed)
dr = pd.date_range(start, end, freq=freq)
return pd.to_datetime(np.sort(np.random.choice(dr, n, replace=False)))
dates = random_dates('2015-01-01', '2018-01-01', 10, 'H', seed=[3, 1415])
df = pd.DataFrame()
df.index = pd.DatetimeIndex(dates.date)
df['Sales'] = np.random.randint(1, 5, size=len(df))
Step 1: Get days of week
df['Day of week'] = df.index.to_series().dt.dayofweek
# 0 is Monday - 6 is Saturday
Step 2: Get the result you asked for
# remove days 5 and 6 (Sun and Sat) and pivot on day of week
result = df[df['Day of week'] < 5].pivot_table(index = 'Day of week', values = "Sales", aggfunc = "sum")
print(result)
Example output:
Sales
Day of week
0 11
3 1
4 14
Again remember: 0 is Monday - 6 is Saturday. You can change these to names to get a more beautiful output.

Setting Time with interval of 1 minute

I have a dataset which comprises of minutely data for 2 stocks over 3 months. I have to create date in the first column and time (with interval of 1 minute) in the next column for 3 months. I am attaching the snap of 1 such data set. Kindly help me to solve this problem.
Data Format
-Create 3 month range numpy array of dates and time with minute frequency
date_rng = pd.date_range(start='1/1/2021', end='3/31/2021', freq='min')
-Isolate dates
date = date_rng.date
-Isolate times
time = date_rng.time
-Create pandas dataframe with 2 columns (date and time)
pd.DataFrame({'date': date, 'time': time})
-Then simply concat the new dataframe with your existing dataframe on the column axis.
***** Remove Saturday and Sunday *****
You could remove weekends by creating a column with weekend day names and then taking a slice of the dataframe exluding Saturday and Sunday:
date_rng = pd.date_range(start='1/1/2021', end='3/31/2021', freq='min')
date = date_rng.date
time = date_rng.time
day = date_rng.day_name()
df = pd.DataFrame({'date': date, 'time': time, 'day': day})
Remove Sat and Sun with this code:
sat = df.day != 'Saturday'
sun = df.day != 'Sunday'
df = df[sat & sun]
As for Holidays, you could use the same method but you would need a list of the holidays applicable in your region.
****** Trading times ******
marketOpen = datetime.strptime('9:15:00', "%H:%M:%S").time()
marketClose = datetime.strptime('15:59:00', "%H:%M:%S").time()
df = df[(df.time >= marketOpen) & (df.time <= marketClose)]
******* Exclude specific day ******
holiday = datetime.strptime("03/30/2021", "%m/%d/%Y").date()
df = df[df.date != holiday]
Lastly, don't forget to reset your dataframe's index.

Is there some Python function like .to_period that could help me extract a fiscal year's week number based on a date?

Essentially, I want to apply some lambda function(?) of some sort to apply to a column in my dataframe that contains dates. Originally, I used dt.week to extract the week number but the calendar dates don't match up with the fiscal year I'm using (Apr 2019 - Mar 2020).
I have tried using pandas' function to_period('Q-MAR) but that seems to be a little bit off. I have been researching other ways but nothing seems to work properly.
Apr 1 2019 -> Week 1
Apr 3 2019 -> Week 1
Apr 30 2019 -> Week 5
May 1 2019 -> Week 5
May 15 2019 -> Week 6
Thank you for any advice or tips in advance!
You can create a DataFrame which contains the dates with a frequency of weeks:
date_rng = pd.date_range(start='01/04/2019',end='31/03/2020', freq='W')
df = pd.DataFrame(date_rng, columns=['date'])
You can then query df for which index the date is smaller than or equal to the value:
df.index[df.date <= query_date][-1]
This will output the largest index which is smaller than or equal to the date you want to examine. I imagine you can pour this into a lambda yourself?
NOTE
This solution has limitations, the biggest one being you have to manually define the datetime dataframe.
I did create a fiscal calendar that can be later improvised to create function in spark
from fiscalyear import *
beginDate = '2016-01-01'
endDate = '2021-12-31'
#create empty dataframe
df = spark.createDataFrame([()])
#create date from given date range
df1 = df.withColumn("date",explode(expr(f"sequence(to_date('{beginDate}'), to_date('{endDate}'), interval 1 day)")))
# get week
df1 = df1.withColumn('week',weekofyear(col("date"))).withColumn('year',year(col("date")))
#translate to use pandas in python
df1 = df1.toPandas()
#get fiscal year
df1['financial_year'] = df1['date'].map(lambda x: x.year if x.month > 3 else x.year-1)
df1['date'] = pd.to_datetime(df1['date'])
#get calendar qtr
df1['quarter_old'] = df1['date'].dt.quarter
#get fiscal calendar
df1['quarter'] = np.where(df1['financial_year']< (df1['year']),df1['quarter_old']+3,df1['quarter_old'])
df1['quarter'] = np.where(df1['financial_year'] == (df1['year']),df1['quarter_old']-1,df1['quarter'])
#get fiscal week by shiftin gas per number of months different from usual calendar
df1["fiscal_week"] = df1.week.shift(91)
df1 = df1.loc[(df1['date'] >= '2020-01-01')]
df1.display()

Merging two dataframe by date

I have two dataframe, using pandas, one (df_1) is the average temperature by day of the year until some point in time (for example the average temperature for all the days of 2014 until 03/01/2014) and the other (df_2) is the average temperature by day for the last 30 years.
What I want to do is to complete the first dataframe by the mean values by day in the second, I can't use the day of the year because of some leap years but I'm not sure this the right way. I have found a way to get the average temperature by day (Get the average year (mean of days over multiple years) in Pandas) to get df_3. My end goal is to complete df_1 for the missing days (so 04/01/2014,...,31/12/2014)
df_1 = pd.DataFrame({
'Date': ['01/01/2014','02/01/2014','03/01/2014'], 'T_Avg_2014': [5,6,0.7]})
df_2 = pd.DataFrame({
'Date': ['01/01/2009','02/01/2010','01/01/2011'], 'T_Avg': [5,-8,-7]})
index = pd.MultiIndex.from_tuples([('1', '1'),
('1', '2'),
('1', '3'),
('2', '1')],
names=['month', 'day'])
columns = [('T_Avg')]
df_3 = pd.DataFrame([3,4,8,10],
index=index,
columns=columns)
Here is a method to accomplish this:
from datetime import datetime
import numpy as np
import pandas as pd
# Create date ranges
date1 = pd.date_range(datetime(2014,1,1), datetime(2014,3,1)) # 2014
date2 = pd.date_range(datetime(1983,1,1), datetime(2013,12,31)) # 30 years
# Create data frames
df1 = pd.DataFrame({'temperature': np.random.rand(len(date1))*100}, index = date1)
df2 = pd.DataFrame({'temperature': np.random.rand(len(date2))*100}, index = date2)
# Compute average daily temperature from 30 year data
df3 = df2.groupby([df2.index.month, df2.index.day]).mean()
df3 = df3.reset_index().rename(columns={'level_0': 'month', 'level_1': 'day'})
# Get data to use to complete df1
idx = df3.index[(df3.month == 3) & (df3.day == 1)][0] + 1 # All past March 1st
data_fill = df3.loc[idx:, ['month', 'day', 'temperature']]
data_fill['date_time'] = pd.to_datetime(data_fill.month.map(str)+'-'+data_fill.day.map(str)+'-2014')
data_fill = data_fill.set_index('date_time')
data_fill = data_fill.drop(columns=['month', 'day'])
# Combine data frames
df4 = pd.concat([df1, data_fill])
# Visualize data
df4.plot()
Notice how the data after March 1st is smoothed, because it is a 30 year average of the randomly generated data, while the data for the first 2 months has not been averaged.

Categories

Resources