Iterating a groupby datetime over several weeks - python

I'm trying to group my data by a week that I predefined using to_datetime and timedelta. However, after copying my script a few times, I was hoping there was a way to iterate this process over multiple weeks. Is this something that can be done?
The data set that I'm working with lists sales out sales revenue and spending out by the day for each data source and its corresponding id.
Below is what I have so far but my knowledge of loops is pretty limited due to being self-taught.
Let me know if what I'm asking is feasible or if I have to continue to copy my code every week.
Code
import pandas as pd
from datetime import datetime, timedelta,date
startdate = '2021-09-26'
enddate = pd.to_datetime(startdate) + timedelta(days=6)
last7 = (df.date >= startdate) & (df.date <= enddate)
df = df.loc[last7,['datasource','id','revenue','spend']]
df.groupby(by=['datasource_name','id'],as_index=False).sum()
df['start_date'] = startdate
df['end_date'] = enddate
df

If I have understood your issue correctly, you are basically trying to aggregate daily data into weekly. You can try following code
import datetime as dt
import pandas as pd
#Get weekend date for each date
df['week_end_date']=df['date'].apply(lambda x: pd.Period(x,freq='W').end_time.date().strftime('%Y-%m-%d'))
#Aggregate sales and revenue at weekly level
df_agg = df.groupby(['datasource_name','id','week_end_date']).agg({'revenue':'sum','spend':'sum'}).reset_index()
df_agg will have all your sales and revenue numbers aggregated by the weekend date for corresponding date.

Related

Python PySpark substract 1 year from given end date to work with one year of data range

What I wanted to do is get 1 year of data.
By calculate latest date from the column date, as my end date. Then use the end date - 1 year to get the start date. After that, I can filter the data in between those start and end date.
I did manage to get the end date, but can't find how I can get the start date.
Below is the code that I have used so far. -1 year is what needs to be solved.
and if you know how to filter in pyspark is also welcome.
from pyspark.sql.functions import min, max
import datetime
import pyspark.sql.function as F
from pyspark.sql.functions import date_format, col
#convert string to date type
df = df.withColumn('risk_date', F.to_date(F.col('chosen_risk_prof_date'), 'dd.MM.yyyy'))
#filter only 1 year of data from big data set.
#calculate the start date and end date. lastest_date = end end.
latest_date = df.select((max("risk_date"))).show()
start_date = latest_date - *1 year*
new_df = df.date > start_date & df.date < end_date
Then after this get all the data between start date and end date
you can use relativedelta as below
from datetime import datetime
from dateutil.relativedelta import relativedelta
print(datetime.now() - relativedelta(years=1))

How to identify actual day of week that markets end on?

As of Dec 30, 2021 ----
I did figure this out. New to Python, so this is not optimized or the most elegant, but it does return just the day that ends any market week. Because of how I specify the start and end dates, the dataframe always starts with a Monday, and ends with the last market day. Basically, it looks at each date in consecutive rows, assigns the difference in days to a new column. Each row will return a -1, except for the last day of the market week. The very last row of all data also returns a "NaN", which I had to deal with. I then delete just the rows with -1 in the Days column. Thank you for the feedback....here is the rest of the code that does the work, which follows the code I previously supplied.
data['Date'] = pd.to_datetime(data['Date'])
data['Days_from_date'] = pd.DatetimeIndex(data['Date']).day
data['Days'] = data['Days_from_date'] - data['Days_from_date'].shift(-1)
data=data.replace(np.nan,-1)
data["Days"]=data["Days"].astype(int)
data = data[data['Days'] != -1]
data = data[data['Days'].ne(-1)]
This is the previous post.....
I currently have python code that gets historical market info for various ETF tickers over a set period of time (currently 50 days). I run this code through Power BI. When I get done testing, I will be getting approximately 40 weeks of data for 60-ish ETFs. Current code is copied below.
I would like to minimize the amount of data returned to just the CLOSE data generated on the last market day of each week. Usually this is Friday, but sometimes it can be Thursday, and I think possibly Wednesday.
I am coming up short on how to identify each week's last market day and then pulling in just that data into a dataframe. Alternatively, I suppose it could pull in all data, and then drop the unwanted rows - I'm not sure which would be a better solution, and, in any case, I can't figure out how to do it!
Current code here, using Python 3.10 and Visual Studio Code for testing....
import yfinance as yf
import pandas as pd
from datetime import date
from datetime import timedelta
enddate = date.today()
startdate = enddate - timedelta(days=50)
tickerStrings = ['VUG', 'VV', 'MGC', 'MGK', 'VOO', 'VXF', 'VBK', 'VB']
df_list = list()
for ticker in tickerStrings:
data = yf.download(ticker, start=startdate, group_by="Ticker")
data['Ticker'] = ticker
df_list.append(data)
data = pd.concat(df_list)
data = data.drop(columns=["Adj Close", "High", "Low", "Open", "Volume"])
data = data.reset_index()
As I commented, I think you can get the desired data by getting the week number from the date data, grouping it and getting the last row. For example, if Friday is a holiday, I considered Thursday to be the last data of the week number.
import yfinance as yf
import pandas as pd
from datetime import date
from datetime import timedelta
enddate = date.today()
startdate = enddate - timedelta(days=50)
tickerStrings = ['VUG', 'VV', 'MGC', 'MGK', 'VOO', 'VXF', 'VBK', 'VB']
df = pd.DataFrame()
for ticker in tickerStrings:
data = yf.download(ticker, start=startdate, progress=False)['Close'].to_frame('Close')
data['Ticker'] = ticker
df = df.append(data)
df.reset_index(inplace=True)
df['week_no'] = df['Date'].dt.isocalendar().week
data = df.groupby(['Ticker','week_no']).tail(1).sort_values('Date', ascending=True)

How would I do date time math on a DF column using today's date?

Essentially I want to create a new column that has the number of days remaining until maturity from today. The code below doesn't work, kind of stuck what to do next as nearly all examples showcase doing math on 2 DF columns.
today = date.today()
today = today.strftime("%m/%d/%y")
df['Maturity Date'] = df['Maturity Date'].apply(pd.to_datetime)
df['Remaining Days til Maturity] = (df['Maturity Date'] - today)
You're mixing types, it's like subtracting apples from pears. In your example, today is a string representing - to us humans - a date (in some format, looks like the one used in the USA). Your pandas Series (the column of interest in your DataFrame) has a datetime64[ns] type, after you did the apply(pd.to_datetime) (which, you could do more efficiently without the apply as that will run an operation in a non-vectorized way over every element of the Series - have a look below, where I'm converting those strings into datetime64[ns] type in a vectorized way).
The main idea is that whenever you do operations with multiple objects, they should be of the same type. Sometimes frameworks will automatically convert types for you, but don't rely on it.
import pandas as pd
df = pd.DataFrame({"date": ["2000-01-01"]})
df["date"] = pd.to_datetime(df["date"])
today = pd.Timestamp.today().floor("D") # That's one way to do it
today
# Timestamp('2021-11-02 00:00:00')
today - df["date"]
# 0 7976 days
# Name: date, dtype: timedelta64[ns]
parse the Maturity Date as a datetime and format it as month/day/year then subtract the Maturity Date as a date type and store the difference in days as Remaining Days til Maturity
from datetime import date
today = date.today()
df=pd.DataFrame({'Maturity Date':'11/04/2021'},index=[0])
df['Maturity Date'] = pd.to_datetime(df['Maturity Date'], format='%m/%d/%Y')
df['Remaining Days til Maturity'] = (df['Maturity Date'].dt.date - today).dt.days
print(df)
output:
Maturity Date Remaining Days til Maturity
0 2021-11-04 2

Python Dataframe Date plus months variable which comes from the other column

I have a dataframe with the date and month_diff variable. I would like to get a new date (name it as Target_Date) based on the following logic:
For example, the date is 2/13/2019, month_diff is 3, then the target date should be the month-end of the original date plus 3 months, which is 5/31/2019
I tried the following method to get the traget date first:
df["Target_Date"] = df["Date"] + pd.DateOffset(months = df["month_diff"])
But it failed, as I know, the parameter in the dateoffset should be a varaible or a fixed number.
I also tried:
df["Target_Date"] = df["Date"] + relativedelta(months = df["month_diff"])
It failes too.
Anyone can help? thank you.
edit:
this is a large dataset with millions rows.
You could try this
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({'Date': [pd.datetime(2019,1,1), pd.datetime(2019,2,1)], 'month_diff': [1,2]})
df.apply(lambda row: row.Date + relativedelta(months=row.month_diff), axis=1)
Or list comprehension
[date + relativedelta(months=month_diff) for date, month_diff in df[['Date', 'month_diff']].values]
I would approach in the following method to compute your "target_date".
Apply the target month offset (in your case +3months), using your pd.DateOffset.
Get the last day of that target month (using for example calendar.monthrange, see also "Get last day of the month"). This will provide you with the "flexible" part of that date" offset.
Apply the flexible day offset, when comparing the result of step 1. and step 2. This could be a new pd.DateOffset.
A solution could look something like this:
import calendar
from dateutil.relativedelta import relativedelta
for ii in df.index:
new_ = df.at[ii, 'start_date'] + relativedelta(months=df.at[ii, 'month_diff'])
max_date = calendar.monthrange(new_.year, new_.month)[1]
end_ = new_ + relativedelta(days=max_date - new_.day)
print(end_)
Further "cleaning" into a function and / or list comprehension will probably make it much faster
import pandas as pd
from datetime import datetime
from datetime import timedelta
This is my approach in solving your issue.
However for some reason I am getting a semantic error in my output even though I am sure it is the correct way. Please everyone correct me if you notice something wrong.
today = datetime.now()
today = today.strftime("%d/%m/%Y")
month_diff =[30,5,7]
n = 30
for i in month_diff:
b = {'Date': today, 'month_diff':month_diff,"Target_Date": datetime.now()+timedelta(days=i*n)}
df = pd.DataFrame(data=b)
Output:
For some reason the i is not getting updated.
I was looking for a solution I can write in one line only and apply does the job. However, by default apply function performs action on each column, so you have to remember to specify correct axis: axis=1.
from datetime import datetime
from dateutil.relativedelta import relativedelta
# Create a new column with date adjusted by number of months from 'month_diff' column and later adjust to the last day of month
df['Target_Date'] = df.apply(lambda row: row.Date # to current date
+ relativedelta(months=row.month_diff) # add month_diff
+ relativedelta(day=+31) # and adjust to the last day of month
, axis=1) # 1 or ‘columns’: apply function to each row.

Group DataFrame by Business Day of Month

I am trying to group a Pandas DataFrame that is indexed by date by the business day of month, approx 22/month.
I would like to return a result that contains 22 rows with mean of some value in `DataFrame.
I can by day of month but cant seem to figure out how to by business day.
Is there a function that will return the business day of month of a date?
if someone could provide a simple example that would be most appreciated.
Assuming your dates are in the index (if not use 'set_index):
df.groupby(pd.TimeGrouper('B'))
See time series functionality.
I think what the question is asking is to groupby business day of month - the other answer just seems to resample the data to the nearest business day (at least for me).
This code returns a groupby object with 22 rows
from datetime import date
import pandas as pd
import numpy as np
d = pd.Series(np.random.randn(1000), index=pd.bdate_range(start='01 Jan 2018', periods=1000))
def to_bday_of_month(dt):
month_start = date(dt.year, dt.month, 1)
return np.busday_count(month_start, dt)
day_of_month = [to_bday_of_month(dt) for dt in d.index.date]
d.groupby(day_of_month).mean()

Categories

Resources