Get hourly averages only of mondays of entire year - python

I have a dataset that looks as follows:
The dataset
I have a big list of orders with there traveled distance for the entire year of 2018. To predict the orders for the future I want to calculate the total orders per Hour for all the Mondays of the year. So, the average number of orders between 00:00:00 -:01:00:00 the average orders between 01:00:00 - 02:00:00 until 23:00:00 - 24:00:00 only on Mondays. They should not include the orders on other weekdays.
What I have so far is:
df_data = pd.read_csv('Finalorders.csv', parse_dates=['datetime'])
week_dfsum = df_data.groupby(df_data['datetime'].dt.weekday_name).sum()
week_dfsum = df_data.groupby(df_data['datetime'].dt.weekday_name).sum()
pprint(week_dfsum)
pprint(week_dfmean)
But I don't know how to only include the orders on Monday.

You're close. After you produce a column called "Day of the week", filter it by Monday's using:
df[df['Day_of_Week'] == 1]
This will return only values from Mondays.

Related

First row is calculated differently than all other rows and all rows but first row is conditioned upon previous row

I have a df of columns with tickers with rows of daily returns and the index is a datetime index
SPY IWM TLT
2016-01-04 0.914939 0.998960 1.014094
2016-01-05 1.014062 1.002650 1.002819
2016-01-06 0.991911 0.999906 1.014441
2016-01-07 0.937087 0.995280 1.014140
2016-01-08 1.005388 0.999147 0.995572
I have initial weights for each ticker on day one
SPY 50
IWM 25
TLT 25
Each weight has a band
SPY = 40, 60
IWM = 20, 30
TLT = 20, 30
The dataframe goes on for 5 years daily. For the first day I want to calculate the original weight times the return for that day. For every day after that day, I want to calculate the return of that day (which is the previous day value times that days return) and check each day if the weight of any one of the three is outside the band. Weight for each day is the value for that ticker / the sum of that days value. If one of the ticker weights violates the band for 5 days straight, I want to rebalance all three and the next day row should be the original weight divided by the previous days portfolio value.
Example SPY IWM TLT PortValue SPYW IWMW TLTW
XX Date 51.45 27.25 21.54 100.24 51.3 27.18 21.4 No Rebal,nextday*prevday
XX Date 59 29 15 103 57 28 14.5 Rebal, next day below
NEXT DAY 50/103*ret 25/103*ret 25/103*ret
I have tried everything. lambda functions, np.where, for loops, if statements, nested variations of all of the above. I cant get around the bool test for the index for the first day and make that work with the rest of the days where the next row is contingent upon the calculation of the previous row and not on the datetime index location
Interesting question. Something along the lines of the following should work - obviously, with many modifications to reflect your actual data. Note that this totally disregards the datetime index since it's informational and doesn't affect the outcome:
portfolio = [50,25,25] #starting amount of each investment in the portfolio
allocations =[.50,.25,.25] #base allocations among the portfolio investments
tickers = list(df3.columns.values)
bands = [[.4,.6],[.20,.30],[.20,.30]] #permissible bands for each investment/ticker
violations = {ticker:0 for ticker in tickers} #initialize a violations counter for each ticker
#start iterating through each day in the dataframe:
for i, row in df.iterrows():
yields = row.to_list() #extract the daily yields
portfolio = [investment*yld for investment, yld in zip(portfolio,yields) ] #recalculate the new value of each investment
weights = [investment/sum(portfolio)for investment in portfolio] #recalculate the new relative weight of each investment
for weight,band in zip(weights, bands):
ticker = tickers[weights.index(weight)] #for each ticker -
#check if it's outside its permitted band
if weight<band[0] or weight>band[1]:
violations[ticker] +=1 #if it is, increment its specific violations counter
else:
violations[ticker]=0 #if not, reset the counter to zero, to account for non-consecutive violations
if violations[ticker] == 5: #5 consecutive violations detected....
portfolio = [sum(portfolio)*allocation for allocation in allocations] #rebalance the portofolio

Subtracting specific datetime objects

I am required to find the calculate the 'time to maturity' by finding the difference between 'maturity' and 'trd_exctn_dt' as shown below.
If I had the following sample data:
cusip_id
trd_exctn_dt
maturity
time_to_maturity
0007TAA2
2015-01-26
2023-05-15
3031 days
0007TAA2
2015-03-26
2023-05-15
2972 days
0007TAA2
2015-05-01
2023-05-15
2936 days
0007TAA2
2015-07-27
2023-05-15
2849 days
My desired output would be:
cusip_id
trd_exctn_dt
maturity
time_to_maturity
0007TAA2
2015-05-01
2023-05-15
2936 days
For this specific cusip_id, because the maturity date is in the 5th month, I am looking for the trd_exctn_dt in the 5th month, in order to calculate the time to maturity. However, I want to do this for several bond issues, where 'maturity' will not necessarily occur within the 5th month'. For example, for another bond issue, the maturity date may be 2023-11-06, therefore I would be looking for the trd_exctn_dt in the 11th month for that bond issue.
Any ideas on how I would do this would be much appreciated!
This solution assumes that you want every row where maturity month equals trd_exctn_dt month.
Code
df.columns = [c.strip() for c in df.columns] # Remove whitespace from column names
# Convert to datetime
df['trd_exctn_dt'] = pd.to_datetime(df['trd_exctn_dt'])
df['maturity'] = pd.to_datetime(df['maturity'])
df['time_to_maturity'] = df['maturity'] - df['trd_exctn_dt'] # If you need to recalculate
df[df['trd_exctn_dt'].dt.month == df['maturity'].dt.month] # Filter for same month in both columns
Output
cusip_id trd_exctn_dt maturity time_to_maturity
2 0007TAA2 2015-05-01 2023-05-15 2936 days

pandas: a rolling window of hour-of-day average

related to: daily data, resample every 3 days, calculate over trailing 5 days efficiently but the summing is over strided non-consecutive data.
I have an hourly time series. For every hour I would like to have the average of the same hour of the day, in the last preceding 10 days window. E.g. at 2019-08-14 23:00, I would like to have an average of all 23:00 data from 2019-08-04 till 2019-08-13.
Is there an efficient way to do so in pandas/numpy? Or should I roll my sleeves and write my own loops and data structures?
Extra points: if today is a workday (Mon-Fri), the average should be for the previous 10 workdays. If it's a weekend (Sat-Sun), for the previous 10 weekend-days (span about 2.5 months)

Propagate dates pandas and interpolate

We have some ready available sales data for certain periods, like 1week, 1month...1year:
time_pillars = pd.Series(['1W', '1M', '3M', '1Y'])
sales = pd.Series([4.75, 5.00, 5.10, 5.75])
data = {'time_pillar': time_pillars, 'sales': sales}
df = pd.DataFrame(data)
I would like to do two operations.
Firstly, create a new column of date type, df['date'], that corresponds to the actual date of 1week, 1month..1year from now.
Then, I'd like to create another column df['days_from_now'], taking how many days are on these pillars (1week would be 7days, 1month would be around 30days..1year around 365days).
The goal of this is then to use any day as input for a a simple linear_interpolation_method() to obtain sales data for any given day (eg, what are sales for 4Octobober2018? ---> We would interpolate between 3months and 1year).
Many thanks.
I'm not exactly sure what you mean regarding your interpolation, but here is a way to make your dataframe in pandas (starting from your original df you provided in your post):
from datetime import datetime
from dateutil.relativedelta import relativedelta
def create_dates(df):
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
df['days_from_now'] = df['date'] - datetime.now().date()
return df
create_dates(df)
sales time_pillar date days_from_now
0 4.75 1W 2018-04-11 7 days
1 5.00 1M 2018-05-04 30 days
2 5.10 3M 2018-07-04 91 days
3 5.75 1Y 2019-04-04 365 days
I wrapped it in a function, so that you can call it on any given day and get your results for 1 week, 3 weeks, etc. from that exact day.
Note: if you want your days_from_now to simply be an integer of the number of days, use df['days_from_now'] = [i.days for i in df['date'] - datetime.now().date()] in the function, instead of df['days_from_now'] = df['date'] - datetime.now().date()
Explanation:
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
Takes a list of the date today (datetime.now()) repeated 4 times, and adds a relativedelta (a time difference) of 1 week, 1 month, 3 months, and 1 year, respectively, extracts the date (i.date() for ...), finally creating a new column using the resulting list.
df['days_from_now'] = df['date'] - datetime.now().date()
is much more straightforward, it simply subtracts those new dates that you got above from the date today. The result is a timedelta object, which pandas conveniently formats as "n days".

Converting Pandas daily timestep to 52 week per year timestep with the last week of 8 days

I have a Pandas dataframe with the index of daily timestep just as below:
oldman.head()
Value
date
1992-01-01 1080.4
1992-01-02 1080.4
1992-01-03 1080.4
1992-01-04 1080.0
1992-01-05 1079.6
...
starting from 1992-01-01 to 2016-12-31. I want to extract weekly mean values of each year. However, my weeks should be in special way. There should be 52 weeks in a year with 365 days but with the last week of 8 days! The first week should start from January 1st of each year.
I am wondering how am I supposed to extract this kind of weeks from a daily timestep data.
Thanks,
I modified COLDSPEED's solution a bit an added in the last week as 8 days. It's worth noting that on leap years that last "week" is actually 9 days. The following example will only work when you include all of a year. The reason for this is that my function assumes the last row in the groupby is actually the last week of the year.
#make some data
df = pd.DataFrame(index=pd.date_range("1992-1-1","1992-12-31"))
df["value"] = 1
#add a counting variable
df["count"] = 1
df = df.groupby(pd.Grouper(freq='Y'))\
.resample('7D')\
.sum()\
.reset_index(level=0, drop=True)\
def chop_last_week(df):
df1=df.copy()
df1.iloc[-2] += df1.iloc[-1]
return df1.iloc[:-1]
df = df.groupby(df.index.year)\
.apply(chop_last_week)\
.reset_index(level=0, drop=True)
df["mean"] = df["value"]/df["count"]
df.tail(5)
It's not the cleanest solution but it runs quickly.

Categories

Resources