I am trying to fetch previous week same day data and then take an average of the value ("current_demand") for today's forecast (predict).
for example:
Today is Monday, so then I want to fetch data from the last two weeks Monday's data same time or block and then take an average of the value ["current_demand"] to predict today's value.
Input Data:
current_demand Date Blockno weekday
18839 01-06-2018 1 4
18836 01-06-2018 2 4
12256 02-06-2018 1 5
12266 02-06-2018 2 5
17957 08-06-2018 1 4
17986 08-06-2018 2 4
18491 09-06-2018 1 5
18272 09-06-2018 2 5
Expecting result:
18398 15-06-2018 1 4
something like that. I want to take same value, same block and same day of the previous two-week value then calculate for next value average.
I have tried some thing:
def forecast(DATA):
df = DATA
day = {0:'Monday',1:'Tuesday',2:'Wednesday',3:'Thursday',4:'Friday',5:'Saturday',6:'Sunday'}
df.friday = day - timedelta(days=day.weekday() + 3)
print df
forecast(DATA)
Please suggest me something. Thank you in advance
I like relativedelta for this kind of job
from dateutil.relativedelta import relativedelta
(datetime.datetime.today() + relativedelta(weeks=-2)).date()
Output:
datetime.date(2018, 7, 23)
without the actual structure of your df it's hard to provide a solution tailored to your needs
Related
This question already has answers here:
Issues in getting week numbers with weeks starting on Sunday in Python?
(2 answers)
Closed 8 months ago.
I have a column "date" with values in MM/DD/YYYY format. How can I create a new column called weeks, with the week number, where week is starting from sunday and not monday.
For example:
I want it like
I would need more data, but I believe you could do something like this
df['Sale_Date'] = pd.to_datetime(df['Sale_Date'], infer_datetime_format=True)
df['Week'] = df['Sale_Date'] + datetime.timedelta(days = 1)
df['Week'] = df['Week'].dt.isocalendar().week
This will push all the dates forward one day and then get the week. So essentially when it reads the date it will read it as Monday for a Sundays date giving you the expected week. However, I recieved a different week number than you so I'm not sure if you are using a different week function that I am.
df['Sale_Date'] = pd.to_datetime(df.Sale_Date)
df['Week'] = df.Sale_Date.dt.strftime('%U')
df
ID sale_amt Sale_Date Week
0 1 100 2022-06-10 23
1 2 200 2022-06-05 23
2 3 250 2022-06-04 22
I need to import a .xlsx sheet into pandas which has a column for the processing time of an associated activity. All entries in this column look somewhat like this:
01:20:34
12:22:30
25:01:02
155:20:56
Which says how much hours, minutes and seconds were needed. When I use pd.read_excel pandas correctly interprets each of the timestamps with less than 24 hours, and reads them as above in the first two cases. The timestamps with more than 24h (last two) on the other hand are converted into a datetime object, which in turn looks like this: 1900-01-02T14:58:03 instead of 62:58:03.
Is there a simple solution?
I think that part of the problem is not in Python/Pandas, but in Excel. Date '1900-01-01' is the base date used by Excel represented by number '1'. You can check that if you write '0' in a cell and then formate that cell to date, you get '1900-01-00' and '1' you get '1900-01-01'.
So, try to export your Excel file to a CSV file before importing to pandas and then import this way:
import pandas as pd
df1 = pd.read_csv('sample_data.csv')
In this case, you can get this DataFrame with the column Duration as a string (I added a column id for reference).
duration id
0 01:20:34 1
1 12:22:30 2
2 25:01:02 3
3 155:20:56 4
Then for your purpose, I suggest you Do not try to convert those values to datetime type, but a timedelta. A strategy will be to split the strings by colons and then build an instance of timedelta using those three fields: hours, minutes, and seconds.
import datetime as dt
def converter1(x):
vals = x.split(':')
vals = [int(val) for val in vals ]
out = dt.timedelta(hours=vals[0], minutes=vals[1], seconds=vals[2])
return out
df1['deltat'] = df1['duration'].apply(converter1)
duration id deltat
0 01:20:34 1 0 days 01:20:34
1 12:22:30 2 0 days 12:22:30
2 25:01:02 3 1 days 01:01:02
3 155:20:56 4 6 days 11:20:56
If you need to convert those values to a number of decimals hours or other new fields use the total_seconds() method from timedelta:
df1['deltat_hr'] = df1['deltat'].apply(lambda x: x.total_seconds()/3600)
duration id deltat deltat_hr
0 01:20:34 1 0 days 01:20:34 1.342778
1 12:22:30 2 0 days 12:22:30 12.375000
2 25:01:02 3 1 days 01:01:02 25.017222
3 155:20:56 4 6 days 11:20:56 155.348889
I am trying to solve for how to get the values of year to date versus last year to date from a dataframe.
Dataframe:
ID start_date distance
1 2019-7-25 2
2 2019-7-26 2
3 2020-3-4 1
4 2020-3-4 1
5 2020-3-5 3
6 2020-3-6 3
There is data back to 2017 and more data will keep getting added so I would like the YTD and LYTD to be dynamic based upon the current year.
I know how to get the cumulative sum for each year and month but I am really struggling with how to calculate the YTD and LYTD.
year_month_distance_df = distance_kpi_df.groupby(["Start_Year","Start_Month"]).agg({"distance":"sum"}).reset_index()
The other code I tried:
cum_sum_distance_ytd =
distance_kpi_df[["start_date_local","distance"]]
cum_sum_distance_ytd = cum_sum_distance_ytd.set_index("start_date_local")
cum_sum_distance_ytd = cum_sum_distance_ytd.groupby(pd.Grouper(freq = "D")).sum()
When I try this logic and add Start_Day into the group by it obviously just sums all the data for that day.
Expected output:
Year to Date = 8
Last Year to Date = 4
You could split the date into its components and get the ytd for all years with
expanding = df.groupby([
df.start_date.month, df.start_date.day, df.start_date.year
]).distance.sum().unstack().cumsum()
Unstacking will fill with np.nan wherever any year does not have a value in the row's date... if that is a problem you can use the fill_value parameter
.unstack(fill_value=0).cumsum()
I'm working in machine learning regression problem where i predict sales value based on input features. In which date is one of the feature and i want to fetch month and week number from the date. Month gives in 1 to 12 that is okay. but for weeks i get between 1 to 52, which is also correct but i'm trying to get week number in range of 1 to 5, some months have 4 weeks and some have 5.
I have tried available methods for getting week number but it gives in range of 1 to 52 only. I can not just simply divide this by 4, otherwise no month will have 5 weeks.
this code gives output in range of 1 to 52 and i have also tried several other methods.
df['Week'] = df['Date'].dt.week
it should return like if a particular date is belong to fifth week of month than it should give week number 5.
Week number refers to week of year in most contexts. Week of month is not a standard notion and is thus not implemented in pandas. You can implement it yourself. See e.g. this question on Stackoverflow.
I am trying to group by hospital staff working hours bi monthly. I have raw data on daily basis which look like below.
date hourse_spent emp_id
9/11/2016 8 1
15/11/2016 8 1
22/11/2016 8 2
23/11/2016 8 1
How I want to group by is.
cycle hourse_spent emp_id
1/11/2016-15/11/2016 16 1
16/11/2016-31/11/2016 8 2
16/11/2016-31/11/2016 8 1
I am trying to do the same with grouper and frequency in pandas something as below.
data.set_index('date',inplace=True)
print data.head()
dt = data.groupby(['emp_id', pd.Grouper(key='date', freq='MS')])['hours_spent'].sum().reset_index().sort_values('date')
#df.resample('10d').mean().interpolate(method='linear',axis=0)
print dt.resample('SMS').sum()
I also tried resampling
df1 = dt.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
data.set_index('date',inplace=True)
df1 = data.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
But this is giving data of 15 days interval not like 1 to 15 and 15 to 31.
Please let me know what I am doing wrong here.
You were almost there. This will do it -
dt = df.groupby(['emp_id', pd.Grouper(key='date', freq='SM')])['hours_spent'].sum().reset_index().sort_values('date')
emp_id date hours_spent
1 2016-10-31 8
1 2016-11-15 16
2 2016-11-15 8
The freq='SM' is the concept of semi-months which will use the 15th and the last day of every month
Put DateTime-Values into Bins
If I got you right, you basically want to put your values in the date column into bins. For this, pandas has the pd.cut() function included, which does exactly what you want.
Here's an approach which might help you:
import pandas as pd
df = pd.DataFrame({
'hours' : 8,
'emp_id' : [1,1,2,1],
'date' : [pd.datetime(2016,11,9),
pd.datetime(2016,11,15),
pd.datetime(2016,11,22),
pd.datetime(2016,11,23)]
})
bins_dt = pd.date_range('2016-10-16', freq='SM', periods=3)
cycle = pd.cut(df.date, bins_dt)
df.groupby([cycle, 'emp_id']).sum()
Which gets you:
cycle emp_id hours
------------------------ ------ ------
(2016-10-31, 2016-11-15] 1 16
2 NaN
(2016-11-15, 2016-11-30] 1 8
2 8
Had a similar question, here was my solution:
df1['BiMonth'] = df1['Date'] + pd.DateOffset(days=-1) + pd.offsets.SemiMonthEnd()
df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')
The construction "df1['Date'] + pd.DateOffset(days=-1)" will take whatever is in the date column and -1 day.
The construction "+ pd.offsets.SemiMonthEnd()" converts it to a bimonthly basket, but its off by a day unless you reduce the reference date by 1.
The construction "df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')" cleans out the time so you just have days.