Add column to dataframe based on date range - python

I want to add a column to my data frame prod_data based on a range of dates. This is an example of the data in the column ['Mount Time'] I want to modify the new column from:
0 2022-08-17 06:07:00
1 2022-08-17 06:12:00
2 2022-08-17 06:40:00
3 2022-08-17 06:45:00
4 2022-08-17 06:47:00
The new column is named ['Week'] and I want it to run from M-S, with week 1 starting on 9/5/22, running through 9/11/22 and then week 2 the next M-S, and so on until the last week which would be 53. I would also like weeks previous to 9/5 to have negative week numbers, so 8/29/22 would be the start of week -1 and so on.
The only thing I could think of was to create 2 massive lists and use np.select to define the parameters of the column, but there has to be a cleaner way of doing this, right?

You can use pandas datetime objects to figure out how many days away a date is from your start date, 9/5/2022, and then use floor division to convert that to week numbers. I made the "mount_time" column just to emphasize that the original column should be a datetime object.
prod_data["mount_time"] = pd.to_datetime( prod_data[ "Mount Time" ] )
start_date = pd.to_datetime( "9/5/2022" )
days_away = prod_data.mount_time - start_date
prod_data["Week"] = ( days_away.dt.days // 7 ) + 1
As intended, 9/5/2022 through 9/11/2022 will have a value of 1. 8/29/2022 would start week 0 (not -1 as you wrote) unless you want 9/5/2022 to start as week 0 (in which case just delete the + 1 from the code). Some more examples:
>>> test[ ["date", "Week" ] ]
date Week
0 2022-08-05 -4
1 2022-08-14 -3
2 2022-08-28 -1
3 2022-08-29 0
4 2022-08-30 0
5 2022-09-05 1
6 2022-09-11 1
7 2022-09-12 2

Related

How to select pandas dataframe rows between two dates without knowing the exact time

I have a snippet of the dataframe here
TIME Value1 Value2
0 2014-10-02 12:45:03 5 6
1 2014-10-02 12:45:05 6 7
2 2014-10-02 12:45:08 3 5
3 2014-10-02 12:45:09 7 4
..................... ... ...
45 2014-11-03 00:51:09 7 8
Now, I would like to get all the dataframe rows between 2014-10-02 and 2014-11-02 without knowing the exact time to the second. I tried this method as shown below, but I am getting 0 rows.
start_date = pd.to_datetime('2014-10-02 00:00:01')
end_date = pd.to_datetime('2014-11-02 00:00:01')
df.loc[(df['TIME'] > start_date) & (df['TIME'] < end_date)]
But when I put in an exact datetime value like '2014-10-02 12:45:03' for the start date and end date, I get the output. The data has millions of rows and I could not possibly find out the exact time to the second for the start date and end date. I just need to get rows between two dates.
You can try boolean masking:
df.loc[(df['TIME'].dt.date > start_date.date()) & (df['TIME'].dt.date< end_date,date())]
OR
You can also use boolean masking and between() method:
df[df['TIME'].dt.date.between(start_date.date(),end_date.date())]

Problem with tuple indices in loop in Python Pandas?

I try to calculate number of days until and since last and next holiday. My method of calculation it is like below:
holidays = pd.Series(pd.to_datetime(["01.01.2013", "06.01.2013", "14.02.2013","29.03.2013",
"31.03.2013", "01.04.2013", "01.05.2013", "03.05.2013",
"19.05.2013", "26.05.2013", "30.05.2013", "23.06.2013",
"15.07.2013", "27.10.2013", "01.11.2013", "11.11.2013",
"24.12.2013", "25.12.2013", "26.12.2013", "31.12.2013",
"01.01.2014", "06.01.2014", "14.02.2014", "30.03.2014",
"18.04.2014", "20.04.2014", "21.04.2014", "01.05.2014",
"03.05.2014", "03.05.2014", "26.05.2014", "08.06.2014",
"19.06.2014", "23.06.2014", "15.08.2014", "26.10.2014",
"01.11.2014", "11.11.2014", "24.12.2014", "25.12.2014",
"26.12.2014", "31.12.2014",
"01.01.2015", "06.01.2015", "14.02.2015", "29.03.2015",
"03.04.2015", "05.04.2015", "06.04.2015", "01.05.2015",
"03.05.2015", "24.05.2015", "26.05.2015", "04.06.2015",
"23.06.2015", "15.08.2015", "25.10.2015", "01.11.2015",
"11.11.2015", "24.12.2015", "25.12.2015", "26.12.2015",
"31.12.2015"], dayfirst=True))
#Number of days until next holiday
d_until_next_holiday = []
#Number of days since last holiday
d_since_last_holiday = []
for row in data.itertuples():
next_special_date = holidays[holidays >= row["Date"]].iloc[0]
d_until_next_holiday.append((next_special_date - row["Date"])/pd.Timedelta('1D'))
previous_special_date = holidays[holidays <= row.index].iloc[-1]
d_since_last_holiday.append((row["Date"] - previous_special_date)/pd.Timedelta('1D'))
#Add new cols to DF
sto2STG14["d_until_next_holiday"] = d_until_next_holiday
sto2STG14["d_since_last_holiday"] = d_since_last_holiday
Nevertheless, I have en error like below:
TypeError: tuple indices must be integers or slices, not str
Why I have this erro ? I know that row is tuple, but i use in my code .iloc[0] and .iloc[-1] ? WHat can I do ?
With pandas, you rarely need to loop. In this case, the .shift method allows you to compute everything in one go:
import pandas
holidays = pandas.Series(pandas.to_datetime([
"01.01.2013", "06.01.2013", "14.02.2013","29.03.2013",
"31.03.2013", "01.04.2013", "01.05.2013", "03.05.2013",
"19.05.2013", "26.05.2013", "30.05.2013", "23.06.2013",
"15.07.2013", "27.10.2013", "01.11.2013", "11.11.2013",
"24.12.2013", "25.12.2013", "26.12.2013", "31.12.2013",
"01.01.2014", "06.01.2014", "14.02.2014", "30.03.2014",
"18.04.2014", "20.04.2014", "21.04.2014", "01.05.2014",
"03.05.2014", "03.05.2014", "26.05.2014", "08.06.2014",
"19.06.2014", "23.06.2014", "15.08.2014", "26.10.2014",
"01.11.2014", "11.11.2014", "24.12.2014", "25.12.2014",
"26.12.2014", "31.12.2014",
"01.01.2015", "06.01.2015", "14.02.2015", "29.03.2015",
"03.04.2015", "05.04.2015", "06.04.2015", "01.05.2015",
"03.05.2015", "24.05.2015", "26.05.2015", "04.06.2015",
"23.06.2015", "15.08.2015", "25.10.2015", "01.11.2015",
"11.11.2015", "24.12.2015", "25.12.2015", "26.12.2015",
"31.12.2015"
], dayfirst=True)
)
results = (
holidays
.sort_values()
.to_frame('holiday')
.assign(
days_since_prev=lambda df: df['holiday'] - df['holiday'].shift(1),
days_until_next=lambda df: df['holiday'].shift(-1) - df['holiday'],
)
)
results.head(10)
And I get:
holiday days_since_prev days_until_next
0 2013-01-01 NaT 5 days
1 2013-01-06 5 days 39 days
2 2013-02-14 39 days 43 days
3 2013-03-29 43 days 2 days
4 2013-03-31 2 days 1 days
5 2013-04-01 1 days 30 days
6 2013-05-01 30 days 2 days
7 2013-05-03 2 days 16 days
8 2013-05-19 16 days 7 days
9 2013-05-26 7 days 4 days

How to count business days per month for the whole year with different weekmask every week?

I'm trying to make a program that will equally distribute employees' day off. There are 4 groups and each group has it's own weekmask for each week of the month. So far I've made a code that will change weekmask when it locates 0 in Dataframe(Sunday). I'm stuck on structuring this command np.busday_count(start, end, weekmask=) to automatically change the start and the end date.
My Dataframe looks like this:
And here's my code:
a: int = 0
week_mask: str = '1100111'
def _change_week_mask():
global a, week_mask
a += 1
if a == 1:
week_mask = '1111000'
elif a == 2:
week_mask = '1111111'
elif a == 3:
week_mask = '0011111'
else:
a = 0
for line in rows['Workday']:
if line is '0':
_change_week_mask()
Edit: changed the value of start week from 6 to 0.
Ok, so to answer your problem I have created the sample data frame with below code.
Then I have added below columns to the data frame.
dayofweek - to reach to similar data which you created by setting every Sunday as zero. In this case Monday is set as zero and Sunday is six.
weeknum - week of year
week - instead of counting and than changing the week mask, I have assigned the value to week from 0 to 3 and based on it, we can calculate the mask.
weekmask - using value of the week, I have calculate the mask, you might need to align this as per your logic.
weekenddate- end date I have calculate by adding 7 to start date, if month is changing mid week then this will have month end date.
b
after this we can create a new data frame to have only end of week entry, in this case Monday is 0 so I have taken 0.
then you can apply function and store the result to data frame.
import datetime
import pandas as pd
import numpy as np
df_ = pd.DataFrame({'startdate':pd.date_range(pd.to_datetime('2018-10-01'), pd.to_datetime('2018-11-30'))})
df_['dayofweek'] = df_.startdate.dt.dayofweek
df_['remaining_days_in_month'] = df_.startdate.dt.days_in_month - df_.startdate.dt.day
df_['week'] = df_.startdate.dt.week%4
df_['day'] = df_.startdate.dt.day
df_['weekmask'] = df_.week.map({0 : '1100111', 1 : '1111000' , 2 : '1111111', 3: '0011111'})
df_['weekenddate'] = [x[0] + datetime.timedelta(days=(7-x[1])) if x[2] > 7-x[1] else x[0] + datetime.timedelta(days=(x[2])) for x in df_[['startdate','dayofweek','remaining_days_in_month']].values]
final_df = df_[(df_['dayofweek']==0) | ( df_['day']==1)][['startdate','weekenddate','weekmask']]
final_df['numberofdays'] = [ np.busday_count((x[0]).astype('<M8[D]'), x[1].astype('<M8[D]'), weekmask=x[2]) for x in final_df.values.astype(str)]
Output:
startdate weekenddate weekmask numberofdays
0 2018-10-01 2018-10-08 1100111 5
7 2018-10-08 2018-10-15 1111000 4
14 2018-10-15 2018-10-22 1111111 7
21 2018-10-22 2018-10-29 0011111 5
28 2018-10-29 2018-10-31 1100111 2
31 2018-11-01 2018-11-05 1100111 3
35 2018-11-05 2018-11-12 1111000 4
42 2018-11-12 2018-11-19 1111111 7
49 2018-11-19 2018-11-26 0011111 5
56 2018-11-26 2018-11-30 1100111 2
let me know if this needs some changes as per your requirement.

pandas resample to specific weekday in month

I have a Pandas dataframe where I'd like to resample to every third Friday of the month.
np.random.seed(0)
#requested output:
dates = pd.date_range("2018-01-01", "2018-08-31")
dates_df = pd.DataFrame(data=np.random.random(len(dates)), index=dates)
mask = (dates.weekday == 4) & (14 < dates.day) & (dates.day < 22)
dates_df.loc[mask]
But when a third Friday is missing (e.g. dropping Feb third
Friday), I want to have the latest value (so as of 2018-02-15). Using the mask gives me the next value (Feb 17 instead of Feb 15):
# remove February third Friday:
dates_df = dates_df.drop([pd.to_datetime("2018-02-16")])
mask = (dates.weekday == 4) & (14 < dates.day) & (dates.day < 22)
dates_df.loc[mask]
Using monthly resample in combination with loffset gives the end of month values with offsetting the index, which is also not what I want:
from pandas.tseries.offsets import WeekOfMonth
dates_df.resample("M", loffset=WeekOfMonth(week=2, weekday=4)).last()
Is there an alternative (preferably using resample) without having to resample to daily values first and then adding a mask (this takes a long time to complete on my dataframe)
Your second attempt is in the right direction IIUC, you just need to resample using WeekOfMonth as the rule, rather than using it as an offset:
dates_df.resample(WeekOfMonth(week=2, weekday=4)).asfreq().dropna()
This approach will not offset the index, it should just return the data for the third Friday for every month.
Dealing with Missing 3rd Friday:
With the above code, if you have a missing 3rd Friday the whole month will be excluded. But depending on how you want to deal with missing data, you can bfill, ffill, pad.. you can amend the above to the following:
dates_df.resample(rule=WeekOfMonth(week=2,weekday=4)).bfill().asfreq(freq='D').dropna()
The above will bfill the missing 3rd Friday with the next value.
Update: Lets work with a fixed data set instead of np.random:
# create a smaller daterange
dates = pd.date_range("2018-05-01", "2018-08-31")
# create a data with only 1,2,3 values
data = [1,2,3] * int(len(dates)/3)
dates_df = pd.DataFrame(data=data, index=dates)
dates_df.head()
# Output:
2018-05-01 1
2018-05-02 2
2018-05-03 3
2018-05-04 1
2018-05-05 2
Now let's check what the data looks like for the 3rd Friday of each month by selecting it manually:
dates_df.loc[[
pd.Timestamp('2018-05-18'),
pd.Timestamp('2018-06-15'),
pd.Timestamp('2018-07-20'),
pd.Timestamp('2018-08-17')
]]
Output:
2018-05-18 3
2018-06-15 1
2018-07-20 3
2018-08-17 1
If you dont have any missing 3rd Fridays and running the code provided earlier:
dates_df.resample(rule=WeekOfMonth(week=2,weekday=4)).asfreq().dropna()
Will produce the following output:
2018-05-18 3
2018-06-15 1
2018-07-20 3
2018-08-17 1
As you can see the index has not been shifted here and it returned the exact values for the 3rd Friday of each month.
Now say you do have some 3rd Fridays missing, depending how you want to do it (use previous value: ffill, or next value bfill):
pad / ffill: propagate last valid observation forward to next valid
backfill / bfill: use NEXT valid observation to fill gap
dates_df.drop(index=pd.Timestamp('2018-08-17')).resample(rule=WeekOfMonth(week=2, weekday=4)).ffill().asfreq(freq='D').dropna()
2018-05-18 3
2018-06-15 1
2018-07-20 3
2018-08-17 3
dates_df.drop(index=pd.Timestamp('2018-08-17')).resample(rule=WeekOfMonth(week=2, weekday=4)).bfill().asfreq(freq='D').dropna()
2018-04-20 1
2018-05-18 3
2018-06-15 1
2018-07-20 3
2018-08-17 2
If say the whole index was shifted like your example:
dates_df.resample(rule='M', loffset=WeekOfMonth(week=2, weekday=4)).asfreq().dropna()
# Output:
2018-06-15 1
2018-07-20 1
2018-08-17 2
2018-09-21 3
Whats happening there is you're resampling by rule 'M' (month end) and then you're offsetting (shifting forward) the index by the 3rd Friday of each Month.
As you can see before the offset, this how it looks like:
dates_df.resample(rule='M').asfreq().dropna()
# Output
2018-05-31 1
2018-06-30 1
2018-07-31 2
2018-08-31 3

Python Pandas: groupby date and count new records for each period

I'm trying to use Python Pandas to count daily returning visitors to my website over a time period.
Example data:
df1 = pd.DataFrame({'user_id':[1,2,3,1,3], 'date':['2012-09-29','2012-09-30','2012-09-30','2012-10-01','2012-10-01']})
print df1
date user_id
0 2012-09-29 1
1 2012-09-30 2
2 2012-09-30 3
3 2012-10-01 1
4 2012-10-01 3
What I'd like to have as final result:
df1_result = pd.DataFrame({'count_new':[1,2,0], 'date':['2012-09-29','2012-09-30','2012-10-01']})
print df1_result
count_new date
0 1 2012-09-29
1 2 2012-09-30
2 0 2012-10-01
In the first day there is 1 new user because user 1 appears for the first time.
In the second day there are 2 new users: user 2 and user 3 both appear for the first time.
Finally in the third day there are 0 new users: user 1 and user 3 have both already appeared in previous periods.
So far I have been looking into merging two copies of same dataframe and shifting one by a date, but without success:
pd.merge(df1, df1.user_id.shift(-date), on = 'date').groupby('date')['user_id_y'].nunique()
Any help would be much appreciated,
Thanks
>>> (df1
.groupby(['user_id'], as_index=False)['date'] # Group by `user_id` and get first date.
.first()
.groupby(['date']) # Group result on `date` and take counts.
.count()
.reindex(df1['date'].unique()) # Reindex on original dates.
.fillna(0)) # Fill null values with zero.
user_id
date
2012-09-29 1
2012-09-30 2
2012-10-01 0
It is better to add a new column Isreturning (in case you need to analysis on Returning customer in the future)
df['Isreturning']=df.groupby('user_id').cumcount()
Only show new customer
df.loc[df.Isreturning==0,:].groupby('date')['user_id'].count()
Out[840]:
date
2012-09-29 1
2012-09-30 2
Name: user_id, dtype: int64
Or you can :
df.groupby('date')['Isreturning'].apply(lambda x : len(x[x==0]))
Out[843]:
date
2012-09-29 1
2012-09-30 2
2012-10-01 0
Name: Isreturning, dtype: int64

Categories

Resources