Convert hours-minutes-seconds duration in dataframe to minutes - python

I have a csv with a column that represents time durations of two discrete events.
Day,Duration
Mon,"S: 3h0s, P: 18m0s"
Tues,"S: 3h0s, P: 18m0s"
Wed,"S: 4h0s, P: 18m0s"
Thurs,"S: 30h, P: 10m0s"
Fri,"S: 15m, P: 3h0s"
I want to split that duration into two distinct columns and consistently represent the time in minutes. Right now, it is shown in hours, minutes, and seconds, like S: 3h0s, P: 18m0s. So the output should look like this:
Day Duration S(min) P(min)
0 Mon S: 3h0s, P: 18m0s 180 18
1 Tues S: 3h0s, P: 18m0s 180 18
2 Wed S: 4h0s, P: 18m0s 240 18
3 Thur S: 30h0s, P: 10m0s 1800 10
4 Fri S: 15m, P: 3h0s 15 180
But when I do in str.replace
import pandas as pd
df = pd.read_csv("/file.csv")
df["S(min)"] = df['Duration'].str.split(',').str[0]
df["P(min)"] = df['Duration'].str.split(',').str[-1]
df['S(min)'] = df['S(min)'].str.replace("S: ", '').str.replace("h", '*60').str.replace('m','*1').str.replace('s','*(1/60)').apply(eval)
df['P(min)'] = df['P(min)'].str.replace("P: ", '').str.replace("h", '*60').str.replace('m','*1').str.replace('s','*(1/60)').apply(eval)
some of the calculations are off:
Day Duration S(min) P(min)
0 Mon S: 3h0s, P: 18m0s 30.0 3.000000
1 Tues S: 3h0s, P: 18m0s 30.0 3.000000
2 Wed S: 4h0s, P: 18m0s 40.0 3.000000
3 Thurs S: 30h, P: 10m0s 1800.0 1.666667
4 Fri S: 15m, P: 3h0s 15.0 30.000000

Using regex and pd.to_timedelta
df[['S', 'P']] = df['Duration'].str.extract(r'(S: .*?), P:( .*)')
df['S(min)'] = pd.to_timedelta(df['Duration'].str.replace('[SP]: ', '', regex=True).str.split(',').str[0]).dt.total_seconds() / 60
df['P(min)'] = pd.to_timedelta(df['Duration'].str.replace('[SP]: ', '', regex=True).str.split(',').str[-1]).dt.total_seconds() / 60
df.drop(['S', 'P'], axis=1, inplace=True)
print(df)
More Simplified approach with different regex pattern:
df[['S', 'P']] = df['Duration'].str.extract(r'S: (.*?), P: (.*)')
df['S'] = pd.to_timedelta(df['S']).dt.total_seconds()/60
df['P'] = pd.to_timedelta(df['P']).dt.total_seconds()/60
df = df.rename(columns = {'S': 'S(min)', 'P': 'P(min)'})
print(df)
Day Duration S(min) P(min)
0 Mon S: 3h0s, P: 18m0s 180.0 18.0
1 Tues S: 3h0s, P: 18m0s 180.0 18.0
2 Wed S: 4h0s, P: 18m0s 240.0 18.0
3 Thur S: 30h0s, P: 10m0s 1800.0 10.0
4 Fri S: 15m, P: 3h0s 15.0 180.0

A possible solution:
df.assign(
**df['Duration'].str.split(':|,', expand=True)[[1,3]]
.apply(pd.to_timedelta)
.apply(lambda x: x.dt.total_seconds().div(60))
.rename({1: 'S(min)', 3: 'P(min)'}, axis=1))
Output:
Day Duration S(min) P(min)
0 Mon S: 3h0s, P: 18m0s 180.0 18.0
1 Tues S: 3h0s, P: 18m0s 180.0 18.0
2 Wed S: 4h0s, P: 18m0s 240.0 18.0
3 Thur S: 30h0s, P: 10m0s 1800.0 10.0
4 Fri S: 15m, P: 3h0s 15.0 180.0

You forgot '+' in '*60' and '*1' but this does not solve the problem if hour or minute are missing. You can also use a more complex regex to explode all values in different columns:
S_pat = r'S:\s*(?:(?P<S_h>\d+)(?=h)h)?\s*(?:(?P<S_m>\d+)(?=m)m)?\s*(?:(?P<S_s>\d+)(?=s)s)?'
P_pat = r'P:\s*(?:(?P<P_h>\d+)(?=h)h)?\s*(?:(?P<P_m>\d+)(?=m)m)?\s*(?:(?P<P_s>\d+)(?=s)s)?'
duration = df['Duration'].str.extract(SP_pat).fillna(0).astype(int)
duration = (duration.mul([60, 1, 1/60, 60, 1, 1/60])
.groupby(duration.columns.str[0], axis=1)
.sum().add_suffix('(min)'))
out = pd.concat([df, duration], axis=1)
Output:
>>> out
Day Duration P(min) S(min)
0 Mon S: 3h0s, P: 18m0s 18.0 180.0
1 Tues S: 3h0s, P: 18m0s 18.0 180.0
2 Wed S: 4h0s, P: 18m0s 18.0 240.0
3 Thur S: 30h0s, P: 10m0s 10.0 1800.0
4 Fri S: 15m, P: 3h0s 180.0 15.0
# duration after extraction, before computing
>>> duration
S_h S_m S_s P_h P_m P_s
0 3 0 0 0 18 0
1 3 0 0 0 18 0
2 4 0 0 0 18 0
3 30 0 0 0 10 0
4 0 15 0 3 0 0

Related

Convert day of the year to datetime

I have a data files containing year, day of the year (DOY), hour and minutes as following:
BuoyID Year Hour Min DOY POS_DOY Lat Lon Ts
0 300234065718160 2019 7 0 216.2920 216.2920 58.559 -23.914 14.61
1 300234065718160 2019 9 0 216.3750 216.3750 58.563 -23.905 14.60
2 300234065718160 2019 10 0 216.4170 216.4170 58.564 -23.903 14.60
3 300234065718160 2019 11 0 216.4580 216.4580 58.563 -23.906 14.60
4 300234065718160 2019 12 0 216.5000 216.5000 58.561 -23.910 14.60
In order to make my datetime, I used:
dt_raw = pd.to_datetime(df_buoy['Year'] * 1000 + df_buoy['DOY'], format='%Y%j')
# Convert to datetime
dt_buoy = [d.date() for d in dt_raw]
date = datetime.datetime.combine(dt_buoy[0], datetime.time(df_buoy.Hour[0], df_buoy.Min[0]))
My problem arises when the hours are not int, but float instead. For example:
BuoyID Year Hour Min DOY POS_DOY Lat Lon BP Ts
0 300234061876910 2014 23.33 0 226.972 226.972 71.93081 -141.0792 1016.9 -0.01
1 300234061876910 2014 23.50 0 226.979 226.979 71.93020 -141.0826 1016.8 3.36
2 300234061876910 2014 23.67 0 226.986 226.986 71.92968 -141.0856 1016.8 3.28
3 300234061876910 2014 23.83 0 226.993 226.993 71.92934 -141.0876 1016.8 3.22
4 300234061876910 2014 0.00 0 227.000 227.000 71.92904 -141.0894 1016.8 3.18
What I tried to do was to convert the hours in str, get the first two indexes, thus obtaining the hour, and then subtract this from the 'Hour' and multiply by 60 to get minutes.
int_hour = [(int(str(i)[0:2])) for i in df_buoy.Hour]
minutes = map(lambda x, y: (x - y)*60, df_buoy.Hour, int_hour)
But, of course, if you have '0.' as your hour, Python will complain:
ValueError: invalid literal for int() with base 10: '0.'
My question is: does anyone know a simple way to convert year, DOY, hour (either int or *float) and minutes to datetime in a simple way?
Use to_timedelta for convert hours columns and add to datetimes, working well with integers and floats:
df['d'] = (pd.to_datetime(df['Year'] * 1000 + df['DOY'], format='%Y%j') +
pd.to_timedelta(df['Hour'], unit='h'))
print (df)
BuoyID Year Hour Min DOY POS_DOY Lat Lon Ts \
0 300234065718160 2019 7 0 216.292 216.292 58.559 -23.914 14.61
1 300234065718160 2019 9 0 216.375 216.375 58.563 -23.905 14.60
2 300234065718160 2019 10 0 216.417 216.417 58.564 -23.903 14.60
3 300234065718160 2019 11 0 216.458 216.458 58.563 -23.906 14.60
4 300234065718160 2019 12 0 216.500 216.500 58.561 -23.910 14.60
d
0 2019-08-04 07:00:00
1 2019-08-04 09:00:00
2 2019-08-04 10:00:00
3 2019-08-04 11:00:00
4 2019-08-04 12:00:00
df['d'] = (pd.to_datetime(df['Year'] * 1000 + df['DOY'], format='%Y%j') +
pd.to_timedelta(df['Hour'], unit='h'))
print (df)
BuoyID Year Hour Min DOY POS_DOY Lat Lon \
0 300234061876910 2014 23.33 0 226.972 226.972 71.93081 -141.0792
1 300234061876910 2014 23.50 0 226.979 226.979 71.93020 -141.0826
2 300234061876910 2014 23.67 0 226.986 226.986 71.92968 -141.0856
3 300234061876910 2014 23.83 0 226.993 226.993 71.92934 -141.0876
4 300234061876910 2014 0.00 0 227.000 227.000 71.92904 -141.0894
BP Ts d
0 1016.9 -0.01 2014-08-14 23:19:48
1 1016.8 3.36 2014-08-14 23:30:00
2 1016.8 3.28 2014-08-14 23:40:12
3 1016.8 3.22 2014-08-14 23:49:48
4 1016.0 NaN 2014-08-15 00:00:00

How to append keys which are not previously in dataframe 1 but are in dataframe 2 against each name

I have a Dataframe df1 like this
id name day marks mean_marks
0 1 John Wed 28 28
1 1 John Fri 30 30
2 2 Alex Fri 40 50
3 2 Alex Fri 60 50
and another dataframe df2 as:
day we
0 Mon 29
1 Wed 21
2 Fri 31
now when i do :
z = pd.merge(df1, df2, how='outer', on=['day']).fillna(0)
i got:
id name day marks mean_marks we
0 1.0 John Wed 28.0 28.0 21
1 1.0 John Fri 30.0 30.0 31
2 2.0 Alex Fri 40.0 50.0 31
3 2.0 Alex Fri 60.0 50.0 31
4 0.0 0 Mon 0.0 0.0 29
but i wanted something which would look like :
id name day marks mean_marks we
0 1.0 John Wed 28.0 28.0 21
1 1.0 John Mon 0.0 0.0 29
2 1.0 John Fri 30.0 30.0 31
3 2.0 Alex Mon 0.0 0.0 29
4 2.0 Alex Wed 0.0 0.0 21
5 2.0 Alex Fri 40.0 50.0 31
6 2.0 Alex Fri 60.0 50.0 31
that is 'day' which are not previously in df1 but are in df2 should be appended to day against each name.
Can someone please help me with this.
You might need a cross join to create all combinations of days per id and name , then merge should work:
u = df1[['id','name']].drop_duplicates().assign(k=1).merge(df2.assign(k=1),on='k')
out = df1.merge(u.drop('k',1),on=['day','name','id'],how='outer').fillna(0)
print(out.sort_values(['id','name']))
id name day marks mean_marks we
0 1 John Wed 28.0 28.0 21
1 1 John Fri 30.0 30.0 31
4 1 John Mon 0.0 0.0 29
2 2 Alex Fri 40.0 50.0 31
3 2 Alex Fri 60.0 5.0 31
5 2 Alex Mon 0.0 0.0 29
6 2 Alex Wed 0.0 0.0 21
The following code should do it:
z = df1.groupby(['name']).apply(lambda grp: grp.merge(df2, how='outer', on='day').
fillna({'name': grp.name, 'id': grp.id})).reset_index(drop=True).fillna(0)
It gives the following output:
id name day marks mean_marks we
0 2.0 Alex Fri 40 50 31
1 2.0 Alex Fri 60 50 31
2 2.0 Alex Mon 0 0 29
3 2.0 Alex Wed 0 0 21
4 1.0 John Wed 28 28 21
5 1.0 John Fri 30 30 31
6 1.0 John Mon 0 0 29
you can create df3 with all names and day combination:
df3 = pd.DataFrame([[name, day] for name in df1.name.unique() for day in df2.day.unique()], columns=['name', 'day'])
Then add id's from df1:
df3 = df3.merge(df1[['id', 'name']]).drop_duplicates()[['id', 'name', 'day']]
Then add marks and mean marks from df1:
df3 = df3.merge(df1, how='left')
Then merge:
z = df3.merge(df2, how='outer', on=['day']).fillna(0).sort_values('id')
Out:
id name day marks mean_marks we
0 1 John Mon 0.0 0.0 29
2 1 John Wed 28.0 28.0 21
4 1 John Fri 30.0 30.0 31
1 2 Alex Mon 0.0 0.0 29
3 2 Alex Wed 0.0 0.0 21
5 2 Alex Fri 40.0 50.0 31
6 2 Alex Fri 60.0 50.0 31
To have the result ordered by weekdays (within each group by id), we should
convert day column in both DataFrames to Categorical type.
I think, it is better than in your original concept, where you don't care
about the days order.
To do it, run:
wDays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
days = pd.Categorical(wDays, wDays, ordered=True)
df1.day = df1.day.astype(days)
df2.day = df2.day.astype(days)
Then define the following function, performing merge within a group
by id and filling NaN values (using either ffill or fillna):
def myMerge(grp):
res = pd.merge(grp, df2, how='right', on=['day'])
res[['id', 'name']] = res[['id', 'name']].ffill()
res[['marks', 'mean_marks']] = res[['marks', 'mean_marks']].fillna(0)
return res.sort_values('day')
Then group df1 by id and apply the above function to each group:
df1.groupby('id', sort=False).apply(myMerge).reset_index(drop=True)
The final step above is reset_index, to re-create "ordinary" index.
I added also sort=False to keep your desired (original) order of groups.

Pandas: Count days in each month between given start and end date

I have a pandas dataframe with some beginning and ending dates.
ActualStartDate ActualEndDate
0 2019-06-30 2019-08-15
1 2019-09-01 2020-01-01
2 2019-08-28 2019-11-13
Given these start & end dates I need to count how many days in each month between beginning and ending dates. I can't figure out a good way to approach this, but resulting dataframe should be something like:
ActualStartDate ActualEndDate 2019-06 2019-07 2019-08 2019-09 2019-10 2019-11 2019-12 2020-01 etc
0 2019-06-30 2019-08-15 1 31 15 0 0 0 0 0
1 2019-09-01 2020-01-01 0 0 0 30 31 30 31 1
2 2019-08-28 2019-11-13 0 0 4 30 31 13 0 0
Note that actual dataframe has ~1,500 rows with varying beginning & end dates. Open to different df output, but showing the above to give you the idea of what I need to accomplish. Thank you in advance for any help!
Idea is create month periods by DatetimeIndex.to_period from date_range and count by Index.value_counts, then create DataFrame by concat with replace missing values by DataFrame.fillna, last join to original by DataFrame.join:
L = {r.Index: pd.date_range(r.ActualStartDate, r.ActualEndDate).to_period('M').value_counts()
for r in df.itertuples()}
df = df.join(pd.concat(L, axis=1).fillna(0).astype(int).T)
print (df)
ActualStartDate ActualEndDate 2019-06 2019-07 2019-08 2019-09 2019-10 \
0 2019-06-30 2019-08-15 1 31 15 0 0
1 2019-09-01 2020-01-01 0 0 0 30 31
2 2019-08-28 2019-11-13 0 0 4 30 31
2019-11 2019-12 2020-01
0 0 0 0
1 30 31 1
2 13 0 0
Performance:
df = pd.concat([df] * 1000, ignore_index=True)
In [44]: %%timeit
...: L = {r.Index: pd.date_range(r.ActualStartDate, r.ActualEndDate).to_period('M').value_counts()
...: for r in df.itertuples()}
...: df.join(pd.concat(L, axis=1).fillna(0).astype(int).T)
...:
689 ms ± 5.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [45]: %%timeit
...: df.join(
...: df.apply(lambda v: pd.Series(pd.date_range(v['ActualStartDate'], v['ActualEndDate'], freq='D').to_period('M')), axis=1)
...: .apply(pd.value_counts, axis=1)
...: .fillna(0)
...: .astype(int))
...:
994 ms ± 5.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Probably not the most efficient but shouldn't be too bad for ~1500 rows... expand out a date range and then convert it to a monthly period, take the counts of those and rejoin back to your original DF, eg:
res = df.join(
df.apply(lambda v: pd.Series(pd.date_range(v['ActualStartDate'], v['ActualEndDate'], freq='D').to_period('M')), axis=1)
.apply(pd.value_counts, axis=1)
.fillna(0)
.astype(int)
)
Gives you:
ActualStartDate ActualEndDate 2019-06 2019-07 2019-08 2019-09 2019-10 2019-11 2019-12 2020-01 2020-02 2020-03 2020-04 2020-05 2020-06 2020-07 2020-08 2020-09 2020-10 2020-11
0 2019-06-30 2020-08-15 1 31 31 30 31 30 31 31 29 31 30 31 30 31 15 0 0 0
1 2019-09-01 2020-01-01 0 0 0 30 31 30 31 1 0 0 0 0 0 0 0 0 0 0
2 2019-08-28 2020-11-13 0 0 4 30 31 30 31 31 29 31 30 31 30 31 31 30 31 13
import pandas as pd
import calendar
date_info = pd.DataFrame({
'ActualStartDate': [
pd.Timestamp('2019-06-30'),
pd.Timestamp('2019-09-01'),
pd.Timestamp('2019-08-28'),
],
'ActualEndDate': [
pd.Timestamp('2019-08-15'),
pd.Timestamp('2020-01-01'),
pd.Timestamp('2019-11-13'),
]
})
# ============================================================
result = {} # result should in dict, in case of too many cols.
for index, timepair in date_info.iterrows():
start = timepair['ActualStartDate']
end = timepair['ActualEndDate']
current = start
result[index] = {} # delta days in this pair
while True:
# find the delta days
# current day is also count, so should + 1
_, days = calendar.monthrange(current.year, current.month)
days = min(days, (end - current).days + 1)
delta = days - current.day + 1
result[index]['%s-%s'%(current.year, current.month)] = delta
current += pd.Timedelta(delta, unit='d')
if current >= end:
break
# you can save the result in dataframe, if you insisit
columns = set()
for value in result.values():
columns.update(value.keys())
for col in columns:
date_info[col] = 0
for index, delta in result.items():
for date, days in delta.items():
date_info.loc[index, date] = days
print(date_info)

Python - Creating list of week numbers of months using dates

I have got a start date ('2019-11-18') and an end date ('2021-02-19'). I am trying to create a list of all the weeks of each month that exist between the start and end date. My expected result should be something like this:
list = ['2019.Nov.3','2019.Nov.4', '2019.Nov.5' .... '2021.Feb.2','2021.Feb.3']
If the first or last date of a month lands on a Wednesday, i will assume that the week belongs to this month (As 3 out of the 5 working days will belong to this month)
I was actually successful in creating a dataframe with all the weeks of the year that exist between the start and end date using the following code:
date_1 = '18-11-19'
first_date = datetime.strptime(date_1, '%d-%m-%y')
date_2 = '19-02-21'
last_date = datetime.strptime(date_2, '%d-%m-%y')
timeline = pd.DataFrame(columns=['Year', 'Week'])
def create_list(df):
start_year = int(first_date.isocalendar()[0])
start_week = int(first_date.isocalendar()[1])
end_year = int(last_date.isocalendar()[0])
end_week = int(last_date.isocalendar()[1])
while start_year < (end_year + 1):
if start_year == end_year:
while start_week < (end_week + 1):
if len(str(start_week)) == 1:
week = f'{start_year}' + '.0' + f'{start_week}'
else:
week = f'{start_year}' + '.' + f'{start_week}'
df = df.append(({'Year': start_year, 'Week': week}), ignore_index=True)
start_week += 1
else:
while start_week < 53:
if len(str(start_week)) == 1:
week = f'{start_year}' + '.0' + f'{start_week}'
else:
week = f'{start_year}' + '.' + f'{start_week}'
df = df.append(({'Year': start_year, 'Week': week}), ignore_index=True)
start_week += 1
start_year += 1
start_week = 1
return df
timeline = create_list(timeline)
I was successfully able to use this as an x axis for my line graph. However, the axis is a bit hard to read and its very difficult to know which week belongs to which month.
I would really appreciate if someone can give me a hand with this!
Edit:
So here is the solution with the guidance of #Serge Ballesta. I hope it helps anyone who might need something similiar in the future!
import pandas as pd
import dateutil.relativedelta
from datetime import datetime
def year_week(date):
if len(str(date.isocalendar()[1])) == 1:
return f'{date.isocalendar()[0]}' + '.0' + f'{date.isocalendar()[1]}'
else:
return f'{date.isocalendar()[0]}' + '.' + f'{date.isocalendar()[1]}'
date_1 = '18-11-19'
first_date = datetime.strptime(date_1, '%d-%m-%y')
date_2 = '19-02-21'
last_date = datetime.strptime(date_2, '%d-%m-%y')
set_first_date = str((first_date - dateutil.relativedelta.relativedelta(months=1)).date())
set_last_date = str((last_date + dateutil.relativedelta.relativedelta(months=1)).date())
s = pd.date_range(set_first_date, set_last_date, freq='W-WED'
).to_series(name='wed').reset_index(drop=True)
df = s.to_frame()
df['week'] = df.apply(lambda x: year_week(x['wed']), axis=1)
df = df.assign(week_of_month=s.groupby(s.dt.strftime('%Y%m')
).cumcount() + 1)
df = df[(s >= pd.Timestamp('2019-11-18'))
& (s <= pd.Timestamp('2021-02-19'))]
df['month_week'] = (df['wed'].dt.strftime('%Y.%b.') + df['week_of_month'].astype(str)).tolist()
df = df.drop(['wed', 'week_of_month'], axis = 1)
print (df)
Printed df:
week month_week
4 2019.47 2019.Nov.3
5 2019.48 2019.Nov.4
6 2019.49 2019.Dec.1
7 2019.50 2019.Dec.2
8 2019.51 2019.Dec.3
.. ... ...
65 2021.03 2021.Jan.3
66 2021.04 2021.Jan.4
67 2021.05 2021.Feb.1
68 2021.06 2021.Feb.2
69 2021.07 2021.Feb.3
I would build a Series of timestamps with a frequency of W-WED to have consistently Wednesday as day of week. That way, we immediately get the correct month for the week.
To have the number of the week in the month, I would start one month before the required start, and use a cumcount on year-month + 1. Then it would be enough to filter only the expected range and properly format the values:
# produce a series of wednesdays starting in 2019-10-01
s = pd.date_range('2019-10-01', '2021-03-31', freq='W-WED'
).to_series(name='wed').reset_index(drop=True)
# compute the week number in the month
df = s.to_frame().assign(week_of_month=s.groupby(s.dt.strftime('%Y%m')
).cumcount() + 1)
# filter the required range
df = df[(s >= pd.Timestamp('2019-11-18'))
& (s <= pd.Timestamp('2021-02-19'))]
# here is the expected list
lst = (df['wed'].dt.strftime('%Y.%b.')+df['week_of_month'].astype(str)).tolist()
lst is as expected:
['2019.Nov.3', '2019.Nov.4', '2019.Dec.1', '2019.Dec.2', '2019.Dec.3', '2019.Dec.4',
'2020.Jan.1', '2020.Jan.2', '2020.Jan.3', '2020.Jan.4', '2020.Jan.5', '2020.Feb.1',
'2020.Feb.2', '2020.Feb.3', '2020.Feb.4', '2020.Mar.1', '2020.Mar.2', '2020.Mar.3',
'2020.Mar.4', '2020.Apr.1', '2020.Apr.2', '2020.Apr.3', '2020.Apr.4', '2020.Apr.5',
'2020.May.1', '2020.May.2', '2020.May.3', '2020.May.4', '2020.Jun.1', '2020.Jun.2',
'2020.Jun.3', '2020.Jun.4', '2020.Jul.1', '2020.Jul.2', '2020.Jul.3', '2020.Jul.4',
'2020.Jul.5', '2020.Aug.1', '2020.Aug.2', '2020.Aug.3', '2020.Aug.4', '2020.Sep.1',
'2020.Sep.2', '2020.Sep.3', '2020.Sep.4', '2020.Sep.5', '2020.Oct.1', '2020.Oct.2',
'2020.Oct.3', '2020.Oct.4', '2020.Nov.1', '2020.Nov.2', '2020.Nov.3', '2020.Nov.4',
'2020.Dec.1', '2020.Dec.2', '2020.Dec.3', '2020.Dec.4', '2020.Dec.5', '2021.Jan.1',
'2021.Jan.2', '2021.Jan.3', '2021.Jan.4', '2021.Feb.1', '2021.Feb.2', '2021.Feb.3']
This may not give you exactly what you need (because of 3 out of 5 days in the last week condition), but may be you can get an idea on how to tweak it to get your desired result.
You can export column res to list with df['res'].to_list()
df = pd.DataFrame({'date': pd.date_range('2019-11-18','2021-02-19', freq=pd.offsets.Week(weekday=0))})
df['year_wk']= df.date.apply(lambda x: x.strftime("%W")).astype(int)
df['mon_beg_wk']= df.date.dt.to_period('M').dt.to_timestamp().dt.strftime("%W").astype(int)
df['mon_wk']= df['year_wk']-df['mon_beg_wk']
df['res']= df['date'].dt.strftime("%Y.%b")+'.'+df['mon_wk'].astype(str)
df
Output
date year_wk mon_beg_wk mon_wk res
0 2019-11-18 46 43 3 2019.Nov.3
1 2019-11-25 47 43 4 2019.Nov.4
2 2019-12-02 48 47 1 2019.Dec.1
3 2019-12-09 49 47 2 2019.Dec.2
4 2019-12-16 50 47 3 2019.Dec.3
5 2019-12-23 51 47 4 2019.Dec.4
6 2019-12-30 52 47 5 2019.Dec.5
7 2020-01-06 1 0 1 2020.Jan.1
8 2020-01-13 2 0 2 2020.Jan.2
9 2020-01-20 3 0 3 2020.Jan.3
10 2020-01-27 4 0 4 2020.Jan.4
11 2020-02-03 5 4 1 2020.Feb.1
12 2020-02-10 6 4 2 2020.Feb.2
13 2020-02-17 7 4 3 2020.Feb.3
14 2020-02-24 8 4 4 2020.Feb.4
15 2020-03-02 9 8 1 2020.Mar.1
16 2020-03-09 10 8 2 2020.Mar.2
17 2020-03-16 11 8 3 2020.Mar.3
18 2020-03-23 12 8 4 2020.Mar.4
19 2020-03-30 13 8 5 2020.Mar.5
20 2020-04-06 14 13 1 2020.Apr.1
21 2020-04-13 15 13 2 2020.Apr.2
22 2020-04-20 16 13 3 2020.Apr.3
23 2020-04-27 17 13 4 2020.Apr.4
24 2020-05-04 18 17 1 2020.May.1
25 2020-05-11 19 17 2 2020.May.2
26 2020-05-18 20 17 3 2020.May.3
27 2020-05-25 21 17 4 2020.May.4
28 2020-06-01 22 22 0 2020.Jun.0
29 2020-06-08 23 22 1 2020.Jun.1
... ... ... ... ... ...
36 2020-07-27 30 26 4 2020.Jul.4
37 2020-08-03 31 30 1 2020.Aug.1
38 2020-08-10 32 30 2 2020.Aug.2
39 2020-08-17 33 30 3 2020.Aug.3
40 2020-08-24 34 30 4 2020.Aug.4
41 2020-08-31 35 30 5 2020.Aug.5
42 2020-09-07 36 35 1 2020.Sep.1
43 2020-09-14 37 35 2 2020.Sep.2
44 2020-09-21 38 35 3 2020.Sep.3
45 2020-09-28 39 35 4 2020.Sep.4
46 2020-10-05 40 39 1 2020.Oct.1
47 2020-10-12 41 39 2 2020.Oct.2
48 2020-10-19 42 39 3 2020.Oct.3
49 2020-10-26 43 39 4 2020.Oct.4
50 2020-11-02 44 43 1 2020.Nov.1
51 2020-11-09 45 43 2 2020.Nov.2
52 2020-11-16 46 43 3 2020.Nov.3
53 2020-11-23 47 43 4 2020.Nov.4
54 2020-11-30 48 43 5 2020.Nov.5
55 2020-12-07 49 48 1 2020.Dec.1
56 2020-12-14 50 48 2 2020.Dec.2
57 2020-12-21 51 48 3 2020.Dec.3
58 2020-12-28 52 48 4 2020.Dec.4
59 2021-01-04 1 0 1 2021.Jan.1
60 2021-01-11 2 0 2 2021.Jan.2
61 2021-01-18 3 0 3 2021.Jan.3
62 2021-01-25 4 0 4 2021.Jan.4
63 2021-02-01 5 5 0 2021.Feb.0
64 2021-02-08 6 5 1 2021.Feb.1
65 2021-02-15 7 5 2 2021.Feb.2
I used datetime.timedelta to do this. It is supposed to work for all start and end dates.
import datetime
import math
date_1 = '18-11-19'
first_date = datetime.datetime.strptime(date_1, '%d-%m-%y')
date_2 = '19-02-21'
last_date = datetime.datetime.strptime(date_2, '%d-%m-%y')
start_week_m=math.ceil(int(first_date.strftime("%d"))/7)#Week number of first month
daysTill_nextWeek=7-int(first_date.strftime("%w"))#Number of days to next sunday.
date_template='%Y.%b.'
tempdate=first_date
weeks=['%s%d' % (tempdate.strftime(date_template),start_week_m)]
tempdate=tempdate+datetime.timedelta(days=daysTill_nextWeek)#tempdate becomes the next sunday
while tempdate < last_date:
temp_year,temp_month=int(tempdate.strftime("%Y")),int(tempdate.strftime("%m"))
print(start_week_m)
weeks.append('%s%d' % (tempdate.strftime(date_template),start_week_m+1))
start_week_m+=1
tempdate=tempdate+datetime.timedelta(days=7)
if temp_month != int(tempdate.strftime("%m")):
print(temp_year,int(tempdate.strftime("%Y")))
start_week_m=0
print(weeks)
prints
['2019.Nov.3', '2019.Nov.4', '2019.Dec.1', '2019.Dec.2', '2019.Dec.3', '2019.Dec.4', '2019.Dec.5', '2020.Jan.1', '2020.Jan.2', '2020.Jan.3', '2020.Jan.4', '2020.Feb.1', '2020.Feb.2', '2020.Feb.3', '2020.Feb.4', '2020.Mar.1', '2020.Mar.2', '2020.Mar.3', '2020.Mar.4', '2020.Mar.5', '2020.Apr.1', '2020.Apr.2', '2020.Apr.3', '2020.Apr.4', '2020.May.1', '2020.May.2', '2020.May.3', '2020.May.4', '2020.May.5', '2020.Jun.1', '2020.Jun.2', '2020.Jun.3', '2020.Jun.4', '2020.Jul.1', '2020.Jul.2', '2020.Jul.3', '2020.Jul.4', '2020.Aug.1', '2020.Aug.2', '2020.Aug.3', '2020.Aug.4', '2020.Aug.5', '2020.Sep.1', '2020.Sep.2', '2020.Sep.3', '2020.Sep.4', '2020.Oct.1', '2020.Oct.2', '2020.Oct.3', '2020.Oct.4', '2020.Nov.1', '2020.Nov.2', '2020.Nov.3', '2020.Nov.4', '2020.Nov.5', '2020.Dec.1', '2020.Dec.2', '2020.Dec.3', '2020.Dec.4', '2021.Jan.1', '2021.Jan.2', '2021.Jan.3', '2021.Jan.4', '2021.Jan.5', '2021.Feb.1', '2021.Feb.2']

Nested for loop python pandas not functioning as desired

Code to generate random database for question (minimum reproducible issue):
df_random = pd.DataFrame(np.random.random((2000,3)))
df_random['order_date'] = pd.date_range(start='1/1/2015',
periods=len(df_random), freq='D')
df_random['customer_id'] = np.random.randint(1, 20, df_random.shape[0])
df_random
Output df_random
0 1 2 order_date customer_id
0 0.018473 0.970257 0.605428 2015-01-01 12
... ... ... ... ... ...
1999 0.800139 0.746605 0.551530 2020-06-22 11
Code to extract mean unique transactions month and year wise
for y in (2015,2019):
for x in (1,13):
df2 = df_random[(df_random['order_date'].dt.month == x)&(df_random['order_date'].dt.year== y)]
df2.sort_values(['customer_id','order_date'],inplace=True)
df2["days"] = df2.groupby("customer_id")["order_date"].apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D"))
df_mean=round(df2['days'].mean(),2)
data2 = data.append(pd.DataFrame({'Mean': df_mean , 'Month': x, 'Year': y}, index=[0]), ignore_index=True)
print(data2)
Expected output
Mean Month Year
0 5.00 1 2015
.......................
11 6.62 12 2015
..............Mean values of days after which one transaction occurs in order_date for years 2016 and 2017 Jan to Dec
36 6.03 1 2018
..........................
47 6.76 12 2018
48 8.40 1 2019
.......................
48 8.40 12 2019
Basically I want single dataframe starting from 2015 Jan month to 2019 December
Instead of the expected output I am getting dataframe from Jan 2015 to Dec 2018 , then again Jan 2015 data and then the entire dataset repeats again from 2015 to 2018 many more times.
Please help
Try this:
data2 = pd.DataFrame([])
for y in range(2015,2020):
for x in range(1,13):
df2 = df_random[(df_random['order_date'].dt.month == x)&(df_random['order_date'].dt.year== y)]
df_mean=df2.groupby("customer_id")["order_date"].apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D")).mean().round(2)
data2 = data2.append(pd.DataFrame({'Mean': df_mean , 'Month': x, 'Year': y}, index=[0]), ignore_index=True)
print(data2)
Try this :
df_random.order_date = pd.to_datetime(df_random.order_date)
df_random = df_random.set_index(pd.DatetimeIndex(df_random['order_date']))
output = df_random.groupby(pd.Grouper(freq="M"))[[0,1,2]].agg(np.mean).reset_index()
output['month'] = output.order_date.dt.month
output['year'] = output.order_date.dt.year
output = output.drop('order_date', axis=1)
output
Output
0 1 2 month year
0 0.494818 0.476514 0.496059 1 2015
1 0.451611 0.437638 0.536607 2 2015
2 0.476262 0.567519 0.528129 3 2015
3 0.519229 0.475887 0.612433 4 2015
4 0.464781 0.430593 0.445455 5 2015
... ... ... ... ... ...
61 0.416540 0.564928 0.444234 2 2020
62 0.553787 0.423576 0.422580 3 2020
63 0.524872 0.470346 0.560194 4 2020
64 0.530440 0.469957 0.566077 5 2020
65 0.584474 0.487195 0.557567 6 2020
Avoid any looping and simply include year and month in groupby calculation:
np.random.seed(1022020)
...
# ASSIGN MONTH AND YEAR COLUMNS, THEN SORT COLUMNS
df_random = (df_random.assign(month = lambda x: x['order_date'].dt.month,
year = lambda x: x['order_date'].dt.year)
.sort_values(['customer_id', 'order_date']))
# GROUP BY CALCULATION
df_random["days"] = (df_random.groupby(["customer_id", "year", "month"])["order_date"]
.apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D")))
# FINAL MEAN AGGREGATION BY YEAR AND MONTH
final_df = (df_random.groupby(["year", "month"], as_index=False)["days"].mean().round(2)
.rename(columns={"days":"mean"}))
print(final_df.head())
# year month mean
# 0 2015 1 8.43
# 1 2015 2 5.87
# 2 2015 3 4.88
# 3 2015 4 10.43
# 4 2015 5 8.12
print(final_df.tail())
# year month mean
# 61 2020 2 8.27
# 62 2020 3 8.41
# 63 2020 4 8.81
# 64 2020 5 9.12
# 65 2020 6 7.00
For multiple aggregates, replace the single groupby.mean() to groupby.agg():
final_df = (df_random.groupby(["year", "month"], as_index=False)["days"]
.agg(['count', 'min', 'mean', 'median', 'max'])
.rename(columns={"days":"mean"}))
print(final_df.head())
# count min mean median max
# year month
# 2015 1 14 1.0 8.43 5.0 25.0
# 2 15 1.0 5.87 5.0 17.0
# 3 16 1.0 4.88 5.0 9.0
# 4 14 1.0 10.43 7.5 23.0
# 5 17 2.0 8.12 8.0 17.0
print(final_df.tail())
# count min mean median max
# year month
# 2020 2 15 1.0 8.27 6.0 21.0
# 3 17 1.0 8.41 7.0 16.0
# 4 16 1.0 8.81 7.0 20.0
# 5 16 1.0 9.12 7.0 22.0
# 6 7 2.0 7.00 7.0 17.0

Categories

Resources