Change index in a pandas dataframe and add additional time column - python

I have a pandas dataframe that currently has no specifiy index (thus when printing an automatic index is created which beginns with 0). Now I would like to have a "timeslot" index that beginns with 1 and an additional "time of the day" column in the dataframe. Here you can see a screenshot of how theoutput csv should look like. Can you tell me how to do this?

Try with pd.date_range:
df['time of day'] = pd.date_range('1970-1-1', periods=len(df), freq='H') \
.strftime('%H:%M')
Setup:
df = pd.DataFrame(np.random.randint(1, 50, (30, 2)), columns=['Column 1', 'Column 2'])
df.insert(0, 'time of day', pd.date_range('1970-1-1', periods=len(df), freq='H').strftime('%H:%M'))
df.index.name = 'timeslot'
df.index += 1
print(df)
# Output:
time of day Column 1 Column 2
timeslot
1 00:00 43 33
2 01:00 20 11
3 02:00 40 10
4 03:00 19 28
5 04:00 10 27
6 05:00 27 10
7 06:00 1 10
8 07:00 33 36
9 08:00 32 2
10 09:00 23 32
11 10:00 1 17
12 11:00 48 42
13 12:00 21 3
14 13:00 48 28
15 14:00 41 46
16 15:00 48 43
17 16:00 47 6
18 17:00 33 21
19 18:00 38 19
20 19:00 17 40
21 20:00 8 24
22 21:00 28 22
23 22:00 2 13
24 23:00 24 3
25 00:00 4 1
26 01:00 8 9
27 02:00 19 36
28 03:00 30 36
29 04:00 43 39
30 05:00 43 3

Assuming your dataframe is df:
df['time of day'] = df.index.astype(str).str.rjust(2, '0')+':00'
df.index += 1
output: No output as no text input was provided
if there are more than 24 rows:
df['time of day'] = (df.index%24).astype(str).str.rjust(2, '0')+':00'
df.index += 1

Related

create multiple columns from datetime object function

local_time
5398 2019-02-14 14:35:42+01:00
5865 2021-09-22 04:28:53+02:00
6188 2018-05-04 09:34:53+02:00
6513 2019-11-09 15:54:51+01:00
6647 2019-09-18 09:25:43+02:00
df_with_local_time['local_time'].loc[6647] returns
datetime.datetime(2019, 9, 18, 9, 25, 43, tzinfo=<DstTzInfo 'Europe/Oslo' CEST+2:00:00 DST>)
Based on the column, I would like to generate multiple date-related columns:
def datelike_variables(i):
year = i.year
month = i.month
#dayofweek = i.dayofweek
day = i.day
hour = i.hour
return year, month, day, hour
df_with_local_time[['year','month','day','hour']]=df_with_local_time['local_time'].apply(datelike_variables,axis=1,result_type="expand")
returns TypeError: datelike_variables() got an unexpected keyword argument 'result_type'
Expected result:
local_time year month day hour
5398 2019-02-14 14:35:42+01:00 2019 02 14 14
5865 2021-09-22 04:28:53+02:00 2021 09 22 04
6188 2018-05-04 09:34:53+02:00 2018 05 04 09
6513 2019-11-09 15:54:51+01:00 2019 11 09 15
6647 2019-09-18 09:25:43+02:00 2019 09 18 09
Error is because use Series.apply, there is no parameter result_type:
def datelike_variables(i):
year = i.year
month = i.month
#dayofweek = i.dayofweek
day = i.day
hour = i.hour
return pd.Series([year, month, day, hour])
df_with_local_time[['year','month','day','hour']]=df_with_local_time['local_time'].apply(datelike_variables)
print (df_with_local_time)
local_time year month day hour
5398 2019-02-14 14:35:42+01:00 2019 2 14 14
5865 2021-09-22 04:28:53+02:00 2021 9 22 4
6188 2018-05-04 09:34:53+02:00 2018 5 4 9
6513 2019-11-09 15:54:51+01:00 2019 11 9 15
6647 2019-09-18 09:25:43+02:00 2019 9 18 9
Your solution is possible by lambda function in DataFrame.apply:
def datelike_variables(i):
year = i.year
month = i.month
#dayofweek = i.dayofweek
day = i.day
hour = i.hour
return year, month, day, hour
df_with_local_time[['year','month','day','hour']]=df_with_local_time.apply(lambda x: datelike_variables(x['local_time']), axis=1,result_type="expand")
print (df_with_local_time)
local_time year month day hour
5398 2019-02-14 14:35:42+01:00 2019 2 14 14
5865 2021-09-22 04:28:53+02:00 2021 9 22 4
6188 2018-05-04 09:34:53+02:00 2018 5 4 9
6513 2019-11-09 15:54:51+01:00 2019 11 9 15
6647 2019-09-18 09:25:43+02:00 2019 9 18 9

Group columns based on the headers if they are found in the same list. Pandas Python

So I have a data frame that is something like this
Resource 2020-06-01 2020-06-02 2020-06-03
Name1 8 7 8
Name2 7 9 9
Name3 10 10 10
Imagine that the header is literal all the days of the month. And that there are way more names than just three.
I need to reduce the columns to five. Considering the first column to be the days between 2020-06-01 till 2020-06-05. Then from Saturday till Friday of the same week. Or the last day of the month if it is before Friday. So for June would be these weeks:
week 1: 2020-06-01 to 2020-06-05
week 2: 2020-06-06 to 2020-06-12
week 3: 2020-06-13 to 2020-06-19
week 4: 2020-06-20 to 2020-06-26
week 5: 2020-06-27 to 2020-06-30
I have no problem defining these weeks. The problem is grouping the columns based on them.
I couldn't come up with anything.
Does someone have any ideas about this?
I have to use these code to generate your dataframe.
dates = pd.date_range(start='2020-06-01', end='2020-06-30')
df = pd.DataFrame({
'Name1': np.random.randint(1, 10, size=len(dates)),
'Name2': np.random.randint(1, 10, size=len(dates)),
'Name3': np.random.randint(1, 10, size=len(dates)),
})
df = df.set_index(dates).transpose().reset_index().rename(columns={'index': 'Resource'})
Then, the solution starts from here.
# Set the first column as index
df = df.set_index(df['Resource'])
# Remove the unused column
df = df.drop(columns=['Resource'])
# Transpose the dataframe
df = df.transpose()
# Output:
Resource Name1 Name2 Name3
2020-06-01 00:00:00 3 2 7
2020-06-02 00:00:00 5 6 8
2020-06-03 00:00:00 2 3 6
...
# Bring "Resource" from index to column
df = df.reset_index()
df = df.rename(columns={'index': 'Resource'})
# Add a column "week of year"
df['week_no'] = df['Resource'].dt.weekofyear
# You can simply group by the week no column
df.groupby('week_no').sum().reset_index()
# Output:
Resource week_no Name1 Name2 Name3
0 23 38 42 41
1 24 37 30 43
2 25 38 29 23
3 26 29 40 42
4 27 2 8 3
I don't know what you want to do for the next. If you want your original form, just transpose() it back.
EDIT: OP claimed the week should start from Saturday end up with Friday
# 0: Monday
# 1: Tuesday
# 2: Wednesday
# 3: Thursday
# 4: Friday
# 5: Saturday
# 6: Sunday
df['weekday'] = df['Resource'].dt.weekday.apply(lambda day: 0 if day <= 4 else 1)
df['customised_weekno'] = df['week_no'] + df['weekday']
Output:
Resource Resource Name1 Name2 Name3 week_no weekday customised_weekno
0 2020-06-01 4 7 7 23 0 23
1 2020-06-02 8 6 7 23 0 23
2 2020-06-03 5 9 5 23 0 23
3 2020-06-04 7 6 5 23 0 23
4 2020-06-05 6 3 7 23 0 23
5 2020-06-06 3 7 6 23 1 24
6 2020-06-07 5 4 4 23 1 24
7 2020-06-08 8 1 5 24 0 24
8 2020-06-09 2 7 9 24 0 24
9 2020-06-10 4 2 7 24 0 24
10 2020-06-11 6 4 4 24 0 24
11 2020-06-12 9 5 7 24 0 24
12 2020-06-13 2 4 6 24 1 25
13 2020-06-14 6 7 5 24 1 25
14 2020-06-15 8 7 7 25 0 25
15 2020-06-16 4 3 3 25 0 25
16 2020-06-17 6 4 5 25 0 25
17 2020-06-18 6 8 2 25 0 25
18 2020-06-19 3 1 2 25 0 25
So, you can use customised_weekno for grouping.

Python - Creating list of week numbers of months using dates

I have got a start date ('2019-11-18') and an end date ('2021-02-19'). I am trying to create a list of all the weeks of each month that exist between the start and end date. My expected result should be something like this:
list = ['2019.Nov.3','2019.Nov.4', '2019.Nov.5' .... '2021.Feb.2','2021.Feb.3']
If the first or last date of a month lands on a Wednesday, i will assume that the week belongs to this month (As 3 out of the 5 working days will belong to this month)
I was actually successful in creating a dataframe with all the weeks of the year that exist between the start and end date using the following code:
date_1 = '18-11-19'
first_date = datetime.strptime(date_1, '%d-%m-%y')
date_2 = '19-02-21'
last_date = datetime.strptime(date_2, '%d-%m-%y')
timeline = pd.DataFrame(columns=['Year', 'Week'])
def create_list(df):
start_year = int(first_date.isocalendar()[0])
start_week = int(first_date.isocalendar()[1])
end_year = int(last_date.isocalendar()[0])
end_week = int(last_date.isocalendar()[1])
while start_year < (end_year + 1):
if start_year == end_year:
while start_week < (end_week + 1):
if len(str(start_week)) == 1:
week = f'{start_year}' + '.0' + f'{start_week}'
else:
week = f'{start_year}' + '.' + f'{start_week}'
df = df.append(({'Year': start_year, 'Week': week}), ignore_index=True)
start_week += 1
else:
while start_week < 53:
if len(str(start_week)) == 1:
week = f'{start_year}' + '.0' + f'{start_week}'
else:
week = f'{start_year}' + '.' + f'{start_week}'
df = df.append(({'Year': start_year, 'Week': week}), ignore_index=True)
start_week += 1
start_year += 1
start_week = 1
return df
timeline = create_list(timeline)
I was successfully able to use this as an x axis for my line graph. However, the axis is a bit hard to read and its very difficult to know which week belongs to which month.
I would really appreciate if someone can give me a hand with this!
Edit:
So here is the solution with the guidance of #Serge Ballesta. I hope it helps anyone who might need something similiar in the future!
import pandas as pd
import dateutil.relativedelta
from datetime import datetime
def year_week(date):
if len(str(date.isocalendar()[1])) == 1:
return f'{date.isocalendar()[0]}' + '.0' + f'{date.isocalendar()[1]}'
else:
return f'{date.isocalendar()[0]}' + '.' + f'{date.isocalendar()[1]}'
date_1 = '18-11-19'
first_date = datetime.strptime(date_1, '%d-%m-%y')
date_2 = '19-02-21'
last_date = datetime.strptime(date_2, '%d-%m-%y')
set_first_date = str((first_date - dateutil.relativedelta.relativedelta(months=1)).date())
set_last_date = str((last_date + dateutil.relativedelta.relativedelta(months=1)).date())
s = pd.date_range(set_first_date, set_last_date, freq='W-WED'
).to_series(name='wed').reset_index(drop=True)
df = s.to_frame()
df['week'] = df.apply(lambda x: year_week(x['wed']), axis=1)
df = df.assign(week_of_month=s.groupby(s.dt.strftime('%Y%m')
).cumcount() + 1)
df = df[(s >= pd.Timestamp('2019-11-18'))
& (s <= pd.Timestamp('2021-02-19'))]
df['month_week'] = (df['wed'].dt.strftime('%Y.%b.') + df['week_of_month'].astype(str)).tolist()
df = df.drop(['wed', 'week_of_month'], axis = 1)
print (df)
Printed df:
week month_week
4 2019.47 2019.Nov.3
5 2019.48 2019.Nov.4
6 2019.49 2019.Dec.1
7 2019.50 2019.Dec.2
8 2019.51 2019.Dec.3
.. ... ...
65 2021.03 2021.Jan.3
66 2021.04 2021.Jan.4
67 2021.05 2021.Feb.1
68 2021.06 2021.Feb.2
69 2021.07 2021.Feb.3
I would build a Series of timestamps with a frequency of W-WED to have consistently Wednesday as day of week. That way, we immediately get the correct month for the week.
To have the number of the week in the month, I would start one month before the required start, and use a cumcount on year-month + 1. Then it would be enough to filter only the expected range and properly format the values:
# produce a series of wednesdays starting in 2019-10-01
s = pd.date_range('2019-10-01', '2021-03-31', freq='W-WED'
).to_series(name='wed').reset_index(drop=True)
# compute the week number in the month
df = s.to_frame().assign(week_of_month=s.groupby(s.dt.strftime('%Y%m')
).cumcount() + 1)
# filter the required range
df = df[(s >= pd.Timestamp('2019-11-18'))
& (s <= pd.Timestamp('2021-02-19'))]
# here is the expected list
lst = (df['wed'].dt.strftime('%Y.%b.')+df['week_of_month'].astype(str)).tolist()
lst is as expected:
['2019.Nov.3', '2019.Nov.4', '2019.Dec.1', '2019.Dec.2', '2019.Dec.3', '2019.Dec.4',
'2020.Jan.1', '2020.Jan.2', '2020.Jan.3', '2020.Jan.4', '2020.Jan.5', '2020.Feb.1',
'2020.Feb.2', '2020.Feb.3', '2020.Feb.4', '2020.Mar.1', '2020.Mar.2', '2020.Mar.3',
'2020.Mar.4', '2020.Apr.1', '2020.Apr.2', '2020.Apr.3', '2020.Apr.4', '2020.Apr.5',
'2020.May.1', '2020.May.2', '2020.May.3', '2020.May.4', '2020.Jun.1', '2020.Jun.2',
'2020.Jun.3', '2020.Jun.4', '2020.Jul.1', '2020.Jul.2', '2020.Jul.3', '2020.Jul.4',
'2020.Jul.5', '2020.Aug.1', '2020.Aug.2', '2020.Aug.3', '2020.Aug.4', '2020.Sep.1',
'2020.Sep.2', '2020.Sep.3', '2020.Sep.4', '2020.Sep.5', '2020.Oct.1', '2020.Oct.2',
'2020.Oct.3', '2020.Oct.4', '2020.Nov.1', '2020.Nov.2', '2020.Nov.3', '2020.Nov.4',
'2020.Dec.1', '2020.Dec.2', '2020.Dec.3', '2020.Dec.4', '2020.Dec.5', '2021.Jan.1',
'2021.Jan.2', '2021.Jan.3', '2021.Jan.4', '2021.Feb.1', '2021.Feb.2', '2021.Feb.3']
This may not give you exactly what you need (because of 3 out of 5 days in the last week condition), but may be you can get an idea on how to tweak it to get your desired result.
You can export column res to list with df['res'].to_list()
df = pd.DataFrame({'date': pd.date_range('2019-11-18','2021-02-19', freq=pd.offsets.Week(weekday=0))})
df['year_wk']= df.date.apply(lambda x: x.strftime("%W")).astype(int)
df['mon_beg_wk']= df.date.dt.to_period('M').dt.to_timestamp().dt.strftime("%W").astype(int)
df['mon_wk']= df['year_wk']-df['mon_beg_wk']
df['res']= df['date'].dt.strftime("%Y.%b")+'.'+df['mon_wk'].astype(str)
df
Output
date year_wk mon_beg_wk mon_wk res
0 2019-11-18 46 43 3 2019.Nov.3
1 2019-11-25 47 43 4 2019.Nov.4
2 2019-12-02 48 47 1 2019.Dec.1
3 2019-12-09 49 47 2 2019.Dec.2
4 2019-12-16 50 47 3 2019.Dec.3
5 2019-12-23 51 47 4 2019.Dec.4
6 2019-12-30 52 47 5 2019.Dec.5
7 2020-01-06 1 0 1 2020.Jan.1
8 2020-01-13 2 0 2 2020.Jan.2
9 2020-01-20 3 0 3 2020.Jan.3
10 2020-01-27 4 0 4 2020.Jan.4
11 2020-02-03 5 4 1 2020.Feb.1
12 2020-02-10 6 4 2 2020.Feb.2
13 2020-02-17 7 4 3 2020.Feb.3
14 2020-02-24 8 4 4 2020.Feb.4
15 2020-03-02 9 8 1 2020.Mar.1
16 2020-03-09 10 8 2 2020.Mar.2
17 2020-03-16 11 8 3 2020.Mar.3
18 2020-03-23 12 8 4 2020.Mar.4
19 2020-03-30 13 8 5 2020.Mar.5
20 2020-04-06 14 13 1 2020.Apr.1
21 2020-04-13 15 13 2 2020.Apr.2
22 2020-04-20 16 13 3 2020.Apr.3
23 2020-04-27 17 13 4 2020.Apr.4
24 2020-05-04 18 17 1 2020.May.1
25 2020-05-11 19 17 2 2020.May.2
26 2020-05-18 20 17 3 2020.May.3
27 2020-05-25 21 17 4 2020.May.4
28 2020-06-01 22 22 0 2020.Jun.0
29 2020-06-08 23 22 1 2020.Jun.1
... ... ... ... ... ...
36 2020-07-27 30 26 4 2020.Jul.4
37 2020-08-03 31 30 1 2020.Aug.1
38 2020-08-10 32 30 2 2020.Aug.2
39 2020-08-17 33 30 3 2020.Aug.3
40 2020-08-24 34 30 4 2020.Aug.4
41 2020-08-31 35 30 5 2020.Aug.5
42 2020-09-07 36 35 1 2020.Sep.1
43 2020-09-14 37 35 2 2020.Sep.2
44 2020-09-21 38 35 3 2020.Sep.3
45 2020-09-28 39 35 4 2020.Sep.4
46 2020-10-05 40 39 1 2020.Oct.1
47 2020-10-12 41 39 2 2020.Oct.2
48 2020-10-19 42 39 3 2020.Oct.3
49 2020-10-26 43 39 4 2020.Oct.4
50 2020-11-02 44 43 1 2020.Nov.1
51 2020-11-09 45 43 2 2020.Nov.2
52 2020-11-16 46 43 3 2020.Nov.3
53 2020-11-23 47 43 4 2020.Nov.4
54 2020-11-30 48 43 5 2020.Nov.5
55 2020-12-07 49 48 1 2020.Dec.1
56 2020-12-14 50 48 2 2020.Dec.2
57 2020-12-21 51 48 3 2020.Dec.3
58 2020-12-28 52 48 4 2020.Dec.4
59 2021-01-04 1 0 1 2021.Jan.1
60 2021-01-11 2 0 2 2021.Jan.2
61 2021-01-18 3 0 3 2021.Jan.3
62 2021-01-25 4 0 4 2021.Jan.4
63 2021-02-01 5 5 0 2021.Feb.0
64 2021-02-08 6 5 1 2021.Feb.1
65 2021-02-15 7 5 2 2021.Feb.2
I used datetime.timedelta to do this. It is supposed to work for all start and end dates.
import datetime
import math
date_1 = '18-11-19'
first_date = datetime.datetime.strptime(date_1, '%d-%m-%y')
date_2 = '19-02-21'
last_date = datetime.datetime.strptime(date_2, '%d-%m-%y')
start_week_m=math.ceil(int(first_date.strftime("%d"))/7)#Week number of first month
daysTill_nextWeek=7-int(first_date.strftime("%w"))#Number of days to next sunday.
date_template='%Y.%b.'
tempdate=first_date
weeks=['%s%d' % (tempdate.strftime(date_template),start_week_m)]
tempdate=tempdate+datetime.timedelta(days=daysTill_nextWeek)#tempdate becomes the next sunday
while tempdate < last_date:
temp_year,temp_month=int(tempdate.strftime("%Y")),int(tempdate.strftime("%m"))
print(start_week_m)
weeks.append('%s%d' % (tempdate.strftime(date_template),start_week_m+1))
start_week_m+=1
tempdate=tempdate+datetime.timedelta(days=7)
if temp_month != int(tempdate.strftime("%m")):
print(temp_year,int(tempdate.strftime("%Y")))
start_week_m=0
print(weeks)
prints
['2019.Nov.3', '2019.Nov.4', '2019.Dec.1', '2019.Dec.2', '2019.Dec.3', '2019.Dec.4', '2019.Dec.5', '2020.Jan.1', '2020.Jan.2', '2020.Jan.3', '2020.Jan.4', '2020.Feb.1', '2020.Feb.2', '2020.Feb.3', '2020.Feb.4', '2020.Mar.1', '2020.Mar.2', '2020.Mar.3', '2020.Mar.4', '2020.Mar.5', '2020.Apr.1', '2020.Apr.2', '2020.Apr.3', '2020.Apr.4', '2020.May.1', '2020.May.2', '2020.May.3', '2020.May.4', '2020.May.5', '2020.Jun.1', '2020.Jun.2', '2020.Jun.3', '2020.Jun.4', '2020.Jul.1', '2020.Jul.2', '2020.Jul.3', '2020.Jul.4', '2020.Aug.1', '2020.Aug.2', '2020.Aug.3', '2020.Aug.4', '2020.Aug.5', '2020.Sep.1', '2020.Sep.2', '2020.Sep.3', '2020.Sep.4', '2020.Oct.1', '2020.Oct.2', '2020.Oct.3', '2020.Oct.4', '2020.Nov.1', '2020.Nov.2', '2020.Nov.3', '2020.Nov.4', '2020.Nov.5', '2020.Dec.1', '2020.Dec.2', '2020.Dec.3', '2020.Dec.4', '2021.Jan.1', '2021.Jan.2', '2021.Jan.3', '2021.Jan.4', '2021.Jan.5', '2021.Feb.1', '2021.Feb.2']

Count different consecutive rows according group by a column

I'm trying to sum the number of different consecutive rows for each customer.
So my data looks like this dummy one:
df = pd.DataFrame({'Customer':['A','A','A','A','A','A','A','A', 'B','B','B','B','B','B','B','B'],
'Time':['00:00','01:00','02:00','03:00','04:00', '05:00','06:00','07:00','00:00','01:00','02:00','03:00','04:00','05:00','06:00','07:00'],
'Lat':[20,20,30,30,30,40,20,20,20,20,30,30,30, 40,20,20],
'Lon':[40,40,50,50,50,60,40,40,40,40,50,50,50,60,40,40]})
Customer Time Lat Lon
0 A 00:00 20 40
1 A 01:00 20 40
2 A 02:00 30 50
3 A 03:00 30 50
4 A 04:00 30 50
5 A 05:00 40 60
6 A 06:00 20 40
7 A 07:00 20 40
8 B 00:00 20 40
9 B 01:00 20 40
10 B 02:00 30 50
11 B 03:00 30 50
12 B 04:00 30 50
13 B 05:00 40 60
14 B 06:00 20 40
15 B 07:00 20 40
And I want to count the number of different rows (according to both Lat and Lon) by customer that aren't consecutive. So, in the example it would return 4 for both customers even though there are only 3 different pairs of Lat and Lon.
This:
test = (df['Lat'] != df['Lat'].shift(1)).values.sum()
Takes care of only one column and doesn't group by Customer.
But I can't seem to do
df[['Lat','Lon']] != df[['Lat','Lon']]
it gives:
ValueError: Wrong number of items passed 2, placement implies 1
or group by Customer. Can somebody help?
I am using shift create a new key , then with drop_duplicates
df['key']=df.groupby('Customer').apply(lambda x : x[['Lat','Lon']].ne(x[['Lat','Lon']].shift()).all(1).cumsum()).reset_index(level=0,drop=True)
df.drop_duplicates(['Customer','key'])
Customer Time Lat Lon key
0 A 00:00 20 40 1
2 A 02:00 30 50 2
5 A 05:00 40 60 3
6 A 06:00 20 40 4
8 B 00:00 20 40 1
10 B 02:00 30 50 2
13 B 05:00 40 60 3
14 B 06:00 20 40 4
IIUC,
df.groupby('Customer')[['Lat', 'Lon']].apply(lambda s: s.diff().ne(0).all(1).sum())
Customer
A 4
B 4
dtype: int64

pandas multiple date ranges from column of dates

Current df:
ID Date
11 3/19/2018
22 1/5/2018
33 2/12/2018
.. ..
I have the df with ID and Date. ID is unique in the original df.
I would like to create a new df based on date. Each ID has a Max Date, I would like to use that date and go back 4 days(5 rows each ID)
There are thousands of IDs.
Expect to get:
ID Date
11 3/15/2018
11 3/16/2018
11 3/17/2018
11 3/18/2018
11 3/19/2018
22 1/1/2018
22 1/2/2018
22 1/3/2018
22 1/4/2018
22 1/5/2018
33 2/8/2018
33 2/9/2018
33 2/10/2018
33 2/11/2018
33 2/12/2018
… …
I tried the following method, i think use date_range might be right direction, but I keep get error.
pd.date_range
def date_list(row):
list = pd.date_range(row["Date"], periods=5)
return list
df["Date_list"] = df.apply(date_list, axis = "columns")
Here is another by using df.assign to overwrite date and pd.concat to glue the range together. cᴏʟᴅsᴘᴇᴇᴅ's solution wins in performance but I think this might be a nice addition as it is quite easy to read and understand.
df = pd.concat([df.assign(Date=df.Date - pd.Timedelta(days=i)) for i in range(5)])
Alternative:
dates = (pd.date_range(*x) for x in zip(df['Date']-pd.Timedelta(days=4), df['Date']))
df = (pd.DataFrame(dict(zip(df['ID'],dates)))
.T
.stack()
.reset_index(0)
.rename(columns={'level_0': 'ID', 0: 'Date'}))
Full example:
import pandas as pd
data = '''\
ID Date
11 3/19/2018
22 1/5/2018
33 2/12/2018'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
df['Date']= pd.to_datetime(df.Date)
df = pd.concat([df.assign(Date=df.Date - pd.Timedelta(days=i)) for i in range(5)])
df.sort_values(by=['ID','Date'], ascending = [True,True], inplace=True)
print(df)
Returns:
ID Date
0 11 2018-03-15
0 11 2018-03-16
0 11 2018-03-17
0 11 2018-03-18
0 11 2018-03-19
1 22 2018-01-01
1 22 2018-01-02
1 22 2018-01-03
1 22 2018-01-04
1 22 2018-01-05
2 33 2018-02-08
2 33 2018-02-09
2 33 2018-02-10
2 33 2018-02-11
2 33 2018-02-12
reindexing with pd.date_range
Let's try creating a flat list of date-ranges and reindexing this DataFrame.
from itertools import chain
v = df.assign(Date=pd.to_datetime(df.Date)).set_index('Date')
# assuming ID is a string column
v.reindex(chain.from_iterable(
pd.date_range(end=i, periods=5) for i in v.index)
).bfill().reset_index()
Date ID
0 2018-03-14 11
1 2018-03-15 11
2 2018-03-16 11
3 2018-03-17 11
4 2018-03-18 11
5 2018-03-19 11
6 2017-12-31 22
7 2018-01-01 22
8 2018-01-02 22
9 2018-01-03 22
10 2018-01-04 22
11 2018-01-05 22
12 2018-02-07 33
13 2018-02-08 33
14 2018-02-09 33
15 2018-02-10 33
16 2018-02-11 33
17 2018-02-12 33
concat based solution on keys
Just for fun. My reindex solution is definitely more performant and easier to read, so if you were to pick one, use that.
v = df.assign(Date=pd.to_datetime(df.Date))
v_dict = {
j : pd.DataFrame(
pd.date_range(end=i, periods=5), columns=['Date']
)
for j, i in zip(v.ID, v.Date)
}
(pd.concat(v_dict, axis=0)
.reset_index(level=1, drop=True)
.rename_axis('ID')
.reset_index()
)
ID Date
0 11 2018-03-14
1 11 2018-03-15
2 11 2018-03-16
3 11 2018-03-17
4 11 2018-03-18
5 11 2018-03-19
6 22 2017-12-31
7 22 2018-01-01
8 22 2018-01-02
9 22 2018-01-03
10 22 2018-01-04
11 22 2018-01-05
12 33 2018-02-07
13 33 2018-02-08
14 33 2018-02-09
15 33 2018-02-10
16 33 2018-02-11
17 33 2018-02-12
group by ID, select the column Date, and for each group generate a series of five days leading up to the greatest date.
rather than writing a long lambda, I've written a helper function.
def drange(x):
e = x.max()
s = e-pd.Timedelta(days=4)
return pd.Series(pd.date_range(s,e))
res = df.groupby('ID').Date.apply(drange)
Then drop the extraneous level from the resulting multiindex and we get our desired output
res.reset_index(level=0).reset_index(drop=True)
# outputs:
ID Date
0 11 2018-03-15
1 11 2018-03-16
2 11 2018-03-17
3 11 2018-03-18
4 11 2018-03-19
5 22 2018-01-01
6 22 2018-01-02
7 22 2018-01-03
8 22 2018-01-04
9 22 2018-01-05
10 33 2018-02-08
11 33 2018-02-09
12 33 2018-02-10
13 33 2018-02-11
14 33 2018-02-12
Compact alternative
# Help function to return Serie with daterange
func = lambda x: pd.date_range(x.iloc[0]-pd.Timedelta(days=4), x.iloc[0]).to_series()
res = df.groupby('ID').Date.apply(func).reset_index().drop('level_1',1)
You can try groupby with date_range
df.groupby('ID').Date.apply(lambda x : pd.Series(pd.date_range(end=x.iloc[0],periods=5))).reset_index(level=0)
Out[793]:
ID Date
0 11 2018-03-15
1 11 2018-03-16
2 11 2018-03-17
3 11 2018-03-18
4 11 2018-03-19
0 22 2018-01-01
1 22 2018-01-02
2 22 2018-01-03
3 22 2018-01-04
4 22 2018-01-05
0 33 2018-02-08
1 33 2018-02-09
2 33 2018-02-10
3 33 2018-02-11
4 33 2018-02-12

Categories

Resources