Resampling a dataframe into a new one while doing some additional operations - python

I am working with a dataframe where each entry (row) comes with a start time, a duration and other attributes. I would like to create a new dataframe from this one where I would sort of transform each entry from the original one into 15 minutes intervals while keeping all other attributes the same. The amount of entries in the new dataframe per entry in the old one would depend on the actual duration of the original one.
At first I tried using pd.resample but it did not do exactly what I expected. I then constructed a function using itertuples() that works quite well but it took about half an hour with a dataframe of around 3000 rows. Now I want to do the same for 2 million rows so I am looking for other possibilities.
Let's say I have the following dataframe:
testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm'], 'id': [1,2,3,4]}
testdf = pd.DataFrame(testdict)
testdf.loc[:,['start']] = pd.to_datetime(testdf['start'])
print(testdf)
>>>testdf
start duration Attribute_A id
0 2018-01-05 11:48:00 22 abc 1
1 2018-05-04 09:05:00 8 def 2
2 2018-08-09 07:15:00 35 hij 3
3 2018-09-27 15:00:00 2 klm 4
And I would like my outcome to be like the following:
>>>resultdf
start duration Attribute_A id
0 2018-01-05 11:45:00 12 abc 1
1 2018-01-05 12:00:00 10 abc 1
2 2018-05-04 09:00:00 8 def 2
3 2018-08-09 07:15:00 15 hij 3
4 2018-08-09 07:30:00 15 hij 3
5 2018-08-09 07:45:00 5 hij 3
6 2018-09-27 15:00:00 2 klm 4
This is the function that I built with itertuples which produced the desired result (the one I showed just above this):
def min15_divider(df,newdf):
for row in df.itertuples():
orig_min = row.start.minute
remains = orig_min % 15 # Check if it is already a multiple of 15
if remains == 0:
new_time = row.start.replace(second=0)
if row.duration < 15: # if it shorter than 15 min just use that for the duration
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A,
'duration': row.duration, 'id':row.id}
newdf = newdf.append(to_append, ignore_index=True)
else: # if not, divide that in 15 min intervals until duration is exceeded
cumu_dur = 15
while cumu_dur < row.duration:
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'id':row.id}
if cumu_dur < 15:
to_append['duration'] = cumu_dur
else:
to_append['duration'] = 15
new_time = new_time + pd.Timedelta('15 minutes')
cumu_dur = cumu_dur + 15
newdf = newdf.append(to_append, ignore_index=True)
else: # add the remainder in the last 15 min interval
final_dur = row.duration - (cumu_dur - 15)
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A,'duration': final_dur, 'id':row.id}
newdf = newdf.append(to_append, ignore_index=True)
else: # When it is not an exact multiple of 15 min
new_min = orig_min - remains # convert to multiple of 15
new_time = row.start.replace(minute=new_min)
new_time = new_time.replace(second=0)
cumu_dur = 15 - remains # remaining minutes in the initial interval
while cumu_dur < row.duration: # divide total in 15 min intervals until duration is exceeded
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'id':row.id}
if cumu_dur < 15:
to_append['duration'] = cumu_dur
else:
to_append['duration'] = 15
new_time = new_time + pd.Timedelta('15 minutes')
cumu_dur = cumu_dur + 15
newdf = newdf.append(to_append, ignore_index=True)
else: # when we reach the last interval or the starting duration was less than the remaining minutes
if row.duration < 15:
final_dur = row.duration # original duration less than remaining minutes in first interval
else:
final_dur = row.duration - (cumu_dur - 15) # remaining duration in last interval
to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'duration': final_dur, 'id':row.id}
newdf = newdf.append(to_append, ignore_index=True)
return newdf
Is there any other way to do this without using itertuples that could save me some time?
Thanks in advance.
PS. I apologize for anything that may seem a bit weird in my post as it is the first time that I have asked a question myself here in stackoverflow.
EDIT
Many entries can have the same starting time, so .groupby 'start' could be problematic. There is, however, a column with unique values for each entry called simply "id".

Using pd.resample is a good idea, but since you have only the starting time each row, you need to build the end row before you can use it.
The code below assumes that each starting time in 'start' column is unique, so that grouby can be used in a bit unusual way, since it will extract only one row.
I use groupby because it will automatically regroups the dataframes produced by the custom function used by apply.
Note also that the column 'duration' is converted to timedelta in minutes in order to better perform some math later.
import pandas as pd
testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm']}
testdf = pd.DataFrame(testdict)
testdf['start'] = pd.to_datetime(testdf['start'])
testdf['duration'] = pd.to_timedelta(testdf['duration'], 'T')
print(testdf)
def calcduration(df, starttime):
if len(df) == 1:
return
elif len(df) == 2:
df['duration'].iloc[0] = pd.Timedelta(15, 'T') - (starttime - df.index[0])
df['duration'].iloc[1] = df['duration'].iloc[1] - df['duration'].iloc[0]
elif len(df) > 2:
df['duration'].iloc[0] = pd.Timedelta(15, 'T') - (starttime - df.index[0])
df['duration'].iloc[1:-1] = pd.Timedelta(15, 'T')
df['duration'].iloc[-1] = df['duration'].iloc[-1] - df['duration'].iloc[:-1].sum()
def expandtime(x):
frow = x.copy()
frow['start'] = frow['start'] + frow['duration']
gdf = pd.concat([x, frow], axis=0)
gdf = gdf.set_index('start')
resdf = gdf.resample('15T').nearest()
calcduration(resdf, x['start'].iloc[0])
return resdf
findf = testdf.groupby('start', as_index=False).apply(expandtime)
print(findf)
This code produces:
duration Attribute_A
start
0 2018-01-05 11:45:00 00:12:00 abc
2018-01-05 12:00:00 00:10:00 abc
1 2018-05-04 09:00:00 00:08:00 def
2 2018-08-09 07:15:00 00:15:00 hij
2018-08-09 07:30:00 00:15:00 hij
2018-08-09 07:45:00 00:05:00 hij
3 2018-09-27 15:00:00 00:02:00 klm
A bit of explanation
expandtime is the first custom function. It takes a dataframe of one row (because we assume that 'start' values are uniques), builds a second row whose 'start' is equal to 'start' of the first row + duration and then uses resample to sample it in time intervals of 15 minutes. Values of all other columns are duplicated.
calcduration is used to do some math on the column 'duration' in order to calculate the correct duration of each row.

So, starting with your df:
testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm']}
df = pd.DataFrame(testdict)
df.loc[:,['start']] = pd.to_datetime(df['start'])
print(df)
First calculate an ending time for each row:
df['dur'] = pd.to_timedelta(df['duration'], unit='m')
df['end'] = df['start'] + df['dur']
Then create two new columns that hold the regular interval (15 minute) start and end dates:
df['start15'] = df['start'].dt.floor('15min')
df['end15'] = df['end'].dt.floor('15min')
At this point, the dataframe looks like:
Attribute_A duration start dur end start15 end15
0 abc 22 2018-01-05 11:48:00 00:22:00 2018-01-05 12:10:00 2018-01-05 11:45:00 2018-01-05 12:00:00
1 def 8 2018-05-04 09:05:00 00:08:00 2018-05-04 09:13:00 2018-05-04 09:00:00 2018-05-04 09:00:00
2 hij 35 2018-08-09 07:15:00 00:35:00 2018-08-09 07:50:00 2018-08-09 07:15:00 2018-08-09 07:45:00
3 klm 2 2018-09-27 15:00:00 00:02:00 2018-09-27 15:02:00 2018-09-27 15:00:00 2018-09-27 15:00:00
The start15 and end15 columns combine to have the right times, but you need to merge them:
df = pd.melt(df, ['dur', 'start', 'Attribute_A', 'end'], ['start15', 'end15'], value_name='start15')
df = df.drop('variable', 1).drop_duplicates('start15').sort_values('start15').set_index('start15')
Output:
dur start Attribute_A
start15
2018-01-05 11:45:00 00:22:00 2018-01-05 11:48:00 abc
2018-01-05 12:00:00 00:22:00 2018-01-05 11:48:00 abc
2018-05-04 09:00:00 00:08:00 2018-05-04 09:05:00 def
2018-08-09 07:15:00 00:35:00 2018-08-09 07:15:00 hij
2018-08-09 07:45:00 00:35:00 2018-08-09 07:15:00 hij
2018-09-27 15:00:00 00:02:00 2018-09-27 15:00:00 klm
Looking good, but the 2018-08-09 07:30:00 row is missing. Fill in this and any other missing rows with groupby and resample:
df = df.groupby('start').resample('15min').ffill().reset_index(0, drop=True).reset_index()
Get the end15 column back, it was dropped during the melt operation earlier:
df['end15'] = df['end'].dt.floor('15min')
Then calculate the correct durations for each row. I split this into two calculations (durations that spread across multiple timesteps, and ones that don't) to keep it readable:
df.loc[df['start15'] != df['end15'], 'duration'] = np.minimum(df['end15'] - df['start'], pd.Timedelta('15min').to_timedelta64())
df.loc[df['start15'] == df['end15'], 'duration'] = np.minimum(df['end'] - df['end15'], df['end'] - df['start'])
Then just some clean-up to make it look like you wanted:
df['duration'] = (df['duration'].dt.seconds/60).astype(int)
print(df)
df = df[['start15', 'duration', 'Attribute_A']].copy()
Result:
start15 duration Attribute_A
0 2018-01-05 11:45:00 12 abc
1 2018-01-05 12:00:00 10 abc
2 2018-05-04 09:00:00 8 def
3 2018-08-09 07:15:00 15 hij
4 2018-08-09 07:30:00 15 hij
5 2018-08-09 07:45:00 5 hij
6 2018-09-27 15:00:00 2 klm
Please note, portions of this answer were based on this answer

Related

How can I check if a timestamp entry is within a time range and with a person filter in two different dataframe?

I need to check if an entry is within a person's shift:
The data looks like this:
timestamp = pd.DataFrame({
'Timestamp': ['01/02/2022 16:08:56','01/02/2022 16:23:31','01/02/2022 16:41:35','02/02/2022 16:57:41','02/02/2022 17:38:22','02/02/2022 17:50:56'],
'Person': ['A','B','A','B','B','A']
})
shift = pd.DataFrame({
'Date': ['01/02/2022','02/02/2022','01/02/2022','02/02/2022'],
'in':['13:00:00','13:00:00','14:00:00','14:00:00'],
'out': ['21:00:00','21:00:00','22:00:00','22:00:00'],
'Person': ['A','A','B','B']
})
For this kind of merge, an efficient method is to use merge_asof:
timestamp['Timestamp'] = pd.to_datetime(timestamp['Timestamp'])
(pd.merge_asof(timestamp.sort_values(by='Timestamp'),
shift.assign(Timestamp=pd.to_datetime(shift['Date']+' '+shift['in']),
ts_out=pd.to_datetime(shift['Date']+' '+shift['out']),
).sort_values(by='Timestamp')
[['Person', 'Timestamp', 'ts_out']],
on='Timestamp', by='Person'
)
.assign(in_shift=lambda d: d['ts_out'].ge(d['Timestamp']))
.drop(columns=['ts_out'])
)
output:
Timestamp Person in_shift
0 2022-01-02 16:08:56 A True
1 2022-01-02 16:23:31 B True
2 2022-01-02 16:41:35 A True
3 2022-02-02 16:57:41 B True
4 2022-02-02 17:38:22 B True
5 2022-02-02 17:50:56 A True
I assume that there is only one shift per person per day.
First I split the day and time from the timestamp dataframe. Then merge this with the shift dataframe on columns Person and Date. Then we only need to check whether the time from timestamp is between in and out.
timestamp[['Date', 'Time']] = timestamp.Timestamp.str.split(' ', 1, expand=True)
df_merge = timestamp.merge(shift, on=['Date', 'Person'])
df_merge['Timestamp_in_shift'] = (df_merge.Time <= df_merge.out) & (df_merge.Time >= df_merge['in'])
df_merge.drop(columns=['Date', 'Time'])
Output:
Timestamp Person in out Timestamp_in_shift
0 01/02/2022 16:08:56 A 13:00:00 21:00:00 True
1 01/02/2022 16:41:35 A 13:00:00 21:00:00 True
2 01/02/2022 16:23:31 B 14:00:00 22:00:00 True
3 02/02/2022 16:57:41 B 14:00:00 22:00:00 True
4 02/02/2022 17:38:22 B 14:00:00 22:00:00 True
5 02/02/2022 17:50:56 A 13:00:00 21:00:00 True

group datetime column by 5 minutes increment only for time of day (ignoring date) and count

I have a dataframe with one column timestamp (of type datetime) and some other columns but their content don't matter. I'm trying to group by 5 minutes interval and count but ignoring the date and only caring about the time of day.
One can generate an example dataframe using this code:
def get_random_dates_df(
n=10000,
start=pd.to_datetime('2015-01-01'),
period_duration_days=5,
seed=None
):
if not seed: # from piR's answer
np.random.seed(0)
end = start + pd.Timedelta(period_duration_days, 'd'),
n_seconds = int(period_duration_days * 3600 * 24)
random_dates = pd.to_timedelta(n_seconds * np.random.rand(n), unit='s') + start
return pd.DataFrame(data={"timestamp": random_dates}).reset_index()
df = get_random_dates_df()
it would look like this:
index
timestamp
0
0
2015-01-03 17:51:27.433696604
1
1
2015-01-04 13:49:21.806272885
2
2
2015-01-04 00:19:53.778462950
3
3
2015-01-03 17:23:09.535054659
4
4
2015-01-03 02:50:18.873314407
I think I have a working solution but it seems overly complicated:
gpd_df = df.groupby(pd.Grouper(key="timestamp", freq="5min")).agg(
count=("index", "count")
).reset_index()
gpd_df["time_of_day"] = gpd_df["timestamp"].dt.time
res_df= gpd_df.groupby("time_of_day").sum()
Output:
count
time_of_day
00:00:00 38
00:05:00 39
00:10:00 48
00:15:00 33
00:20:00 27
... ...
23:35:00 34
23:40:00 38
23:45:00 37
23:50:00 41
23:55:00 41
[288 rows x 1 columns]
Is there a better way to solve this?
You could groupby the floored 5Min datetime's time portion:
df2 = df.groupby(df['timestamp'].dt.floor('5Min').dt.time)['index'].count()
I'd suggest something like this, to avoid trying to merge the results of two groupbys together:
gpd_df = df.copy()
gpd_df["time_of_day"] = gpd_df["timestamp"].apply(lambda x: x.replace(year=2000, month=1, day=1))
gpd_df = gpd_df.set_index("time_of_day")
res_df = gpd_df.resample("5min").size()
It works by setting the year/month/day to fixed values and applying the built-in resampling function.
What about flooring the datetimes to 5min, extracting the time only and using value_counts:
out = (df['timestamp']
.dt.floor('5min')
.dt.time.value_counts(sort=False)
.sort_index()
)
Output:
00:00:00 38
00:05:00 39
00:10:00 48
00:15:00 33
00:20:00 27
..
23:35:00 34
23:40:00 38
23:45:00 37
23:50:00 41
23:55:00 41
Name: timestamp, Length: 288, dtype: int64

How to do the mean over consecutive date time and is there a "simple" sql statement for this?

I have a MySQL database with records associated with date time of record. When several values are within a time range of 3 minutes, I want to do the mean of each values. I made a fake file to illustrate.
#dataSample.csv
;y;datetime
0;1.885539280369374;2020-12-18 00:16:59
1;88.87944658745302;2020-12-18 00:18:26
2;5.4934801892366645;2020-12-18 00:21:47
3;27.481240675960745;2020-12-22 02:22:43
4;78.20955112191257;2021-03-12 00:01:45
5;69.20174844202616;2021-03-12 00:03:01
6;92.452056802478;2021-03-12 00:04:10
7;65.44391665410022;2021-03-12 00:06:12
8;40.59036279552053;2021-03-13 11:07:40
9;97.28850548113896;2021-03-13 11:08:46
10;94.73214209590618;2021-03-13 11:09:52
11;15.032038741334246;2021-03-14 00:50:10
12;26.96629037360529;2021-03-14 00:51:17
13;57.257554884427755;2021-03-14 00:52:20
14;18.845976481042804;2021-03-17 13:52:00
15;57.19160644979182;2021-03-17 13:53:48
16;3.81419643210113;2021-03-17 13:54:50
17;46.65212265222033;2021-03-17 20:00:06
18;78.99788944141437;2021-03-17 20:01:28
19;72.57950242929162;2021-03-17 20:02:18
20;31.953619913660063;2021-03-20 16:40:04
21;71.03880579866258;2021-03-20 16:41:14
22;80.07721218822367;2021-03-20 16:42:03
23;84.4974927845413;2021-03-23 23:51:04
24;23.332882564418554;2021-03-23 23:52:37
25;24.84651458538292;2021-03-23 23:53:44
26;3.2905723920299073;2021-04-13 01:07:13
27;95.00543057651691;2021-04-13 01:08:53
28;46.02579988887248;2021-04-13 01:10:03
29;71.73362449536457;2021-04-13 07:54:22
30;93.17353939667422;2021-04-13 07:56:03
31;28.06669274690586;2021-04-13 07:57:04
32;10.733532291051478;2021-04-21 23:52:19
33;92.92374999199961;2021-04-21 23:53:02
34;59.68694726616824;2021-04-21 23:54:12
35;30.01172074266929;2021-11-29 00:21:09
36;34.905022198511915;2021-11-29 00:23:09
37;25.149590827473055;2021-11-29 00:24:13
38;82.09740354280564;2021-12-01 08:30:00
39;25.58339148753002;2021-12-01 08:32:00
40;72.7009145748645;2021-12-01 08:34:00
41;8.43474445404563;2021-12-01 13:18:58
42;57.95936012084567;2021-12-01 13:19:45
43;31.118114587376713;2021-12-01 13:21:19
44;42.082098854369576;2021-12-01 20:24:46
45;75.8402567179772;2021-12-01 20:25:45
46;55.29546227636972;2021-12-01 20:26:20
47;72.52918512264547;2021-12-02 08:35:42
48;77.81077056479849;2021-12-02 08:36:35
49;34.63717484559066;2021-12-02 08:37:22
50;71.65724478546072;2021-12-06 00:05:00
51;19.54082334014094;2021-12-06 00:08:00
52;48.28967362303979;2021-12-06 00:10:00
53;34.894095185290105;2021-12-03 08:36:00
54;58.187428474357375;2021-12-03 08:40:00
55;94.53441120864328;2021-12-03 08:45:00
56;12.272217150555866;2021-12-03 13:10:00
57;87.21292441413424;2021-12-03 13:11:00
58;86.35470090744712;2021-12-03 13:12:00
59;50.23396755270806;2021-12-06 23:46:00
60;73.30424413459407;2021-12-06 23:48:00
61;60.48531615320234;2021-12-06 23:49:00
62;56.10336877052336;2021-12-06 23:51:00
63;87.6451368964707;2021-12-07 08:37:00
64;11.902048844734905;2021-12-07 10:48:00
65;57.596744167099494;2021-12-07 10:58:00
66;61.77125104854312;2021-12-07 11:05:00
67;21.542193987296695;2021-12-07 11:28:00
68;91.64520146457525;2021-12-07 11:29:00
69;78.42486998655676;2021-12-07 16:06:00
70;79.51721853991806;2021-12-07 16:08:00
71;54.46969194684532;2021-12-07 16:09:00
72;56.092025088935785;2021-12-07 16:12:00
73;2.546437552510464;2021-12-07 18:35:00
74;11.598686235757118;2021-12-07 18:40:00
75;40.26003639570842;2021-12-07 18:45:00
76;30.697636730470848;2021-12-07 23:39:00
77;66.3177096178856;2021-12-07 23:42:00
78;73.16870525875022;2021-12-07 23:47:00
79;61.68994018242363;2021-12-08 13:47:00
80;38.06598256433572;2021-12-08 13:48:00
81;43.91268499464372;2021-12-08 13:49:00
82;33.166594417250735;2021-12-15 00:23:00
83;52.68422837459157;2021-12-15 00:24:00
84;86.01398356923765;2021-12-15 00:26:00
85;21.444108620566542;2021-12-15 00:31:00
86;86.6839608035921;2021-12-18 01:09:00
87;43.83047571188636;2022-01-06 00:24:00
Here is my code:
import pandas as pd
import numpy as np
import datetime
from datetime import datetime, timedelta
fileName = "dataSample.csv"
df = pd.read_csv(fileName, sep=";", index_col=0)
df['datetime_object'] = df['datetime'].apply(datetime.fromisoformat)
def define_mask(d, delta_minutes):
return (d <= df["datetime_object"]) & (df["datetime_object"]<= d + timedelta(minutes=delta_minutes))
group = []
i = 0
while i < len(df):
d = df.loc[i]["datetime_object"]
mask = define_mask(d, 3)
for k in range(len(df[mask].index)):
group.append(i)
i += len(df[mask].index)
df["group"] = group
df_new = df.groupby("group").apply(np.mean)
It works well but I am wondering if this is good "pandas" practice .
I have 2 questions:
Is there another way to do that with pandas ?
Is there an SQL command to do that directly ?
You can use resample:
df = pd.read_csv('data.csv', sep=';', index_col=0, parse_dates=['datetime'])
out = df.resample('3min', on='datetime').mean().dropna().reset_index()
print(out)
# Output
datetime y
0 2020-12-18 00:15:00 1.885539
1 2020-12-18 00:18:00 88.879447
2 2020-12-18 00:21:00 5.493480
3 2020-12-22 02:21:00 27.481241
4 2021-03-12 00:00:00 78.209551
.. ... ...
59 2021-12-15 00:21:00 33.166594
60 2021-12-15 00:24:00 69.349106
61 2021-12-15 00:30:00 21.444109
62 2021-12-18 01:09:00 86.683961
63 2022-01-06 00:24:00 43.830476
[64 rows x 2 columns]
Another way to get the first datetime value of a group of 3 minutes:
out = df.groupby(pd.Grouper(freq='3min', key='datetime'), as_index=False) \
.agg({'y': 'mean', 'datetime': 'first'}) \
.dropna(how='all').reset_index(drop=True)
print(out)
# Output
y datetime
0 1.885539 2020-12-18 00:16:59
1 88.879447 2020-12-18 00:18:26
2 5.493480 2020-12-18 00:21:47
3 27.481241 2020-12-22 02:22:43
4 78.209551 2021-03-12 00:01:45
.. ... ...
59 33.166594 2021-12-15 00:23:00
60 69.349106 2021-12-15 00:24:00
61 21.444109 2021-12-15 00:31:00
62 86.683961 2021-12-18 01:09:00
63 43.830476 2022-01-06 00:24:00
[64 rows x 2 columns]
Or
out = df.resample('3min', on='datetime') \
.agg({'y': 'mean', 'datetime': 'first'}) \
.dropna(how='all').reset_index(drop=True)`
In MySQL you can achieve it like this:
SELECT
FROM_UNIXTIME(FLOOR(UNIX_TIMESTAMP(`datetime`)/180)*180) AS 'datetime'
AVG(`y`) AS 'y'
FROM `table`
GROUP BY
FLOOR(MINUTE(`datetime`) / 3)
AVG() is an aggregate function in MySQL that when used with 'GROUP BY' returns an aggregate result of the grouped rows.
One way to round the date to groups of 3 minute intervals would be to convert to a unix timestamp and utilize the FLOOR function:
UNIX_TIMESTAMP to convert the date to unix timestamp (number of seconds since 1970-01-01 00:00:00)
Divide by # of seconds to group by
FLOOR() function to get the closest integer value not greater than the input.
Multiply the result by # of seconds to convert back to a unix timestamp
FROM_UNIXTIME() to convert the unix timestamp back to a MySQL datetime

Elegant way to shift multiple date columns - Pandas

I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21],
'offset' :['-131 days','29 days','142 days','20 days','-200 days'],
'date_1': ['05/29/2017', '01/21/1997', '7/27/1989','01/01/2013','12/31/2016'],
'dis_date': ['05/29/2017', '01/24/1999', '7/22/1999','01/01/2015','12/31/1991'],
'vis_date':['05/29/2018', '01/27/1994', '7/29/2011','01/01/2018','12/31/2014']})
df['date_1'] = pd.to_datetime(df['date_1'])
df['dis_date'] = pd.to_datetime(df['dis_date'])
df['vis_date'] = pd.to_datetime(df['vis_date'])
I would like to shift all the dates of each subject based on his offset
Though my code works (credit - SO), I am looking for an elegant approach. You can see am kind of repeating almost the same line thrice.
df['offset_to_shift'] = pd.to_timedelta(df['offset'],unit='d')
#am trying to make the below lines elegant/efficient
df['shifted_date_1'] = df['date_1'] + df['offset_to_shift']
df['shifted_dis_date'] = df['dis_date'] + df['offset_to_shift']
df['shifted_vis_date'] = df['vis_date'] + df['offset_to_shift']
I expect my output to be like as shown below
Use, DataFrame.add along with DataFrame.add_prefix and DataFrame.join:
cols = ['date_1', 'dis_date', 'vis_date']
df = df.join(df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_'))
OR, it is also possible to use pd.concat:
df = pd.concat([df, df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_')], axis=1)
OR, we can also directly assign the new shifted columns to the dataframe:
df[['shifted_' + col for col in cols]] = df[cols].add(df['offset_to_shift'], 0)
Result:
# print(df)
person_id offset date_1 dis_date vis_date offset_to_shift shifted_date_1 shifted_dis_date shifted_vis_date
0 11 -131 days 2017-05-29 2017-05-29 2018-05-29 -131 days 2017-01-18 2017-01-18 2018-01-18
1 11 29 days 1997-01-21 1999-01-24 1994-01-27 29 days 1997-02-19 1999-02-22 1994-02-25
2 11 142 days 1989-07-27 1999-07-22 2011-07-29 142 days 1989-12-16 1999-12-11 2011-12-18
3 21 20 days 2013-01-01 2015-01-01 2018-01-01 20 days 2013-01-21 2015-01-21 2018-01-21
4 21 -200 days 2016-12-31 1991-12-31 2014-12-31 -200 days 2016-06-14 1991-06-14 2014-06-14

How do I store the first use date in a dictionary using a for loop

I have a dataset of userids and the all the times they use a particular pass. I need to find out how many days since each of them first used the pass. I was thinking of running through the dataset and store the first use in a dictionary and minus it off today's date. I cant seem to get it to work.
Userid Start use Day
1712 2019-01-04 Friday
1712 2019-01-05 Saturday
9050 2019-01-04 Friday
9050 2019-01-04 Friday
9050 2019-01-06 Sunday
9409 2019-01-05 Saturday
9683 2019-05-20 Monday
8800 2019-05-17 Friday
8800 2019-05-17 Friday
This is the part of the dataset. Date format is Y-m-d
usedict={}
keys = df.user_id
values = df.start_date
for i in keys:
if (usedict[i] == keys):
continue
else:
usedict[i] = values[i]
prints(usedict)
user_id use_count days_used Ave Daily Trips register_date days_since_reg
12 42 23 1.826087 NaT NaT
17 28 13 2.153846 NaT NaT
114 54 24 2.250000 2019-02-04 107 days
169 31 17 1.823529 NaT NaT
1414 49 20 2.450000 NaT NaT
1712 76 34 2.235294 NaT NaT
2388 24 12 2.000000 NaT NaT
6150 10 5 2.000000 2019-02-05 106 days
You can achieve what you want with the following. I have used only 2 user ids from the example given by you, but the same will apply to all.
import pandas as pd
import datetime
df = pd.DataFrame([{'Userid':'1712','use_date':'2019-01-04'},
{'Userid':'1712','use_date':'2019-01-05'},
{'Userid':'9050','use_date':'2019-01-04'},
{'Userid':'9050','use_date':'2019-01-04'},
{'Userid':'9050','use_date':'2019-01-06'}])
df.use_date = pd.to_datetime(df.use_date).dt.date
group_df = df.sort_values(by='use_date').groupby('Userid', as_index=False).agg({'use_date':'first'}).rename(columns={'use_date':'first_use_date'})
group_df['diff_from_today'] = datetime.datetime.today().date() - group_df.first_use_date
The output is:
print(group_df)
Userid first_use_date diff_from_today
0 1712 2019-01-04 139 days
1 9050 2019-01-04 139 days
Check sort_values and groupby for more details.
I am only looking at two columns but you could find the min for each id with groupby and then use apply to get difference (I have done difference in days)
import pandas as pd
import datetime
user_id = [1712, 1712, 9050, 9050, 9050, 9409, 9683, 8800, 8800]
start = ['2019-01-04', '2019-01-05', '2019-01-04', '2019-01-04', '2019-01-06', '2019-01-05', '2019-05-20', '2019-05-17', '2019-05-17']
df = pd.DataFrame(list(zip(user_id, start)), columns = ['UserId', 'Start'])
df['Start']= pd.to_datetime(df['Start'])
df = df.groupby('UserId')['Start'].agg([pd.np.min])
now = datetime.datetime.now()
df['days'] = df['amin'].apply(lambda x: (now - x).days)
a_dict = pd.Series(df.days.values,index = df.index).to_dict()
print(a_dict)
References:
to_dict() method taken from #jeff
Output:

Categories

Resources