Python Pandas interpolation: redistribute value forwards over missing date range - python

I have time trend data on facility traffic (admissions to and releases from a facility over time), with gaps. Because of the structure of this data, when a gap appears, the "releases" one day prior to the gap are artificially high (accounting for all unseen individuals released over the period of the gap), and the "admissions" one day after the gap are artificially high (for the same reason: any individual who was admitted during the gap and remains in the facility will appear as an "admission" on this date).
Here is a sample Pandas series involving such a data gap (with zeroes implying missing data on 2020-01-04 through 2020-01-07):
date(index) releases admissions
2020-01-01 15 23
2020-01-02 8 20
2020-01-03 50 14
2020-01-04 0 0
2020-01-05 0 0
2020-01-06 0 0
2020-01-07 0 0
2020-01-08 8 100
2020-01-09 11 19
2020-01-10 9 17
A visualization of this (ignore the separate linear interpolation over the missing total population) looks like the following:
I want to smooth this data, but I'm not sure what interpolation method to use. What I want to accomplish is redistribution forwards of the "releases" on date gap(0)-1 and redistribution backwards of "admissions" on date gap(n)+1. For instance, if a gap is 4 days long and on day gap(n)+1 there are 100 admissions, I want to redistribute such that, on each day of the gap, there are 20 admissions, and on day gap(n)+1 admissions are revised to show 20.
Using the above example series, redistribution would look like the following:
date(index) releases admissions
2020-01-01 15 23
2020-01-02 8 20
2020-01-03 10 14
2020-01-04 10 20
2020-01-05 10 20
2020-01-06 10 20
2020-01-07 10 20
2020-01-08 8 20
2020-01-09 11 19
2020-01-10 9 17

You can create groups with consecutive zeros + one value before for releases and one value after for admissions, and then use transform('mean') to calculate average for each group:
# releases
df['releases'] = df.groupby(
df['releases'].replace(0, np.nan).notna().cumsum()
)['releases'].transform('mean')
# admissions
df['admissions'] = df.groupby(
df['admissions'].replace(0, np.nan).notna().iloc[::-1].cumsum().iloc[::-1]
)['admissions'].transform('mean')
Output:
releases admissions
date
2020-01-01 15 23
2020-01-02 8 20
2020-01-03 10 14
2020-01-04 10 20
2020-01-05 10 20
2020-01-06 10 20
2020-01-07 10 20
2020-01-08 8 20
2020-01-09 11 19
2020-01-10 9 17
Update: For keeping the existing NA values:
# releases
df['releases_i'] = df.groupby(
df['releases'].ne(0).cumsum()
)['releases'].transform('mean')
# admissions
df['admissions_i'] = df.groupby(
df['admissions'].ne(0).iloc[::-1].cumsum().iloc[::-1]
)['admissions'].transform('mean')

Related

How to divide 24 hours into 96 quarters?

I'm trying to come up with some logic to cleanly divide 24 hours into 96 quarters but I can't figure it out. I have a Python Pandas dataframe showing the hours and quarters of each time stamp. It looks like this
Timestamp | Hour | Quarter
----------------------------------------
2020-11-01 05:00:00+01 5 1
2020-11-01 05:15:00+01 5 2
2020-11-01 05:30:00+01 5 3
2020-11-01 05:45:00+01 5 4
2020-11-01 06:00:00+01 6 1
2020-11-01 06:15:00+01 6 2
2020-11-01 06:30:00+01 6 3
2020-11-01 06:45:00+01 6 4
So here it shows the quarters for each hour (every hour has 4 quarters). But now I want to have 96 quarters for the entire day. So I would add a column:
Timestamp | Hour | Quarter | Q's
------------------------------------------------
2020-11-01 05:00:00+01 5 1 21
2020-11-01 05:15:00+01 5 2 22
2020-11-01 05:30:00+01 5 3 23
2020-11-01 05:45:00+01 5 4 24
2020-11-01 06:00:00+01 6 1 25
2020-11-01 06:15:00+01 6 2 26
2020-11-01 06:30:00+01 6 3 27
2020-11-01 06:45:00+01 6 4 28
Because I'm working with timestamps which are timezone sensitive, I can't just do this index wise. Also I don't like for loops. What's the logic here that I am completely missing?
Isn't it simply this?
df["Q's"] = 4 * df["Hour"] + df["Quarter"]

Get the Minimum and Maximum value within specific date range in DataFrame

I have a DataFrame that has the columns 'From' (datetime), 'To' (datetime). There are some overlapping in the ranges of different rows of the table.
Here is the simplified version of criteria dataframe (the date range is vary and overlapping with each other):
df1= pd.DataFrame({'From': pd.date_range(start='2020-01-01', end='2020-01-31',freq='2D'), 'To': pd.date_range(start='2020-01-05', end='2020-02-04',freq='2D')})
From To
0 2020-01-01 2020-01-05
1 2020-01-03 2020-01-07
2 2020-01-05 2020-01-09
3 2020-01-07 2020-01-11
4 2020-01-09 2020-01-13
5 2020-01-11 2020-01-15
6 2020-01-13 2020-01-17
7 2020-01-15 2020-01-19
8 2020-01-17 2020-01-21
9 2020-01-19 2020-01-23
10 2020-01-21 2020-01-25
11 2020-01-23 2020-01-27
12 2020-01-25 2020-01-29
13 2020-01-27 2020-01-31
14 2020-01-29 2020-02-02
15 2020-01-31 2020-02-04
And I have a dataframe which keep the daily high and low value like this
random.seed(0)
df2= pd.DataFrame({'Date': pd.date_range(start='2020-01-01', end='2020-01-31'), 'High': [random.randint(7,15)+5 for i in range(31)], 'Low': [random.randint(0,7)-1 for i in range(31)]})
Date High Low
0 2020-01-01 18 6
1 2020-01-02 18 6
2 2020-01-03 12 3
3 2020-01-04 16 -1
4 2020-01-05 20 -1
5 2020-01-06 19 0
6 2020-01-07 18 5
7 2020-01-08 16 -1
8 2020-01-09 19 6
9 2020-01-10 17 4
10 2020-01-11 15 2
11 2020-01-12 20 4
12 2020-01-13 14 0
13 2020-01-14 16 2
14 2020-01-15 14 2
15 2020-01-16 13 2
16 2020-01-17 16 1
17 2020-01-18 20 6
18 2020-01-19 14 0
19 2020-01-20 16 0
20 2020-01-21 13 4
21 2020-01-22 13 6
22 2020-01-23 17 0
23 2020-01-24 19 3
24 2020-01-25 20 3
25 2020-01-26 13 0
26 2020-01-27 17 4
27 2020-01-28 18 2
28 2020-01-29 17 3
29 2020-01-30 15 6
30 2020-01-31 20 0
Then I hope to get the maximum and minimum value based on the From Date and To Date in df1, Here is the expected result:
result = pd.DataFrame({'From': pd.date_range(start='2020-01-01', end='2020-01-31',freq='2D'), 'To': pd.date_range(start='2020-01-05', end='2020-02-04',freq='2D'), 'High':[20,20,20,19,20,20,16,20,20,17,20,20,20,20,20,20], 'Low':[-1,-1,-1,-1,0,0,1,0,0,0,0,0,0,0,0,0]})
From To High Low
0 2020-01-01 2020-01-05 20 -1
1 2020-01-03 2020-01-07 20 -1
2 2020-01-05 2020-01-09 20 -1
3 2020-01-07 2020-01-11 19 -1
4 2020-01-09 2020-01-13 20 0
5 2020-01-11 2020-01-15 20 0
6 2020-01-13 2020-01-17 16 1
7 2020-01-15 2020-01-19 20 0
8 2020-01-17 2020-01-21 20 0
9 2020-01-19 2020-01-23 17 0
10 2020-01-21 2020-01-25 20 0
11 2020-01-23 2020-01-27 20 0
12 2020-01-25 2020-01-29 20 0
13 2020-01-27 2020-01-31 20 0
14 2020-01-29 2020-02-02 20 0
15 2020-01-31 2020-02-04 20 0
I have tried to use resampling method, but it seems not support custom date range. I'm looking for a reasonably efficient and elegant way of doing this. Thank you very much.
With the size of the data, I think you should consider another approach, the idea is to vectorize by chunk over df1 the comparison between dates with df2. It is lot more lines than other solutions, but it will be way faster for large dataframes.
# this is a parameter you can play with,
# but if your df1 is in memory, this value should work
nb_split = int((len(df1)*len(df2))//4e6)+1
# work with arrays of flaot
arr1 = df1[['From','To']].astype('int64').to_numpy().astype(float)
arr2 = df2.astype('int64').to_numpy().astype(float)
# create result array
arr_out = np.zeros((len(arr1), 2), dtype=float)
i = 0 #index position
for arr1_sp in np.array_split(arr1, nb_split, axis=0):
# get length of the chunk
lft = len(arr1_sp)
# get the min datetime in From and max in To
min_from = arr1_sp[:, 0].min()
max_to = arr1_sp[:, 1].max()
# select the rows of arr2 tht are within the min and max date of the split
arr2_sp = arr2[(arr2[:,0]>=min_from)&(arr2[:,0]<=max_to), :]
# create an bool arraywith True when the date in arr2_sp is above from and below to
# each row is the reuslt for each row of arr1_sp
m = np.less_equal.outer(arr1_sp[:,0], arr2_sp[:, 0])\
&np.greater_equal.outer(arr1_sp[:,1], arr2_sp[:, 0])
# use this mask to get the values high and low within the range row-wise
# and replace where the mask was False by np.nan
arr_high = arr2_sp[:,1]*m
arr_high[~m] = np.nan
arr_low = arr2_sp[:,2]*m
arr_low[~m] = np.nan
# put the result in the result array
arr_out[i:i+lft, 0] = np.nanmax(arr_high, axis=1)
arr_out[i:i+lft, 1] = np.nanmin(arr_low, axis=1)
i += lft #update first idx position for next loop
# create the columns in df1
df1['High'] = arr_out[:, 0]
df1['Low'] = arr_out[:, 1]
I tried with df1 with 10000 rows and df2 5000 rows, and this method is about 102ms while the method with apply getHighLow2is about 8s, so 80 time faster this way. Adn the results where the same.
Here is a function which does this:
Checks the dates which are in the from/to interval
Gets the maximum and minimum values of the High and Low columns respectively
def get_high_low(d1):
high = df2.loc[df2["Date"].isin(pd.date_range(d1["From"], d1["To"])), "High"].max()
low = df2.loc[df2["Date"].isin(pd.date_range(d1["From"], d1["To"])), "Low"].max()
return pd.Series([high, low], index=["High", "Low"])
Then we can just apply this function and concatenate the result with the dates.
pd.concat([df1, df1.apply(get_high_low, axis=1)], axis=1)
The result
From To High Low
0 2020-01-01 2020-01-05 19 4
1 2020-01-03 2020-01-07 17 5
2 2020-01-05 2020-01-09 19 5
3 2020-01-07 2020-01-11 19 2
4 2020-01-09 2020-01-13 17 4
5 2020-01-11 2020-01-15 19 4
6 2020-01-13 2020-01-17 19 5
7 2020-01-15 2020-01-19 18 5
8 2020-01-17 2020-01-21 18 0
9 2020-01-19 2020-01-23 19 3
10 2020-01-21 2020-01-25 19 5
11 2020-01-23 2020-01-27 19 5
12 2020-01-25 2020-01-29 17 5
13 2020-01-27 2020-01-31 17 3
14 2020-01-29 2020-02-02 17 1
15 2020-01-31 2020-02-04 13 -1
I would do a cross merge and query, then groupby:
(df1.assign(dummy=1)
.merge(df2.assign(dummy=1), on='dummy') # this is cross merge
.drop('dummy', axis=1) # remove the `dummy` column
.query('From<=Date<=To') # only choose valid data
.groupby(['From','To']) # groupby `From` and `To`
.agg({'High':'max','Low':'min'}) # aggregation
.reset_index()
)
Output:
From To High Low
0 2020-01-01 2020-01-05 20 -1
1 2020-01-03 2020-01-07 20 -1
2 2020-01-05 2020-01-09 20 -1
3 2020-01-07 2020-01-11 19 -1
4 2020-01-09 2020-01-13 20 0
5 2020-01-11 2020-01-15 20 0
6 2020-01-13 2020-01-17 16 0
7 2020-01-15 2020-01-19 20 0
8 2020-01-17 2020-01-21 20 0
9 2020-01-19 2020-01-23 17 0
10 2020-01-21 2020-01-25 20 0
11 2020-01-23 2020-01-27 20 0
12 2020-01-25 2020-01-29 20 0
13 2020-01-27 2020-01-31 20 0
14 2020-01-29 2020-02-02 20 0
15 2020-01-31 2020-02-04 20 0
You can create a simple function that gets the min and max within a given date renge. Than use the apply function to add the columns.
def MaxMin(row):
dfRange = df2[(df2['Date']>=row['From'])&(df2['Date']<=row['To'])] # df2 rows within a given date range
row['High'] = dfRange['High'].max()
row['Low'] = dfRange['Low'].min()
return row
df1 = df1.apply(MaxMin, axis =1)
Define the following function:
def getHighLow(row):
wrk = df2[df2.Date.between(row.From, row.To)]
return pd.Series([wrk.High.max(), wrk.Low.min()], index=['High', 'Low'])
Then run:
df1.join(df1.apply(getHighLow, axis=1))
According to the DRY rule, it is better to find wrk (a set of rows between
given dates) once and then (form wrk) extract maximal High and
minimal Low.
Another advantage over the other solution: My code runs quicker by about
30 % (at least on my computer, measurements performed using %timeit).
Edit
Yet quicker solution is when the search in df2 can be performed by index
instead of "from regular column".
As a preparatory step run:
df2a = df2.set_index('Date')
Then define another variant of getHighLow function:
def getHighLow2(row):
wrk = df2a.loc[row.From : row.To]
return pd.Series([wrk.High.max(), wrk.Low.min()], index=['High', 'Low'])
To get the result, run:
df1.join(df1.apply(getHighLow2, axis=1))
For your data, the execution time is about a half of the other solution
(not including the time to create df2a, but it can be created just in this form (with Date as the index)).

How to continue the week number when the year changes using pandas

Example: By using
df['Week_Number'] = df['Date'].dt.strftime('%U')
for 29/12/2019 the week is 52. and this week is from 29/12/2019 to 04/01/2020.
but for 01/01/2020 the week is getting as 00.
I require the week for 01/01/2020 also as 52. and for 05/01/2020 to 11/01/2020 as 53. This need to be continued.
I used a logic to solve the question.
First of all, let's write a function to create an instance of Dataframe involving dates from 2019-12-01 to 2020-01-31 by a function
def create_date_table(start='2019-12-01', end='2020-01-31'):
df = pd.DataFrame({"Date": pd.date_range(start, end)})
df["Week_start_from_Monday"] = df.Date.dt.isocalendar().week
df['Week_start_from_Sunday'] = df['Date'].dt.strftime('%U')
return df
Run the function and observe the Dataframe
date_df=create_date_table()
date_df.head(n=40)
There are two fields in the Dataframe about weeks, Week_start_from_Monday and Week_start_from_Sunday, the difference come from they count Monday or Sunday as the first day of a week.
In this case, Week_start_from_Sunday is the one we need to focus on.
Now we write a function to add a column containing weeks continuing from last year, not reset to 00 when we enter a new year.
def add_continued_week_field(date: Timestamp, df_start_date: str = '2019-12-01') -> int:
start_date = datetime.strptime(df_start_date, '%Y-%m-%d')
year_of_start_date = start_date.year
year_of_date = date.year
week_of_date = date.strftime("%U")
year_diff = year_of_date - year_of_start_date
if year_diff == 0:
continued_week = int(week_of_date)
else:
continued_week = year_diff * 52 + int(week_of_date)
return continued_week
Let's apply the function add_continued_week_field to the dates' Dataframe.
date_df['Week_continue'] = date_df['Date'].apply(add_continued_week_field)
We can see the new added field in the dates' Dataframe
As stated in converting a pandas date to week number, you can use df['Date'].dt.week to get week numbers.
To let it continue you maybe could sum up the last week number with new week-values, something like this? I cannot test this right now...
if(df['Date'].dt.strftime('%U') == 53):
last = df['Date'].dt.strftime('%U')
df['Week_Number'] = last + df['Date'].dt.strftime('%U')
You can do this with isoweek and isoyear.
I don't see how you arrive at the values you present with '%U' so I will assume that you want to map the week starting on Sunday 2019-12-29 ending on 2020-01-04 to 53, and that you want to map the following week to 54 and so on.
For weeks to continue past the year you need isoweek.
isocalendar() provides a tuple with isoweek in the second element and a corresponding unique isoyear in the first element.
But isoweek starts on Monday so we have to add one day so the Sunday is interpreted as Monday and counted to the right week.
2019 is subtracted to have years starting from 0, then every year is multiplied with 53 and the isoweek is added. Finally there is an offset of 1 so you arrive at 53.
In [0]: s=pd.Series(["29/12/2019", "01/01/2020", "05/01/2020", "11/01/2020"])
dts = pd.to_datetime(s,infer_datetime_format=True)
In [0]: (dts + pd.DateOffset(days=1)).apply(lambda x: (x.isocalendar()[0] -2019)*53 + x.isocalendar()[1] -1)
Out[0]:
0 53
1 53
2 54
3 54
dtype: int64
This of course assumes that all iso years have 53 weeks which is not the case, so instead you would want to compute the number of iso weeks per iso year since 2019 and sum those up.
Maybe you are looking for this. I fixed an epoch. If you have dates earlier than 2019, you can choose other epoch.
epoch= pd.Timestamp("2019-12-23")
# Test data:
df=pd.DataFrame({"Date":pd.date_range("22/12/2019",freq="1D",periods=25)})
df["Day_name"]=df.Date.dt.day_name()
# Calculation:
df["Week_Number"]=np.where(df.Date.astype("datetime64").le(epoch), \
df.Date.dt.week, \
df.Date.sub(epoch).dt.days//7+52)
df
Date Day_name Week_Number
0 2019-12-22 Sunday 51
1 2019-12-23 Monday 52
2 2019-12-24 Tuesday 52
3 2019-12-25 Wednesday 52
4 2019-12-26 Thursday 52
5 2019-12-27 Friday 52
6 2019-12-28 Saturday 52
7 2019-12-29 Sunday 52
8 2019-12-30 Monday 53
9 2019-12-31 Tuesday 53
10 2020-01-01 Wednesday 53
11 2020-01-02 Thursday 53
12 2020-01-03 Friday 53
13 2020-01-04 Saturday 53
14 2020-01-05 Sunday 53
15 2020-01-06 Monday 54
16 2020-01-07 Tuesday 54
17 2020-01-08 Wednesday 54
18 2020-01-09 Thursday 54
19 2020-01-10 Friday 54
20 2020-01-11 Saturday 54
21 2020-01-12 Sunday 54
22 2020-01-13 Monday 55
23 2020-01-14 Tuesday 55
24 2020-01-15 Wednesday 55
I got here wanting to know how to label consecutive weeks - I'm not sure if that's exactly what the question is asking but I think it might be. So here is what I came up with:
# Create dataframe with example dates
# It has a datetime index and a column with day of week (just to check that it's working)
dates = pd.date_range('2019-12-15','2020-01-10')
df = pd.DataFrame(dates.dayofweek,index=dates,columns=['dow'])
# Add column
# THESE ARE THE RELEVANT LINES
woy = df.index.weekofyear
numbered = np.cumsum(np.diff(woy,prepend=woy[0])!=0)
# Append for easier comparison
df['week_num'] = numbered
df then looks like this:
dow week_num
2019-12-15 6 0
2019-12-16 0 1
2019-12-17 1 1
2019-12-18 2 1
2019-12-19 3 1
2019-12-20 4 1
2019-12-21 5 1
2019-12-22 6 1
2019-12-23 0 2
2019-12-24 1 2
2019-12-25 2 2
2019-12-26 3 2
2019-12-27 4 2
2019-12-28 5 2
2019-12-29 6 2
2019-12-30 0 3
2019-12-31 1 3
2020-01-01 2 3
2020-01-02 3 3
2020-01-03 4 3
2020-01-04 5 3
2020-01-05 6 3
2020-01-06 0 4
2020-01-07 1 4
2020-01-08 2 4
2020-01-09 3 4
2020-01-10 4 4

How do i find the first duplicate value in a data frame based on time stamp in Python 3.x?

I am new to Python 3.6 and I have been trying to solve an assignment without any success using Pandas.
My dataframe looks like this:
Index ID Time Account Key City County
0 10 2016-01-01 12:30 11 55 a NZ
1 2 2016-01-02 13:30 14 34 b AL
2 33 2016-01-03 11:20 4 55 a NZ
3 4 2016-01-01 14:30 11 40 b AL
4 18 2016-01-20 23:30 14 34 b AL
..
100 41 2016-03-20 13:50 11 55 a NZ
I want to identify that Account 11 and 14 are reoccurring and to count them in different buckets in a new column (Ie: occurring with changes in Key and occurring without changes in Key) but I want 11 to be counted once.
I want to calculate the time difference in hours between the first and second occurrence of Account 11 but to ignore all other occurrences of 11. The results should be placed in a new data frame with columns 'Account' and 'Time_diff'
Any ideas on how to proceed? I am using Spyder if that makes any difference =)
So for Q1 it would look like:
Index ID Time Account Key City County ChangeKey
0 10 2016-01-01 12:30 11 55 a NZ 0
1 2 2016-01-02 13:30 14 34 b AL 0
2 33 2016-01-03 11:20 4 55 a NZ 0
3 4 2016-01-01 14:30 11 40 b AL 1
4 18 2016-01-20 23:30 14 34 b AL 0
The key changes for account 11 but not account 14.
For Q2 the final result would look like
Index Time Account Timediff
0 2016-01-01 12:30 11 0
1 2016-01-02 13:30 14 0
2 2016-01-03 11:20 4 NA
3 2016-01-01 14:30 11 2
4 2016-01-20 23:30 14 320

How long are Pandas groupby objects remembered?

I have the following example Python 3.4 script. It does the following:
creates a dataframe,
converts the date variable to datetime64 format,
creates a groupby object based on two categorical variables,
produces a dataframe that contains a count of the number items in each group,
merges count dataframe back with original dataframe to create a column containing the number of rows in each group
creates a column containing the difference in dates between sequential rows.
Here is the script:
import numpy as np
import pandas as pd
# Create dataframe consisting of id, date and two categories (gender and age)
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Convert date to datetime
tempDF['date'] = pd.to_datetime(tempDF['date'])
# Create groupby object based on two categorical variables
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
# Count number in each group and merge with original dataframe to create 'count' column
tempCountsDF = tempGroupby['id'].count().reset_index(drop=False)
tempCountsDF = tempCountsDF.rename(columns={'id': 'count'})
tempDF = tempDF.merge(tempCountsDF, on=['gender','age'])
# Calculate difference between consecutive rows in each group. (First row in each
# group should have date difference = NaT)
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
tempDF['diff'] = tempGroupby['date'].diff()
print(tempDF)
This script produces the following output:
age date gender id count diff
0 young 2015-02-04 02:34:00 male 1 2 NaT
1 young 2015-10-05 08:52:00 male 10 2 243 days 06:18:00
2 old 2015-06-04 12:34:00 female 2 3 NaT
3 old 2015-09-04 23:03:00 female 3 3 92 days 10:29:00
4 old 2015-04-21 12:59:00 female 6 3 -137 days +13:56:00
5 old 2015-12-04 01:00:00 male 4 6 NaT
6 old 2015-04-15 07:12:00 male 5 6 -233 days +06:12:00
7 old 2015-06-05 11:12:00 male 9 6 51 days 04:00:00
8 old 2015-05-19 19:22:00 male 12 6 -17 days +08:10:00
9 old 2015-04-06 12:57:00 male 15 6 -44 days +17:35:00
10 old 2015-06-15 03:23:00 male 17 6 69 days 14:26:00
11 young 2015-12-05 14:19:00 female 11 4 NaT
12 young 2015-05-27 22:31:00 female 13 4 -192 days +08:12:00
13 young 2015-01-06 11:09:00 female 14 4 -142 days +12:38:00
14 young 2015-06-19 05:37:00 female 18 4 163 days 18:28:00
And this exactly what I'd expect. However, it seems to rely on creating the groupby object twice (in exactly the same way). If the second groupby definition is commented out, it seems to lead to a very different output in the diff column:
import numpy as np
import pandas as pd
# Create dataframe consisting of id, date and two categories (gender and age)
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Convert date to datetime
tempDF['date'] = pd.to_datetime(tempDF['date'])
# Create groupby object based on two categorical variables
tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
# Count number in each group and merge with original dataframe to create 'count' column
tempCountsDF = tempGroupby['id'].count().reset_index(drop=False)
tempCountsDF = tempCountsDF.rename(columns={'id': 'count'})
tempDF = tempDF.merge(tempCountsDF, on=['gender','age'])
# Calculate difference between consecutive rows in each group. (First row in each
# group should have date difference = NaT)
# ****** THIS TIME THE FOLLOWING GROUPBY DEFINITION IS COMMENTED OUT *****
# tempGroupby = tempDF.sort_values(['gender','age','id']).groupby(['gender','age'])
tempDF['diff'] = tempGroupby['date'].diff()
print(tempDF)
And, this time the output is very different (and NOT what I wanted at all)
age date gender id count diff
0 young 2015-02-04 02:34:00 male 1 2 NaT
1 young 2015-10-05 08:52:00 male 10 2 NaT
2 old 2015-06-04 12:34:00 female 2 3 92 days 10:29:00
3 old 2015-09-04 23:03:00 female 3 3 NaT
4 old 2015-04-21 12:59:00 female 6 3 -233 days +06:12:00
5 old 2015-12-04 01:00:00 male 4 6 -137 days +13:56:00
6 old 2015-04-15 07:12:00 male 5 6 NaT
7 old 2015-06-05 11:12:00 male 9 6 NaT
8 old 2015-05-19 19:22:00 male 12 6 51 days 04:00:00
9 old 2015-04-06 12:57:00 male 15 6 243 days 06:18:00
10 old 2015-06-15 03:23:00 male 17 6 NaT
11 young 2015-12-05 14:19:00 female 11 4 -17 days +08:10:00
12 young 2015-05-27 22:31:00 female 13 4 -192 days +08:12:00
13 young 2015-01-06 11:09:00 female 14 4 -142 days +12:38:00
14 young 2015-06-19 05:37:00 female 18 4 -44 days +17:35:00
(In my real-life script the results seem to be a little erratic, sometimes it works and sometimes it doesn't. But in the above script, the different outputs seem to occur consistently.)
Why is it necessary to recreate the groupby object on what is, essentially, the same dataframe (albeit with an additional column added) immediately before using the .diff() function? This seems very dangerous to me.
Not the same, the index has changed. For example:
tempDF.loc[1].id # before
10
tempDF.loc[1].id # after
2
So if you compute tempGroupby with the old tempDF and then change the indexes in tempDF when you do this:
tempDF['diff'] = tempGroupby['date'].diff()
the indexes do not match as you expect. You are assigning to each row the difference corresponding to the row that had that index in the old tempDF.

Categories

Resources