I'm have two pandas dataframes, both with two columns: datetime and value (float). I want to substract the value of dataframe A from the value of dataframe B based on the nearest datetime.
Example:
dataframe A:
datetime | value
01-01-2016 00:00 | 10
01-01-2016 01:00 | 12
01-01-2016 02:00 | 14
01-01-2016 03:00 | 12
01-01-2016 04:00 | 12
01-01-2016 05:00 | 16
01-01-2016 06:00 | 18
dataframe B:
datetime | value
01-01-2016 00:20 | 5
01-01-2016 00:50 | -5
01-01-2016 01:20 | 12
01-01-2016 01:50 | 30
01-01-2016 02:20 | 1
01-01-2016 02:50 | 6
01-01-2016 03:50 | 0
In case of the first row of A, this would mean that the nearest datetime of B would also be the first row and thus: 10-5 = 5. In case of the fourth row of A (01-01-2016 3:00) this would mean that the sixth row of B is nearest and the difference would be: 12-6 = 6.
I currently do this using a for loop:
for i, row in data.iterrows():
# i is the index, a Timestamp
data['h'][i] = row['h'] - baro.iloc[baro.index.get_loc(i,method='nearest')]['h']
It works fine, but would it be possible to do this faster?
new with pandas 0.19 pd.merge_asof
pd.merge_asof(dfa, dfb, 'datetime')
IIUC you can use reindex(..., method='nearest') method if you are using Pandas version < 0.19.0, starting from 0.19.0 it definitely makes sense to use pd.merge_asof, which is much more convenient and much more efficient too:
df1 = df1.set_index('datetime')
df2 = df2.set_index('datetime')
In [214]: df1.join(df2.reindex(df1.index, method='nearest'), rsuffix='_right')
Out[214]:
value value_right
datetime
2016-01-01 00:00:00 10 5
2016-01-01 01:00:00 12 -5
2016-01-01 02:00:00 14 30
2016-01-01 03:00:00 12 6
2016-01-01 04:00:00 12 0
2016-01-01 05:00:00 16 0
2016-01-01 06:00:00 18 0
In [224]: df1.value - df2.reindex(df1.index, method='nearest').value
Out[224]:
datetime
2016-01-01 00:00:00 5
2016-01-01 01:00:00 17
2016-01-01 02:00:00 -16
2016-01-01 03:00:00 6
2016-01-01 04:00:00 12
2016-01-01 05:00:00 16
2016-01-01 06:00:00 18
Name: value, dtype: int64
In [218]: merged = df1.join(df2.reindex(df1.index, method='nearest'), rsuffix='_right')
In [220]: merged.value.subtract(merged.value_right)
Out[220]:
datetime
2016-01-01 00:00:00 5
2016-01-01 01:00:00 17
2016-01-01 02:00:00 -16
2016-01-01 03:00:00 6
2016-01-01 04:00:00 12
2016-01-01 05:00:00 16
2016-01-01 06:00:00 18
dtype: int64
Related
I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0
I have this dataFrame where some tasks happened time period
Date Start Time End Time
0 2016-01-01 0:00:00 2016-01-01 0:10:00 2016-01-01 0:25:00
1 2016-01-01 0:00:00 2016-01-01 1:17:00 2016-01-01 1:31:00
2 2016-01-02 0:00:00 2016-01-02 0:30:00 2016-01-02 0:32:00
... ... ... ...
Convert this df to 30 mins interval
Expected outcome
Date Hours
1 2016-01-01 0:30:00 0:15
2 2016-01-01 1:00:00 0:00
3 2016-01-01 1:30:00 0:13
4 2016-01-01 2:00:00 0:01
5 2016-01-01 2:30:00 0:00
6 2016-01-01 3:00:00 0:00
... ...
47 2016-01-01 23:30:00 0:00
48 2016-01-02 23:59:59 0:00
49 2016-01-02 00:30:00 0:00
50 2016-01-02 01:00:00 0:02
... ...
I was trying to do with for loop which was getting tedious. Any simple way to do in pandas.
IIUC you can discard the Date column, get the time difference between start and end, groupby 30 minutes and agg on first (assuming you always have one entry only per 30 minutes slot):
print (df.assign(Diff=df["End Time"]-df["Start Time"])
.groupby(pd.Grouper(key="Start Time", freq="30T"))
.agg({"Diff": "first"})
.fillna(pd.Timedelta(seconds=0)))
Diff
Start Time
2016-01-01 00:00:00 0 days 00:15:00
2016-01-01 00:30:00 0 days 00:00:00
2016-01-01 01:00:00 0 days 00:14:00
2016-01-01 01:30:00 0 days 00:00:00
2016-01-01 02:00:00 0 days 00:00:00
2016-01-01 02:30:00 0 days 00:00:00
...
2016-01-02 00:30:00 0 days 00:02:00
The idea is to create a series with 0 and DatetimeIndex per minutes between min start time and max end time. Then add 1 where Start Time and subtract 1 where End Time. You can then use cumsum to count the values between Start and End, resample.sum per 30 minutes and reset_index. The last line of code is to get the proper format in the Hours column.
#create a series of 0 with a datetime index
res = pd.Series(data=0,
index= pd.DatetimeIndex(pd.date_range(df['Start Time'].min(),
df['End Time'].max(),
freq='T'),
name='Dates'),
name='Hours')
# add 1 o the start time and remove 1 to the end start
res[df['Start Time']] += 1
res[df['End Time']] -= 1
# cumsum to get the right value for each minute then resample per 30 minutes
res = (res.cumsum()
.resample('30T', label='right').sum()
.reset_index('Dates')
)
# change the format of the Hours column, honestly not necessary
res['Hours'] = pd.to_datetime(res['Hours'], format='%M').dt.strftime('%H:%M') # or .dt.time
print(res)
Dates Hours
0 2016-01-01 00:30:00 00:15
1 2016-01-01 01:00:00 00:00
2 2016-01-01 01:30:00 00:13
3 2016-01-01 02:00:00 00:01
4 2016-01-01 02:30:00 00:00
5 2016-01-01 03:00:00 00:00
...
48 2016-01-02 00:30:00 00:00
49 2016-01-02 01:00:00 00:02
I have two dataframes (df and df1) like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM', '06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = df.start_date + timedelta(days=5)
df['enc_id'] = ['ABC1','ABC2','ABC3','ABC4','DEF1','DEF2','DEF3']
df1 = pd.DataFrame({'person_id': [101,101,101,101,101,101,101,202,202,202,202,202,202,202,202],'date_1':['07/07/2013 11:20:00 AM','05/07/2013 02:30:00 PM','06/07/2013 02:40:00 PM','08/06/2014 12:00:00 AM','11/06/2014 12:00:00 AM','02/03/2013 12:30:00 PM','13/06/2014 12:00:00 AM','12/11/2011 12:00:00 AM','13/10/2012 07:00:00 AM','13/12/2015 12:00:00 AM','13/12/2012 12:00:00 AM','13/12/2012 06:30:00 PM','13/07/2011 10:00:00 AM','18/12/2012 10:00:00 AM', '19/12/2013 11:00:00 AM']})
df1['date_1'] = pd.to_datetime(df1['date_1'])
df1['within_id'] = ['ABC','ABC','ABC','ABC','ABC','ABC','ABC','DEF','DEF','DEF','DEF','DEF','DEF','DEF',np.nan]
What I would like to do is
a) Pick each person from df1 who doesnt have NA in 'within_id' column and check whether their date_1 is between (df.start_date - 1) and (df.end_date + 1) of the same person in df and for the same within_idor enc_id
ex: for subject = 101 and within_id = ABC, we have date_1 is 7/7/2013, you check whether they are between 4/7/2013 (df.start_date - 1) and 11/7/2013 (df.end_date + 1).
As the first-row comparison itself gave us the result, we don't have to compare our date_1 with rest of the records in df for subject 101. If not, we need to find/scan until we find the interval within which date_1 falls.
b) If date interval found, then assign the corresponding enc_id from df to the within_id in df1
c) If not then assign, "Out of Range"
I tried the below
t1 = df.groupby('person_id').apply(pd.DataFrame.sort_values, 'start_date')
t2 = df1.groupby('person_id').apply(pd.DataFrame.sort_values, 'date_1')
t3= pd.concat([t1, t2], axis=1)
t3['within_id'] = np.where((t3['date_1'] >= t3['start_date'] && t3['person_id'] == t3['person_id_x'] && t3['date_2'] >= t3['end_date']),enc_id]
I expect my output (also see 14th row at the bottom of my screenshot) to be as shown below. As I intend to apply the solution on big data (4/5 million records and there might be 5000-6000 unique person_ids), any efficient and elegant solution is helpful
14 202 2012-12-13 11:00:00 NA
Let's do:
d = df1.merge(df.assign(within_id=df['enc_id'].str[:3]),
on=['person_id', 'within_id'], how='left', indicator=True)
m = d['date_1'].between(d['start_date'] - pd.Timedelta(days=1),
d['end_date'] + pd.Timedelta(days=1))
d = df1.merge(d[m | d['_merge'].ne('both')], on=['person_id', 'date_1'], how='left')
d['within_id'] = d['enc_id'].fillna('out of range').mask(d['_merge'].eq('left_only'))
d = d[df1.columns]
Details:
Left merge the dataframe df1 with df on person_id and within_id:
print(d)
person_id date_1 within_id start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
1 101 2013-07-07 11:20:00 ABC 2013-09-08 11:21:00 2013-09-13 11:21:00 ABC2 both
2 101 2013-07-07 11:20:00 ABC 2014-06-06 08:00:00 2014-06-11 08:00:00 ABC3 both
3 101 2013-07-07 11:20:00 ABC 2014-06-06 05:00:00 2014-06-11 10:00:00 DEF1 both
....
47 202 2012-12-18 10:00:00 DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
48 202 2012-12-18 10:00:00 DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
49 202 2013-12-19 11:00:00 NaN NaT NaT NaN left_only
Create a boolean mask m to represent the condition where date_1 is between df.start_date - 1 days and df.end_date + 1 days:
print(m)
0 False
1 False
2 False
3 False
...
47 False
48 True
49 False
dtype: bool
Again left merge the dataframe df1 with the dataframe filtered using mask m on columns person_id and date_1:
print(d)
person_id date_1 within_id_x within_id_y start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC NaN NaT NaT NaN NaN
1 101 2013-05-07 14:30:00 ABC ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
2 101 2013-06-07 14:40:00 ABC NaN NaT NaT NaN NaN
3 101 2014-08-06 00:00:00 ABC NaN NaT NaT NaN NaN
4 101 2014-11-06 00:00:00 ABC NaN NaT NaT NaN NaN
5 101 2013-02-03 12:30:00 ABC NaN NaT NaT NaN NaN
6 101 2014-06-13 00:00:00 ABC NaN NaT NaT NaN NaN
7 202 2011-12-11 00:00:00 DEF DEF 2011-12-11 10:00:00 2011-12-16 10:00:00 DEF1 both
8 202 2012-10-13 07:00:00 DEF DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
9 202 2015-12-13 00:00:00 DEF NaN NaT NaT NaN NaN
10 202 2012-12-13 00:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
11 202 2012-12-13 18:30:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
12 202 2011-07-13 10:00:00 DEF NaN NaT NaT NaN NaN
13 202 2012-12-18 10:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
14 202 2013-12-19 11:00:00 NaN NaN NaT NaT NaN left_only
Populate the values in within_id column from enc_id and using Series.fillna fill the NaN excluding the ones that doesn't match from df with out of range, finally filter the columns to get the result:
print(d)
person_id date_1 within_id
0 101 2013-07-07 11:20:00 out of range
1 101 2013-05-07 14:30:00 ABC1
2 101 2013-06-07 14:40:00 out of range
3 101 2014-08-06 00:00:00 out of range
4 101 2014-11-06 00:00:00 out of range
5 101 2013-02-03 12:30:00 out of range
6 101 2014-06-13 00:00:00 out of range
7 202 2011-12-11 00:00:00 DEF1
8 202 2012-10-13 07:00:00 DEF2
9 202 2015-12-13 00:00:00 out of range
10 202 2012-12-13 00:00:00 DEF3
11 202 2012-12-13 18:30:00 DEF3
12 202 2011-07-13 10:00:00 out of range
13 202 2012-12-18 10:00:00 DEF3
14 202 2013-12-19 11:00:00 NaN
I used df and df1 as provided above.
The basic approach is to iterate over df1 and extract the matching values of enc_id.
I added a 'rule' column, to show how each value got populated.
Unfortunately, I was not able to reproduce the expected results. Perhaps the general approach will be useful.
df1['rule'] = 0
for t in df1.itertuples():
person = (t.person_id == df.person_id)
b = (t.date_1 >= df.start_date) & (t.date_2 <= df.end_date)
c = (t.date_1 >= df.start_date) & (t.date_2 >= df.end_date)
d = (t.date_1 <= df.start_date) & (t.date_2 <= df.end_date)
e = (t.date_1 <= df.start_date) & (t.date_2 <= df.start_date) # start_date at BOTH ends
if (m := person & b).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 1
elif (m := person & c).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 10
elif (m := person & d).any():
df1.at[t.Index, 'within_id'] = df.loc[m, 'enc_id'].values[0]
df1.at[t.Index, 'rule'] += 100
elif (m := person & e).any():
df1.at[t.Index, 'within_id'] = 'out of range'
df1.at[t.Index, 'rule'] += 1_000
else:
df1.at[t.Index, 'within_id'] = 'impossible!'
df1.at[t.Index, 'rule'] += 10_000
df1['within_id'] = df1['within_id'].astype('Int64')
The results are:
print(df1)
person_id date_1 date_2 within_id rule
0 11 1961-12-30 00:00:00 1962-01-01 00:00:00 11345678901 1
1 11 1962-01-30 00:00:00 1962-02-01 00:00:00 11345678902 1
2 12 1962-02-28 00:00:00 1962-03-02 00:00:00 34567892101 100
3 12 1989-07-29 00:00:00 1989-07-31 00:00:00 34567892101 1
4 12 1989-09-03 00:00:00 1989-09-05 00:00:00 34567892101 10
5 12 1989-10-02 00:00:00 1989-10-04 00:00:00 34567892103 1
6 12 1989-10-01 00:00:00 1989-10-03 00:00:00 34567892103 1
7 13 1999-03-29 00:00:00 1999-03-31 00:00:00 56432718901 1
8 13 1999-04-20 00:00:00 1999-04-22 00:00:00 56432718901 10
9 13 1999-06-02 00:00:00 1999-06-04 00:00:00 56432718904 1
10 13 1999-06-03 00:00:00 1999-06-05 00:00:00 56432718904 1
11 13 1999-07-29 00:00:00 1999-07-31 00:00:00 56432718905 1
12 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1
13 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1
I have a table, indexed by date, that has values of price that I want to use when creating a new column, previous_close.
date | price
2019-01-01 00:00:00 | 2
2019-01-01 04:00:00 | 3
2019-01-02 00:00:00 | 4
2019-01-01 04:00:00 | 5
I want to generate a column previous_close that returns the value of price in a row of the previous day's last price, so the output will be as follows:
date | price | previous_close
2019-01-01 00:00:00 | 2 | NaN
2019-01-01 04:00:00 | 3 | NaN
2019-01-02 00:00:00 | 4 | 3
2019-01-02 04:00:00 | 5 | 3
So far the only way I've figured how is to use df.apply, which iterates row-wise and for every row filters the index for the latest preceding day's last row. However, even though the DataFrame is date-indexed this takes a lot of time; for a table with a hundred thousand rows it takes several minutes to populate.
I was wondering if there was any way to create the new series in a vectorized form; something like df.shift(num_periods) but with the num_periods adjusted according to the row's date value.
I suggest as in question for the reindexing part:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({"date": pd.date_range("2019-01-01 22:00:00", periods=10, freq="H"),
"price": np.random.randint(1, 100, 10)})
df = df.set_index("date")
df = pd.concat([df.price,
df.resample("d").last().shift().rename(columns={"price":"close"}).reindex(df.index, method='ffill')],
axis = 1)
And you get the result:
price close
date
2019-01-01 22:00:00 67 NaN
2019-01-01 23:00:00 93 NaN
2019-01-02 00:00:00 99 93.0
2019-01-02 01:00:00 18 93.0
2019-01-02 02:00:00 84 93.0
2019-01-02 03:00:00 58 93.0
2019-01-02 04:00:00 87 93.0
2019-01-02 05:00:00 98 93.0
2019-01-02 06:00:00 97 93.0
2019-01-02 07:00:00 48 93.0
EDIT:
If your business day ends at 2 and you want the close for this hour, I suggest using DateOffset (as in here) on the date and doing the same method:
df = pd.DataFrame({"date": pd.date_range("2019-01-01 22:00:00", periods=10, freq="H"),
"price": np.random.randint(1, 100, 10)})
df["proxy"] = df.date + pd.DateOffset(hours=-3)
df = df.set_index("proxy")
df = pd.concat([df[["price", "date"]],
(df.price.resample("d").last().shift()
.rename({"price":"close"})
.reindex(df.index, method='ffill'))],
axis = 1).reset_index(drop=True).set_index("date")
You get the result:
price price
date
2019-01-01 22:00:00 67 NaN
2019-01-01 23:00:00 93 NaN
2019-01-02 00:00:00 99 NaN
2019-01-02 01:00:00 18 NaN
2019-01-02 02:00:00 84 NaN
2019-01-02 03:00:00 58 84.0
2019-01-02 04:00:00 87 84.0
2019-01-02 05:00:00 98 84.0
2019-01-02 06:00:00 97 84.0
2019-01-02 07:00:00 48 84.0
I'd like to merge one data frame with another, where the merge is conditional on the date/time falling in a particular range.
For example, let's say I have the following two data frames.
import pandas as pd
import datetime
# Create main data frame.
data = pd.DataFrame()
time_seq1 = pd.DataFrame(pd.date_range('1/1/2016', periods=3, freq='H'))
time_seq2 = pd.DataFrame(pd.date_range('1/2/2016', periods=3, freq='H'))
data = data.append(time_seq1, ignore_index=True)
data = data.append(time_seq1, ignore_index=True)
data = data.append(time_seq1, ignore_index=True)
data = data.append(time_seq2, ignore_index=True)
data['myID'] = ['001','001','001','002','002','002','003','003','003','004','004','004']
data.columns = ['Timestamp', 'myID']
# Create second data frame.
data2 = pd.DataFrame()
data2['time'] = [pd.to_datetime('1/1/2016 12:06 AM'), pd.to_datetime('1/1/2016 1:34 AM'), pd.to_datetime('1/2/2016 12:25 AM')]
data2['myID'] = ['002', '003', '004']
data2['specialID'] = ['foo_0', 'foo_1', 'foo_2']
# Show data frames.
data
Timestamp myID
0 2016-01-01 00:00:00 001
1 2016-01-01 01:00:00 001
2 2016-01-01 02:00:00 001
3 2016-01-01 00:00:00 002
4 2016-01-01 01:00:00 002
5 2016-01-01 02:00:00 002
6 2016-01-01 00:00:00 003
7 2016-01-01 01:00:00 003
8 2016-01-01 02:00:00 003
9 2016-01-02 00:00:00 004
10 2016-01-02 01:00:00 004
11 2016-01-02 02:00:00 004
data2
time myID specialID
0 2016-01-01 00:06:00 002 foo_0
1 2016-01-01 01:34:00 003 foo_1
2 2016-01-02 00:25:00 004 foo_2
I would like to construct the following output.
# Desired output.
Timestamp myID special_ID
0 2016-01-01 00:00:00 001 NaN
1 2016-01-01 01:00:00 001 NaN
2 2016-01-01 02:00:00 001 NaN
3 2016-01-01 00:00:00 002 NaN
4 2016-01-01 01:00:00 002 foo_0
5 2016-01-01 02:00:00 002 NaN
6 2016-01-01 00:00:00 003 NaN
7 2016-01-01 01:00:00 003 NaN
8 2016-01-01 02:00:00 003 foo_1
9 2016-01-02 00:00:00 004 NaN
10 2016-01-02 01:00:00 004 foo_2
11 2016-01-02 02:00:00 004 NaN
In particular, I want to merge special_ID into data such that Timestamp is the first time occurring after the value of time. For example, foo_0 would be in the row corresponding to 2016-01-01 01:00:00 with myID = 002 since that is the next time in data immediately following 2016-01-01 00:06:00 (the time of special_ID = foo_0) among the rows containing myID = 002.
Note, Timestamp is not the index of data and time is not the index of data2. Most other related posts seem to rely on using the datetime object as the index of the data frame.
You can use merge_asof, which is new in Pandas 0.19, to do most of the work. Then, combine loc and duplicated to remove secondary matches:
# Data needs to be sorted for merge_asof.
data = data.sort_values(by='Timestamp')
# Perform the merge_asof.
df = pd.merge_asof(data, data2, left_on='Timestamp', right_on='time', by='myID').drop('time', axis=1)
# Make the additional matches null.
df.loc[df['specialID'].duplicated(), 'specialID'] = np.nan
# Get the original ordering.
df = df.set_index(data.index).sort_index()
The resulting output:
Timestamp myID specialID
0 2016-01-01 00:00:00 001 NaN
1 2016-01-01 01:00:00 001 NaN
2 2016-01-01 02:00:00 001 NaN
3 2016-01-01 00:00:00 002 NaN
4 2016-01-01 01:00:00 002 foo_0
5 2016-01-01 02:00:00 002 NaN
6 2016-01-01 00:00:00 003 NaN
7 2016-01-01 01:00:00 003 NaN
8 2016-01-01 02:00:00 003 foo_1
9 2016-01-02 00:00:00 004 NaN
10 2016-01-02 01:00:00 004 foo_2
11 2016-01-02 02:00:00 004 NaN
Not very beautiful, but i think it works.
data['specialID'] = None
foolist = list(data2['myID'])
for i in data.index:
if data.myID[i] in foolist:
if data.Timestamp[i]> list(data2[data2['myID'] == data.myID[i]].time)[0]:
data['specialID'][i] = list(data2[data2['myID'] == data.myID[i]].specialID)[0]
foolist.remove(list(data2[data2['myID'] == data.myID[i]].myID)[0])
In [95]: data
Out[95]:
Timestamp myID specialID
0 2016-01-01 00:00:00 001 None
1 2016-01-01 01:00:00 001 None
2 2016-01-01 02:00:00 001 None
3 2016-01-01 00:00:00 002 None
4 2016-01-01 01:00:00 002 foo_0
5 2016-01-01 02:00:00 002 None
6 2016-01-01 00:00:00 003 None
7 2016-01-01 01:00:00 003 None
8 2016-01-01 02:00:00 003 foo_1
9 2016-01-02 00:00:00 004 None
10 2016-01-02 01:00:00 004 foo_2
11 2016-01-02 02:00:00 004 None