I'd like to merge one data frame with another, where the merge is conditional on the date/time falling in a particular range.
For example, let's say I have the following two data frames.
import pandas as pd
import datetime
# Create main data frame.
data = pd.DataFrame()
time_seq1 = pd.DataFrame(pd.date_range('1/1/2016', periods=3, freq='H'))
time_seq2 = pd.DataFrame(pd.date_range('1/2/2016', periods=3, freq='H'))
data = data.append(time_seq1, ignore_index=True)
data = data.append(time_seq1, ignore_index=True)
data = data.append(time_seq1, ignore_index=True)
data = data.append(time_seq2, ignore_index=True)
data['myID'] = ['001','001','001','002','002','002','003','003','003','004','004','004']
data.columns = ['Timestamp', 'myID']
# Create second data frame.
data2 = pd.DataFrame()
data2['time'] = [pd.to_datetime('1/1/2016 12:06 AM'), pd.to_datetime('1/1/2016 1:34 AM'), pd.to_datetime('1/2/2016 12:25 AM')]
data2['myID'] = ['002', '003', '004']
data2['specialID'] = ['foo_0', 'foo_1', 'foo_2']
# Show data frames.
data
Timestamp myID
0 2016-01-01 00:00:00 001
1 2016-01-01 01:00:00 001
2 2016-01-01 02:00:00 001
3 2016-01-01 00:00:00 002
4 2016-01-01 01:00:00 002
5 2016-01-01 02:00:00 002
6 2016-01-01 00:00:00 003
7 2016-01-01 01:00:00 003
8 2016-01-01 02:00:00 003
9 2016-01-02 00:00:00 004
10 2016-01-02 01:00:00 004
11 2016-01-02 02:00:00 004
data2
time myID specialID
0 2016-01-01 00:06:00 002 foo_0
1 2016-01-01 01:34:00 003 foo_1
2 2016-01-02 00:25:00 004 foo_2
I would like to construct the following output.
# Desired output.
Timestamp myID special_ID
0 2016-01-01 00:00:00 001 NaN
1 2016-01-01 01:00:00 001 NaN
2 2016-01-01 02:00:00 001 NaN
3 2016-01-01 00:00:00 002 NaN
4 2016-01-01 01:00:00 002 foo_0
5 2016-01-01 02:00:00 002 NaN
6 2016-01-01 00:00:00 003 NaN
7 2016-01-01 01:00:00 003 NaN
8 2016-01-01 02:00:00 003 foo_1
9 2016-01-02 00:00:00 004 NaN
10 2016-01-02 01:00:00 004 foo_2
11 2016-01-02 02:00:00 004 NaN
In particular, I want to merge special_ID into data such that Timestamp is the first time occurring after the value of time. For example, foo_0 would be in the row corresponding to 2016-01-01 01:00:00 with myID = 002 since that is the next time in data immediately following 2016-01-01 00:06:00 (the time of special_ID = foo_0) among the rows containing myID = 002.
Note, Timestamp is not the index of data and time is not the index of data2. Most other related posts seem to rely on using the datetime object as the index of the data frame.
You can use merge_asof, which is new in Pandas 0.19, to do most of the work. Then, combine loc and duplicated to remove secondary matches:
# Data needs to be sorted for merge_asof.
data = data.sort_values(by='Timestamp')
# Perform the merge_asof.
df = pd.merge_asof(data, data2, left_on='Timestamp', right_on='time', by='myID').drop('time', axis=1)
# Make the additional matches null.
df.loc[df['specialID'].duplicated(), 'specialID'] = np.nan
# Get the original ordering.
df = df.set_index(data.index).sort_index()
The resulting output:
Timestamp myID specialID
0 2016-01-01 00:00:00 001 NaN
1 2016-01-01 01:00:00 001 NaN
2 2016-01-01 02:00:00 001 NaN
3 2016-01-01 00:00:00 002 NaN
4 2016-01-01 01:00:00 002 foo_0
5 2016-01-01 02:00:00 002 NaN
6 2016-01-01 00:00:00 003 NaN
7 2016-01-01 01:00:00 003 NaN
8 2016-01-01 02:00:00 003 foo_1
9 2016-01-02 00:00:00 004 NaN
10 2016-01-02 01:00:00 004 foo_2
11 2016-01-02 02:00:00 004 NaN
Not very beautiful, but i think it works.
data['specialID'] = None
foolist = list(data2['myID'])
for i in data.index:
if data.myID[i] in foolist:
if data.Timestamp[i]> list(data2[data2['myID'] == data.myID[i]].time)[0]:
data['specialID'][i] = list(data2[data2['myID'] == data.myID[i]].specialID)[0]
foolist.remove(list(data2[data2['myID'] == data.myID[i]].myID)[0])
In [95]: data
Out[95]:
Timestamp myID specialID
0 2016-01-01 00:00:00 001 None
1 2016-01-01 01:00:00 001 None
2 2016-01-01 02:00:00 001 None
3 2016-01-01 00:00:00 002 None
4 2016-01-01 01:00:00 002 foo_0
5 2016-01-01 02:00:00 002 None
6 2016-01-01 00:00:00 003 None
7 2016-01-01 01:00:00 003 None
8 2016-01-01 02:00:00 003 foo_1
9 2016-01-02 00:00:00 004 None
10 2016-01-02 01:00:00 004 foo_2
11 2016-01-02 02:00:00 004 None
Related
I have this dataFrame where some tasks happened time period
Date Start Time End Time
0 2016-01-01 0:00:00 2016-01-01 0:10:00 2016-01-01 0:25:00
1 2016-01-01 0:00:00 2016-01-01 1:17:00 2016-01-01 1:31:00
2 2016-01-02 0:00:00 2016-01-02 0:30:00 2016-01-02 0:32:00
... ... ... ...
Convert this df to 30 mins interval
Expected outcome
Date Hours
1 2016-01-01 0:30:00 0:15
2 2016-01-01 1:00:00 0:00
3 2016-01-01 1:30:00 0:13
4 2016-01-01 2:00:00 0:01
5 2016-01-01 2:30:00 0:00
6 2016-01-01 3:00:00 0:00
... ...
47 2016-01-01 23:30:00 0:00
48 2016-01-02 23:59:59 0:00
49 2016-01-02 00:30:00 0:00
50 2016-01-02 01:00:00 0:02
... ...
I was trying to do with for loop which was getting tedious. Any simple way to do in pandas.
IIUC you can discard the Date column, get the time difference between start and end, groupby 30 minutes and agg on first (assuming you always have one entry only per 30 minutes slot):
print (df.assign(Diff=df["End Time"]-df["Start Time"])
.groupby(pd.Grouper(key="Start Time", freq="30T"))
.agg({"Diff": "first"})
.fillna(pd.Timedelta(seconds=0)))
Diff
Start Time
2016-01-01 00:00:00 0 days 00:15:00
2016-01-01 00:30:00 0 days 00:00:00
2016-01-01 01:00:00 0 days 00:14:00
2016-01-01 01:30:00 0 days 00:00:00
2016-01-01 02:00:00 0 days 00:00:00
2016-01-01 02:30:00 0 days 00:00:00
...
2016-01-02 00:30:00 0 days 00:02:00
The idea is to create a series with 0 and DatetimeIndex per minutes between min start time and max end time. Then add 1 where Start Time and subtract 1 where End Time. You can then use cumsum to count the values between Start and End, resample.sum per 30 minutes and reset_index. The last line of code is to get the proper format in the Hours column.
#create a series of 0 with a datetime index
res = pd.Series(data=0,
index= pd.DatetimeIndex(pd.date_range(df['Start Time'].min(),
df['End Time'].max(),
freq='T'),
name='Dates'),
name='Hours')
# add 1 o the start time and remove 1 to the end start
res[df['Start Time']] += 1
res[df['End Time']] -= 1
# cumsum to get the right value for each minute then resample per 30 minutes
res = (res.cumsum()
.resample('30T', label='right').sum()
.reset_index('Dates')
)
# change the format of the Hours column, honestly not necessary
res['Hours'] = pd.to_datetime(res['Hours'], format='%M').dt.strftime('%H:%M') # or .dt.time
print(res)
Dates Hours
0 2016-01-01 00:30:00 00:15
1 2016-01-01 01:00:00 00:00
2 2016-01-01 01:30:00 00:13
3 2016-01-01 02:00:00 00:01
4 2016-01-01 02:30:00 00:00
5 2016-01-01 03:00:00 00:00
...
48 2016-01-02 00:30:00 00:00
49 2016-01-02 01:00:00 00:02
I have a table, indexed by date, that has values of price that I want to use when creating a new column, previous_close.
date | price
2019-01-01 00:00:00 | 2
2019-01-01 04:00:00 | 3
2019-01-02 00:00:00 | 4
2019-01-01 04:00:00 | 5
I want to generate a column previous_close that returns the value of price in a row of the previous day's last price, so the output will be as follows:
date | price | previous_close
2019-01-01 00:00:00 | 2 | NaN
2019-01-01 04:00:00 | 3 | NaN
2019-01-02 00:00:00 | 4 | 3
2019-01-02 04:00:00 | 5 | 3
So far the only way I've figured how is to use df.apply, which iterates row-wise and for every row filters the index for the latest preceding day's last row. However, even though the DataFrame is date-indexed this takes a lot of time; for a table with a hundred thousand rows it takes several minutes to populate.
I was wondering if there was any way to create the new series in a vectorized form; something like df.shift(num_periods) but with the num_periods adjusted according to the row's date value.
I suggest as in question for the reindexing part:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({"date": pd.date_range("2019-01-01 22:00:00", periods=10, freq="H"),
"price": np.random.randint(1, 100, 10)})
df = df.set_index("date")
df = pd.concat([df.price,
df.resample("d").last().shift().rename(columns={"price":"close"}).reindex(df.index, method='ffill')],
axis = 1)
And you get the result:
price close
date
2019-01-01 22:00:00 67 NaN
2019-01-01 23:00:00 93 NaN
2019-01-02 00:00:00 99 93.0
2019-01-02 01:00:00 18 93.0
2019-01-02 02:00:00 84 93.0
2019-01-02 03:00:00 58 93.0
2019-01-02 04:00:00 87 93.0
2019-01-02 05:00:00 98 93.0
2019-01-02 06:00:00 97 93.0
2019-01-02 07:00:00 48 93.0
EDIT:
If your business day ends at 2 and you want the close for this hour, I suggest using DateOffset (as in here) on the date and doing the same method:
df = pd.DataFrame({"date": pd.date_range("2019-01-01 22:00:00", periods=10, freq="H"),
"price": np.random.randint(1, 100, 10)})
df["proxy"] = df.date + pd.DateOffset(hours=-3)
df = df.set_index("proxy")
df = pd.concat([df[["price", "date"]],
(df.price.resample("d").last().shift()
.rename({"price":"close"})
.reindex(df.index, method='ffill'))],
axis = 1).reset_index(drop=True).set_index("date")
You get the result:
price price
date
2019-01-01 22:00:00 67 NaN
2019-01-01 23:00:00 93 NaN
2019-01-02 00:00:00 99 NaN
2019-01-02 01:00:00 18 NaN
2019-01-02 02:00:00 84 NaN
2019-01-02 03:00:00 58 84.0
2019-01-02 04:00:00 87 84.0
2019-01-02 05:00:00 98 84.0
2019-01-02 06:00:00 97 84.0
2019-01-02 07:00:00 48 84.0
I'm have two pandas dataframes, both with two columns: datetime and value (float). I want to substract the value of dataframe A from the value of dataframe B based on the nearest datetime.
Example:
dataframe A:
datetime | value
01-01-2016 00:00 | 10
01-01-2016 01:00 | 12
01-01-2016 02:00 | 14
01-01-2016 03:00 | 12
01-01-2016 04:00 | 12
01-01-2016 05:00 | 16
01-01-2016 06:00 | 18
dataframe B:
datetime | value
01-01-2016 00:20 | 5
01-01-2016 00:50 | -5
01-01-2016 01:20 | 12
01-01-2016 01:50 | 30
01-01-2016 02:20 | 1
01-01-2016 02:50 | 6
01-01-2016 03:50 | 0
In case of the first row of A, this would mean that the nearest datetime of B would also be the first row and thus: 10-5 = 5. In case of the fourth row of A (01-01-2016 3:00) this would mean that the sixth row of B is nearest and the difference would be: 12-6 = 6.
I currently do this using a for loop:
for i, row in data.iterrows():
# i is the index, a Timestamp
data['h'][i] = row['h'] - baro.iloc[baro.index.get_loc(i,method='nearest')]['h']
It works fine, but would it be possible to do this faster?
new with pandas 0.19 pd.merge_asof
pd.merge_asof(dfa, dfb, 'datetime')
IIUC you can use reindex(..., method='nearest') method if you are using Pandas version < 0.19.0, starting from 0.19.0 it definitely makes sense to use pd.merge_asof, which is much more convenient and much more efficient too:
df1 = df1.set_index('datetime')
df2 = df2.set_index('datetime')
In [214]: df1.join(df2.reindex(df1.index, method='nearest'), rsuffix='_right')
Out[214]:
value value_right
datetime
2016-01-01 00:00:00 10 5
2016-01-01 01:00:00 12 -5
2016-01-01 02:00:00 14 30
2016-01-01 03:00:00 12 6
2016-01-01 04:00:00 12 0
2016-01-01 05:00:00 16 0
2016-01-01 06:00:00 18 0
In [224]: df1.value - df2.reindex(df1.index, method='nearest').value
Out[224]:
datetime
2016-01-01 00:00:00 5
2016-01-01 01:00:00 17
2016-01-01 02:00:00 -16
2016-01-01 03:00:00 6
2016-01-01 04:00:00 12
2016-01-01 05:00:00 16
2016-01-01 06:00:00 18
Name: value, dtype: int64
In [218]: merged = df1.join(df2.reindex(df1.index, method='nearest'), rsuffix='_right')
In [220]: merged.value.subtract(merged.value_right)
Out[220]:
datetime
2016-01-01 00:00:00 5
2016-01-01 01:00:00 17
2016-01-01 02:00:00 -16
2016-01-01 03:00:00 6
2016-01-01 04:00:00 12
2016-01-01 05:00:00 16
2016-01-01 06:00:00 18
dtype: int64
I start with the following pandas dataframe, I wish to group each day, and make a new column called 'label', which labels the group with a sequential number. How do I do this?
df = pd.DataFrame({'val': [10,40,30,10,11,13]}, index=pd.date_range('2016-01-01 00:00:00', periods=6, freq='12H' ) )
# df['label'] = df.groupby(pd.TimeGrouper('D')) # what do i do here???
print df
output:
val
2016-01-01 00:00:00 10
2016-01-01 12:00:00 40
2016-01-02 00:00:00 30
2016-01-02 12:00:00 10
2016-01-03 00:00:00 11
2016-01-03 12:00:00 13
desired output:
val label
2016-01-01 00:00:00 10 1
2016-01-01 12:00:00 40 1
2016-01-02 00:00:00 30 2
2016-01-02 12:00:00 10 2
2016-01-03 00:00:00 11 3
2016-01-03 12:00:00 13 3
Try this:
df = pd.DataFrame({'val': [10,40,30,10,11,13]}, index=pd.date_range('2016-01-01 00:00:00', periods=6, freq='12H' ) )
If you just want to group by date:
df['label'] = df.groupby(df.index.date).grouper.group_info[0] + 1
print(df)
To group by time more generally, you can use TimeGrouper:
df['label'] = df.groupby(pd.TimeGrouper('D')).grouper.group_info[0] + 1
print(df)
Both of the above should give you the following:
val label
2016-01-01 00:00:00 10 1
2016-01-01 12:00:00 40 1
2016-01-02 00:00:00 30 2
2016-01-02 12:00:00 10 2
2016-01-03 00:00:00 11 3
2016-01-03 12:00:00 13 3
I think this is undocumented (or hard to find, at least). Check out:
Get group id back into pandas dataframe
for more discussion.
maybe a more simpler and intuitive approach is this:
df['label'] = df.groupby(df.index.day).keys
I have such a DataFrame:
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-02 00:00:00 2
2016-01-02 12:00:00 3
2016-01-03 00:00:00 4
2016-01-03 12:00:00 5
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
The reason I separate 2016-01-02 00:00:00 to 2016-01-03 12:00:00 is that, those two days are weekends.
So here is what I wish to do:
I wish to rolling_sum with window = 2 business days.
For example, I wish to sum
A
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
2016-01-05 00:00:00 8
2016-01-05 12:00:00 9
and then sum (we skip any non-business days)
A
2016-01-01 00:00:00 0
2016-01-01 12:00:00 1
2016-01-04 00:00:00 6
2016-01-04 12:00:00 7
And the result is
A
2016-01-01 Nan
2016-01-04 14
2016-01-05 30
How can I achieve that?
I tried rolling_sum(df, window=2, freq=BDay(1)), it seems it just pick one row of the same day, but not sum the two rows (00:00 and 12:00) within the same day.
You could first select only business days, resample to (business) daily frequency for the remaining data points and sum, and then apply rolling_sum:
Starting with some sample data:
df = pd.DataFrame(data={'A': np.random.randint(0, 10, 500)}, index=pd.date_range(datetime(2016,1,1), freq='6H', periods=500))
A
2016-01-01 00:00:00 6
2016-01-01 06:00:00 9
2016-01-01 12:00:00 3
2016-01-01 18:00:00 9
2016-01-02 00:00:00 7
2016-01-02 06:00:00 5
2016-01-02 12:00:00 8
2016-01-02 18:00:00 6
2016-01-03 00:00:00 2
2016-01-03 06:00:00 0
2016-01-03 12:00:00 0
2016-01-03 18:00:00 0
2016-01-04 00:00:00 5
2016-01-04 06:00:00 4
2016-01-04 12:00:00 1
2016-01-04 18:00:00 4
2016-01-05 00:00:00 6
2016-01-05 06:00:00 9
2016-01-05 12:00:00 7
2016-01-05 18:00:00 2
....
First select the values on business days:
tsdays = df.index.values.astype('<M8[D]')
bdays = pd.bdate_range(tsdays[0], tsdays[-1]).values.astype('<M8[D]')
df = df[np.in1d(tsdays, bdays)]
Then apply rolling_sum() to the resampled data, where each value represents the sum for an individual business day:
pd.rolling_sum(df.resample('B', how='sum'), window=2)
to get:
A
2016-01-01 NaN
2016-01-04 41
2016-01-05 38
2016-01-06 56
2016-01-07 52
2016-01-08 37
See also [here] for the type conversion and 1[this question]2 for the business day extraction.