How to merge two different dataframe with a slight difference in timestamp

How to merge two different dataframe with a slight difference in timestamp - python

I have calculated the moving average of 15 minutes from 10 second recorded data. Now I wanted to merge two timeseries data (15 minutes average and 15 minutes moving average) from different files into a new file based on the nearest timestamp.
The 15 minutes moving average data is as below. As I have calculated the moving average, the first few rows are NaN:
RecTime NO2_RAW NO2 Ox_RAW Ox CO_RAW CO SO2_RAW SO2
2019-06-03 00:00:08 NaN NaN NaN NaN NaN NaN NaN NaN
2019-06-03 00:00:18 NaN NaN NaN NaN NaN NaN NaN NaN
2019-06-03 00:00:28 NaN NaN NaN NaN NaN NaN NaN NaN
2019-06-03 00:00:38 NaN NaN NaN NaN NaN NaN NaN NaN
The 15 minute average data is shown below:
Site Species ReadingDateTime Value Units Provisional or Ratified
0 CR9 NO2 2019-03-06 00:00:00 8.2 ug m-3 P
1 CR9 NO2 2019-03-06 00:15:00 7.6 ug m-3 P
2 CR9 NO2 2019-03-06 00:30:00 5.9 ug m-3 P
3 CR9 NO2 2019-03-06 00:45:00 5.1 ug m-3 P
4 CR9 NO2 2019-03-06 01:00:00 5.2 ug m-3 P
I want a table like this:
ReadingDateTime Value NO2_Raw NO2
2019-06-03 00:00:00
2019-06-03 00:15:00
2019-06-03 00:30:00
2019-06-03 00:45:00
2019-06-03 01:00:00
I tried to match the two dataframes with nearest time
df3 = pd.merge_asof(df1, df2, left_on = 'RecTime', right_on = 'ReadingDateTime', tolerance=pd.Timedelta('59s'), allow_exact_matches=False)
I got a new dataframe
RecTime NO2_RAW NO2 Ox_RAW Ox CO_RAW CO SO2_RAW SO2 Site Species ReadingDateTime Value Units Provisional or Ratified
0 2019-06-03 00:14:58 1.271111 21.557111 65.188889 170.011111 152.944444 294.478000 -124.600000 -50.129444 NaN NaN NaT NaN NaN NaN
1 2019-06-03 00:15:08 1.294444 21.601778 65.161111 169.955667 152.844444 294.361556 -124.595556 -50.117556 NaN NaN NaT NaN NaN NaN
2 2019-06-03 00:15:18 1.318889 21.648556 65.104444 169.842556 152.750000 294.251556 -124.593333 -50.111667 NaN NaN NaT NaN NaN NaN
But the values of df2 became NaN. Can someone please help?

Assuming the minutes are correct, you could remove the seconds, and then you would be able to merge.
df.RecTime.map(lambda x: x.replace(second=0)).
You could either create a new column or replace the existing one to merge.

Related

Take time points, and make labels against datetime object to correlate for things around points

I'm trying to use the usual times I take medication (so + 4 hours on top of that) and fill in a data frame with a label, of being 2,1 or 0, for when I am on this medication, or for the hour after the medication as 2 for just being off of the medication.
As an example of the dataframe I am trying to add this column too,
<bound method NDFrame.to_clipboard of id sentiment magnitude angry disgusted fearful \
created
2020-05-21 12:00:00 23.0 -0.033333 0.5 NaN NaN NaN
2020-05-21 12:15:00 NaN NaN NaN NaN NaN NaN
2020-05-21 12:30:00 NaN NaN NaN NaN NaN NaN
2020-05-21 12:45:00 NaN NaN NaN NaN NaN NaN
2020-05-21 13:00:00 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
2021-04-20 00:45:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:00:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:15:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:30:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:45:00 46022.0 -1.000000 1.0 NaN NaN NaN
happy neutral sad surprised
created
2020-05-21 12:00:00 NaN NaN NaN NaN
2020-05-21 12:15:00 NaN NaN NaN NaN
2020-05-21 12:30:00 NaN NaN NaN NaN
2020-05-21 12:45:00 NaN NaN NaN NaN
2020-05-21 13:00:00 NaN NaN NaN NaN
... ... ... ... ...
2021-04-20 00:45:00 NaN NaN NaN NaN
2021-04-20 01:00:00 NaN NaN NaN NaN
2021-04-20 01:15:00 NaN NaN NaN NaN
2021-04-20 01:30:00 NaN NaN NaN NaN
2021-04-20 01:45:00 NaN NaN NaN NaN
[32024 rows x 10 columns]>
And the data for the timestamps for when i usually take my medication,
['09:00 AM', '12:00 PM', '03:00 PM']
How would I use those time stamps to get this sort of column information?
Update
So, trying to build upon the question, How would I make sure it only adds medication against places where there is data available, and making sure that the after medication timing of one hour is applied correctly!
Thanks

Use np.select() to choose the appropriate label for a given condition.
First dropna() if all values after created are null (subset=df.columns[1:]). You can change the subset depending on your needs (e.g., subset=['id'] if rows should be dropped just for having a null id).
Then generate datetime arrays for taken-, active-, and after-medication periods based on the duration of the medication. Check whether the created times match any of the times in active (label 1) or after (label 2), otherwise default to 0.
# drop rows that are empty except for column 0 (i.e., except for df.created)
df.dropna(subset=df.columns[1:], inplace=True)
# convert times to datetime
df.created = pd.to_datetime(df.created)
taken = pd.to_datetime(['09:00:00', '12:00:00', '15:00:00'])
# generate time arrays
duration = 2 # hours
active = np.array([(taken + pd.Timedelta(f'{h}H')).time for h in range(duration)]).ravel()
after = (taken + pd.Timedelta(f'{duration}H')).time
# define boolean masks by label
conditions = {
1: df.created.dt.floor('H').dt.time.isin(active),
2: df.created.dt.floor('H').dt.time.isin(after),
}
# create medication column with np.select()
df['medication'] = np.select(conditions.values(), conditions.keys(), default=0)
Here is the output with some slightly modified data that better demonstrate the active / after / nan scenarios:
created id sentiment magnitude medication
0 2020-05-21 12:00:00 23.0 -0.033333 0.5 1
3 2020-05-21 12:45:00 39.0 -0.500000 0.5 1
4 2020-05-21 13:00:00 90.0 -0.500000 0.5 1
5 2020-05-21 13:15:00 100.0 -0.033333 0.1 1
9 2020-05-21 14:15:00 1000.0 0.033333 0.5 2
10 2020-05-21 14:30:00 3.0 0.001000 1.0 2
17 2021-04-20 01:00:00 46022.0 -1.000000 1.0 0
20 2021-04-20 01:45:00 46022.0 -1.000000 1.0 0

Best way to setup 'Buy' signal/backtest when multiple conditions equal True (stock data)

Working with daily [Date, Open, High, Low, Close] stock data, I am trying to better understand a good method for the type of statements to use when I am backtesting multiple conditions. For example:
#Signal:
Todays Close > Todays Open AND
Yesterdays Close > Yesterdays Open AND
Todays Close >= Todays High - 10%
#Position:
If ALL of the signal conditions above are true, then "Buy" tomorrow at (todays High + 5%) and "Sell" at the Close of the day.
**To take the position I would have to test that the "Buy" condition was satisfied on the 'tomorrow' bar
#Calculate Return:
If Position taken, calculate profit or loss for the day
I've seen sample algorithms, but many examples are just basic moving average crossover systems (one condition), very simple to do with vectorized approach.
When you have multiple conditions as above, can someone show me a good way to code this?

Assuming your data has been sorted by date and indexed sequentially, try this:
cond1 = df['Close'] > df['Open']
cond2 = df['Close'].shift() > df['Open'].shift()
cond3 = df['Close'] >= (df['High'] * 0.9)
signal = df[cond1 & cond2 & cond3]
df.loc[signal.index + 1, 'BuyAt'] = (signal['High'] * 1.05).values
df.loc[signal.index + 1, 'SellAt'] = df.loc[signal.index + 1, 'Close']
df['PnL'] = df['SellAt'] - df['BuyAt']
Result (from MSFT stock price courtesy of Yahoo Finance):
Date Open High Low Close BuyAt SellAt PnL
0 2019-01-02 99.550003 101.750000 98.940002 101.120003 NaN NaN NaN
1 2019-01-03 100.099998 100.190002 97.199997 97.400002 NaN NaN NaN
2 2019-01-04 99.720001 102.510002 98.930000 101.930000 NaN NaN NaN
3 2019-01-07 101.639999 103.269997 100.980003 102.059998 NaN NaN NaN
4 2019-01-08 103.040001 103.970001 101.709999 102.800003 108.433497 102.800003 -5.633494
5 2019-01-09 103.860001 104.879997 103.239998 104.269997 NaN NaN NaN
6 2019-01-10 103.220001 103.750000 102.379997 103.599998 NaN NaN NaN
7 2019-01-11 103.190002 103.440002 101.639999 102.800003 108.937500 102.800003 -6.137497
8 2019-01-14 101.900002 102.870003 101.260002 102.050003 NaN NaN NaN
9 2019-01-15 102.510002 105.050003 101.879997 105.010002 NaN NaN NaN
10 2019-01-16 105.260002 106.260002 104.959999 105.379997 110.302503 105.379997 -4.922506
11 2019-01-17 105.000000 106.629997 104.760002 106.120003 111.573002 106.120003 -5.452999
12 2019-01-18 107.459999 107.900002 105.910004 107.709999 111.961497 107.709999 -4.251498
13 2019-01-22 106.750000 107.099998 104.860001 105.680000 113.295002 105.680000 -7.615002
14 2019-01-23 106.120003 107.040001 105.339996 106.709999 NaN NaN NaN
15 2019-01-24 106.860001 107.000000 105.339996 106.199997 NaN NaN NaN
16 2019-01-25 107.239998 107.879997 106.199997 107.169998 NaN NaN NaN
17 2019-01-28 106.260002 106.480003 104.660004 105.080002 NaN NaN NaN
18 2019-01-29 104.879997 104.970001 102.169998 102.940002 NaN NaN NaN
19 2019-01-30 104.620003 106.379997 104.330002 106.379997 NaN NaN NaN
It seems like a losing strategy to me!

Filling missing dates by imputing on previous dates in Python

I have a time series that I want to lag and predict on for future data one year ahead that looks like:
Date Energy Pred Energy Lag Error
.
2017-09-01 9 8.4
2017-10-01 10 9
2017-11-01 11 10
2017-12-01 12 11.5
2018-01-01 1 1.3
NaT (pred-true)
NaT
NaT
NaT
.
.
All I want to do is impute dates into the NaT entries to continue from 2018-01-01 to 2019-01-01 (just fill them like we're in Excel drag and drop) because there are enough NaT positions to fill up to that point.
I've tried model['Date'].fillna() with various methods and either just repeats the same previous date or drops things I don't want to drop.
Any way to just fill these NaTs with 1 month increments like the previous data?

Make the df and set the index (there are better ways to set the index):
"""
Date,Energy,Pred Energy,Lag Error
2017-09-01,9,8.4
2017-10-01,10,9
2017-11-01,11,10
2017-12-01,12,11.5
2018-01-01,1,1.3
"""
import pandas as pd
df = pd.read_clipboard(sep=",", parse_dates=True)
df.set_index(pd.DatetimeIndex(df['Date']), inplace=True)
df.drop("Date", axis=1, inplace=True)
df
Reindex to a new date_range:
idx = pd.date_range(start='2017-09-01', end='2019-01-01', freq='MS')
df = df.reindex(idx)
Output:
Energy Pred Energy Lag Error
2017-09-01 9.0 8.4 NaN
2017-10-01 10.0 9.0 NaN
2017-11-01 11.0 10.0 NaN
2017-12-01 12.0 11.5 NaN
2018-01-01 1.0 1.3 NaN
2018-02-01 NaN NaN NaN
2018-03-01 NaN NaN NaN
2018-04-01 NaN NaN NaN
2018-05-01 NaN NaN NaN
2018-06-01 NaN NaN NaN
2018-07-01 NaN NaN NaN
2018-08-01 NaN NaN NaN
2018-09-01 NaN NaN NaN
2018-10-01 NaN NaN NaN
2018-11-01 NaN NaN NaN
2018-12-01 NaN NaN NaN
2019-01-01 NaN NaN NaN
Help from:
Pandas Set DatetimeIndex

Pandas timespan and groups: Need to groupby/pivot with index as group id with columns that correspond to most recent period values

I have a table that looks like this:
Index Group_Id Period Start Period End Value Value_Count
42 1016833 2012-01-01 2013-01-01 127491.00 17.0
43 1016833 2013-01-01 2014-01-01 48289.00 9.0
44 1016833 2014-01-01 2015-01-01 2048.00 2.0
45 1016926 2012-02-01 2013-02-01 913.00 1.0
46 1016926 2013-02-01 2014-02-01 6084.00 5.0
47 1016926 2014-02-01 2015-02-01 29942.00 3.0
48 1016971 2014-03-01 2015-03-01 0.00 0.0
I am trying to end up with a 'wide' df where each Group_Id has one observation and the value/value counts are converted to columns that correspond to their respective period in order of recency. So the end result would like like:
Index Group_Id Value_P0 Value_P1 Value_P3 Count_P0 Count_P1 ...
42 1016833 2048.00 48289.00 127491.00 2.0 9.0
45 1016926 29942.00 6084.00 913.00 3.0 5.0
48 1016971 0.0 0.00 0.0 0.0 0.0
Where Value_P0 is the most recent value, Value_P1 is the next most recent value after that, and the Count columns work the same way.
I've tried pivoting the table so that the Group_IDs are the indices and Period Start is the columns and Values or Counts is the corresponding value.
Period Start 2006-07-01 2008-07-01 2009-02-01 2009-12-17 2010-02-01 2010-06-01 2010-07-01 2010-08-13 2010-09-01 2010-12-01 ... 2016-10-02 2016-10-20 2016-12-29 2017-01-05 2017-02-01 2017-03-28 2017-04-10 2017-05-14 2017-08-27 2017-09-15
Group_Id
1007310 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1007318 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1007353 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
This way I have the Group_Ids as one record but would then need to loop through each row of the many columns and pull out the non-NaN values. Their order would correspond to oldest to newest. This seems like an incorrect way to go about this though.
I've also considered grouping by Group_Id and somehow creating a timedelta that corresponds to the most recent date. Then from this pivoting/unstacking so that the columns are the timedelta and the values are value or value_count. I'm not sure how to do this though. I appreciate the help.

Still using pivot
df['ID']=df.groupby('Group_Id').cumcount()
d1=df.pivot('Group_Id','ID','Value').add_prefix('Value_P')
d2=df.pivot('Group_Id','ID','Value_Count').add_prefix('Count_P')
pd.concat([d1,d2],axis=1).fillna(0)
Out[347]:
ID Value_P0 Value_P1 Value_P2 Count_P0 Count_P1 Count_P2
Group_Id
1016833 127491.0 48289.0 2048.0 17.0 9.0 2.0
1016926 913.0 6084.0 29942.0 1.0 5.0 3.0
1016971 0.0 0.0 0.0 0.0 0.0 0.0

pandas MultiIndex rolling mean

Preface: I'm newish, but have searched for hours here and in the pandas documentation without success. I've also read Wes's book.
I am modeling stock market data for a hedge fund, and have a simple MultiIndexed-DataFrame with tickers, dates(daily), and fields. The sample here is from Bloomberg. 3 months - Dec. 2016 through Feb. 2017, 3 tickers(AAPL, IBM, MSFT).
import numpy as np
import pandas as pd
import os
# get data from Excel
curr_directory = os.getcwd()
filename = 'Sample Data File.xlsx'
filepath = os.path.join(curr_directory, filename)
df = pd.read_excel(filepath, sheetname = 'Sheet1', index_col = [0,1], parse_cols = 'A:D')
# sort
df.sort_index(inplace=True)
# sample of the data
df.head(15)
Out[4]:
PX_LAST PX_VOLUME
Security Name date
AAPL US Equity 2016-12-01 109.49 37086862
2016-12-02 109.90 26527997
2016-12-05 109.11 34324540
2016-12-06 109.95 26195462
2016-12-07 111.03 29998719
2016-12-08 112.12 27068316
2016-12-09 113.95 34402627
2016-12-12 113.30 26374377
2016-12-13 115.19 43733811
2016-12-14 115.19 34031834
2016-12-15 115.82 46524544
2016-12-16 115.97 44351134
2016-12-19 116.64 27779423
2016-12-20 116.95 21424965
2016-12-21 117.06 23783165
df.tail(15)
Out[5]:
PX_LAST PX_VOLUME
Security Name date
MSFT US Equity 2017-02-07 63.43 20277226
2017-02-08 63.34 18096358
2017-02-09 64.06 22644443
2017-02-10 64.00 18170729
2017-02-13 64.72 22920101
2017-02-14 64.57 23108426
2017-02-15 64.53 17005157
2017-02-16 64.52 20546345
2017-02-17 64.62 21248818
2017-02-21 64.49 20655869
2017-02-22 64.36 19292651
2017-02-23 64.62 20273128
2017-02-24 64.62 21796800
2017-02-27 64.23 15871507
2017-02-28 63.98 23239825
When I calculate daily price changes, like this, it seems to work, only the first day is NaN, as it should be:
df.head(5)
Out[7]:
PX_LAST PX_VOLUME px_change_%
Security Name date
AAPL US Equity 2016-12-01 109.49 37086862 NaN
2016-12-02 109.90 26527997 0.003745
2016-12-05 109.11 34324540 -0.007188
2016-12-06 109.95 26195462 0.007699
2016-12-07 111.03 29998719 0.009823
But daily 30 Day Volume doesn't. It should only be NaN for the first 29 days, but is NaN for all of it:
# daily change from 30 day volume - doesn't work
df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean()
df['volume_change_%'] = (df['PX_VOLUME'] - df['30_day_volume']) / df['30_day_volume']
df.iloc[:,3:].tail(40)
Out[12]:
30_day_volume volume_change_%
Security Name date
MSFT US Equity 2016-12-30 NaN NaN
2017-01-03 NaN NaN
2017-01-04 NaN NaN
2017-01-05 NaN NaN
2017-01-06 NaN NaN
2017-01-09 NaN NaN
2017-01-10 NaN NaN
2017-01-11 NaN NaN
2017-01-12 NaN NaN
2017-01-13 NaN NaN
2017-01-17 NaN NaN
2017-01-18 NaN NaN
2017-01-19 NaN NaN
2017-01-20 NaN NaN
2017-01-23 NaN NaN
2017-01-24 NaN NaN
2017-01-25 NaN NaN
2017-01-26 NaN NaN
2017-01-27 NaN NaN
2017-01-30 NaN NaN
2017-01-31 NaN NaN
2017-02-01 NaN NaN
2017-02-02 NaN NaN
2017-02-03 NaN NaN
2017-02-06 NaN NaN
2017-02-07 NaN NaN
2017-02-08 NaN NaN
2017-02-09 NaN NaN
2017-02-10 NaN NaN
2017-02-13 NaN NaN
2017-02-14 NaN NaN
2017-02-15 NaN NaN
2017-02-16 NaN NaN
2017-02-17 NaN NaN
2017-02-21 NaN NaN
2017-02-22 NaN NaN
2017-02-23 NaN NaN
2017-02-24 NaN NaN
2017-02-27 NaN NaN
2017-02-28 NaN NaN
As pandas seems to have been designed specifically for finance, I'm surprised this isn't straightforward.
Edit: I've tried some other ways as well.
Tried converting it into a Panel (3D), but didn't find any built in functions for Windows except to convert to a DataFrame and back, so no advantage there.
Tried to create a pivot table, but couldn't find a way to reference just the first level of the MultiIndex. df.index.levels[0] or ...levels[1] wasn't working.
Thanks!

Can you try the following to see if it works?
df['30_day_volume'] = df.groupby(level=0)['PX_VOLUME'].rolling(window=30).mean().values
df['volume_change_%'] = (df['PX_VOLUME'] - df['30_day_volume']) / df['30_day_volume']

I can verify Allen's answer works when using pandas_datareader, modifying the index level for the groupby operation for the datareader multiindexing.
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2016, 12, 1)
end = datetime.datetime(2017, 2, 28)
data = web.DataReader(['AAPL', 'IBM', 'MSFT'], 'yahoo', start, end).to_frame()
data['30_day_volume'] = data.groupby(level=1).rolling(window=30)['Volume'].mean().values
data['volume_change_%'] = (data['Volume'] - data['30_day_volume']) / data['30_day_volume']
# double-check that it computed starting at 30 trading days.
data.loc['2017-1-17':'2017-1-30']
The original poster might try editing this line:
df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean()
to the following, using mean().values:
df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean().values
The data don't get properly aligned without this, resulting in NaN's.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.