Pandas merge single column dataframe with another dataframe of multiple columns - python

I have one dataframe_1 as
date
0 2020-01-01
1 2020-01-02
2 2020-01-03
3 2020-01-04
4 2020-01-05
and another dataframe_2 as
date source dest price
634647 2020-09-18 EUR USD 1.186317
634648 2020-09-19 EUR USD 1.183970
634649 2020-09-20 EUR USD 1.183970
I want to merge them on 'date' but the problem is dataframe_1 last date is '2021-02-15' and dataframe_2 last date is '2021-02-01'.
I want the resulting dataframe as
date source dest price
634647 2021-02-01 EUR USD 1.186317
634648 2021-02-02 NaN NaN NaN
634649 2021-02-03 Nan NaN NaN
...
date source dest price
634647 2021-02-13 NaN NaN NaN
634648 2021-02-14 NaN NaN NaN
634649 2021-02-25 NaN NaN NaN
But I am not able to do it using pd.merge, please ignore the indices in the dataframes.
Thanks a lot in advance.

you can use join to do it.
df1.set_index('date').join(df2.set_index('date'))

Related

Best way to setup 'Buy' signal/backtest when multiple conditions equal True (stock data)

Working with daily [Date, Open, High, Low, Close] stock data, I am trying to better understand a good method for the type of statements to use when I am backtesting multiple conditions. For example:
#Signal:
Todays Close > Todays Open AND
Yesterdays Close > Yesterdays Open AND
Todays Close >= Todays High - 10%
#Position:
If ALL of the signal conditions above are true, then "Buy" tomorrow at (todays High + 5%) and "Sell" at the Close of the day.
**To take the position I would have to test that the "Buy" condition was satisfied on the 'tomorrow' bar
#Calculate Return:
If Position taken, calculate profit or loss for the day
I've seen sample algorithms, but many examples are just basic moving average crossover systems (one condition), very simple to do with vectorized approach.
When you have multiple conditions as above, can someone show me a good way to code this?
Assuming your data has been sorted by date and indexed sequentially, try this:
cond1 = df['Close'] > df['Open']
cond2 = df['Close'].shift() > df['Open'].shift()
cond3 = df['Close'] >= (df['High'] * 0.9)
signal = df[cond1 & cond2 & cond3]
df.loc[signal.index + 1, 'BuyAt'] = (signal['High'] * 1.05).values
df.loc[signal.index + 1, 'SellAt'] = df.loc[signal.index + 1, 'Close']
df['PnL'] = df['SellAt'] - df['BuyAt']
Result (from MSFT stock price courtesy of Yahoo Finance):
Date Open High Low Close BuyAt SellAt PnL
0 2019-01-02 99.550003 101.750000 98.940002 101.120003 NaN NaN NaN
1 2019-01-03 100.099998 100.190002 97.199997 97.400002 NaN NaN NaN
2 2019-01-04 99.720001 102.510002 98.930000 101.930000 NaN NaN NaN
3 2019-01-07 101.639999 103.269997 100.980003 102.059998 NaN NaN NaN
4 2019-01-08 103.040001 103.970001 101.709999 102.800003 108.433497 102.800003 -5.633494
5 2019-01-09 103.860001 104.879997 103.239998 104.269997 NaN NaN NaN
6 2019-01-10 103.220001 103.750000 102.379997 103.599998 NaN NaN NaN
7 2019-01-11 103.190002 103.440002 101.639999 102.800003 108.937500 102.800003 -6.137497
8 2019-01-14 101.900002 102.870003 101.260002 102.050003 NaN NaN NaN
9 2019-01-15 102.510002 105.050003 101.879997 105.010002 NaN NaN NaN
10 2019-01-16 105.260002 106.260002 104.959999 105.379997 110.302503 105.379997 -4.922506
11 2019-01-17 105.000000 106.629997 104.760002 106.120003 111.573002 106.120003 -5.452999
12 2019-01-18 107.459999 107.900002 105.910004 107.709999 111.961497 107.709999 -4.251498
13 2019-01-22 106.750000 107.099998 104.860001 105.680000 113.295002 105.680000 -7.615002
14 2019-01-23 106.120003 107.040001 105.339996 106.709999 NaN NaN NaN
15 2019-01-24 106.860001 107.000000 105.339996 106.199997 NaN NaN NaN
16 2019-01-25 107.239998 107.879997 106.199997 107.169998 NaN NaN NaN
17 2019-01-28 106.260002 106.480003 104.660004 105.080002 NaN NaN NaN
18 2019-01-29 104.879997 104.970001 102.169998 102.940002 NaN NaN NaN
19 2019-01-30 104.620003 106.379997 104.330002 106.379997 NaN NaN NaN
It seems like a losing strategy to me!

How to merge two different dataframe with a slight difference in timestamp

I have calculated the moving average of 15 minutes from 10 second recorded data. Now I wanted to merge two timeseries data (15 minutes average and 15 minutes moving average) from different files into a new file based on the nearest timestamp.
The 15 minutes moving average data is as below. As I have calculated the moving average, the first few rows are NaN:
RecTime NO2_RAW NO2 Ox_RAW Ox CO_RAW CO SO2_RAW SO2
2019-06-03 00:00:08 NaN NaN NaN NaN NaN NaN NaN NaN
2019-06-03 00:00:18 NaN NaN NaN NaN NaN NaN NaN NaN
2019-06-03 00:00:28 NaN NaN NaN NaN NaN NaN NaN NaN
2019-06-03 00:00:38 NaN NaN NaN NaN NaN NaN NaN NaN
The 15 minute average data is shown below:
Site Species ReadingDateTime Value Units Provisional or Ratified
0 CR9 NO2 2019-03-06 00:00:00 8.2 ug m-3 P
1 CR9 NO2 2019-03-06 00:15:00 7.6 ug m-3 P
2 CR9 NO2 2019-03-06 00:30:00 5.9 ug m-3 P
3 CR9 NO2 2019-03-06 00:45:00 5.1 ug m-3 P
4 CR9 NO2 2019-03-06 01:00:00 5.2 ug m-3 P
I want a table like this:
ReadingDateTime Value NO2_Raw NO2
2019-06-03 00:00:00
2019-06-03 00:15:00
2019-06-03 00:30:00
2019-06-03 00:45:00
2019-06-03 01:00:00
I tried to match the two dataframes with nearest time
df3 = pd.merge_asof(df1, df2, left_on = 'RecTime', right_on = 'ReadingDateTime', tolerance=pd.Timedelta('59s'), allow_exact_matches=False)
I got a new dataframe
RecTime NO2_RAW NO2 Ox_RAW Ox CO_RAW CO SO2_RAW SO2 Site Species ReadingDateTime Value Units Provisional or Ratified
0 2019-06-03 00:14:58 1.271111 21.557111 65.188889 170.011111 152.944444 294.478000 -124.600000 -50.129444 NaN NaN NaT NaN NaN NaN
1 2019-06-03 00:15:08 1.294444 21.601778 65.161111 169.955667 152.844444 294.361556 -124.595556 -50.117556 NaN NaN NaT NaN NaN NaN
2 2019-06-03 00:15:18 1.318889 21.648556 65.104444 169.842556 152.750000 294.251556 -124.593333 -50.111667 NaN NaN NaT NaN NaN NaN
But the values of df2 became NaN. Can someone please help?
Assuming the minutes are correct, you could remove the seconds, and then you would be able to merge.
df.RecTime.map(lambda x: x.replace(second=0)).
You could either create a new column or replace the existing one to merge.

Filling missing dates by imputing on previous dates in Python

I have a time series that I want to lag and predict on for future data one year ahead that looks like:
Date Energy Pred Energy Lag Error
.
2017-09-01 9 8.4
2017-10-01 10 9
2017-11-01 11 10
2017-12-01 12 11.5
2018-01-01 1 1.3
NaT (pred-true)
NaT
NaT
NaT
.
.
All I want to do is impute dates into the NaT entries to continue from 2018-01-01 to 2019-01-01 (just fill them like we're in Excel drag and drop) because there are enough NaT positions to fill up to that point.
I've tried model['Date'].fillna() with various methods and either just repeats the same previous date or drops things I don't want to drop.
Any way to just fill these NaTs with 1 month increments like the previous data?
Make the df and set the index (there are better ways to set the index):
"""
Date,Energy,Pred Energy,Lag Error
2017-09-01,9,8.4
2017-10-01,10,9
2017-11-01,11,10
2017-12-01,12,11.5
2018-01-01,1,1.3
"""
import pandas as pd
df = pd.read_clipboard(sep=",", parse_dates=True)
df.set_index(pd.DatetimeIndex(df['Date']), inplace=True)
df.drop("Date", axis=1, inplace=True)
df
Reindex to a new date_range:
idx = pd.date_range(start='2017-09-01', end='2019-01-01', freq='MS')
df = df.reindex(idx)
Output:
Energy Pred Energy Lag Error
2017-09-01 9.0 8.4 NaN
2017-10-01 10.0 9.0 NaN
2017-11-01 11.0 10.0 NaN
2017-12-01 12.0 11.5 NaN
2018-01-01 1.0 1.3 NaN
2018-02-01 NaN NaN NaN
2018-03-01 NaN NaN NaN
2018-04-01 NaN NaN NaN
2018-05-01 NaN NaN NaN
2018-06-01 NaN NaN NaN
2018-07-01 NaN NaN NaN
2018-08-01 NaN NaN NaN
2018-09-01 NaN NaN NaN
2018-10-01 NaN NaN NaN
2018-11-01 NaN NaN NaN
2018-12-01 NaN NaN NaN
2019-01-01 NaN NaN NaN
Help from:
Pandas Set DatetimeIndex

How do I delete rows of unmatched dates in dataframes?

I have two dataframes loaded from CSV file:
time_df: consist of all the dates I want as shown below
0 2017-01-31
1 2017-01-26
2 2017-01-12
3 2017-01-09
4 2017-01-02
price_df: consist of other fields and many dates that i do not need
Date NYSEARCA:TPYP NYSEARCA:MENU NYSEARCA:SLYV NYSEARCA:CZA
0 2017-01-31 NaN 16.56 117.75 55.96
1 2017-01-26 NaN 16.68 116.89 55.84
2 2017-01-27 NaN 16.70 118.47 56.04
3 2017-01-12 NaN 16.81 119.14 56.13
5 2017-01-09 NaN 16.91 120.00 56.26
6 2017-01-08 NaN 16.91 120.00 56.26
7 2017-01-02 NaN 16.91 120.00 56.26
My aim is to delete the rows where dates in price_df does not equals to the dates in time_df
tried:
del price_df['Date'] if price_df['Date']!=time_df['Date']
but can't so I tried to print print(price_df['Date']!= time_df['Date'])
but it shows the next error: Can only compare identically-labeled Series objects
Sounds like a problem an inner join can fix:
time_df.merge(price_df, on='Date',copy=False)

pandas MultiIndex rolling mean

Preface: I'm newish, but have searched for hours here and in the pandas documentation without success. I've also read Wes's book.
I am modeling stock market data for a hedge fund, and have a simple MultiIndexed-DataFrame with tickers, dates(daily), and fields. The sample here is from Bloomberg. 3 months - Dec. 2016 through Feb. 2017, 3 tickers(AAPL, IBM, MSFT).
import numpy as np
import pandas as pd
import os
# get data from Excel
curr_directory = os.getcwd()
filename = 'Sample Data File.xlsx'
filepath = os.path.join(curr_directory, filename)
df = pd.read_excel(filepath, sheetname = 'Sheet1', index_col = [0,1], parse_cols = 'A:D')
# sort
df.sort_index(inplace=True)
# sample of the data
df.head(15)
Out[4]:
PX_LAST PX_VOLUME
Security Name date
AAPL US Equity 2016-12-01 109.49 37086862
2016-12-02 109.90 26527997
2016-12-05 109.11 34324540
2016-12-06 109.95 26195462
2016-12-07 111.03 29998719
2016-12-08 112.12 27068316
2016-12-09 113.95 34402627
2016-12-12 113.30 26374377
2016-12-13 115.19 43733811
2016-12-14 115.19 34031834
2016-12-15 115.82 46524544
2016-12-16 115.97 44351134
2016-12-19 116.64 27779423
2016-12-20 116.95 21424965
2016-12-21 117.06 23783165
df.tail(15)
Out[5]:
PX_LAST PX_VOLUME
Security Name date
MSFT US Equity 2017-02-07 63.43 20277226
2017-02-08 63.34 18096358
2017-02-09 64.06 22644443
2017-02-10 64.00 18170729
2017-02-13 64.72 22920101
2017-02-14 64.57 23108426
2017-02-15 64.53 17005157
2017-02-16 64.52 20546345
2017-02-17 64.62 21248818
2017-02-21 64.49 20655869
2017-02-22 64.36 19292651
2017-02-23 64.62 20273128
2017-02-24 64.62 21796800
2017-02-27 64.23 15871507
2017-02-28 63.98 23239825
When I calculate daily price changes, like this, it seems to work, only the first day is NaN, as it should be:
df.head(5)
Out[7]:
PX_LAST PX_VOLUME px_change_%
Security Name date
AAPL US Equity 2016-12-01 109.49 37086862 NaN
2016-12-02 109.90 26527997 0.003745
2016-12-05 109.11 34324540 -0.007188
2016-12-06 109.95 26195462 0.007699
2016-12-07 111.03 29998719 0.009823
But daily 30 Day Volume doesn't. It should only be NaN for the first 29 days, but is NaN for all of it:
# daily change from 30 day volume - doesn't work
df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean()
df['volume_change_%'] = (df['PX_VOLUME'] - df['30_day_volume']) / df['30_day_volume']
df.iloc[:,3:].tail(40)
Out[12]:
30_day_volume volume_change_%
Security Name date
MSFT US Equity 2016-12-30 NaN NaN
2017-01-03 NaN NaN
2017-01-04 NaN NaN
2017-01-05 NaN NaN
2017-01-06 NaN NaN
2017-01-09 NaN NaN
2017-01-10 NaN NaN
2017-01-11 NaN NaN
2017-01-12 NaN NaN
2017-01-13 NaN NaN
2017-01-17 NaN NaN
2017-01-18 NaN NaN
2017-01-19 NaN NaN
2017-01-20 NaN NaN
2017-01-23 NaN NaN
2017-01-24 NaN NaN
2017-01-25 NaN NaN
2017-01-26 NaN NaN
2017-01-27 NaN NaN
2017-01-30 NaN NaN
2017-01-31 NaN NaN
2017-02-01 NaN NaN
2017-02-02 NaN NaN
2017-02-03 NaN NaN
2017-02-06 NaN NaN
2017-02-07 NaN NaN
2017-02-08 NaN NaN
2017-02-09 NaN NaN
2017-02-10 NaN NaN
2017-02-13 NaN NaN
2017-02-14 NaN NaN
2017-02-15 NaN NaN
2017-02-16 NaN NaN
2017-02-17 NaN NaN
2017-02-21 NaN NaN
2017-02-22 NaN NaN
2017-02-23 NaN NaN
2017-02-24 NaN NaN
2017-02-27 NaN NaN
2017-02-28 NaN NaN
As pandas seems to have been designed specifically for finance, I'm surprised this isn't straightforward.
Edit: I've tried some other ways as well.
Tried converting it into a Panel (3D), but didn't find any built in functions for Windows except to convert to a DataFrame and back, so no advantage there.
Tried to create a pivot table, but couldn't find a way to reference just the first level of the MultiIndex. df.index.levels[0] or ...levels[1] wasn't working.
Thanks!
Can you try the following to see if it works?
df['30_day_volume'] = df.groupby(level=0)['PX_VOLUME'].rolling(window=30).mean().values
df['volume_change_%'] = (df['PX_VOLUME'] - df['30_day_volume']) / df['30_day_volume']
I can verify Allen's answer works when using pandas_datareader, modifying the index level for the groupby operation for the datareader multiindexing.
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2016, 12, 1)
end = datetime.datetime(2017, 2, 28)
data = web.DataReader(['AAPL', 'IBM', 'MSFT'], 'yahoo', start, end).to_frame()
data['30_day_volume'] = data.groupby(level=1).rolling(window=30)['Volume'].mean().values
data['volume_change_%'] = (data['Volume'] - data['30_day_volume']) / data['30_day_volume']
# double-check that it computed starting at 30 trading days.
data.loc['2017-1-17':'2017-1-30']
The original poster might try editing this line:
df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean()
to the following, using mean().values:
df['30_day_volume'] = df.groupby(level=0,group_keys=True)['PX_VOLUME'].rolling(window=30).mean().values
The data don't get properly aligned without this, resulting in NaN's.

Categories

Resources