I'm trying to calculate the cumulative price of a time series in the face of external cashflows.
This is the sample dataset:
reportdate fund mtd_return cashflow Desired Output
30/11/2018 Fund X -0.00860 15687713 15552798.98
31/12/2018 Fund X -0.00900 15412823.78
31/01/2019 Fund X 0.00920 15554621.76
28/02/2019 Fund X 0.00630 15652615.88
31/03/2019 Fund X 0.00700 15762184.19
30/04/2019 Fund X 0.01220 15954482.84
31/05/2019 Fund X 0.00060 1000000 16964655.53
30/06/2019 Fund X 0.00570 1200000 18268194.07
31/07/2019 Fund X 0.00450 18350400.94
31/08/2019 Fund X 0.00210 18388936.78
30/09/2019 Fund X 0.00530 18486398.15
31/10/2019 Fund X 0.00200 18523370.94
30/11/2019 Fund X 0.00430 18603021.44
31/12/2019 Fund X 0.00660 18725801.38
31/01/2020 Fund X 0.01070 18926167.45
29/02/2020 Fund X -0.00510 18829644.00
31/03/2020 Fund X -0.10700 16814872.09
30/04/2020 Fund X 0.02740 3400000 20768759.59
31/05/2020 Fund X 0.02180 2000000 23265118.55
30/06/2020 Fund X 0.02270 23793236.74
31/07/2020 Fund X 0.01120 24059720.99
31/08/2020 Fund X 0.01260 24362873.47
30/09/2020 Fund X 0.00750 24545595.02
31/10/2020 Fund X 0.00410 -8110576 16502402.68
30/11/2020 Fund X 0.02790 16962819.72
31/12/2020 Fund X 0.01230 17171462.40
In the above, the Desired Output column is calculated by taking the previous row's Desired Output, plus any cashflow in the current period, times 1 + mtd_return. Effectively, I'm looking for a good way to calculate compounded returns in the face of external cashflows.
Many thanks!
Mike
Any help on implementing this in python would be appreciated.
import pandas as pd
df = pd.read_csv('df7.txt', sep=',', header=0)
df['reportdate'] = pd.to_datetime(df['reportdate'])
df = df.fillna(0)
qqq = []
def func_data(x):
a = 0
ind = x.index[0] - 1
if x.index[0] > 0:
a = (qqq[ind] + x['cashflow']) * (1 + x['mtd_return'])
qqq.append(a.values[0])
else:
qqq.append(df.loc[0, 'cashflow'] * (1 + df.loc[0, 'mtd_return']))
return a
df.groupby(['reportdate']).apply(func_data)
df['new'] = qqq
print(df)
Output
reportdate fund mtd_return cashflow Desired Output new
0 2018-11-30 Fund X -0.0086 15687713.0 15552798.98 1.555280e+07
1 2018-12-31 Fund X -0.0090 0.0 15412823.78 1.541282e+07
2 2019-01-31 Fund X 0.0092 0.0 15554621.76 1.555462e+07
3 2019-02-28 Fund X 0.0063 0.0 15652615.88 1.565262e+07
4 2019-03-31 Fund X 0.0070 0.0 15762184.19 1.576218e+07
5 2019-04-30 Fund X 0.0122 0.0 15954482.84 1.595448e+07
6 2019-05-31 Fund X 0.0006 1000000.0 16964655.53 1.696466e+07
7 2019-06-30 Fund X 0.0057 1200000.0 18268194.07 1.826819e+07
8 2019-07-31 Fund X 0.0045 0.0 18350400.94 1.835040e+07
9 2019-08-31 Fund X 0.0021 0.0 18388936.78 1.838894e+07
10 2019-09-30 Fund X 0.0053 0.0 18486398.15 1.848640e+07
11 2019-10-31 Fund X 0.0020 0.0 18523370.94 1.852337e+07
12 2019-11-30 Fund X 0.0043 0.0 18603021.44 1.860302e+07
13 2019-12-31 Fund X 0.0066 0.0 18725801.38 1.872580e+07
14 2020-01-31 Fund X 0.0107 0.0 18926167.45 1.892617e+07
15 2020-02-29 Fund X -0.0051 0.0 18829644.00 1.882964e+07
16 2020-03-31 Fund X -0.1070 0.0 16814872.09 1.681487e+07
17 2020-04-30 Fund X 0.0274 3400000.0 20768759.59 2.076876e+07
18 2020-05-31 Fund X 0.0218 2000000.0 23265118.55 2.326512e+07
19 2020-06-30 Fund X 0.0227 0.0 23793236.74 2.379324e+07
20 2020-07-31 Fund X 0.0112 0.0 24059720.99 2.405972e+07
21 2020-08-31 Fund X 0.0126 0.0 24362873.47 2.436287e+07
22 2020-09-30 Fund X 0.0075 0.0 24545595.02 2.454559e+07
23 2020-10-31 Fund X 0.0041 -8110576.0 16502402.68 1.650240e+07
24 2020-11-30 Fund X 0.0279 0.0 16962819.72 1.696282e+07
25 2020-12-31 Fund X 0.0123 0.0 17171462.40 1.717146e+07
Made in your file all the values separated by commas, empty too (that is, between commas is empty). Read a file in pandas and created a dataframe. header=0 means that the first row is used as column headers. Next, the 'reportdate ' column is converted to datetime format and empty values are replaced with zero. The data is grouped by date. The func_data function is created for the call. If this is the first call, then the code in else is calculated, the rest is in if. The calculations are written to the qqq array, which then populates the 'new' column.
Related
I have this code:
test = {"number": ['1333','1444','1555','1666','1777', '1444', '1999', '2000', '2000'],
"order_amount": ['1000.00','-500.00','100.00','200.00','-200.00','-500.00','150.00','-100.00','-20.00'],
"number_of_refund": ['','1333','', '', '1666', '1333', '','1999','1999']
}
df = pd.DataFrame(test)
Which returns the following dataframe:
number order_amount number_of_refund
0 1333 1000.00
1 1444 -500.00 1333
2 1555 100.00
3 1666 200.00
4 1777 -200.00 1666
5 1444 -500.00 1333
6 1999 150.00
7 2000 -100.00 1999
8 2000 -20.00 1999
From which I need to remove the orders that have "number" value inside "number_of_refund" column if the amount of "number_of_refund" records (can be multiple) are the negative amount of the "number" that is specified in "number_of_refund". So the outcome should be:
number order_amount number_of_refund
2 1555 100.00
6 1999 150.00
7 2000 -100.00 1999
8 2000 -20.00 1999
You can do the same as in my previous answer, but use an intermediate dataframe with aggregated refunds:
# ensure numeric values
df['order_amount'] = pd.to_numeric(df['order_amount'], errors='coerce')
# aggregate values
df2 = df.groupby(['number', 'number_of_refund'], as_index=False).sum()
# is the row a refund?
m1 = df2['number_of_refund'].ne('')
# get mapping of refunds
s = df2[m1].set_index('number_of_refund')['order_amount']
# get reimbursements and find which ones will equal the original value
reimb = df['number'].map(s)
m2 = reimb.eq(-df['order_amount'])
m3 = df['number_of_refund'].isin(df.loc[m2, 'number'])
# keep rows that do not match any m2 or m3 mask
df = df[~(m2|m3)]
output:
number order_amount number_of_refund
2 1555 100.0
6 1999 150.0
7 2000 -100.0 1999
8 2000 -20.0 1999
Story that pertain to a new design solution
Goal is to use weather data to run ARIMA model fit on each group of like named 'stations' with their associated precipitation data, then finally execute a 30 day forward forecast. Looking to process specific same named stations and then next process the next unique same named stations, etc.
The algorithm to add question
How to write algorithm to run ARIMA model for each UNIQUE 'station' and, perhaps grouping stations to be unique groups to run ARIMA model on the group, and then fit for a 30 day forward forecast? The ARIMA(2,1,1) is a working order terms from auto.arima().
How to write a group algorithm for same named 'stations' before running the ARIMA model, fit, forecast? Or what other approach would achieve a set of like named stations to process specific same named stations and then move unto the next unique same named stations.
Working code executes but needs broader algorithm
Code was working, but on last run, predict(start=start_date, end=end_date) issued a key error. Removed NA, so this may fix the predict(start, end)
wd.weather_data = wd.weather_data[wd.weather_data['date'].notna()]
forecast_models = [50000]
n = 1
df_all_stations = data_prcp.drop(['level_0', 'index', 'prcp'], axis=1)
wd.weather_data.sort_values("date", axis = 0, ascending = True, inplace = True)
for station_name in wd.weather_data['station']:
start_date = pd.to_datetime(wd.weather_data['date'])
number_of_days = 31
end_date = pd.to_datetime(start_date) + pd.DateOffset(days=30)
model = statsmodels.tsa.arima_model.ARIMA(wd.weather_data['prcp'], order=(2,1,1))
model_fit = model.fit()
forecast = model_fit.predict(start=start_date, end=end_date)
forecast_models.append(forecast)
Data Source
<bound method NDFrame.head of station date tavg tmin tmax prcp snow
0 Anchorage, AK 2018-01-01 -4.166667 -8.033333 -0.30 0.3 80.0
35328 Grand Forks, ND 2018-01-01 -14.900000 -23.300000 -6.70 0.0 0.0
86016 Key West, FL 2018-01-01 20.700000 16.100000 25.60 0.0 0.0
59904 Wilmington, NC 2018-01-01 -2.500000 -7.100000 0.00 0.0 0.0
66048 State College, PA 2018-01-01 -13.500000 -17.000000 -10.00 4.5 0.0
... ... ... ... ... ... ... ...
151850 Kansas City, MO 2022-03-30 9.550000 3.700000 16.55 21.1 0.0
151889 Springfield, MO 2022-03-30 12.400000 4.500000 17.10 48.9 0.0
151890 St. Louis, MO 2022-03-30 14.800000 8.000000 17.60 24.9 0.0
151891 State College, PA 2022-03-30 0.400000 -5.200000 6.20 0.2 0.0
151899 Wilmington, NC 2022-03-30 14.400000 6.200000 20.20 0.0 0.0
wdir wspd pres
0 143.0 5.766667 995.133333
35328 172.0 33.800000 1019.200000
86016 4.0 13.000000 1019.900000
59904 200.0 21.600000 1017.000000
66048 243.0 12.700000 1015.200000
... ... ... ...
151850 294.5 24.400000 998.000000
151889 227.0 19.700000 997.000000
151890 204.0 20.300000 996.400000
151891 129.0 10.800000 1020.400000
151899 154.0 16.400000 1021.900000
Error
KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'
I have two pandas dataframe that I want to merge. My first dataframe, names, is a list of stock tickers and corresponding dates. Example below:
Date Symbol DateSym
0 2017-01-05 AGRX AGRX01-05-2017
1 2017-01-05 TMDI TMDI01-05-2017
2 2017-01-06 ATHE ATHE01-06-2017
3 2017-01-06 AVTX AVTX01-06-2017
4 2017-01-09 CVM CVM01-09-2017
5 2017-01-10 DFFN DFFN01-10-2017
6 2017-01-10 VKTX VKTX01-10-2017
7 2017-01-11 BIOC BIOC01-11-2017
8 2017-01-11 CVM CVM01-11-2017
9 2017-01-11 PALI PALI01-11-2017
I created another dataframe, price1, that loops through the tickers and creates a dataframe with the open, high, low, close and other relevant info I need. When I merge the two dataframes together, I want to only show the names dataframe on the left with the corresponding price data on the right. What I ran a test of the first 10 tickers, I noticed that the combined dataframe is outputting redundant rows. (See CVM in row 4 and 5 below), even though the price1 dataframe doesn't have duplicate values. What am I doing wrong?
def price_stats(df):
# df['ticker'] = df
df['Gap Up%'] = df['Open'] / df['Close'].shift(1) - 1
df['HOD%'] = df['High'] / df['Open'] - 1
df['Close vs Open%'] = df['Close'] / df['Open'] - 1
df['Close%'] = df['Close'] / df['Close'].shift(1) - 1
df['GU and Goes Red'] = np.where((df['Low'] < df['Close'].shift(1)) & (df['Open'] > df['Close'].shift(1)), 1, 0)
df['Yday Intraday>30%'] = np.where((df['Close vs Open%'].shift(1) > .30), 1, 0)
df['Gap Up?'] = np.where((df['Gap Up%'] > 0), 1, 0)
df['Sloppy $ Vol'] = (df['High'] + df['Low'] + df['Close']) / 3 * df['Volume']
df['Prev Day Sloppy $ Vol'] = (df['High'].shift(1) + df['Low'].shift(1) + df['Close'].shift(1)) / 3 * df[
'Volume'].shift(1)
df['Prev Open'] = df['Open'].shift(1)
df['Prev High'] = df['High'].shift(1)
df['Prev Low'] = df['Low'].shift(1)
df['Prev Close'] = df['Close'].shift(1)
df['Prev Vol'] = df['Volume'].shift(1)
df['D-2 Close'] = df['Close'].shift(2)
df['D-2 Vol'] = df['Volume'].shift(2)
df['D-3 Close']= df['Close'].shift(3)
df['D-2 Open'] = df['Open'].shift(2)
df['D-2 High'] = df['High'].shift(2)
df['D-2 Low'] = df['Low'].shift(2)
df['D-2 Intraday Rnage'] = df['D-2 Close']/df['D-2 Open']-1
df['D-2 Close%'] = df['D-2 Close']/df['D-3 Close']-1
df.dropna(inplace=True)
vol_excel = pd.read_excel('C://U******.xlsx')
names = vol_excel.Symbol.to_list()
price1 = []
price1 = pd.DataFrame(price1)
for name in names[0:10]:
print(name)
price = yf.download(name, start="2016-12-01", end="2022-03-04")
price['ticker'] = name
price_stats(price)
price1 = pd.concat([price1, price])
price1 = price1.reset_index()
orig_day = pd.to_datetime(price1['Date'])
price1['Prev Day Date'] = orig_day - pd.tseries.offsets.CustomBusinessDay(1, holidays=nyse.holidays().holidays)
price1['DateSym'] = price1['ticker']+ price1['Date'].dt.strftime('%m-%d-%Y')
price1 = price1.rename(columns={'ticker':'Symbol'})
datesym = price1['DateSym']
price1.drop(labels=['DateSym'], axis=1,inplace = True)
price1.insert(0, 'DateSym', datesym)
vol_excel['DateSym'] = vol_excel['Symbol']+vol_excel['Date'].dt.strftime('%m-%d-%Y')
dfcombo = vol_excel.merge(price1,on=['Date','Symbol'],how='inner')
See how CVM is duplicated twice when i print out dfcombo
Date Symbol DateSym_x DateSym_y Open High Low Close Adj Close Volume ... Prev Vol D-2 Close D-2 Vol D-3 Close D-2 Open D-2 High D-2 Low D-2 Intraday Rnage D-2 Close% Prev Day Date
0 2017-01-05 AGRX AGRX01-05-2017 AGRX01-05-2017 2.71 2.71 2.40 2.52 2.52 2408400 ... 18584900.0 5.000 2390400.0 5.700000 5.770 5.813000 4.460 -0.133449 -0.122807 2017-01-04
1 2017-01-05 TMDI TMDI01-05-2017 TMDI01-05-2017 15.60 16.50 12.90 13.50 13.50 43830 ... 114327.0 10.500 61543.0 7.200000 7.500 10.500000 7.500 0.400000 0.458333 2017-01-04
2 2017-01-06 ATHE ATHE01-06-2017 ATHE01-06-2017 2.58 2.60 2.23 2.42 2.42 222500 ... 1750700.0 1.930 53900.0 1.750000 1.790 1.950000 1.790 0.078212 0.102857 2017-01-05
3 2017-01-06 AVTX AVTX01-06-2017 AVTX01-06-2017 1.24 1.24 1.02 1.07 1.07 480500 ... 1246100.0 0.883 44900.0 0.890000 0.896 0.950000 0.827 -0.014509 -0.007865 2017-01-05
4 2017-01-09 CVM CVM01-09-2017 CVM01-09-2017 2.75 3.00 2.75 2.75 2.75 376520 ... 414056.0 2.000 77360.0 2.000000 2.000 2.250000 2.000 0.000000 0.000000 2017-01-06
5 2017-01-09 CVM CVM01-09-2017 CVM01-09-2017 2.75 3.00 2.75 2.75 2.75 376520 ... 414056.0 2.000 77360.0 2.000000 2.000 2.250000 2.000 0.000000 0.000000 2017-01-06
6 2017-01-10 DFFN DFFN01-10-2017 DFFN01-10-2017 111.00 232.50 108.75 125.25 125.25 165407 ... 43167.0 30.900 67.0 34.650002 31.500 34.349998 30.900 -0.019048 -0.108225 2017-01-09
7 2017-01-10 VKTX VKTX01-10-2017 VKTX01-10-2017 1.64 1.64 1.43 1.56 1.56 981700 ... 1550400.0 1.260 264900.0 1.230000 1.250 1.299000 1.210 0.008000 0.024390 2017-01-09
8 2017-01-11 BIOC BIOC01-11-2017 BIOC01-11-2017 858.00 1017.00 630.00 813.00 813.00 210182 ... 78392.0 306.000 5368.0 285.000000 285.000 315.000000 285.000 0.073684 0.073684 2017-01-10
9 2017-01-11 CVM CVM01-11-2017 CVM01-11-2017 4.25 4.50 3.00 3.75 3.75 487584 ... 672692.0 2.750 376520.0 2.750000 2.750 3.000000 2.750 0.000000 0.000000 2017-01-10
I'm wondering since the names dataframe may have the same tickers, but different dates, in the dataframe and each time the price1 dataframe is pulling the price data and adding to that price1 dataframe maybe causing the issue.
For example, in the names dataframe, AGRX can be listed for the date 2017-01-05 and 2020-12-20. My loop function as shown pulls from yahoo data and appends it to the price1 dataframe even though its the same set of data. Along the same token, is there a way for me to skip appending that duplicate ticker and will that solve the issue?
I'm trying to parse scraped data from finviz quote profile pages. I think I can use panda's pivot table mechanisms to get the output I need, but have never used pivot tables so I'm unsure if or how to format the output.
The table I receive is below. I would like each even column number to be the column header of the output table. Ending with a dataframe with one row, and 72 columns, as there are 72 values. Unless anyone can recommend a better output strcuture and how to access the values?
0 1 2 3 4 5 6 7 8 9 10 11
0 Index S&P 500 P/E 23.06 EPS (ttm) 3.15 Insider Own 0.20% Shs Outstand 1.01B Perf Week 3.94%
1 Market Cap 73.24B Forward P/E 21.29 EPS next Y 3.41 Insider Trans -34.81% Shs Float 996.66M Perf Month 4.85%
2 Income 3.21B PEG 2.31 EPS next Q 0.81 Inst Own 89.30% Short Float 2.05% Perf Quarter 4.47%
3 Sales 13.15B P/S 5.57 EPS this Y 9.70% Inst Trans 0.42% Short Ratio 4.21 Perf Half Y 25.19%
4 Book/sh 10.26 P/B 7.08 EPS next Y 7.88% ROA 20.10% Target Price 74.34 Perf Year 28.65%
5 Cash/sh 3.11 P/C 23.35 EPS next 5Y 10.00% ROE 32.10% 52W Range 45.51 - 72.28 Perf YTD 36.02%
6 Dividend 2 P/FCF 31.29 EPS past 5Y 1.50% ROI 21.60% 52W High 0.44% Beta 1.26
7 Dividend % 2.75% Quick Ratio 2.5 Sales past 5Y -1.40% Gross Margin 60.70% 52W Low 59.54% ATR 1.34
8 Employees 29977 Current Ratio 3.3 Sales Q/Q 7.20% Oper. Margin 35.20% RSI (14) 67.26 Volatility 1.65% 1.93%
9 Optionable Yes Debt/Eq 0.35 EPS Q/Q 23.80% Profit Margin 24.40% Rel Volume 1.13 Prev Close 72.08
10 Shortable Yes LT Debt/Eq 0.29 Earnings Oct 26 AMC Payout 47.70% Avg Volume 4.84M Price 72.6
11 Recom 2.5 SMA20 3.86% SMA50 4.97% SMA200 16.49% Volume 5476883 Change 0.72%
I know the formatting is hard to see, so
Try reshaping
d1 = pd.DataFrame(df.values.reshape(-1, 2), columns=['key', 'value'])
d1.set_index('key').T
I have a question on arithmetic within a dataframe. Please note that each of the below columns in my dataframe are based on one another except for 'holdings'
Here is a shortened version of my dataframe
'holdings' & 'cash' & 'total'
0.0 10000.0 10000.0
0.0 10000.0 10000.0
1000 9000.0 10000.0
1500 10000.0 11500.0
2000 10000.0 12000.0
initial_cap = 10000.0
But here is my problem... the first time I have holdings, the cash is calculated correctly where cash of 10000.0 - holdings of 1000.0 = 9000.0
I need cash to remain at 9000.0 until my holdings goes back to 0.0 again
Here are my calculations
In other words, how would you calculate cash so that it remains at 9000.0 until holdings goes back to 0.0
Here is how I want it to look like
'holdings' & 'cash' & 'total'
0.0 10000.0 10000.0
0.0 10000.0 10000.0
1000 9000.0 10000.0
1500 9000.0 10500.0
2000 9000.0 11000.0
cash = initial_cap - holdings
So I try to rephrase: You start with initial capital 10 and a given sequence of holdings {0, 0, 1, 1.5, 2} and want to create a cashvariable that is 10 whenever cash is 0. As soon as cash increases in an initial period by x, you want cash to be 10 - x until cash equals 0 again.
If this is correct, this is what I would do (the logic of total and all of this is still unclear to me, but this is what you added in the end, so I focus on this).
PS. Providing code to create your sample is considered nice
df = pd.DataFrame([0, 1, 2, 2, 0, 2, 3, 3], columns=['holdings'])
x = 10
# triggers are when cash is supposed to be zero
triggers = df['holdings'] == 0
# inits are when holdings change for the first time
inits = df.index[triggers].values + 1
df['cash'] = 0
for i in inits:
df['cash'][i:] = x - df['holdings'][i]
df['cash'][triggers] = 0
df
Out[339]:
holdings cash
0 0 0
1 1 9
2 2 9
3 2 9
4 0 0
5 2 8
6 3 8
7 3 8