merge pandas dataframe causing duplicate row values

merge pandas dataframe causing duplicate row values - python

I have two pandas dataframe that I want to merge. My first dataframe, names, is a list of stock tickers and corresponding dates. Example below:
Date Symbol DateSym
0 2017-01-05 AGRX AGRX01-05-2017
1 2017-01-05 TMDI TMDI01-05-2017
2 2017-01-06 ATHE ATHE01-06-2017
3 2017-01-06 AVTX AVTX01-06-2017
4 2017-01-09 CVM CVM01-09-2017
5 2017-01-10 DFFN DFFN01-10-2017
6 2017-01-10 VKTX VKTX01-10-2017
7 2017-01-11 BIOC BIOC01-11-2017
8 2017-01-11 CVM CVM01-11-2017
9 2017-01-11 PALI PALI01-11-2017
I created another dataframe, price1, that loops through the tickers and creates a dataframe with the open, high, low, close and other relevant info I need. When I merge the two dataframes together, I want to only show the names dataframe on the left with the corresponding price data on the right. What I ran a test of the first 10 tickers, I noticed that the combined dataframe is outputting redundant rows. (See CVM in row 4 and 5 below), even though the price1 dataframe doesn't have duplicate values. What am I doing wrong?
def price_stats(df):
# df['ticker'] = df
df['Gap Up%'] = df['Open'] / df['Close'].shift(1) - 1
df['HOD%'] = df['High'] / df['Open'] - 1
df['Close vs Open%'] = df['Close'] / df['Open'] - 1
df['Close%'] = df['Close'] / df['Close'].shift(1) - 1
df['GU and Goes Red'] = np.where((df['Low'] < df['Close'].shift(1)) & (df['Open'] > df['Close'].shift(1)), 1, 0)
df['Yday Intraday>30%'] = np.where((df['Close vs Open%'].shift(1) > .30), 1, 0)
df['Gap Up?'] = np.where((df['Gap Up%'] > 0), 1, 0)
df['Sloppy $ Vol'] = (df['High'] + df['Low'] + df['Close']) / 3 * df['Volume']
df['Prev Day Sloppy $ Vol'] = (df['High'].shift(1) + df['Low'].shift(1) + df['Close'].shift(1)) / 3 * df[
'Volume'].shift(1)
df['Prev Open'] = df['Open'].shift(1)
df['Prev High'] = df['High'].shift(1)
df['Prev Low'] = df['Low'].shift(1)
df['Prev Close'] = df['Close'].shift(1)
df['Prev Vol'] = df['Volume'].shift(1)
df['D-2 Close'] = df['Close'].shift(2)
df['D-2 Vol'] = df['Volume'].shift(2)
df['D-3 Close']= df['Close'].shift(3)
df['D-2 Open'] = df['Open'].shift(2)
df['D-2 High'] = df['High'].shift(2)
df['D-2 Low'] = df['Low'].shift(2)
df['D-2 Intraday Rnage'] = df['D-2 Close']/df['D-2 Open']-1
df['D-2 Close%'] = df['D-2 Close']/df['D-3 Close']-1
df.dropna(inplace=True)
vol_excel = pd.read_excel('C://U******.xlsx')
names = vol_excel.Symbol.to_list()
price1 = []
price1 = pd.DataFrame(price1)
for name in names[0:10]:
print(name)
price = yf.download(name, start="2016-12-01", end="2022-03-04")
price['ticker'] = name
price_stats(price)
price1 = pd.concat([price1, price])
price1 = price1.reset_index()
orig_day = pd.to_datetime(price1['Date'])
price1['Prev Day Date'] = orig_day - pd.tseries.offsets.CustomBusinessDay(1, holidays=nyse.holidays().holidays)
price1['DateSym'] = price1['ticker']+ price1['Date'].dt.strftime('%m-%d-%Y')
price1 = price1.rename(columns={'ticker':'Symbol'})
datesym = price1['DateSym']
price1.drop(labels=['DateSym'], axis=1,inplace = True)
price1.insert(0, 'DateSym', datesym)
vol_excel['DateSym'] = vol_excel['Symbol']+vol_excel['Date'].dt.strftime('%m-%d-%Y')
dfcombo = vol_excel.merge(price1,on=['Date','Symbol'],how='inner')
See how CVM is duplicated twice when i print out dfcombo
Date Symbol DateSym_x DateSym_y Open High Low Close Adj Close Volume ... Prev Vol D-2 Close D-2 Vol D-3 Close D-2 Open D-2 High D-2 Low D-2 Intraday Rnage D-2 Close% Prev Day Date
0 2017-01-05 AGRX AGRX01-05-2017 AGRX01-05-2017 2.71 2.71 2.40 2.52 2.52 2408400 ... 18584900.0 5.000 2390400.0 5.700000 5.770 5.813000 4.460 -0.133449 -0.122807 2017-01-04
1 2017-01-05 TMDI TMDI01-05-2017 TMDI01-05-2017 15.60 16.50 12.90 13.50 13.50 43830 ... 114327.0 10.500 61543.0 7.200000 7.500 10.500000 7.500 0.400000 0.458333 2017-01-04
2 2017-01-06 ATHE ATHE01-06-2017 ATHE01-06-2017 2.58 2.60 2.23 2.42 2.42 222500 ... 1750700.0 1.930 53900.0 1.750000 1.790 1.950000 1.790 0.078212 0.102857 2017-01-05
3 2017-01-06 AVTX AVTX01-06-2017 AVTX01-06-2017 1.24 1.24 1.02 1.07 1.07 480500 ... 1246100.0 0.883 44900.0 0.890000 0.896 0.950000 0.827 -0.014509 -0.007865 2017-01-05
4 2017-01-09 CVM CVM01-09-2017 CVM01-09-2017 2.75 3.00 2.75 2.75 2.75 376520 ... 414056.0 2.000 77360.0 2.000000 2.000 2.250000 2.000 0.000000 0.000000 2017-01-06
5 2017-01-09 CVM CVM01-09-2017 CVM01-09-2017 2.75 3.00 2.75 2.75 2.75 376520 ... 414056.0 2.000 77360.0 2.000000 2.000 2.250000 2.000 0.000000 0.000000 2017-01-06
6 2017-01-10 DFFN DFFN01-10-2017 DFFN01-10-2017 111.00 232.50 108.75 125.25 125.25 165407 ... 43167.0 30.900 67.0 34.650002 31.500 34.349998 30.900 -0.019048 -0.108225 2017-01-09
7 2017-01-10 VKTX VKTX01-10-2017 VKTX01-10-2017 1.64 1.64 1.43 1.56 1.56 981700 ... 1550400.0 1.260 264900.0 1.230000 1.250 1.299000 1.210 0.008000 0.024390 2017-01-09
8 2017-01-11 BIOC BIOC01-11-2017 BIOC01-11-2017 858.00 1017.00 630.00 813.00 813.00 210182 ... 78392.0 306.000 5368.0 285.000000 285.000 315.000000 285.000 0.073684 0.073684 2017-01-10
9 2017-01-11 CVM CVM01-11-2017 CVM01-11-2017 4.25 4.50 3.00 3.75 3.75 487584 ... 672692.0 2.750 376520.0 2.750000 2.750 3.000000 2.750 0.000000 0.000000 2017-01-10
I'm wondering since the names dataframe may have the same tickers, but different dates, in the dataframe and each time the price1 dataframe is pulling the price data and adding to that price1 dataframe maybe causing the issue.
For example, in the names dataframe, AGRX can be listed for the date 2017-01-05 and 2020-12-20. My loop function as shown pulls from yahoo data and appends it to the price1 dataframe even though its the same set of data. Along the same token, is there a way for me to skip appending that duplicate ticker and will that solve the issue?

Related

Calculate compounded return with external cashflows

I'm trying to calculate the cumulative price of a time series in the face of external cashflows.
This is the sample dataset:
reportdate fund mtd_return cashflow Desired Output
30/11/2018 Fund X -0.00860 15687713 15552798.98
31/12/2018 Fund X -0.00900 15412823.78
31/01/2019 Fund X 0.00920 15554621.76
28/02/2019 Fund X 0.00630 15652615.88
31/03/2019 Fund X 0.00700 15762184.19
30/04/2019 Fund X 0.01220 15954482.84
31/05/2019 Fund X 0.00060 1000000 16964655.53
30/06/2019 Fund X 0.00570 1200000 18268194.07
31/07/2019 Fund X 0.00450 18350400.94
31/08/2019 Fund X 0.00210 18388936.78
30/09/2019 Fund X 0.00530 18486398.15
31/10/2019 Fund X 0.00200 18523370.94
30/11/2019 Fund X 0.00430 18603021.44
31/12/2019 Fund X 0.00660 18725801.38
31/01/2020 Fund X 0.01070 18926167.45
29/02/2020 Fund X -0.00510 18829644.00
31/03/2020 Fund X -0.10700 16814872.09
30/04/2020 Fund X 0.02740 3400000 20768759.59
31/05/2020 Fund X 0.02180 2000000 23265118.55
30/06/2020 Fund X 0.02270 23793236.74
31/07/2020 Fund X 0.01120 24059720.99
31/08/2020 Fund X 0.01260 24362873.47
30/09/2020 Fund X 0.00750 24545595.02
31/10/2020 Fund X 0.00410 -8110576 16502402.68
30/11/2020 Fund X 0.02790 16962819.72
31/12/2020 Fund X 0.01230 17171462.40
In the above, the Desired Output column is calculated by taking the previous row's Desired Output, plus any cashflow in the current period, times 1 + mtd_return. Effectively, I'm looking for a good way to calculate compounded returns in the face of external cashflows.
Many thanks!
Mike
Any help on implementing this in python would be appreciated.

import pandas as pd
df = pd.read_csv('df7.txt', sep=',', header=0)
df['reportdate'] = pd.to_datetime(df['reportdate'])
df = df.fillna(0)
qqq = []
def func_data(x):
a = 0
ind = x.index[0] - 1
if x.index[0] > 0:
a = (qqq[ind] + x['cashflow']) * (1 + x['mtd_return'])
qqq.append(a.values[0])
else:
qqq.append(df.loc[0, 'cashflow'] * (1 + df.loc[0, 'mtd_return']))
return a
df.groupby(['reportdate']).apply(func_data)
df['new'] = qqq
print(df)
Output
reportdate fund mtd_return cashflow Desired Output new
0 2018-11-30 Fund X -0.0086 15687713.0 15552798.98 1.555280e+07
1 2018-12-31 Fund X -0.0090 0.0 15412823.78 1.541282e+07
2 2019-01-31 Fund X 0.0092 0.0 15554621.76 1.555462e+07
3 2019-02-28 Fund X 0.0063 0.0 15652615.88 1.565262e+07
4 2019-03-31 Fund X 0.0070 0.0 15762184.19 1.576218e+07
5 2019-04-30 Fund X 0.0122 0.0 15954482.84 1.595448e+07
6 2019-05-31 Fund X 0.0006 1000000.0 16964655.53 1.696466e+07
7 2019-06-30 Fund X 0.0057 1200000.0 18268194.07 1.826819e+07
8 2019-07-31 Fund X 0.0045 0.0 18350400.94 1.835040e+07
9 2019-08-31 Fund X 0.0021 0.0 18388936.78 1.838894e+07
10 2019-09-30 Fund X 0.0053 0.0 18486398.15 1.848640e+07
11 2019-10-31 Fund X 0.0020 0.0 18523370.94 1.852337e+07
12 2019-11-30 Fund X 0.0043 0.0 18603021.44 1.860302e+07
13 2019-12-31 Fund X 0.0066 0.0 18725801.38 1.872580e+07
14 2020-01-31 Fund X 0.0107 0.0 18926167.45 1.892617e+07
15 2020-02-29 Fund X -0.0051 0.0 18829644.00 1.882964e+07
16 2020-03-31 Fund X -0.1070 0.0 16814872.09 1.681487e+07
17 2020-04-30 Fund X 0.0274 3400000.0 20768759.59 2.076876e+07
18 2020-05-31 Fund X 0.0218 2000000.0 23265118.55 2.326512e+07
19 2020-06-30 Fund X 0.0227 0.0 23793236.74 2.379324e+07
20 2020-07-31 Fund X 0.0112 0.0 24059720.99 2.405972e+07
21 2020-08-31 Fund X 0.0126 0.0 24362873.47 2.436287e+07
22 2020-09-30 Fund X 0.0075 0.0 24545595.02 2.454559e+07
23 2020-10-31 Fund X 0.0041 -8110576.0 16502402.68 1.650240e+07
24 2020-11-30 Fund X 0.0279 0.0 16962819.72 1.696282e+07
25 2020-12-31 Fund X 0.0123 0.0 17171462.40 1.717146e+07
Made in your file all the values separated by commas, empty too (that is, between commas is empty). Read a file in pandas and created a dataframe. header=0 means that the first row is used as column headers. Next, the 'reportdate ' column is converted to datetime format and empty values are replaced with zero. The data is grouped by date. The func_data function is created for the call. If this is the first call, then the code in else is calculated, the rest is in if. The calculations are written to the qqq array, which then populates the 'new' column.

transpose multiple columns in a pandas dataframe

AD AP AR MD MS iS AS
0 169.88 0.00 50.50 814.0 57.3 32.3 43.230
1 12.54 0.01 84.75 93.0 51.3 36.6 43.850
2 321.38 0.00 65.08 986.0 56.7 28.9 42.070
I would like to change the dataframe above to a transposed version where for each column, the values are put in a single row, so e.g. for columns AD and AP, it will look like this
d1_AD d2_AD d3_AD d1_AP d2_AP d3_AP
169.88 12.54 321.38 0.00 0.01 0.00
I can do a transpose, but how do I get the column names and output structure like above?
NOTE: The output is truncated for legibility but the actual output should include all the other columns like AR MD MS iS AS

We can rename to make the index of the correct form, then stack and sort_index, then Collapse the MultiIndex and to_frame and transpose
new_df = df.rename(lambda x: f'd{x + 1}').stack().sort_index(level=1)
new_df.index = new_df.index.map('_'.join)
new_df = new_df.to_frame().transpose()
Input df:
df = pd.DataFrame({
'AD': [169.88, 12.54, 321.38], 'AP': [0.0, 0.01, 0.0],
'AR': [50.5, 84.75, 65.08], 'MD': [814.0, 93.0, 986.0],
'MS': [57.3, 51.3, 56.7], 'iS': [32.3, 36.6, 28.9],
'AS': [43.23, 43.85, 42.07]
})
new_df:
d1_AD d2_AD d3_AD d1_AP d2_AP ... d2_MS d3_MS d1_iS d2_iS d3_iS
0 169.88 12.54 321.38 0.0 0.01 ... 51.3 56.7 32.3 36.6 28.9
[1 rows x 21 columns]
If lexicographic sorting does not work we can wait to convert the MultiIndex to string until after sort_index:
new_df = df.stack().sort_index(level=1) # Sort level 1 (by number)
new_df.index = new_df.index.map(lambda x: f'd{x[0]+1}_{x[1]}')
new_df = new_df.to_frame().transpose()
Larger frame:
df = pd.concat([df] * 4, ignore_index=True)
Truncated output:
d1_AD d2_AD d3_AD d4_AD d5_AD ... d8_iS d9_iS d10_iS d11_iS d12_iS
0 169.88 12.54 321.38 169.88 12.54 ... 36.6 28.9 32.3 36.6 28.9
[1 rows x 84 columns]
If needing columns in same order as df, use melt using ignore_index=False to not have to recalculate groups and let melt handle the ordering:
new_df = df.melt(value_name=0, ignore_index=False)
new_df = new_df[[0]].set_axis(
# Create the new index
'd' + (new_df.index + 1).astype(str) + '_' + new_df['variable']
).transpose()
Truncated output on the larger frame:
d1_AD d2_AD d3_AD d4_AD d5_AD ... d8_AS d9_AS d10_AS d11_AS d12_AS
0 169.88 12.54 321.38 169.88 12.54 ... 43.85 42.07 43.23 43.85 42.07
[1 rows x 84 columns]

You could try melt and set_index with groupby:
x = df.melt().set_index('variable').rename_axis(index=None).T.set_axis([0])
x.set_axis(x.columns + x.columns.to_series().groupby(level=0).transform('cumcount').add(1).astype(str), axis=1)
AD1 AD2 AD3 AP1 AP2 AP3 AR1 AR2 AR3 ... MS1 MS2 MS3 iS1 iS2 iS3 AS1 AS2 AS3
0 169.88 12.54 321.38 0.0 0.01 0.0 50.5 84.75 65.08 ... 57.3 51.3 56.7 32.3 36.6 28.9 43.23 43.85 42.07
[1 rows x 21 columns]

remove non matching rows from two dataframes python

I have two dataframes
1st
dt SRNE CRSR GME ... ASO TH DTE ATH
0 2021-04-12 00:00:00 6.940 33.67 141.09 ... 32.29 3.42 135.63 50.80
1 2021-04-13 00:00:00 6.930 33.71 140.99 ... 31.68 3.39 137.63 50.88
2 2021-04-14 00:00:00 7.385 33.93 166.53 ... 30.82 3.23 138.72 53.35
3 2021-04-15 00:00:00 7.440 34.16 156.44 ... 30.54 3.26 139.48 54.14
4 2021-04-16 00:00:00 7.490 32.60 154.69 ... 30.77 2.79 140.68 55.45
2nd
dt text compare
0 2021-03-19 14:59:49+00:00 i only need uxy to hit 20 eod to make up for a... 1
1 2021-03-19 14:59:51+00:00 oh this isn’t good 0
2 2021-03-19 14:59:51+00:00 lads why is my account covered in more red ink... 0
3 2021-03-19 14:59:51+00:00 i'm tempted to drop my last 800 into some stup... 0
4 2021-03-19 14:59:52+00:00 the sell offs will continue until moral improves. 0
I want to remove rows that don't match with both dataframes by looking at the data column.
I tried
discussion = discussion[discussion['dt'] == price['dt']]
It gives an error ValueError: Can only compare identically-labeled Series objects
I assume it is because the column names don't match
Appreciate your help

import pandas as pd
discussion = pd.DataFrame([['2021-04-12 00:00:00',6.940,33.67,141.09,32.29, 3.42, 135.63, 50.80],
['2021-04-13 00:00:00',6.930,33.71,140.99,31.68, 3.39, 137.63, 50.88],
['2021-04-14 00:00:00',7.385,33.93,166.53,30.82, 3.23, 138.72, 53.35],
['2021-04-15 00:00:00',7.440,34.16,156.44,30.54, 3.26, 139.48, 54.14],
['2021-04-16 00:00:00',7.490,32.60,154.69,30.77, 2.79, 140.68, 55.45]],
columns=['dt', 'SRNE', 'CRSR', 'GME', 'ASO', 'TH', 'DTE', 'ATH'])
discussion['dt'] = pd.to_datetime(discussion['dt'])
price = pd.DataFrame([['2021-04-12 23:30:00','i only need uxy to hit 20 eod to make up for a...', 1],
['2021-03-19 14:59:51+00:00','oh this isn’t good ',0],
['2021-03-19 14:59:51+00:00','lads why is my account covered in more red ink... ', 0],
['2021-03-19 14:59:51+00:00','im tempted to drop my last 800 into some stup... ', 0],
['2021-04-16 12:45:00','the sell offs will continue until moral improves. ', 0]],
columns=['dt', 'text', 'compare'])
price['dt'] = pd.to_datetime(price['dt'], utc=True)
discussion = discussion[discussion['dt'].dt.date.isin(price['dt'].dt.date)]
discussion
Output
dt SRNE CRSR GME ASO TH DTE ATH
0 2021-04-12 6.94 33.67 141.09 32.29 3.42 135.63 50.80
4 2021-04-16 7.49 32.60 154.69 30.77 2.79 140.68 55.45

dataframe math in pandas

TOTALLY RE WROTE ORIGINAL QUESTION
I read raw data from a csv file "CloseWeight4.csv"
df=pd.read_csv('CloseWeights4.csv')
Date Symbol ClosingPrice Weight
3/1/2010 OGDC 116.51 0.1820219
3/2/2010 OGDC 117.32 0.1820219
3/3/2010 OGDC 116.4 0.1820219
3/4/2010 OGDC 116.58 0.1820219
3/5/2010 OGDC 117.61 0.1820219
3/1/2010 WTI 78.7 0.5348142
3/2/2010 WTI 79.68 0.5348142
3/3/2010 WTI 80.87 0.5348142
3/4/2010 WTI 80.21 0.5348142
3/5/2010 WTI 81.5 0.5348142
3/1/2010 FX 85.07 0.1312427
3/2/2010 FX 85.1077 0.1312427
3/3/2010 FX 85.049 0.1312427
3/4/2010 FX 84.9339 0.1312427
3/5/2010 FX 84.8 0.1312427
3/1/2010 PIB 98.1596499 0.1519211
3/2/2010 PIB 98.1596499 0.1519211
3/3/2010 PIB 98.1764222 0.1519211
3/4/2010 PIB 98.1770656 0.1519211
3/5/2010 PIB 98.1609364 0.1519211
From Which I generate a dataframe df2
df2=df.iloc[:,0:3].pivot('Date', 'Symbol', 'ClosingPrice')
df2
Out[10]:
Symbol FX OGDC PIB WTI
Date
2010-03-01 85.0700 116.51 98.159650 78.70
2010-03-02 85.1077 117.32 98.159650 79.68
2010-03-03 85.0490 116.40 98.176422 80.87
2010-03-04 84.9339 116.58 98.177066 80.21
2010-03-05 84.8000 117.61 98.160936 81.50
from this I calculate returns using:
ret=np.log(df2/df2.shift(1))
In [12] ret
Out[12]:
Symbol FX OGDC PIB WTI
Date
2010-03-01 NaN NaN NaN NaN
2010-03-02 0.000443 0.006928 0.000000 0.012375
2010-03-03 -0.000690 -0.007873 0.000171 0.014824
2010-03-04 -0.001354 0.001545 0.000007 -0.008195
2010-03-05 -0.001578 0.008796 -0.000164 0.015955
I have weights of each security from df
df3=df.iloc[:,[1,3]].drop_duplicates().reset_index(drop=True)
df3
Out[14]:
Weight
Symbol
OGDC 0.182022
WTI 0.534814
FX 0.131243
PIB 0.151921
I am trying to get the following weighted return results for each day but don't know how to do the math in pandas:
Date Portfolio_weighted_returns
2010-03-02 0.008174751
2010-03-03 0.006061657
2010-03-04 -0.005002414
2010-03-05 0.009058151
where the Portfolio_weighted_returns of 2010-03-02 is calculated as follows:
0.006928*0.182022+.012375*0.534814+0.000443*0.131243+0*0.151921 = 0.007937512315
I then need to have these results multiplied by a decay factor where the decay factor is defineD as decFac =decay^(t). Using a decay = 0.5 gives decFac values of:
Date decFac
2010-03-02 0.0625
2010-03-03 0.125
2010-03-04 0.25
2010-03-05 0.5
I then need to take the SQRT of the sum of the squared Portfolio_weighted_returns for each day multiplied by the respective decFac as such:
SQRT(Sum(0.008174751^2*.0625+0.006061657^2*.125+(-0.005002414^2)*.25+.009058151^2*.5)) = 0.007487

IIUC you can do it this way:
In [267]: port_ret = ret.dot(df3)
In [268]: port_ret
Out[268]:
Weight
Date
2010-03-01 NaN
2010-03-02 0.007938
2010-03-03 0.006431
2010-03-04 -0.004278
2010-03-05 0.009902
In [269]: decay = 0.5
In [270]: decay_df = pd.DataFrame({'decFac':decay**np.arange(len(ret), 0, -1)}, index=ret.index)
In [271]: decay_df
Out[271]:
decFac
Date
2010-03-01 0.03125
2010-03-02 0.06250
2010-03-03 0.12500
2010-03-04 0.25000
2010-03-05 0.50000
In [272]: (port_ret.Weight**2 * decay_df.decFac).sum() ** 0.5
Out[272]: 0.007918790111274962
port_ret.Weight**2 * decay_df.decFac
In [277]: port_ret.Weight**2 * decay_df.decFac
Out[277]:
Date
2010-03-01 NaN
2010-03-02 0.000004
2010-03-03 0.000005
2010-03-04 0.000005
2010-03-05 0.000049
dtype: float64

import numpy as np
import pandas as pd
define the variables
data = np.mat(''' 85.0700 116.51 98.159650 78.70;
85.1077 117.32 98.159650 79.68;
85.0490 116.40 98.176422 80.87;
84.9339 116.58 98.177066 80.21;
84.8000 117.61 98.160936 81.50''')
cols = ['FX', 'OGDC' , 'PIB' , 'WTI']
dts = pd.Series( data=pd.date_range('2010-03-01', '2010-03-05'), name='Date' )
df2 = pd.DataFrame( data=data, columns=cols, index=dts )
# this is your df3 variable
wgt = pd.DataFrame( data=[0.131243, 0.182022, 0.151921, 0.534814], index=pd.Series(cols, name='Symbol') , columns=['Weight'] )
To calculate daily returns I use the .shift operator
# Calculate the daily returns for each security
df_ret = np.log( df2 / df2.shift(1) )
# FX OGDC PIB WTI
# Date
# 2010-03-01 NaN NaN NaN NaN
# 2010-03-02 0.000443 0.006928 0.000000 0.012375
# 2010-03-03 -0.000690 -0.007873 0.000171 0.014824
# 2010-03-04 -0.001354 0.001545 0.000007 -0.008195
# 2010-03-05 -0.001578 0.008796 -0.000164 0.015955
You need to multiply the Weight column of wgt with ret to get the desired result. wgt['Weight'] will return a pd.Series which is more like a 1-D array than a 2D array which a pd.DataFrame can be generally thought of.
df_wgt_ret = wgt['Weight'] * df_ret
# FX OGDC PIB WTI
# Date
# 2010-03-01 NaN NaN NaN NaN
# 2010-03-02 0.000081 0.003705 0.000000e+00 0.001880
# 2010-03-03 -0.000126 -0.004210 2.242285e-05 0.002252
# 2010-03-04 -0.000247 0.000826 8.609014e-07 -0.001245
# 2010-03-05 -0.000287 0.004704 -2.156434e-05 0.002424
Sum over the columns (axis=1) to get the portfolio returns. Note this returns a pd.Series not a dataframe
port_ret = df_wgt_ret.sum(axis=1)
# Date
# 2010-03-01 NaN
# 2010-03-02 0.005666
# 2010-03-03 -0.002061
# 2010-03-04 -0.000664
# 2010-03-05 0.006820
Finally, multiply the decay rate with the portfolio, note that because the operation happens over the columns you need to
total_ret = (port_ret * sr_dec).sum()
final_res = total_ret**0.5
The One liner
I'm assuming decFac is a dataframe with column name decFac and using df3 and ret as you've defined them.
result = (( (df3.Weight * ret).sum(axis=1)**2 * decFac.decFac ).sum())**.5

Populating new DataFrame by multi-criteria selection from old one with different structure

I'm using Pandas for data analysis. I have an input file like this snippet:
VEH SEC POS ACCELL SPEED
2 8.4 36.51 -0.2929 27.39
3 8.4 23.57 -0.7381 33.09
4 8.4 6.18 0.6164 38.8
1 8.5 47.76 0 25.57
I need to reorganize the data so that the rows are the unique (ordered) values from SEC as the 1st column, and then the other columns would be VEH1_POS, VEH1_SPEED, VEH1_ACCELL, VEH2_POS, VEH2_SPEED, VEH2_ACCELL, etc.:
TIME VEH1_POS VEH1_SPEED VEH1_ACCEL VEH2_POS, VEH2_SPEED, etc.
0.1 6.2 3.7 0.0 7.5 2.1
0.2 6.8 3.2 -0.5 8.3 2.1
etc.
So, for example, the value for VEH1_POS for each row in the new dataframe would be filled in by selecting values from the POS column in the original dataframe using the row where the SEC value matches the TIME value for the row in the new dataframe and the VEH value == 1.
To set up the rows in the new data frame I'm doing this:
start = inputdf['SIMSEC'].min()
end = inputdf['SIMSEC'].max()
time_steps = frange(start, end, 0.1)
outputdf['TIME'] = time_steps
But I'm lost at how to select the right values from the input dataframe and create the rest of the new dataframe for further analysis. Note also that the input file will NOT have data for every VEH for every SEC (time stamp). So the solution needs to handle that as well. My best guess was:
outputdf['veh1_pos'] = np.where((inputdf['VEH NO'] == 1) & (inputdf['SIMSEC'] == row['Time Step']))
but that doesn't work.

import pandas as pd
# your data
# ==========================
print(df)
Out[272]:
VEH SEC POS ACCELL SPEED
0 2 8.4 36.51 -0.2929 27.39
1 3 8.4 23.57 -0.7381 33.09
2 4 8.4 6.18 0.6164 38.80
3 1 8.5 47.76 0.0000 25.57
# reshaping
# ==========================
result = df.set_index(['SEC','VEH']).unstack()
Out[278]:
POS ACCELL SPEED
VEH 1 2 3 4 1 2 3 4 1 2 3 4
SEC
8.4 NaN 36.51 23.57 6.18 NaN -0.2929 -0.7381 0.6164 NaN 27.39 33.09 38.8
8.5 47.76 NaN NaN NaN 0 NaN NaN NaN 25.57 NaN NaN NaN
So here, the column has multi-level index where 1st level is POS, ACCELL, SPEED and 2nd level is VEH=1,2,3,4.
# if you want to rename the column
temp_z = result.columns.get_level_values(0)
temp_y = result.columns.get_level_values(1)
temp_x = ['VEH'] * len(temp_y)
result.columns = ['{}{}_{}'.format(x,y,z) for x,y,z in zip(temp_x, temp_y, temp_z)]
Out[298]:
VEH1_POS VEH2_POS VEH3_POS VEH4_POS VEH1_ACCELL VEH2_ACCELL VEH3_ACCELL VEH4_ACCELL VEH1_SPEED VEH2_SPEED VEH3_SPEED VEH4_SPEED
SEC
8.4 NaN 36.51 23.57 6.18 NaN -0.2929 -0.7381 0.6164 NaN 27.39 33.09 38.8
8.5 47.76 NaN NaN NaN 0 NaN NaN NaN 25.57 NaN NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

merge pandas dataframe causing duplicate row values - python

Related

Calculate compounded return with external cashflows

transpose multiple columns in a pandas dataframe

remove non matching rows from two dataframes python

dataframe math in pandas

Populating new DataFrame by multi-criteria selection from old one with different structure

Categories

Resources