How can we convert a Pandas DataFrame containing a MultiIndex column, such as
FB AAPL
open volume open volume
date
2019-10-30 189.56 28734995 244.76 31130522
2019-10-31 196.70 42286529 247.24 34790520
2019-11-01 192.85 21711829 249.54 37781334
to one with regular columns, where one of the index level is now a column in all rows
open volume ticker
date
2019-10-30 189.56 28734995 FB
2019-10-31 196.70 42286529 FB
2019-11-01 192.85 21711829 FB
2019-10-30 244.76 31130522 AAPL
2019-10-31 247.24 34790520 AAPL
2019-11-01 249.54 37781334 AAPL
Main idea is use DataFrame.stack with DataFrame.reset_index for convert MultiIndex second level to column:
df = df.stack(0).rename_axis(('date','ticker')).reset_index(level=1)
print (df)
ticker open volume
date
2019-10-30 AAPL 244.76 31130522
2019-10-30 FB 189.56 28734995
2019-10-31 AAPL 247.24 34790520
2019-10-31 FB 196.70 42286529
2019-11-01 AAPL 249.54 37781334
2019-11-01 FB 192.85 21711829
If ordering is important then is used ordered catagoricals for tickers, sorting and for column to last position reassign with DataFrame.pop:
df1 = df.stack(0).rename_axis(('date','ticker')).reset_index(level=1)
df1['ticker'] = pd.Categorical(df1.pop('ticker'),
ordered=True,
categories=df.columns.get_level_values(0).unique())
df1 = df1.sort_values(['ticker','date'])
print (df1)
open volume ticker
date
2019-10-30 189.56 28734995 FB
2019-10-31 196.70 42286529 FB
2019-11-01 192.85 21711829 FB
2019-10-30 244.76 31130522 AAPL
2019-10-31 247.24 34790520 AAPL
2019-11-01 249.54 37781334 AAPL
Related
I'm trying to convert daily prices into weekly, monthly, quarterly, semesterly, yearly, but the code only works when I run it for one stock. When I add another stock to the list the code crashes and gives two errors. 'ValueError: Length of names must match number of levels in MultiIndex.' and 'TypeError: other must be a MultiIndex or a list of tuples.' I'm not experienced with MultiIndexing and have searched everywhere with no success.
This is the code:
import yfinance as yf
from pandas_datareader import data as pdr
symbols = ['AMZN', 'AAPL']
yf.pdr_override()
df = pdr.get_data_yahoo(symbols, start = '2014-12-01', end = '2021-01-01')
df = df.reset_index()
df.Date = pd.to_datetime(df.Date)
df.set_index('Date', inplace = True)
res = {'Open': 'first', 'Adj Close': 'last'}
dfw = df.resample('W').agg(res)
dfw_ret = (dfw['Adj Close'] / dfw['Open'] - 1)
dfm = df.resample('BM').agg(res)
dfm_ret = (dfm['Adj Close'] / dfm['Open'] - 1)
dfq = df.resample('Q').agg(res)
dfq_ret = (dfq['Adj Close'] / dfq['Open'] - 1)
dfs = df.resample('6M').agg(res)
dfs_ret = (dfs['Adj Close'] / dfs['Open'] - 1)
dfy = df.resample('Y').agg(res)
dfy_ret = (dfy['Adj Close'] / dfy['Open'] - 1)
print(dfw_ret)
print(dfm_ret)
print(dfq_ret)
print(dfs_ret)
print(dfy_ret)```
This is what the original df prints:
```Adj Close Open
AAPL AMZN AAPL AMZN
Date
2014-12-01 26.122288 326.000000 29.702499 338.119995
2014-12-02 26.022408 326.309998 28.375000 327.500000
2014-12-03 26.317518 316.500000 28.937500 325.730011
2014-12-04 26.217640 316.929993 28.942499 315.529999
2014-12-05 26.106400 312.630005 28.997499 316.799988
... ... ... ... ...
2020-12-24 131.549637 3172.689941 131.320007 3193.899902
2020-12-28 136.254608 3283.959961 133.990005 3194.000000
2020-12-29 134.440399 3322.000000 138.050003 3309.939941
2020-12-30 133.294067 3285.850098 135.580002 3341.000000
2020-12-31 132.267349 3256.929932 134.080002 3275.000000
And this is what the different df_ret print when I go from daily
to weekly/monthly/etc but it can only do it for one stock and
the idea is to be able to do it for multiple stocks:
Date
2014-12-07 -0.075387
2014-12-14 -0.013641
2014-12-21 -0.029041
2014-12-28 0.023680
2015-01-04 0.002176
...
2020-12-06 -0.014306
2020-12-13 -0.012691
2020-12-20 0.018660
2020-12-27 -0.008537
2021-01-03 0.019703
Freq: W-SUN, Length: 318, dtype: float64
Date
2014-12-31 -0.082131
2015-01-30 0.134206
2015-02-27 0.086016
2015-03-31 -0.022975
2015-04-30 0.133512
...
2020-08-31 0.085034
2020-09-30 -0.097677
2020-10-30 -0.053569
2020-11-30 0.034719
2020-12-31 0.021461
Freq: BM, Length: 73, dtype: float64
Date
2014-12-31 -0.082131
2015-03-31 0.190415
2015-06-30 0.166595
2015-09-30 0.165108
2015-12-31 0.322681
2016-03-31 -0.095461
2016-06-30 0.211909
2016-09-30 0.167275
2016-12-31 -0.103026
2017-03-31 0.169701
2017-06-30 0.090090
2017-09-30 -0.011760
2017-12-31 0.213143
2018-03-31 0.234932
2018-06-30 0.199052
2018-09-30 0.190349
2018-12-31 -0.257182
2019-03-31 0.215363
2019-06-30 0.051952
2019-09-30 -0.097281
2019-12-31 0.058328
2020-03-31 0.039851
2020-06-30 0.427244
2020-09-30 0.141676
2020-12-31 0.015252
Freq: Q-DEC, dtype: float64
Date
2014-12-31 -0.082131
2015-06-30 0.388733
2015-12-31 0.538386
2016-06-30 0.090402
2016-12-31 0.045377
2017-06-30 0.277180
2017-12-31 0.202181
2018-06-30 0.450341
2018-12-31 -0.107405
2019-06-30 0.292404
2019-12-31 -0.039075
2020-06-30 0.471371
2020-12-31 0.180907
Freq: 6M, dtype: float64
Date
2014-12-31 -0.082131
2015-12-31 1.162295
2016-12-31 0.142589
2017-12-31 0.542999
2018-12-31 0.281544
2019-12-31 0.261152
2020-12-31 0.737029
Freq: A-DEC, dtype: float64```
Without knowing what your df DataFrame looks like I am assuming it is an issue with correctly handling the resampling on a MultiIndex similar to the one talked about in this question.
The solution listed there is to use pd.Grouper with the freq and level parameters filled out correctly.
# This is just from the listed solution so I am not sure if these is the correct level to choose
df.groupby(pd.Grouper(freq='W', level=-1))
If this doesn't work, I think you would need to provide some more detail or a dummy data set to reproduce the issue.
Given a DataFrame of stock prices, I am interested in filtering based on the latest closing price. I am aware of how to do this for a simple DataFrame, but cannot figure out how to do it for a multi-indexed dataframe.
Simple dataframe:
AAPL AMZN GOOG MSFT
2021-02-08 136.91 3322.94 2092.91 242.47
2021-02-09 136.01 3305.00 2083.51 243.77
2021-02-10 135.39 3286.58 2095.38 242.82
2021-02-11 135.13 3262.13 2095.89 244.49
2021-02-12 135.37 3277.71 2104.11 244.99
Operation: df.loc[:,df.iloc[-1] < 250]
Output:
AAPL MSFT
2021-02-08 136.91 242.47
2021-02-09 136.01 243.77
2021-02-10 135.39 242.82
2021-02-11 135.13 244.49
2021-02-12 135.37 244.99
However I cannot figure out how to accomplish this on a DataFrame with a MultiIndex (such as OHLC)
Multiindex DataFrame:
Close High Low ... Open Volume
AAPL AMZN GOOG MSFT AAPL AMZN GOOG MSFT AAPL ... MSFT AAPL AMZN GOOG MSFT AAPL AMZN GOOG MSFT
2021-02-08 136.91 3322.94 2092.91 242.47 136.96 3365.00 2123.55 243.68 134.92 ... 240.81 136.03 3358.50 2105.91 243.15 71297200 3257400 1241900 22211900
2021-02-09 136.01 3305.00 2083.51 243.77 137.88 3338.00 2105.13 244.76 135.85 ... 241.38 136.62 3312.49 2078.54 241.87 76774200 2203500 889900 23565000
2021-02-10 135.39 3286.58 2095.38 242.82 136.99 3317.95 2108.37 245.92 134.40 ... 240.89 136.48 3314.00 2094.21 245.00 73046600 3151600 1135500 22186700
2021-02-11 135.13 3262.13 2095.89 244.49 136.39 3292.00 2102.03 245.15 133.77 ... 242.15 135.90 3292.00 2099.51 244.78 64280000 2301400 945700 15751100
2021-02-12 135.37 3277.71 2104.11 244.99 135.53 3280.25 2108.82 245.30 133.69 ... 242.73 134.35 3250.00 2090.25 243.93 60029300 2329300 855700 16552000
[5 rows x 20 columns]
Filter: df_filter = df.iloc[-1].loc['Close'] < 250
AAPL True
AMZN False
GOOG False
MSFT True
Name: 2021-02-12 00:00:00, dtype: bool
Operation???:
Maybe something like df.loc[:,filter] but I receive the error:
pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match)
I understand it's a multi-index so I also tried using pd.IndexSlice: df.loc[:,idx[:,df_filter]] but still get:
ValueError: cannot index with a boolean indexer that is not the same length as the index
Desired Output:
Close High Low Open Volume
AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT
2021-02-08 136.91 242.47 136.96 243.68 134.92 240.81 136.03 243.15 71297200 22211900
2021-02-09 136.01 243.77 137.88 244.76 135.85 241.38 136.62 241.87 76774200 23565000
2021-02-10 135.39 242.82 136.99 245.92 134.40 240.89 136.48 245.00 73046600 22186700
2021-02-11 135.13 244.49 136.39 245.15 133.77 242.15 135.90 244.78 64280000 15751100
2021-02-12 135.37 244.99 135.53 245.30 133.69 242.73 134.35 243.93 60029300 16552000
I'm not sure if IndexSlice works with boolean indexing. You can try passing the valid index:
df.loc[:,pd.IndexSlice[:, df_filter.index[df_filter]]]
Imagine I have a dataframe with multindex (stock and date). I would like to calculate the pct_change from column close. I use this code:
data['win'] = data.Close.pct_change()*100
Open Close High Low Volume Adj_Close win
Ticker Date
AAPL 2018-12-14 169.000000 165.479996 169.080002 165.279999 40703700 165.479996 NaN
2018-12-17 165.449997 163.940002 168.350006 162.729996 44287900 163.940002 -0.930622
2018-12-18 165.380005 166.070007 167.529999 164.389999 33841500 166.070007 1.299259
2018-12-19 166.000000 160.889999 167.449997 159.089996 48889400 160.889999 -3.119171
AMZN 2018-12-14 1638.000000 1591.910034 1642.569946 1585.000000 6367200 1591.910034 **889.440017**
2018-12-17 1566.000000 1520.910034 1576.130005 1505.010010 8829800 1520.910034 -4.460051
2018-12-18 1540.000000 1551.479980 1567.550049 1523.010010 6523000 1551.479980 2.009977
2018-12-19 1543.050049 1495.079956 1584.530029 1483.180054 8654400 1495.079956 -3.635240
It works fine, but when it starts with AMZN pct_change, the first one is incorrect because it use the last value of AAPL.
How can I change the formula to calculate the pct_change correctly?
The solution should be this:
Open Close High Low Volume Adj_Close win
Ticker Date
AAPL 2018-12-14 169.000000 165.479996 169.080002 165.279999 40703700 165.479996 NaN
2018-12-17 165.449997 163.940002 168.350006 162.729996 44287900 163.940002 -0.930622
2018-12-18 165.380005 166.070007 167.529999 164.389999 33841500 166.070007 1.299259
2018-12-19 166.000000 160.889999 167.449997 159.089996 48889400 160.889999 -3.119171
AMZN 2018-12-14 1638.000000 1591.910034 1642.569946 1585.000000 6367200 1591.910034 Nan
2018-12-17 1566.000000 1520.910034 1576.130005 1505.010010 8829800 1520.910034 -4.460051
2018-12-18 1540.000000 1551.479980 1567.550049 1523.010010 6523000 1551.479980 2.009977
2018-12-19 1543.050049 1495.079956 1584.530029 1483.180054 8654400 1495.079956 -3.635240
I have the following code:
import pandas as pd
import pandas_datareader.data as web
pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2009', '6/1/2012'))
for stk in ['AAPL', 'GOOG', 'MSFT']))
pdata
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 861 (major_axis) x 3 (minor_axis)
Items axis: Open to Volume
Major_axis axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: AAPL to MSFT
AAPL GOOG MSFT
Date minor
2009-01-02 Open 12.268572 153.302917 19.530001
High 13.005714 159.870193 20.400000
Low 12.165714 151.762924 19.370001
Close 12.964286 159.621811 20.330000
Adj Close 11.621618 159.621811 16.140903
Normally the following method would give me what I need:
pdata = pdata.swapaxes('items', 'minor')
And I get the following warning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are
with a MultiIndex on a DataFrame, via the Panel.to_frame() method
My objective is to have a Data Frame in the form of a Panel, using the date and stock ticker as Mayor and Minor row indices, and the Open price, etc as the columns as this:
minor Open High Low Close Adj Close
Date
2009-01-02 AAPL 12.268572 19.530001 12.165714 12.964286 11.621618
GOOG 153.302917 ... ... ... ...
MSFT 19.530001 ... ... ... ...
I did convert the Panel object into a DataFrame and tried to use the pivot_table or set_index methods but I can't get the stock tickers to be the inner row index. When I use the swapaxes method on the DF, the Date is also swaped to the columns. Is there any easy way I can get the format I need?
Option 1
unstack + swaplevel + sort_index
pdata.to_frame().unstack(0).T\
.swaplevel(0, 1).sort_index(level=[0]).head(6)
minor Open High Low Close Adj Close \
Date
2009-01-02 AAPL 12.268572 13.005714 12.165714 12.964286 11.621618
GOOG 153.302917 159.870193 151.762924 159.621811 159.621811
MSFT 19.530001 20.400000 19.370001 20.330000 16.140903
2009-01-05 AAPL 13.310000 13.740000 13.244286 13.511429 12.112095
GOOG 159.462845 164.549759 156.482239 162.965073 162.965073
MSFT 20.200001 20.670000 20.059999 20.520000 16.291746
minor Volume
Date
2009-01-02 AAPL 186503800.0
GOOG 7267900.0
MSFT 50084000.0
2009-01-05 AAPL 295402100.0
GOOG 9841400.0
MSFT 61475200.0
Option 2
Wen's wonderful stack equivalent.
pdata.to_frame().stack().unstack(-2).head(6)
minor Open High Low Close Adj Close \
Date
2009-01-02 AAPL 12.268572 13.005714 12.165714 12.964286 11.621618
GOOG 153.302917 159.870193 151.762924 159.621811 159.621811
MSFT 19.530001 20.400000 19.370001 20.330000 16.140903
2009-01-05 AAPL 13.310000 13.740000 13.244286 13.511429 12.112095
GOOG 159.462845 164.549759 156.482239 162.965073 162.965073
MSFT 20.200001 20.670000 20.059999 20.520000 16.291746
minor Volume
Date
2009-01-02 AAPL 186503800.0
GOOG 7267900.0
MSFT 50084000.0
2009-01-05 AAPL 295402100.0
GOOG 9841400.0
MSFT 61475200.0
I have the dataframe below, which has several stocks value for about 200 companies, I am trying to find a way to for loop and build a new dataframe which includes these companies' different yearly feature
Date Symbol Open High Low Close Volume Daily Return
2016-01-04 AAPL 102.61 105.37 102.00 105.35 67281190 0.025703
2016-01-05 AAPL 105.75 105.85 102.41 102.71 55790992 0.019960
2016-12-28 AMZN 776.25 780.00 770.50 772.13 3301025 0.009122
2016-12-29 AMZN 772.40 773.40 760.85 765.15 3158299 0.020377
I have tried different way, the closest I have come is:
stocks_features = pd.DataFrame(data=stocks_data.Symbol.unique(), columns = ['Symbol'])
stocks_features['Max_Yearly_Price'] = stocks_data['High'].max()
stocks_features['Min_Yearly_Price'] = stocks_data['Low'].min()
stocks_features
But it gives me the same values for all stocks:
Symbol Max_Yearly_Price Min_Yearly_Price
AAPL 847.21 89.47
AMZN 847.21 89.47
What I am doing wrong, how can I accomplish this?
By using groupby agg
df.groupby('Symbol').agg({'High':'max','Low':'min'}).\
rename(columns={'High':'Max_Yearly_Price','Low':'Min_Yearly_Price'})
Out[861]:
Max_Yearly_Price Min_Yearly_Price
Symbol
AAPL 105.85 102.00
AMZN 780.00 760.85
Wen's answer is great as well. I had a different way of solving it. I'll explain as I go along:
# creates a dictionary of all the symbols and their max values
value_maps = dict(stocks_features.loc[stocks_features.\
groupby('Symbol').High.agg('idxmax')][['Symbol', 'High']].values)
# sets Max_Yearly_Price equal to the symbol
stocks_features['Max_Yearly_Price'] = stocks_features['Symbol']
# replaces the symbol wiht the corresponding value from the dicitonary
stocks_features['Max_Yearly_Price'] = stocks_features['Max_Yearly_Price'].map(value_maps)
# ouput
Date Symbol Open High Low Close Volume Daily Return Max_Yearly_Price
0 2016-01-04 AAPL 102.61 105.37 102.00 105.35 672811900.025703 NaN 105.85
1 2016-01-05 AAPL 105.75 105.85 102.41 102.71 557909920.019960 NaN 105.85
2 2016-12-28 AMZN 776.25 780.00 770.50 772.13 33010250.009122 NaN 780.00
3 2016-12-29 AMZN 772.40 773.40 760.85 765.15 31582990.020377 NaN 780.00