Pandas Panel to DataFrame indexing - python

I have the following code:
import pandas as pd
import pandas_datareader.data as web
pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2009', '6/1/2012'))
for stk in ['AAPL', 'GOOG', 'MSFT']))
pdata
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 861 (major_axis) x 3 (minor_axis)
Items axis: Open to Volume
Major_axis axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: AAPL to MSFT
AAPL GOOG MSFT
Date minor
2009-01-02 Open 12.268572 153.302917 19.530001
High 13.005714 159.870193 20.400000
Low 12.165714 151.762924 19.370001
Close 12.964286 159.621811 20.330000
Adj Close 11.621618 159.621811 16.140903
Normally the following method would give me what I need:
pdata = pdata.swapaxes('items', 'minor')
And I get the following warning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are
with a MultiIndex on a DataFrame, via the Panel.to_frame() method
My objective is to have a Data Frame in the form of a Panel, using the date and stock ticker as Mayor and Minor row indices, and the Open price, etc as the columns as this:
minor Open High Low Close Adj Close
Date
2009-01-02 AAPL 12.268572 19.530001 12.165714 12.964286 11.621618
GOOG 153.302917 ... ... ... ...
MSFT 19.530001 ... ... ... ...
I did convert the Panel object into a DataFrame and tried to use the pivot_table or set_index methods but I can't get the stock tickers to be the inner row index. When I use the swapaxes method on the DF, the Date is also swaped to the columns. Is there any easy way I can get the format I need?

Option 1
unstack + swaplevel + sort_index
pdata.to_frame().unstack(0).T\
.swaplevel(0, 1).sort_index(level=[0]).head(6)
minor Open High Low Close Adj Close \
Date
2009-01-02 AAPL 12.268572 13.005714 12.165714 12.964286 11.621618
GOOG 153.302917 159.870193 151.762924 159.621811 159.621811
MSFT 19.530001 20.400000 19.370001 20.330000 16.140903
2009-01-05 AAPL 13.310000 13.740000 13.244286 13.511429 12.112095
GOOG 159.462845 164.549759 156.482239 162.965073 162.965073
MSFT 20.200001 20.670000 20.059999 20.520000 16.291746
minor Volume
Date
2009-01-02 AAPL 186503800.0
GOOG 7267900.0
MSFT 50084000.0
2009-01-05 AAPL 295402100.0
GOOG 9841400.0
MSFT 61475200.0
Option 2
Wen's wonderful stack equivalent.
pdata.to_frame().stack().unstack(-2).head(6)
minor Open High Low Close Adj Close \
Date
2009-01-02 AAPL 12.268572 13.005714 12.165714 12.964286 11.621618
GOOG 153.302917 159.870193 151.762924 159.621811 159.621811
MSFT 19.530001 20.400000 19.370001 20.330000 16.140903
2009-01-05 AAPL 13.310000 13.740000 13.244286 13.511429 12.112095
GOOG 159.462845 164.549759 156.482239 162.965073 162.965073
MSFT 20.200001 20.670000 20.059999 20.520000 16.291746
minor Volume
Date
2009-01-02 AAPL 186503800.0
GOOG 7267900.0
MSFT 50084000.0
2009-01-05 AAPL 295402100.0
GOOG 9841400.0
MSFT 61475200.0

Related

Indexing while using pandas loc

I have a pd.DataFrame which had a Datetime index. When I select a particular Datetime and when there are two matches, it correctly selects date as the index.
a = df.loc[date]
a
Symbol Order Shares
Date
2011-01-10 AAPL BUY 1500
2011-01-10 AAPL SELL 1500
However, when there is only one match for that date, it assigns the columns as the index.
orders
Symbol GOOG
Order BUY
Shares 1000
Name: 2011-01-26 00:00:00, dtype: object
How can I force it to take Date as the index all the time? I don't have a name for the index.
You can use one element list:
a = df.loc[[date]]
Sample:
print (df)
Symbol Order Shares
Date
2011-01-09 AAPL BUY 1500
2011-01-10 AAPL BUY 1500
2011-01-10 AAPL SELL 1500
date = '2011-01-09'
a = df.loc[[date]]
print (a)
Symbol Order Shares
Date
2011-01-09 AAPL BUY 1500
date = '2011-01-10'
a = df.loc[[date]]
print (a)
Symbol Order Shares
Date
2011-01-10 AAPL BUY 1500
2011-01-10 AAPL SELL 1500

How do I filter multiIndexed pandas DataFrame by a column value

Given a DataFrame of stock prices, I am interested in filtering based on the latest closing price. I am aware of how to do this for a simple DataFrame, but cannot figure out how to do it for a multi-indexed dataframe.
Simple dataframe:
AAPL AMZN GOOG MSFT
2021-02-08 136.91 3322.94 2092.91 242.47
2021-02-09 136.01 3305.00 2083.51 243.77
2021-02-10 135.39 3286.58 2095.38 242.82
2021-02-11 135.13 3262.13 2095.89 244.49
2021-02-12 135.37 3277.71 2104.11 244.99
Operation: df.loc[:,df.iloc[-1] < 250]
Output:
AAPL MSFT
2021-02-08 136.91 242.47
2021-02-09 136.01 243.77
2021-02-10 135.39 242.82
2021-02-11 135.13 244.49
2021-02-12 135.37 244.99
However I cannot figure out how to accomplish this on a DataFrame with a MultiIndex (such as OHLC)
Multiindex DataFrame:
Close High Low ... Open Volume
AAPL AMZN GOOG MSFT AAPL AMZN GOOG MSFT AAPL ... MSFT AAPL AMZN GOOG MSFT AAPL AMZN GOOG MSFT
2021-02-08 136.91 3322.94 2092.91 242.47 136.96 3365.00 2123.55 243.68 134.92 ... 240.81 136.03 3358.50 2105.91 243.15 71297200 3257400 1241900 22211900
2021-02-09 136.01 3305.00 2083.51 243.77 137.88 3338.00 2105.13 244.76 135.85 ... 241.38 136.62 3312.49 2078.54 241.87 76774200 2203500 889900 23565000
2021-02-10 135.39 3286.58 2095.38 242.82 136.99 3317.95 2108.37 245.92 134.40 ... 240.89 136.48 3314.00 2094.21 245.00 73046600 3151600 1135500 22186700
2021-02-11 135.13 3262.13 2095.89 244.49 136.39 3292.00 2102.03 245.15 133.77 ... 242.15 135.90 3292.00 2099.51 244.78 64280000 2301400 945700 15751100
2021-02-12 135.37 3277.71 2104.11 244.99 135.53 3280.25 2108.82 245.30 133.69 ... 242.73 134.35 3250.00 2090.25 243.93 60029300 2329300 855700 16552000
[5 rows x 20 columns]
Filter: df_filter = df.iloc[-1].loc['Close'] < 250
AAPL True
AMZN False
GOOG False
MSFT True
Name: 2021-02-12 00:00:00, dtype: bool
Operation???:
Maybe something like df.loc[:,filter] but I receive the error:
pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match)
I understand it's a multi-index so I also tried using pd.IndexSlice: df.loc[:,idx[:,df_filter]] but still get:
ValueError: cannot index with a boolean indexer that is not the same length as the index
Desired Output:
Close High Low Open Volume
AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT
2021-02-08 136.91 242.47 136.96 243.68 134.92 240.81 136.03 243.15 71297200 22211900
2021-02-09 136.01 243.77 137.88 244.76 135.85 241.38 136.62 241.87 76774200 23565000
2021-02-10 135.39 242.82 136.99 245.92 134.40 240.89 136.48 245.00 73046600 22186700
2021-02-11 135.13 244.49 136.39 245.15 133.77 242.15 135.90 244.78 64280000 15751100
2021-02-12 135.37 244.99 135.53 245.30 133.69 242.73 134.35 243.93 60029300 16552000
I'm not sure if IndexSlice works with boolean indexing. You can try passing the valid index:
df.loc[:,pd.IndexSlice[:, df_filter.index[df_filter]]]

How to loop over unique ticker values in a Pandas DataFrame?

I have the dataframe below, which has several stocks value for about 200 companies, I am trying to find a way to for loop and build a new dataframe which includes these companies' different yearly feature
Date Symbol Open High Low Close Volume Daily Return
2016-01-04 AAPL 102.61 105.37 102.00 105.35 67281190 0.025703
2016-01-05 AAPL 105.75 105.85 102.41 102.71 55790992 0.019960
2016-12-28 AMZN 776.25 780.00 770.50 772.13 3301025 0.009122
2016-12-29 AMZN 772.40 773.40 760.85 765.15 3158299 0.020377
I have tried different way, the closest I have come is:
stocks_features = pd.DataFrame(data=stocks_data.Symbol.unique(), columns = ['Symbol'])
stocks_features['Max_Yearly_Price'] = stocks_data['High'].max()
stocks_features['Min_Yearly_Price'] = stocks_data['Low'].min()
stocks_features
But it gives me the same values for all stocks:
Symbol Max_Yearly_Price Min_Yearly_Price
AAPL 847.21 89.47
AMZN 847.21 89.47
What I am doing wrong, how can I accomplish this?
By using groupby agg
df.groupby('Symbol').agg({'High':'max','Low':'min'}).\
rename(columns={'High':'Max_Yearly_Price','Low':'Min_Yearly_Price'})
Out[861]:
Max_Yearly_Price Min_Yearly_Price
Symbol
AAPL 105.85 102.00
AMZN 780.00 760.85
Wen's answer is great as well. I had a different way of solving it. I'll explain as I go along:
# creates a dictionary of all the symbols and their max values
value_maps = dict(stocks_features.loc[stocks_features.\
groupby('Symbol').High.agg('idxmax')][['Symbol', 'High']].values)
# sets Max_Yearly_Price equal to the symbol
stocks_features['Max_Yearly_Price'] = stocks_features['Symbol']
# replaces the symbol wiht the corresponding value from the dicitonary
stocks_features['Max_Yearly_Price'] = stocks_features['Max_Yearly_Price'].map(value_maps)
# ouput
Date Symbol Open High Low Close Volume Daily Return Max_Yearly_Price
0 2016-01-04 AAPL 102.61 105.37 102.00 105.35 672811900.025703 NaN 105.85
1 2016-01-05 AAPL 105.75 105.85 102.41 102.71 557909920.019960 NaN 105.85
2 2016-12-28 AMZN 776.25 780.00 770.50 772.13 33010250.009122 NaN 780.00
3 2016-12-29 AMZN 772.40 773.40 760.85 765.15 31582990.020377 NaN 780.00

pandas - Inconsistent behavior when applying timeseries operations to a multi-indexed DataFrame

This might be a potential bug: doing grouped timeseries operations fails silently on a multi-indexed DataFrame.
import pandas as pd
import pandas.io.data as web
# Get some market data
df = web.DataReader(['AAPL', 'GOOG'], 'yahoo', pd.Timestamp('2013'), pd.Timestamp('2014')).to_frame()
df.index.names = ('dt', 'symbol')
In [21]: df.head()
Out[21]:
Open High Low Close Volume \
dt symbol
2013-01-02 AAPL 553.82001 555.00000 541.62994 549.03003 140129500
2013-01-03 AAPL 547.88000 549.67004 541.00000 542.10004 88241300
2013-01-04 AAPL 536.96997 538.63000 525.82996 527.00000 148583400
2013-01-07 AAPL 522.00000 529.30005 515.20001 523.90002 121039100
2013-01-08 AAPL 529.21002 531.89001 521.25000 525.31000 114676800
Adj Close
dt symbol
2013-01-02 AAPL 74.63931
2013-01-03 AAPL 73.69719
2013-01-04 AAPL 71.64438
2013-01-07 AAPL 71.22294
2013-01-08 AAPL 71.41463
Let's say we want to resample this to monthly data. This fails and returns an empty DataFrame:
df_M = df.groupby(level='symbol').resample('M', how='mean')
In [23]: df_M
Out[23]:
Empty DataFrame
Columns: []
Index: []
This, however, works, but requires a seemingly-unnecessary re-indexing:
df_M = df.reset_index().set_index('dt').groupby('symbol').resample('M', how='mean')
In [26]: df_M.head()
Out[26]:
Adj Close Close High Low Open \
symbol dt
AAPL 2013-01-31 67.677750 497.822382 504.407623 492.969997 500.083329
2013-02-28 62.388477 456.808942 463.231056 452.106325 458.503692
2013-03-31 60.417287 441.841000 446.803495 437.337996 442.011512
2013-04-30 57.398619 419.765001 425.553183 414.722271 419.766820
2013-05-31 61.340151 446.452734 451.658190 441.495455 446.400919
Volume
symbol dt
AAPL 2013-01-31 1.562312e+08
2013-02-28 1.229478e+08
2013-03-31 1.147110e+08
2013-04-30 1.245851e+08
2013-05-31 1.073583e+08
The fact that you need to do the reset_index().set_index('dt') and then groupby('symbol') instead of groupby(level='symbol') seems to defeat the purpose of a multi-index! What gives?
I also realize that data like this is perhaps more well-suited to a Panel than a DataFrame, but when dealing with very large amounts of (often sparse) data, the 3D Panel structure presents performance and memory issues compared to the flat DataFrame.
This was indeed a bug, and has been fixed: https://github.com/pydata/pandas/issues/10063

How do you convert time series data with pandas and show graph via highstock

I have some time series data as(financial stock trading data):
TIMESTAMP PRICE VOLUME
1294311545 24990 1500000000
1294317813 25499 5000000000
1294318449 25499 100000000
I need to convert them to OHLC values (JSON list) based on price column,ie,(open,high,low,close), and show that as OHLC graph with highstock JS framework.
The output should be as following:
[{'time':'2013-09-01','open':24999,'high':25499,'low':24999,'close':25000,'volume':15000000},
{'time':'2013-09-02','open':24900,'high':25600,'low':24800,'close':25010,'volume':16000000},
{...}]
For example,my sample have 10 data for day 2013-09-01,the output will have one object for the day with high is the highest price of all 10 data, low is the lowest price,open is the first price of the day, close is the last price of that day,volume should be the TOTAL volume of all 10 data.
I know there is a python library pandas maybe could do that,but i still could not try it out.
Updated: As suggestion, i use resample() as:
df['VOLUME'].resample('H', how='sum')
df['PRICE'].resample('H', how='ohlc')
But how to merge the result?
At the moment you can only perform ohlc on a column/Series (will be fixed in 0.13).
First, coerce TIMESTAMP columns to a pandas Timestamp:
In [11]: df.TIMESTAMP = pd.to_datetime(df.TIMESTAMP, unit='s')
In [12]: df.set_index('TIMESTAMP', inplace=True)
In [13]: df
Out[13]:
PRICE VOLUME
TIMESTAMP
2011-01-06 10:59:05 24990 1500000000
2011-01-06 12:43:33 25499 5000000000
2011-01-06 12:54:09 25499 100000000
The resample via ohlc (here I've resampled by hour):
In [14]: df['VOLUME'].resample('H', how='ohlc')
Out[14]:
open high low close
TIMESTAMP
2011-01-06 10:00:00 1500000000 1500000000 1500000000 1500000000
2011-01-06 11:00:00 NaN NaN NaN NaN
2011-01-06 12:00:00 5000000000 5000000000 100000000 100000000
In [15]: df['PRICE'].resample('H', how='ohlc')
Out[15]:
open high low close
TIMESTAMP
2011-01-06 10:00:00 24990 24990 24990 24990
2011-01-06 11:00:00 NaN NaN NaN NaN
2011-01-06 12:00:00 25499 25499 25499 25499
You can apply to_json to any DataFrame:
In [16]: df['PRICE'].resample('H', how='ohlc').to_json()
Out[16]: '{"open":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"high":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"low":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"close":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0}}'
*This would probably be a straightforward enhancement for a DataFrame atm its NotImplemented.
Updated: from your desired output (or at least very close to), can be achieved as follows:
In [21]: price = df['PRICE'].resample('D', how='ohlc').reset_index()
In [22]: price
Out[22]:
TIMESTAMP open high low close
0 2011-01-06 00:00:00 24990 25499 24990 25499
Use the records orientation and the iso date_format:
In [23]: price.to_json(date_format='iso', orient='records')
Out[23]: '[{"TIMESTAMP":"2011-01-06T00:00:00.000Z","open":24990,"high":25499,"low":24990,"close":25499}]'
In [24]: price.to_json('foo.json', date_format='iso', orient='records') # save as json file

Categories

Resources