I have the dataframe below, which has several stocks value for about 200 companies, I am trying to find a way to for loop and build a new dataframe which includes these companies' different yearly feature
Date Symbol Open High Low Close Volume Daily Return
2016-01-04 AAPL 102.61 105.37 102.00 105.35 67281190 0.025703
2016-01-05 AAPL 105.75 105.85 102.41 102.71 55790992 0.019960
2016-12-28 AMZN 776.25 780.00 770.50 772.13 3301025 0.009122
2016-12-29 AMZN 772.40 773.40 760.85 765.15 3158299 0.020377
I have tried different way, the closest I have come is:
stocks_features = pd.DataFrame(data=stocks_data.Symbol.unique(), columns = ['Symbol'])
stocks_features['Max_Yearly_Price'] = stocks_data['High'].max()
stocks_features['Min_Yearly_Price'] = stocks_data['Low'].min()
stocks_features
But it gives me the same values for all stocks:
Symbol Max_Yearly_Price Min_Yearly_Price
AAPL 847.21 89.47
AMZN 847.21 89.47
What I am doing wrong, how can I accomplish this?
By using groupby agg
df.groupby('Symbol').agg({'High':'max','Low':'min'}).\
rename(columns={'High':'Max_Yearly_Price','Low':'Min_Yearly_Price'})
Out[861]:
Max_Yearly_Price Min_Yearly_Price
Symbol
AAPL 105.85 102.00
AMZN 780.00 760.85
Wen's answer is great as well. I had a different way of solving it. I'll explain as I go along:
# creates a dictionary of all the symbols and their max values
value_maps = dict(stocks_features.loc[stocks_features.\
groupby('Symbol').High.agg('idxmax')][['Symbol', 'High']].values)
# sets Max_Yearly_Price equal to the symbol
stocks_features['Max_Yearly_Price'] = stocks_features['Symbol']
# replaces the symbol wiht the corresponding value from the dicitonary
stocks_features['Max_Yearly_Price'] = stocks_features['Max_Yearly_Price'].map(value_maps)
# ouput
Date Symbol Open High Low Close Volume Daily Return Max_Yearly_Price
0 2016-01-04 AAPL 102.61 105.37 102.00 105.35 672811900.025703 NaN 105.85
1 2016-01-05 AAPL 105.75 105.85 102.41 102.71 557909920.019960 NaN 105.85
2 2016-12-28 AMZN 776.25 780.00 770.50 772.13 33010250.009122 NaN 780.00
3 2016-12-29 AMZN 772.40 773.40 760.85 765.15 31582990.020377 NaN 780.00
Related
I have transaction data that I hope to resample in a fashion similar to OHLC stock market prices.
The goal is to display a meaningful price summary for each day
However, my challenge is that the transactional data is sparse.
There are days without transactions
The opening price of the day might happen in the middle of the day and does not roll over from the previous day automatically
I can do the naive OHLC with resample(), as seen in the example below. Because the data is sparse, the normal straightforward resample gives unideal results. To get a meaningful price sample, the following conditions should be met too:
The opening price is always set to the closing price of the previous day
If any day does not have transactions, all OHLC values are the closing price of the previous day ("price does not move")
I can do this in pure Python, as it is not that difficult, but it is not very computationally efficient for high volumes of data. Thus, my question is, would Pandas offer any clever way of doing resample or aggregate satisfying the conditions above, but without needing to loop values in Python manually?
The example code is below:
import pandas as pd
# Transactions do not have regular intervals and may miss days
data = {
"timestamp": [
pd.Timestamp("2020-01-01 01:00"),
pd.Timestamp("2020-01-01 05:00"),
pd.Timestamp("2020-01-02 03:00"),
pd.Timestamp("2020-01-04 04:00"),
pd.Timestamp("2020-01-05 00:00"),
],
"transaction": [
100.00,
102.00,
103.00,
102.80,
99.88
]
}
df = pd.DataFrame.from_dict(data, orient="columns")
df.set_index("timestamp", inplace=True)
print(df)
transaction
timestamp
2020-01-01 01:00:00 100.00
2020-01-01 05:00:00 102.00
2020-01-02 03:00:00 103.00
2020-01-04 04:00:00 102.80
2020-01-05 00:00:00 99.88
# https://stackoverflow.com/a/36223274/315168
naive_resample = df["transaction"].resample("1D") .agg({'open': 'first', 'high': 'max', 'low': 'min', 'close': 'last'})
print(naive_resample)
In this result, you can see that:
open/close do not match over daily boundaries
if a day does not have transactions price is marked as NaN
open high low close
timestamp
2020-01-01 100.00 102.00 100.00 102.00
2020-01-02 103.00 103.00 103.00 103.00
2020-01-03 NaN NaN NaN NaN
2020-01-04 102.80 102.80 102.80 102.80
2020-01-05 99.88 99.88 99.88 99.88
You can use following logic:
Shift "close" to next row as "prev_close", to use it for next row processing.
If "open" == NaN, fill in "prev_close" for OHLC.
Match "open" with "prev_close" for all.
# Fill missing days "close" with last known value.
naive_resample["close"] = naive_resample["close"].fillna(method="ffill")
# Shift "close" to next row, to use it for next row processing.
naive_resample["prev_close"] = naive_resample["close"].shift(1)
# First transaction has no "prev_close". Adjust it to prevent NaN spill over.
naive_resample.iloc[0, naive_resample.columns.get_loc("prev_close")] = naive_resample.iloc[0, naive_resample.columns.get_loc("open")]
def adjust_ohlc(row):
# Process missing day
if math.isnan(row["open"]):
return pd.Series([row["prev_close"]] * 4)
else:
# Adjust "open" with "prev_close"
return pd.Series([row["prev_close"], row["high"], row["low"], row["close"]])
naive_resample[["open", "high", "low", "close"]] = naive_resample.apply(adjust_ohlc, axis=1)
naive_resample = naive_resample.drop("prev_close", axis=1)
Output:
open high low close
timestamp
2020-01-01 100.0 102.00 100.00 102.00
2020-01-02 102.0 102.00 102.00 102.00
2020-01-03 102.0 103.00 103.00 103.00
2020-01-04 103.0 102.80 102.80 102.80
2020-01-05 102.8 99.88 99.88 99.88
I have a pd.DataFrame which had a Datetime index. When I select a particular Datetime and when there are two matches, it correctly selects date as the index.
a = df.loc[date]
a
Symbol Order Shares
Date
2011-01-10 AAPL BUY 1500
2011-01-10 AAPL SELL 1500
However, when there is only one match for that date, it assigns the columns as the index.
orders
Symbol GOOG
Order BUY
Shares 1000
Name: 2011-01-26 00:00:00, dtype: object
How can I force it to take Date as the index all the time? I don't have a name for the index.
You can use one element list:
a = df.loc[[date]]
Sample:
print (df)
Symbol Order Shares
Date
2011-01-09 AAPL BUY 1500
2011-01-10 AAPL BUY 1500
2011-01-10 AAPL SELL 1500
date = '2011-01-09'
a = df.loc[[date]]
print (a)
Symbol Order Shares
Date
2011-01-09 AAPL BUY 1500
date = '2011-01-10'
a = df.loc[[date]]
print (a)
Symbol Order Shares
Date
2011-01-10 AAPL BUY 1500
2011-01-10 AAPL SELL 1500
Given a DataFrame of stock prices, I am interested in filtering based on the latest closing price. I am aware of how to do this for a simple DataFrame, but cannot figure out how to do it for a multi-indexed dataframe.
Simple dataframe:
AAPL AMZN GOOG MSFT
2021-02-08 136.91 3322.94 2092.91 242.47
2021-02-09 136.01 3305.00 2083.51 243.77
2021-02-10 135.39 3286.58 2095.38 242.82
2021-02-11 135.13 3262.13 2095.89 244.49
2021-02-12 135.37 3277.71 2104.11 244.99
Operation: df.loc[:,df.iloc[-1] < 250]
Output:
AAPL MSFT
2021-02-08 136.91 242.47
2021-02-09 136.01 243.77
2021-02-10 135.39 242.82
2021-02-11 135.13 244.49
2021-02-12 135.37 244.99
However I cannot figure out how to accomplish this on a DataFrame with a MultiIndex (such as OHLC)
Multiindex DataFrame:
Close High Low ... Open Volume
AAPL AMZN GOOG MSFT AAPL AMZN GOOG MSFT AAPL ... MSFT AAPL AMZN GOOG MSFT AAPL AMZN GOOG MSFT
2021-02-08 136.91 3322.94 2092.91 242.47 136.96 3365.00 2123.55 243.68 134.92 ... 240.81 136.03 3358.50 2105.91 243.15 71297200 3257400 1241900 22211900
2021-02-09 136.01 3305.00 2083.51 243.77 137.88 3338.00 2105.13 244.76 135.85 ... 241.38 136.62 3312.49 2078.54 241.87 76774200 2203500 889900 23565000
2021-02-10 135.39 3286.58 2095.38 242.82 136.99 3317.95 2108.37 245.92 134.40 ... 240.89 136.48 3314.00 2094.21 245.00 73046600 3151600 1135500 22186700
2021-02-11 135.13 3262.13 2095.89 244.49 136.39 3292.00 2102.03 245.15 133.77 ... 242.15 135.90 3292.00 2099.51 244.78 64280000 2301400 945700 15751100
2021-02-12 135.37 3277.71 2104.11 244.99 135.53 3280.25 2108.82 245.30 133.69 ... 242.73 134.35 3250.00 2090.25 243.93 60029300 2329300 855700 16552000
[5 rows x 20 columns]
Filter: df_filter = df.iloc[-1].loc['Close'] < 250
AAPL True
AMZN False
GOOG False
MSFT True
Name: 2021-02-12 00:00:00, dtype: bool
Operation???:
Maybe something like df.loc[:,filter] but I receive the error:
pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match)
I understand it's a multi-index so I also tried using pd.IndexSlice: df.loc[:,idx[:,df_filter]] but still get:
ValueError: cannot index with a boolean indexer that is not the same length as the index
Desired Output:
Close High Low Open Volume
AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT AAPL MSFT
2021-02-08 136.91 242.47 136.96 243.68 134.92 240.81 136.03 243.15 71297200 22211900
2021-02-09 136.01 243.77 137.88 244.76 135.85 241.38 136.62 241.87 76774200 23565000
2021-02-10 135.39 242.82 136.99 245.92 134.40 240.89 136.48 245.00 73046600 22186700
2021-02-11 135.13 244.49 136.39 245.15 133.77 242.15 135.90 244.78 64280000 15751100
2021-02-12 135.37 244.99 135.53 245.30 133.69 242.73 134.35 243.93 60029300 16552000
I'm not sure if IndexSlice works with boolean indexing. You can try passing the valid index:
df.loc[:,pd.IndexSlice[:, df_filter.index[df_filter]]]
I have the following code:
import pandas as pd
import pandas_datareader.data as web
pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2009', '6/1/2012'))
for stk in ['AAPL', 'GOOG', 'MSFT']))
pdata
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 861 (major_axis) x 3 (minor_axis)
Items axis: Open to Volume
Major_axis axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00
Minor_axis axis: AAPL to MSFT
AAPL GOOG MSFT
Date minor
2009-01-02 Open 12.268572 153.302917 19.530001
High 13.005714 159.870193 20.400000
Low 12.165714 151.762924 19.370001
Close 12.964286 159.621811 20.330000
Adj Close 11.621618 159.621811 16.140903
Normally the following method would give me what I need:
pdata = pdata.swapaxes('items', 'minor')
And I get the following warning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are
with a MultiIndex on a DataFrame, via the Panel.to_frame() method
My objective is to have a Data Frame in the form of a Panel, using the date and stock ticker as Mayor and Minor row indices, and the Open price, etc as the columns as this:
minor Open High Low Close Adj Close
Date
2009-01-02 AAPL 12.268572 19.530001 12.165714 12.964286 11.621618
GOOG 153.302917 ... ... ... ...
MSFT 19.530001 ... ... ... ...
I did convert the Panel object into a DataFrame and tried to use the pivot_table or set_index methods but I can't get the stock tickers to be the inner row index. When I use the swapaxes method on the DF, the Date is also swaped to the columns. Is there any easy way I can get the format I need?
Option 1
unstack + swaplevel + sort_index
pdata.to_frame().unstack(0).T\
.swaplevel(0, 1).sort_index(level=[0]).head(6)
minor Open High Low Close Adj Close \
Date
2009-01-02 AAPL 12.268572 13.005714 12.165714 12.964286 11.621618
GOOG 153.302917 159.870193 151.762924 159.621811 159.621811
MSFT 19.530001 20.400000 19.370001 20.330000 16.140903
2009-01-05 AAPL 13.310000 13.740000 13.244286 13.511429 12.112095
GOOG 159.462845 164.549759 156.482239 162.965073 162.965073
MSFT 20.200001 20.670000 20.059999 20.520000 16.291746
minor Volume
Date
2009-01-02 AAPL 186503800.0
GOOG 7267900.0
MSFT 50084000.0
2009-01-05 AAPL 295402100.0
GOOG 9841400.0
MSFT 61475200.0
Option 2
Wen's wonderful stack equivalent.
pdata.to_frame().stack().unstack(-2).head(6)
minor Open High Low Close Adj Close \
Date
2009-01-02 AAPL 12.268572 13.005714 12.165714 12.964286 11.621618
GOOG 153.302917 159.870193 151.762924 159.621811 159.621811
MSFT 19.530001 20.400000 19.370001 20.330000 16.140903
2009-01-05 AAPL 13.310000 13.740000 13.244286 13.511429 12.112095
GOOG 159.462845 164.549759 156.482239 162.965073 162.965073
MSFT 20.200001 20.670000 20.059999 20.520000 16.291746
minor Volume
Date
2009-01-02 AAPL 186503800.0
GOOG 7267900.0
MSFT 50084000.0
2009-01-05 AAPL 295402100.0
GOOG 9841400.0
MSFT 61475200.0
I have some time series data as(financial stock trading data):
TIMESTAMP PRICE VOLUME
1294311545 24990 1500000000
1294317813 25499 5000000000
1294318449 25499 100000000
I need to convert them to OHLC values (JSON list) based on price column,ie,(open,high,low,close), and show that as OHLC graph with highstock JS framework.
The output should be as following:
[{'time':'2013-09-01','open':24999,'high':25499,'low':24999,'close':25000,'volume':15000000},
{'time':'2013-09-02','open':24900,'high':25600,'low':24800,'close':25010,'volume':16000000},
{...}]
For example,my sample have 10 data for day 2013-09-01,the output will have one object for the day with high is the highest price of all 10 data, low is the lowest price,open is the first price of the day, close is the last price of that day,volume should be the TOTAL volume of all 10 data.
I know there is a python library pandas maybe could do that,but i still could not try it out.
Updated: As suggestion, i use resample() as:
df['VOLUME'].resample('H', how='sum')
df['PRICE'].resample('H', how='ohlc')
But how to merge the result?
At the moment you can only perform ohlc on a column/Series (will be fixed in 0.13).
First, coerce TIMESTAMP columns to a pandas Timestamp:
In [11]: df.TIMESTAMP = pd.to_datetime(df.TIMESTAMP, unit='s')
In [12]: df.set_index('TIMESTAMP', inplace=True)
In [13]: df
Out[13]:
PRICE VOLUME
TIMESTAMP
2011-01-06 10:59:05 24990 1500000000
2011-01-06 12:43:33 25499 5000000000
2011-01-06 12:54:09 25499 100000000
The resample via ohlc (here I've resampled by hour):
In [14]: df['VOLUME'].resample('H', how='ohlc')
Out[14]:
open high low close
TIMESTAMP
2011-01-06 10:00:00 1500000000 1500000000 1500000000 1500000000
2011-01-06 11:00:00 NaN NaN NaN NaN
2011-01-06 12:00:00 5000000000 5000000000 100000000 100000000
In [15]: df['PRICE'].resample('H', how='ohlc')
Out[15]:
open high low close
TIMESTAMP
2011-01-06 10:00:00 24990 24990 24990 24990
2011-01-06 11:00:00 NaN NaN NaN NaN
2011-01-06 12:00:00 25499 25499 25499 25499
You can apply to_json to any DataFrame:
In [16]: df['PRICE'].resample('H', how='ohlc').to_json()
Out[16]: '{"open":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"high":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"low":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0},"close":{"1294308000000000000":24990.0,"1294311600000000000":null,"1294315200000000000":25499.0}}'
*This would probably be a straightforward enhancement for a DataFrame atm its NotImplemented.
Updated: from your desired output (or at least very close to), can be achieved as follows:
In [21]: price = df['PRICE'].resample('D', how='ohlc').reset_index()
In [22]: price
Out[22]:
TIMESTAMP open high low close
0 2011-01-06 00:00:00 24990 25499 24990 25499
Use the records orientation and the iso date_format:
In [23]: price.to_json(date_format='iso', orient='records')
Out[23]: '[{"TIMESTAMP":"2011-01-06T00:00:00.000Z","open":24990,"high":25499,"low":24990,"close":25499}]'
In [24]: price.to_json('foo.json', date_format='iso', orient='records') # save as json file