Reindexing a specific level of a MultiIndex dataframe - python

I have a DataFrame with two indices and would like to reindex it by one of the indices.
from pandas_datareader import data
import matplotlib.pyplot as plt
import pandas as pd
# Instruments to download
tickers = ['AAPL']
# Online source one should use
data_source = 'yahoo'
# Data range
start_date = '2000-01-01'
end_date = '2018-01-09'
# Load the desired data
panel_data = data.DataReader(tickers, data_source, start_date, end_date).to_frame()
panel_data.head()
The reindexing goes as follows:
# Get just the adjusted closing prices
adj_close = panel_data['Adj Close']
# Gett all weekdays between start and end dates
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')
# Align the existing prices in adj_close with our new set of dates
adj_close = adj_close.reindex(all_weekdays, method="ffill")
The last line gives the following error:
TypeError: '<' not supported between instances of 'tuple' and 'int'
This is because the DataFrame index is a list of tuples:
panel_data.index[0]
(Timestamp('2018-01-09 00:00:00'), 'AAPL')
Is it possible to reindex adj_close? By the way, if I don't convert the Panel object to a DataFrame using to_frame(), the reindexing works as it is. But it seems that Panel objects are deprecated...

If you're looking to reindex on a certain level, then reindex accepts a level argument you can pass -
adj_close.reindex(all_weekdays, level=0)
When passing a level argument, you cannot pass a method argument at the same time (reindex throws a TypeError), so you can chain a ffill call after -
adj_close.reindex(all_weekdays, level=0).ffill()

Related

Timeseries dataframe returns an error when using Pandas Align - valueError: cannot join with no overlapping index names

My goal:
I have two time-series data frames, one with a time interval of 1m and the other with a time interval of 5m. The 5m data frame is a resampled version of the 1m data. What I'm doing is computing a set of RSI values that correspond to the 5m df using the vectorbt library, then aligning and broadcasting these values to the 1m df using df.align
The Problem:
When trying to do this line by line, it works perfectly. Here's what the final result looks like:
However, when applying it under the function, it returns the following error while having overlapping index names:
ValueError: cannot join with no overlapping index names
Here's the complete code:
import vectorbt as vbt
import numpy as np
import pandas as pd
import datetime
end_date = datetime.datetime.now()
start_date = end_date - datetime.timedelta(days=3)
btc_price = vbt.YFData.download('BTC-USD',
interval='1m',
start=start_date,
end=end_date,
missing_index='drop').get('Close')
def custom_indicator(close, rsi_window=14, ma_window=50):
close_5m = close.resample('5T').last()
rsi = vbt.RSI.run(close_5m, window=rsi_window).rsi
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill')
print(rsi) #to check
print(close) #to check
return
#setting up indicator factory
ind = vbt.IndicatorFactory(
class_name='Combination',
short_name='comb',
input_names=['close'],
param_names=['rsi_window', 'ma_window'],
output_names=['value']).from_apply_func(custom_indicator,
rsi_window=14,
ma_window=50,
keep_pd=True)
res = ind.run(btc_price, rsi_window=21, ma_window=50)
print(res)
Thank you for taking the time to read this. Any help would be appreciated!
if you checked the columns of both , rsi and close
print('close is', close.columns)
print('rsi is', rsi.columns)
you will find
rsi is MultiIndex([(14, 'Close')],
names=['rsi_window', None])
close is Index(['Close'], dtype='object')
as it has two indexes, one should be dropped, so it can be done by the below code
rsi.columns = rsi.columns.droplevel()
to drop one level of the indexes, so it could be align,
The problem is that the data must be a time series and not a pandas data frame for table joins using align
You need to fix the data type
# Time Series
close = close['Close']
close_5m = close.resample('15min').last()
rsi = vbt.RSI.run(close_5m, window=rsi_window).rsi
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill', join='right')
When you are aligning the data make sure to include join='right'
rsi, _ = rsi.align(close, broadcast_axis=0, method='ffill', join='right'

pandas-ta with multiindex dataframe

I am wanting to use pandas-ta.
Although most aspects of this library seem easier for technical analysis I can only make it function on single ticker dataframes.
I would like to figure out how to get pandas-ta to work over multiple tickers in a multiindex dataframe.
I get the data using: - where [stocks] come from a csv list.
df = yf.download[stocks], '2021-1-1', interval='1d')
the pandas-ta download method below only creates a single ticker dataframe and only iterates the first ticker when using [stocks].
df.ta.ticker('GOOG', period = '1y', interval = "1h")
My current dataframe appears something like below. (where the list of tickers will change)
Adj Close Close High Low Open Volume
BTC-USD ETH-USD BTC-USD ETH-USD BTC-USD ETH-USD BTC-USD ETH-USD BTC-USD ETH-USD BTC-USD ETH-USD
Date
2020-12-31 29001.720703 737.803406 29001.720703 737.803406 29244.876953 754.299438 28201.992188 726.511902 28841.574219 751.626648 46754964848 13926846861
2021-01-01 29374.152344 730.367554 29374.152344 730.367554 29600.626953 749.201843 28803.585938 719.792236 28994.009766 737.708374 40730301359 13652004358
2021-01-02 32127.267578 774.534973 32127.267578 774.534973 33155.117188 786.798462 29091.181641 718.109497 29376.455078 730.402649 67865420765 19740771179
2021-01-03 32782.023438 975.507690 32782.023438 975.507690 34608.558594 1006.565002 32052.316406 771.561646 32129.408203 774.511841 78665235202 45200463368
2021-01-04 31971.914062 1040.233032 31971.914062 1040.233032 33440.218750 1153.189209 28722.755859 912.305359 32810.949219 977.058838 81163475344 56945985763
When I try to apply a pandas-ta function such as:
df[stocks] = data[stocks].ta.sma(length=10)
I get the error.
AttributeError: 'Series' object has no attribute 'ta'
When I use the documentation standard method
sma10 = ta.sma(df["Close"], length=10)
I don't know how to target the specific (BTC-USD)'Close' columns for all tickers in the .csv list - ie. (df['Close']
In both examples pandas-ta sma is using the 'close' value but I'm hoping to be able to apply all pandas-ta methods to a multiindex.
I can download 'Close' only data -
data = yf.download[stocks], '2021-1-1', interval='1d')['Close']
however the columns will be the 'ticker names' containing 'Close' data and I still have the same issue with pandas-ta trying to find the 'close' column data.
I don't know how to make pandas-ta function over multiple tickers in the same dataframe.
Is there a solution to this?
Thanks for any help!
Since each column of multi-column consists of a tuple, it is possible to deal with data frames in horizontal format by specifying them in tuple format using .loc, etc. Two types of technical analysis are added by loop processing. The last step is to reorder the columns. If you need to handle more than just the closing price, you can use the closing price as the target of the loop.
import pandas as pd
import pandas_ta as ta
import yfinance as yf
stocks = 'BTC-USD ETH-USD XRP-USD XEM-USD'
df = yf.download(stocks, '2021-1-1', interval='1d',)
technicals = ['sma10', 'sma25', 'vwma']
tickers = stocks.split(' ')
for ticker in tickers:
for t in technicals:
if t[:2] == 'sma':
l = int(t[3:])
df[(t, ticker)] = ta.sma(df.loc[:,('Close', ticker)], length=l)
else:
df[(t, ticker)] = ta.vwma(df.loc[:,('Close', ticker)], df.loc[:,('Volume', ticker)])

Trying to find Mean returns for following stocks from yahoo finance

tickers = ['BIOCON.NS', 'HDFCBANK.NS', 'RELIANCE.NS', 'RADICO.NS', 'LTI.NS', 'TCS.NS', 'DRREDDY.NS','BAJFINANCE.NS']
pfolio_data = pd.DataFrame()
for t in tickers:
pfolio_data[t] = wb.DataReader(t, data_source='yahoo', start ='2017-1-1')['Adj Close']
pfolio_data_returns= (pfolio_data/pfolio_data.shift(1))-1
pfolio_data_returns
pfolio_data_returns[[[[[[[['BIOCON.NS', 'HDFCBANK.NS', 'RELIANCE.NS', 'RADICO.NS', 'LTI.NS', 'TCS.NS', 'DRREDDY.NS','BAJFINANCE.NS']]]]]]]].mean()
the last code shows me error -
unhashable type: 'list'
how should i go ahead with it ?
What does the data in pfolio_data_returns look like ?
Maybe this is what you are looking for:
Type error: unhashable type 'list' while selecting subset from specific columns pandas dataframe
This should work:
import pandas as pd
import pandas_datareader as wb
tickers = ['BIOCON.NS', 'HDFCBANK.NS', 'RELIANCE.NS', 'RADICO.NS', 'LTI.NS', 'TCS.NS', 'DRREDDY.NS','BAJFINANCE.NS']
data = {ticker: wb.DataReader(ticker, data_source='yahoo', start ='2017-1-1')['Adj Close'] for ticker in tickers}
df_raw = pd.DataFrame(data)
df_returns = (
df_raw
.apply(lambda df: df/df.shift(1)-1)
)
df_returns.mean()
Out[28]:
BIOCON.NS 0.001252
HDFCBANK.NS 0.000694
RELIANCE.NS 0.001477
RADICO.NS 0.001759
LTI.NS 0.001306
TCS.NS 0.000831
DRREDDY.NS 0.000503
BAJFINANCE.NS 0.001425
dtype: float64
I changed the way you create your dataframe. Looks a lot cleaner with a dictionary comprehension. df_raw contains raw data, df_returns computes the returns and the following .mean() statement computes the mean of each column.
There are too many opening/closing brackets. The following works:
pfolio_data_returns[['BIOCON.NS', 'HDFCBANK.NS', 'RELIANCE.NS', 'RADICO.NS', 'LTI.NS', 'TCS.NS', 'DRREDDY.NS','BAJFINANCE.NS']].mean().

How do I make this function iterable (getting indexerror)

I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.

Pandas TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'

I've got some order data that I want to analyse.
Currently of interest is: How often has which SKU been bought in which month?
Here a small example:
import datetime
import pandas as pd
import numpy as np
d = {'sku': ['RT-17']}
df_skus = pd.DataFrame(data=d)
print(df_skus)
d = {'date': ['2017/02/17', '2017/03/17', '2017/04/17', '2017/04/18', '2017/05/02'], 'item_sku': ['HT25', 'RT-17', 'HH30', 'RT-17', 'RT-19']}
df_orders = pd.DataFrame(data=d)
print(df_orders)
for i in df_orders.index:
print("\n toll")
df_orders.loc[i,'date']=pd.to_datetime(df_orders.loc[i, 'date'])
df_orders = df_orders[df_orders["item_sku"].isin(df_skus["sku"])]
monthly_sales = df_orders.groupby(["item_sku", pd.Grouper(key="date",freq="M")]).size()
monthly_sales = monthly_sales.unstack(0)
print(monthly_sales)
That works fine, but if I use my real order data (from CSV) I get after some minutes:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'
That problem comes from the line:
monthly_sales = df_orders.groupby(["item_sku", pd.Grouper(key="date",freq="M")]).size()
Is it possible to skip over the error?
I tried a try except block:
try:
monthly_sales = df_orders.groupby(["item_sku", pd.Grouper(key="date",freq="M")]).size()
monthly_sales = monthly_sales.unstack(0)
except:
print "\n Here seems to be one issue"
Then I get for the print(monthly_sales)
Empty DataFrame
Columns: [txn_id, date, item_sku, quantity]
Index: []
So something in my data empties or brakes the grouping it seems like?
How can I 'clean' my data?
Or I'd be even fine with loosing the data of a sale here and there if I can just 'skip' over the error, is this possible?
When reading your CSV, use the parse_dates argument -
df_order = pd.read_csv('file.csv', parse_dates=['date'])
Which automatically converts date to datetime. If that doesn't work, then you'll need to load it in as a string, and then use the errors='coerce' argument with pd.to_datetime -
df_order['date'] = pd.to_datetime(df_order['date'], errors='coerce')
Note that you can pass series objects (amongst other things) to pd.to_datetime`.
Next, filter and group as you've been doing, and it should work.
df_orders[df_orders["item_sku"].isin(df_skus["sku"])]\
.groupby(['item_sku', pd.Grouper(key='date', freq='M')]).size()
item_sku date
RT-17 2017-03-31 1
2017-04-30 1

Categories

Resources