align returns tuple instead of Pandas DataFrame - python

I am bit puzzled because I would expect the following code to produce a Pandas DataFrame or Series and instead returns a tuple.
import pandas_datareader.data as web
import random
import pandas as pd
start_date = '2018-01-01'
end_date = '2018-06-06'
SPY = web.DataReader('SPY', 'yahoo', start_date, end_date)
SPY = SPY[['Adj Close']]
SPY.columns = ['Price']
SPY1 = SPY.iloc[random.sample(range(len(SPY.index)), 80), ]
SPY2 = SPY.iloc[random.sample(range(len(SPY.index)), 80), ]
SPY3 = SPY1.align(SPY2, join = 'inner', axis = 0)
type(SPY3)
tuple
I can transform the tuple to a Series as follows:
SPY3 = pd.Series(SPY3[0])
Still I wonder why a tuple is returned in the first place.

The method align returns a tuple according to the documentation:
Returns: (left, right) : (DataFrame, type of other)
Aligned objects
(left, right) is a tuple

Related

Function to add a column based on the input from a specific column

I have the following dataframe:
import pandas as pd
import numpy as np
from pandas_datareader import data as pdr
from datetime import date, timedelta
yf.pdr_override()
end = date.today()
start = end - timedelta(days=7300)
# download dataframe
data = pdr.get_data_yahoo('^GSPC', start=start, end= end)
Now, that I have the dataframe, I want to create a function to add the logarithmic return based on a column to the dataframe called 'data', with the following code:
data['log_return'] = np.log(data['Adj Close'] / data['Adj Close'].shift(1))
How I think the function should look like is like this:
def add_log_return(df):
# add returns in a logarithmic fashion
added = df.copy()
added["log_return"] = np.log(df[column] / df[column].shift(1))
added["log_return"] = added["log_return"].apply(lambda x: x*100)
return added
How can I select a specific column as an input of the function add_log_return(df['Adj Close']), so the function adds the logarithmic return to my 'data' dataframe?
data = add_log_return(df['Adj Close'])
Just add an argument column to your function!
def add_log_return(df, column):
# add returns in a logarithmic fashion
added = df.copy()
added["log_return"] = np.log(df[column] / df[column].shift(1)) * 100
return added
new_df = add_log_return(old_df, 'Adj_Close')
Note I removed the line in your function to apply a lambda that just multiplied by 100. It's much faster to do this in a vectorized manner, by including it in the np.log(...) line
However, if I were you, I'd just return the Series object instead of copying the dataframe and modifying and returning the copy.
def log_return(col: pd.Series) -> np.ndarray:
return np.log(col / col.shift(1)) * 100
Now, the caller can do what they want with it:
df['log_ret'] = log_return(df['Adj_Close'])

pandas.DataFrame.loc: TypeError returned when trying to modify values of column based on datetime

I have a data frame that contains a column of dates and another column that I'd like to modify according to the date. However when I try to do this using the .loc method, I get
TypeError: '<' not supported between instances of 'str' and 'datetime.datetime'
Could anyone please explain 1) why this error comes up - the dates are datetime objects, and 2) how I can modify the second column. I include a MWE below.
Many thanks
from datetime import datetime as DT
import numpy as np
import pandas as pd
def random_dates(start, end, n, unit='D', seed=None):
if not seed: # from piR's answer
np.random.seed(0)
ndays = (end - start).days + 1
return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) + start
start_date = pd.to_datetime('2010-01-01')
end_date = pd.to_datetime('2020-01-01')
df = pd.DataFrame(columns=['Date', 'Names'])
N = 10
df['Date'] = random_dates(start_date, end_date, N)
df = df.assign(Names = ['A'] * N)
df.loc['Date' < DT(2015, 1, 1), 'Names'] = 'B'
(random_dates function from this post)
Switch the line
df.loc['Date' < DT(2015, 1, 1), 'Names'] = 'B'
to
df.loc[df.Date < DT(2015, 1, 1), 'Names'] = 'B'
This would solve it.
You are not using the df.loc statement properly. For your particular logic, you want to change the values of column Names to "B" when a row's corresponding date is < 2015-1-1. When you want to include any conditions in df.loc, the proper way to use it is like this:
df.loc[(df['Date'] < DT(2015, 1, 1)), 'Names'] = 'B'
For a detailed guide on how to use conditions with df.loc, you can refer this link

Reindexing a specific level of a MultiIndex dataframe

I have a DataFrame with two indices and would like to reindex it by one of the indices.
from pandas_datareader import data
import matplotlib.pyplot as plt
import pandas as pd
# Instruments to download
tickers = ['AAPL']
# Online source one should use
data_source = 'yahoo'
# Data range
start_date = '2000-01-01'
end_date = '2018-01-09'
# Load the desired data
panel_data = data.DataReader(tickers, data_source, start_date, end_date).to_frame()
panel_data.head()
The reindexing goes as follows:
# Get just the adjusted closing prices
adj_close = panel_data['Adj Close']
# Gett all weekdays between start and end dates
all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')
# Align the existing prices in adj_close with our new set of dates
adj_close = adj_close.reindex(all_weekdays, method="ffill")
The last line gives the following error:
TypeError: '<' not supported between instances of 'tuple' and 'int'
This is because the DataFrame index is a list of tuples:
panel_data.index[0]
(Timestamp('2018-01-09 00:00:00'), 'AAPL')
Is it possible to reindex adj_close? By the way, if I don't convert the Panel object to a DataFrame using to_frame(), the reindexing works as it is. But it seems that Panel objects are deprecated...
If you're looking to reindex on a certain level, then reindex accepts a level argument you can pass -
adj_close.reindex(all_weekdays, level=0)
When passing a level argument, you cannot pass a method argument at the same time (reindex throws a TypeError), so you can chain a ffill call after -
adj_close.reindex(all_weekdays, level=0).ffill()

Pandas data.frame, has incorrect indices

I'm pulling data from Yahoo finance, but the data.frame I'm creating won't load because my indices are incorrect.
I know what I need to fix, I just don't know how :/
Here is my code and error:
from scipy import stats
import scipy as sp
import numpy as np
import pandas as pd
import datetime as datetime
from matplotlib.finance import quotes_historical_yahoo
ticker = 'IBM'
begdate = (1962,1,1)
enddate = (2013,11,22)
x = quotes_historical_yahoo(ticker, begdate, enddate, asobject = True, adjusted = True)
logret = np.log(x.aclose[1:] / x.aclose[:-1])
date = []
d0 = x.date
print len(logret)
for i in range(0, len(logret)):
t1 = ''.join([d0[i].strftime("%Y"), d0[i].strftime("%m"), "01"])
date.append(datetime.datetime.strptime(t1, "%Y%m%d"))
y = pd.DataFrame(logret, date)
retM = y.groupby(y.index).sum()
ret_Jan = retM[retM.index.month == 1]
ret_others = retM[retM.index.month != 1]
print sp.stats.bartlett(ret_Jan.values, ret_others.values)
The error comes from this line:
y = pd.DataFrame(logret, date)
And produces this:
ValueError: Shape of passed values is (1, 13064), indices imply (1, 1)
I believe I need to change logret into a list? ... or a tuple?
But my efforts to convert, using tuple(logret) or creating an empty list and populating it, have not worked so far.
Suggestions?
ValueError: Shape of passed values is (1, 13064), indices imply (1, 1)
means that you've given pd.DataFrame a series of length 13064 and an index of length 1, and asked it to index the series by the index. Indeed, that is what you've done: date starts off as [], and then you append one value to it, so the index you're passing to the dataframe is just a singleton list.
I think you probably didn't mean to create the DataFrame inside the loop.
I think you're making this a lot harder by going in and out of Pandas objects. If you just stay in Pandas this is pretty simple. I think all you need to do is:
import pandas.io.data as web
import datetime
start = datetime.datetime(1962, 1, 1)
end = datetime.datetime(2013, 11, 22)
f=web.DataReader("IBM", 'yahoo', start, end)
f['returns'] = log(f.Close/f.Close.shift(1))
ret_Jan = f.returns[f.index.month == 1]
ret_others = f.returns[f.index.month != 1]
print sp.stats.bartlett(ret_Jan, ret_others)
(122.77708966467267, 1.5602965581388475e-28)

Python Pandas join dataframes on index

I am trying to join to dataframe on the same column "Date", the code is as follow:
import pandas as pd
from datetime import datetime
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'],index_col='Date')
start = datetime(2010, 2, 5)
end = datetime(2012, 10, 26)
df_train_fly = pd.date_range(start, end, freq="W-FRI")
df_train_fly = pd.DataFrame(pd.Series(df_train_fly), columns=['Date'])
merged = df_train_csv.join(df_train_fly.set_index(['Date']), on = ['Date'], how = 'right', lsuffix='_x')
It complains dataframe df_train_csv has no column named "Date". I'd like to set "Date" in both dataframe as index and I am wondering what is the best way to join dataframe with date as the index?
UPDATE:
That is the sample data
Date,Weekly_Sales
2010-02-05,24924.5
2010-02-12,46039.49
2010-02-19,41595.55
2010-02-26,19403.54
2010-03-05,21827.9
2010-03-12,21043.39
2010-03-19,22136.64
2010-03-26,26229.21
2010-04-02,57258.43
2010-04-09,42960.91
2010-04-16,17596.96
2010-04-23,16145.35
2010-04-30,16555.11
2010-05-07,17413.94
2010-05-14,18926.74
2010-05-21,14773.04
2010-05-28,15580.43
2010-06-04,17558.09
2010-06-11,16637.62
2010-06-18,16216.27
2010-06-25,16328.72
2010-07-02,16333.14
2010-07-09,17688.76
2010-07-16,17150.84
2010-07-23,15360.45
2010-07-30,15381.82
2010-08-06,17508.41
2010-08-13,15536.4
2010-08-20,15740.13
2010-08-27,15793.87
2010-09-03,16241.78
2010-09-10,18194.74
2010-09-17,19354.23
2010-09-24,18122.52
2010-10-01,20094.19
2010-10-08,23388.03
2010-10-15,26978.34
2010-10-22,25543.04
2010-10-29,38640.93
2010-11-05,34238.88
2010-11-12,19549.39
2010-11-19,19552.84
2010-11-26,18820.29
2010-12-03,22517.56
2010-12-10,31497.65
2010-12-17,44912.86
2010-12-24,55931.23
2010-12-31,19124.58
2011-01-07,15984.24
2011-01-14,17359.7
2011-01-21,17341.47
2011-01-28,18461.18
2011-02-04,21665.76
2011-02-11,37887.17
2011-02-18,46845.87
2011-02-25,19363.83
2011-03-04,20327.61
2011-03-11,21280.4
2011-03-18,20334.23
2011-03-25,20881.1
2011-04-01,20398.09
2011-04-08,23873.79
2011-04-15,28762.37
2011-04-22,50510.31
2011-04-29,41512.39
2011-05-06,20138.19
2011-05-13,17235.15
2011-05-20,15136.78
2011-05-27,15741.6
2011-06-03,16434.15
2011-06-10,15883.52
2011-06-17,14978.09
2011-06-24,15682.81
2011-07-01,15363.5
2011-07-08,16148.87
2011-07-15,15654.85
2011-07-22,15766.6
2011-07-29,15922.41
2011-08-05,15295.55
2011-08-12,14539.79
2011-08-19,14689.24
2011-08-26,14537.37
2011-09-02,15277.27
2011-09-09,17746.68
2011-09-16,18535.48
2011-09-23,17859.3
2011-09-30,18337.68
2011-10-07,20797.58
2011-10-14,23077.55
2011-10-21,23351.8
2011-10-28,31579.9
2011-11-04,39886.06
2011-11-11,18689.54
2011-11-18,19050.66
2011-11-25,20911.25
2011-12-02,25293.49
2011-12-09,33305.92
2011-12-16,45773.03
2011-12-23,46788.75
2011-12-30,23350.88
2012-01-06,16567.69
2012-01-13,16894.4
2012-01-20,18365.1
2012-01-27,18378.16
2012-02-03,23510.49
2012-02-10,36988.49
2012-02-17,54060.1
2012-02-24,20124.22
2012-03-02,20113.03
2012-03-09,21140.07
2012-03-16,22366.88
2012-03-23,22107.7
2012-03-30,28952.86
2012-04-06,57592.12
2012-04-13,34684.21
2012-04-20,16976.19
2012-04-27,16347.6
2012-05-04,17147.44
2012-05-11,18164.2
2012-05-18,18517.79
2012-05-25,16963.55
2012-06-01,16065.49
2012-06-08,17666
2012-06-15,17558.82
2012-06-22,16633.41
2012-06-29,15722.82
2012-07-06,17823.37
2012-07-13,16566.18
2012-07-20,16348.06
2012-07-27,15731.18
2012-08-03,16628.31
2012-08-10,16119.92
2012-08-17,17330.7
2012-08-24,16286.4
2012-08-31,16680.24
2012-09-07,18322.37
2012-09-14,19616.22
2012-09-21,19251.5
2012-09-28,18947.81
2012-10-05,21904.47
2012-10-12,22764.01
2012-10-19,24185.27
2012-10-26,27390.81
I will read it from a csv file. But sometimes, some weeks may be missing. Therefore, I am trying to generate a date range like this:
df_train_fly = pd.date_range(start, end, freq="W-FRI")
This generated dataframe contains all weeks in the range so I need to merge those two dataframe into one.
If I check df_train_csv['Date'] and df_train_fly['Date'] from the iPython console, they both showed as dtype: datetime64[ns]
So let's dissect this:
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'],index_col='Date')
OK first problem here is you have specified that the index column should be 'Date' this means that you will not have a 'Date' column anymore.
start = datetime(2010, 2, 5)
end = datetime(2012, 10, 26)
df_train_fly = pd.date_range(start, end, freq="W-FRI")
df_train_fly = pd.DataFrame(pd.Series(df_train_fly), columns=['Date'])
merged = df_train_csv.join(df_train_fly.set_index(['Date']), on = ['Date'], how = 'right', lsuffix='_x')
So the above join will not work as the error reported so in order to fix this:
# remove the index_col param
df_train_csv = pd.read_csv('./train.csv',parse_dates=['Date'])
# don't set the index on df_train_fly
merged = df_train_csv.join(df_train_fly, on = ['Date'], how = 'right', lsuffix='_x')
OR don't set the 'on' param:
merged = df_train_csv.join(df_train_fly, how = 'right', lsuffix='_x')
the above will use the index of both df's to join on
You can also achieve the same result by performing a merge instead:
merged = df_train_csv.merge(df_train_fly.set_index(['Date']), left_index=True, right_index=True, how = 'right', lsuffix='_x')

Categories

Resources