Pandas data.frame, has incorrect indices - python

I'm pulling data from Yahoo finance, but the data.frame I'm creating won't load because my indices are incorrect.
I know what I need to fix, I just don't know how :/
Here is my code and error:
from scipy import stats
import scipy as sp
import numpy as np
import pandas as pd
import datetime as datetime
from matplotlib.finance import quotes_historical_yahoo
ticker = 'IBM'
begdate = (1962,1,1)
enddate = (2013,11,22)
x = quotes_historical_yahoo(ticker, begdate, enddate, asobject = True, adjusted = True)
logret = np.log(x.aclose[1:] / x.aclose[:-1])
date = []
d0 = x.date
print len(logret)
for i in range(0, len(logret)):
t1 = ''.join([d0[i].strftime("%Y"), d0[i].strftime("%m"), "01"])
date.append(datetime.datetime.strptime(t1, "%Y%m%d"))
y = pd.DataFrame(logret, date)
retM = y.groupby(y.index).sum()
ret_Jan = retM[retM.index.month == 1]
ret_others = retM[retM.index.month != 1]
print sp.stats.bartlett(ret_Jan.values, ret_others.values)
The error comes from this line:
y = pd.DataFrame(logret, date)
And produces this:
ValueError: Shape of passed values is (1, 13064), indices imply (1, 1)
I believe I need to change logret into a list? ... or a tuple?
But my efforts to convert, using tuple(logret) or creating an empty list and populating it, have not worked so far.
Suggestions?

ValueError: Shape of passed values is (1, 13064), indices imply (1, 1)
means that you've given pd.DataFrame a series of length 13064 and an index of length 1, and asked it to index the series by the index. Indeed, that is what you've done: date starts off as [], and then you append one value to it, so the index you're passing to the dataframe is just a singleton list.
I think you probably didn't mean to create the DataFrame inside the loop.

I think you're making this a lot harder by going in and out of Pandas objects. If you just stay in Pandas this is pretty simple. I think all you need to do is:
import pandas.io.data as web
import datetime
start = datetime.datetime(1962, 1, 1)
end = datetime.datetime(2013, 11, 22)
f=web.DataReader("IBM", 'yahoo', start, end)
f['returns'] = log(f.Close/f.Close.shift(1))
ret_Jan = f.returns[f.index.month == 1]
ret_others = f.returns[f.index.month != 1]
print sp.stats.bartlett(ret_Jan, ret_others)
(122.77708966467267, 1.5602965581388475e-28)

Related

Find a certain date inside Timestamp vector

I have a certain timestamp vector and I need to find the position index of the date inside this vector. Let's say I want to find inside this vector the position index of 2017-01-01.
Here below is the basic code that creates a ts vector:
import numpy as np
import pandas as pd
ts_vec = []
t = pd._libs.tslibs.timestamps.Timestamp('2016-03-03 00:00:00')
for i in range(1000):
ts_vec = [*ts_vec,t]
t = t+pd.Timedelta(days=1)
ts_vec = np.array(ts_vec)
How should I do this? Thank You
outp = np.where(ts_vec==pd._libs.tslibs.timestamps.Timestamp('2017-01-01 00:00:00'))

pandas.DataFrame.loc: TypeError returned when trying to modify values of column based on datetime

I have a data frame that contains a column of dates and another column that I'd like to modify according to the date. However when I try to do this using the .loc method, I get
TypeError: '<' not supported between instances of 'str' and 'datetime.datetime'
Could anyone please explain 1) why this error comes up - the dates are datetime objects, and 2) how I can modify the second column. I include a MWE below.
Many thanks
from datetime import datetime as DT
import numpy as np
import pandas as pd
def random_dates(start, end, n, unit='D', seed=None):
if not seed: # from piR's answer
np.random.seed(0)
ndays = (end - start).days + 1
return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) + start
start_date = pd.to_datetime('2010-01-01')
end_date = pd.to_datetime('2020-01-01')
df = pd.DataFrame(columns=['Date', 'Names'])
N = 10
df['Date'] = random_dates(start_date, end_date, N)
df = df.assign(Names = ['A'] * N)
df.loc['Date' < DT(2015, 1, 1), 'Names'] = 'B'
(random_dates function from this post)
Switch the line
df.loc['Date' < DT(2015, 1, 1), 'Names'] = 'B'
to
df.loc[df.Date < DT(2015, 1, 1), 'Names'] = 'B'
This would solve it.
You are not using the df.loc statement properly. For your particular logic, you want to change the values of column Names to "B" when a row's corresponding date is < 2015-1-1. When you want to include any conditions in df.loc, the proper way to use it is like this:
df.loc[(df['Date'] < DT(2015, 1, 1)), 'Names'] = 'B'
For a detailed guide on how to use conditions with df.loc, you can refer this link

align returns tuple instead of Pandas DataFrame

I am bit puzzled because I would expect the following code to produce a Pandas DataFrame or Series and instead returns a tuple.
import pandas_datareader.data as web
import random
import pandas as pd
start_date = '2018-01-01'
end_date = '2018-06-06'
SPY = web.DataReader('SPY', 'yahoo', start_date, end_date)
SPY = SPY[['Adj Close']]
SPY.columns = ['Price']
SPY1 = SPY.iloc[random.sample(range(len(SPY.index)), 80), ]
SPY2 = SPY.iloc[random.sample(range(len(SPY.index)), 80), ]
SPY3 = SPY1.align(SPY2, join = 'inner', axis = 0)
type(SPY3)
tuple
I can transform the tuple to a Series as follows:
SPY3 = pd.Series(SPY3[0])
Still I wonder why a tuple is returned in the first place.
The method align returns a tuple according to the documentation:
Returns: (left, right) : (DataFrame, type of other)
Aligned objects
(left, right) is a tuple

Excel xlwings data input for Python Technical Indicators

I am trying to replicate a simple Technical-Analysis indicator using xlwings. However, the list/data seems not to be able to read Excel values. Below is the code
import pandas as pd
import datetime as dt
import numpy as np
#xw.func
def EMA(df, n):
EMA = pd.Series(pd.ewma(df['Close'], span = n, min_periods = n - 1), name = 'EMA_' + str(n))
df = df.join(EMA)
return df
When I enter a list of excel data : EMA = ({1,2,3,4,5}, 5}, I get the following error message
TypeError: list indices must be integers, not str EMA = pd.Series(pd.ewma(df['Close'], span = n, min_periods = n - 1), name = 'EMA_' + str(n))
(Expert) help much appreciated! Thanks.
EMA() expects a DataFrame df and a scalar n, and it returns the EMA in a separate column in the source DataFrame. You are passing a simple list of values, this is not supposed to work.
Construct a DataFrame and assign the values to the Close column:
v = range(100) # use your list of values instead
df = pd.DataFrame(v, columns=['Close'])
Call EMA() with this DataFrame:
EMA(df, 5)

Describing gaps in a time series pandas

I'm trying to write a function that takes a continuous time series and returns a data structure which describes any missing gaps in the data (e.g. a DF with columns 'start' and 'end'). It seems like a fairly common issue for time series, but despite messing around with groupby, diff, and the like -- and exploring SO -- I haven't been able to come up with much better than the below.
It's a priority for me that this use vectorized operations to remain efficient. There has got to be a more obvious solution using vectorized operations -- hasn't there? Thanks for any help, folks.
import pandas as pd
def get_gaps(series):
"""
#param series: a continuous time series of data with the index's freq set
#return: a series where the index is the start of gaps, and the values are
the ends
"""
missing = series.isnull()
different_from_last = missing.diff()
# any row not missing while the last was is a gap end
gap_ends = series[~missing & different_from_last].index
# count the start as different from the last
different_from_last[0] = True
# any row missing while the last wasn't is a gap start
gap_starts = series[missing & different_from_last].index
# check and remedy if series ends with missing data
if len(gap_starts) > len(gap_ends):
gap_ends = gap_ends.append(series.index[-1:] + series.index.freq)
return pd.Series(index=gap_starts, data=gap_ends)
For the record, Pandas==0.13.1, Numpy==1.8.1, Python 2.7
This problem can be transformed to find the continuous numbers in a list. find all the indices where the series is null, and if a run of (3,4,5,6) are all null, you only need to extract the start and end (3,6)
import numpy as np
import pandas as pd
from operator import itemgetter
from itertools import groupby
# create an example
data = [2, 3, 4, 5, 12, 13, 14, 15, 16, 17]
s = pd.series( data, index=data)
s = s.reindex(xrange(18))
print find_gap(s)
def find_gap(s):
""" just treat it as a list
"""
nullindex = np.where( s.isnull())[0]
ranges = []
for k, g in groupby(enumerate(nullindex), lambda (i,x):i-x):
group = map(itemgetter(1), g)
ranges.append((group[0], group[-1]))
startgap, endgap = zip(* ranges)
return pd.series( endgap, index= startgap )
reference : Identify groups of continuous numbers in a list

Categories

Resources