How do I find the structural breaks in this time series data? - python

I have yearly average closing values for an asset in a dataframe, and I need to find the structural breaks in the time series. I intended to do this using the stats model 'season_decompose' method but I am having trouble implementing it.
Example data below
from statsmodels.tsa.seasonal import seasonal_decompose
data = {'Year':['1991','1992','1993','1994','1995','1996','1997','1998','1999','2000','2001','2002','2003','2004'],'Close':[11,22,55,34,447,85,99,86,83,82,81,34,33,36]}
df = pd.DataFrame(data)
df['Year'] = df['Year'].astype(str)
sd = seasonal_decompose(df)
plt.show()
ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None
When I change the 'Year' column to date time, I get the following issue:
TypeError: float() argument must be a string or a number, not 'Timestamp'
I do not know what the issue is. I have no missing values? Secondary to this, does anybody know a more efficient method to identify structural breaks in time series data?
Thanks

The problem is that you need to set column Year as the index after converting the string values to datetime (from the ValueError message: a pandas object with a DatetimeIndex).
So, e.g.:
from statsmodels.tsa.seasonal import seasonal_decompose
import pandas as pd
data = {'Year':['1991','1992','1993','1994','1995','1996','1997','1998','1999','2000','2001','2002','2003','2004'],'Close':[11,22,55,34,447,85,99,86,83,82,81,34,33,36]}
df = pd.DataFrame(data)
df['Year'] = pd.to_datetime(df['Year'])
df.set_index('Year', drop=True, inplace=True)
sd = seasonal_decompose(df)
sd.plot()
Plot:

Related

Python Pandas Error: 'DataFrame' object has no attribute 'Datetime' when trying to create an average of time periods

I am trying to create a dataframe (df) that creates a sample sums, means, and standard deviations of the following 12 monthly return series by month from another dataframe cdv file called QUESTION1DATA.csv and he head of performance data looks like this: .
So far I have created a code to find what I am looking for and have come up with this:
import time
df['Timestamp'] = portlist.to_datetime(df['Year'],format='%Y')
new_df = df.groupby([df.index.year, df.index.month]).agg(['sum', 'mean', 'std'])
new_df.index.set_names(['Year', 'Month'], inplace = True)
new_df.reset_index(inplace = True)
However when I run this code I get this error and don't know where to go from there.
`"list' object has no attribute 'to_datetime'
I
to_datetime is available in Pandas and should be use as follows:
>>pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
datetime.datetime(1300, 1, 1, 0, 0)
In your case portlist is a list and hence the code is throwing error.
In your case it should be as follows:
df['Timestamp'] = pd.to_datetime(df['Year'],format='%Y')

Pandas ValueError on df.join(), then TypeError on pd.concat()

I'm sure I have a simple error here but I'm not seeing it. Maybe a fresh pair of eyes can pick it out in a minute or two. I've been working on various solutions for a while now and I need some help.
I have a Pandas dataframe and I'm attempting to add a newly created calculated column to the dataframe based on an existing column in that dataframe.
import pandas as pd
df = pd.read_csv('mydata.csv')
df.date = pd.to_datetime(df.date, format='%d.%m.%Y %H:%M:%S.%f')
# calculate Simple Moving Average with a 20 day window
sma = df_train.close.rolling(window=20).mean()
res = df_train.join(sma, on='date', how='left', lsuffix='_left', rsuffix='_right')
print(res)
ValueError: You are trying to merge on datetime64[ns] and int64
columns. If you wish to proceed you should use pd.concat
Ok, so I tried using pd.concat:
import pandas as pd
df = pd.read_csv('mydata.csv')
df.date = pd.to_datetime(df.date, format='%d.%m.%Y %H:%M:%S.%f')
# calculate Simple Moving Average with 20 days window
sma = df_train.close.rolling(window=20).mean()
frames = [df_train, sma]
res = pd.concat(frames)
print("Printing result of concat")
print(res)
TypeError: concat() missing 1 required positional argument: 'objs'
What positional argument is needed? I can't figure this out based on the research and documentation I've seen online.
Thanks in advance.

finding the modulus of time delta in Pandas

I want to find the modulus of two time delta columns in Pandas. My data frame looks something like this:
import pandas as pd
d={'date':(['01/01/2018','05/02/2018','01/01/2018','05/01/2018']),\
'fct date':(['01/06/2019','01/06/2019','01/06/2019','01/06/2019']),\
'tenor':[1,2,3,4]}
df=pd.DataFrame(data=d)
df['date'] = pd.to_datetime(df['date'],format='%d/%m/%Y')
df['fct date'] = pd.to_datetime(df['fct date'],format='%d/%m/%Y')
df['tenor']=pd.to_timedelta(df['tenor'],unit="d")
df
when I try to apply the modulus function, I get a unsupported operand timedeltaindex error. Any idea why this is throwing an error ?
a=(df['date']-df['fct date'])%df['tenor']

Plotting using Pandas and datetime format

I have a dataframe with just two columns, Date, and ClosingPrice. I am trying to plot them using df.plot() but keep getting this error:
ValueError: view limit minimum -36785.37852 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units
I have found documentation about this from matplotlib but that says how to make sure that the format is datetime. Here is code that I have to make sure the format is datetime and also printing the data type for each column before attempting to plot.
df.Date = pd.to_datetime(df.Date)
print(df['ClosingPrice'].dtypes)
print(df['Date'].dtypes)
The output for these print statements are:
float64
datetime64[ns]
I am not sure what the problem is since I am verifying the data type before plotting. Here is also what the first few rows of the data set look like:
Date ClosingPrice
0 2013-09-10 64.7010
1 2013-09-11 61.1784
2 2013-09-12 61.8298
3 2013-09-13 60.8108
4 2013-09-16 58.8776
5 2013-09-17 59.5577
6 2013-09-18 60.7821
7 2013-09-19 61.7788
Any help is appreciated.
EDIT 2 after seeing more people ending up here. To be clear for new people to python, you should first import pandas for the codes bellow to work:
import pandas as pd
EDIT 1: (short quick answer)
If³ you don't want to drop your original index (this makes sense after reading the original and long answer bellow) you could:
df[['Date','ClosingPrice']].plot('Date', figsize=(15,8))
Original and long answer:
Try setting your index as your Datetime column first:
df.set_index('Date', inplace=True, drop=True)
Just to be sure, try setting the index dtype (edit: this probably wont be needed as you did it previously):
df.index = pd.to_datetime(df.index)
And then plot it
df.plot()
If this solves the issue it's because when you use the .plot() from DataFrame object, the X axis will automatically be the DataFrame's index.
If² your DataFrame had a Datetimeindex and 2 other columns (say ['Currency','pct_change_1']) and you wanted to plot just one of them (maybe pct_change_1) you could:
# single [ ] transforms the column into series, double [[ ]] into DataFrame
df[['pct_change_1']].plot(figsize=(15,8))
Where figsize=(15,8) you're setting the size of the plot (width, height).
Here is a simple solution:
my_dict = {'Date':['2013-09-10', '2013-09-11', '2013-09-12', '2013-09-13', '2013-09-16', '2013-09-17', '2013-09-18',
'2013-09-19'], 'ClosingPrice': [ 64.7010, 61.1784, 61.8298, 60.8108, 58.8776, 59.5577, 60.7821, 61.7788]}
df = pd.DataFrame(my_dict)
df.set_index('Date', inplace=True)
df.plot()

Python3 - Pandas Resample function

This is my code:
import pandas as pd
data = pd.read_csv("temp.data",sep=';')
data['Date'] = pd.to_datetime(data['Date']+' '+data['Time'])
del data['Time']
data.rename(columns={'Date':'TimeStamp'}, inplace=True)
data = data.reset_index()
data['TimeStamp'] = pd.to_datetime(data['TimeStamp'], format='%d/%m/%Y %H:%M:%S')
data['num'] = pd.to_numeric(data['num'])
data = data.resample('1M')
Error:
"TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'".
Sample data (Original data has 200 thousand rows, with no missing values):
17/12/2006;01:58:00;3.600
17/12/2006;01:59:00;2.098
17/12/2006;02:00:00;1.334
17/12/2006;02:01:00;4.362
17/12/2006;02:02:00;1.258
17/12/2006;02:03:00;2.448
17/12/2006;02:04:00;5.426
17/12/2006;02:05:00;9.704
As mentioned in the error, when trying to resample, you have a RangeIndex, which is just consecutive integers, while you need one that represents points in time.
Adding
data.set_index('TimeStamp', inplace=True)
before you resample will set your 'TimeStamp' column as the index. You can also do away with
data = data.reset_index()
unless you want an extra column labeled 'index' floating around.

Categories

Resources