Python3 - Pandas Resample function - python

This is my code:
import pandas as pd
data = pd.read_csv("temp.data",sep=';')
data['Date'] = pd.to_datetime(data['Date']+' '+data['Time'])
del data['Time']
data.rename(columns={'Date':'TimeStamp'}, inplace=True)
data = data.reset_index()
data['TimeStamp'] = pd.to_datetime(data['TimeStamp'], format='%d/%m/%Y %H:%M:%S')
data['num'] = pd.to_numeric(data['num'])
data = data.resample('1M')
Error:
"TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'".
Sample data (Original data has 200 thousand rows, with no missing values):
17/12/2006;01:58:00;3.600
17/12/2006;01:59:00;2.098
17/12/2006;02:00:00;1.334
17/12/2006;02:01:00;4.362
17/12/2006;02:02:00;1.258
17/12/2006;02:03:00;2.448
17/12/2006;02:04:00;5.426
17/12/2006;02:05:00;9.704

As mentioned in the error, when trying to resample, you have a RangeIndex, which is just consecutive integers, while you need one that represents points in time.
Adding
data.set_index('TimeStamp', inplace=True)
before you resample will set your 'TimeStamp' column as the index. You can also do away with
data = data.reset_index()
unless you want an extra column labeled 'index' floating around.

Related

selecting a df row by month formatted with (lambda x: datetime.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z'))

I'm having some issues with geopandas and pandas datetime objects; I kept getting the error
pandas Invalid field type <class 'pandas._libs.tslibs.timedeltas.Timedelta'>
when I try to save it using gpd.to_file() apparently this is a known issue between pandas and geopandas date types so I used
df.DATE = df.DATE.apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z'))
to get a datetime object I could manipulate without getting the aforementioned error when I save the results. Due to that change, my selection by month
months = [4]
for month in months:
df = df[[(pd.DatetimeIndex(df.DATE).month == month)]]
no longer works, throwing a value error.
ValueError: Item wrong length 1 instead of 108700.
I tried dropping the pd.DatetimeIndex but this throws a dataframe series error
AttributeError: 'Series' object has no attribute 'month'
and
df = df[(df.DATE.month == month)]
gives me the same error.
I know it converted over to a datetime object because print(df.dtype) shows DATE datetime64[ns, UTC] and
for index, row in df.iterrows():
print(row.DATE.month)
prints the month as a integer to the terminal.
Without going back to pd.Datetime how can I fix my select statement for the month?
The statement df.DATE returns a Series object. That doesn't have a .month attribute. The dates inside the Series do, which is why row.DATE.month works. Try something like:
filter = [x.month == month for x in df.DATE]
df_filtered = df[filter]
Before that, I'm not sure what you're trying to accomplish with pd.DatetimeIndex(df.DATE).month == month) but a similar fix should take care of it.

How do I find the structural breaks in this time series data?

I have yearly average closing values for an asset in a dataframe, and I need to find the structural breaks in the time series. I intended to do this using the stats model 'season_decompose' method but I am having trouble implementing it.
Example data below
from statsmodels.tsa.seasonal import seasonal_decompose
data = {'Year':['1991','1992','1993','1994','1995','1996','1997','1998','1999','2000','2001','2002','2003','2004'],'Close':[11,22,55,34,447,85,99,86,83,82,81,34,33,36]}
df = pd.DataFrame(data)
df['Year'] = df['Year'].astype(str)
sd = seasonal_decompose(df)
plt.show()
ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None
When I change the 'Year' column to date time, I get the following issue:
TypeError: float() argument must be a string or a number, not 'Timestamp'
I do not know what the issue is. I have no missing values? Secondary to this, does anybody know a more efficient method to identify structural breaks in time series data?
Thanks
The problem is that you need to set column Year as the index after converting the string values to datetime (from the ValueError message: a pandas object with a DatetimeIndex).
So, e.g.:
from statsmodels.tsa.seasonal import seasonal_decompose
import pandas as pd
data = {'Year':['1991','1992','1993','1994','1995','1996','1997','1998','1999','2000','2001','2002','2003','2004'],'Close':[11,22,55,34,447,85,99,86,83,82,81,34,33,36]}
df = pd.DataFrame(data)
df['Year'] = pd.to_datetime(df['Year'])
df.set_index('Year', drop=True, inplace=True)
sd = seasonal_decompose(df)
sd.plot()
Plot:

Error when .loc() rows with a list of dates in pandas

I have the following code:
import pandas as pd
from pandas_datareader import data as web
df = web.DataReader('^GSPC', 'yahoo')
df['pct'] = df['Close'].pct_change()
dates_list = df.index[df['pct'].gt(0.002)]
df2 = web.DataReader('^GDAXI', 'yahoo')
df2['pct2'] = df2['Close'].pct_change()
i was trying to run this:
df2.loc[dates_list, 'pct2']
But i keep getting this error:
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported,
I am guessing this is because there are missing data for dates in dates_list. To resolve this:
idx1 = df.index
idx2 = df2.index
missing = idx2.difference(idx1)
df.drop(missing, inplace = True)
df2.drop(missing, inplace = True)
However i am still getting the same error. I dont understand why that is.
Note that dates_list has been created from df, so it includes
some dates present in index there (in df).
Then you read df2 and attempt to retrieve pct2 from rows on
just these dates.
But there is a chance that the index in df2 does not contain
all dates given in dates_list.
And just this is the cause of your exception.
To avoid it, retrieve only rows on dates present in the index.
To look for only such "allowed" (narrow down the rows specifidation),
you should pass:
dates_list[dates_list.isin(df2.index)]
Run this alone and you will see the "allowed" dates (some dates will
be eliminated).
So change the offending instruction to:
df2.loc[dates_list[dates_list.isin(df2.index)], 'pct']

Pandas ValueError on df.join(), then TypeError on pd.concat()

I'm sure I have a simple error here but I'm not seeing it. Maybe a fresh pair of eyes can pick it out in a minute or two. I've been working on various solutions for a while now and I need some help.
I have a Pandas dataframe and I'm attempting to add a newly created calculated column to the dataframe based on an existing column in that dataframe.
import pandas as pd
df = pd.read_csv('mydata.csv')
df.date = pd.to_datetime(df.date, format='%d.%m.%Y %H:%M:%S.%f')
# calculate Simple Moving Average with a 20 day window
sma = df_train.close.rolling(window=20).mean()
res = df_train.join(sma, on='date', how='left', lsuffix='_left', rsuffix='_right')
print(res)
ValueError: You are trying to merge on datetime64[ns] and int64
columns. If you wish to proceed you should use pd.concat
Ok, so I tried using pd.concat:
import pandas as pd
df = pd.read_csv('mydata.csv')
df.date = pd.to_datetime(df.date, format='%d.%m.%Y %H:%M:%S.%f')
# calculate Simple Moving Average with 20 days window
sma = df_train.close.rolling(window=20).mean()
frames = [df_train, sma]
res = pd.concat(frames)
print("Printing result of concat")
print(res)
TypeError: concat() missing 1 required positional argument: 'objs'
What positional argument is needed? I can't figure this out based on the research and documentation I've seen online.
Thanks in advance.

Pandas - Data Series - TypeError: Index must be DatetimeIndex

C is a Data series with shape of (10000000, ) with dtypes of dtype(< M8[ns]). I want to create a dataseries which only contain one hour of C.
c.between_time('22:00:00','23:00:00')
This is the error that I get
TypeError: Index must be DatetimeIndex
How should I fix it?
Try creating a dataframe by initializing an empty dataframe.
I created a sample of 3 dummy times in Series
import pandas as pd
C = pd.Series(['22:00:00','22:30:00','23:00:00'])
df = pd.DataFrame()
df['C'] = C
#set index to the obervations after converting them to type datetimeIndex
df.index = pd.to_datetime(C)
print df
print df.between_time(df['C'][0],df['C'][2])

Categories

Resources