Rewrite datetimeindex for timeseries for new Pandas version

Rewrite datetimeindex for timeseries for new Pandas version - python

I am developing a Machine Learning Model to do some time series forecasting. I wrote a general function to create a time-series for my data. When I was writing this, the datetimeindex method in pandas used to take different parameters and my function was executing properly. There has been a change to this method and I am not sure how to rewrite this datetimeindex method? Can someone please help?
Here is the full timeseries function I wrote:
def make_time_series(mean_power_df, years, freq='D', start_idx=4):
'''Creates as many time series as there are complete years. This code
accounts for the leap year, 2016.
:param Daily_Price_mean: A dataframe of bitcoin weighted price, averaged by day.
This dataframe should also be indexed by a datetime.
:param years: A list of years to make time series out of, ex. ['2013', '2014'].
:param freq: The frequency of data recording (D = daily)
:param start_idx: The starting dataframe index of the first point in the first time series.
The default, 16, points to '2013-01-01'.
:return: A list of pd.Series(), time series data.
'''
# store time series
time_series = []
# store leap year in this dataset
leap = '2012'
# create time series for each year in years
for i in range(len(years)):
year = years[i]
if(year == leap):
end_idx = start_idx+366
else:
end_idx = start_idx+365
# create start and end datetimes
t_start = year + '-01-01' # Jan 1st of each year = t_start
t_end = year + '-12-31' # Dec 31st = t_end
# get global consumption data
data = mean_power_df[start_idx:end_idx]
# create time series for the year
index = pd.DatetimeIndex(start=t_start, end=t_end, freq=freq) ## this is the line causing problems, this was based on previous method inputs
time_series.append(pd.Series(data=data, index=index))
start_idx = end_idx
# return list of time series
return time_series
When I call this function the following way:
group = data.groupby('date') #the data was read and processed...
Daily_Price_mean = group['Weighted_Price'].mean()
full_years = ['2012', '2013', '2014']
freq='D' # daily recordings
# make time series
time_series = make_time_series(Daily_Price_mean, full_years, freq=freq)
I get the error: TypeError: new() got an unexpected keyword argument 'start'
Can smeone please let me know how I can fix my function?
Thank you

Related

Convert days data in years data in a list

I want to do a time serie with temperature data from 1850 to 2014. And I have an issue because when I plot the time series the start is 0 and it corresponds to day 1 of January 1850 and it stops day 60 230 with the 31 December of 2014.
I try to do a loop to create a new list with the time in month-years but it didn't succeed, and to create the plot with this new list and my initial temperature list.
This is the kind of loop that I tested :
days = list(range(1,365+1))
years = []
y = 1850
years.append(y)
while y<2015:
for i in days:
years.append(y+i)
y = y+1
del years [-1]
dsetyears = Dataset(years)
I also try with the tool called "datetime" but it didn't work also (maybe this tool is better because it will take into account the bissextile years...).
day_number = "0"
year = "1850"
res = datetime.strptime(year + "-" + day_number, "%Y-%j").strftime("%m-%d-%Y")
If anyone has a clue or a lead I can look into I'm interested.
Thanks by advance !

You can achieve that using datetime module. Let's declare starting and ending date.
import datetime
dates = []
starting_date = datetime.datetime(1850, 1, 1)
ending_date = datetime.datetime(2014, 1, 1)
Then we can create a while loop and check if the ending date is greater or equal to starting date and add 1-day using timedelta function for every iteration. before iteration, we will append the formatted date as a string to the dates list.
while starting_date <= ending_date:
dates.append(starting_date.strftime("%m-%d-%Y"))
starting_date += datetime.timedelta(days=1)

Group by custom period annually in Xarray

I'm trying to group an xarray.Dataset object into a custom 5-month period spanning from October-January with an annual frequency. This is complicated because the period crosses New Year.
I've been trying to use the approach
wb_start = temperature.sel(time=temperature.time.dt.month.isin([10,11,12,1]))
wb_start1 = wb_start.groupby('time.year')
But this predictably makes the January month of the same year, instead of +1 year. Any help would be appreciated!

I fixed this in a somewhat clunk albeit effective way by adding a year to the months after January. My method essentially moves the months 10,11,12 up one year while leaving the January data in place, and then does a groupby(year) instance on the reindexed time data.
wb_start = temperature.sel(time=temperature.time.dt.month.isin([10,11,12,1]))
# convert cftime to datetime
datetimeindex = wb_start.indexes['time'].to_datetimeindex()
wb_start['time'] = pd.to_datetime(datetimeindex)
# Add custom group by year functionality
custom_year = wb_start['time'].dt.year
# convert time type to pd.Timestamp
time1 = [pd.Timestamp(i) for i in custom_year['time'].values]
# Add year to Timestamp objects when month is before Jan. (relativedelta does not work from np.datetime64)
time2 = [i + relativedelta(years=1) if i.month>=10 else i for i in time1]
wb_start['time'] = time2
#Groupby using the new time index
wb_start1 = wb_start.groupby('time.year')

Error when parsing json during DeepAR training job - How to remove Nan values from target

I have been trying to do some time series prediction using AWS Sagemaker's DeepAR. DeepAR requires the data as one line JSON inputs. I continue to get error in parsing JSON error and I think its because my "target" has nan values in it. Does anyone know how I could properly preprocess the data or get rid of this nan so that I can get rid of this error? Here is how I preprocessed the initial data:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
import pandas as pd
data = pd.read_csv('./data/bitstampUSD_1-min_data_2012-01-01_to_2020-04-22.csv')
data.isnull().values.any()
from datetime import datetime
data.dropna(subset = ["Weighted_Price"], inplace=True)
data.reset_index(drop=True, inplace=True) ##Too many null values so wanted to drop it and reindex
data['date'] = pd.to_datetime(data['Timestamp'],unit='s').dt.date
data.drop(['Timestamp'], axis=1)
daily_price = data['Weighted_Price'].copy()
print(daily_price.shape)
group = data.groupby('date')
Daily_Price_mean = group['Weighted_Price'].mean()
Then I created a function to create a time series:
def make_time_series(mean_power_df, years, freq='D', start_idx=4):
'''Creates as many time series as there are complete years. This code
accounts for the leap year, 2016.
:param Daily_Price_mean: A dataframe of bitcoin weighted price, averaged by day.
This dataframe should also be indexed by a datetime.
:param years: A list of years to make time series out of, ex. ['2013', '2014'].
:param freq: The frequency of data recording (D = daily)
:param start_idx: The starting dataframe index of the first point in the first time series.
The default, 16, points to '2013-01-01'.
:return: A list of pd.Series(), time series data.
'''
# store time series
time_series = []
# store leap year in this dataset
leap = '2012'
# create time series for each year in years
for i in range(len(years)):
year = years[i]
if(year == leap):
end_idx = start_idx+366
else:
end_idx = start_idx+365
# create start and end datetimes
t_start = year + '-01-01' # Jan 1st of each year = t_start
t_end = year + '-12-31' # Dec 31st = t_end
# get global consumption data
data = mean_power_df[start_idx:end_idx]
# create time series for the year
index = pd.date_range(start=t_start, end=t_end, freq=freq)
time_series.append(pd.Series(data=data, index=index))
start_idx = end_idx
# return list of time series
return time_series
full_years = ['2012', '2013', '2014']
freq='D' # daily recordings
# make time series
time_series = make_time_series(Daily_Price_mean, full_years, freq=freq)
def create_training_series(complete_time_series, prediction_length):
'''Given a complete list of time series data, create training time series.
:param complete_time_series: A list of all complete time series.
:param prediction_length: The number of points we want to predict.
:return: A list of training time series.
'''
# get training series
time_series_training = []
for ts in complete_time_series:
# truncate trailing 30 pts
time_series_training.append(ts[:-prediction_length])
return time_series_training
# set prediction length
prediction_length = 30 # 30 days ~ a month
time_series_training = create_training_series(time_series, prediction_length)
I then wrote a function to convert series to json this way:
def series_to_json_obj(ts):
'''Returns a dictionary of values in DeepAR, JSON format.
:param ts: A single time series.
:return: A dictionary of values with "start" and "target" keys.
'''
# get start time and target from the time series, ts
json_obj = {"start": str(ts.index[0]), "target": list(ts)}
return json_obj
ts = time_series[0]
json_obj = series_to_json_obj(ts)
tst = time_series_training[0]
json_objt = series_to_json_obj(tst)
print(json_obj)
print(json_objt)
When I begin to train the data, I start getting Training job failed "Error in parsing JSON" . When I looked at the json_obj output I saw that it was in this form:
{'start': '2012-01-01 00:00:00', 'target': [nan, nan, nan, 5.208159313655556, 6.2841271510761905....]}. I think nan values are causing this to fail but I dont know how to get rid of it especially since I dropped all nan values in the beginning. Can someone please help?

Faster way to check if a date is within a 5000+ element long list / numpy array?

I have the following function that is called multiple times:
def next_valid_date(self, date_object):
"""Returns next valid date based on valid_dates.
If argument date_object is valid, original date_object will be returned."""
while date_object not in self.valid_dates.tolist():
date_object += datetime.timedelta(days=1)
return date_object
For reference, valid_dates is a numpy array that holds all recorded dates for a given stock pulled from yfinance. In the case of the example I've been working with, NVDA (nvidia stock), the valid_dates array has 5395 elements (dates).
I have another function, and its purpose is to create a series of start dates and end dates. In this example self.interval is a timedelta with a length of 365 days, and self.sub_interval is a timedelta with a length of 1 day:
def get_date_range_series(self):
"""Retrieves a series containing lists of start dates and corresponding end dates over a given interval."""
interval_start = self.valid_dates[0]
interval_end = self.next_valid_date(self.valid_dates[0] + self.interval)
dates = [[interval_start, interval_end]]
while interval_end < datetime.date.today():
interval_start = self.next_valid_date(interval_start + self.sub_interval)
interval_end = self.next_valid_date(interval_start + self.interval)
dates.append([interval_start, interval_end])
return pd.Series(dates)
My main issue is that it takes a lengthy period of time to execute (about 2 minutes), and I'm sure there's a far better way of doing this... Any thoughts?

I just created an alternate next_valid_date() method that calls .loc() on a pandas dataframe (the dataframe's index is a list of the valid dates, which is where the list of valid_dates comes from in the first place):
def next_valid_date_alt(self, date_object):
while True:
try:
self.stock_yf_df.loc[date_object]
break
except KeyError:
date_object += datetime.timedelta(days=1)
return date_object
Checking for the next valid date when 6/28/20 is passed in (which isn't valid, it is a weekend, and the stock market is closed) resulted in the original method taking 0.0099754 seconds to complete and the alternate method taking .0019944 seconds to complete.
What this means is that get_date_range_series() takes just over 1 second to complete when using the next_valid_date_alt() method as opposed to 70 seconds when using the next_valid_date() method. I'll definitely look into the other optimizations mentioned as well. I appreciate everyone else's responses!

How can I make pandas treat the start of the next business day as the next time after the previous business day?

I have financial trade data (timestamped with the trade time, so there are duplicate times and the datetimes are irregularly spaced). Basically I have just a datetime column and a price column in a pandas dataframe, and I've calculated returns, but I want to linearly interpolate the data so that I can get an estimate of prices every second, minute, day, etc...
It seems the best way to do this is treat the beginning of a Tuesday as occurring just after the end of Monday, so essentially modding out by the time between days. Does pandas provide an easy way to do this? I've searched the documentation and found BDay, but that doesn't seem to do what I want.
Edit: Here's a sample of my code:
df = read_csv(filePath,usecols=[0,4]) #column 0 is date_time and column 4 is price
df.date_time = pd.to_datetime(df.date_time,format = '%m-%d-%Y %H:%M:%S.%f')
def get_returns(df):
return np.log(df.Price.shift(1) / df.Price)
But my issue is that this is trade data, so that I have every trade that occurs for a given stock over some time period, trading happens only during a trading day (9:30 am - 4 pm), and the data is timestamped. I can take the price that every trade happens at and make a price series, but when I calculate kurtosis and other stylized facts, I'm getting very strange results because these sorts of statistics are usually run on evenly spaced time series data.
What I started to do was write code to interpolate my data linearly so that I could get the price every 10 seconds, minute, 10 minutes, hour, day, etc. However, with business days, weekends, holidays, and all the time where trading can't happen, I want to make python think that the only time which exists is during a business day, so that my real world times still match up with the correct date times, but not such that I need a price stamp for all the times when trading is closed.
def lin_int_tseries(series, timeChange):
tDelta = datetime.timedelta(seconds=timeChange)
data_times = series['date_time']
new_series = []
sample_times = []
sample_times.append(data_times[0])
while max(sample_times) < max(data_times):
sample_times.append(sample_times[-1] + tDelta)
for position,time in enumerate(sample_times):
try:
ind = data_times.index(time)
new_series.append(series[ind])
except:
t_next = getnextTime(time,data_times) #get next largest timestamp in data
t_prev = getprevTime(time,data_times) #get next smallest timestamp in data
ind_next = data_times.index(t_next) #index of next largest timestamp
ind_prev = data_times.index(t_prev) #index of next smallest timestamp
p_next = series[ind_next][1] #price at next timestamp
p_prev = series[ind_prev][1] #price a prev timestamp
omega = (float(time) - t_prev)/(t_next - t_prev) #linear interpolation
p_interp = (1 - omega)*p_prev + omega*p_next
new_series.append([time,p_interp])
return new_series
Sorry if it's still unclear. I just want to find some way to stitch the end of one trading day to the beginning of the next trading day, while not losing the actual datetime information.

You should use pandas resample:
df=df.resample("D")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Rewrite datetimeindex for timeseries for new Pandas version - python

Related

Convert days data in years data in a list

Group by custom period annually in Xarray

Error when parsing json during DeepAR training job - How to remove Nan values from target

Faster way to check if a date is within a 5000+ element long list / numpy array?

How can I make pandas treat the start of the next business day as the next time after the previous business day?

Categories

Resources