Change the index to dates when running time-series models

Change the index to dates when running time-series models - python

I need to run a time-series model using StatsModels, and it requires my indices to be dates. However, currently my dates are all in string form. Is there any quick way for me to convert the dates to the format satisfied by statsmodel timeseries models?
My date string is currently like the following:
1/8/2015
1/15/2015
1/22/2015
1/29/2015
2/5/2015

I've found a way to solve it by using the following code:
df.index = pd.to_datetime(df.index, format='%m/%d/%Y', errors='ignore')
After this, i'm able to run the time-series modules under StatsModels.

You can use the datetime module to convert those dates:
Code:
import datetime as dt
def make_date(date_string):
m, d, y = tuple(int(x) for x in my_date.split('/'))
return dt.date(year=y, month=m, day=d)
for my_date in my_dates:
print(make_date(my_date))
Test Data:
my_dates = """
1/8/2015
1/15/2015
1/22/2015
1/29/2015
2/5/2015
""".split('\n')[1:-1]

Related

python pandas converting UTC integer to datetime

I am calling some financial data from an API which is storing the time values as (I think) UTC (example below):
enter image description here
I cannot seem to convert the entire column into a useable date, I can do it for a single value using the following code so I know this works, but I have 1000's of rows with this problem and thought pandas would offer an easier way to update all the values.
from datetime import datetime
tx = int('1645804609719')/1000
print(datetime.utcfromtimestamp(tx).strftime('%Y-%m-%d %H:%M:%S'))
Any help would be greatly appreciated.

Simply use pandas.DataFrame.apply:
df['date'] = df.date.apply(lambda x: datetime.utcfromtimestamp(int(x)/1000).strftime('%Y-%m-%d %H:%M:%S'))
Another way to do it is by using pd.to_datetime as recommended by Panagiotos in the comments:
df['date'] = pd.to_datetime(df['date'],unit='ms')

You can use "to_numeric" to convert the column in integers, "div" to divide it by 1000 and finally a loop to iterate the dataframe column with datetime to get the format you want.
import pandas as pd
import datetime
df = pd.DataFrame({'date': ['1584199972000', '1645804609719'], 'values': [30,40]})
df['date'] = pd.to_numeric(df['date']).div(1000)
for i in range(len(df)):
df.iloc[i,0] = datetime.utcfromtimestamp(df.iloc[i,0]).strftime('%Y-%m-%d %H:%M:%S')
print(df)
Output:
date values
0 2020-03-14 15:32:52 30
1 2022-02-25 15:56:49 40

How to convert Pandas Series of strings to Pandas datetime with non-standard formats that contain dates before 1970

I have a column of dates in the following format:
Jan-85
Apr-99
Nov-01
Feb-65
Apr-57
Dec-19
I want to convert this to a pandas datetime object.
The following syntax works to convert them:
pd.to_datetime(temp, format='%b-%y')
where temp is the pd.Series object of dates. The glaring issue here of course is that dates that are prior to 1970 are being wrongly converted to 20xx.
I tried updating the function call with the following parameter:
pd.to_datetime(temp, format='%b-%y', origin='1950-01-01')
However, I am getting the error:
Name: temp, Length: 42537, dtype: object' is not compatible with origin='1950-01-01'; it must be numeric with a unit specified
I tried specifying a unit as it said, but I got a different error citing that the unit cannot be specified alongside a format.
Any ideas how to fix this?

Just #DudeWah's logic, but improving upon the code:
def days_of_future_past(date,chk_y=pd.Timestamp.today().year):
return date.replace(year=date.year-100) if date.year > chk_y else date
temp = pd.to_datetime(temp,format='%b-%y').map(days_of_future_past)
Output:
>>> temp
0 1985-01-01
1 1999-04-01
2 2001-11-01
3 1965-02-01
4 1957-04-01
5 2019-12-01
6 1965-05-01
Name: date, dtype: datetime64[ns]

Gonna go ahead and answer my own question so others can use this solution if they come across this same issue. Not the greatest, but it gets the job done. It should work until 2069, so hopefully pandas will have a better solution to this by then lol
Perhaps someone else will post a better solution.
def wrong_date_preprocess(data):
"""Correct date issues with pre-1970 dates with whacky mon-yy format."""
df1 = data.copy()
dates = df1['date_column_of_interest']
# use particular datetime format with data; ex: jan-91
dates = pd.to_datetime(dates, format='%b-%y')
# look at wrongly defined python dates (pre 1970) and get indices
date_dummy = dates[dates > pd.Timestamp.today().floor('D')]
idx = list(date_dummy.index)
# fix wrong dates by offsetting 100 years back dates that defaulted to > 2069
dummy2 = date_dummy.apply(lambda x: x.replace(year=x.year - 100)).to_list()
dates.loc[idx] = dummy2
df1['date_column_of_interest'] = dates
return(df1)

Python / Pandas - Datetime statistics. How to aggregate means of datetime columns

i am currently writing a "Split - Apply - Combine" pipeline for my data analysis, which also involves dates. Here's some sample data:
In [1]:
import pandas as pd
import numpy as np
import datetime as dt
startdate = np.datetime64("2018-01-01")
randdates = np.random.randint(1, 365, 100) + startdate
df = pd.DataFrame({'Type': np.random.choice(['A', 'B', 'C'], 100),
'Metric': np.random.rand(100),
'Date': randdates})
df.head()
Out[1]:
Type Metric Date
0 A 0.442970 2018-08-02
1 A 0.611648 2018-02-11
2 B 0.202763 2018-03-16
3 A 0.295577 2018-01-09
4 A 0.895391 2018-11-11
Now I want to aggregate by 'Type' and get summary statistics for the respective variables. This is easy for numerical variables like 'Metric':
df.groupby('Type')['Metric'].agg(('mean', 'std'))
For datetime objects however, calculating a mean, standard deviation, or other statistics doesn't really make sense and throws an error. The context I need this operation for, is that I am modelling a Date based on some distance metric. When I repeat this modelling with random sampling (monte-carlo simulation), I later want to reassign a mean and confidence interval to the modeled dates.
So my Question is: What useful statistics can be built with datetime data? How do you represent the statistical distribution of modelled dates? And how do you implement the aggregation operation?
My Ideal output would be to get a Date_mean and Date_stdev column representing a range for my modeled dates.

You can use timestamps (Unix)
Epoch, also known as Unix timestamps, is the number of seconds (not milliseconds!) that have elapsed since January 1, 1970 at 00:00:00 GMT (1970-01-01 00:00:00 GMT).
You can convert all your dates to timestamps liks this:
import time
import datetime
d = "2018-08-02"
time.mktime(datetime.datetime.strptime(d, "%Y-%m-%d").timetuple()) #1533160800
And from there you can calculate what you need.

You can compute min, max, and mean using the built-in operations of the datetime:
date = dt.datetime.date
df.groupby('Type')['Date'].agg(lambda x:(date(x.mean()), date(x.min()), date(x.max())))
Out[490]:
Type
A (2018-06-10, 2018-01-11, 2018-11-08)
B (2018-05-20, 2018-01-20, 2018-12-31)
C (2018-06-22, 2018-01-04, 2018-12-05)
Name: Date, dtype: object
I used date(x) to make sure the output fits here, it's not really needed.

converting a string to np.array with datetime64, NOT using Pandas

I'm looking for a way to convert dates given in the format YYYYmmdd to an np.array with dtype='datetime64'. The dates are stored in another np.array but with dtype='float64'.
I am looking for a way to achieve this by avoiding Pandas!
I already tried something similar as suggested in this answer but the author states that "[...] if (the date format) was in ISO 8601 you could parse it directly using numpy, [...]".
As the date format in my case is YYYYmmdd which IS(?) ISO 8601 it should be somehow possible to parse it directly using numpy. But I don't know how as I am a total beginner in python and coding in general.
I really try to avoid Pandas because I don't want to bloat my script when there is a way to get the task done by using the modules I am already using. I also read it would decrease the speed here.

If noone else comes up with something more builtin, here is a pedestrian method:
>>> dates
array([19700101., 19700102., 19700103., 19700104., 19700105., 19700106.,
19700107., 19700108., 19700109., 19700110., 19700111., 19700112.,
19700113., 19700114.])
>>> y, m, d = dates.astype(int) // np.c_[[10000, 100, 1]] % np.c_[[10000, 100, 100]]
>>> y.astype('U4').astype('M8') + (m-1).astype('m8[M]') + (d-1).astype('m8[D]')
array(['1970-01-01', '1970-01-02', '1970-01-03', '1970-01-04',
'1970-01-05', '1970-01-06', '1970-01-07', '1970-01-08',
'1970-01-09', '1970-01-10', '1970-01-11', '1970-01-12',
'1970-01-13', '1970-01-14'], dtype='datetime64[D]')

You can go via the python datetime module.
from datetime import datetime
import numpy as np
datestrings = np.array(["18930201", "19840404"])
dtarray = np.array([datetime.strptime(d, "%Y%m%d") for d in datestrings], dtype="datetime64[D]")
print(dtarray)
# out: ['1893-02-01' '1984-04-04'] datetime64[D]
Since the real question seems to be how to get the given strings into the matplotlib datetime format,
from datetime import datetime
import numpy as np
from matplotlib import dates as mdates
datestrings = np.array(["18930201", "19840404"])
mpldates = mdates.datestr2num(datestrings)
print(mpldates)
# out: [691071. 724370.]

Python calculate the number of year in date column

I've recently start coding with Python, and I'm struggling to calculate the number of years between the current date and a given date.
Dataframe
I would like to calculate the number of year for each column.
I tried this but it's not working:
def Number_of_years(d1,d2):
if d1 is not None:
return relativedelta(d2,d1).years
for col in df.select_dtypes(include=['datetime64[ns]']):
df[col]=Number_of_years(df[col],date.today())
Can anyone help me find a solution to this?

I see that the format of dates is day/month/year.
Given this format is same for all the grids, you can parse the date using the datetime module like so:
from datetime import datetime # import module
def numberOfYears(element):
# parse the date string according to the fixed format
date = datetime.strptime(element, '%d/%m/%Y')
# return the difference in the years
return datetime.today().year - date.year
# make things more interesting by vectorizing this function
function = np.vectorize(numberOfYears)
# This returns a numpy array containing difference between years.
# call this for each column, and you should be good
difference = function(df.Date_creation)

You code is basically right, but you're operating over a pandas series so you can't just call relativedelta directly:
def number_of_years(d1,d2):
return relativedelta(d2,d1).years
for col in df.select_dtypes(include=['datetime64[ns]']):
df[col]= df[col].apply(lambda d: number_of_years(x, date.today()))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Change the index to dates when running time-series models - python

I've found a way to solve it by using the following code: df.index = pd.to_datetime(df.index, format='%m/%d/%Y', errors='ignore') After this, i'm able to run the time-series modules under StatsModels.

Related

python pandas converting UTC integer to datetime

How to convert Pandas Series of strings to Pandas datetime with non-standard formats that contain dates before 1970

Python / Pandas - Datetime statistics. How to aggregate means of datetime columns

converting a string to np.array with datetime64, NOT using Pandas

Python calculate the number of year in date column

Categories

Resources