Python / Pandas - Datetime statistics. How to aggregate means of datetime columns - python

i am currently writing a "Split - Apply - Combine" pipeline for my data analysis, which also involves dates. Here's some sample data:
In [1]:
import pandas as pd
import numpy as np
import datetime as dt
startdate = np.datetime64("2018-01-01")
randdates = np.random.randint(1, 365, 100) + startdate
df = pd.DataFrame({'Type': np.random.choice(['A', 'B', 'C'], 100),
'Metric': np.random.rand(100),
'Date': randdates})
df.head()
Out[1]:
Type Metric Date
0 A 0.442970 2018-08-02
1 A 0.611648 2018-02-11
2 B 0.202763 2018-03-16
3 A 0.295577 2018-01-09
4 A 0.895391 2018-11-11
Now I want to aggregate by 'Type' and get summary statistics for the respective variables. This is easy for numerical variables like 'Metric':
df.groupby('Type')['Metric'].agg(('mean', 'std'))
For datetime objects however, calculating a mean, standard deviation, or other statistics doesn't really make sense and throws an error. The context I need this operation for, is that I am modelling a Date based on some distance metric. When I repeat this modelling with random sampling (monte-carlo simulation), I later want to reassign a mean and confidence interval to the modeled dates.
So my Question is: What useful statistics can be built with datetime data? How do you represent the statistical distribution of modelled dates? And how do you implement the aggregation operation?
My Ideal output would be to get a Date_mean and Date_stdev column representing a range for my modeled dates.

You can use timestamps (Unix)
Epoch, also known as Unix timestamps, is the number of seconds (not milliseconds!) that have elapsed since January 1, 1970 at 00:00:00 GMT (1970-01-01 00:00:00 GMT).
You can convert all your dates to timestamps liks this:
import time
import datetime
d = "2018-08-02"
time.mktime(datetime.datetime.strptime(d, "%Y-%m-%d").timetuple()) #1533160800
And from there you can calculate what you need.

You can compute min, max, and mean using the built-in operations of the datetime:
date = dt.datetime.date
df.groupby('Type')['Date'].agg(lambda x:(date(x.mean()), date(x.min()), date(x.max())))
Out[490]:
Type
A (2018-06-10, 2018-01-11, 2018-11-08)
B (2018-05-20, 2018-01-20, 2018-12-31)
C (2018-06-22, 2018-01-04, 2018-12-05)
Name: Date, dtype: object
I used date(x) to make sure the output fits here, it's not really needed.

Related

Python: Convert numeric value to date like SAS

I have a question. I have a set of numeric values that are a date, but apparently the date is wrongly formatted and coming out of SAS. For example, I have the value 5893 that is in SAS 19.02.1976 when formatted correctly. I want to achieve this in Python/PySpark. From what I've found until now, there is a function fromtimestamp.
However, when I do this, it gives a wrong date:
value = 5893
date = datetime.datetime.fromtimestamp(value)
print(date)
1970-01-01 02:38:13
Any proposals to get the correct date? Thank you! :-)
EDIT: And how would the code look like when this operation is imposed on a dataframe column rather than a variable?
The Epoch, as far as SAS is concerned, is 1st January 1960. The number you have (5893) is the number of elapsed days since that Epoch. Therefore:
from datetime import timedelta, date
print(date(1960, 1, 1) + timedelta(days=5893))
...will give you the desired result
import numpy as np
import pandas as pd
ser = pd.Series([19411.0, 19325.0, 19325.0, 19443.0, 19778.0])
ser = pd.to_timedelta(ser, unit='D') + pd.Timestamp('1960-1-1')

date_range won't include last date of interval for custom frequency

I want to create a date vector with a given fixed spacing depending on the frequency I choose. So far, this is what I got:
import pandas as pd
import datetime as dt
from datetime import date, timedelta
def getDates(sD, eD, f):
# creating the datetime object to later on make the date vector
sD = dt.datetime.strptime(sD, '%m/%d/%Y')
eD = dt.datetime.strptime(eD, '%m/%d/%Y')
sd_t = date(sD.year,sD.month,sD.day) # start date
ed_t = date(eD.year,eD.month,eD.day) # end date
# we hardcode a frequency dictionary for the frequencies and spacing that
# date vectors are going to have.
freqDict = {'1h':'40D', '4h':'162D', '1d':'1000D'}
dateVector = pd.date_range(sd_t, ed_t, freq = freqDict[f])
return dateVector
As you can see, I have only 3 frequencies I'm interested in. And the spacing between them works well, I have to play with API limitations and the limit I set up for requests is 1000. This is why I chose these custom spacings between dates, in order to allow a good amount of data points as possible according to the frequency and the API request limitations for which these dates are meant.
Unfortunately, I can't get the final date on the dateVector for some cases. If run this function with these inputs:
getDates('01/01/2020', '01/01/2021', '4h')
I get this outcome, which is missing the final date on the array ('01/01/2021'):
0 2020-01-01
1 2020-06-11
2 2020-11-20
I thought of using the closed = parameter, but it didn't get me where I wanted.
A workaround I thought of consists of using periods instead of freq, and dynamically computing the periods according to the distance (in terms of days) between the start date and the end date. But I would like to know if I can make date_range work in my favor without having to write such a routine.

Is there a way to display time data in Pandas in hours above 24 hours?

I am trying to format my time data to be displayed in hours:minutes:seconds (e.g. 36:30:30). The main goal is to be able to aggregate the times so that totals can be displayed in number of hours. I do not want to have totals in number of days.
My time data start as strings, in the format "HH:MM:SS". With pandas, I convert these to timedelta values using:
df["date column"] = pd.to_timedelta(df["date column"])
There is one record that is "24:00:00", but the above line of code gives that as "1 day".
Is there a way to display this time as 24:00:00?
IIUC, we can use np.timedelta64 to change your timedelta object into a numerical representation of it self.
import numpy as np
df = pd.DataFrame({'hours' : ['34:00:00','23:45:22','11:00:11'] })
hours = pd.to_timedelta(df['hours']) / np.timedelta64(1,'h')
print(hours)
0 34.000000
1 23.756111
2 11.003056
Name: hours, dtype: float64

Change the index to dates when running time-series models

I need to run a time-series model using StatsModels, and it requires my indices to be dates. However, currently my dates are all in string form. Is there any quick way for me to convert the dates to the format satisfied by statsmodel timeseries models?
My date string is currently like the following:
1/8/2015
1/15/2015
1/22/2015
1/29/2015
2/5/2015
I've found a way to solve it by using the following code:
df.index = pd.to_datetime(df.index, format='%m/%d/%Y', errors='ignore')
After this, i'm able to run the time-series modules under StatsModels.
You can use the datetime module to convert those dates:
Code:
import datetime as dt
def make_date(date_string):
m, d, y = tuple(int(x) for x in my_date.split('/'))
return dt.date(year=y, month=m, day=d)
for my_date in my_dates:
print(make_date(my_date))
Test Data:
my_dates = """
1/8/2015
1/15/2015
1/22/2015
1/29/2015
2/5/2015
""".split('\n')[1:-1]

How can I get all the dates within a week of a certain day using datetime?

I have some measurements that happened on specific days in a dictionary. It looks like
date_dictionary['YYYY-MM-DD'] = measurement.
I want to calculate the variance between the measurements within 7 days from a given date. When I convert the date strings to a datetime.datetime, the result looks like a tuple or an array, but doesn't behave like one.
Is there an easy way to generate all the dates one week from a given date? If so, how can I do that efficiently?
You can do this using - timedelta . Example -
>>> from datetime import datetime,timedelta
>>> d = datetime.strptime('2015-07-22','%Y-%m-%d')
>>> for i in range(1,8):
... print(d + timedelta(days=i))
...
2015-07-23 00:00:00
2015-07-24 00:00:00
2015-07-25 00:00:00
2015-07-26 00:00:00
2015-07-27 00:00:00
2015-07-28 00:00:00
2015-07-29 00:00:00
You do not actually need to print it, datetime object + timedelta object returns a datetime object. You can use that returned datetime object directly in your calculation.
Using datetime, to generate all 7 dates following a given date, including the the given date, you can do:
import datetime
dt = datetime.datetime(...)
week_dates = [ dt + datetime.timedelta(days=i) for i in range(7) ]
There are libraries providing nicer APIs for performing datetime/date operations, most notably pandas (though it includes much much more). See pandas.date_range.

Categories

Resources