Having an issue converting a set of very large integers into timedeltas - python

I am working with a dataframe that includes a column of integers where the units are the number of days since 0001-01-01. I need to convert these integers into current dates. When I attempt use the pd.to_timedelta function to convert these integers into TimeDeltas that I can then add to the start date, the resulting time deltas do not result in what I expect. Please run the following code for an example:
import pandas as pd
df = pd.DataFrame([735110,735111,735112,735114], columns=["days_since_0001-01-01"])
df['time_added'] = pd.to_timedelta(df['days_since_0001-01-01'], unit='d')
print(df[0:1])
#output: 735110 94598 days 01:16:18.871345152
As you can see, the timedelta result for the first row in days is 94,598, plus additional units for the hours/seconds/minutes etc. What I was expecting for that row was 735,110 days, plus 0s for the rest of the timestamp. Further, if I use a DataFrame with a smaller number of days, the output is as I would expect. I have come to the conclusion that to_timedelta cannot handle very large numbers of days; however, I do not know of an alternative method to do this. I could simply reduce the number of days by an arbitrary amount and increase the start date, but I would still need to know the proper amount by which to reduce the days integer and the amount by which to increase the start date. Any help would be appreciated.

Every year is 365 days, except for leap years which are 366 days and occur every four years. So I just need to pick an arbitrary point within the acceptable timedelta resolutions, (e.x. 1850), calculate the number of regular years and leap years between 0000-01-01 and 1850-01-01, multiply regular by 365 and leap by 366, then subtract that from my days_since_0001-01-01. Then I can set a start date as 1850-01-01, and add the remaining amount of days to that start date, to get the current date.

Related

Can the day of the month be encoded similarily to other cyclical variables?

I am working with time-series data in Python to see if variables like the time of day and the day of the month and the month of the year affect attendance at a gym. I have read up on encoding the time series data cyclicly using sine and cosine. I was wondering if you can do the same thing for the day of the month. The reason I ask is that, unlike the number of months in a year or the number of days in a week, the number of days in a month is variable (for example, February has 28, whereas March has 31). Is there any way to deal with that?
Here is a link describing what I mean by cyclic encoding: https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time/
Essentially, what this is saying is that you can't just convert the hour into a series of values like 1, 2, 3, ..., 24 when you are doing machine learning because that implies that the 24th hour is further away (from a euclidean geometric perspective) from the 1st hour than the 1st hour is from the 2nd hour, which is not true. Cyclical encoding (assigning sine and cosine values to each hour) allows you to represent the fact that the 24th hour and the 2nd hour are equidistant from the 1st hour.
My question is that I do not know if this cyclical conversion will work for days in a month, seeing as different months can have different numbers of days.
You can implement this by dividing each month into 2π radians; then in a 28-day month, a day is 0.2234 while in a 31-day month, a day is 0.2026.
This obviously introduces a skew where a shorter month will appear to take up as much time as a longer one; but it will satisfy your requirement. If you only use this metric for normalizing a single feature, that should be inconsequential, and let you achieve the stated goal.
If you have points in time with a finer granularity than a day, you obviously can and probably should normalize those into the same projection.

How to convert an int64 into a datetime?

I'm trying to convert the column Year (type: int64) into a date type so that I can use the Groupby function to group by decade.
I'm using the following code to convert the datatype:
import datetime as dt
crime["Date"]=pd.TimedeltaIndex(crime["Year"], unit='d')+dt.datetime(1960,1,1)
crime[["Year","Date"]].head(10)
Screenshot of output
The date it is returning to me is not correct - it isn't starting at the correct year and the day is increasing by the rows.
I want the year to start at 1960, and for each row the year to increase by 1.
I tried substituting unit='d' in the code above with unit='y' and I get the following result:
Value Error: Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta value durations.
I think #kate's answer is what you want. I wrote my answer before that one came along. I thought my answer might still be worth something to explain why unit='y' isn't supported, and why unit='d' isn't working for you either...
I wouldn't think this would be right:
TimedeltaIndex(crime["Year"], unit='d')
as I expect this to be interpreting your year count as a count of days. If you can't use unit='y', then maybe there's a good reason for that. Maybe that is because years don't always have the same number of days in them, and so specifying a number of years is ambiguous in terms of the number of days that equates to. You have to add any count of years to an actual year for it to make exact sense.
The same holds true, even moreso, for months, since months have a variety of day counts, so you can have no idea what a timedelta in months really means.
I would add the column in the following way:
crime['Date'] = crime['Year'].map(lambda x: dt.datetime(1960 + x,1,1))

convert pandas.tseries.offsets.Day datatype to integer datatype for simple calculations

I took difference of two columns, each of type pandas._libs.tslibs.period.Period. The result is of pandas.tseries.offsets.Day datatype. Now, I want to use the integer value of calculated time difference to do other calculations. How to do that?
I want last column values to be simply integers
Here is what i have tried.
## Check if all dates are in same format and take time upto days only, which will be suitable for given data
data_dates['ExaminDate'] = pd.to_datetime(data_dates["ExaminDate"],errors='coerce', infer_datetime_format= True)
data_dates["DeathDate"] = pd.to_datetime(data_dates["DeathDate"],errors='coerce',infer_datetime_format= True)
data_dates['ExaminMY']= data_dates['ExaminDate'].dt.to_period('D')
data_dates['DeathMY']= data_dates['DeathDate'].dt.to_period('D')
## Make a new column representing time of observation for each patient, which will be difference of two columns (ExaminDate and DeathDate)
data_dates['Time(days)'] = data_dates['DeathMY'] - data_dates['ExaminMY']
It's unclear why you choose to convert your dates to time periods in the first place - it prevents you from achieving the goal of calculating the time difference (in days) between two dates. The following two lines should, therefore, be removed:
data_dates['ExaminMY']= data_dates['ExaminDate'].dt.to_period('D')
data_dates['DeathMY']= data_dates['DeathDate'].dt.to_period('D')
Explanation: with Period objects and there's no clear definition of what's the time difference (in days or otherwise) between two time periods (e.g. Q42019 and Q12020). You could be referring to the starting date, the end-date, or some combination of the above. Plus, periods (offsets, really) like '1 month' or '1 quarter` can differ in the number of days they contain.
If what's you're interested in is the time difference, in days, between DeathDate and ExaminDate, just do the calculation on the original DateTime fields:
# I don't think you need these three lines, as you're reading the date from a file. It's just
# to make sure the example works.
df = pd.DataFrame({"ExamineDate": ['2020-01-15'], "DeathDate": ["2020-04-20"]})
df.ExamineDate = pd.to_datetime(df.ExamineDate)
df.DeathDate = pd.to_datetime(df.DeathDate)
# This is where the real stuff begins
df["days_diff"] = df.DeathDate - df.ExamineDate
df["days_diff_int"] = df.days_diff.dt.days
print (df)
The result is:
ExamineDate DeathDate days_diff days_diff_int
0 2020-01-15 2020-04-20 96 days 96

Python: Date conversion to year-weeknumber, issue at switch of year

I am trying to convert a dataframe column with a date and timestamp to a year-weeknumber format, i.e., 01-05-2017 03:44 = 2017-1. This is pretty easy, however, I am stuck at dates that are in a new year, yet their weeknumber is still the last week of the previous year. The same thing that happens here.
I did the following:
df['WEEK_NUMBER'] = df.date.dt.year.astype(str).str.cat(df.date.dt.week.astype(str), sep='-')
Where df['date'] is a very large column with date and times, ranging over multiple years.
A date which gives a problem is for example:
Timestamp('2017-01-01 02:11:27')
The output for my code will be 2017-52, while it should be 2016-52. Since the data covers multiple years, and weeknumbers and their corresponding dates change every year, I cannot simply subtract a few days.
Does anybody have an idea of how to fix this? Thanks!
Replace df.date.dt.year by this:
(df.date.dt.year- ((df.date.dt.week>50) & (df.date.dt.month==1)))
Basically, it means that you will substract 1 to the year value if the week number is greater than 50 and the month is January.

Selecting data for one hour in a timeseries dataframe

I'm having trouble selecting data in a dataframe dependent on an hour.
I have a months worth of data which increases in 10min intervals.
I would like to be able to select the data (creating another dataframe) for each hour in a specific day for each hour. However, I am having trouble creating an expression.
This is how I did it to select the day:
x=all_data.resample('D').index
for day in range(20):
c=x.day[day]
d=x.month[day]
print data['%(a)s-%(b)s-2009' %{'a':c, 'b':d} ]
but if I do it for hour, it will not work.
x=data['04-09-2009'].resample('H').index
for hour in range(8):
daydata=data['4-9-2009 %(a)s' %{'a':x.hour[hour]}]
I get the error:
raise KeyError('no item named %s' % com.pprint_thing(item))
KeyError: u'no item named 4-9-2009 0'
which is true as it is in format dd/mm/yyy hh:mm:ss
I'm sure this should be easy and something to do with resample. The trouble is I don't want to do anything with the dat, just select the data frame (to correlate it afterwards)
Cheers
You don't need to resample your data unless you want to aggregate into a daily value (e.g., sum, max, median)
If you just want a specific day's worth of data, you can use to the follow example of the .loc attribute to get started:
import numpy
import pandas
N = 3700
data = numpy.random.normal(size=N)
time = pandas.DatetimeIndex(freq='10T', start='2013-02-15 14:30', periods=N)
ts = pandas.Series(data=data, index=time)
ts.loc['2013-02-16']
The great thing about using .loc on a time series is that you can be a general or specific as you want with the dates. So for a particular hour, you'd say:
ts.loc['2013-02-16 13'] # notice that i didn't put any minutes in there
Similarly, you can pull out a whole month with:
ts.loc['2013-02']
The issue you're having with the string formatting is that you're manually padding the string with a 0. So if you have a 2-digit hour (i.e. in the afternoon) you end up with a 3-digit representation of the hours (and that's not valid). SO if I wanted to loop through a specific set of hours, I would do:
hours = [2, 7, 12, 22]
for hr in hours:
print(ts.loc['2013-02-16 {0:02d}'.format(hr)])
The 02d format string tell python to construct a string from a digit (integer) that is least two characters wide and the pad the string with a 0 of the left side if necessary. Also you probably need to format your date as YYYY-mm-dd instead of the other way around.

Categories

Resources