Elegant way to add years as timedelta units to shift dates - Pandas

Elegant way to add years as timedelta units to shift dates - Pandas - python

I have a dataframe like as shown below
df1 = pd.DataFrame({'person_id': [11,11,11,21,21],
'admit_dates': ['03/21/2015', '01/21/2016', '7/20/2018','01/11/2017','12/31/2011'],
'discharge_dates': ['05/09/2015', '01/29/2016', '7/27/2018','01/12/2017','01/31/2016'],
'drug_start_dates': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961'],
'offset':[223,223,223,310,310]})
What I would like to do is add offset which is in years to the dates columns.
So, I was trying to convert the offset to timedelta object with unit=y or unit=Y and then shift admit_dates
df1['offset'] = pd.to_timedelta(df1['offset'],unit='Y') #also tried with `y` (small y)
df1['shifted_date'] = df1['admit_dates'] + df1['offset']
However, I get the below error
ValueError: Units 'M' and 'Y' are no longer supported, as they do not
represent unambiguous timedelta values durations.
Is there any other elegant way to shift dates by years?

The max Timestamp supported in pandas is Timestamp('2262-04-11 23:47:16.854775807') so you could not be able to add 310 years to date 12/31/2011, one possible way is to use python's datetime objects which support a max year upto 9999 so you should be able to add 310 years to that.
from dateutil.relativedelta import relativedelta
df['admit_dates'] = pd.to_datetime(df['admit_dates'])
df['admit_dates'] = df['admit_dates'].dt.date.add(
df['offset'].apply(lambda y: relativedelta(years=y)))
Result:
df
person_id admit_dates discharge_dates drug_start_dates offset
0 11 2238-03-21 05/09/2015 05/29/1967 223
1 11 2239-01-21 01/29/2016 01/21/1957 223
2 11 2241-07-20 7/27/2018 7/27/1959 223
3 21 2327-01-11 01/12/2017 01/01/1961 310
4 21 2321-12-31 01/31/2016 12/31/1961 310

One thing you can do is extract the year out of the date, and add it to the offset:
df1 = pd.DataFrame({'person_id': [11,11,11,21,21],
'admit_dates': ['03/21/2015', '01/21/2016', '7/20/2018','01/11/2017','12/31/2011'],
'discharge_dates': ['05/09/2015', '01/29/2016', '7/27/2018','01/12/2017','01/31/2016'],
'drug_start_dates': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961'],
'offset':[10,20,2,31,12]})
df1.admit_dates = pd.to_datetime(df1.admit_dates)
df1["new_year"] = df1.admit_dates.dt.year + df1.offset
df1["date_with_offset"] = pd.to_datetime(pd.DataFrame({"year": df1.new_year,
"month": df1.admit_dates.dt.month,
"day":df1.admit_dates.dt.day}))
The catch - with your original offsets, some of the dates cause the following error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2328-01-11 00:00:00
According to the documentation, the maximum date in pandas is Apr. 11th, 2262 (at about quarter to midnight, to be specific). It's probably because they keep time in nanoseconds, and that's when the out of bounds error occurs for this representation.

Units 'Y' and 'M' becomes deprecated since pandas 0.25.0
But thanks to numpy timedelta64 through which we can use these units in the pandas Timedelta
import pandas as pd
import numpy as np
# Builds your dataframe
df1 = pd.DataFrame({'person_id': [11,11,11,21,21],
'admit_dates': ['03/21/2015', '01/21/2016', '7/20/2018','01/11/2017','12/31/2011'],
'discharge_dates': ['05/09/2015', '01/29/2016', '7/27/2018','01/12/2017','01/31/2016'],
'drug_start_dates': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961'],
'offset':[223,223,223,310,310]})
>>> df1
person_id admit_dates discharge_dates drug_start_dates offset
0 11 03/21/2015 05/09/2015 05/29/1967 223
1 11 01/21/2016 01/29/2016 01/21/1957 223
2 11 7/20/2018 7/27/2018 7/27/1959 223
3 21 01/11/2017 01/12/2017 01/01/1961 310
4 21 12/31/2011 01/31/2016 12/31/1961 310
>>> df1['shifted_date'] = df1.apply(lambda r: pd.Timedelta(np.timedelta64(r['offset'], 'Y'))+ pd.to_datetime(r['admit_dates']), axis=1)
>>> df1['shifted_date'] = df1['shifted_date'].dt.date
>>> df1
person_id admit_dates discharge_dates drug_start_dates offset shifted_date
0 11 03/21/2015 05/09/2015 05/29/1967 223 2238-03-21
1 11 01/21/2016 01/29/2016 01/21/1957 223 2239-01-21
2 11 7/20/2018 7/27/2018 7/27/1959 223 2241-07-20
....

Related

How to convert timedelta to integer in pandas?

I have a column 'Time' in pandas that includes both integer and time deltas in days:
index Time
1 91
2 28
3 509 days 00:00:00
4 341 days 00:00:00
5 250 days 00:00:00
I am wanting to change all of the Time deltas to integers, but I am getting many errors when trying to pick and choose which values to convert, as it throws errors when I try to convert an integer within the column rather than a TD.
I want this:
index Time
1 91
2 28
3 509
4 341
5 250
I've tried a few variations of this to check if it's an integer, as I'm not concerned with those:
for x in finished['Time Future']:
if isinstance(x, int):
continue
else:
finished['Time'][x] = finished['Time'][x].astype(int)
But It is not working at all. I can't seem to find a solution.

This seems to work:
# If the day counts are actual integers:
m = ~df.Time.apply(lambda x: isinstance(x, int))
# OR, in case the day counts are strings:
m = ~df.Time.str.isdigit()
df.loc[m, 'Time'] = df.Time[m].apply(lambda x: pd.Timedelta(x).days)
Results in:
Time
1 91
2 28
3 509
4 341
5 250

using pd.to_datetime to form a date by taking input of year,months,day present in different columns in a data frame

I have a problem combining the day month year columns to form a date column in a data frame using pd.to_datetime. Below is the dataframe,i'm working on and the columns Yr,Mo,Dy represents as year month day.
data = pd.read_table("/ALabs/wind.data",sep = ',')
Yr Mo Dy RPT VAL ROS KIL
61 1 1 15.04 14.96 13.17 9.29
61 1 2 14.71 NaN 10.83 6.50
61 1 3 18.50 16.88 12.33 10.13
So I've tried the below code, i get the following error: "to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing"
Step 1:
data['Date'] = pd.to_datetime(data[['Yr','Mo','Dy']],format="%y-%m-%d")
Next I've tried converting Yr,Mo,Dy column datatype to datetime64 from int64 and assigning the result to new columns Year,Month,Day. Now when i try to combine the columns i'm getting the proper date format in the new date column and i have no idea how i got the desired result.
Step2:
data['Year'] = pd.to_datetime(data.Yr,format='%y').dt.year
data['Month'] = pd.to_datetime(data.Mo,format='%m').dt.month
data['Day'] = pd.to_datetime(data.Dy,format ='%d').dt.day
data['Date'] =pd.to_datetime(data[['Year','Month','Day']])
Result:
Yr Mo Dy Year Month Day Date
61 1 1 2061 1 1 2061-01-01
61 1 2 2061 1 2 2061-01-02
61 1 3 2061 1 3 2061-01-03
61 1 4 2061 1 4 2061-01-04
But if i try doing the same method by changing the column names from year,month, day to Yy,Mh,Di like in the below code. I get the same error "to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing"
Step3:
data['Yy'] = pd.to_datetime(data.Yr,format='%y').dt.year
data['Mh'] = pd.to_datetime(data.Mo,format='%m').dt.month
data['Di'] = pd.to_datetime(data.Dy,format ='%d').dt.day
data['Date'] =pd.to_datetime(data[['Yy','Mh','Di']])
What i want to know :
1) Is it mandatory for the argument names to be 'Year' 'Month' 'Day' if we are using pd.to_datetime?
2) Is there any other way to combine the columns in a dataframe to form a date, rather than using this long method?
3) Is this error specific only to python version 3.7??
4)where have i gone wrong in Step 1 and Step 3 ,and why i'm getting o/p when i follow step 2 ?

As per the pandas.to_datetime docs, the column names really do have to be 'year', 'month', and 'day' (capitalizing the first letter is fine). This explains the answer to all of your questions, and no it has nothing to do with the version of Python (and all recent versions of Pandas behave the same).
Also, you should be aware that when you call to_datetime with a sequence of columns (as opposed to a single column/list of strings), the format argument seems to be ignored. So you'll need to normalize your years (to 1961 or 2061 or 1061, etc) yourself. Here's a complete example of how you could do the conversion in a single line:
import pandas as pd
d = '''Yr Mo Dy RPT VAL ROS KIL
61 1 1 15.04 14.96 13.17 9.29
61 1 2 14.71 NaN 10.83 6.50
61 1 3 18.50 16.88 12.33 10.13 '''
data = pd.read_csv(pd.compat.StringIO(d), sep='\s+')
dtime = pd.to_datetime({k:data[c]+v for c,k,v in zip(('Yr', 'Mo', 'Dy'), ('Year', 'Month', 'Day'), (1900, 0, 0))})
print(dtime)
Output:
0 1961-01-01
1 1961-01-02
2 1961-01-03
dtype: datetime64[ns]
In the above code, instead of adding the appropriately named columns to the dataframe data, I just made a dict where the key/value pairs are eg. ('Year', data['Yr']), and also added 1900 to the years.
You can simplify the dict comprehension a bit by just adding 1900 directly to the appropriate column:
data['Yr'] += 1900
dtime = pd.to_datetime({k:data[c] for c,k in zip(('Yr', 'Mo', 'Dy'), ('year', 'month', 'day'))})
This code will have the same output as the previous.

I don't really know how Python deals with years, but the reason it wasn't working had to do with the fact that you were using the year 61.
This works for me
d = {'Day': ["1", "2","3"],
'Month': ["1", "1","1"],
'Year':["61", "61", "61"]}
df = pd.DataFrame(data=d)
df["Year"] = pd.to_numeric(df["Year"])
df.Year = df.Year+2000
df['Date'] = pd.to_datetime(df[['Year','Month','Day']], format='%Y%m%d')

grouping time-series data based on starting and ending date

I have time-series data of a yearly sports tournament, with the date when each game was played. I want to group the games by the season(year) they were played in. Each season starts in August and ends the NEXT year in july.
How would I go about grouping the games by season, like -
season(2016-2017), season(2017-2018), etc..
This Answer involving df.resample() may be related, but I'm not sure how I'd go about doing it.
This is what the date column looks like:
DATE
26/09/09
04/10/09
17/10/09
25/10/09
31/10/09
...
29/09/18
07/10/18
28/10/18
03/11/18
I want to group by seasons so that I can perform visualization operations over the aggregated data.
UPDATE: For the time being my solution is to split up the dataframe into groups of 32 as I know each season has 32 games. This is the code I've used:
split_df = np.array_split(df, np.arange(0, len(df),32))
But I'd rather prefer something more elegant and more inclusive of time-series data so I'll keep the question open.

The key to success is proper grouping, in your case pd.Grouper(key='DATA', freq='AS-AUG').
Note that freq='AS-AUG' states that your groups should start from the start of
August each year.
Look at the following script:
import pandas as pd
# Source columns
dates = [ '01/04/09', '31/07/09', '01/08/09', '26/09/09', '04/10/09', '17/12/09',
'25/01/10', '20/04/10', '31/07/10', '01/08/10', '28/10/10', '03/11/10',
'25/12/10', '20/04/11', '31/07/11' ]
scores_x = np.random.randint(0, 20, len(dates))
scores_y = np.random.randint(0, 20, len(dates))
# Source DataFrame
df = pd.DataFrame({'DATA': dates, 'SCORE_X': scores_x, 'SCORE_Y': scores_y})
# Convert string date to datetime
df.DATA = pd.to_datetime(df.DATA, format='%d/%m/%y')
# Grouping
gr = df.groupby(pd.Grouper(key='DATA', freq='AS-AUG'))
If you print the results:
for name, group in gr:
print()
print(name)
print(group)
you will get:
2008-08-01 00:00:00
DATA SCORE_X SCORE_Y
0 2009-04-01 16 11
1 2009-07-31 10 7
2009-08-01 00:00:00
DATA SCORE_X SCORE_Y
2 2009-08-01 19 6
3 2009-09-26 14 5
4 2009-10-04 8 11
5 2009-12-17 12 19
6 2010-01-25 0 0
7 2010-04-20 17 6
8 2010-07-31 18 2
2010-08-01 00:00:00
DATA SCORE_X SCORE_Y
9 2010-08-01 15 18
10 2010-10-28 2 4
11 2010-11-03 8 16
12 2010-12-25 13 1
13 2011-04-20 19 7
14 2011-07-31 8 3
As you can see, each group starts just on 1-st of August and ends on
31-st of July.
They you can do with your groups whatever you want.

Use -
df.groupby(df['DATE'].dt.year).count()
Output
DATE
DATE
2009 5
2018 4
Custom Season Grouping
min_year = df['DATE'].dt.year.min()
max_year = df['DATE'].dt.year.max()
rng = pd.date_range(start='{}-07'.format(min_year), end='{}-08'.format(max_year), freq='12M').to_series()
df.groupby(pd.cut(df['DATE'], rng)).count()
Output
DATE
DATE
(2009-07-31, 2010-07-31] 3
(2010-07-31, 2011-07-31] 0
(2011-07-31, 2012-07-31] 0
(2012-07-31, 2013-07-31] 0
(2013-07-31, 2014-07-31] 0
(2014-07-31, 2015-07-31] 0
(2015-07-31, 2016-07-31] 0
(2016-07-31, 2017-07-31] 0
(2017-07-31, 2018-07-31] 1

Resampling using 'A-JUL' as an anchored offset alias should do the trick:
>>> df
SAMPLE
DATE
2009-01-30 1
2009-07-10 4
2009-11-20 3
2010-01-01 5
2010-05-13 1
2010-08-01 1
>>> df.resample('A-JUL').sum()
SAMPLE
DATE
2009-07-31 5
2010-07-31 9
2011-07-31 1
A indicates it is a yearly interval, -JUL indicates it ends in July.

You could build a season column and group by that. In below code, I used pandas.DateOffset() to move all dates 7 months back so a game that happened in August would look like it happened in January to align the season year with the calendar year. Building season string is fairly straightforward after that.
import pandas as pd
from datetime import date
dates = pd.date_range(date(2009, 8, 1), date(2018, 7, 30), freq='17d')
df = pd.DataFrame(dates, columns=['date'])
# copy the date column to a separate dataframe to do the work
df_tmp = df[['date']]
df_tmp['season_start_year'] = (df_tmp['date'] - pd.DateOffset(months=7)).dt.year
df_tmp['season_end_year'] = df_tmp['season_start_year'] + 1
df_tmp['season'] = df_tmp['season_start_year'].map(str) + '-' + df_tmp['season_end_year'].map(str)
# copy season column to the main dataframe
df['season'] = df_tmp['season']
df.groupby('season').count()

Trying to create a new dataframe column in pandas based on a dataframe related if statement

I'm learning Python & pandas and practicing with different stock calculations. I've tried to search help with this but just haven't found a response similar enough or then didn't understand how to deduce the correct approach based on the previous responses.
I have read stock data of a given time frame with datareader into dataframe df. In df I have Date Volume and Adj Close columns which I want to use to create a new column "OBV" based on given criteria. OBV is a cumulative value that adds or subtracts the value of the volume today to the previous' days OBV depending on the adjusted close price.
The calculation of OBV is simple:
If Adj Close is higher today than Adj Close of yesterday then add the Volume of today to the (cumulative) volume of yesterday.
If Adj Close is lower today than Adj Close of yesterday then substract the Volume of today from the (cumulative) volume of yesterday.
On day 1 the OBV = 0
This is then repeated along the time frame and OBV gets accumulated.
Here's the basic imports and start
import numpy as np
import pandas as pd
import pandas_datareader
import datetime
from pandas_datareader import data, wb
start = datetime.date(2012, 4, 16)
end = datetime.date(2017, 4, 13)
# Reading in Yahoo Finance data with DataReader
df = data.DataReader('GOOG', 'yahoo', start, end)
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
#This is what I cannot get to work, and I've tried two different ways.
#ATTEMPT1
def obv1(column):
if column["Adj Close"] > column["Adj close"].shift(-1):
val = column["Volume"].shift(-1) + column["Volume"]
else:
val = column["Volume"].shift(-1) - column["Volume"]
return val
df["OBV"] = df.apply(obv1, axis=1)
#ATTEMPT 2
def obv1(df):
if df["Adj Close"] > df["Adj close"].shift(-1):
val = df["Volume"].shift(-1) + df["Volume"]
else:
val = df["Volume"].shift(-1) - df["Volume"]
return val
df["OBV"] = df.apply(obv1, axis=1)
Both give me an error.

Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
Volume=np.random.randint(100, 200, 10),
AdjClose=np.random.rand(10)
))
print(df)
AdjClose Volume
0 0.951710 111
1 0.346711 198
2 0.289758 174
3 0.662151 190
4 0.171633 115
5 0.018571 155
6 0.182415 113
7 0.332961 111
8 0.150202 113
9 0.810506 126
Multiply the Volume by -1 when change in AdjClose is negative. Then cumsum
(df.Volume * (~df.AdjClose.diff().le(0) * 2 - 1)).cumsum()
0 111
1 -87
2 -261
3 -71
4 -186
5 -341
6 -228
7 -117
8 -230
9 -104
dtype: int64
Include this along side the rest of the df
df.assign(new=(df.Volume * (~df.AdjClose.diff().le(0) * 2 - 1)).cumsum())
AdjClose Volume new
0 0.951710 111 111
1 0.346711 198 -87
2 0.289758 174 -261
3 0.662151 190 -71
4 0.171633 115 -186
5 0.018571 155 -341
6 0.182415 113 -228
7 0.332961 111 -117
8 0.150202 113 -230
9 0.810506 126 -104

Calculate day of year in dataframe from a datetime column to another column in Python

I have a dataframe with a datetime column in it, like 2014-01-01, 2016-06-05, etc. Now I want to add a column in the dataframe calculating the day of year (for that given year).
On this forum I did find some hints for sure, but I'm struggling with the types and dataframe stuff.
So this works fine
from datetime import datetime
day_to_calc = today
day_of_year = day_to_calc.timetuple().tm_yday
day_of_year
But my day_to_calc is not today, but df['Date']. However, if I try this
df['DOY'] = df['Date'].timetuple().tm_yday
I get
AttributeError: 'Series' object has no attribute 'timetuple'
Ok, so I guess I need a map function perhaps?
So I'm trying something like ..
df['DOY'] = map (datetime.timetuple().tm_yday,df['Date'])
And surely you guys see how stupid that is ;-) (but I'm still learning Python)
TypeError: descriptor 'timetuple' of 'datetime.datetime' object needs an argument
So that makes sense sort of because I need to pass the date as parameter, sooo .. trying
df['DOY'] = datetime.timetuple(df['Date']).tm_yday
TypeError: descriptor 'timetuple' requires a 'datetime.datetime' object but received a 'Series'
There must be a simple way, but I just can't figure out the syntax :-(

Use dayofyear function:
import pandas as pd
# first convert date string to datetime with a proper format string
df = pd.DataFrame({'Date':pd.to_datetime(['2014-01-01', '2016-06-05'], format='%Y-%m-%d')})
# calculate day of year
df['DOY'] = df['Date'].dt.dayofyear
print(df)
Output:
Date DOY
0 2014-01-01 1
1 2016-06-05 157

I noticed the above answer does not go into great detail, so I've provided a more explanatory answer below.
Try the following:
import pandas as pd
# Create a pandas datetime range for the year 2022
passed_2022 = pd.date_range('2022-01-01', '2022-12-31')
# Convert the datetime range to a list of strings in the format 'YYYY-MM-DD'
passed_2022_list = [i.strftime('%Y-%m-%d') for i in passed_2022]
# Create a DataFrame
data = pd.DataFrame({'datetime': passed_2022_list})
# Filter the data DataFrame to only include dates in the passed_2022 list
data = data[data['datetime'].isin(passed_2022_list)]
# Count the number of rows in the filtered DataFrame
num_days_passed = len(data)
# Create a new DataFrame with 'datetime' and 'DAYS_OF_YEAR' columns
result = pd.DataFrame({'datetime': passed_2022_list,
'DAYS OF YEAR': range(1, num_days_passed+1)})
# Print the result of the DataFrame
print(result)
Output:
datetime DAYS OF YEAR
0 2022-01-01 1
1 2022-01-02 2
2 2022-01-03 3
3 2022-01-04 4
4 2022-01-05 5
.. ... ...
360 2022-12-27 361
361 2022-12-28 362
362 2022-12-29 363
363 2022-12-30 364
364 2022-12-31 365
[365 rows x 2 columns]
Process finished with exit code 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Elegant way to add years as timedelta units to shift dates - Pandas - python

Related

How to convert timedelta to integer in pandas?

using pd.to_datetime to form a date by taking input of year,months,day present in different columns in a data frame

grouping time-series data based on starting and ending date

Trying to create a new dataframe column in pandas based on a dataframe related if statement

Calculate day of year in dataframe from a datetime column to another column in Python

Categories

Resources