Python calculate the number of year in date column - python

I've recently start coding with Python, and I'm struggling to calculate the number of years between the current date and a given date.
Dataframe
I would like to calculate the number of year for each column.
I tried this but it's not working:
def Number_of_years(d1,d2):
if d1 is not None:
return relativedelta(d2,d1).years
for col in df.select_dtypes(include=['datetime64[ns]']):
df[col]=Number_of_years(df[col],date.today())
Can anyone help me find a solution to this?

I see that the format of dates is day/month/year.
Given this format is same for all the grids, you can parse the date using the datetime module like so:
from datetime import datetime # import module
def numberOfYears(element):
# parse the date string according to the fixed format
date = datetime.strptime(element, '%d/%m/%Y')
# return the difference in the years
return datetime.today().year - date.year
# make things more interesting by vectorizing this function
function = np.vectorize(numberOfYears)
# This returns a numpy array containing difference between years.
# call this for each column, and you should be good
difference = function(df.Date_creation)

You code is basically right, but you're operating over a pandas series so you can't just call relativedelta directly:
def number_of_years(d1,d2):
return relativedelta(d2,d1).years
for col in df.select_dtypes(include=['datetime64[ns]']):
df[col]= df[col].apply(lambda d: number_of_years(x, date.today()))

Related

date_range won't include last date of interval for custom frequency

I want to create a date vector with a given fixed spacing depending on the frequency I choose. So far, this is what I got:
import pandas as pd
import datetime as dt
from datetime import date, timedelta
def getDates(sD, eD, f):
# creating the datetime object to later on make the date vector
sD = dt.datetime.strptime(sD, '%m/%d/%Y')
eD = dt.datetime.strptime(eD, '%m/%d/%Y')
sd_t = date(sD.year,sD.month,sD.day) # start date
ed_t = date(eD.year,eD.month,eD.day) # end date
# we hardcode a frequency dictionary for the frequencies and spacing that
# date vectors are going to have.
freqDict = {'1h':'40D', '4h':'162D', '1d':'1000D'}
dateVector = pd.date_range(sd_t, ed_t, freq = freqDict[f])
return dateVector
As you can see, I have only 3 frequencies I'm interested in. And the spacing between them works well, I have to play with API limitations and the limit I set up for requests is 1000. This is why I chose these custom spacings between dates, in order to allow a good amount of data points as possible according to the frequency and the API request limitations for which these dates are meant.
Unfortunately, I can't get the final date on the dateVector for some cases. If run this function with these inputs:
getDates('01/01/2020', '01/01/2021', '4h')
I get this outcome, which is missing the final date on the array ('01/01/2021'):
0 2020-01-01
1 2020-06-11
2 2020-11-20
I thought of using the closed = parameter, but it didn't get me where I wanted.
A workaround I thought of consists of using periods instead of freq, and dynamically computing the periods according to the distance (in terms of days) between the start date and the end date. But I would like to know if I can make date_range work in my favor without having to write such a routine.

How to plus datetime in python?

I have the program that generate datetime in several format like below.
1 day, 21:21:00.561566
11:19:26.056148
Maybe it have in month or year format, and i want to know are there any way to plus these all time that i get from the program.
- 1 day, 21:21:00.561566 is the string representation of a datetime.timedelta object. If you need to parse from string to timedelta, pandas has a suitable method. There are other third party parsers; I'm just using this one since pandas is quite common.
import pandas as pd
td = pd.to_timedelta('- 11:19:26.056148')
# Timedelta('-1 days +12:40:33.943852')
td.total_seconds()
# -40766.056148
If you need to find the sum of multiple timedelta values, you can sum up their total_seconds and convert them back to timedelta:
td_strings = ['- 1 day, 21:21:00.561566', '- 11:19:26.056148']
td_sum = pd.Timedelta(seconds=sum([pd.to_timedelta(s).total_seconds() for s in td_strings]))
td_sum
# Timedelta('-1 days +10:01:34.505418')
...or leverage some tools from the Python standard lib:
from functools import reduce
from operator import add
td_sum = reduce(add, map(pd.to_timedelta, td_strings))
# Timedelta('-1 days +10:01:34.505418')
td_sum.total_seconds()
# -50305.494582
You can subtract date time like here to find how far apart these two times are:
https://stackoverflow.com/a/1345852/2415706
Adding two dates doesn't really make any sense though. Like, if you try to add Jan 1st of 2020 to Jan 1st of 1995, what are you expecting?
You can use datatime.timedelta class for this purpose.
You can find the documentation here.
You will need to parse your string and build a timedelta object.

Python Dataframe Date plus months variable which comes from the other column

I have a dataframe with the date and month_diff variable. I would like to get a new date (name it as Target_Date) based on the following logic:
For example, the date is 2/13/2019, month_diff is 3, then the target date should be the month-end of the original date plus 3 months, which is 5/31/2019
I tried the following method to get the traget date first:
df["Target_Date"] = df["Date"] + pd.DateOffset(months = df["month_diff"])
But it failed, as I know, the parameter in the dateoffset should be a varaible or a fixed number.
I also tried:
df["Target_Date"] = df["Date"] + relativedelta(months = df["month_diff"])
It failes too.
Anyone can help? thank you.
edit:
this is a large dataset with millions rows.
You could try this
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({'Date': [pd.datetime(2019,1,1), pd.datetime(2019,2,1)], 'month_diff': [1,2]})
df.apply(lambda row: row.Date + relativedelta(months=row.month_diff), axis=1)
Or list comprehension
[date + relativedelta(months=month_diff) for date, month_diff in df[['Date', 'month_diff']].values]
I would approach in the following method to compute your "target_date".
Apply the target month offset (in your case +3months), using your pd.DateOffset.
Get the last day of that target month (using for example calendar.monthrange, see also "Get last day of the month"). This will provide you with the "flexible" part of that date" offset.
Apply the flexible day offset, when comparing the result of step 1. and step 2. This could be a new pd.DateOffset.
A solution could look something like this:
import calendar
from dateutil.relativedelta import relativedelta
for ii in df.index:
new_ = df.at[ii, 'start_date'] + relativedelta(months=df.at[ii, 'month_diff'])
max_date = calendar.monthrange(new_.year, new_.month)[1]
end_ = new_ + relativedelta(days=max_date - new_.day)
print(end_)
Further "cleaning" into a function and / or list comprehension will probably make it much faster
import pandas as pd
from datetime import datetime
from datetime import timedelta
This is my approach in solving your issue.
However for some reason I am getting a semantic error in my output even though I am sure it is the correct way. Please everyone correct me if you notice something wrong.
today = datetime.now()
today = today.strftime("%d/%m/%Y")
month_diff =[30,5,7]
n = 30
for i in month_diff:
b = {'Date': today, 'month_diff':month_diff,"Target_Date": datetime.now()+timedelta(days=i*n)}
df = pd.DataFrame(data=b)
Output:
For some reason the i is not getting updated.
I was looking for a solution I can write in one line only and apply does the job. However, by default apply function performs action on each column, so you have to remember to specify correct axis: axis=1.
from datetime import datetime
from dateutil.relativedelta import relativedelta
# Create a new column with date adjusted by number of months from 'month_diff' column and later adjust to the last day of month
df['Target_Date'] = df.apply(lambda row: row.Date # to current date
+ relativedelta(months=row.month_diff) # add month_diff
+ relativedelta(day=+31) # and adjust to the last day of month
, axis=1) # 1 or ‘columns’: apply function to each row.

How to check if a certain date is present in a dictionary and if not, return the closest date available?

I have a dictionary with many sorted dates. How could I write a loop in Python that checks if a certain date is in the dictionary and if not, it returns the closest date available? I want it to work that if after subtracting one day to the date, it checks again if now it exists in the dictionary and if not, it subtracts again until it finds a existing date.
Thanks in advance
from datetime import timedelta
def function(date):
if date not in dictio:
date -= timedelta(days=1)
return date
I've made a recursive function to solve your problem:
import datetime
def find_date(date, date_dict):
if date not in date_dict.keys():
return find_date(date-datetime.timedelta(days=1), date_dict)
else:
return date
I don't know what is the content of your dictionary but the following example should show you how this works:
import numpy as np
# creates a casual dates dictionary
months = np.random.randint(3,5,20)
days = np.random.randint(1,30,20)
dates = {
datetime.date(2019,m,d): '{}_{:02}_{:02}'.format(2019,m,d)
for m,d in zip(months,days)}
# select the date to find
target_date = datetime.date(2019, np.random.randint(3,5), np.random.randint(1,30))
# print the result
print("The date I wanted: {}".format(target_date))
print("The date I got: {}".format(find_date(target_date, dates)))
What you are looking for is possibly a while loop, although beware because if it will not find the date it will run to infinite. Perhaps you want to define a limit of attempts until the script should give up?
from datetime import timedelta, date
d1 = {
date(2019, 4, 1): None
}
def function(date, dictio):
while date not in dictio:
date -= timedelta(days=1)
return date
res_date = function(date.today(), d1)
print(res_date)

Python Pandas: Using a user defined function to fill in a blank variable

I am trying to figure out a way to fill in a blank column using a user defined function. I have a Start Date column and an End Date column. The End Date is currently blank. The data has been read in as a csv into a pandas data-frame called df.
What I am wanting to do specifically is build a user defined function that takes the date in the Start Date column and adds 1 year to it and puts that into the end date column. Something to the effect of this:
Beginning Data-frame:
Start_Date End_Date
12/4/2013 NaN
07/16/2012 NaN
03/05/1999 NaN
Output with one year added:
Start_Date End_Date
12/04/2013 12/03/2014
07/16/2012 07/15/2013
03/05/1999 03/04/2000
I realize this can be done with the following code:
from datetime import timedelta
df['END_DATE'] = df['START_DATE'] + timedelta(days=365)
But I would really like to use a user defined function (if it is possible) along the lines of:
def add_1_year(x):
ed = [x['START_DATE']+ timedelta(days=365)
return pd.Series(ed)
df['END_DATE'].apply(add_1_year)
df[['START_DATE','END_DATE']]
I hope this makes as much sense, but any suggestions will be greatly appreciated.
Thanks
Assuming 'Start_Date' is already a datetime:
def add_1_year(x):
x['End_Date'] = x['Start_Date']+ timedelta(days=365)
return x
df.apply(add_1_year,axis=1)
Should do it

Categories

Resources