Calculating difference in years for the whole dataframe - python

I have a dataframe with two dates and I want to add a new column that is the difference between the two in years.
birthDate | created_at | diff_in_years
2000-10 | 2019-06-17 13:15:04.598799+00:00 |
I have written the following code to compute the difference. Since I do not know the exact day from birthDate, I manually set it to 1 for both. It works great for one row.
def convert_to_datetime(str):
x = int(str[0:4])
y = int(str[5:7])
z = 1
return datetime.datetime(x,y,z)
start_date = "2000-10"
start_date = convert_to_datetime(start_date)
end_date = "2019-06-17 13:15:04.598799+00:00"
end_date = convert_to_datetime(end_date)
diff = relativedelta(end_date,start_date)
But the problem is how can I run this computation for the whole dataframe? I've tried the apply function but doesn't work. I'm not using it properly.
data.apply(relativedelta(convert_to_datetime(data["created_at"]),convert_to_datetime(data["birthDate"]), axis=1))

Try the following, try to use the pandas built in function for the datetime conversion :
df['birthDate'] = pd.to_datetime(df['birthDate'])
df['created_at'] = pd.to_datetime(df['created_at'])
#from here you can just simply substract
df['difference'] = df['created_at'] - df['birthDate']
#note that this will give you the difference in days, try to divide by 365 or something like that

Related

simplify function in python to return mean age per product

I have data like so:
product|birth-date|id| etc. etc.
_____________________
tv | 01-01-2000|1|...
book | 30-04-1980|2|...
I want to create a function that sorts product type bought by mean age of the buyer, but as a result return a new data frame with those two columns.
Is there any way to simplify this to be one function instead of two?
The first one aims at creating an age column from the birth date column:
def age(df):
today = date.today()
return today.year - df.year - ((today.month, today.day) < (df.month, df.day))
df['age'] = pd.to_datetime(df.birth_date)
df['age'] = df['age'].apply(lambda x: age(x))
df['age'] = df['age'].astype('int')
The second function creates a new dataframe that sorts product by age
def create_new_df(df):
data= df.groupby(['product'])['age'].mean()
new_df = pd.DataFrame(data)
return new_df
create_new_df(df)
You can extrapolate the age also in this way:
df["age"] = (
np.datetime64(dt.date.today()) # convert the datetime object in a numpy.dtype datetime64 array to be able to perform vectorized operation
- pd.to_datetime(df.birth_date) # get the datetime array that we wish to compute
).astype('<m8[Y]') # we convert the timedelta in Years
Slightly less readable (but not so much with comments). It should be faster due to vectorization of operations. As soon as I get home I can compare with some tests.

For a list of dates, check if it is between another list of 2 dates

I'm trying to compare 2 lists of dates, by checking if the date in the first dataframe with column 'timekey' is between the 2 dates, where the 2 dates is the date in timelist and timelist - 1 year.
An example would be checking if 30Aug2020 is between 30Nov2020 and 30Nov2020-1year, I.E 30Nov2019.
I then want to have a 3rd column in the original df where it shows the difference between the timekey date and the compared timelist date.
I'm doing all of this in python using pandas.
import pandas as pd
import datetime as dt
datelist = pd.date_range(start = dt.datetime(2016,8,31), end = dt.datetime(2020,11,30), freq = '3M')
data = {'ID': ['1', '2', '3'], 'timekey': ['31Dec2016', '30Jun2017', '30Aug2018']}
df = pd.DataFrame(data)
df['timekey'] = pd.to_datetime(df['timekey'])
print(df)
print(datelist)
Here is the code I tried, but I have a value error where they say lengths must match to compare. Whats going on?
for date in datelist:
if (df['timekey'] <= datelist) & (df['timekey'] >= (datelist - pd.offsets.DateOffset(years=1))):
df['diff'] = df['timekey'] - (datelist - pd.offsets.DateOffset(years=1))
The expected output should be that for each timekey, if it is within the date range specified by the datelist, it should generate an entire new row with the same ID and timekey with the 3rd new column being the difference in months.
For example, if the timekey is 30Jun2020, it would be between 30Nov2019-30Nov2020, 30Aug2019-30Aug2020. There would be 2 rows created whereby the time difference in months would be 5 and 2 respectively.
Easiest way I could think of to solve your problem would be using the unix timestamp (which will return you the seconds passed since 1970-01-01) to compare. Therefore you would need to convert your dates to unix.
Something like this would work:
unixTime = (pd.to_datetime(<yourTime>) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
so a working example to check if a date is in-between two dates could look like this:
def checkIfInbetween(date1,date2,dateToCheck):
date1 = (pd.to_datetime(date1) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
date2 = (pd.to_datetime(date1) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
dateToCheck = (pd.to_datetime(dateToCheck) - pd.Timestamp('1970-01-01T00:00:00.000Z')) // pd.Timedelta('1s')
if(dateToCheck<date2 && dateToCheck>date1):
return True
else:
return False
df['isInbetween'] = df.apply(lamdbda x: checkIfInbetween(x['date1'], x['date2'], x['dateToCheck']))
(Code not tested)

PYTHON Numpy where time condition

I have the following target: I need to compare two date columns in the same table and create a 3rd column based on the result of the comparison. I do not know how to compare dates in a np.where statement.
This is my current code:
now = datetime.datetime.now() #set the date to compare
delta = datetime.timedelta(days=7) #set delta
time_delta = now+delta #now+7 days
And here is the np.where statement:
DB['s_date'] = np.where((DB['Start Date']<=time_delta | DB['Start Date'] = (None,"")),DB['Start Date'],RW['date'])
There is an OR condition to take into account the possibility that Start Date column might be empty
Would lambda apply work for you Filippo? It looks at a series row-wise, then applies a function of your choice to every value of the row. Whatever is returned in the function will fill up the series with the values it returns.
def compare(date):
if date <= time_delta or date == None:
#return something
else:
#return something else
DB['s_date'] = DB.apply(lambda x: compare(x))
EDIT: This will work as well (thanks EyuelDK)
DB['s_date'] = DB.apply(compare)
Thank you for the insights. I updated (and adjusted for my purposes) the code as following and it works:
now = datetime.datetime.now() #set the date to compare
delta = datetime.timedelta(days=7) #set delta
time_delta = now+delta #now+7 days
DB['Start'] = np.where(((DB['Start Date']<=time_delta) | (DB['Start Date'].isnull()) | (DB['Start Date'] == "")),DB['Start'],DB['Start Date'])
They key was to add () in each condition separated by |. Otherwise was giving an error by comparing two different data types.

Loop over a list in order to apply the same function to multiple Datasets

I'm looking for a way to apply a funtion to multiple rdds (rdd : A Resilient Distributed Dataset). I'm using PySpark and I have to get 6 new rdds by applying the same function to all of my original Datasets. I have something like this :
def define_CohortPeriods(d_date):
do something
return something
if __name__ == '__main__':
try:
first_OrderPeriod = define_CohortPeriods(d_date = '2016-10-19')
second_OrderPeriod = define_CohortPeriods(d_date = '2016-10-20')
third_OrderPeriod = define_CohortPeriods(d_date = '2016-10-21')
fourth_OrderPeriod = define_CohortPeriods(d_date = '2016-10-22')
fifth_OrderPeriod = define_CohortPeriods(d_date = '2016-10-23')
sixth_OrderPeriod = define_CohortPeriods(d_date = '2016-10-24')
except ValueError:
print "Error"
I want to just give two arguments to my code, for exemple the first and the last date, and do something like this :
from datetime import date, timedelta as td
first_date = datetime.datetime.strptime('2016-10-19', '%Y-%m-%d')
last_date = datetime.datetime.strptime('2016-10-24', '%Y-%m-%d')
deltaDate = last_date - first_date
for i in range(deltaDate.days + 1):
print d1 + td(days=i)
which gives :
2016-10-19 00:00:00
2016-10-20 00:00:00
2016-10-21 00:00:00
2016-10-22 00:00:00
2016-10-23 00:00:00
2016-10-24 00:00:00
And finally, iterate through this list of dates, associate every time a date to d_date, and get separately my expected outputs : first_OrderPeriod, second_OrderPeriod, third_OrderPeriod, etc.
What is the most efficient way to do this ? Thx !
Use a list to store your orderPeriod values and then access them by index. Since we're storing them in a list, we can use a list comprehension to build that list.
dates_list = [d1 + td(days=i) for i in range(deltaDate.days +1)]
orderPeriods = [define_CohortPeriods(d_date) for d_date in dates_list]
It's not quite clear whether define_CohortPeriods accepts strings or datetime objects. You should probably be using date objects for both though, as you aren't using the time part of the datetime object.

Python Pandas: Using a user defined function to fill in a blank variable

I am trying to figure out a way to fill in a blank column using a user defined function. I have a Start Date column and an End Date column. The End Date is currently blank. The data has been read in as a csv into a pandas data-frame called df.
What I am wanting to do specifically is build a user defined function that takes the date in the Start Date column and adds 1 year to it and puts that into the end date column. Something to the effect of this:
Beginning Data-frame:
Start_Date End_Date
12/4/2013 NaN
07/16/2012 NaN
03/05/1999 NaN
Output with one year added:
Start_Date End_Date
12/04/2013 12/03/2014
07/16/2012 07/15/2013
03/05/1999 03/04/2000
I realize this can be done with the following code:
from datetime import timedelta
df['END_DATE'] = df['START_DATE'] + timedelta(days=365)
But I would really like to use a user defined function (if it is possible) along the lines of:
def add_1_year(x):
ed = [x['START_DATE']+ timedelta(days=365)
return pd.Series(ed)
df['END_DATE'].apply(add_1_year)
df[['START_DATE','END_DATE']]
I hope this makes as much sense, but any suggestions will be greatly appreciated.
Thanks
Assuming 'Start_Date' is already a datetime:
def add_1_year(x):
x['End_Date'] = x['Start_Date']+ timedelta(days=365)
return x
df.apply(add_1_year,axis=1)
Should do it

Categories

Resources