How to subtract month correctly in Pandas - python

My dataframe has two columns. When I subtract them to get the month in between, I got some weird numbers. Here is an example:
test = pd.DataFrame({'reg_date': [datetime(2017,3,1), datetime(2016,9,1)],
'leave_date':[datetime(2017,7,1), datetime(2017,6,1)]})
test['diff_month'] = test.leave_date.dt.month - test.reg_date.dt.month
test
The output:
If a user's register_date is last year, I get a negative number (also incorrect as well).
What operations should I perform to get the correct time difference in month between two datetime column?
Update: I changed the example a bit so it reflects more about the issue I am facing. Don't down vote so fast guys.
A hack I did to fix this is:
test['real_diff'] = test.diff_month.apply(lambda x: x if x > 0 else 12+x)
I don't like the hack so I am curious if there is any other way of doing it.

IIUC you can call apply and use relativedelta as #zipa suggested:
In[29]:
from dateutil import relativedelta
test['real_diff'] = test.apply(lambda row: relativedelta.relativedelta(row['leave_date'], row['reg_date']).months, axis=1)
test
Out[29]:
leave_date reg_date real_diff
0 2017-07-01 2017-03-01 4
1 2017-06-01 2016-09-01 9

To get your result you can use relativedelta from dateutil:
import datetime
from dateutil import relativedelta
a = datetime.datetime(2016, 12, 1)
b = datetime.datetime(2017, 5, 1)
relativedelta.relativedelta(b, a).months
#5

Related

Python Dataframe Date plus months variable which comes from the other column

I have a dataframe with the date and month_diff variable. I would like to get a new date (name it as Target_Date) based on the following logic:
For example, the date is 2/13/2019, month_diff is 3, then the target date should be the month-end of the original date plus 3 months, which is 5/31/2019
I tried the following method to get the traget date first:
df["Target_Date"] = df["Date"] + pd.DateOffset(months = df["month_diff"])
But it failed, as I know, the parameter in the dateoffset should be a varaible or a fixed number.
I also tried:
df["Target_Date"] = df["Date"] + relativedelta(months = df["month_diff"])
It failes too.
Anyone can help? thank you.
edit:
this is a large dataset with millions rows.
You could try this
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({'Date': [pd.datetime(2019,1,1), pd.datetime(2019,2,1)], 'month_diff': [1,2]})
df.apply(lambda row: row.Date + relativedelta(months=row.month_diff), axis=1)
Or list comprehension
[date + relativedelta(months=month_diff) for date, month_diff in df[['Date', 'month_diff']].values]
I would approach in the following method to compute your "target_date".
Apply the target month offset (in your case +3months), using your pd.DateOffset.
Get the last day of that target month (using for example calendar.monthrange, see also "Get last day of the month"). This will provide you with the "flexible" part of that date" offset.
Apply the flexible day offset, when comparing the result of step 1. and step 2. This could be a new pd.DateOffset.
A solution could look something like this:
import calendar
from dateutil.relativedelta import relativedelta
for ii in df.index:
new_ = df.at[ii, 'start_date'] + relativedelta(months=df.at[ii, 'month_diff'])
max_date = calendar.monthrange(new_.year, new_.month)[1]
end_ = new_ + relativedelta(days=max_date - new_.day)
print(end_)
Further "cleaning" into a function and / or list comprehension will probably make it much faster
import pandas as pd
from datetime import datetime
from datetime import timedelta
This is my approach in solving your issue.
However for some reason I am getting a semantic error in my output even though I am sure it is the correct way. Please everyone correct me if you notice something wrong.
today = datetime.now()
today = today.strftime("%d/%m/%Y")
month_diff =[30,5,7]
n = 30
for i in month_diff:
b = {'Date': today, 'month_diff':month_diff,"Target_Date": datetime.now()+timedelta(days=i*n)}
df = pd.DataFrame(data=b)
Output:
For some reason the i is not getting updated.
I was looking for a solution I can write in one line only and apply does the job. However, by default apply function performs action on each column, so you have to remember to specify correct axis: axis=1.
from datetime import datetime
from dateutil.relativedelta import relativedelta
# Create a new column with date adjusted by number of months from 'month_diff' column and later adjust to the last day of month
df['Target_Date'] = df.apply(lambda row: row.Date # to current date
+ relativedelta(months=row.month_diff) # add month_diff
+ relativedelta(day=+31) # and adjust to the last day of month
, axis=1) # 1 or ‘columns’: apply function to each row.

How to Parse 0 hour with dateutil

I'm trying to merge my dataframe columns which contain time info (UTC) into a single column containing datetime object/string. The columns of my df are like this:
YY MM DD HH
98 12 05 11
98 12 05 10
So, I would like a single column containing that time information.
What I've tried so far:
I've merged into a string so that I can parse them into a datetime object by
from dateutil.parser import parse
d_test = (list(df[0].map(str) + " " + df[1].map(str) + " " + df[2].map(str)
+ " " + df[3].map(str)))
Now I just have to parse the list of date strings
parse_d = []
for d in d_test:
parse_d.append(parse(d))
But this is raising me an "unknown string error". I looked into it and it arrises because some of the dates are like:
d_test[5] = '98 12 5 0'
I've tried reading the detailed documentation of dateutil (https://labix.org/python-dateutil) and what I understood is that I've to make a dictionary specifying the timezone as key (UTC in my case) and that might solve the error.
tzinfo ={}
parse(d_test[5], tzinfo=tzinfo)
Maybe, I'm missing something very basic but I'm not able to understand how to create this dictionary.
In general, if you know the format of a string, you don't need to use dateutil.parser.parse to parse it, because you can use datetime.strptime with a specified string.
In this case, the only slightly unfortunate thing is that you have 2-digit years, some of which are from before 2000. In this case, I'd probably do something like this:
cent_21_mask = df['YY'] < 50
df.loc[:, 'YY'] = df.loc[:, 'YY'] + 1900
df.loc[cent_21_mask, 'YY'] = df.loc[cent_21_mask, 'YY'] + 100
Once you've done that, you can use one of the solutions from this question (specifically this one) to convert your individual datetime columns into pandas Timestamps / datetimes.
If these are in UTC, you then use pandas.Series.tz_localize with 'UTC' to get timezone-aware datetimes.
Putting it all together:
import pandas as pd
df = pd.DataFrame(
[[98, 12, 5, 11],
[98, 12, 5, 10],
[4, 12, 5, 00]],
columns=['YY', 'MM', 'DD', 'HH'])
# Convert 2-digit years to 4-digit years
cent_21_mask = df['YY'] < 50
df.loc[:, 'YY'] = df.loc[:, 'YY'] + 1900
df.loc[cent_21_mask, 'YY'] = df.loc[cent_21_mask, 'YY'] + 100
# Retrieve the date columns and rename them
col_renames = {'YY': 'year', 'MM': 'month', 'DD': 'day', 'HH': 'hour'}
dt_subset = df.loc[:, list(col_renames.keys())].rename(columns=col_renames)
dt_series = pd.to_datetime(dt_subset)
# Convert to UTC
dt_series = dt_series.dt.tz_localize('UTC')
# Result:
# 0 1998-12-05 11:00:00+00:00
# 1 1998-12-05 10:00:00+00:00
# 2 2004-12-05 00:00:00+00:00
# dtype: datetime64[ns, UTC]
Also, to clarify two things about this statement:
I've tried reading the detailed documentation of dateutil (https://labix.org/python-dateutil) and what I understood is that I've to make a dictionary specifying the timezone as key (UTC in my case) and that might solve the error.
The correct documentation for python-dateutil is now https://dateutil.readthedocs.io.
If you are using parse, in your situation there is no reason to add UTC into a dictionary and pass it to tzinfos. If you know that your datetimes are going to be naive but that they represent times in UTC, parse them as normal to get naive datetimes, then use datetime.replace(dateutil.tz.tzutc()) to get aware datetimes. The tzinfos dictionary is for when the timezone information is actually represented in the string.
An example of what to do when you have strings representing UTC that don't contain timezone information:
from dateutil.parser import parse
from dateutil import tz
dt = parse('1998-12-05 11:00')
dt = dt.replace(tzinfo=tz.tzutc())
How about if you parse the date in this format?
parse("98/12/05 00h")

Get the Week Number between two Dates Pandas

I have a basic code snippet that I need to recreate in pandas:
import datetime as dt
date1 = dt.date(2016,10,10)
date2 = dt.date.today()
print('Week#', round((date2 - date1).days / 7 +.5))
# output: Week# 36
I have a datetime64[ns] datatype column and I cannot crack it. Using this basic example I'm stumped:
import pandas as pd
import numpy as np
import datetime as dt
dfp = pd.DataFrame({'A' : [dt.date(2016,10,6)]})
dfp['A'] = pd.to_datetime(dfp['A'])
def week(col):
print((col.dt.date - dt.date(2015, 10, 6)))
week(dfp['A']) #output: 366 days
When I try re-creating the week number calculation I'm running into multiple errors: print((col.dt.date - dt.date(2015, 10, 6)).days) returns AttributeError: 'Series' object has no attribute 'days'
I'd like to try and solve this on my own so I can learn from the pain.
How do I return the pandas column values in terms of "number of days" or as an int like using the first calculation in the first code snippet? (ie, instead of 366 days, just 366)
If you're feeling adventurous how could i get the Week# xxx output in a more efficient way?
I think you forget .dt:
dfp = pd.DataFrame({'A' : [date2]})
dfp['A'] = pd.to_datetime(dfp['A'])
print (dfp)
print (((dfp['A'].dt.date - dt.date(2016, 10, 10)).dt.days / 7 + .5).round().astype(int))
0 36
Name: A, dtype: int32

Converting date formats python - Unusual date formats - Extract %Y%M%D

I have a large data set with a variety of Date information in the following formats:
DAYS since Jan 1, 1900 - ex: 41213 - I believe these are from Excel http://www.kirix.com/stratablog/jd-edwards-date-conversions-cyyddd
YYDayofyear - ex 2012265
I am familiar with python's time module, strptime() method, and strftime () method. However, I am not sure what these date formats above are called on if there is a python module I can use to convert these unusual date formats.
Any idea how to get the %Y%M%D format from these unusual date formats without writing my own calculator?
Thanks.
You can try something like the following:
In [1]: import datetime
In [2]: s = '2012265'
In [3]: datetime.datetime.strptime(s, '%Y%j')
Out[3]: datetime.datetime(2012, 9, 21, 0, 0)
In [4]: d = '41213'
In [5]: datetime.date(1900, 1, 1) + datetime.timedelta(int(d))
Out[5]: datetime.date(2012, 11, 2)
The first one is the trickier one, but it uses the %j parameter to interpret the day of the year you provide (after a four-digit year, represented by %Y). The second one is simply the number of days since January 1, 1900.
This is the general conversion - not sure of your input format but hopefully this can be tweaked to suit it.
On the Excel integer to Python datetime bit:
Note that there are two Excel date systems (one 1-Jan-1900 based and another 1-Jan 1904 based); see https://support.microsoft.com/en-us/help/214330/differences-between-the-1900-and-the-1904-date-system-in-excel for more information.
Also note that the system is NOT zero-based. So, in the 1900 system, 1-Jan-1900 is day 1 (not day 0).
import datetime
EXCEL_DATE_SYSTEM_PC=1900
EXCEL_DATE_SYSTEM_MAC=1904
i = 42129 # Excel number for 5-May-2015
d = datetime.date(EXCEL_DATE_SYSTEM_PC, 1, 1) + datetime.timedelta(i-2)
Both of these formats seems pretty straightforward to work with. The first one, in fact, is just an integer, so why don't you just do something like this?
import datetime
def days_since_jan_1_1900_to_datetime(d):
return datetime.datetime(1900,1,1) + \
datetime.timedelta(days=d)
For the second one, the details depend on exactly how the format is defined (e.g. can you always expect 3 digits after the year even when the number of days is less than 100, or is it possible that there are 2 or 1 – and if so, is the year always 4 digits?) but once you've got that part down it can be done very similarly.
According to http://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
, day of the year is "%j", whereas the first case can be solved by toordinal() and fromordinal(): date.fromordinal(date(1900, 1, 1).toordinal() + x)
I'd think timedelta.
import datetime
d = datetime.timedelta(days=41213)
start = datetime.datetime(year=1900, month=1, day=1)
the_date = start + d
For the second one, you can 2012265[:4] to get the year and use the same method.
edit: See the answer with %j for the second.
from datetime import datetime
df(['timeelapsed'])=(pd.to_datetime(df['timeelapsed'], format='%H:%M:%S') - datetime(1900, 1, 1)).dt.total_seconds()

How to convert Year and Day of Year to Date?

I have a year value and a day of year and would like to convert to a date (day/month/year).
datetime.datetime(year, 1, 1) + datetime.timedelta(days - 1)
>>> import datetime
>>> datetime.datetime.strptime('2010 120', '%Y %j')
datetime.datetime(2010, 4, 30, 0, 0)
>>> _.strftime('%d/%m/%Y')
'30/04/2010'
The toordinal() and fromordinal() functions of the date class could be used:
from datetime import date
date.fromordinal(date(year, 1, 1).toordinal() + days - 1)
since it is pretty common these days, a pandas option, using pd.to_datetime with specified unit and origin:
import pandas as pd
day, year = 21, 2021
print(pd.to_datetime(day-1, unit='D', origin=str(year)))
# 2021-01-21 00:00:00
>>>import datetime
>>>year = int(input())
>>>month = int(input())
>>>day = int(input())
data = datetime.datetime(year,month,day)
daynew = data.toordinal()
yearstart = datetime.datetime(year,1,1)
day_yearstart = yearstart.toordinal()
print ((daynew-day_yearstart)+1)
Using the mx.DateTime module to get the date is similar to what has been proposed above using datetime and timedelta. Namely:
import mx.DateTime as dt
date = dt.DateTime(yyyy,mm,dd) + dt.DateTimeDeltaFromDays(doy-1)
So, given that you know the year (say, 2020) and the doy (day of the year, say 234), then:
date = dt.DateTime(2020,1,1) + dt.DateTimeFromDays(233)
which returns
2020-08-21 00:00:00.00
The advantage of the mx.DateTime library is that has many useful features. As per description in its homepage:
Parses date/time string values in an almost seamless way.
Provides conversion routines to and from many different alternative date/time
storage formats.
Includes an easy-to-use C API which makes
integration a breeze.
Fast, memory efficient, accurate.
Georgian and Julian calendar support.
Vast range of valid dates (including B.C. dates).
Stable, robust and portable (mxDateTime has been around for almost 15 years
now).
Fully interoperates with Python's time and datetime modules.

Categories

Resources