pandas: convert datetime to end-of-month - python

I have written a function to convert pandas datetime dates to month-end:
import pandas
import numpy
import datetime
from pandas.tseries.offsets import Day, MonthEnd
def get_month_end(d):
month_end = d - Day() + MonthEnd()
if month_end.month == d.month:
return month_end # 31/March + MonthEnd() returns 30/April
else:
print "Something went wrong while converting dates to EOM: " + d + " was converted to " + month_end
raise
This function seems to be quite slow, and I was wondering if there is any faster alternative? The reason I noticed it's slow is that I am running this on a dataframe column with 50'000 dates, and I can see that the code is much slower since introducing that function (before I was converting dates to end-of-month).
df = pandas.read_csv(inpath, na_values = nas, converters = {open_date: read_as_date})
df[open_date] = df[open_date].apply(get_month_end)
I am not sure if that's relevant, but I am reading the dates in as follows:
def read_as_date(x):
return datetime.datetime.strptime(x, fmt)

Revised, converting to period and then back to timestamp does the trick
In [104]: df = DataFrame(dict(date = [Timestamp('20130101'),Timestamp('20130131'),Timestamp('20130331'),Timestamp('20130330')],value=randn(4))).set_index('date')
In [105]: df
Out[105]:
value
date
2013-01-01 -0.346980
2013-01-31 1.954909
2013-03-31 -0.505037
2013-03-30 2.545073
In [106]: df.index = df.index.to_period('M').to_timestamp('M')
In [107]: df
Out[107]:
value
2013-01-31 -0.346980
2013-01-31 1.954909
2013-03-31 -0.505037
2013-03-31 2.545073
Note that this type of conversion can also be done like this, the above would be slightly faster, though.
In [85]: df.index + pd.offsets.MonthEnd(0)
Out[85]: DatetimeIndex(['2013-01-31', '2013-01-31', '2013-03-31', '2013-03-31'], dtype='datetime64[ns]', name=u'date', freq=None, tz=None)

If the date column is in datetime format and is set to starting day of the month, this will add one month of time to it:
df['date1']=df['date'] + pd.offsets.MonthEnd(0)

import pandas as pd
import numpy as np
import datetime as dt
df0['Calendar day'] = pd.to_datetime(df0['Calendar day'], format='%m/%d/%Y')
df0['Calendar day'] = df0['Calendar day'].apply(pd.datetools.normalize_date)
df0['Month Start Date'] = df0['Calendar day'].dt.to_period('M').apply(lambda r: r.start_time)
This code should work. Calendar Day is a column in which date is given in the format %m/%d/%Y. For example: 12/28/2014 is 28 December, 2014. The output comes out to be 2014-12-01 in class 'pandas.tslib.Timestamp' type.

you can also use numpy to do it faster:
import numpy as np
date_array = np.array(['2013-01-01', '2013-01-15', '2013-01-30']).astype('datetime64[ns]')
month_start_date = date_array.astype('datetime64[M]')

In case the date is not in the index but in another column (works for Pandas 0.25.0):
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(date = [pd.Timestamp('20130101'),
pd.Timestamp('20130201'),
pd.Timestamp('20130301'),
pd.Timestamp('20130401')],
value = np.random.rand(4)))
print(df.to_string())
df.date = df.date.dt.to_period('M').dt.to_timestamp('M')
print(df.to_string())
Output:
date value
0 2013-01-01 0.295791
1 2013-02-01 0.278883
2 2013-03-01 0.708943
3 2013-04-01 0.483467
date value
0 2013-01-31 0.295791
1 2013-02-28 0.278883
2 2013-03-31 0.708943
3 2013-04-30 0.483467

What you are looking for might be:
df.resample('M').last()
The other method as said earlier by #Jeff:
df.index = df.index.to_period('M').to_timestamp('M')

Related

Subtract 2 datetime lists dd/mm/YYYY in pandas

So, Basically, I got this 2 df columns with data content. The initial content is in the dd/mm/YYYY format, and I want to subtract them. But I can't really subtract string, so I converted it to datetime, but when I do such thing for some reason the format changes to YYYY-dd-mm, so when I try to subtract them, I got a wrong result. For example:
Initial Content:
a: 05/09/2022
b: 30/09/2021
result expected: 25 days.
Converted to DateTime:
a: 2022-05-09
b: 2021-09-30 (For some reason this date stills the same)
result: 144 days.
I'm using pandas and datetime to make this project.
So, I wanted to know a way I can subtract this 2 columns with the proper result.
--- Answer
When I used
pd.to_datetime(date, format="%d/%m/%Y")
It worked. Thank you all for your time. This is my first project in pandas. :)
df = pd.DataFrame({'Date1': ['05/09/2021'], 'Date2': ['30/09/2021']})
df = df.apply(lambda x:pd.to_datetime(x,format=r'%d/%m/%Y')).assign(Delta=lambda x: (x.Date2-x.Date1).dt.days)
print(df)
Date1 Date2 Delta
0 2021-09-05 2021-09-30 25
I just answered a similar query here subtracting dates in python
import datetime
from datetime import date
from datetime import datetime
import pandas as pd
date_format_str = '%Y-%m-%d %H:%M:%S.%f'
date_1 = '2016-09-24 17:42:27.839496'
date_2 = '2017-01-18 10:24:08.629327'
start = datetime.strptime(date_1, date_format_str)
end = datetime.strptime(date_2, date_format_str)
diff = end - start
# Get interval between two timstamps as timedelta object
diff_in_hours = diff.total_seconds() / 3600
print(diff_in_hours)
# get the difference between two dates as timedelta object
diff = end.date() - start.date()
print(diff.days)
Pandas
import datetime
from datetime import date
from datetime import datetime
import pandas as pd
date_1 = '2016-09-24 17:42:27.839496'
date_2 = '2017-01-18 10:24:08.629327'
start = pd.to_datetime(date_1, format='%Y-%m-%d %H:%M:%S.%f')
end = pd.to_datetime(date_2, format='%Y-%m-%d %H:%M:%S.%f')
# get the difference between two datetimes as timedelta object
diff = end - start
print(diff.days)

Convert from float to datetime in Python

I have a dataframe which datatype is float64 and I want to change it to datetime 64. But the result is return to only one day : 1970-01-01 no matter which method I use. Any help please
df.product_first_sold_date = [41245,0, 37659.0,40487.0,41701.0,40649.0]
dt.cv = pd.to_datetime(df.product_first_sold_date)
dt.cv
dt.cv2 = df.product_first_sold_date.apply(lambda x: datetime.fromtimestamp(x).strftime('%m-%d-%Y') if x==x else None)
dt.cv2
I believe you re dealing with Excel date type which is the number of days since 1900-01-01, as #Dishin pointed out 1899-12-30
# sample data:
df = pd.DataFrame({'date':[41245,37659,40487]})
# convert - adjust 1900-01-01 to the correct day
df['date'] = pd.to_timedelta(df.date, unit='D') + pd.to_datetime('1899-12-30')
Output:
date
0 2012-12-02
1 2003-02-07
2 2010-11-05

Previous month datetime pandas

I have a datetime instance declared as follows:
dtDate = datetime.datetime(2016,1,1,0,0)
How do I get the previous month and previous year from dtDate?
e.g. something like:
dtDate.minusOneMonth()
# to return datetime.datetime(2015,12,1,0,0)
You can use:
dtDate = datetime.datetime(2016,1,1,0,0)
print (dtDate - pd.DateOffset(months=1))
2015-12-01 00:00:00
print (dtDate - pd.DateOffset(years=1))
2015-01-01 00:00:00
Add s is important, because if use year only:
print (dtDate - pd.DateOffset(year=1))
0001-01-01 00:00:00
You can use DateOffset:
In [32]:
dtDate = dt.datetime(2016,1,1,0,0)
dtDate - pd.DateOffset(months=1)
Out[32]:
Timestamp('2015-12-01 00:00:00')
To Manipulate an entire pandas series.
Use pd.DateOffset() with .dt.to_period("M")
df['year_month'] = df['timestamp'].dt.to_period("M")
df['prev_year_month'] = (df['timestamp'] - pd.DateOffset(months=1)).dt.to_period("M")
If you want to go forward a month, set months=-1.
Use relativedelta from dateutil:
import datetime
import dateutil.relativedelta
dtDate = datetime.datetime(2016,1,1,0,0)
# get previous month
print ((dtDate+dateutil.relativedelta.relativedelta(months=-1)).month)
# get previous year
print ((dtDate+dateutil.relativedelta.relativedelta(years=-1)).year)
Output:
12
2015

pandas save date in ISO format?

I'm trying to generate a Pandas DataFrame where date_range is an index. Then save it to a CSV file so that the dates are written in ISO-8601 format.
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
NumberOfSamples = 10
dates = pd.date_range('20130101',periods=NumberOfSamples,freq='90S')
df3 = DataFrame(index=dates)
df3.to_csv('dates.txt', header=False)
The current output to dates.txt is:
2013-01-01 00:00:00
2013-01-01 00:01:30
2013-01-01 00:03:00
2013-01-01 00:04:30
...................
I'm trying to get it to look like:
2013-01-01T00:00:00Z
2013-01-01T00:01:30Z
2013-01-01T00:03:00Z
2013-01-01T00:04:30Z
....................
Use datetime.strftime and call map on the index:
In [72]:
NumberOfSamples = 10
import datetime as dt
dates = pd.date_range('20130101',periods=NumberOfSamples,freq='90S')
df3 = pd.DataFrame(index=dates)
df3.index = df3.index.map(lambda x: dt.datetime.strftime(x, '%Y-%m-%dT%H:%M:%SZ'))
df3
Out[72]:
Empty DataFrame
Columns: []
Index: [2013-01-01T00:00:00Z, 2013-01-01T00:01:30Z, 2013-01-01T00:03:00Z, 2013-01-01T00:04:30Z, 2013-01-01T00:06:00Z, 2013-01-01T00:07:30Z, 2013-01-01T00:09:00Z, 2013-01-01T00:10:30Z, 2013-01-01T00:12:00Z, 2013-01-01T00:13:30Z]
Alternatively and better in my view (thanks to #unutbu) you can pass a format specifier to to_csv:
df3.to_csv('dates.txt', header=False, date_format='%Y-%m-%dT%H:%M:%SZ')
With pd.Index.strftime:
If you're sure that all your dates are UTC, you can hardcode the format:
df3.index = df3.index.strftime('%Y-%m-%dT%H:%M:%SZ')
which gives you 2013-01-01T00:00:00Z and so on. Note that the "Z" denotes UTC!
With pd.Timestamp.isoformat and pd.Index.map:
df3.index = df3.index.map(lambda timestamp: timestamp.isoformat())
This gives you 2013-01-01T00:00:00. If you attach a timezone to your dates first (e.g. by passing tz="UTC" to date_range), you'll get: 2013-01-01T00:00:00+00:00 which also conforms to ISO-8601 but is a different notation. This should work for any dateutil or pytz timezone, leaving no room for ambiguity when clocks switch from daylight saving to standard time.

Pandas 0.15 DataFrame: Remove or reset time portion of a datetime64

I have imported a CSV file into a pandas DataFrame and have a datetime64 column with values such as:
2014-06-30 21:50:00
I simply want to either remove the time or set the time to midnight:
2014-06-30 00:00:00
What is the easiest way of doing this?
Pandas has a builtin function pd.datetools.normalize_date for that purpose:
df['date_col'] = df['date_col'].apply(pd.datetools.normalize_date)
It's implemented in Cython and does the following:
if PyDateTime_Check(dt):
return dt.replace(hour=0, minute=0, second=0, microsecond=0)
elif PyDate_Check(dt):
return datetime(dt.year, dt.month, dt.day)
else:
raise TypeError('Unrecognized type: %s' % type(dt))
Use dt methods, which is vectorized to yield faster results.
# There are better ways of converting it in to datetime column.
# Ignore those to keep it simple
data['date_column'] = pd.to_datetime(data['date_column'])
data['date_column'].dt.date
pd.datetools.normalize_date has been deprecated. Use df['date_col'] = df['date_col'].dt.normalize() instead.
See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.normalize.html
I can think of two ways, setting or assigning to a new column just the date() attribute, or calling replace on the datetime object and passing param hour=0, minute=0:
In [106]:
# example data
t = """datetime
2014-06-30 21:50:00"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df
Out[106]:
datetime
0 2014-06-30 21:50:00
In [107]:
# apply a lambda accessing just the date() attribute
df['datetime'] = df['datetime'].apply( lambda x: x.date() )
print(df)
# reset df
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
# call replace with params hour=0, minute=0
df['datetime'] = df['datetime'].apply( lambda x: x.replace(hour=0, minute=0) )
df
datetime
0 2014-06-30
Out[107]:
datetime
0 2014-06-30
Since pd.datetools.normalize_date has been deprecated and you are working with the datetime64 data type, use:
df.your_date_col = df.your_date_col.apply(lambda x: x.replace(hour=0, minute=0, second=0, microsecond=0))
This way you don't need to convert to pandas datetime first. If it's already a pandas datetime, then see answer from Phil.
df.your_date_col = df.your_date_col.dt.normalize()
The fastest way I have found to strip everything but the date is to use the underlying Numpy structure of pandas Timestamps.
import pandas as pd
dates = pd.to_datetime(['1990-1-1 1:00:11',
'1991-1-1',
'1999-12-31 12:59:59.999'])
dates
DatetimeIndex(['1990-01-01 01:00:11', '1991-01-01 00:00:00',
'1999-12-31 12:59:59.999000'],
dtype='datetime64[ns]', freq=None)
dates = dates.astype(np.int64)
ns_in_day = 24*60*60*np.int64(1e9)
dates //= ns_in_day
dates *= ns_in_day
dates = dates.astype(np.dtype('<M8[ns]'))
dates = pd.Series(dates)
dates
0 1990-01-01
1 1991-01-01
2 1999-12-31
dtype: datetime64[ns]
This might not work when data have timezone information.

Categories

Resources