cleaning date columns in python - python

Kindly assist me in cleaning my date types in python.
My sample data is as follows:
INITIATION DATE
DATE CUT
DATE GIVEN
1/July/2022
21 July 2022
11-July-2022
17-July-2022
16/July/2022
21/July/2022
16-July-2022
01-July-2022
09/July/2022
19-July-2022
31 July 2022
27 July 2022
How do I remove all dashes/slashes/hyphens from dates in the different columns? I have 8 columns and 300 rows.
What i tried:
df[['INITIATION DATE', 'DATE CUT', 'DATE GIVEN']]= df[['INITIATION DATE', 'DATE CUT', 'DATE GIVEN']].apply(pd.to_datetime, format = '%d%b%Y')
Desired output format for all: 1 July 2022
ValueError I'm getting:
time data '18 July 2022' does not match format '%d-%b-%Y' (match)

to remove all dashes/slashes/hyphens from strings you can just use replace method:
df.apply(lambda x: x.str.replace('[/-]',' ',regex=True))
>>>
'''
INITIATION DATE DATE CUT DATE GIVEN
0 1 July 2022 21 July 2022 11 July 2022
1 17 July 2022 16 July 2022 21 July 2022
2 16 July 2022 01 July 2022 09 July 2022
3 19 July 2022 31 July 2022 27 July 2022
and if you also need to conver strings to datetime then try this:
df.apply(lambda x: pd.to_datetime(x.str.replace('[/-]',' ',regex=True)))
>>>
'''
INITIATION DATE DATE CUT DATE GIVEN
0 2022-07-01 2022-07-21 2022-07-11
1 2022-07-17 2022-07-16 2022-07-21
2 2022-07-16 2022-07-01 2022-07-09
3 2022-07-19 2022-07-31 2022-07-27

You can use pd.to_datetime to convert strings to datetime objects. The function takes a format argument which specifies the format of the datetime string, using the usual format codes
df['INITIATION DATE'] = pd.to_datetime(df['INITIATION DATE'], format='%d-%B-%Y').dt.strftime('%d %B %Y')
df['DATE CUT'] = pd.to_datetime(df['DATE CUT'], format='%d %B %Y').dt.strftime('%d %B %Y')
df['DATE GIVEN'] = pd.to_datetime(df['DATE GIVEN'], format='%d/%B/%Y').dt.strftime('%d %B %Y')
output
INITIATION DATE DATE CUT DATE GIVEN
0 01 July 2022 21 July 2022 11 July 2022
1 17 July 2022 16 July 2022 21 July 2022
2 16 July 2022 01 July 2022 09 July 2022
3 19 July 2022 31 July 2022 27 July 2022
You get that error because your datetime strings (e.g. '18 July 2022') do not match your format specifiers ('%d-%b-%Y') because of the extra hyphens in the format specifier.

Related

How can I convert day of week, Month, Date to Year - Month - Date

I have dates from 2018 until 2021 in a pandas column and they look like this:
Date
Sun, Dec 30
Mon, Dec 31
Any idea how I can convert this to:
Date
Dec 30 2018
Dec 31 2018
In the sense that is it possible that knowing the day of the week i.e. (monday, tuesday etc) is it possible to get the year of that specific date?
I would take a look at this conversation. As mentioned, you will probably need to define a range of years, since it is possible that December 30th (for example) falls on a Sunday in more than one year. Otherwise, it is possible to collect a list of years where the input (Sun, Dec 30) is valid. You will probably need to use datetime to convert your strings to a Python readable format.
you can iterate the years from 2018 to 2022 to get every target date's weekday name, then find the match year.
df = pd.DataFrame({'Date': {0: 'Sun, Dec 30',
1: 'Mon, Dec 31'}})
for col in range(2018, 2022):
df[col] = '%s' % col + df['Date'].str.split(',').str[-1]
df[col] = pd.to_datetime(df[col], format='%Y %b %d').dt.strftime('%a, %b %d')
dfn = df.set_index('Date').stack().reset_index()
cond = dfn['Date'] == dfn[0]
obj = dfn[cond].set_index('Date')['level_1'].rename('year')
result:
print(obj)
Date
Sun, Dec 30 2018
Mon, Dec 31 2018
Name: year, dtype: int64
print(df.join(obj, on='Date'))
Date 2018 2019 2020 2021 year
0 Sun, Dec 30 Sun, Dec 30 Mon, Dec 30 Wed, Dec 30 Thu, Dec 30 2018
1 Mon, Dec 31 Mon, Dec 31 Tue, Dec 31 Thu, Dec 31 Fri, Dec 31 2018
df_result = obj.reset_index()
df_result['Date_new'] = df_result['Date'].str.split(',').str[-1] + ' ' + df_result['year'].astype(str)
print(df_result)
Date year Date_new
0 Sun, Dec 30 2018 Dec 30 2018
1 Mon, Dec 31 2018 Dec 31 2018

Convert date strings with Italian month names to %Y-%m-%d

I would like to convert dates (Before) within a column (After) in date format:
Before After
23 Ottobre 2020 2020-10-23
24 Ottobre 2020 2020-10-24
27 Ottobre 2020 2020-10-27
30 Ottobre 2020 2020-10-30
22 Luglio 2020 2020-07-22
I tried as follows:
from datetime import datetime
date = df.Before.tolist()
dtObject = datetime.strptime(date,"%d %m, %y")
dtConverted = dtObject.strftime("%y-%m-%d")
But it does not work.
Can you explain me how to do it?
Similar to this question, you can set the locale to Italian before parsing:
import pandas as pd
import locale
locale.setlocale(locale.LC_ALL, 'it_IT')
df = pd.DataFrame({'Before': ['30 Ottobre 2020', '22 Luglio 2020']})
df['After'] = pd.to_datetime(df['Before'], format='%d %B %Y')
# df
# Before After
# 0 30 Ottobre 2020 2020-10-30
# 1 22 Luglio 2020 2020-07-22
If you want the "After" column as dtype string, use df['After'].dt.strftime('%Y-%m-%d').

How to subtract variable number of days from variable dates using timedelta python

I have a dataset that has the original dates in this format:
End_date=['Fri, 19 Aug 2011 19:28:17 -0000', 'Sun, 08 Apr 2012 02:14:00 -0000',
'Wed, 22 Jun 2011 13:33:00 -0000', 'Fri, 30 Dec 2011 04:36:53 -0000'....]
Duration_in_days=[30, 20, 10, 15,....]
How do i use loop, datetime and timedelta to subtract the duration from the End_date to derive the start date?
from datetime import datetime, timedelta
End_date=['Fri, 19 Aug 2011 19:28:17 -0000', 'Sun, 08 Apr 2012 02:14:00 -0000',
'Wed, 22 Jun 2011 13:33:00 -0000', 'Fri, 30 Dec 2011 04:36:53 -0000']
Duration_in_days=[30, 20, 10, 15]
startDates = []
for date, days in zip(End_date, Duration_in_days):
x = datetime.strptime(date, "%a, %d %b %Y %H:%M:%S -0000") - timedelta(days=days)
startDates.append(x)
will set startDates to a list of the start dates as datetime objects
EDIT: Or use a list comprehension as suggested in the comments:
startDates = [
datetime.strptime(date, "%a, %d %b %Y %H:%M:%S -0000") - timedelta(days=days)
for date, days in zip(End_date, Duration_in_days)
]
The following program shows how to do this, subtracrting the days and ending up with a string-based collection as per the original:
# Need pprint for pretty printing of lists.
import time, datetime, pprint
# Set up test data.
End_date=[
'Fri, 19 Aug 2011 19:28:17 -0000',
'Sun, 08 Apr 2012 02:14:00 -0000',
'Wed, 22 Jun 2011 13:33:00 -0000',
'Fri, 30 Dec 2011 04:36:53 -0000',
]
Duration_in_days=[30, 20, 10, 15]
fmt = "%a, %d %b %Y %H:%M:%S %z"
# Add each (adjusted) item to a new list.
Start_date = []
for i in range(len(End_date)):
# Parse end date, subtract and create string from it.
dt = datetime.datetime.strptime(End_date[i], fmt)
dt -= datetime.timedelta(days=Duration_in_days[i])
Start_date.append(datetime.datetime.strftime(dt, fmt))
pprint.pprint(End_date)
pprint.pprint(Start_date)
And, despite your comment to the contrary, July 20 2011 was a Wednesday :-)
The output shows the original and adjusted times:
['Fri, 19 Aug 2011 19:28:17 -0000',
'Sun, 08 Apr 2012 02:14:00 -0000',
'Wed, 22 Jun 2011 13:33:00 -0000',
'Fri, 30 Dec 2011 04:36:53 -0000']
['Wed, 20 Jul 2011 19:28:17 +0000',
'Mon, 19 Mar 2012 02:14:00 +0000',
'Sun, 12 Jun 2011 13:33:00 +0000',
'Thu, 15 Dec 2011 04:36:53 +0000']
If you're after a more Pythonic way :-), you can replace the loop with:
Start_date = [datetime.datetime.strftime(datetime.datetime.strptime(dt, fmt) - datetime.timedelta(days=delta), fmt) for dt, delta in zip(End_date, Duration_in_days)]

Return dataframe with range of dates

I need a Python function to return a Pandas DataFrame with range of dates, only year and month, for example, from November 2016 to March 2017 and have this as result:
year month
2016 11
2016 12
2017 01
2017 02
2017 03
My dates are in string format Y-m (from = '2016-11', to = '2017-03'). I'm not sure on turning them to datetime type or to separate them into two different integer values.
Any ideas on how to achieve it properly?
Are you looking at something like this?
pd.date_range('November 2016', 'April 2017', freq = 'M')
You get
DatetimeIndex(['2016-11-30', '2016-12-31', '2017-01-31', '2017-02-28',
'2017-03-31'],
dtype='datetime64[ns]', freq='M')
To get dataframe
index = pd.date_range('November 2016', 'April 2017', freq = 'M')
df = pd.DataFrame(index = index)
pd.Series(pd.date_range('2016-11', '2017-4', freq='M').strftime('%Y-%m')) \
.str.split('-', expand=True) \
.rename(columns={0: 'year', 1: 'month'})
year month
0 2016 11
1 2016 12
2 2017 01
3 2017 02
4 2017 03
You can use a combination of pd.to_datetime and pd.date_range.
import pandas as pd
start = 'November 2016'
end = 'March 2017'
s = pd.Series(pd.date_range(*(pd.to_datetime([start, end]) \
+ pd.offsets.MonthEnd()), freq='1M'))
Construct a dataframe using the .dt accessor attributes.
df = pd.DataFrame({'year' : s.dt.year, 'month' : s.dt.month})
df
month year
0 11 2016
1 12 2016
2 1 2017
3 2 2017
4 3 2017

How can I convert the time in a datetime string from 24:00 to 00:00 in Python?

I have a lot of date strings like Mon, 16 Aug 2010 24:00:00 and some of them are in 00-23 hour format and some of them in 01-24 hour format. I want to get a list of date objects of them, but when I try to transform the example string into a date object, I have to transform it from Mon, 16 Aug 2010 24:00:00 to Tue, 17 Aug 2010 00:00:00. What is the easiest way?
import email.utils as eutils
import time
import datetime
ntuple=eutils.parsedate('Mon, 16 Aug 2010 24:00:00')
print(ntuple)
# (2010, 8, 16, 24, 0, 0, 0, 1, -1)
timestamp=time.mktime(ntuple)
print(timestamp)
# 1282017600.0
date=datetime.datetime.fromtimestamp(timestamp)
print(date)
# 2010-08-17 00:00:00
print(date.strftime('%a, %d %b %Y %H:%M:%S'))
# Tue, 17 Aug 2010 00:00:00
Since you say you have a lot of these to fix, you should define a function:
def standardize_date(date_str):
ntuple=eutils.parsedate(date_str)
timestamp=time.mktime(ntuple)
date=datetime.datetime.fromtimestamp(timestamp)
return date.strftime('%a, %d %b %Y %H:%M:%S')
print(standardize_date('Mon, 16 Aug 2010 24:00:00'))
# Tue, 17 Aug 2010 00:00:00

Categories

Resources