Python Date cleaner using regex - python

I've been trying to do a script that takes date inputs like 3/14/2015, 03-14-2015,
and 2015/3/14 (using pyperclip to copy and paste) and modifies them to a single format. So far this is what I've accomplished:
import re,pyperclip
dateRegex_0 = re.compile(r'''(
#0) 3/14/2015
(\d{1,2})
(-|\/|\.)
(\d{2})
(-|\/|\.)
(\d{4})
)''',re.VERBOSE)
dateRegex_1 = re.compile(r'''(
#1)03-14-2015
(\d{2})
(-|\/|\.)
(\d{2})
(-|\/|\.)
(\d{4})
)''',re.VERBOSE)
dateRegex_2 = re.compile(r'''(
#2)2015/3/14 , format YYYY/MM/DD
(\d{4})
(-|\/|\.)
(\d{1,2})
(-|\/|\.)
(\d{1,2})
)''',re.VERBOSE)
text=str(pyperclip.paste())
matches = []
for groups in dateRegex_0.findall(text):
cleanDate = '-'.join([groups[3],groups[1],groups[5]])
matches.append(cleanDate)
for groups in dateRegex_1.findall(text):
cleanDate = '-'.join([groups[3],groups[1],groups[5]])
matches.append(cleanDate)
for groups in dateRegex_2.findall(text):
cleanDate = '-'.join([groups[5],groups[3],groups[1]])
matches.append(cleanDate)
if len(matches)>0:
pyperclip.copy('\n'.join(matches))
print('Copied to clipboard:')
print('\n'.join(matches))
else:
print('There are no dates in your text!')
I managed to create a regex for each date type, and the code transforms the data to this format:
DD-MM-YYYY.
However I have 2 problems:
When I try to clean this type of date: 3/14/2015, 03-14-2015 I get this output:
14-3-2015 , 14-03-2015. I want to get rid of that 0 that appears before the single digit months, or add a 0 before everyone of them (basically I want all of my cleaned dates to have the same format).
How can I write a Regex for my date types that doesn't require 3 separate ones? I want a single Regex to identify all of the date types(instead of having dateRegex_0, dateRegex_1, dateRegex_2).

One idea...
import re
#pip install dateparser (if required)
import dateparser
# quite crude pattern; just 1-4 number, then either / or -, then repeated a couple of times
pattern = r'(\d{1,4}(?:/|-)\d{1,4}(?:/|-)\d{1,4})'
# this is just seen as text (could be from the clipboard)...
data = '''
import dateparser
dates = ['1/14/2016', '05-14-2017', '2014/3/18', '2015-06-14 00:00:00', '13-13-2000000']
for date in dates:
print(dateparser.parse(date))
'''
# pull out a list of dates matching the above pattern to a list
extracted_dates = re.findall(pattern, data)
# print out the matched strings if dateparser thinks they are a date
# '3-13-2000000' would match the regex but for dateparser it returns None
for date in extracted_dates:
if dateparser.parse(date) is not None:
print(dateparser.parse(date))
Outputs:
2016-01-14 00:00:00
2017-05-14 00:00:00
2014-03-18 00:00:00
2015-06-14 00:00:00

Related

How to search for string between whitespace and marker? Python

My problem is the following:
I have the string:
datetime = "2021/04/07 08:30:00"
I want to save in the variable hour, 08 and
I want to save in the variable minutes, 30
What I've done is the following:
import re
pat = re.compile(' (.*):')
hour = re.search(pat, datetime)
minutes = re.search(pat, datetime)
print(hour.group(1))
print(minutes.group(1))
What I obtain from the prints is
08:30 and 30, so the minutes are correct but for some reason that I'm not understanding, in the hours the first : is skipped and takes everything from the whitespace to the second :.
What am I doing wrong? Thank you.
Please use strptime from datetime module which is recommended way to handle string dates in python.
strptime returns a datetime object from the string date, and this datetime object comes with all sorts of goodies like date, time, hour, isoformat, timestamp etc which makes working with datetimes breeze.
datetime.datetime.strptime("2021/04/07 08:30:00", "%Y/%m/%d %H:%M:%S")
datetime.datetime(2021, 4, 7, 8, 30)
datetime.datetime.strptime("2021/04/07 08:30:00", "%Y/%m/%d %H:%M:%S").hour
8
datetime.datetime.strptime("2021/04/07 08:30:00", "%Y/%m/%d %H:%M:%S").second
0
Ah, no no, python has a much better approach with datetime.strptime
https://www.programiz.com/python-programming/datetime/strptime
So for you:
from datetime import datetime
dt_string = "2021/04/07 08:30:00"
# Considering date is in dd/mm/yyyy format
dt_object1 = datetime.strptime(dt_string, "%Y/%m/%d %H:%M:%S")
You want hours?
hours = dt_object1.hour
or minutes?
mins = dt_object1.minute
Now, if what you have presented is just an example of where you need to work around whitespace, then you could split the string up. Again with dt_string:
dt_string1 = dt_string.split(" ")
dateString = dt_string1.split("/") # A list in [years, months, days]
timeString = dt_string2.split(":") # A list in [hours, minutes, seconds]
Wildcard . matches any single character, even the :. So .* matches the 08:30.
Use:
hour = re.search('\ ([0-9]*):', datetime)
Output:
>>> hour.group(1)
'08'
You can try below regex to make it non greedy and stop at first :
hour = re.search(' (.*?):', datetime)
An alternative to what you are doing is to split the original datetime by the space into a variable such as dates, which will give you ['2021/04/07', '08:30:00']. You can then access the second value of the list variable dates and split it again by ':', to get the individual time, and access the parts of the list varaible time for hours, minutes, and seconds, from the variable time.
datetime = "2021/04/07 08:30:00"
dates = datetime.split(" ")
print(dates)
time = dates[1].split(":")
print(time)
Printing the code will give you
print(dates) --> ['2021/04/07', '08:30:00']
print(time) --> ['08', '30', '00']
You can access individual parts of time with time[0] for '08', time[1] for '30' etc.
I would use re.compile with named capture groups and iterate:
inp = "Hello World 2021/04/07 08:30:00 Goodbye"
r = re.compile(r'\b\d{4}/\d{2}/\d{2} (?P<hour>\d{2}):(?P<minute>\d{2}):\d{2}\b')
output = [m.groupdict() for m in r.finditer(inp)]
print(output[0]['hour']) # 08
print(output[0]['minute']) # 30
This is a simple datetime question. Python already has the ability to do exactly what you need. 3 steps:
use strptime to generate your date time from your string.
you can get the formatting options here
return just the hour or minute from the datetime object
from datetime import datetime
dt_string = "2021/04/07 08:30:00"
dt_object = datetime.strptime(dt_string, "%Y/%m/%d %H:%M:%S")
print(dt_object)
print(dt_object.hour, dt_object.minute)
# 2021-04-07 08:30:00
# 8 30
You can do that thing using Striptime in some cases if you need to do with string you can search those thing using +[what are the things inside here you can make it here example 0-7 or a-z or A-Z or symbols]+
import re
datetime = "2021/04/07 08:30:00"
#here you need to make a change
hour = re.search(' (.*):+[0-7]+:', datetime)
minutes = re.search(':(.*):', datetime)
print(hour.group(1))
print(minutes.group(1))

Split based on _ and find the difference between dates

I am trying to find the difference between the below two dates. It is in the format of "dd-mm-yyyy". I splitted the two strings based on _ and extract the date, month and year.
previous_date = "filename_03_03_2021"
current_date = "filename_09_03_2021"
previous_array = previous_date.split("_")
Not sure after that what could be done to combine them into a date format and find the difference between dates in "days".
Any leads/suggestions would be appreciated.
You could index into the list after split like previous_array[1] to get the values and add those to date
But tnstead of using split, you might use a pattern with 3 capture groups to make the match a bit more specific to get the numbers and then subtract the dates and get the days value.
You might make the date like pattern more specific using the pattern on this page
import re
from datetime import date
previous_date = "filename_03_03_2021"
current_date = "filename_09_03_2021"
pattern = r"filename_(\d{2})_(\d{2})_(\d{4})"
pMatch = re.match(pattern, previous_date)
cMatch = re.match(pattern, current_date)
pDate = date(int(pMatch.group(3)), int(pMatch.group(2)), int(pMatch.group(1)))
cDate = date(int(cMatch.group(3)), int(cMatch.group(2)), int(cMatch.group(1)))
print((cDate - pDate).days)
Output
6
See a Python demo

Python regex for date and numbers, find the date format

How to extract dates alone from text file using regex in Python 3?
Below is my current code:
import datetime
from datetime import date
import re
s = "birthday on 20/12/2018 and wedding aniversry on 04/01/1997 and dob is on
09/07/1897"
match = re.search(r'\d{2}/\d{2}/\d{4}', s)
date = datetime.datetime.strptime(match.group(), '%Y-%m-%d').date()
print (date)
Expected Output is
20/12/2018
04/01/1997
09/07/1897
Building on DirtyBit's answer. I found that if you make a minor change, it will pick up multiple date formats. Change the forward slash to a dot.
import re
s = "birthday on 20.12.2018 and wedding anniversary on 04-01-1997 and dob
is on 09/07/1897"
pattern = r'\d{2}.\d{2}.\d{4}'
print("\n".join(re.findall(pattern,s)))
Output
20.12.2018
04-01-1997
09/07/1897
You have an invalid date format near '%Y-%m-%d' since it should have been '%d/%m/%Y' looking at your provided date: birthday on 20/12/2018 (dd/mm/yyyy)
Change this:
date = datetime.datetime.strptime(match.group(), '%Y-%m-%d').date()
With this:
date = datetime.datetime.strptime(match.group(), '%d/%m/%Y').date()
Your Fix:
import datetime
from datetime import date
import re
s = "birthday on 20/12/2018"
match = re.search(r'\d{2}/\d{2}/\d{4}', s)
date = datetime.datetime.strptime(match.group(), '%d/%m/%Y').date()
print (date)
But:
Why get into all the trouble? When they're easier and elegant ways out there.
Using dparser:
import dateutil.parser as dparser
dt_1 = "birthday on 20/12/2018"
print("Date: {}".format(dparser.parse(dt_1,fuzzy=True).date()))
OUTPUT:
Date: 2018-12-20
EDIT:
With your edited question which now has multiple dates, you could extract them using regex:
import re
s = "birthday on 20/12/2018 and wedding aniversry on 04/01/1997 and dob is on 09/07/1897"
pattern = r'\d{2}/\d{2}/\d{4}'
print("\n".join(re.findall(pattern,s)))
OUTPUT:
20/12/2018
04/01/1997
09/07/1897
OR
Using dateutil:
from dateutil.parser import parse
for s in s.split():
try:
print(parse(s))
except ValueError:
pass
OUTPUT:
2018-12-20 00:00:00
1997-04-01 00:00:00
1897-09-07 00:00:00
You are doing everything right expect this line,
date = datetime.datetime.strptime(match.group(), '%d/%m/%Y').date()
You have to give the same format as your input has in datetime.strptime.
'%Y-%m-%d' >> 2018-12-20
'%d/%m/%Y' >> 20/12/2018
Edit
If you are not looking for datetime object. You can do like this
results = re.findall(r'\d{2}/\d{2}/\d{4}', s)
print('\n'.join(results))
Output
In [20]: results = re.findall(r'\d{2}/\d{2}/\d{4}', s)
In [21]: print('\n'.join(results))
20/12/2018
04/01/1997
09/07/1897

Python: Extract two dates from string

I have a string s which contains two dates in it and I am trying to extract these two dates in order to subtract them from each other to count the number of days in between. In the end I am aiming to get a string like this: s = "o4_24d_20170708_20170801"
At the company I work we can't install additional packages so I am looking for a solution using native python. Below is what I have so far by using the datetime package which only extracts one date: How can I get both dates out of the string?
import re, datetime
s = "o4_20170708_20170801"
match = re.search('\d{4}\d{2}\d{2}', s)
date = datetime.datetime.strptime(match.group(), '%Y%m%d').date()
print date
from datetime import datetime
import re
s = "o4_20170708_20170801"
pattern = re.compile(r'(\d{8})_(\d{8})')
dates = pattern.search(s)
# dates[0] is full match, dates[1] and dates[2] are captured groups
start = datetime.strptime(dates[1], '%Y%m%d')
end = datetime.strptime(dates[2], '%Y%m%d')
difference = end - start
print(difference.days)
will print
24
then, you could do something like:
days = 'd{}_'.format(difference.days)
match_index = dates.start()
new_name = s[:match_index] + days + s[match_index:]
print(new_name)
to get
o4_d24_20170708_20170801
import re, datetime
s = "o4_20170708_20170801"
match = re.findall('\d{4}\d{2}\d{2}', s)
for a_date in match:
date = datetime.datetime.strptime(a_date, '%Y%m%d').date()
print date
This will print:
2017-07-08
2017-08-01
Your regex was working correctly at regexpal

find matching string in a file using Python

Using Python I would like to find strings in a file that matches this format YYYY-MM-DD
Here is how my sample file looks like
I want to find date 2016-01-01 ,2016-01-05
then I want to find 2016-01-17
then I want to find this date 2016-01-04
Output should be
2016-01-01
2016-01-05
2016-01-17
2016-01-04
below is the code I am using currently using but I can't find matching records , any help on this will be appreciated ?
#!/usr/bin/python
import sys
import csv
import re
pattern = re.compile("^([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9])$")
for i, line in enumerate(open('C:\\Work\\scripts\\logs\\CSI.txt')):
for match in re.finditer(pattern, line):
print 'Found on line' % (i+1, match.groups())
I would remove ^( and $, because your dates don't seem separated :
re.compile("[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]")
You can use regex and datetime to get valid dates from string
import re
from datetime import datetime
string = "I want to find date 2016-01-01 ,2016-01-05"
pattern = re.complie("[\d]{4}-\d{2}-\d{2}")
raw_dates = pattern.findall(string)
parsed_dates = []
for date in raw_dates:
try:
d = datetime.strptime(date, "%Y-%m-%d")
parsed_dates.append(d)
except:
pass
print(parsed_dates)
output:
['2016-01-01', '2016-01-05']

Categories

Resources