Python regex to match dates - python

What regular expression in Python do I use to match dates like this: "11/12/98"?

Instead of using regex, it is generally better to parse the string as a datetime.datetime object:
In [140]: datetime.datetime.strptime("11/12/98","%m/%d/%y")
Out[140]: datetime.datetime(1998, 11, 12, 0, 0)
In [141]: datetime.datetime.strptime("11/12/98","%d/%m/%y")
Out[141]: datetime.datetime(1998, 12, 11, 0, 0)
You could then access the day, month, and year (and hour, minutes, and seconds) as attributes of the datetime.datetime object:
In [143]: date.year
Out[143]: 1998
In [144]: date.month
Out[144]: 11
In [145]: date.day
Out[145]: 12
To test if a sequence of digits separated by forward-slashes represents a valid date, you could use a try..except block. Invalid dates will raise a ValueError:
In [159]: try:
.....: datetime.datetime.strptime("99/99/99","%m/%d/%y")
.....: except ValueError as err:
.....: print(err)
.....:
.....:
time data '99/99/99' does not match format '%m/%d/%y'
If you need to search a longer string for a date,
you could use regex to search for digits separated by forward-slashes:
In [146]: import re
In [152]: match = re.search(r'(\d+/\d+/\d+)','The date is 11/12/98')
In [153]: match.group(1)
Out[153]: '11/12/98'
Of course, invalid dates will also match:
In [154]: match = re.search(r'(\d+/\d+/\d+)','The date is 99/99/99')
In [155]: match.group(1)
Out[155]: '99/99/99'
To check that match.group(1) returns a valid date string, you could then parsing it using datetime.datetime.strptime as shown above.

I find the below RE working fine for Date in the following format;
14-11-2017
14.11.2017
14|11|2017
It can accept year from 2000-2099
Please do not forget to add $ at the end,if not it accept 14-11-201 or 20177
date="13-11-2017"
x=re.search("^([1-9] |1[0-9]| 2[0-9]|3[0-1])(.|-)([1-9] |1[0-2])(.|-|)20[0-9][0-9]$",date)
x.group()
output = '13-11-2017'

I built my solution on top of #aditya Prakash appraoch:
print(re.search("^([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])$|^([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])$",'01/01/2018'))
The first part (^([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])$) can handle the following formats:
01.10.2019
1.1.2019
1.1.19
12/03/2020
01.05.1950
The second part (^([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])$) can basically do the same, but in inverse order, where the year comes first, followed by month, and then day.
2020/02/12
As delimiters it allows ., /, -. As years it allows everything from 1900-2099, also giving only two numbers is fine.
If you have suggestions for improvement please let me know in the comments, so I can update the answer.

Using this regular expression you can validate different kinds of Date/Time samples, just a little change is needed.
^\d\d\d\d/(0?[1-9]|1[0-2])/(0?[1-9]|[12][0-9]|3[01]) (00|[0-9]|1[0-9]|2[0-3]):([0-9]|[0-5][0-9]):([0-9]|[0-5][0-9])$ -->validate this: 2018/7/12 13:00:00
for your format you cad change it to:
^(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[0-2])/\d\d$ --> validates this: 11/12/98

I use something like this
>>> import datetime
>>> regex = datetime.datetime.strptime
>>>
>>> # TEST
>>> assert regex('2020-08-03', '%Y-%m-%d')
>>>
>>> assert regex('2020-08', '%Y-%m-%d')
ValueError: time data '2020-08' does not match format '%Y-%m-%d'
>>> assert regex('08/03/20', '%m/%d/%y')
>>>
>>> assert regex('08-03-2020', '%m/%d/%y')
ValueError: time data '08-03-2020' does not match format '%m/%d/%y'

Well, from my understanding, simply for matching this format in a given string, I prefer this regular expression:
pattern='[0-9|/]+'
to match the format in a more strict way, the following works:
pattern='(?:[0-9]{2}/){2}[0-9]{2}'
Personally, I cannot agree with unutbu's answer since sometimes we use regular expression for "finding" and "extract", not only "validating".

Sometimes we need to get the date from a string.
One example with grouping:
record = '1518-09-06 00:57 some-alphanumeric-charecter'
pattern_date_time = ([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}) .+
match = re.match(pattern_date_time, record)
if match is not None:
group = match.group()
date = group[0]
print(date) // outputs 1518-09-06 00:57

As the question title asks for a regex that finds many dates, I would like to propose a new solution, although there are many solutions already.
In order to find all dates of a string that are in this millennium (2000 - 2999), for me it worked the following:
dates = re.findall('([1-9]|1[0-9]|2[0-9]|3[0-1]|0[0-9])(.|-|\/)([1-9]|1[0-2]|0[0-9])(.|-|\/)(20[0-9][0-9])',dates_ele)
dates = [''.join(dates[i]) for i in range(len(dates))]
This regex is able to find multiple dates in the same string, like bla Bla 8.05/2020 \n BLAH bla15/05-2020 blaa. As one could observe, instead of / the date can have . or -, not necessary at the same time.
Some explaining
More specifically it can find dates of format day , moth year. Day is an one digit integer or a zero followed by one digit integer or 1 or 2 followed by an one digit integer or a 3 followed by 0 or 1. Month is an one digit integer or a zero followed by one digit integer or 1 followed by 0, 1, or 2. Year is the number 20 followed by any number between 00 and 99.
Useful notes
One can add more date splitting symbols by adding | symbol at the end of both (.|-|\/). For example for adding -- one would do (.|-|\/|--)
To have years outside of this millennium one has to modify (20[0-9][0-9]) to ([0-9][0-9][0-9][0-9])

I use something like this :
string="text 24/02/2021 ... 24-02-2021 ... 24_02_2021 ... 24|02|2021 text"
new_string = re.sub(r"[0-9]{1,4}[\_|\-|\/|\|][0-9]{1,2}[\_|\-|\/|\|][0-9]{1,4}", ' ', string)
print(new_string)
out : text ... ... ... text

If you don't want to raise ValueError exception like in methods with datetime, you can use re. Maybe you should also check that day of month lower than 31 and month number is lower than 12, inclusive:
from re import search as re_search
date_input = '31.12.1998'
re_search(r'^(3[01]|[12][0-9]|0[1-9]).(1[0-2]|0[1-9]).[0-9]{4}$', date_input)
With datetime good answer gave #unutbu earlier.

In case anyone wants to match this type of date "24 November 2008"
you can use
import re
date = "24 November 2008"
regex = re.compile("\d+\s\w+\s\d+")
matchDate = regex.findall(date)
print(matchDate)
Or
import re
date = "24 November 2008"
matchDate = re.findall("\d+\s\w+\s\d+", date)
print(matchDate)

This regular expression for matching dates in this format "22/10/2021" works for me :
import re
date = "WHATEVER 22/10/2029 WHATEVER"
match = re.search("([0-9]|1[0-9]|2[0-9]|3[0-5])/([0-9]|1[0-9]|2[0-9]|3[0-5])/([0-9][0-9][0-9][0-9])", date)
print(match)
OUTPUT = <re.Match object; span=(9, 19), match='22/10/2029'>
You can see in the fourth line that there is this string ([0-9]|1[0-9]|2[0-9]|3[0-5])/([0-9]|1[0-9]|2[0-9]|3[0-5])/([0-9][0-9][0-9][0-9]), this is the regular expression that I made based in this page.

Related

Unable to extract date of birth from a given random format

I have a set of text files from which I have to extract date of birth. The below code is able to extract date of birth from most of the text files but it is getting failed when given in the below format. May I know how could I extract DOB? The data is very much un-uniform and broken.
Code:
import re
str = """ This is python to extract date
D
.O.B.
:
14
J
u
n
e
199
1
work in a team or as individual
contributor.
And Name is: Zon; DOB: 12/23/
1955 11/15/2014 11:53 AM"""
pattern = re.findall(r'.*?D.O.B.*?:\s+([\d]{1,2}\s(?:JAN|NOV|OCT|DEC|June)\s[\d]{4})', string)
pattern2 = re.findall(r'.*?DOB.*?:\s+([\d/]+)', string)
print(pattern)
print(pattern2)`
Expected Output:
['14 June 1991']
['12/23/1955']
Working with date time is always a nightmare for developers for many reasons. In your case, you are trying to extract the date of birth, which is specified with a prefix of DOB with or without separators.
I suggest not to use and maintain a lot of regexes in the code, since you said the date formats can vary. You can use a good library like python-dateutil install it from pypy like pip install python-dateutil
All you have to do is find a good candidate section of the text, and use the library to parse it. Eg., in your case, find the date containing section of text like
import re
from dateutil.parser import parse
in_str = """DOB: 14 June 1991
work in a team or as individual
contributor"""
# find DOB prefixed string patterns
candidates = re.findall(r"D\.?O\.?B\.?:.*\d{4}\b", in_str)
#parse the dates from the candidates
parsed_dates = [parse(dt) for dt in candidates]
print(parsed_dates)
This will give you an output like
[datetime.datetime(1991, 6, 14, 0, 0)]
From here, you can manipulate or process them easily. Finding the date contained sections is again not a necessity for date parser to work, but that minimizes your work as well.
For the first pattern, you can add matching optional whitespace chars between the single characters.
\bD\s*\.\s*O\s*\.\s*B[^:]*:\s+(\d{1,2}\s*(?:JAN|NOV|OCT|DEC|J\s*u\s*n\s*e)(?:\s*\d){4})
Then in the match, remove the newlines.
See a regex demo and a Python demo.
For the second pattern, you can match optional whitespace chars around the / and then remove the whitespace chars from the matches.
\bDOB.*?:\s+(\d\d\s*/\s*\d\d\s*/\s*\d{4})\b
See another regex demo and a Python demo.
For example
import re
pattern = r"\bDOB.*?:\s+(\d\d\s*/\s*\d\d\s*/\s*\d{4})\b"
s = (" This is python to extract date\n"
"D\n"
".O.B.\n"
": \n"
"14 \n"
"J\n"
"u\n"
"n\n"
"e \n\n"
"199\n"
"1\n"
"work in a team or as individual \n"
"contributor.\n"
"And Name is: Zon; DOB: 12/23/\n"
" 1955 11/15/2014 11:53 AM")
res = [re.sub(r"\s+", "", s) for s in re.findall(pattern, s)]
print(res)
Output
['12/23/1955']
If there should not be a colon between DOB and matching the "date" part, you can also use a negated character class to exclude matching the colon instead of .*?
\bDOB[^:]*:\s+(\d\d\s*/\s*\d\d\s*/\s*\d{4})\b
Regex demo
I agree with #Kris that you should try to have as little regex to maintain as possible, and make them as simple as possible. You should also, as he suggested, divide your problem in 2 steps:
1/ extracting candidates
2/ parsing (using, for example dateutil.parser.parse)
step 1: extracting candidates
One solution for making regex patterns simpler is to manipulate the input string (if possible).
For example in your case, the difficulty comes from varying newlines and spaces. Taking back your example:
import re
s1 = """ This is python to extract date
D
.O.B.
:
14
J
u
n
e
199
1
work in a team or as individual
contributor.
And Name is: Zon; DOB: 12/23/
1955 11/15/2014 11:53 AM"""
You can create s2 that removes new lines and spaces:
s2 = s.replace("\n", "").replace(" ", "")
Then your pattern becomes simpler:
pattern = re.compile(r"D\.?O\.?B\.?:(?P<date-of-birth>(.*?)(\d{4}))")
(see pattern explanation below)
Match the pattern with your simplified string:
matches = [m.group('date-of-birth') for m in pattern.finditer(s2) if m]
You get:
>>> print(matches)
['14June1991', '12/23/1955']
step 2: parsing candidates to date objects
#Kris suggestion works very well:
import dateutil
dobs = [dateutil.parser.parse(m) for m in matches]
You get your expected result:
>>> print(dobs)
[datetime.datetime(1991, 6, 14, 0, 0), datetime.datetime(1955, 12, 23, 0, 0)]
You can then use strftime if you want to make all your dates as pretty, standardized strings:
dobs_pretty = [d.strftime('%Y-%m-%d') for d in dobs]
print(dobs_pretty)
>>> ['1991-06-14', '1955-12-23']
Pattern explanation
D\.?O\.?B\.?: you look for "DOB", with or without periods (hence the ? operator)
(?P<date-of-birth>(.*?)(\d{4})): You capture everything on the right of "DOB" until you find 4 consecutive digits (representing the year). (.*?) captures everything "up until" (\d{4}) (the 4 consecutive digits)
?P<date-of-birth> allows you to name the captured group, making retrieving the date much easier. You simply put the group name (date-of-birth) in the group() method: m.group('date-of-birth')

How to find date with regular expresions

I have to find some date inside an string with a regular expresions in python
astring ='L2A_T21HUB_A023645_20210915T135520'
and i'm trying to get the part before the T with shape xxxxxxxx where every x is a number.
desiredOutput = '20210915'
I'm new in regex so I have no idea how to solve this
If the astring's format is consistent, meaning it will always have the same shape with respect to the date, you can split the string by '_' and get the last substring and get the date from there as such:
astring ='L2A_T21HUB_A023645_20210915T135520'
date_split = astring.split("_"). # --> ['L2A', 'T21HUB', 'A023645', '20210915T135520']
desiredOutput = date_split[3][:8] # --> [3] = '20210915T135520' [:8] gets first 8 chars
print(desiredOutput) # --> 20210915
If you wanted an actual datetime object
>>> from datetime import datetime
>>> astring = 'L2A_T21HUB_A023645_20210915T135520'
>>> date_str = astring.split('_')[-1]
>>> datetime.strptime(date_str, '%Y%m%dT%H%M%S')
datetime.datetime(2021, 9, 15, 13, 55, 20)
From that, you can use datetime.strftime to reformat to a new string, or you can use split('T')[0] to get the string you want.
The trouble with Regex is that there can be unexpected patterns that match your expected pattern and throw things off. However, if you know that only the date portion will ever have 8 sequential digits, you can do this:
import re
date_patt = re.compile('\d{8}')
date = date_patt.search(astring).group(0)
You can develop more robust patterns based on your knowledge of the formatting of the incoming strings. For instance, if you know that the date will always follow an underscore, you could use a look-behind assertion:
date_patt = re.compile(r'(?<=\_)\d{8}') # look for '_' before the date, but don't capture
Hope this helps. Regex can be finicky and may take some tweaking, but hope this sets you in the right direction.

Obtain part of a string using regular expressions in Python

I am dealing with strings that need to be converted into dates in Python. In a normal situation, my string would have %d/%m/%Y %H:%M:%S. For instance:
18/02/2013 09:21:14
However in some occasion I could obtain something like %d/%m/%Y %H:%M:%S:%ms, such as:06/01/2014 09:52:14:78
I would like to get rid of that ms bit but I need to figure out how. I have been able to create a regular expression which can test if the date matches:
mydate = re.compile("^((((31\/(0?[13578]|1[02]))|((29|30)\/(0?[1,3-9]|1[0-2])))\/(1[6-9]|[2-9]\d)?\d{2})|(29\/0?2\/(((1[6-9]|[2-9]\d)?(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00))))|(0?[1-9]|1\d|2[0-8])\/((0?[1-9])|(1[0-2]))\/((1[6-9]|[2-9]\d)?\d{2})) (20|21|22|23|[0-1]?\d):[0-5]?\d:[0-5]?\d$")
s = "06/01/2014 09:52:14:78"
bool(mydate.match(s))
>>> False
However I do not know how to obtain only the interesting part, i.e 06/01/2014 09:52:14
Any ideas?
You can use a positive lookbehind and re.sub():
>>> re.sub(r'(?<=\d{2}:\d{2}:\d{2}).*','','06/01/2014 09:52:14:78')
'06/01/2014 09:52:14'
Debuggex Demo
How about the re.sub function
>>> re.sub(r'( \d{2}(:\d{2}){2}).*$',r'\1','06/01/2014 09:52:14:78')
'06/01/2014 09:52:14'
>>> re.sub(r'( \d{2}(:\d{2}){2}).*$,r'\1','8/02/2013 09:21:14')
'8/02/2013 09:21:14'
( \d{2}(:\d{2}){2}) matcheshours:min:sec` saved in capture group 1
.*$ matches the milliseconds
r'\1' replaced with contents of the first caputre group

Matching dates with regular expressions in Python?

I know that there are similar questions to mine that have been answered, but after reading through them I still don't have the solution I'm looking for.
Using Python 3.2.2, I need to match "Month, Day, Year" with the Month being a string, Day being two digits not over 30, 31, or 28 for February and 29 for February on a leap year. (Basically a REAL and Valid date)
This is what I have so far:
pattern = "(January|February|March|April|May|June|July|August|September|October|November|December)[,][ ](0[1-9]|[12][0-9]|3[01])[,][ ]((19|20)[0-9][0-9])"
expression = re.compile(pattern)
matches = expression.findall(sampleTextFile)
I'm still not too familiar with regex syntax so I may have characters in there that are unnecessary (the [,][ ] for the comma and spaces feels like the wrong way to go about it), but when I try to match "January, 26, 1991" in my sample text file, the printing out of the items in "matches" is ('January', '26', '1991', '19').
Why does the extra '19' appear at the end?
Also, what things could I add to or change in my regex that would allow me to validate dates properly? My plan right now is to accept nearly all dates and weed them out later using high level constructs by comparing the day grouping with the month and year grouping to see if the day should be <31,30,29,28
Any help would be much appreciated including constructive criticism on how I am going about designing my regex.
Here's one way to make a regular expression that will match any date of your desired format (though you could obviously tweak whether commas are optional, add month abbreviations, and so on):
years = r'((?:19|20)\d\d)'
pattern = r'(%%s) +(%%s), *%s' % years
thirties = pattern % (
"September|April|June|November",
r'0?[1-9]|[12]\d|30')
thirtyones = pattern % (
"January|March|May|July|August|October|December",
r'0?[1-9]|[12]\d|3[01]')
fours = '(?:%s)' % '|'.join('%02d' % x for x in range(4, 100, 4))
feb = r'(February) +(?:%s|%s)' % (
r'(?:(0?[1-9]|1\d|2[0-8])), *%s' % years, # 1-28 any year
r'(?:(29), *((?:(?:19|20)%s)|2000))' % fours) # 29 leap years only
result = '|'.join('(?:%s)' % x for x in (thirties, thirtyones, feb))
r = re.compile(result)
print result
Then we have:
>>> r.match('January 30, 2001') is not None
True
>>> r.match('January 31, 2001') is not None
True
>>> r.match('January 32, 2001') is not None
False
>>> r.match('February 32, 2001') is not None
False
>>> r.match('February 29, 2001') is not None
False
>>> r.match('February 28, 2001') is not None
True
>>> r.match('February 29, 2000') is not None
True
>>> r.match('April 30, 1908') is not None
True
>>> r.match('April 31, 1908') is not None
False
And what is this glorious regexp, you may ask?
>>> print result
(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000))))
(I initially intended to do a tongue-in-cheek enumeration of the possible dates, but I basically ended up hand-writing that whole gross thing except for the multiples of four, anyway.)
Here are some quick thoughts:
Everyone who is suggesting you use something other than regular expression is giving you very good advice. On the other hand, it's always a good time to learn more about regular expression syntax...
An expression in square brackets -- [...] -- matches any single character inside those brackets. So writing [,], which only contains a single character, is exactly identical to writing a simple unadorned comma: ,.
The .findall method returns a list of all matching groups in the string. A group is identified by parenthese -- (...) -- and they count from left to right, outermost first. Your final expression looks like this:
((19|20)[0-9][0-9])
The outermost parentheses match the entire year, and the inside parentheses match the first two digits. Hence, for a date like "1989", the final two match groups are going to be 1989 and 19.
A group is identified by parentheses (...) and they count from left to right, outermost first. Your final expression looks like this:
((19|20)[0-9][0-9])
The outermost parentheses match the entire year, and the inside parentheses match the first two digits. Hence, for a date like "1989", the two match groups are going to be 1989 and 19. Since you don't want the inner group (first two digits), you should use a non-capturing group instead. Non-capturing groups start with ?:, used like this: (?:a|b|c)
By the way, there is some good documentation on how to use regular expressions here.
Python has a date parser as part of the time module:
import time
time.strptime("December 31, 2012", "%B %d, %Y")
The above is all you need if the date format is always the same.
So, in real production code, I would write a regular expression that parses the date, and then use the results from the regular expression to build a date string that is always the same format.
Now that you said, in the comments, that this is homework, I'll post another answer with tips on regular expressions.
You have this regular expression:
pattern = "(January|February|March|April|May|June|July|August|September|October|November|December)[,][ ](0[1-9]|[12][0-9]|3[01])[,][ ]((19|20)[0-9][0-9])"
One feature of regular expressions is a "character class". Characters in square brackets make a character class. Thus [,] is a character class matching a single character, , (a comma). You might as well just put the comma.
Perhaps you wanted to make the comma optional? You can do that by putting a question mark after it: ,?
Anything you put into parentheses makes a "match group". I think the mysterious extra "19" came from a match group you didn't mean to have. You can make a non-matching group using this syntax: (?:
So, for example:
r'(?:red|blue) socks'
This would match "red socks" or "blue socks" but does not make a match group. If you then put that inside plain parentheses:
r'((?:red|blue) socks)'
That would make a match group, whose value would be "red socks" or "blue socks"
I think if you apply these comments to your regular expression, it will work. It is mostly correct now.
As for validating the date against the month, that is way beyond the scope of a regular expression. Your pattern will match "February 31" and there is no easy way to fix that.
First of all as other as said i don't think that regular expression are the best choice to solve this problem but to answer your question. By using parenthesis you are dissecting the string into several subgroups and when you call the function findall, you will create a list with all the matching group you created and the matching string.
((19|20)[0-9][0-9])
Here is your problem, the regex will match both the entire year and 19 or 20 depending on whether the year start with 19 or 20.

date matching using python regex

What am i doing wrong in the below regular expression matching
>>> import re
>>> d="30-12-2001"
>>> re.findall(r"\b[1-31][/-:][1-12][/-:][1981-2011]\b",d)
[]
[1-31] matches 1-3 and 1 which is basically 1, 2 or 3. You cannot match a number rage unless it's a subset of 0-9. Same applies to [1981-2011] which matches exactly one character that is 0, 1, 2, 8 or 9.
The best solution is simply matching any number and then checking the numbers later using python itself. A date such as 31-02-2012 would not make any sense - and making your regex check that would be hard. Making it also handle leap years properly would make it even harder or impossible. Here's a regex matching anything that looks like a dd-mm-yyyy date: \b\d{1,2}[-/:]\d{1,2}[-/:]\d{4}\b
However, I would highly suggest not allowing any of -, : and / as : is usually used for times, / usually for the US way of writing a date (mm/dd/yyyy) and - for the ISO way (yyyy-mm-dd). The EU dd.mm.yyyy syntax is not handled at all.
If the string does not contain anything but the date, you don't need a regex at all - use strptime() instead.
All in all, tell the user what date format you expect and parse that one, rejecting anything else. Otherwise you'll get ambiguous cases such as 04/05/2012 (is it april 5th or may 4th?).
[1-31] does not means what you think it means. The square bracket syntax matches a range of characters, not a range of numbers. Matching a range of numbers with a regex is possible, but unwieldy.
If you really want to use regular expressions for this (rather than a date parsing library) you'd be better off matching all numbers of the right number of digits, capturing the values, and then checking the values yourself:
>>> import re
>>> d="30-12-2001"
>>> >>> re.findall(r"\b([0-9]{1,2})[-/:]([0-9]{1,2})[-/:]([0-9]{4})\b",d)
[('30', '12', '2001')]
You'll have to do actual date verification anyway, to catch invalid dates like 31-02-2012.
(Note that [/-:] doesn't work either, because it's interpreted as a range. Use [-/:] instead - putting the hyphen at the front prevents it being interpreted as a range separator.)
Regular expressions do not understand numbers; to a regular expression, 1 is just a character of the string - the same kind of thing that a is. Thus, for example, [1-31] is parsed as a character class which contains the range 1-3 and the (redundant) single symbol 1.
You do not want to use regular expressions to parse dates. There is already a built-in module for handling date parsing:
>>> import datetime
>>> datetime.datetime.strptime('30-12-2001', '%d-%m-%Y')
datetime.datetime(2001, 12, 30, 0, 0) # an object representing the date.
This also does all the secondary checks (for things like an attempt to refer to Feb. 31) for you. If you want to handle multiple types of separators, you can simply .replace them in the original string so that they all turn into the same separator, then use that in your format.
You're probably doing it wrong. Some other replies here are helping you with the regex, but I suggest you use the datetime.strptime method to turn your formatted date into a datetime object, and do further logic with that object:
>>> import datetime
>>> datetime.strptime('30-12-2001', '%d-%m-%Y')
datetime.datetime(2001, 12, 30, 0, 0)
More info on the strptime method and it's format strings.
regexp = r'(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\d\d)'
( #start of group #1
0?[1-9] # 01-09 or 1-9
| # ..or
[12][0-9] # 10-19 or 20-29
| # ..or
3[01] # 30, 31
) #end of group #1
/ # follow by a "/"
( # start of group #2
0?[1-9] # 01-09 or 1-9
| # ..or
1[012] # 10,11,12
) # end of group #2
/ # follow by a "/"
( # start of group #3
(19|20)\\d\\d # 19[0-9][0-9] or 20[0-9][0-9]
) # end of group #3
Maybe you can try this regex
^((0|1|2)[0-9]{1}|(3)[0-1]{1})/((0)[0-9]{1}|(1)[0-2]{1})/((19)[0-9]{2}|(20)[0-9]{2})$
this match for (01 to 31)/(01 to 12)/(1900 to 2099)

Categories

Resources