Unable to extract date of birth from a given random format

Unable to extract date of birth from a given random format - python

I have a set of text files from which I have to extract date of birth. The below code is able to extract date of birth from most of the text files but it is getting failed when given in the below format. May I know how could I extract DOB? The data is very much un-uniform and broken.
Code:
import re
str = """ This is python to extract date
D
.O.B.
:
14
J
u
n
e
199
1
work in a team or as individual
contributor.
And Name is: Zon; DOB: 12/23/
1955 11/15/2014 11:53 AM"""
pattern = re.findall(r'.*?D.O.B.*?:\s+([\d]{1,2}\s(?:JAN|NOV|OCT|DEC|June)\s[\d]{4})', string)
pattern2 = re.findall(r'.*?DOB.*?:\s+([\d/]+)', string)
print(pattern)
print(pattern2)`
Expected Output:
['14 June 1991']
['12/23/1955']

Working with date time is always a nightmare for developers for many reasons. In your case, you are trying to extract the date of birth, which is specified with a prefix of DOB with or without separators.
I suggest not to use and maintain a lot of regexes in the code, since you said the date formats can vary. You can use a good library like python-dateutil install it from pypy like pip install python-dateutil
All you have to do is find a good candidate section of the text, and use the library to parse it. Eg., in your case, find the date containing section of text like
import re
from dateutil.parser import parse
in_str = """DOB: 14 June 1991
work in a team or as individual
contributor"""
# find DOB prefixed string patterns
candidates = re.findall(r"D\.?O\.?B\.?:.*\d{4}\b", in_str)
#parse the dates from the candidates
parsed_dates = [parse(dt) for dt in candidates]
print(parsed_dates)
This will give you an output like
[datetime.datetime(1991, 6, 14, 0, 0)]
From here, you can manipulate or process them easily. Finding the date contained sections is again not a necessity for date parser to work, but that minimizes your work as well.

For the first pattern, you can add matching optional whitespace chars between the single characters.
\bD\s*\.\s*O\s*\.\s*B[^:]*:\s+(\d{1,2}\s*(?:JAN|NOV|OCT|DEC|J\s*u\s*n\s*e)(?:\s*\d){4})
Then in the match, remove the newlines.
See a regex demo and a Python demo.
For the second pattern, you can match optional whitespace chars around the / and then remove the whitespace chars from the matches.
\bDOB.*?:\s+(\d\d\s*/\s*\d\d\s*/\s*\d{4})\b
See another regex demo and a Python demo.
For example
import re
pattern = r"\bDOB.*?:\s+(\d\d\s*/\s*\d\d\s*/\s*\d{4})\b"
s = (" This is python to extract date\n"
"D\n"
".O.B.\n"
": \n"
"14 \n"
"J\n"
"u\n"
"n\n"
"e \n\n"
"199\n"
"1\n"
"work in a team or as individual \n"
"contributor.\n"
"And Name is: Zon; DOB: 12/23/\n"
" 1955 11/15/2014 11:53 AM")
res = [re.sub(r"\s+", "", s) for s in re.findall(pattern, s)]
print(res)
Output
['12/23/1955']
If there should not be a colon between DOB and matching the "date" part, you can also use a negated character class to exclude matching the colon instead of .*?
\bDOB[^:]*:\s+(\d\d\s*/\s*\d\d\s*/\s*\d{4})\b
Regex demo

I agree with #Kris that you should try to have as little regex to maintain as possible, and make them as simple as possible. You should also, as he suggested, divide your problem in 2 steps:
1/ extracting candidates
2/ parsing (using, for example dateutil.parser.parse)
step 1: extracting candidates
One solution for making regex patterns simpler is to manipulate the input string (if possible).
For example in your case, the difficulty comes from varying newlines and spaces. Taking back your example:
import re
s1 = """ This is python to extract date
D
.O.B.
:
14
J
u
n
e
199
1
work in a team or as individual
contributor.
And Name is: Zon; DOB: 12/23/
1955 11/15/2014 11:53 AM"""
You can create s2 that removes new lines and spaces:
s2 = s.replace("\n", "").replace(" ", "")
Then your pattern becomes simpler:
pattern = re.compile(r"D\.?O\.?B\.?:(?P<date-of-birth>(.*?)(\d{4}))")
(see pattern explanation below)
Match the pattern with your simplified string:
matches = [m.group('date-of-birth') for m in pattern.finditer(s2) if m]
You get:
>>> print(matches)
['14June1991', '12/23/1955']
step 2: parsing candidates to date objects
#Kris suggestion works very well:
import dateutil
dobs = [dateutil.parser.parse(m) for m in matches]
You get your expected result:
>>> print(dobs)
[datetime.datetime(1991, 6, 14, 0, 0), datetime.datetime(1955, 12, 23, 0, 0)]
You can then use strftime if you want to make all your dates as pretty, standardized strings:
dobs_pretty = [d.strftime('%Y-%m-%d') for d in dobs]
print(dobs_pretty)
>>> ['1991-06-14', '1955-12-23']
Pattern explanation
D\.?O\.?B\.?: you look for "DOB", with or without periods (hence the ? operator)
(?P<date-of-birth>(.*?)(\d{4})): You capture everything on the right of "DOB" until you find 4 consecutive digits (representing the year). (.*?) captures everything "up until" (\d{4}) (the 4 consecutive digits)
?P<date-of-birth> allows you to name the captured group, making retrieving the date much easier. You simply put the group name (date-of-birth) in the group() method: m.group('date-of-birth')

Related

Is there a way to find (potentially) multiple results with re.search?

While parsing file names of TV shows, I would like to extract information about them to use for renaming. I have a working model, but it currently uses 28 if/elif statements for every iteration of filename I've seen over the last few years. I'd love to be able to condense this to something that I'm not ashamed of, so any help would be appreciated.
Phase one of this code repentance is to hopefully grab multiple episode numbers. I've gotten as far as the code below, but in the first entry it only displays the first episode number and not all three.
import re
def main():
pattern = '(.*)\.S(\d+)[E(\d+)]+'
strings = ['blah.s01e01e02e03', 'foo.s09e09', 'bar.s05e05']
#print(strings)
for string in strings:
print(string)
result = re.search("(.*)\.S(\d+)[E(\d+)]+", string, re.IGNORECASE)
print(result.group(2))
if __name__== "__main__":
main()
This outputs:
blah.s01e01e02e03
01
foo.s09e09
09
bar.s05e05
05
It's probably trivial, but regular expressions might as well be Cuneiform most days. Thanks in advance!

No. You can use findall to find all e\d+, but it cannot find overlapping matches, which makes it impossible to use s\d+ together with it (i.e. you can't distinguish e02 in "foo.s01e006e007" from that of "age007.s01e001"), and Python doesn't let you use variable-length lookbehind (to make sure s\d+ is before it without overlapping).
The way to do this is to find \.s\d+((?:e\d+)+)$ then split the resultant group 1 in another step (whether by using findall with e\d+, or by splitting with (?<!^)(?=e)).
text = 'blah.s01e01e02e03'
match = re.search(r'\.(s\d+)((?:e\d+)+)$', text, re.I)
season = match.group(1)
episodes = re.findall(r'e\d+', match.group(2), re.I)
print(season, episodes)
# => s01 ['e01', 'e02', 'e03']

re.findall instead of re.search will return a list of all matches

If you can make use of the PyPi regex module you could make use of repeating capture groups in the pattern, and then use .captures()
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(s\d+)(e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.captures(1)[0], m.captures(2))
Output:
s01 ['e01', 'e02', 'e03']
See a Python demo and a regex101 demo.
Or using .capturesdict () with named capture groups.
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(?P<season>s\d+)(?P<episodes>e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.capturesdict())
Output:
{'season': ['s01'], 'episodes': ['e01', 'e02', 'e03']}
See a Python demo.
Note that the notation [E(\d+)] that you used is a character class, that matches 1 or the listed characters like E ( a digit + )

regex to match version number

hi everyone i have data parsed that i want to match.
i have list two strings i have parsed with:
technologytitle=technologytitle.lower()
vulntitle=vulntitle.lower()
ree1=re.split(technologytitle, vulntitle)
This produces the following:
['\nmultiple cross-site scripting (xss) vulnerabilities in', '9.0.1 and earlier\n\n\n\n\n']
I am now trying to formulate writing re.match to match the second value with:
ree2=re.match(r'^[0-9].[0-9]*$', ree1[1])
print("ree2 {}".format(ree2))
however this is returning None .
Any thoughts? Thanks

Unclear if you wanted the whole string, or individual parts, but you can do both without ^ or $
import re
regex = r'((?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+))'
s = '9.0.1 and earlier\n\n\n\n\n'
matches = re.search(regex, s)
print(matches.group(0))
for v in ['major', 'minor', 'patch']:
print(v, matches.group(v))
Output
9.0.1
major 9
minor 0
patch 1

i used this one and it worked for me since dollar sign means the end of pattern and your pattern does not end with a number between 0-9 then it gives you none
regexPattern = "[0-9].*[0-9]"

How to separate strings using regex?

I have a datetime string and, I want to separate the hour format from the minute format and then print both of them on a separate line.
Here is the code:
import re
from datetime import datetime
ct = datetime.now().strftime("%H:%M")
time = "12:30"
minus = datetime.strptime(time,"%H:%M") - datetime.strptime(ct,"%H:%M")
minus = datetime.strptime(str(minus),"%H:%M:%S").strftime("%H:%M")
# print(minus)
regex2 = re.compile(r'(\d)+:(\d)+')
match = regex2.search(minus)
print(match.group(0))
If the variable minus gives an output: 01:22
Then I want it 01 and 22 to be printed on different lines.
Output Should be like:
01
22

you don't need regex for that.
you can use timedelta directly:
from datetime import datetime
ct = datetime.now().strftime("%H:%M")
time = "10:30"
minus = datetime.strptime(time,"%H:%M") - datetime.strptime(ct,"%H:%M")
## extract whatever values you want from this delta:
## eg:
seconds = minus.seconds
hours = minus.seconds//3600
minutes = (minus.seconds//60)%60
print(hours)
print(minutes)

Ignoring the fact that your script fails due to a runtime error at line 6 (which prevents any output from being generated from the regex part of the script), to capture the digits surrounding the colon, you need to put the repetition marker (i.e. the plus sign) inside the capturing markers (i.e. the parenthesis).
So revising your regex:
regex2 = re.compile(r'(\d)+:(\d)+')
as follows:
regex2 = re.compile(r'(\d+):(\d+)')
Will achieve that.
The second thing you need to do is print the matching groups (as opposed to the entire matched text - i.e. group(0)).
Printing the first and second matching groups is achieved as follows:
print(match.group(1))
print(match.group(2))
The difference between your regular expression and mine was in the number of groups that would ultimately be generated on a match.
In your case (\d)+ you requested the creation of one or more groups of a single digit.
In my case (\d+) I am requesting the creation of a single group containing one or more digits. I then use this pattern twice - once on each side of the colon.
So, for input like this 01:22, you would have created 4 groups being "0", "1", "2" and "2". Whereas my regex would generate just 2 groups "01" and "22". Which I think is what you are trying to achieve.
Note that if you had input like this "0:122", your regular expression would have generated the same 4 groups ("0", "1", "2" and "2") - you would have no idea where the ":" character was unless you also captured it (the colon). In my RE, the input "0:122" would generate "0" and "122" thereby correctly informing you where the colon was (i.e. between the "0" and the "122"

Use the regex below:
'([^:]+)'
See regex explanation here: https://regex101.com/r/EgXlcD/46
import re
from datetime import datetime
ct = datetime.now().strftime("%H:%M")
time = "10:30"
minus = datetime.strptime(time,"%H:%M") - datetime.strptime(ct,"%H:%M")
minus = datetime.strptime(str(minus),"%H:%M:%S"
"").strftime("%H:%M")
#print(minus)
matches = re.findall(r'([^:]+)',minus)
for match in matches:
print (match)

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string

import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)

>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..

Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()

If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Python regex to match dates

What regular expression in Python do I use to match dates like this: "11/12/98"?

Instead of using regex, it is generally better to parse the string as a datetime.datetime object:
In [140]: datetime.datetime.strptime("11/12/98","%m/%d/%y")
Out[140]: datetime.datetime(1998, 11, 12, 0, 0)
In [141]: datetime.datetime.strptime("11/12/98","%d/%m/%y")
Out[141]: datetime.datetime(1998, 12, 11, 0, 0)
You could then access the day, month, and year (and hour, minutes, and seconds) as attributes of the datetime.datetime object:
In [143]: date.year
Out[143]: 1998
In [144]: date.month
Out[144]: 11
In [145]: date.day
Out[145]: 12
To test if a sequence of digits separated by forward-slashes represents a valid date, you could use a try..except block. Invalid dates will raise a ValueError:
In [159]: try:
.....: datetime.datetime.strptime("99/99/99","%m/%d/%y")
.....: except ValueError as err:
.....: print(err)
.....:
.....:
time data '99/99/99' does not match format '%m/%d/%y'
If you need to search a longer string for a date,
you could use regex to search for digits separated by forward-slashes:
In [146]: import re
In [152]: match = re.search(r'(\d+/\d+/\d+)','The date is 11/12/98')
In [153]: match.group(1)
Out[153]: '11/12/98'
Of course, invalid dates will also match:
In [154]: match = re.search(r'(\d+/\d+/\d+)','The date is 99/99/99')
In [155]: match.group(1)
Out[155]: '99/99/99'
To check that match.group(1) returns a valid date string, you could then parsing it using datetime.datetime.strptime as shown above.

I find the below RE working fine for Date in the following format;
14-11-2017
14.11.2017
14|11|2017
It can accept year from 2000-2099
Please do not forget to add $ at the end,if not it accept 14-11-201 or 20177
date="13-11-2017"
x=re.search("^([1-9] |1[0-9]| 2[0-9]|3[0-1])(.|-)([1-9] |1[0-2])(.|-|)20[0-9][0-9]$",date)
x.group()
output = '13-11-2017'

I built my solution on top of #aditya Prakash appraoch:
print(re.search("^([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])$|^([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])$",'01/01/2018'))
The first part (^([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])$) can handle the following formats:
01.10.2019
1.1.2019
1.1.19
12/03/2020
01.05.1950
The second part (^([0-9][0-9]|19[0-9][0-9]|20[0-9][0-9])(\.|-|/)([1-9]|0[1-9]|1[0-2])(\.|-|/)([1-9]|0[1-9]|1[0-9]|2[0-9]|3[0-1])$) can basically do the same, but in inverse order, where the year comes first, followed by month, and then day.
2020/02/12
As delimiters it allows ., /, -. As years it allows everything from 1900-2099, also giving only two numbers is fine.
If you have suggestions for improvement please let me know in the comments, so I can update the answer.

Using this regular expression you can validate different kinds of Date/Time samples, just a little change is needed.
^\d\d\d\d/(0?[1-9]|1[0-2])/(0?[1-9]|[12][0-9]|3[01]) (00|[0-9]|1[0-9]|2[0-3]):([0-9]|[0-5][0-9]):([0-9]|[0-5][0-9])$ -->validate this: 2018/7/12 13:00:00
for your format you cad change it to:
^(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[0-2])/\d\d$ --> validates this: 11/12/98

I use something like this
>>> import datetime
>>> regex = datetime.datetime.strptime
>>>
>>> # TEST
>>> assert regex('2020-08-03', '%Y-%m-%d')
>>>
>>> assert regex('2020-08', '%Y-%m-%d')
ValueError: time data '2020-08' does not match format '%Y-%m-%d'
>>> assert regex('08/03/20', '%m/%d/%y')
>>>
>>> assert regex('08-03-2020', '%m/%d/%y')
ValueError: time data '08-03-2020' does not match format '%m/%d/%y'

Well, from my understanding, simply for matching this format in a given string, I prefer this regular expression:
pattern='[0-9|/]+'
to match the format in a more strict way, the following works:
pattern='(?:[0-9]{2}/){2}[0-9]{2}'
Personally, I cannot agree with unutbu's answer since sometimes we use regular expression for "finding" and "extract", not only "validating".

Sometimes we need to get the date from a string.
One example with grouping:
record = '1518-09-06 00:57 some-alphanumeric-charecter'
pattern_date_time = ([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}) .+
match = re.match(pattern_date_time, record)
if match is not None:
group = match.group()
date = group[0]
print(date) // outputs 1518-09-06 00:57

As the question title asks for a regex that finds many dates, I would like to propose a new solution, although there are many solutions already.
In order to find all dates of a string that are in this millennium (2000 - 2999), for me it worked the following:
dates = re.findall('([1-9]|1[0-9]|2[0-9]|3[0-1]|0[0-9])(.|-|\/)([1-9]|1[0-2]|0[0-9])(.|-|\/)(20[0-9][0-9])',dates_ele)
dates = [''.join(dates[i]) for i in range(len(dates))]
This regex is able to find multiple dates in the same string, like bla Bla 8.05/2020 \n BLAH bla15/05-2020 blaa. As one could observe, instead of / the date can have . or -, not necessary at the same time.
Some explaining
More specifically it can find dates of format day , moth year. Day is an one digit integer or a zero followed by one digit integer or 1 or 2 followed by an one digit integer or a 3 followed by 0 or 1. Month is an one digit integer or a zero followed by one digit integer or 1 followed by 0, 1, or 2. Year is the number 20 followed by any number between 00 and 99.
Useful notes
One can add more date splitting symbols by adding | symbol at the end of both (.|-|\/). For example for adding -- one would do (.|-|\/|--)
To have years outside of this millennium one has to modify (20[0-9][0-9]) to ([0-9][0-9][0-9][0-9])

I use something like this :
string="text 24/02/2021 ... 24-02-2021 ... 24_02_2021 ... 24|02|2021 text"
new_string = re.sub(r"[0-9]{1,4}[\_|\-|\/|\|][0-9]{1,2}[\_|\-|\/|\|][0-9]{1,4}", ' ', string)
print(new_string)
out : text ... ... ... text

If you don't want to raise ValueError exception like in methods with datetime, you can use re. Maybe you should also check that day of month lower than 31 and month number is lower than 12, inclusive:
from re import search as re_search
date_input = '31.12.1998'
re_search(r'^(3[01]|[12][0-9]|0[1-9]).(1[0-2]|0[1-9]).[0-9]{4}$', date_input)
With datetime good answer gave #unutbu earlier.

In case anyone wants to match this type of date "24 November 2008"
you can use
import re
date = "24 November 2008"
regex = re.compile("\d+\s\w+\s\d+")
matchDate = regex.findall(date)
print(matchDate)
Or
import re
date = "24 November 2008"
matchDate = re.findall("\d+\s\w+\s\d+", date)
print(matchDate)

This regular expression for matching dates in this format "22/10/2021" works for me :
import re
date = "WHATEVER 22/10/2029 WHATEVER"
match = re.search("([0-9]|1[0-9]|2[0-9]|3[0-5])/([0-9]|1[0-9]|2[0-9]|3[0-5])/([0-9][0-9][0-9][0-9])", date)
print(match)
OUTPUT = <re.Match object; span=(9, 19), match='22/10/2029'>
You can see in the fourth line that there is this string ([0-9]|1[0-9]|2[0-9]|3[0-5])/([0-9]|1[0-9]|2[0-9]|3[0-5])/([0-9][0-9][0-9][0-9]), this is the regular expression that I made based in this page.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to extract date of birth from a given random format - python

Related

Is there a way to find (potentially) multiple results with re.search?

regex to match version number

How to separate strings using regex?

How do I extract some string from a long string in Python?

Python regex to match dates

Categories

Resources