How to Extract Date From String Python - python

I have a String Like
"Originally Posted on 09 May, 2016. By query 3 j...."
how can i extract date using python??
I tried this code:
dStr = "Originally Posted on 09 May, 2016. By query 3 j...."
date_st = re.findall("(\d+\ \w+,)", dStr)
printing date_st, i've got:
['09 May,']
what should i do for year??

You forget to add the year after ',', Only need to add '\d+.' is well.
re.findall("(\d+\ \w+, \d+\.)", dStr)
You will got this:
'09 May, 2016.'

You were almost there. Just add 4 digits for the year after the ,. It is better to use [a-z]+ instead of \w+ to match the month names as \w matches _ and 0-9 (along with alphabets) as well.
re.findall(r'\d+\s[a-z]+,\s\d{4}',s,re.I)

Related

Is it possible to tag an already masked text in python?

I'm trying to preprocess a text file for NLP and in this effort we are tagging various items such as dates, addresses and sensitive personal information (SPI). The problem is the text has already masked some of these information. For example:
Jan 6, xxxx or (xxx)xxx-1234
My question is, is it possible to use regular expression in python to unmask them so that we can proceed with tagging them properly?
So I need something like this:
Jan 6, 1111 or (111)111-1234
To tag them as #US_DATE and #PHONE
I've tried simple possible solutions such as:
re.sub(r'xx', '11', '(xxx)xxx-1234')
re.sub(r'xx+', '11', 'January 9 xxxx')
but neither give me the right pattern!
Thanks in advance.
Perhaps one option is to match the different formats that you have using an alternation and use re.sub with a callback to replace all the x characters with 1.
For the pattern I have used character classes with quantifiers to specify what you would allow to match, but you could update that to make it more specific.
\b[A-Za-z]{3,} [a-zA-Z\d]{1,2},? [a-zA-Z\d]{4}\b|\([a-zA-Z\d]+\)[a-zA-Z\d]{3}-[a-zA-Z\d]{4}\b
Regex demo | Python demo
For example:
import re
regex = r"\b[A-Za-z]{3,} [a-zA-Z\d]{1,2},? [a-zA-Z\d]{4}\b|\([a-zA-Z\d]+\)[a-zA-Z\d]{3}-[a-zA-Z\d]{4}\b"
test_str = ("Jan 6, xxxx or (xxx)xxx-1234\n"
"Jan 16, xxxx or (xxx)xxx-1234\n"
"January 9 xxxx\n"
"(xxx)xxx-1234")
matches = re.sub(regex, lambda x: x.group().replace('x', '1'), test_str)
print(matches)
Result
Jan 6, 1111 or (111)111-1234
Jan 16, 1111 or (111)111-1234
January 9 1111
(111)111-1234

Extract strings that follow changing time strings

So I've been trying to extract the strings that follow the "dot" character in the text file, but only for lines that follow the pattern as below, that is, after the date and time:
09 May 2018 10:37AM • 6PR, Perth (Mornings)
The problem is for each of those lines, the date and time would change so the only common pattern is that there would be AM or PM right before the "dot".
However, if I search for "AM" or "PM" it wouldn't recognize the lines because the "AM" and "PM" are attached to the time.
This is my current code:
for i,s in enumerate(open(file)):
for words in ['PM','AM']:
if re.findall(r'\b' + words + r'\b', s):
source=s.split('•')[0]
Any idea how to get around this problem? Thank you.
I guess your regex is the problem here.
for i, s in enumerate(open(file)):
if re.findall(r'\d{2}[AP]M', s):
source = s.split('•')[0]
# 09 May 2018 10:37AM
If you are trying to extract the datetime try using regex.
Ex:
import re
s = "09 May 2018 10:37AM • 6PR, Perth (Mornings)"
m = re.search("(?P<datetime>\d{2}\s+(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{4}\s+\d{2}\:\d{2}(AM|PM))", s)
if m:
print m.group("datetime")
Output:
09 May 2018 10:37AM

joining multiple regular expression for readability

I have following requirements in date which can be any of the following format.
mm/dd/yyyy or dd Mon YYYY
Few examples are shown below
04/20/2009 and 24 Jan 2001
To handle this I have written regular expression as below
Few text scenarios are metnioned below
txt1 = 'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox
+ fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical
Review of Systems Constitutional:'
txt2 = "s The patient is a 44 year old married Caucasian woman,
unemployed Decorator, living with husband and caring for two young
children, who is referred by Capitol Hill Hospital PCP, Dr. Heather
Zubia, for urgent evaluation/treatment till first visit with Dr. Toney
Winkler IN EIGHT WEEKS on 24 Jan 2001."
date = re.findall(r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}', txtData)
I am not getting 24 Jan 2001 where as if I run individually (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}' I am able to get output.
Question 1: What is bug in above expression?
Question 2: I want to combine both to make more readable as I have to parse any other formats so I used join as shown below
RE1 = '(?:\b(?<!\.)[\d{0,2}]+) (?:[/-]\d{0,}[/-]\d{2,4})'
RE2 = '(?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
regex_all = '|'.join([RE1, RE2])
regex_all = re.compile(regex_all)
date = regex_all.findall(txtData) // notice here txtData can be any one of the above string.
I am getting output as NaN in case of above for date.
Please suggest what is the mistake if I join.
Thanks for your help.
Note that it is a very bad idea to join such long patterns that also match at the same location within the string. That would cause the regex engine to backtrack too much, and possibly lead to crashes and slowdown. If there is a way to re-write the alternations so that they could only match at different locations, or even get rid of them completely, do it.
Besides, you should use grouping constructs (...) to groups sequences of patterns, and only use [...] character classes when you need to matches specific chars.
Also, your alternatives are overlapping, you may combine them easily. See the fixed regex:
\b(?<!\.)\d{1,2}(?:[/-]\d+[/-]|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{2})\b
See the regex demo.
Details
\b - a word boundary
(?<!\.) - no . immediately to the left of the current location
\d{1,2} - 1 or 2 digits
(?: - start of a non-capturing alternation group:
[/-]\d+[/-] - / or -, 1+ digits, - or /
| - or
(?:th|st|[nr]d)?\s*(?:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)) - th, st, nd or rd (optionally), followed with 0+ whitespaces, and then month names
\s* - 0+ whitespaces
(?:\d{4}|\d{2}) - 2 or 4 digits
\b - trailing word boundary.
Another note: if you want to match the date-like strings with two matching delimiters, you will need to capture the first one, and use a backreference to match the second one, see this regex demo. In Python, you would need a re.finditer to get those matches.
See this Python demo:
import re
rx = r"\b(?<!\.)\d{1,2}(?:([/-])\d+\1|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{4})\b"
s = "Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox\nfluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical\nReview of Systems Constitutional:\n\nThe patient is a 44 year old married Caucasian woman, unemployed Decorator, living with husband and caring for two young children, who is referred by Capitol Hill Hospital PCP, Dr. Heather Zubia, for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001"
print([x.group(0) for x in re.finditer(rx, s, re.I)])
# => ['7/11/77', '24 Jan 2001']
I think your approach is too complicated. I suggest using a combination of a simple regex and strptime().
import re
from datetime import datetime
date_formats = ['%m/%d/%Y', '%d %b %Y']
pattern = re.compile(r'\b(\d\d?/\d\d?/\d{4}|\d\d? \w{3} \d{4})\b')
data = "... your string ..."
for match in re.findall(pattern, data):
print("Trying to parse '%s'" % match)
for fmt in date_formats:
try:
date = datetime.strptime(match, fmt)
print(" OK:", date)
break
except:
pass
The advantage of this approach is, besides a much more manageable regex, that it won't pick dates that look plausible but do not exist, like 2/29/2000 (whereas 2/29/2004 works).
r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
you should use raw strings (r'foo') for each string, not only the first one. This way backslashes (\) will be considered as normal character and usable by the re library.
[abc|def] matches any character between the [], while (one|two|three) matches any expression (one, two, or three)

Python regex to find phrases contain exact words

I have a list of strings and wish to find exact phases.
So far my code finds the month and year only, but the whole phase including “- Recorded” is needed, like “March 2016 - Recorded”.
How can it add on the “- Recorded” to the regex?
import re
texts = [
"Shawn Dookhit took annual leave in March 2016 - Recorded The report",
"Soondren Armon took medical leave in February 2017 - Recorded It was in",
"David Padachi took annual leave in May 2016 - Recorded It says",
"Jack Jagoo",
"Devendradutt Ramgolam took medical leave in August 2016 - Recorded Day back",
"Kate Dudhee",
"Vinaye Ramjuttun took annual leave in - Recorded Answering"
]
regex = re.compile('(?P<month>[a-zA-Z]+)\s+(?P<year>\d{4})\s')
for t in texts:
try:
m = regex.search(t)
print m.group()
except:
print "keyword's not found"
You got 2 named groups here: month and year which takes month and year from your strings. To get - Recorded into recorded named group you can do this:
regex = re.compile('(?P<month>[a-zA-Z]+)\s+(?P<year>\d{4})\s(?P<recorded>- Recorded)')
Or if you can just add - Recorded to your regex without named group:
regex = re.compile('(?P<month>[a-zA-Z]+)\s+(?P<year>\d{4})\s- Recorded')
Or you can add named group other with hyphen and one capitalized word:
regex = re.compile('(?P<month>[a-zA-Z]+)\s+(?P<year>\d{4})\s(?P<other>- [A-Z][a-z]+)')
I think first or third option is preferable because you already got named groups. Also i recommend you to use this web site http://pythex.org/, it really helps to construct regex :).
Use a list comprehension with the corrected regex:
regex = re.compile('(?P<month>[a-zA-Z]+)\s+(?P<year>\d{4})\s* - Recorded')
matches = [match.groups() for text in texts for match in [regex.search(text)] if match]
print(matches)
# [('March', '2016'), ('February', '2017'), ('May', '2016'), ('August', '2016')]

Python regex: Matching one named group or another

I hav a string, that can contain the following:
lots of text Nov 30 2011 lots more of text
or
lots of text Nov 30 12:48 lots more of text
What I want to match is the date inside that line. What I want to get is the following for the first line:
{'date': 'Nov 30 2011', 'time': None}
or for the second line:
{'date': None, 'time': 'Nov 30 12:48'}
So my attemp was to this:
re.match(
'^.+((?P<date>\w{3} \d{1,2} \d{4})|(?P<time>\w{3} \d{1,2}:\d{2})).+',
line
)
But this does not work, it returns None. I tried some other combinations, but none worked.
How can I do this?
You are missing the day on the <time> group (e.g. "Nov 12:48"):
(?P<date>\w{3} \d{1,2} \d{4})|(?P<time>\w{3} \d{1,2} \d{1,2}:\d{2})
Also, you can probably match for that pattern without the ^.+(...).+ - it doesn't add much beyond requiring at least on character before and after your date.
I'd also recommend replacing spaces with \s+ or + (space plus, or [ ]+ if you want it visible) - you have double spaces in some places, which isn't too robust.
Another option is to avoid repetition - keep the date in its own group, and add alternaton between the time and the year:
(?P<date>\w{3}\s+\d{1,2})\s+(?:(?P<year>\d{4})|(?P<time>\d{1,2}:\d{2}))
Working example: http://rubular.com/r/g81Kudu0dY (without names)

Categories

Resources