Find date from image/text

Find date from image/text - python

I have dates like this and I need regex to find these types of dates
12-23-2019
29 10 2019
1:2:2018
9/04/2019
22.07.2019
here's what I did
first I removed all spaces from the text and here's what it looks like
12-23-2019291020191:02:2018
and this is my regex
re.findall(r'((\d{1,2})([.\/-])(\d{2}|\w{3,9})([.\/-])(\d{4}))',new_text)
it can find 12-23-2019 , 9/04/2019 , 22.07.2019 but cannot find 29 10 2019 and 1:02:2018

You may use
(?<!\d)\d{1,2}([.:/ -])(?:\d{1,2}|\w{3,})\1\d{4}(?!\d)
See the regex demo
Details
(?<!\d) - no digit right before
\d{1,2} - 1 or 2 digits
([.:/ -]) - a dot, colon, slash, space or hyphen (captured in Group 1)
(?:\d{1,2}|\w{3,}) - 1 or 2 digits or 3 or more word chars
\1 - same value as in Group 1
\d{4} - four digits
(?!\d) - no digit allowed right after
Python sample usage:
import re
text = 'Aaaa 12-23-2019, bddd 29 10 2019 <=== 1:2:2018'
pattern = r'(?<!\d)\d{1,2}([.:/ -])(?:\d{1,2}|\w{3,})\1\d{4}(?!\d)'
results = [x.group() for x in re.finditer(pattern, text)]
print(results) # => ['12-23-2019', '29 10 2019', '1:2:2018']

Related

Parsing dates in Different format from Text

i have a dataframe where within the raw text column certain text with Dates in different format is given. i am looking to extract this dates in separate column
sample Raw Text :
"Sales Assistant # DFS Duration - June 2021 - 2023 Currently
working in XYZ Within the role I am expected to achieve sales targets
which I currently have no problems reaching. Job Role/Establishment -
Plasterer # XX Plasterer’s Duration - September 2016 - Nov 2016
Job Role/Establishment - Customer Advisor # AA Duration - (2015 –
2016) Job Role/Establishment - Warehouse Operative # xyz Duration -
03/2014 to 08/2015 In the xyz warehouse Job Role/Establishment - Airport Terminal Assistant # port Duration - 01/2012 - 06/2013
Working at the airport . Job Role/Establishment - Apprentice Floorer #
YY Floors Duration - DEC 2010 – APRIL 2012 "
Expected Dataframe :
id Raw_text Dates
01 "sample_raw_text" June 2021 - 2023 , September 2016 - Nov 2016,(2015 – 2016),03/2014 to 08/2015 , 01/2012 - 06/2013, DEC 2010 – APRIL 2012
I have Tried below pattern :
def extract_dates(df, column):
# Define the regex pattern to match dates in different month formats
pattern = r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{1,2}[-,\s]*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{2,4}\s*[-–]\s*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{1,2}[-,\s]*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{2,4}'
# Extract the dates from the specified column
df['Dates'] = df[column].str.extract(pattern)
with above i am unable to fetch required output. please guide what am i missing

Try this:
\(?(?:\b[A-Za-z]{3,9}\s*)?(?:\d\d?\/){0,2}[12]\d{3}\)?\s*(?:–|-|[Tt][Oo])\s*\(?(?:[A-Za-z]{3,9}\s*)?(?:\d\d?\/){0,2}[12]\d{3}\)?|\(\s*[A-Za-z]{3,9}\s*[–-]\s*[A-Za-z]{3,9}\s*[12]\d{3}\s*\)
\(? an optional (.
(?:[A-Za-z]{3,9}\s*)? non-capturing gruop.
[A-Za-z]{3,9} between 3-9 letters.
\s* zero or more whitespace character.
? makes the whole group optinal.
(?:\d\d\/)? non-caputring group.
\d a digit between 0-9.
\d another digit between 0-9.
\/ a literal forward slash /.
[12]\d{3}\s*
[12] match one digit from the listed digits 1 or 2.
\d{3} three digits between 0-9
\s* zero or more whitespace character.
(?:–|-|[Tt][Oo])\s*
(?:–|-|[Tt][Oo]) match –, -, TO, to, To or tO.
\s* zero or more whitespace character.
(?:[A-Za-z]{3,9}\s*)? explained above.
(?:\d\d\/)? explained above.
[12]\d{3} explained above.
\)? an optional ).
See regex demo

Python Regular expression of group to match text before amount

I am trying to write a python regular expression which captures multiple values from a few columns in dataframe. Below regular expression attempts to do the same. There are 4 parts of the string.
group 1: Date - month and day
group 2: Date - month and day
group 3: description text before amount i.e. group 4
group 4: amount - this group is optional
Some peculiar conditions for group 3 - text that
(1)the text itself might contain characters like "-" , "$". So we cannot use - & $ as the boundary of text.
(2) The text (group 3) sometimes may not be followed by amount.
(3) Empty space between group 3 and 4 is optional
Below is python function code which takes in a dataframe having 4 columns c1,c2,c3,c4 adds the columns dt, txt and amt after processing to dataframe.
def parse_values(args):
re_1='(([JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC]{3}\s{0,}[\d]{1,2})\s{0,}){2}(.*[\s]|.*[^\$]|.*[^-]){1}([-+]?\$[\d|,]+(?:\.\d+)?)?'
srch=re.search(re_1, args[0])
if srch is None:
return args
m = re.match(re_1, args[0])
args['dt']=m.group(1)
args['txt']=m.group(3)
args['amt']=m.group(4)
if m.group(4) is None:
if pd.isnull(args['c3']):
args['amt']=args.c2
else:
args['amt']=args.c3
return args
And in order to test the results I have below 6 rows which needs to return a properly formatted amt column in return.
tt=[{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL ','c2':'$16.84'},
{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL','c2':'$16.84'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK -$80,00,7770.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK-$2070.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK$2070.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK $80,00,7770.70'}
]
t=pd.DataFrame(tt,columns=['c1','c2','c3','c4'])
t=t.apply(parse_values,1)
t
However due to the error in my regular expression in re_1 I am not getting the amt column and txt column parsed properly as they return NaN or miss some words (as dipicted in some rows of the output image below).

How about this:
(((?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s*[\d]{1,2})\s*){2}(.*?)\s*(?=[\-$])([-+]?\$[\d|,]+(?:\.\d+)?)
As seen at regex101.com
Explanation:
First off, I've shortened the regex by changing a few minor details like using \s* instead of \s{0,}, which mean the exact same thing.
The whole [Jan|...|DEC] code was using a character class i.e. [], whcih only takes a single character from the entire set. Using non capturing groups is the correct way of selecting from different groups of multiple letters, which in your case are 'months'.
The meat of the regex: LOOKAHEADS
(?=[\-$]) tells the regex that the text before it in (.*) should match as much as it can until it finds a position followed by a dash or a dollar sign. Lookaheads don't actually match whatever they're looking for, they just tell the regex that the lookahead's arguments should be following that position.

Insert space to separate conjoined alpha and numeric strings - Python RegEx

In Python, I need to create a regex that inserts a space between any concatenated AlphaNum combinations. For example, this is what I want:
8min15sec ==> 8 min 15 sec
7m12s ==> 7 m 12 s
15mi25s ==> 15 mi 25 s
RegEx101 demo
I am blundering around with solutions found online, but they are a bit too complex for me to parse/modify. For example, I have this:
[a-zA-Z][a-zA-Z\d]*
but it only identifies the first insertion point: 8Xmin15sec (the X)
And this
(?<=[a-z])(?=[A-Z0-9])|(?<=[0-9])(?=[A-Z])
but it only finds this point: 8minX15sec (the X)
I could sure use a hand with the full syntax for finding each insertion point and inserting the spaces.
RegEx101 demo (same link as above)

How about the following approach:
import re
for test in ['8min15sec', '7m12s', '15mi25s']:
print(re.sub(r'(\d+|\D+)', r'\1 ', test).strip())
Which would give you:
8 min 15 sec
7 m 12 s
15 mi 25 s

You can use this regex, which marks the point which are boundaries of numbers and alphabets with either order i.e. number first then alphabets or vice versa.
(?<=\d)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=\d)
This regex (?<=\d)(?=[a-zA-Z]) marks a point with positive lookahead to look for an alphabet and positive look behind to look for a digit.
Similarly, (?<=[a-zA-Z])(?=\d) does same but in opposite order.
And then just replace that mark by a space.
Demo
Here is sample python code for same.
import re
arr = ['8min15sec', '7m12s', '15mi25s']
for s in arr:
print (s + ' --> ' + re.sub('(?<=\d)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=\d)', ' ',s))
Which prints following output,
8min15sec --> 8 min 15 sec
7m12s --> 7 m 12 s
15mi25s --> 15 mi 25 s

How about:
"(\d+)([a-zA-Z]+)"
to
"\1 \2 "
https://regex101.com/r/yvqCtQ/2
And in python:
In [59]: re.sub(r'(\d+)([a-zA-Z]+)', r'\1 \2 ', '8min15sec')
Out[59]: '8 min 15 sec '

ReGex for surrounding numbers with whitespaces

I would like to find a Regex to convert string like the following one:
wienerstr256pta 18 graz austria8051 4
Into the following one:
wienerstr 256 pta 18 graz austria 8051 4
So I just want to surround every number set between spaces.
I know I can easily find the digits by:
/[0-9]+/g
But how can I replace this match with the same content plus extra whitespaces?

You may find all the positions between a non-digit/non-whitespace and a digit, or between a digit and a non-digit/non-whitespace and insert a space there:
(?<=[^0-9\s])(?=[0-9])|(?<=[0-9])(?=[^0-9\s])
Replace with a space.
See the regex demo.
Details
(?<=[^0-9\s]) - matches a position that is immediately preceded with a char other than a digit and a whitespace...
(?=[0-9]) - and is followed with a digit
| - or
(?<=[0-9]) - matches a position immediately preceded with a digit and
(?=[^0-9\s]) - followed with a char other than a digit and a whitespace.
A Pandas test:
>>> from pandas import DataFrame
>>> import pandas as pd
>>> col_list = ['wienerstr256pta 18 graz austria8051 4']
>>> rx = r'(?<=[^0-9\s])(?=[0-9])|(?<=[0-9])(?=[^0-9\s])'
>>> df = pd.DataFrame(col_list, columns=['col'])
>>> df['col'].replace(rx," ", regex=True, inplace=True)
>>> df['col']
0 wienerstr 256 pta 18 graz austria 8051 4
Name: col, dtype: object

echo "wienerstr256pta18graz austria8051 4" \
| sed -r "s/([^0-9])([0-9])/\1 \2/g;s/([0-9])([^0-9])/\1 \2/g;s/ */ /g"
wienerstr 256 pta 18 graz austria 8051 4
Replace every change of number to nonnumber or nonnumber to number with both with blank in between. Condense multiple blanks by one in the end, since a blank is a nonnumber too.
Keeping multiple blanks - which might be in the input - together:
echo "wienerstr256pta18graz austria8051 4" | sed -r "s/([^0-9 ])([0-9])/\1 \2/g;s/([0-9])([^0-9 ])/\1 \2/g;"
wienerstr 256 pta 18 graz austria 8051 4

joining multiple regular expression for readability

I have following requirements in date which can be any of the following format.
mm/dd/yyyy or dd Mon YYYY
Few examples are shown below
04/20/2009 and 24 Jan 2001
To handle this I have written regular expression as below
Few text scenarios are metnioned below
txt1 = 'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox
+ fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical
Review of Systems Constitutional:'
txt2 = "s The patient is a 44 year old married Caucasian woman,
unemployed Decorator, living with husband and caring for two young
children, who is referred by Capitol Hill Hospital PCP, Dr. Heather
Zubia, for urgent evaluation/treatment till first visit with Dr. Toney
Winkler IN EIGHT WEEKS on 24 Jan 2001."
date = re.findall(r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}', txtData)
I am not getting 24 Jan 2001 where as if I run individually (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}' I am able to get output.
Question 1: What is bug in above expression?
Question 2: I want to combine both to make more readable as I have to parse any other formats so I used join as shown below
RE1 = '(?:\b(?<!\.)[\d{0,2}]+) (?:[/-]\d{0,}[/-]\d{2,4})'
RE2 = '(?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
regex_all = '|'.join([RE1, RE2])
regex_all = re.compile(regex_all)
date = regex_all.findall(txtData) // notice here txtData can be any one of the above string.
I am getting output as NaN in case of above for date.
Please suggest what is the mistake if I join.
Thanks for your help.

Note that it is a very bad idea to join such long patterns that also match at the same location within the string. That would cause the regex engine to backtrack too much, and possibly lead to crashes and slowdown. If there is a way to re-write the alternations so that they could only match at different locations, or even get rid of them completely, do it.
Besides, you should use grouping constructs (...) to groups sequences of patterns, and only use [...] character classes when you need to matches specific chars.
Also, your alternatives are overlapping, you may combine them easily. See the fixed regex:
\b(?<!\.)\d{1,2}(?:[/-]\d+[/-]|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{2})\b
See the regex demo.
Details
\b - a word boundary
(?<!\.) - no . immediately to the left of the current location
\d{1,2} - 1 or 2 digits
(?: - start of a non-capturing alternation group:
[/-]\d+[/-] - / or -, 1+ digits, - or /
| - or
(?:th|st|[nr]d)?\s*(?:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)) - th, st, nd or rd (optionally), followed with 0+ whitespaces, and then month names
\s* - 0+ whitespaces
(?:\d{4}|\d{2}) - 2 or 4 digits
\b - trailing word boundary.
Another note: if you want to match the date-like strings with two matching delimiters, you will need to capture the first one, and use a backreference to match the second one, see this regex demo. In Python, you would need a re.finditer to get those matches.
See this Python demo:
import re
rx = r"\b(?<!\.)\d{1,2}(?:([/-])\d+\1|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{4})\b"
s = "Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox\nfluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical\nReview of Systems Constitutional:\n\nThe patient is a 44 year old married Caucasian woman, unemployed Decorator, living with husband and caring for two young children, who is referred by Capitol Hill Hospital PCP, Dr. Heather Zubia, for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001"
print([x.group(0) for x in re.finditer(rx, s, re.I)])
# => ['7/11/77', '24 Jan 2001']

I think your approach is too complicated. I suggest using a combination of a simple regex and strptime().
import re
from datetime import datetime
date_formats = ['%m/%d/%Y', '%d %b %Y']
pattern = re.compile(r'\b(\d\d?/\d\d?/\d{4}|\d\d? \w{3} \d{4})\b')
data = "... your string ..."
for match in re.findall(pattern, data):
print("Trying to parse '%s'" % match)
for fmt in date_formats:
try:
date = datetime.strptime(match, fmt)
print(" OK:", date)
break
except:
pass
The advantage of this approach is, besides a much more manageable regex, that it won't pick dates that look plausible but do not exist, like 2/29/2000 (whereas 2/29/2004 works).

r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
you should use raw strings (r'foo') for each string, not only the first one. This way backslashes (\) will be considered as normal character and usable by the re library.
[abc|def] matches any character between the [], while (one|two|three) matches any expression (one, two, or three)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find date from image/text - python

Related

Parsing dates in Different format from Text

Python Regular expression of group to match text before amount

Insert space to separate conjoined alpha and numeric strings - Python RegEx

ReGex for surrounding numbers with whitespaces

joining multiple regular expression for readability

Categories

Resources