I am reading 'Automate the boring stuff with python'. Right now, I'm stuck on chapter 7 (Regex part). There is a template for American phone numbers, which I want to implement for Ukrainian phone numbers.
Ukrainian numbers can appear in the different formats, such as : +380445371428, +38(044)5371428, +38(044)537 14 28, +38(044)537-14-28, +38(044) 537.14.28, 044.537.14.28, 0445371428, 044-537-1428, (044)537-1428, 044 537-1428, etc.
Following, is my implementation, but it's not quite correct. What do I need?
When I'm copying some website pages, from all of the info I have copied, I want to extract the Ukrainian number appearing in this (044-537-1428) format.
phoneRegex = re.compile(r'''(
(^\+38?) # area code(not necessarily)
(\d{3}|\(\d{3}\)) # carrier code(usually starts with 0
(\s|-|\.)? # separator
(\d{3}|\(\d{3}\)) # first 3 digits
(\s|-|\.) # separator
(\d{4}) # last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))? # extension
)''', re.VERBOSE)
template for American number (according to the book) looks like the following
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?
(\s|-|\.)
(\d{3})
(\s|-|\.)
(\d{4})
(\s*(ext|x|ext.)\s*(\d{2,5}))?
)''', re.VERBOSE)
Maybe, an option would be to incorporate alternation, based on the types of patterns that we might have, such as:
^(?:\+38)?(?:\(044\)[ .-]?[0-9]{3}[ .-]?[0-9]{2}[ .-]?[0-9]{2}|044[ .-]?[0-9]{3}[ .-]?[0-9]{2}[ .-]?[0-9]{2}|044[0-9]{7})$
Or even more restricted than that, if we'd be validating.
Demo
Test
import re
regex = r'^(?:\+38)?(?:\(044\)[ .-]?[0-9]{3}[ .-]?[0-9]{2}[ .-]?[0-9]{2}|044[ .-]?[0-9]{3}[ .-]?[0-9]{2}[ .-]?[0-9]{2}|044[0-9]{7})$'
string = '''
+380445371428
+38(044)5371428
+38(044)537 14 28
+38(044)537-14-28
+38(044) 537.14.28
044.537.14.28
0445371428
044-537-1428
(044)537-1428
044 537-1428
+83(044)537 14 28
088 537-1428
'''
print(re.findall(regex, string, re.M))
Output
['+380445371428', '+38(044)5371428', '+38(044)537 14 28',
'+38(044)537-14-28', '+38(044) 537.14.28', '044.537.14.28',
'0445371428', '044-537-1428', '(044)537-1428', '044 537-1428']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
Here is my regex for all Ukrainian numbers:
^\+?3?8?(0[\s\.-]\d{2}[\s\.-]\d{3}[\s\.-]\d{2}[\s\.-]\d{2})$
This allows:
+380 XX XXX XX XX or
+380-XX-XXX-XX-XX
or same without "plus"
I am not familiar with python but I think following regex would resolve your problem
((\+38)?\(?\d{3}\)?[\s\.-]?(\d{7}|\d{3}[\s\.-]\d{2}[\s\.-]\d{2}|\d{3}-\d{4}))
you can check it working here
Related
I am trying to extract the comma delimited numbers inside () brackets from a string. I can get the numbers if that are alone in a line. But i cant seem to find a solution to get the numbers when other surrounding text is involved. Any help will be appreciated. Below is the code that I current use in python.
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
line = each.strip()
regex_criteria = r'"^([1-9][0-9]*|\([1-9][0-9]*\}|\(([1-9][0-9]*,?)+[1-9][0-9]*\))$"gm'
if (line.__contains__('(') and line.__contains__(')') and not re.search('[a-zA-Z]', refline)):
refline = line[line.find('(')+1:line.find(')')]
if not re.search('[a-zA-Z]', refline):
Remove the ^, $ is whats preventing you from getting all the numbers. And gm flags wont work in python re.
You can change your regex to :([1-9][0-9]*|\([1-9][0-9]*\}|\(?:([1-9][0-9]*,?)+[1-9][0-9]*\)) if you want to get each number separately.
Or you can simplify your pattern to (?<=[(,])[1-9][0-9]+(?=[,)])
Test regex here: https://regex101.com/r/RlGwve/1
Python code:
import re
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
print(re.findall(r'(?<=[(,])[1-9][0-9]+(?=[,)])', line))
# ['101065', '101066', '101067', '101065']
(?<=[(,])[1-9][0-9]+(?=[,)])
The above pattern tells to match numbers which begin with 1-9 followed by one or more digits, only if the numbers begin with or end with either comma or brackets.
Here's another option:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)*(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?<=\(): Lookbehind for (
[1-9]+\d*: At least one number (would \d+ work too?)
(?:,[1-9]\d*)*: Zero or multiple numbers after a ,
(?=\)): Lookahead for )
Result for your line:
[['101065', '101066', '101067'], ['101065']]
If you only want the comma separated numbers:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)+(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?:,[1-9]\d*)+: One or more numbers after a ,
Result:
[['101065', '101066', '101067']]
Now, if your line could also look like
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines ( 101065,101066, 101067 )
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
then you have to sprinkle the pattern with \s* and remove the whitespace afterwards (here with str.translate and str.maketrans):
pattern = re.compile(r"(?<=\()\s*[1-9]+\d*(?:\s*,\s*[1-9]\d*\s*)*(?=\))")
table = str.maketrans("", "", " ")
results = [match[0].translate(table).split(",") for match in pattern.finditer(line)]
Result:
[['101065', '101066', '101067'], ['101065']]
Using the pypi regex module you could also use capture groups:
\((?P<num>\d+)(?:,(?P<num>\d+))*\)
The pattern matches:
\( Match (
(?P<num>\d+) Capture group, match 1+ digits
(?:,(?P<num>\d+))* Optionally repeat matching , and 1+ digits in a capture group
\) Match )
Regex demo | Python demo
Example code
import regex
pattern = r"\((?P<num>\d+)(?:,(?P<num>\d+))*\)"
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
matches = regex.finditer(pattern, line)
for _, m in enumerate(matches, start=1):
print(m.capturesdict())
Output
{'num': ['101065', '101066', '101067']}
{'num': ['101065']}
I need a Python regex which matches to mobile phone numbers from Germany and Austria.
In order to do so, we first have to understand the structure of a phone number:
a mobile number can be written with a country calling code in the beginning. However, this code is optional!
if we use the country calling code the trunk prefix is redundant!
The prefix is composed out of the trunk prefix and the company code
The prefix is followed by an individual and unique number with 7 or 8 digits, respectivley.
List of German prefixes:
0151, 0160, 0170, 0171, 0175, 0152, 0162, 0172, 0173, 0174, 0155, 0157, 0159, 0163, 0176, 0177, 0178, 0179, 0164, 0168, 0169
List of Austrian prefixes:
0664, 0680, 0688, 0681, 0699, 0664, 0667, 0650, 0678, 0650, 0677, 0676, 0660, 0699, 0690, 0665, 0686, 0670
Now that we know all rules to build a regex, we have to consider, that humans sometimes write numbers in a very strange ways with multiple whitespaces, / or (). For example:
0176 98 600 18 9
+49 17698600189
+(49) 17698600189
0176/98600189
0176 / 98600189
many more ways to write the same number
I am looking for a Python regex which can match all Austian and German mobile numbers.
What I have so far is this:
^(?:\+4[39]|004[39]|0|\+\(49\)|\(\+49\))\s?(?=(?:[^\d\n]*\d){10,11}(?!\d))(\()?[19][1567]\d{1,2}(?(1)\))\s?\d(?:[ /-]?\d)+
You can use
(?x)^ # Free spacing mode on and start of string
(?: # A container group:
(\+49|0049|\+\(49\)|\(\+49\))? [ ()\/-]* # German: country code
(?(1)|0)1(?:5[12579]|6[023489]|7[0-9]) # trunk prefix and company code
| # or
(\+43|0043|\+\(43\)|\(\+43\))? [ ()\/-]* # Austrian: country code
(?(2)|0)6(?:64|(?:50|6[0457]|7[0678]|8[0168]|9[09])) # trunk prefix and company code
)
[ ()\/-]* # zero or more spaces, parens, / and -
\d(?:[ \/-]*\d){6,7} # a digit and then six or seven occurrences of space, / or - and a digit
\s* # zero or more whites
$ # end of string
See the regex demo.
A one-line version of the pattern is
^(?:(\+49|0049|\+\(49\)|\(\+49\))?[ ()\/-]*(?(1)|0)1(?:5[12579]|6[023489]|7[0-9])|(\+43|0043|\+\(43\)|\(\+43\))?[ ()\/-]*(?(2)|0)6(?:64|(?:50|6[0457]|7[0678]|8[0168]|9[09])))[ ()\/-]*\d(?:[ \/-]*\d){6,7}\s*$
See this demo.
How to create company code regex
Go to the Optimize long lists of fixed string alternatives in regex
Click the Run code snippet button at the bottom of the answer to run the last code snippet
Re-size the input box if you wish
Get the list of your supported numbers, either comma or linebreak separated and paste it into the field
Click Generate button, and grab the pattern that will appear below.
I have a string from a NWS bulletin:
LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
KHNX 141001 RECHNX Weather Service San Joaquin Valley
My aim is to extract a couple fields with regular expressions. In the first string I want "AAD" and from the second string I want "RECHNX". I have tried:
( )\w{3} #for the first string
and
\w{6} #for the 2nd string
But these find all 3 and 6 character strings leading up to the string I want.
Assuming the fields you want to extract are always in capital letters and preceded by 6 digits and a space, this regular expression would do the trick:
(?<=\d{6}\s)[A-Z]+
Demo: https://regex101.com/r/dsDHTs/1
Edit: if you want to match up to two alpha-numeric uppercase words preceded by 6 digits, you can use:
(?<=\d{6}\s)([A-Z0-9]+\b)\s(?:([A-Z0-9]+\b))*
Demo: https://regex101.com/r/dsDHTs/5
If you have a specific list of valid fields, you could also simply use:
(AAD|TMLB|RECHNX|RR4HNX)
https://regex101.com/r/dsDHTs/3
Since the substring you want to extract is a word that follows a number, separated by a space, you can use re.search with the following regex (given your input stored in s):
re.search(r'\b\d+ (\w+)', s).group(1)
To read first groups of word chars from each line, you can use a pattern like
(\w+) (\w+) (\w+) (\w+).
Then, from the first line read group No 4 and from the second line read group No 3.
Look at the following program. It prints four groups from each source line:
import re
txt = """LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
KHNX 141001 RECHNX Weather Service San Joaquin Valley"""
n = 0
pat = re.compile(r'(\w+) (\w+) (\w+) (\w+)')
for line in txt.splitlines():
n += 1
print(f'{n:2}: {line}')
mtch = pat.search(line)
if mtch:
gr = [ mtch.group(i) for i in range(1, 5) ]
print(f' {gr}')
The result is:
1: LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
['LTUS41', 'KCAR', '141558', 'AAD']
2: KHNX 141001 RECHNX Weather Service San Joaquin Valley
['KHNX', '141001', 'RECHNX', 'Weather']
I am trying to extract phone numbers from a web page using Python & RegEx
Australian number format
+61 (international code - shown below as 'i')
02, 03, 07 or 08 (state codes - shown below as 's')
1234-5678 (8 digit local number - shown below as 'x')
Common variations of format (in order of commonality):
Format 1: ss xxxx xxxx (e.g. 02 1234 5678)
Format 2: +ii s xxxx xxxx (e.g. +61 2 1234 5678) (note the first 's' digit is removed here)
Format 3: (seen rarely) +ii (s)s xxxx-xxxx (e.g. +61 (0)2 1234 5678
My RegEx:
re.findall(r'[0][2]\d{8}|[0][3]\d{8}|[0][7]\d{8}|[0][8]\d{8}|[6][1][2]\d{8}|[6][1][3]\d{8}|[6][1][7]\d{8}|[6][1][8]\d{8}|[0][4]\d{8}|[6][1][4]\d{8}|[1][3][0][0]\d{6}|[1][8][0][0]\d{6}', re.sub(r'\W+', '', sample_text))
works well on a simple sample_text:
sample_text =
"610212345678ABC##610312345678ABC##610712345678ABC##610812345678ABC##0212345678ABC##0312345678ABC##0712345678ABC##0812345678ABC##61212345678ABC##61312345678ABC##61712345678ABC##61812345678ABC##0412345678ABC##61412345678ABC##130012345678ABC##180012345678ABC##"
Result:
['0212345678', '0312345678', '0712345678', '0812345678',
'0212345678', '0312345678', '0712345678', '0812345678',
'61212345678', '61312345678', '61712345678', '61812345678',
'0412345678', '61412345678', '1300123456', '1800123456']
The Goal
Using http://www.outware.com.au/contact as an example ...
The 2 actual numbers on the page are:
+61 (0)3 8684 9912 and +61 (0)2 8064 7043 (both numbers appear twice - once in the main section of the page and once in the footer)
The Problem
#take HTML markup from body tags
b = driver.find_element_by_css_selector('body').text
#remove all non-alpha + white space.
b = re.sub(r'\W+', '', b)
Result:
"PORTFOLIOINNOVATIONSERVICESCAREERSINSIGHTSNEWSABOUTCONTACTCONTACTOUTWAREMelbourneLe......AFRFast100Nov92017EXPLOREOUTWAREPortfolioInnovationWorkingatOutwareAboutSitemapCONNECTMELBOURNELevel3469LaTrobeStMelbourneVIC3000610386849912SYDNEYLevel41SmailStUltimoNSW2007610280647043"
Now if I apply my regex to this string
re.findall(r'[0][2]\d{8}|[0][3]\d{8}|[0][7]\d{8}|[0][8]\d{8}|[6][1][2]\d{8}|[6][1][3]\d{8}|[6][1][7]\d{8}|[6][1][8]\d{8}|[0][4]\d{8}|[6][1][4]\d{8}|[1][3][0][0]\d{6}|[1][8][0][0]\d{6}', re.sub(r'\W+', '', b))
Result:
[u'0386849912', u'0761028064', u'0386849912', u'0761028064']
I am getting a false positive because I have concatenated a postcode "NSW2007" onto the start of the phone number.
I presume because the regex has parsed the first part of "NSW2007610280647043" matching "0761028064" it doesn't then match "0280647043" which is also part of the same substring
I actually don't mind the false positive (i.e. getting "0761028064") but I do need to solve the false negative (i.e. not getting "0280647043")
I know there's some RegEx gurus here who can help on this. :-)
Please help!!
Don't search/replace any text prior to using the regex. That will make your input unusable. Try this:
(?:(?:\+?61 )?(?:0|\(0\))?)?[2378] \d{4}[ -]?\d{4}
https://regex101.com/r/1Q4HuD/3
It might help if you use a negative look ahead to check to see make sure the following character is not a number. For example: (?!\d).
This could create a problem though if some data following a phone number starts with a number.
The look behind looks like this when implemented in your regex:
(02\d{8}|03\d{8}|07\d{8}|08\d{8}|612\d{8}|613\d{8}|617\d{8}|618\d{8}|04\d{8}|614\d{8}|1300\d{6}|1800\d{6})(?!\d)
(I removed the square brackets as you do not need them when trying to match a single character)
This answer should be a comment, it isn't because of my low reputation!
I've seen you're updating the regex and I think this variation can help you. It should match very uncommon formats!
(\+61 )?(?:0|\(0\))?[2378] (?:[\s-]*\d){8}
I have following requirements in date which can be any of the following format.
mm/dd/yyyy or dd Mon YYYY
Few examples are shown below
04/20/2009 and 24 Jan 2001
To handle this I have written regular expression as below
Few text scenarios are metnioned below
txt1 = 'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox
+ fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical
Review of Systems Constitutional:'
txt2 = "s The patient is a 44 year old married Caucasian woman,
unemployed Decorator, living with husband and caring for two young
children, who is referred by Capitol Hill Hospital PCP, Dr. Heather
Zubia, for urgent evaluation/treatment till first visit with Dr. Toney
Winkler IN EIGHT WEEKS on 24 Jan 2001."
date = re.findall(r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}', txtData)
I am not getting 24 Jan 2001 where as if I run individually (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}' I am able to get output.
Question 1: What is bug in above expression?
Question 2: I want to combine both to make more readable as I have to parse any other formats so I used join as shown below
RE1 = '(?:\b(?<!\.)[\d{0,2}]+) (?:[/-]\d{0,}[/-]\d{2,4})'
RE2 = '(?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
regex_all = '|'.join([RE1, RE2])
regex_all = re.compile(regex_all)
date = regex_all.findall(txtData) // notice here txtData can be any one of the above string.
I am getting output as NaN in case of above for date.
Please suggest what is the mistake if I join.
Thanks for your help.
Note that it is a very bad idea to join such long patterns that also match at the same location within the string. That would cause the regex engine to backtrack too much, and possibly lead to crashes and slowdown. If there is a way to re-write the alternations so that they could only match at different locations, or even get rid of them completely, do it.
Besides, you should use grouping constructs (...) to groups sequences of patterns, and only use [...] character classes when you need to matches specific chars.
Also, your alternatives are overlapping, you may combine them easily. See the fixed regex:
\b(?<!\.)\d{1,2}(?:[/-]\d+[/-]|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{2})\b
See the regex demo.
Details
\b - a word boundary
(?<!\.) - no . immediately to the left of the current location
\d{1,2} - 1 or 2 digits
(?: - start of a non-capturing alternation group:
[/-]\d+[/-] - / or -, 1+ digits, - or /
| - or
(?:th|st|[nr]d)?\s*(?:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)) - th, st, nd or rd (optionally), followed with 0+ whitespaces, and then month names
\s* - 0+ whitespaces
(?:\d{4}|\d{2}) - 2 or 4 digits
\b - trailing word boundary.
Another note: if you want to match the date-like strings with two matching delimiters, you will need to capture the first one, and use a backreference to match the second one, see this regex demo. In Python, you would need a re.finditer to get those matches.
See this Python demo:
import re
rx = r"\b(?<!\.)\d{1,2}(?:([/-])\d+\1|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{4})\b"
s = "Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox\nfluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical\nReview of Systems Constitutional:\n\nThe patient is a 44 year old married Caucasian woman, unemployed Decorator, living with husband and caring for two young children, who is referred by Capitol Hill Hospital PCP, Dr. Heather Zubia, for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001"
print([x.group(0) for x in re.finditer(rx, s, re.I)])
# => ['7/11/77', '24 Jan 2001']
I think your approach is too complicated. I suggest using a combination of a simple regex and strptime().
import re
from datetime import datetime
date_formats = ['%m/%d/%Y', '%d %b %Y']
pattern = re.compile(r'\b(\d\d?/\d\d?/\d{4}|\d\d? \w{3} \d{4})\b')
data = "... your string ..."
for match in re.findall(pattern, data):
print("Trying to parse '%s'" % match)
for fmt in date_formats:
try:
date = datetime.strptime(match, fmt)
print(" OK:", date)
break
except:
pass
The advantage of this approach is, besides a much more manageable regex, that it won't pick dates that look plausible but do not exist, like 2/29/2000 (whereas 2/29/2004 works).
r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
you should use raw strings (r'foo') for each string, not only the first one. This way backslashes (\) will be considered as normal character and usable by the re library.
[abc|def] matches any character between the [], while (one|two|three) matches any expression (one, two, or three)