Regex that matches all German and Austrian mobile phone numbers - python

I need a Python regex which matches to mobile phone numbers from Germany and Austria.
In order to do so, we first have to understand the structure of a phone number:
a mobile number can be written with a country calling code in the beginning. However, this code is optional!
if we use the country calling code the trunk prefix is redundant!
The prefix is composed out of the trunk prefix and the company code
The prefix is followed by an individual and unique number with 7 or 8 digits, respectivley.
List of German prefixes:
0151, 0160, 0170, 0171, 0175, 0152, 0162, 0172, 0173, 0174, 0155, 0157, 0159, 0163, 0176, 0177, 0178, 0179, 0164, 0168, 0169
List of Austrian prefixes:
0664, 0680, 0688, 0681, 0699, 0664, 0667, 0650, 0678, 0650, 0677, 0676, 0660, 0699, 0690, 0665, 0686, 0670
Now that we know all rules to build a regex, we have to consider, that humans sometimes write numbers in a very strange ways with multiple whitespaces, / or (). For example:
0176 98 600 18 9
+49 17698600189
+(49) 17698600189
0176/98600189
0176 / 98600189
many more ways to write the same number
I am looking for a Python regex which can match all Austian and German mobile numbers.
What I have so far is this:
^(?:\+4[39]|004[39]|0|\+\(49\)|\(\+49\))\s?(?=(?:[^\d\n]*\d){10,11}(?!\d))(\()?[19][1567]\d{1,2}(?(1)\))\s?\d(?:[ /-]?\d)+

You can use
(?x)^ # Free spacing mode on and start of string
(?: # A container group:
(\+49|0049|\+\(49\)|\(\+49\))? [ ()\/-]* # German: country code
(?(1)|0)1(?:5[12579]|6[023489]|7[0-9]) # trunk prefix and company code
| # or
(\+43|0043|\+\(43\)|\(\+43\))? [ ()\/-]* # Austrian: country code
(?(2)|0)6(?:64|(?:50|6[0457]|7[0678]|8[0168]|9[09])) # trunk prefix and company code
)
[ ()\/-]* # zero or more spaces, parens, / and -
\d(?:[ \/-]*\d){6,7} # a digit and then six or seven occurrences of space, / or - and a digit
\s* # zero or more whites
$ # end of string
See the regex demo.
A one-line version of the pattern is
^(?:(\+49|0049|\+\(49\)|\(\+49\))?[ ()\/-]*(?(1)|0)1(?:5[12579]|6[023489]|7[0-9])|(\+43|0043|\+\(43\)|\(\+43\))?[ ()\/-]*(?(2)|0)6(?:64|(?:50|6[0457]|7[0678]|8[0168]|9[09])))[ ()\/-]*\d(?:[ \/-]*\d){6,7}\s*$
See this demo.
How to create company code regex
Go to the Optimize long lists of fixed string alternatives in regex
Click the Run code snippet button at the bottom of the answer to run the last code snippet
Re-size the input box if you wish
Get the list of your supported numbers, either comma or linebreak separated and paste it into the field
Click Generate button, and grab the pattern that will appear below.

Related

How to extract all comma delimited numbers inside () bracked and ignore any text

I am trying to extract the comma delimited numbers inside () brackets from a string. I can get the numbers if that are alone in a line. But i cant seem to find a solution to get the numbers when other surrounding text is involved. Any help will be appreciated. Below is the code that I current use in python.
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
line = each.strip()
regex_criteria = r'"^([1-9][0-9]*|\([1-9][0-9]*\}|\(([1-9][0-9]*,?)+[1-9][0-9]*\))$"gm'
if (line.__contains__('(') and line.__contains__(')') and not re.search('[a-zA-Z]', refline)):
refline = line[line.find('(')+1:line.find(')')]
if not re.search('[a-zA-Z]', refline):
Remove the ^, $ is whats preventing you from getting all the numbers. And gm flags wont work in python re.
You can change your regex to :([1-9][0-9]*|\([1-9][0-9]*\}|\(?:([1-9][0-9]*,?)+[1-9][0-9]*\)) if you want to get each number separately.
Or you can simplify your pattern to (?<=[(,])[1-9][0-9]+(?=[,)])
Test regex here: https://regex101.com/r/RlGwve/1
Python code:
import re
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
print(re.findall(r'(?<=[(,])[1-9][0-9]+(?=[,)])', line))
# ['101065', '101066', '101067', '101065']
(?<=[(,])[1-9][0-9]+(?=[,)])
The above pattern tells to match numbers which begin with 1-9 followed by one or more digits, only if the numbers begin with or end with either comma or brackets.
Here's another option:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)*(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?<=\(): Lookbehind for (
[1-9]+\d*: At least one number (would \d+ work too?)
(?:,[1-9]\d*)*: Zero or multiple numbers after a ,
(?=\)): Lookahead for )
Result for your line:
[['101065', '101066', '101067'], ['101065']]
If you only want the comma separated numbers:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)+(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?:,[1-9]\d*)+: One or more numbers after a ,
Result:
[['101065', '101066', '101067']]
Now, if your line could also look like
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines ( 101065,101066, 101067 )
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
then you have to sprinkle the pattern with \s* and remove the whitespace afterwards (here with str.translate and str.maketrans):
pattern = re.compile(r"(?<=\()\s*[1-9]+\d*(?:\s*,\s*[1-9]\d*\s*)*(?=\))")
table = str.maketrans("", "", " ")
results = [match[0].translate(table).split(",") for match in pattern.finditer(line)]
Result:
[['101065', '101066', '101067'], ['101065']]
Using the pypi regex module you could also use capture groups:
\((?P<num>\d+)(?:,(?P<num>\d+))*\)
The pattern matches:
\( Match (
(?P<num>\d+) Capture group, match 1+ digits
(?:,(?P<num>\d+))* Optionally repeat matching , and 1+ digits in a capture group
\) Match )
Regex demo | Python demo
Example code
import regex
pattern = r"\((?P<num>\d+)(?:,(?P<num>\d+))*\)"
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
matches = regex.finditer(pattern, line)
for _, m in enumerate(matches, start=1):
print(m.capturesdict())
Output
{'num': ['101065', '101066', '101067']}
{'num': ['101065']}

Regular expression in Python, 2-3 numbers then 2 letters

I am trying to do autodetection of bra size in a list of clothes. While I managed to extract only the bra items, I am now looking at extracting the size information and I think I am almost there (thanks to the stackoverflow community). However, there is a particular case that I could not find on another post.
I am using:
regexp = re.compile(r' \d{2,3} ?[a-fA-F]([^bce-zBCE-Z]|$)')
So
Possible white space if not at the beginning of the description
two or three numbers
Another possible white space or not
Any letters (lower or upper case) between A and F
and then another letter for the two special case AA and FF or the end of the string.
My question is, is there a way to have the second letter to be a match of the first letter (AA or FF) because in my case, my code output me some BA and CA size which are not existing
Examples:
Not working:
"bh sexig top matchande h&m genomskinligt parti svart detaljer 42 basic plain" return "42 ba" instead of not found
"puma, sport-bh, strl: 34cd, svart/grĂ¥", I guess the customer meant c/d
Working fine:
"victoria's secret, bh, strl: 32c, gul/vit" returns "32 c"
"pink victorias secret bh 75dd burgundy" returns "75 dd"
Thanks!
You might use
\d{2,3} ?([a-fA-F])\1?(?![a-fA-F])
Explanation
\d{2,3} ? Match a space, 2-3 digits and optional space
([a-fA-F])\1? Capture a-fA-F in group 1 followed by an optional backreference to group 1
(?![a-fA-F]) Negative lookahead, assert what is on the right is not a-fA-F
Regex demo

regex template for Ukrainian phone numbers

I am reading 'Automate the boring stuff with python'. Right now, I'm stuck on chapter 7 (Regex part). There is a template for American phone numbers, which I want to implement for Ukrainian phone numbers.
Ukrainian numbers can appear in the different formats, such as : +380445371428, +38(044)5371428, +38(044)537 14 28, +38(044)537-14-28, +38(044) 537.14.28, 044.537.14.28, 0445371428, 044-537-1428, (044)537-1428, 044 537-1428, etc.
Following, is my implementation, but it's not quite correct. What do I need?
When I'm copying some website pages, from all of the info I have copied, I want to extract the Ukrainian number appearing in this (044-537-1428) format.
phoneRegex = re.compile(r'''(
(^\+38?) # area code(not necessarily)
(\d{3}|\(\d{3}\)) # carrier code(usually starts with 0
(\s|-|\.)? # separator
(\d{3}|\(\d{3}\)) # first 3 digits
(\s|-|\.) # separator
(\d{4}) # last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))? # extension
)''', re.VERBOSE)
template for American number (according to the book) looks like the following
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?
(\s|-|\.)
(\d{3})
(\s|-|\.)
(\d{4})
(\s*(ext|x|ext.)\s*(\d{2,5}))?
)''', re.VERBOSE)
Maybe, an option would be to incorporate alternation, based on the types of patterns that we might have, such as:
^(?:\+38)?(?:\(044\)[ .-]?[0-9]{3}[ .-]?[0-9]{2}[ .-]?[0-9]{2}|044[ .-]?[0-9]{3}[ .-]?[0-9]{2}[ .-]?[0-9]{2}|044[0-9]{7})$
Or even more restricted than that, if we'd be validating.
Demo
Test
import re
regex = r'^(?:\+38)?(?:\(044\)[ .-]?[0-9]{3}[ .-]?[0-9]{2}[ .-]?[0-9]{2}|044[ .-]?[0-9]{3}[ .-]?[0-9]{2}[ .-]?[0-9]{2}|044[0-9]{7})$'
string = '''
+380445371428
+38(044)5371428
+38(044)537 14 28
+38(044)537-14-28
+38(044) 537.14.28
044.537.14.28
0445371428
044-537-1428
(044)537-1428
044 537-1428
+83(044)537 14 28
088 537-1428
'''
print(re.findall(regex, string, re.M))
Output
['+380445371428', '+38(044)5371428', '+38(044)537 14 28',
'+38(044)537-14-28', '+38(044) 537.14.28', '044.537.14.28',
'0445371428', '044-537-1428', '(044)537-1428', '044 537-1428']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
Here is my regex for all Ukrainian numbers:
^\+?3?8?(0[\s\.-]\d{2}[\s\.-]\d{3}[\s\.-]\d{2}[\s\.-]\d{2})$
This allows:
+380 XX XXX XX XX or
+380-XX-XXX-XX-XX
or same without "plus"
I am not familiar with python but I think following regex would resolve your problem
((\+38)?\(?\d{3}\)?[\s\.-]?(\d{7}|\d{3}[\s\.-]\d{2}[\s\.-]\d{2}|\d{3}-\d{4}))
you can check it working here

Python RegEx for Australian Phone Numbers - False negative - 2 matches in the same substring

I am trying to extract phone numbers from a web page using Python & RegEx
Australian number format
+61 (international code - shown below as 'i')
02, 03, 07 or 08 (state codes - shown below as 's')
1234-5678 (8 digit local number - shown below as 'x')
Common variations of format (in order of commonality):
Format 1: ss xxxx xxxx (e.g. 02 1234 5678)
Format 2: +ii s xxxx xxxx (e.g. +61 2 1234 5678) (note the first 's' digit is removed here)
Format 3: (seen rarely) +ii (s)s xxxx-xxxx (e.g. +61 (0)2 1234 5678
My RegEx:
re.findall(r'[0][2]\d{8}|[0][3]\d{8}|[0][7]\d{8}|[0][8]\d{8}|[6][1][2]\d{8}|[6][1][3]\d{8}|[6][1][7]\d{8}|[6][1][8]\d{8}|[0][4]\d{8}|[6][1][4]\d{8}|[1][3][0][0]\d{6}|[1][8][0][0]\d{6}', re.sub(r'\W+', '', sample_text))
works well on a simple sample_text:
sample_text =
"610212345678ABC##610312345678ABC##610712345678ABC##610812345678ABC##0212345678ABC##0312345678ABC##0712345678ABC##0812345678ABC##61212345678ABC##61312345678ABC##61712345678ABC##61812345678ABC##0412345678ABC##61412345678ABC##130012345678ABC##180012345678ABC##"
Result:
['0212345678', '0312345678', '0712345678', '0812345678',
'0212345678', '0312345678', '0712345678', '0812345678',
'61212345678', '61312345678', '61712345678', '61812345678',
'0412345678', '61412345678', '1300123456', '1800123456']
The Goal
Using http://www.outware.com.au/contact as an example ...
The 2 actual numbers on the page are:
+61 (0)3 8684 9912 and +61 (0)2 8064 7043 (both numbers appear twice - once in the main section of the page and once in the footer)
The Problem
#take HTML markup from body tags
b = driver.find_element_by_css_selector('body').text
#remove all non-alpha + white space.
b = re.sub(r'\W+', '', b)
Result:
"PORTFOLIOINNOVATIONSERVICESCAREERSINSIGHTSNEWSABOUTCONTACTCONTACTOUTWAREMelbourneLe......AFRFast100Nov92017EXPLOREOUTWAREPortfolioInnovationWorkingatOutwareAboutSitemapCONNECTMELBOURNELevel3469LaTrobeStMelbourneVIC3000610386849912SYDNEYLevel41SmailStUltimoNSW2007610280647043"
Now if I apply my regex to this string
re.findall(r'[0][2]\d{8}|[0][3]\d{8}|[0][7]\d{8}|[0][8]\d{8}|[6][1][2]\d{8}|[6][1][3]\d{8}|[6][1][7]\d{8}|[6][1][8]\d{8}|[0][4]\d{8}|[6][1][4]\d{8}|[1][3][0][0]\d{6}|[1][8][0][0]\d{6}', re.sub(r'\W+', '', b))
Result:
[u'0386849912', u'0761028064', u'0386849912', u'0761028064']
I am getting a false positive because I have concatenated a postcode "NSW2007" onto the start of the phone number.
I presume because the regex has parsed the first part of "NSW2007610280647043" matching "0761028064" it doesn't then match "0280647043" which is also part of the same substring
I actually don't mind the false positive (i.e. getting "0761028064") but I do need to solve the false negative (i.e. not getting "0280647043")
I know there's some RegEx gurus here who can help on this. :-)
Please help!!
Don't search/replace any text prior to using the regex. That will make your input unusable. Try this:
(?:(?:\+?61 )?(?:0|\(0\))?)?[2378] \d{4}[ -]?\d{4}
https://regex101.com/r/1Q4HuD/3
It might help if you use a negative look ahead to check to see make sure the following character is not a number. For example: (?!\d).
This could create a problem though if some data following a phone number starts with a number.
The look behind looks like this when implemented in your regex:
(02\d{8}|03\d{8}|07\d{8}|08\d{8}|612\d{8}|613\d{8}|617\d{8}|618\d{8}|04\d{8}|614\d{8}|1300\d{6}|1800\d{6})(?!\d)
(I removed the square brackets as you do not need them when trying to match a single character)
This answer should be a comment, it isn't because of my low reputation!
I've seen you're updating the regex and I think this variation can help you. It should match very uncommon formats!
(\+61 )?(?:0|\(0\))?[2378] (?:[\s-]*\d){8}

Python regex string groups capture

I have a number of medical reports from each which i am trying to capture 6 groups (groups 5 and 6 are optional):
(clinical details | clinical indication) + (text1) + (result|report) + (text2) + (interpretation|conclusion) + (text3).
The regex I am using is:
reportPat=re.compile(r'(Clinical details|indication)(.*?)(result|description|report)(.*?)(Interpretation|conclusion)(.*)',re.IGNORECASE|re.DOTALL)
works except on strings missing the optional groups on whom it fails.i have tried putting a question mark after group5 like so: (Interpretation|conclusion)?(.*) but then this group gets merged into group4. I am pasting two conflicting strings (one containing group 5/6 and the other without it) for people to test their regex. thanks for helping
text 1 (all groups present)
Technical Report:\nAdministrations:\n1.04 ml of Fluorine 18, fluorodeoxyglucose with aco - Bronchus and lung\nJA - Staging\n\nClinical Details:\nSquamous cell lung cancer, histology confirmed ?stage\nResult:\nAn FDG scan was acquired from skull base to upper thighs together with a low dose CT scan for attenuation correction and image fusion. \n\nThere is a large mass noted in the left upper lobe proximally, with lower grade uptake within a collapsed left upper lobe. This lesi\n\nInterpretation: \nThe scan findings are in keeping with the known lung primary in the left upper lobe and involvement of the lymph nodes as dThere is no evidence of distant metastatic disease.
text 2 (without group 5 and 6)
Technical Report:\nAdministrations:\n0.81 ml of Fluorine 18, fluorodeoxyglucose with activity 312.79\nScanner: 3D Static\nPatient Position: Supine, Head First. Arms up\n\n\nDiagnosis Codes:\n- Bronchus and lung\nJA - Staging\n\nClinical Indication:\nNewly diagnosed primary lung cancer with cranial metastasis. PET scan to assess any further metastatic disease.\n\nScanner DST 3D\n\nSession 1 - \n\n.\n\nResult:\nAn FDG scan was acquired from skull base to upper thighs together with a low dose CT scan for attenuation correction and image fusion.\n\nThere is increased FDG uptake in the right lower lobe mass abutting the medial and posterior pleura with central necrosis (maximum SUV 18.2). small nodule at the right paracolic gutte
It seems like that what you were missing is basically an end of pattern match to fool the greedy matches when combining with the optional presence of the group 5 & 6. This regexp should do the trick, maintaining your current group numbering:
reportPat=re.compile(
r'(Clinical details|indication)(.*)'
r'(result|description|report)(.*?)'
r'(?:(Interpretation|conclusion)(.*))?$',
re.IGNORECASE|re.DOTALL)
Changes done are adding the $ to the end, and enclosing the two last groups in a optional non-capturing group, (?: ... )?. Also note how you easily can make the entire regexp more readable by splitting the lines (which the interpreter will autoconnect when compiling).
Added: When reviewing the result of the matches I saw some :\n or : \n, which can easily be cleaned up by adding (?:[:\s]*)? inbetween the header and text groups. This is an optional non-capturing group of colons and whitespace. Your regexp does then look like this:
reportPat=re.compile(
r'(Clinical details|indication)(?:[:\s]*)?(.*)'
r'(result|description|report)(?:[:\s]*)?(.*?)'
r'(?:(Interpretation|conclusion)(?:[:\s]*)?(.*))?$',
re.IGNORECASE|re.DOTALL)
Added 2: At this link: https://regex101.com/r/gU9eV7/3, you can see the regex in action. I've also added some unit test cases to verify that it works against both texts, and that in for text1 it has a match for text1, and that for text2 it has nothing. I used this parallell to direct editing in a python script to verify my answer.
The following pattern works for both your test cases though given the format of the data you're having to parse I wouldn't be confident that the pattern will work for all cases (for example I've added : after each of the keyword matches to try to prevent inadvertent matches against more common words like result or description):
re.compile(
r'(Clinical details|indication):(.+?)(result|description|report):(.+?)((Interpretation|conclusion):(.+?)){0,1}\Z',
re.IGNORECASE|re.DOTALL
)
I grouped the last 2 groups and marked them as optional using {0,1}. This means the output groups will vary a little from your original pattern (you'll have an extra group, the 4th group will now contain the output of both the last 2 groups and the data for the last 2 groups will be in groups 5 and 6).

Categories

Resources