I've come up with a regex expression that works well enough for my purposes for finding phone numbers.
I would like to take it a step further and use it in large text blocks to identify matching strings that follow the words 'cell' or 'mobile' by at most 10 characters. I would like it to return the number in Cell Phone: (954) 555-4444 as well as Mobile 555-777-9999 but not Fax: (555) 444-6666
something like (in pseudocode)
regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
bigstring = # Some giant string added together from many globbed files
matches = regex.search(bigstring)
for match in matches:
if match follows 'cell' or match follows 'mobile':
print match.group(0)
You can do:
txt='''\
Call me on my mobile anytime: 555-666-1212
The office is best at 555-222-3333
Dont ever call me at 555-666-2345 '''
import re
print re.findall(r'(?:(mobile|office).{0,15}(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4}))', txt)
Prints:
[('mobile', '555-666-1212'), ('office', '555-222-3333')]
You can do that with your regular expression. In the re documentation, you will find that the pattern r'(?<=abc)def' matches 'def' only if it is preceded by 'abc'.
Similarly r'Hello (?=World)' matches 'Hello ' if followed by 'World'
Related
Given a string which follows the structure as-
" (subsidiary of <holding_company>) <post_>"
where
holding_company may contain alphabets & some special characters including brackets
post_ may contain any character
Example string: " google (subsidiary of alphabet (inc.)) xyz"
How to extract the holding company name using regex?
The regular expression to extract that is as follows:
"subsidiary of\s+(.*)\)\s+\S+"
In Python2 code, you'd do something like:
import re
regex = r"subsidiary of\s+(.*)\)\s+\S+"
test_str = "\" (subsidiary of <holding_company>) <post_>\""
m = re.search(regex, test_str)
if m:
# if it found the pattern, the company name is in group(1)
print m.group(1)
See it in action here: https://repl.it/repls/ShyFocusedInstructions#main.py
This suld get you there :
(?<=\(subsidiary of)(.*)(?=\) )
This creates capture groups for your holding company and post.
You may need to expand the regex to include additional special characters.
Here's the regex on regex101 if you need to expand it
https://regex101.com/r/xpVfqU/1
#!/usr/bin/python3
import re
str=" (subsidiary of <holding_company>) <post_>"
holding_company=re.sub(r'\s\(subsidiary\ of\ ([\w<>]*)\)\s*(.*)', '\\1', str)
post=re.sub(r'\s\(subsidiary\ of\ ([\w<>]*)\)\s*(.*)', '\\2', str)
print(holding_company)
print(post)
Trying to find multiple word match in given text.For example :
text = "oracle sql"
regex = "(oracle\\ sql|sql)"
re.findall(regex,text,re.I)
Output actual
oracle sql
Expected output
oracle sql,sql
Can anyone tell me, where is problem with regex expression ?
Updated:
#jim it won't work ,if multiple overlapping comes, for example :
re.findall("(?=(spark|spark sql|sql))","spark sql",re.I)
Actual Output
['spark','sql']
Expected Output :
['spark','sql','spark sql']]
Note : In the above case if both are matched then it won't match combination of words.
Updated :
Check link : repl.it/repls/NewFaithfulMath
You don't need to escape whitespace.
import re
text = "oracle sql"
regex = "(oracle sql|sql)"
print re.findall(regex, text, re.I)
From the documentation:
Return all non-overlapping matches of pattern in string, as a list of
strings.
This counts as an overlapping match.
Returning overlapping matches
You can use a lookahead to capture the string you're looking for, but because it's technically matching on the lookahead they are not overlapping.
import re
text = "oracle sql"
regex = "(?=(oracle sql|sql))"
print re.findall(regex, text, re.I)
Output:
['oracle sql', 'sql']
See it in action.
The downside of this implementation is that it will only find 1 match for each word at a particular position in a string. This is due to overlapping matches.
For example (my test|my|test) will only find ['my test', 'test'].
You could always use a regex replacement that will find overlapping matches too, such as regex, but this will still only find ['my test', 'test'] with the pattern (my test|my|test):
import regex as re
text = "oracle sql"
regex = "(oracle sql|sql)"
print re.findall(regex, text, re.I, overlapped=True)
Recursion
Regex will only find one match per character. It has already found the match for the first character based on "oracle sql" so you can't get a match on just oracle. You can't find every single one.
However... You could use a recursive function to try to match the same string with all of the items - what has already been matched.
I am not sure how performant this code will be as you could execute a lot of regex searches.
import re
def find_all_matches(text, items):
regex_items = '|'.join(items)
regex = "(?=({}))".format(regex_items)
matches = re.findall(regex, text, re.I)
new_items = [i for i in items if i not in matches]
if new_items:
new_matches = find_all_matches(text, new_items)
return matches + new_matches
return matches
print find_all_matches("oracle sql", ['oracle sql', 'oracle', 'sql'])
Output:
['oracle sql', 'sql', 'oracle']
No regex
Lastly you could implement this without regex. Again I haven't looked at the performance of this.
def find_all_matches(text, items):
return [i for i in items if i in text]
print find_all_matches("oracle sql", ['oracle sql', 'oracle', 'sql'])
Output:
['oracle sql', 'oracle', 'sql']
I have the following regex:
r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
When I apply this to a text string with, let's say,
"this is www.website1.com and this is website2.com", I get:
['www.website1.com']
['website.com']
How can i modify the regex to exclude the 'www', so that I get 'website1.com' and 'website2.com? I'm missing something pretty basic ...
Try this one (thanks #SunDeep for the update):
\s(?:www.)?(\w+.com)
Explanation
\s matches any whitespace character
(?:www.)? non-capturing group, matches www. 0 or more times
(\w+.com) matches any word character one or more times, followed by .com
And in action:
import re
s = 'this is www.website1.com and this is website2.com'
matches = re.findall(r'\s(?:www.)?(\w+.com)', s)
print(matches)
Output:
['website1.com', 'website2.com']
A couple notes about this. First of all, matching all valid domain names is very difficult to do, so while I chose to use \w+ to capture for this example, I could have chosen something like: [a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,}.
This answer has a lot of helpful info about matching domains:
What is a regular expression which will match a valid domain name without a subdomain?
Next, I only look for .com domains, you could adjust my regular expression to something like:
\s(?:www.)?(\w+.(com|org|net))
To match whichever types of domains you were looking for.
Here a try :
import re
s = "www.website1.com"
k = re.findall ( '(www.)?(.*?)$', s, re.DOTALL)[0][1]
print(k)
O/P like :
'website1.com'
if it is s = "website1.com" also it will o/p like :
'website1.com'
Iam trying to make a python script that reads a text file input.txt and then scans all phone numbers in that file and writes back all matching phone no's to output.txt
lets say text file is like:
Hey my number is 1234567890 and another number is +91-1234567890. but if none of these is available you can call me on +91 5645454545 (or) mail me at abc#xyz.com
it should match 1234567890, +91-1234567890 and +91 5645454545
import re
no = '^(\+[1-9]\d{0,2}[- ]?)?[1-9][0-9]{9}' #i think problem is here
f2 = open('output.txt','w+')
for line in open('input.txt'):
out = re.findall(no,line)
for i in out :
f2.write(i + '\n')
Regexp for no is like : it takes country codes upto 3 digits and then a - or space which is optional and country code itself is optional and then a 10 digit number.
Yes, the problem is with your regex. Fortunately, it's a small one. You just need to remove the ^ character:
'(\+[1-9]\d{0,2}[- ]?)?[1-9]\d{9}'
The ^ signifies that you want to match only at the beginning of the string. You want to match multiple times throughout the string. Here's a 101demo.
For python, you'll need to specify a non-capturing group as well with ?:. Otherwise, re.findall does not return the complete match:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups.
Bold emphasis mine. Here's a relevant question.
This is what you get when you specify non-capturing groups for your problem:
In [485]: re.findall('(?:\+[1-9]\d{0,2}[- ]?)?[1-9]\d{9}', text)
Out[485]: ['1234567890', '+91-1234567890', '+91 5645454545']
this code will work:
import re
no = '(?:\+[1-9]\d{0,2}[- ]?)?[1-9][0-9]{9}' #i think problem is here
f2 = open('output.txt','w+')
for line in open('input.txt'):
out = re.findall(no,line)
for i in out :
f2.write(i + '\n')
The output will be:
1234567890
+91-1234567890
+91 5645454545
you can use
(?:\+[1-9]\d{1,2}-?)?\s?[1-9][0-9]{9}
see the demo at demo
pattern = '\d{10}|\+\d{2}[- ]+\d{10}'
matches = re.findall(pattern,text)
o/p -> ['1234567890', '+91-1234567890', '+91 5645454545']
I need to extract people's names from a really long string.
Their names are in this format: LAST, FIRST.
Some of these people have hyphenated names. Some don't.
My attempt with a smaller string:
Input:
import re
text = 'Smith-Jones, Robert&Epson, Robert'
pattern = r'[A-Za-z]+(-[A-Za-z]+)?,\sRobert'
print re.findall(pattern, text)
Expected output:
['Smith-Jones, Robert', 'Epson, Robert']
Actual output:
['-Jones', '']
What am I doing wrong?
Use
import re
text = 'Smith-Jones, Robert&Epson, Robert'
pattern = r'[A-Za-z]+(?:-[A-Za-z]+)?,\sRobert'
print re.findall(pattern, text)
# => ['Smith-Jones, Robert', 'Epson, Robert']
Just make the capturing group non-capturing. The thing is that findall returns capture group values if they are specified in the regex pattern. So, the best way to solve this in this pattern is just replace (...)? with (?:...)?.
See IDEONE demo