I need to extract people's names from a really long string.
Their names are in this format: LAST, FIRST.
Some of these people have hyphenated names. Some don't.
My attempt with a smaller string:
Input:
import re
text = 'Smith-Jones, Robert&Epson, Robert'
pattern = r'[A-Za-z]+(-[A-Za-z]+)?,\sRobert'
print re.findall(pattern, text)
Expected output:
['Smith-Jones, Robert', 'Epson, Robert']
Actual output:
['-Jones', '']
What am I doing wrong?
Use
import re
text = 'Smith-Jones, Robert&Epson, Robert'
pattern = r'[A-Za-z]+(?:-[A-Za-z]+)?,\sRobert'
print re.findall(pattern, text)
# => ['Smith-Jones, Robert', 'Epson, Robert']
Just make the capturing group non-capturing. The thing is that findall returns capture group values if they are specified in the regex pattern. So, the best way to solve this in this pattern is just replace (...)? with (?:...)?.
See IDEONE demo
Related
Given a string which follows the structure as-
" (subsidiary of <holding_company>) <post_>"
where
holding_company may contain alphabets & some special characters including brackets
post_ may contain any character
Example string: " google (subsidiary of alphabet (inc.)) xyz"
How to extract the holding company name using regex?
The regular expression to extract that is as follows:
"subsidiary of\s+(.*)\)\s+\S+"
In Python2 code, you'd do something like:
import re
regex = r"subsidiary of\s+(.*)\)\s+\S+"
test_str = "\" (subsidiary of <holding_company>) <post_>\""
m = re.search(regex, test_str)
if m:
# if it found the pattern, the company name is in group(1)
print m.group(1)
See it in action here: https://repl.it/repls/ShyFocusedInstructions#main.py
This suld get you there :
(?<=\(subsidiary of)(.*)(?=\) )
This creates capture groups for your holding company and post.
You may need to expand the regex to include additional special characters.
Here's the regex on regex101 if you need to expand it
https://regex101.com/r/xpVfqU/1
#!/usr/bin/python3
import re
str=" (subsidiary of <holding_company>) <post_>"
holding_company=re.sub(r'\s\(subsidiary\ of\ ([\w<>]*)\)\s*(.*)', '\\1', str)
post=re.sub(r'\s\(subsidiary\ of\ ([\w<>]*)\)\s*(.*)', '\\2', str)
print(holding_company)
print(post)
I'm supposed to extract groups of text from a file with a top ten list: name, rank, etc. for each. You can see the file and the regex here https://regex101.com/r/fXK5YV/1. It works in there and you can see the capturing groups.
import re
pattern = '''
(?P<list><li\sclass="regular-search-result">(.|\n)*?(?<=\<span class=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span class=\"review-count rating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.match(yelp_html)
This prints None.
There is definitely text inside of yelp_html.
What am I doing wrong?
I see two issues:
You're not using a raw string (prefix the string with an r), which means that your backslashes are going to be trying to represent special things instead of being part of the string.
I believe your multiline string is going to be attempting to match both the newlines between each line and the spaces at the start of the string into your regex (which you don't want, given this is not how the regex is formatted in your link).
import re
pattern = r'''
(?P<list><li\sclass=\"regular-search-result\">(.|\n)*?(?<=\<span\sclass=\"indexed-biz-name\"\>)
(?P<rank>\d{1,2})
(.|\n)*?\<span\>
(?P<name>.+)
\<\/span\>(.|\n)*?alt=\"
(?P<stars>\d\.\d)
\sstar\srating\"(.|\n)*?\<span\sclass=\"review-count\srating-qualifier\"\>(\s|\t|\n)*?
(?P<numrevs>\d{1,7})
(.|\n)*?\<span\sclass=\"business-attribute\sprice-range\">
(?P<price>\${1,6})
\<\/span\>(.|\n)*?<\/li>)
'''
pattern_matcher = re.compile(pattern, re.VERBOSE)
matches = pattern_matcher.finditer(yelp_html)
for item in matches:
print(item.group('rank', 'name', 'stars', 'numrevs', 'price'))
I'm new in regex expressions. I've read the documentation but I still have some questions.
I Have the following string:
[('15000042', 19)]
And I need to get the key, the comma and the value as a string.
like this:
15000042,19
I need this to enter these value as a comma separated value in a database.
I've tried the next regular expression:
([\w,]+)
but this only split the string into 3 substrings. Is there a way to get the full match?
https://regex101.com/r/vtYKOG/1
I'm using python
You match what you don't want to keep and use 3 groups instead of 1 and assemble your value using these 3 groups:
\[\('(\d+)'(,) (\d+)\)\]
Regex demo
For example:
import re
test_str = "[('15000042', 19)]"
result = re.sub(r"\[\('(\d+)'(,) (\d+)\)\]", r"\1\2\3", test_str)
if result:
print (result)
Result
15000042,19
Another option is to use only your character class [^\w,]+ and negate it so match not what is listed.
Then replace those characters with an empty string:
import re
test_str = "[('15000042', 19)]"
result = re.sub(r"[^\w,]+", "", test_str)
if result:
print (result)
Regex demo
Trying to find multiple word match in given text.For example :
text = "oracle sql"
regex = "(oracle\\ sql|sql)"
re.findall(regex,text,re.I)
Output actual
oracle sql
Expected output
oracle sql,sql
Can anyone tell me, where is problem with regex expression ?
Updated:
#jim it won't work ,if multiple overlapping comes, for example :
re.findall("(?=(spark|spark sql|sql))","spark sql",re.I)
Actual Output
['spark','sql']
Expected Output :
['spark','sql','spark sql']]
Note : In the above case if both are matched then it won't match combination of words.
Updated :
Check link : repl.it/repls/NewFaithfulMath
You don't need to escape whitespace.
import re
text = "oracle sql"
regex = "(oracle sql|sql)"
print re.findall(regex, text, re.I)
From the documentation:
Return all non-overlapping matches of pattern in string, as a list of
strings.
This counts as an overlapping match.
Returning overlapping matches
You can use a lookahead to capture the string you're looking for, but because it's technically matching on the lookahead they are not overlapping.
import re
text = "oracle sql"
regex = "(?=(oracle sql|sql))"
print re.findall(regex, text, re.I)
Output:
['oracle sql', 'sql']
See it in action.
The downside of this implementation is that it will only find 1 match for each word at a particular position in a string. This is due to overlapping matches.
For example (my test|my|test) will only find ['my test', 'test'].
You could always use a regex replacement that will find overlapping matches too, such as regex, but this will still only find ['my test', 'test'] with the pattern (my test|my|test):
import regex as re
text = "oracle sql"
regex = "(oracle sql|sql)"
print re.findall(regex, text, re.I, overlapped=True)
Recursion
Regex will only find one match per character. It has already found the match for the first character based on "oracle sql" so you can't get a match on just oracle. You can't find every single one.
However... You could use a recursive function to try to match the same string with all of the items - what has already been matched.
I am not sure how performant this code will be as you could execute a lot of regex searches.
import re
def find_all_matches(text, items):
regex_items = '|'.join(items)
regex = "(?=({}))".format(regex_items)
matches = re.findall(regex, text, re.I)
new_items = [i for i in items if i not in matches]
if new_items:
new_matches = find_all_matches(text, new_items)
return matches + new_matches
return matches
print find_all_matches("oracle sql", ['oracle sql', 'oracle', 'sql'])
Output:
['oracle sql', 'sql', 'oracle']
No regex
Lastly you could implement this without regex. Again I haven't looked at the performance of this.
def find_all_matches(text, items):
return [i for i in items if i in text]
print find_all_matches("oracle sql", ['oracle sql', 'oracle', 'sql'])
Output:
['oracle sql', 'oracle', 'sql']
I've come up with a regex expression that works well enough for my purposes for finding phone numbers.
I would like to take it a step further and use it in large text blocks to identify matching strings that follow the words 'cell' or 'mobile' by at most 10 characters. I would like it to return the number in Cell Phone: (954) 555-4444 as well as Mobile 555-777-9999 but not Fax: (555) 444-6666
something like (in pseudocode)
regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
bigstring = # Some giant string added together from many globbed files
matches = regex.search(bigstring)
for match in matches:
if match follows 'cell' or match follows 'mobile':
print match.group(0)
You can do:
txt='''\
Call me on my mobile anytime: 555-666-1212
The office is best at 555-222-3333
Dont ever call me at 555-666-2345 '''
import re
print re.findall(r'(?:(mobile|office).{0,15}(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4}))', txt)
Prints:
[('mobile', '555-666-1212'), ('office', '555-222-3333')]
You can do that with your regular expression. In the re documentation, you will find that the pattern r'(?<=abc)def' matches 'def' only if it is preceded by 'abc'.
Similarly r'Hello (?=World)' matches 'Hello ' if followed by 'World'