Regex not grabbing all groups, not working in multiline - python

I have the following regex pattern:
pattern = r'''
(?P<name>.+?)\n
SKU\s#\s+(?P<sku_hidden>\d+)\n
Quantity:\s+(?P<quantity>\d+)\n
Gift\sWrap:\s+(?P<gift_wrap>.+?)\n
Shipping\sMethod:.+?\n
Price:.+?\n
Total:\s+(?P<total_price>\$[\d.]+)
'''
I retrieve them using:
re.finditer(pattern, plain, re.M | re.X)
Yet using re.findall yields the same result.
It should match texts like this:
Red Retro Citrus Juicer
SKU # 403109
Quantity: 1
Gift Wrap: No
Shipping Method:Standard
Price: $24.99
Total: $24.99
The first thing that is happening is that using re.M and re.X it doesn't work, but if I put it all in one line it does. The other thing is that when it does work only the first group is caught and the rest ignored. Any thoughts?
ADDITIONAL INFORMATION:
If I change my pattern to be just:
pattern = r'''
(?P<name>.+?)\n
SKU\s#\s+(?P<sku_hidden>\d+)\n
'''
My output comes out like this: [u'Red Retro Citrus Juicer'] it matches yet the SKU does not appear. If I put everything on the same line, like so:
pattern = r'(?P<name>.+?)\nSKU\s#\s+(?P<sku_hidden>\d+)\n'
It does match and grab everything.

When using the X flag, you need to escape the #, which start the comments.
Right now your two-line regex is equivalent to
(?P<name>.+?)\n
SKU\s
What you want is
pattern = r'''
(?P<name>.+?)\n
SKU\s\#\s+(?P<sku_hidden>\d+)\n
Quantity:\s+(?P<quantity>\d+)\n
Gift\sWrap:\s+(?P<gift_wrap>.+?)\n
Shipping\sMethod:.+?\n
Price:.+?\n
Total:\s+(?P<total_price>\$[\d.]+)
'''
Notice the \#...

Related

How to extract all comma delimited numbers inside () bracked and ignore any text

I am trying to extract the comma delimited numbers inside () brackets from a string. I can get the numbers if that are alone in a line. But i cant seem to find a solution to get the numbers when other surrounding text is involved. Any help will be appreciated. Below is the code that I current use in python.
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
line = each.strip()
regex_criteria = r'"^([1-9][0-9]*|\([1-9][0-9]*\}|\(([1-9][0-9]*,?)+[1-9][0-9]*\))$"gm'
if (line.__contains__('(') and line.__contains__(')') and not re.search('[a-zA-Z]', refline)):
refline = line[line.find('(')+1:line.find(')')]
if not re.search('[a-zA-Z]', refline):
Remove the ^, $ is whats preventing you from getting all the numbers. And gm flags wont work in python re.
You can change your regex to :([1-9][0-9]*|\([1-9][0-9]*\}|\(?:([1-9][0-9]*,?)+[1-9][0-9]*\)) if you want to get each number separately.
Or you can simplify your pattern to (?<=[(,])[1-9][0-9]+(?=[,)])
Test regex here: https://regex101.com/r/RlGwve/1
Python code:
import re
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
print(re.findall(r'(?<=[(,])[1-9][0-9]+(?=[,)])', line))
# ['101065', '101066', '101067', '101065']
(?<=[(,])[1-9][0-9]+(?=[,)])
The above pattern tells to match numbers which begin with 1-9 followed by one or more digits, only if the numbers begin with or end with either comma or brackets.
Here's another option:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)*(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?<=\(): Lookbehind for (
[1-9]+\d*: At least one number (would \d+ work too?)
(?:,[1-9]\d*)*: Zero or multiple numbers after a ,
(?=\)): Lookahead for )
Result for your line:
[['101065', '101066', '101067'], ['101065']]
If you only want the comma separated numbers:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)+(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?:,[1-9]\d*)+: One or more numbers after a ,
Result:
[['101065', '101066', '101067']]
Now, if your line could also look like
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines ( 101065,101066, 101067 )
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
then you have to sprinkle the pattern with \s* and remove the whitespace afterwards (here with str.translate and str.maketrans):
pattern = re.compile(r"(?<=\()\s*[1-9]+\d*(?:\s*,\s*[1-9]\d*\s*)*(?=\))")
table = str.maketrans("", "", " ")
results = [match[0].translate(table).split(",") for match in pattern.finditer(line)]
Result:
[['101065', '101066', '101067'], ['101065']]
Using the pypi regex module you could also use capture groups:
\((?P<num>\d+)(?:,(?P<num>\d+))*\)
The pattern matches:
\( Match (
(?P<num>\d+) Capture group, match 1+ digits
(?:,(?P<num>\d+))* Optionally repeat matching , and 1+ digits in a capture group
\) Match )
Regex demo | Python demo
Example code
import regex
pattern = r"\((?P<num>\d+)(?:,(?P<num>\d+))*\)"
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
matches = regex.finditer(pattern, line)
for _, m in enumerate(matches, start=1):
print(m.capturesdict())
Output
{'num': ['101065', '101066', '101067']}
{'num': ['101065']}

How to extract person name using regular expression?

I am new to Regular Expression and I have kind of a phone directory. I want to extract the names out of it. I wrote this (below), but it extracts lots of unwanted text rather than just names. Can you kindly tell me what am i doing wrong and how to correct it? Here is my code:
import re
directory = '''Mark Adamson
Home: 843-798-6698
(424) 345-7659
265-1864 ext. 4467
326-665-8657x2986
E-mail:madamson#sncn.net
Allison Andrews
Home: 612-321-0047
E-mail: AEA#anet.com
Cellular: 612-393-0029
Dustin Andrews'''
nameRegex = re.compile('''
(
[A-Za-z]{2,25}
\s
([A-Za-z]{2,25})+
)
''',re.VERBOSE)
print(nameRegex.findall(directory))
the output it gives is:
[('Mark Adamson', 'Adamson'), ('net\nAllison', 'Allison'), ('Andrews\nHome', 'Home'), ('com\nCellular', 'Cellular'), ('Dustin Andrews', 'Andrews')]
Would be really grateful for help!
Your problem is that \s will also match newlines. Instead of \s just add a space. That is
name_regex = re.compile('[A-Za-z]{2,25} [A-Za-z]{2,25}')
This works if the names have exactly two words. If the names have more than two words (middle names or hyphenated last names) then you may want to expand this to something like:
name_regex = re.compile(r"^([A-Za-z \-]{2,25})+$", re.MULTILINE)
This looks for one or more words and will stretch from the beginning to end of a line (e.g. will not just get 'John Paul' from 'John Paul Jones')
I can suggest to try the next regex, it works for me:
"([A-Z][a-z]+\s[A-Z][a-z]+)"
The following regex works as expected.
Related part of the code:
nameRegex = re.compile(r"^[a-zA-Z]+[',. -][a-zA-Z ]?[a-zA-Z]*$", re.MULTILINE)
print(nameRegex.findall(directory)
Output:
>>> python3 test.py
['Mark Adamson', 'Allison Andrews', 'Dustin Andrews']
Try:
nameRegex = re.compile('^((?:\w+\s*){2,})$', flags=re.MULTILINE)
This will only choose complete lines that are made up of two or more names composed of 'word' characters.

Trouble scooping out a certain portion of text out of a chunk

How can I get the address appeared before Telephone from the portion of text I've pasted below. I tried with the following but it gives me nothing.
This is the code I've tried so far with:
import re
content="""
Campbell, Bellam AssociƩs Inc.
3003 Rue College
Sherbrooke, QC J1M 1T8
Telephone: 819-569-9255
Website: http://www.assurancescb.com
"""
pattern = re.compile(r"(.*)(?=Telephone)")
for item in pattern.finditer(content):
print(item.group())
Expected output:
Campbell, Bellam AssociƩs Inc.
3003 Rue College
Sherbrooke, QC J1M 1T8
The block of texts are always like the pasted one and there is no flag attached to it using which I opt for positive lookbehind so I tried like above instead.
The dot does not match a line break character so you could use a modifier (?s) or use re.S or re.DOTALL
pattern = re.compile(r"(.*)(?=Telephone)", re.S)
or
pattern = re.compile(r"(?s)(.*)(?=Telephone)")
You could also get the match without using a group:
(?s).*(?=Telephone)
Change the line
pattern = re.compile(r"(.*)(?=Telephone)")
To
pattern = re.compile(r"(.*)(?=Telephone)", re.DOTALL)
So that your regex wildcard (*) would match newline characters.
:)

Using Regex to extract certain phrases but exclude phrases followed by the word "of"

I am basically trying to extract Section references from a long document.
The following code does so quite well:
example1 = 'Sections 21(1), 54(2), 78(1) of Harry Potter'
res = re.search(r'Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*', example1)
res.group(0)
Output: 'Sections 21(1), 54(2), 78(1)'
However, frequently the sections refer to outside books and I would like to either indicate those or exclude them. Generally, the section reference is followed by an "of" if it refers to another book (example below):
example2 = 'Sections 21(1), 54(2), 78(1) of Harry Potter'
So in this case, I would like to exclude these sections because they refer to Harry Potter and not to sections within the document. The following should achieve this but it doesn't work.
example2 = 'Sections 21(1), 54(2), 78(1) of Harry Potter'
res = re.search(r'Sections?(\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*)(?!\s+of)', example2)
res.group(0)
Expected output: Sections 21(1), 54(2), 78 --> (?!\s+of) removes the (1) behind 78 but not the entire reference.
You can emulate atomic groups with capturing groups and lookahead:
(?=(?P<section>Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*))(?P=section)(?! of)
Demo
Long story short:
* in positive lookahead you create a capturing group called section that finds a section pattern
* then you match the group contents in (?P=secion)
* then in negative lookahead you check that there is no of following
Here is a really good answer that explains that technique.
This is because after (?!\s+of) fails, it backtracks before optional (\(..\))? which matches because negative lookahead doesn't match.
Atomic group could be used with other regex engines but isn't implemented in python re.
Other solution is to use a possessive quantifier + after ? optional part :
r'Sections?(\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?+)*)(?!\s+of)'
note the + after ?

Python Regex not working to extract information from a wiki page

I'm trying to get a block of text out of a string. I'm trying to use:
def findPersonInfo(self):
if (self.isPerson == True):
regex = re.compile("\{\{persondata(.*)\}\}",re.IGNORECASE|re.MULTILINE|re.UNICODE)
result = regex.search(self._rawPage)
if result:
print 'Match found: ', result.group()
The string is: (yes, its a wiki page)
*[http://www.jsc.nasa.gov/Bios/htmlbios/acaba-jm.html NASA biography]
{{NASA Astronaut Group 19}}
{{Persondata
|NAME= Acaba, Joseph Michael "Joe"
|ALTERNATIVE NAMES=
|SHORT DESCRIPTION=[[Hydrogeologist]]
|DATE OF BIRTH={{Birth date and age|1967|5|17}}
|PLACE OF BIRTH=[[Inglewood, California]]
|DATE OF DEATH=
|PLACE OF DEATH=
}}
{{DEFAULTSORT:Acaba, Joseph M.}}
[[Category:1967 births]]
but I keep getting no matches.
Add re.DOTALL to the regex options:
In [193]: regex = re.compile(r"{{persondata(.*)}}",re.IGNORECASE|re.MULTILINE|re.UNICODE|re.DOTALL)
In [194]: regex.search(text).group()
Out[194]: '{{Persondata\n|NAME= Acaba, Joseph Michael "Joe"\n|ALTERNATIVE NAMES=\n|SHORT DESCRIPTION=[[Hydrogeologist]]\n|DATE OF BIRTH={{Birth date and age|1967|5|17}}\n|PLACE OF BIRTH=[[Inglewood, California]]\n|DATE OF DEATH=\n|PLACE OF DEATH=\n}}\n{{DEFAULTSORT:Acaba, Joseph M.}}'
DOTALL causes . to match any character at all, including the newline. (Without DOTALL, . does not match newlines.)
MULTILINE causes ^ to match the beginning of lines as well as that of the string, and $ to match the end of lines as well as that of the string. That's okay but it does not influence the match here.
PS. The backslashes are not necessary, so for the sake of readability, I've omitted them.
PPS. If the findPersonInfo method is called a lot, you may want to lift the call to re.compile out of the method since it does not depend on self:
class Foo:
info_pat = re.compile("{{persondata(.*)}}",
re.IGNORECASE|re.MULTILINE|re.UNICODE)
def findPersonInfo(self):
result = None
if self.isPerson:
result = self.info_pat.search(self._rawPage)
if result:
print 'Match found: ', result.group()

Categories

Resources