findall and regular expressions, getting the correct pattern - python

I'm working out of Magnus Lie Hetland's book, "Beginning Python" 2nd edition, and on page 244 he says the first pattern listed in my code should produce the desired output listed at the bottom of this code, but it doesn't. So I tried a couple of other patterns in order to try and get the desired output, but they don't work either. I checked the errata for the book and there are no corrections for this page. I'm using python 2.7.6. Any suggestions?
import re
s1 = 'http://www.python.org http://python.org www.python.org python.org .python.org ww.python.org w.python.org wwww.python.org'
# choose a pattern and comment out the other two
# output using Hetland's pattern
pat = r'(http://)?(www\.)?python\.org'
''' [('http://', 'www.'), ('http://', ''), ('', 'www.'), ('', ''), ('', ''), ('', ''), ('', ''), ('', 'www.')] '''
# output using this pattern
# pat = r'http://?www\.?python\.org'
''' ['http://www.python.org'] '''
# output using this pattern
# pat = r'http://?|www\.?|python\.org'
''' ['http://', 'www.', 'python.org', 'www.', 'http://', 'python.org', 'www.', 'python.org', 'python.org', 'python.org', 'python.org', 'python.org', 'www', 'python.org'] '''
print '\n', re.findall(pat, s1)
# desired output
''' ['http://www.python.org', 'http://python.org', 'www.python.org', 'python.org'] '''

The pattern works if you make the first two optional groups non-capture groups (?:...):
pat = r'(?:http://)?(?:www\.)?python\.org'
matches = re.findall(pat, s1)
# ['http://www.python.org', 'http://python.org', 'www.python.org', 'python.org', 'python.org', 'python.org', 'python.org', 'www.python.org']
That is, if that's the desired result - as the change to the pattern means there's only one capture group instead of three...

Related

Regex pattern to match multiple characters and split

I haven't used regex much and was having issues trying to split out 3 specific pieces of info in a long list of text I need to parse.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
pattern1 = Parse the **Name Text**
pattern2 = Parse the number `#x`
pattern3 = Grab everything else until the next pattern 1
What I have doesn't seem to work well. There are empty elements? They are not grouped together? And I can't figure out how to grab the last pattern text without it affecting the first 2 patterns. I'd also like it if all 3 matches were in a tuple together rather than separated. Here's what I have so far:
all = r"\*\*(.+?)\*\*|\`#(.+?)\`:"
l = re.findall(all, note)
Output:
[('Jane Greiz', ''), ('', '1'), ('Thomas Fitzpatrick', ''), ('', '90'), ('Anthony Smith', ''), ('', '91')]
Don't use alternatives. Put the name and number patterns after each other in a single alternative, and add another group for the match up to the next **.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
all = r"\*\*(.+?)\*\*.*?\`#(.+?)\`:(.*)"
print(re.findall(all, note))
Output is:
[('Jane Greiz', '1', ' Should be open here .'), ('Thomas Fitzpatrick', '90', ' Anim: Can we start the movement.'), ('Anthony Smith', '91', ' Her left shoulder.')]

Match names, dialogues, and actions from transcript using regex

Given a string dialogue such as below, I need to find the sentence that corresponds to each user.
text = 'CHRIS: Hello, how are you...
PETER: Great, you? PAM: He is resting.
[PAM SHOWS THE COUCH]
[PETER IS NODDING HIS HEAD]
CHRIS: Are you ok?'
For the above dialogue, I would like to return tuples with three elements with:
The name of the person
The sentence in lower case and
The sentences within Brackets
Something like this:
('CHRIS', 'Hello, how are you...', None)
('PETER', 'Great, you?', None)
('PAM', 'He is resting', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD')
('CHRIS', 'Are you ok?', None)
etc...
I am trying to use regex to achieve the above. So far I was able to get the names of the users with the below code. I am struggling to identify the sentence between two users.
actors = re.findall(r'\w+(?=\s*:[^/])',text)
You can do this with re.findall:
>>> re.findall(r'\b(\S+):([^:\[\]]+?)\n?(\[[^:]+?\]\n?)?(?=\b\S+:|$)', text)
[('CHRIS', ' Hello, how are you...', ''),
('PETER', ' Great, you? ', ''),
('PAM',
' He is resting.',
'[PAM SHOWS THE COUCH]\n[PETER IS NODDING HIS HEAD]\n'),
('CHRIS', ' Are you ok?', '')]
You will have to figure out how to remove the square braces yourself, that cannot be done with regex while still attempting to match everything.
Regex Breakdown
\b # Word boundary
(\S+) # First capture group, string of characters not having a space
: # Colon
( # Second capture group
[^ # Match anything that is not...
: # a colon
\[\] # or square braces
]+? # Non-greedy match
)
\n? # Optional newline
( # Third capture group
\[ # Literal opening brace
[^:]+? # Similar to above - exclude colon from match
\]
\n? # Optional newlines
)? # Third capture group is optional
(?= # Lookahead for...
\b # a word boundary, followed by
\S+ # one or more non-space chars, and
: # a colon
| # Or,
$ # EOL
)
Regex is one way to approach this problem, but you can also think about it as iterating through each token in your text and applying some logic to form groups.
For example, we could first find groups of names and text:
from itertools import groupby
def isName(word):
# Names end with ':'
return word.endswith(":")
text_split = [
" ".join(list(g)).rstrip(":")
for i, g in groupby(text.replace("]", "] ").split(), isName)
]
print(text_split)
#['CHRIS',
# 'Hello, how are you...',
# 'PETER',
# 'Great, you?',
# 'PAM',
# 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]',
# 'CHRIS',
# 'Are you ok?']
Next you can collect pairs of consecutive elements in text_split into tuples:
print([(text_split[i*2], text_split[i*2+1]) for i in range(len(text_split)//2)])
#[('CHRIS', 'Hello, how are you...'),
# ('PETER', 'Great, you?'),
# ('PAM', 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]'),
# ('CHRIS', 'Are you ok?')]
We're almost at the desired output. We just need to deal with the text in the square brackets. You can write a simple function for that. (Regular expressions is admittedly an option here, but I'm purposely avoiding that in this answer.)
Here's something quick that I came up with:
def isClosingBracket(word):
return word.endswith("]")
def processWords(words):
if "[" not in words:
return [words, None]
else:
return [
" ".join(g).replace("]", ".")
for i, g in groupby(map(str.strip, words.split("[")), isClosingBracket)
]
print(
[(text_split[i*2], *processWords(text_split[i*2+1])) for i in range(len(text_split)//2)]
)
#[('CHRIS', 'Hello, how are you...', None),
# ('PETER', 'Great, you?', None),
# ('PAM', 'He is resting.', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD.'),
# ('CHRIS', 'Are you ok?', None)]
Note that using the * to unpack the result of processWords into the tuple is strictly a python 3 feature.

regex to split text based on month abbreviations and extract following text?

I am working on a personal project, and am stuck on extracting the text surrounding month abbreviations.
A sample input text is of the form:
text = "apr25, 2016\nblah blah\npow\nmay22, 2017\nasdf rtys\nqwer\njan9, 2018\npoiu\nlkjhj yertt"
I expect output of the form:
[ ("apr25, 2016\nblah blah\npow\n"), ("may22, 2017\nasdf rtys\nqwer\n"), ("jan9, 2018\npoiu\nlkjhj yertt") ]
I tried a simple regex, but it is incorrect:
import re
# Greedy version
REGEX_MONTHS_TEXT = re.compile(r'(apr[\w\W]*)|(may[\w\W]*)|(jan[\w\W]*)')
REGEX_MONTHS_TEXT.findall(text)
# output: [('apr25, 2016\nblah blah\npow\nmay22, 2017\nasdf rtys\nqwer\njan9, 2018\npoiu\nlkjhj yertt', '', '')]
# Non-Greedy version
REGEX_MONTHS_TEXT = re.compile(r'(apr[\w\W]*?)|(may[\w\W]*?)|(jan[\w\W]*?)')
REGEX_MONTHS_TEXT.findall(text)
# output: [('apr', '', ''), ('', 'may', ''), ('', '', 'jan')]
Can you help me produce the desired output with python3 regex?
Or do i need to write custom python3 code to produce the desired output?
The problem was in stopping around month abbreviations in my regex, after matching for month abbreviations.
I referred Python RegEx Stop before a Word and used the tempered greedy token solution mentioned there.
import re
REGEX_MONTHS_TEXT = re.compile(r'(apr|may|jan)((?:(?!apr|may|jan)[\w\W])+)')
text = "apr25, 2016\nblah blah\npow\nmay22, 2017\nasdf rtys\nqwer\njan9, 2018\npoiu\nlkjhj yertt"
arr = REGEX_MONTHS_TEXT.findall(text)
# arr = [ ('apr', '25, 2016\nblah blah\npow\n'), ('may', '22, 2017\nasdf rtys\nqwer\n'), ('jan', '9, 2018\npoiu\nlkjhj yertt')]
# The above arr can be combined using list comprehension to form
# list of singleton tuples as expected in the original question
output = [ (x + y,) for (x, y) in arr ]
# output = [('apr25, 2016\nblah blah\npow\n',), ('may22, 2017\nasdf rtys\nqwer\n',), ('jan9, 2018\npoiu\nlkjhj yertt',)]
Additional Resource for Tempered Greedy Token: Tempered Greedy Token - What is different about placing the dot before the negative lookahead

Python: Regular Expression not working properly

I'm using following regex, it suppose to find the string 'U.S.A.', but it only gets 'A.', is anyone know what's wrong?
#INPUT
import re
text = 'That U.S.A. poster-print costs $12.40...'
print re.findall(r'([A-Z]\.)+', text)
#OUTPUT
['A.']
Expected Output:
['U.S.A.']
I'm following the NLTK Book, Chapter 3.7 here, it has a set of regex but it just not workin. I've tried it in both Python 2.7 and 3.4.
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
nltk.regexp_tokenize() works the same as re.findall(), I think somehow my python here does not recognize the regex as expected. The regex listed above output this:
[('', '', ''),
('A.', '', ''),
('', '-print', ''),
('', '', ''),
('', '', '.40'),
('', '', '')]
Possibly, it's something to do with how regexes were compiled previously using nltk.internals.compile_regexp_to_noncapturing() that is abolished in v3.1, see here)
>>> import nltk
>>> nltk.__version__
'3.0.5'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-#&*] # special characters with meanings
... '''
>>>
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
But it doesn't work in NLTK v3.1:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-#&*] # special characters with meanings
... '''
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
With slight modification of how you define your regex groups, you could get the same pattern to work in NLTK v3.1, using this regex:
pattern = r"""(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
|\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
|(?:[+/\-#&*]) # special characters with meanings
"""
In code:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r"""
... (?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-#&*]) # special characters with meanings
... """
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
Without NLTK, using python's re module, we see that the old regex patterns are not supported natively:
>>> pattern1 = r"""(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... |\w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... |[+/\-#&*] # special characters with meanings
... |\S\w* # any sequence of word characters#
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern1, text)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
>>> pattern2 = r"""(?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-#&*]) # special characters with meanings
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern2, text)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
Note: The change in how NLTK's RegexpTokenizer compiles the regexes would make the examples on NLTK's Regular Expression Tokenizer obsolete too.
Drop the trailing +, or put it inside the group:
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> re.findall(r'([A-Z]\.)+', text)
['A.'] # wrong
>>> re.findall(r'([A-Z]\.)', text)
['U.', 'S.', 'A.'] # without '+'
>>> re.findall(r'((?:[A-Z]\.)+)', text)
['U.S.A.'] # with '+' inside the group
The first part of the text that the regexp matches is "U.S.A." because ([A-Z]\.)+ matches the first group (part within parenthesis) three times. However you can only return one match per group, so Python picks the last match for that group.
If you instead change the regular expression to include the "+" in the group, then the group will only match once and the full match will be returned. For example (([A-Z]\.)+) or ((?:[A-Z]\.)+).
If you instead want three separate results, then just get rid of the "+" sign in the regular expression and it will only match one letter and one dot for each time.
The problem is the "capturing group", aka the parentheses, which have an unexpected effect on the result of findall(): When a capturing group is utilized multiple times in a match, the regexp engine loses track and strange things happen. Specifically: the regexp correctly matches the entire U.S.A., but findall drops it on the floor and only returns the last group capture.
As this answer says, the re module doesn't support repeated capturing groups, but you could install the alternative regexp module that does handle this correctly. (However, this would be no help to you if you want to pass your regexp to nltk.tokenize.regexp.)
Anyway to match U.S.A. correctly, use this: r'(?:[A-Z]\.)+', text).
>>> re.findall(r'(?:[A-Z]\.)+', text)
['U.S.A.']
You can apply the same fix to all repeated patterns in the NLTK regexp, and everything will work correctly. As #alvas suggested, the NLTK used to make this substitution behind the scenes, but this feature was recently dropped and replaced with a warning in the documentation of the tokenizer. The book is clearly out of date; #alvas filed a bug report about it back in November, but it hasn't been acted on yet...

Regex works fine on Pythex, but not in Python

I used the following regular expression on pythex to test it:
(\d|t)(_\d+){1}\.
It works fine and I am primarily interested in group 2. That it works successfully is shown below:
However, I can't get Python to actually show me the correct results. Here's a MWE:
fn_list = ['IMG_0064.png',
'IMG_0064.JPG',
'IMG_0064_1.JPG',
'IMG_0064_2.JPG',
'IMG_0064_2.PNG',
'IMG_0064_2.BMP',
'IMG_0064_3.JPEG',
'IMG_0065.JPG',
'IMG_0065.JPEG',
'IMG-20150623-00176-preview-left.jpg',
'IMG-20150623-00176-preview-left_2.jpg',
'thumb_2595.bmp',
'thumb_2595_1.bmp',
'thumb_2595_15.bmp']
pattern = re.compile(r'(\d|t)(_\d+){1}\.', re.IGNORECASE)
for line in fn_list:
search_obj = re.match(pattern, line)
if search_obj:
matching_group = search_obj.groups()
print matching_group
The output is nothing.
However, the pythex above clearly shows two groups returned for each, the second should be present and hit off many more files. What am I doing wrong?
You need to use re.search(), not re.match(). re.search() matches anywhere in the string, whereas re.match() matches only at the beginning.
import re
fn_list = ['IMG_0064.png',
'IMG_0064.JPG',
'IMG_0064_1.JPG',
'IMG_0064_2.JPG',
'IMG_0064_2.PNG',
'IMG_0064_2.BMP',
'IMG_0064_3.JPEG',
'IMG_0065.JPG',
'IMG_0065.JPEG',
'IMG-20150623-00176-preview-left.jpg',
'IMG-20150623-00176-preview-left_2.jpg',
'thumb_2595.bmp',
'thumb_2595_1.bmp',
'thumb_2595_15.bmp']
pattern = re.compile(r'(\d|t)(_\d+){1}\.', re.IGNORECASE)
for line in fn_list:
search_obj = re.search(pattern, line) # CHANGED HERE
if search_obj:
matching_group = search_obj.groups()
print matching_group
Result:
('4', '_1')
('4', '_2')
('4', '_2')
('4', '_2')
('4', '_3')
('t', '_2')
('5', '_1')
('5', '_15')
Since you are compiling the regular expression, you can do search_obj = pattern.search(line) instead of search_obj = re.search(pattern, line). As for your regular expression itself, r'([\dt])(_\d+)\.' is equivalent to the one you're using, and a bit cleaner.
You need to use the following code:
import re
fn_list = ['IMG_0064.png',
'IMG_0064.JPG',
'IMG_0064_1.JPG',
'IMG_0064_2.JPG',
'IMG_0064_2.PNG',
'IMG_0064_2.BMP',
'IMG_0064_3.JPEG',
'IMG_0065.JPG',
'IMG_0065.JPEG',
'IMG-20150623-00176-preview-left.jpg',
'IMG-20150623-00176-preview-left_2.jpg',
'thumb_2595.bmp',
'thumb_2595_1.bmp',
'thumb_2595_15.bmp']
pattern = re.compile(r'([\dt])(_\d+)\.', re.IGNORECASE) # OPTIMIZED REGEX A BIT
for line in fn_list:
search_obj = pattern.search(line) # YOU NEED SEARCH WITH THE COMPILED REGEX
if search_obj:
matching_group = search_obj.group(2) # YOU NEED TO ACCESS GROUP 2 IF YOU ARE INTERESTED JUST IN GROUP 2
print matching_group
See IDEONE demo
As for the regex, (\d|t) is the same as ([\dt]), but the latter is more efficient. Also, {1} is redundant in regex.

Categories

Resources