Python: Regular Expression not working properly - python

I'm using following regex, it suppose to find the string 'U.S.A.', but it only gets 'A.', is anyone know what's wrong?
#INPUT
import re
text = 'That U.S.A. poster-print costs $12.40...'
print re.findall(r'([A-Z]\.)+', text)
#OUTPUT
['A.']
Expected Output:
['U.S.A.']
I'm following the NLTK Book, Chapter 3.7 here, it has a set of regex but it just not workin. I've tried it in both Python 2.7 and 3.4.
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
nltk.regexp_tokenize() works the same as re.findall(), I think somehow my python here does not recognize the regex as expected. The regex listed above output this:
[('', '', ''),
('A.', '', ''),
('', '-print', ''),
('', '', ''),
('', '', '.40'),
('', '', '')]

Possibly, it's something to do with how regexes were compiled previously using nltk.internals.compile_regexp_to_noncapturing() that is abolished in v3.1, see here)
>>> import nltk
>>> nltk.__version__
'3.0.5'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-#&*] # special characters with meanings
... '''
>>>
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
But it doesn't work in NLTK v3.1:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-#&*] # special characters with meanings
... '''
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
With slight modification of how you define your regex groups, you could get the same pattern to work in NLTK v3.1, using this regex:
pattern = r"""(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
|\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
|(?:[+/\-#&*]) # special characters with meanings
"""
In code:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r"""
... (?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-#&*]) # special characters with meanings
... """
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
Without NLTK, using python's re module, we see that the old regex patterns are not supported natively:
>>> pattern1 = r"""(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... |\w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... |[+/\-#&*] # special characters with meanings
... |\S\w* # any sequence of word characters#
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern1, text)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
>>> pattern2 = r"""(?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-#&*]) # special characters with meanings
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern2, text)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
Note: The change in how NLTK's RegexpTokenizer compiles the regexes would make the examples on NLTK's Regular Expression Tokenizer obsolete too.

Drop the trailing +, or put it inside the group:
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> re.findall(r'([A-Z]\.)+', text)
['A.'] # wrong
>>> re.findall(r'([A-Z]\.)', text)
['U.', 'S.', 'A.'] # without '+'
>>> re.findall(r'((?:[A-Z]\.)+)', text)
['U.S.A.'] # with '+' inside the group

The first part of the text that the regexp matches is "U.S.A." because ([A-Z]\.)+ matches the first group (part within parenthesis) three times. However you can only return one match per group, so Python picks the last match for that group.
If you instead change the regular expression to include the "+" in the group, then the group will only match once and the full match will be returned. For example (([A-Z]\.)+) or ((?:[A-Z]\.)+).
If you instead want three separate results, then just get rid of the "+" sign in the regular expression and it will only match one letter and one dot for each time.

The problem is the "capturing group", aka the parentheses, which have an unexpected effect on the result of findall(): When a capturing group is utilized multiple times in a match, the regexp engine loses track and strange things happen. Specifically: the regexp correctly matches the entire U.S.A., but findall drops it on the floor and only returns the last group capture.
As this answer says, the re module doesn't support repeated capturing groups, but you could install the alternative regexp module that does handle this correctly. (However, this would be no help to you if you want to pass your regexp to nltk.tokenize.regexp.)
Anyway to match U.S.A. correctly, use this: r'(?:[A-Z]\.)+', text).
>>> re.findall(r'(?:[A-Z]\.)+', text)
['U.S.A.']
You can apply the same fix to all repeated patterns in the NLTK regexp, and everything will work correctly. As #alvas suggested, the NLTK used to make this substitution behind the scenes, but this feature was recently dropped and replaced with a warning in the documentation of the tokenizer. The book is clearly out of date; #alvas filed a bug report about it back in November, but it hasn't been acted on yet...

Related

Regex pattern to match multiple characters and split

I haven't used regex much and was having issues trying to split out 3 specific pieces of info in a long list of text I need to parse.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
pattern1 = Parse the **Name Text**
pattern2 = Parse the number `#x`
pattern3 = Grab everything else until the next pattern 1
What I have doesn't seem to work well. There are empty elements? They are not grouped together? And I can't figure out how to grab the last pattern text without it affecting the first 2 patterns. I'd also like it if all 3 matches were in a tuple together rather than separated. Here's what I have so far:
all = r"\*\*(.+?)\*\*|\`#(.+?)\`:"
l = re.findall(all, note)
Output:
[('Jane Greiz', ''), ('', '1'), ('Thomas Fitzpatrick', ''), ('', '90'), ('Anthony Smith', ''), ('', '91')]
Don't use alternatives. Put the name and number patterns after each other in a single alternative, and add another group for the match up to the next **.
note = "**Jane Greiz** `#1`: Should be open here .\n**Thomas Fitzpatrick** `#90`: Anim: Can we start the movement.\n**Anthony Smith** `#91`: Her left shoulder.\nhttps://google.com"
all = r"\*\*(.+?)\*\*.*?\`#(.+?)\`:(.*)"
print(re.findall(all, note))
Output is:
[('Jane Greiz', '1', ' Should be open here .'), ('Thomas Fitzpatrick', '90', ' Anim: Can we start the movement.'), ('Anthony Smith', '91', ' Her left shoulder.')]

Using regex on Python to find any numerical value in an expression

I am trying to get all numerical value (integers,decimal,float,scientific notation) from an expression and want to differentiate them from digits that are not realy number but part of a name. For example in the expression below.
230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1
the first 230 is not a numerical value as it is part of a tag (230FIC100.PV).
Using the web tool regexp.com I come up with the following expression that works for the expression above.
(?!\s)(?<!\w)[+-]?((\d+\.\d*)|(\.\d+)|(\d+))([eE][+-]?\d+)?(\s)|(?<!\w)[0-9]\d+(?<!\s)$
However when I try to use the above expression in python re.findall() I receive as result a list with 5 tuples with 6 elements on each.
import re
pat = r'(?!\s)(?<!\w)[+-]?((\d+\.\d*)|(\.\d+)|(\d+))([eE][+-]?\d+)?(\s)|(?<!\w)[0-9]\d+(?<!\s)$'
exp = '230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1 '
matches = re.findall(pat,exp)
The result is
special variables
function variables
0:('2', '', '', '2', 'e3', ' ')
1:('20', '', '', '20', '', ' ')
2:('20.4', '20.4', '', '', '', ' ')
3:('45', '', '', '45', '', ' ')
4:('2', '', '', '2', 'e4', ' ')
len():5
I would like some help to undestand what is happening and if there is any way to get this done in a similar way that happen on the regexp.com.
This should take care of it. (All the items are strings)
import re
st = '230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1'
re.findall(r'-?[0-9]+\.?[0-9]*(?:[Ee]\ *-?\ *[0-9]+)|-?\d+\.\d+|\b\d+\b', st)
referred: How to extract numbers from strings,
Extracting scientific numbers from string,
and Extracting decimal values from string

How to allow characters and whitespaces in an exception in regex?

Given the input:
1993年8月にデビュー。。。同年11月から1995年3月にかけてクラシック三冠を含むGI5連勝、10連続連対を達成し、1993年JRA賞最優秀3歳牡馬[† 3]、1994年JRA賞年度代表馬および最優秀4歳牡馬[† 3]に選出された。1995年春に故障(股関節炎)を発症したあとはその後遺症から低迷し、6戦して重賞を1勝するにとどまった(GI は5戦して未勝利)が、第44回阪神大賞典におけるマヤノトップガンとのマッチレースや短距離戦である第26回高松宮杯への出走によってファンの話題を集めた。第26回高松宮杯出走後に発症した屈腱炎が原因となって1996年10月に競走馬を引退した。競走馬を引退したあとは種牡馬となったが、1998年9月に胃破裂を発症し、安楽死の措置がとられた。
Desired output is:
["1993年8月にデビュー。"
"同年11月から1995年3月にかけてクラシック三冠を含むGI5連勝、", "10連続連対を達成し、",
"1993年JRA賞最優秀3歳牡馬[† 3]、", "1994年JRA賞年度代表馬および最優秀4歳牡馬[† 3]に選出された。",
"1995年春に故障(股関節炎)を発症したあとはその後遺症から低迷し、", "6戦して重賞を1勝するにとどまった",
"(GI は5戦して未勝利)が、", "第44回阪神大賞典におけるマヤノトップガンとのマッチレースや短距離戦である第26回高松宮杯への出走によってファンの話題を集めた。",
"第26回高松宮杯出走後に発症した屈腱炎が原因となって1996年10月に競走馬を引退した。",
"競走馬を引退したあとは種牡馬となったが、", "1998年9月に胃破裂を発症し、", "安楽死の措置がとられた。"]
I've tried the following regex:
import re
text= str("1993年8月にデビュー。"
"同年11月から1995年3月にかけてクラシック三冠を含むGI5連勝、10連続連対を達成し、"
"1993年JRA賞最優秀3歳牡馬[† 3]、1994年JRA賞年度代表馬および最優秀4歳牡馬[† 3]に選出された。"
"1995年春に故障(股関節炎)を発症したあとはその後遺症から低迷し、6戦して重賞を1勝するにとどまった"
"(GI は5戦して未勝利)が、第44回阪神大賞典におけるマヤノトップガンとのマッチレースや短距離戦である第26回高松宮杯への出走によってファンの話題を集めた。"
"第26回高松宮杯出走後に発症した屈腱炎が原因となって1996年10月に競走馬を引退した。"
"競走馬を引退したあとは種牡馬となったが、1998年9月に胃破裂を発症し、安楽死の措置がとられた。")
re.split(r'([^! ? 。、]*[!?。、]{1,3})', text)
That splits the punctuations correctly but also split on the space, outputs:
['',
'1993年8月にデビュー。',
'',
'同年11月から1995年3月にかけてクラシック三冠を含むGI5連勝、',
'',
'10連続連対を達成し、',
'1993年JRA賞最優秀3歳牡馬[† ',
'3]、',
'1994年JRA賞年度代表馬および最優秀4歳牡馬[† ',
'3]に選出された。',
'',
'1995年春に故障(股関節炎)を発症したあとはその後遺症から低迷し、',
'6戦して重賞を1勝するにとどまった(GI ',
'は5戦して未勝利)が、',
'',
'第44回阪神大賞典におけるマヤノトップガンとのマッチレースや短距離戦である第26回高松宮杯への出走によってファンの話題を集めた。',
'',
'第26回高松宮杯出走後に発症した屈腱炎が原因となって1996年10月に競走馬を引退した。',
'',
'競走馬を引退したあとは種牡馬となったが、',
'',
'1998年9月に胃破裂を発症し、',
'',
'安楽死の措置がとられた。',
'']
These segments were broken wrongly because space wasn't included in the allowed characters of the first optional group:
'1993年JRA賞最優秀3歳牡馬[† 3]、',
'1994年JRA賞年度代表馬および最優秀4歳牡馬[† 3]に選出された。',
...,
'6戦して重賞を1勝するにとどまった(GI は5戦して未勝利)が、'
How to allow characters and whitespaces in an exception in regex?
Your desired output shows a split before a parenthesis that wasn't in your regular expression attempt. Assuming that is an error, this works:
#coding:utf8
import re
text = '''1993年8月にデビュー。。。同年11月から1995年3月にかけてクラシック三冠を含むGI5連勝、10連続連対を達成し、1993年JRA賞最優秀3歳牡馬[† 3]、1994年JRA賞年度代表馬および最優秀4歳牡馬[† 3]に選出された。1995年春に故障(股関節炎)を発症したあとはその後遺症から低迷し、6戦して重賞を1勝するにとどまった(GI は5戦して未勝利)が、第44回阪神大賞典におけるマヤノトップガンとのマッチレースや短距離戦である第26回高松宮杯への出走によってファンの話題を集めた。第26回高松宮杯出走後に発症した屈腱炎が原因となって1996年10月に競走馬を引退した。競走馬を引退したあとは種牡馬となったが、1998年9月に胃破裂を発症し、安楽死の措置がとられた。'''
desired = ["1993年8月にデビュー。",
"同年11月から1995年3月にかけてクラシック三冠を含むGI5連勝、",
"10連続連対を達成し、",
"1993年JRA賞最優秀3歳牡馬[† 3]、",
"1994年JRA賞年度代表馬および最優秀4歳牡馬[† 3]に選出された。",
"1995年春に故障(股関節炎)を発症したあとはその後遺症から低迷し、",
"6戦して重賞を1勝するにとどまった(GI は5戦して未勝利)が、",
"第44回阪神大賞典におけるマヤノトップガンとのマッチレースや短距離戦である第26回高松宮杯への出走によってファンの話題を集めた。",
"第26回高松宮杯出走後に発症した屈腱炎が原因となって1996年10月に競走馬を引退した。",
"競走馬を引退したあとは種牡馬となったが、",
"1998年9月に胃破裂を発症し、",
"安楽死の措置がとられた。"]
actual = re.findall(r'([^!?。、]*[!?。、])[!?。、]*', text)
print(desired == actual)
Output:
True

regex expression not recognising the other lines

I have a regex which I would like to match a couple of things:
Here is a link to the examples and the code which I have started but for errors which I cannot determine in my regex is not recognising some lines: http://regex101.com/r/oL4bB5/1
The string examples:
eg1: Tommy Berry
eg2: Ms Winona Costin (a3/47kg)
eg3: Ms Kathy O'Hara
End result using findall in python:
eg1: ['Tommy Berry']
eg2: ['Ms','Winona Costin', '3', '47']
eg3: ['Ms', 'Kathy O'Hara']
As you can see, I want to isolate the Ms at the beginning of the string, the digits within the parenthesis and maintain the full name.
I appreciate the help, thanks!
EDIT
The name may contain numbers and special characters such as '-. etc.:
eg: Samuel L. Jackson-Pitt
I think you want something like this,
^(Ms)?\s*([\w '-]+)(?= \(|$)(?: *\(\D*(\d+)\D*(\d+)[^\n]*)?$
DEMO
>>> import re
>>> s = """Brodie Loy (a3/53kg)
Hugh Bowman
Ms Winona Costin (a3/47kg)
James McDonald
Ms Kathy O'Hara"""
>>> m = re.findall(r"^(Ms)?\s*([\w '-]+)(?= \(|$)(?: *\(\D*(\d+)\D*(\d+)[^\n]*)?$", s, re.M)
>>> m
[('', 'Brodie Loy', '3', '53'), ('', 'Hugh Bowman', '', ''), ('Ms', 'Winona Costin', '3', '47'), ('', 'James McDonald', '', ''), ('Ms', "Kathy O'Hara", '', '')]
>>> [tuple(s for s in tup if s) for tup in m]
[('Brodie Loy', '3', '53'), ('Hugh Bowman',), ('Ms', 'Winona Costin', '3', '47'), ('James McDonald',), ('Ms', "Kathy O'Hara")]
What you are looking for is: (demo)
^(Ms)?([\w '-]+)(?:.*?(\d+)\/(\d+))?
Remember to use re.MULTILINE.

findall and regular expressions, getting the correct pattern

I'm working out of Magnus Lie Hetland's book, "Beginning Python" 2nd edition, and on page 244 he says the first pattern listed in my code should produce the desired output listed at the bottom of this code, but it doesn't. So I tried a couple of other patterns in order to try and get the desired output, but they don't work either. I checked the errata for the book and there are no corrections for this page. I'm using python 2.7.6. Any suggestions?
import re
s1 = 'http://www.python.org http://python.org www.python.org python.org .python.org ww.python.org w.python.org wwww.python.org'
# choose a pattern and comment out the other two
# output using Hetland's pattern
pat = r'(http://)?(www\.)?python\.org'
''' [('http://', 'www.'), ('http://', ''), ('', 'www.'), ('', ''), ('', ''), ('', ''), ('', ''), ('', 'www.')] '''
# output using this pattern
# pat = r'http://?www\.?python\.org'
''' ['http://www.python.org'] '''
# output using this pattern
# pat = r'http://?|www\.?|python\.org'
''' ['http://', 'www.', 'python.org', 'www.', 'http://', 'python.org', 'www.', 'python.org', 'python.org', 'python.org', 'python.org', 'python.org', 'www', 'python.org'] '''
print '\n', re.findall(pat, s1)
# desired output
''' ['http://www.python.org', 'http://python.org', 'www.python.org', 'python.org'] '''
The pattern works if you make the first two optional groups non-capture groups (?:...):
pat = r'(?:http://)?(?:www\.)?python\.org'
matches = re.findall(pat, s1)
# ['http://www.python.org', 'http://python.org', 'www.python.org', 'python.org', 'python.org', 'python.org', 'python.org', 'www.python.org']
That is, if that's the desired result - as the change to the pattern means there's only one capture group instead of three...

Categories

Resources