regex expression not recognising the other lines - python

I have a regex which I would like to match a couple of things:
Here is a link to the examples and the code which I have started but for errors which I cannot determine in my regex is not recognising some lines: http://regex101.com/r/oL4bB5/1
The string examples:
eg1: Tommy Berry
eg2: Ms Winona Costin (a3/47kg)
eg3: Ms Kathy O'Hara
End result using findall in python:
eg1: ['Tommy Berry']
eg2: ['Ms','Winona Costin', '3', '47']
eg3: ['Ms', 'Kathy O'Hara']
As you can see, I want to isolate the Ms at the beginning of the string, the digits within the parenthesis and maintain the full name.
I appreciate the help, thanks!
EDIT
The name may contain numbers and special characters such as '-. etc.:
eg: Samuel L. Jackson-Pitt

I think you want something like this,
^(Ms)?\s*([\w '-]+)(?= \(|$)(?: *\(\D*(\d+)\D*(\d+)[^\n]*)?$
DEMO
>>> import re
>>> s = """Brodie Loy (a3/53kg)
Hugh Bowman
Ms Winona Costin (a3/47kg)
James McDonald
Ms Kathy O'Hara"""
>>> m = re.findall(r"^(Ms)?\s*([\w '-]+)(?= \(|$)(?: *\(\D*(\d+)\D*(\d+)[^\n]*)?$", s, re.M)
>>> m
[('', 'Brodie Loy', '3', '53'), ('', 'Hugh Bowman', '', ''), ('Ms', 'Winona Costin', '3', '47'), ('', 'James McDonald', '', ''), ('Ms', "Kathy O'Hara", '', '')]
>>> [tuple(s for s in tup if s) for tup in m]
[('Brodie Loy', '3', '53'), ('Hugh Bowman',), ('Ms', 'Winona Costin', '3', '47'), ('James McDonald',), ('Ms', "Kathy O'Hara")]

What you are looking for is: (demo)
^(Ms)?([\w '-]+)(?:.*?(\d+)\/(\d+))?
Remember to use re.MULTILINE.

Related

Using regex on Python to find any numerical value in an expression

I am trying to get all numerical value (integers,decimal,float,scientific notation) from an expression and want to differentiate them from digits that are not realy number but part of a name. For example in the expression below.
230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1
the first 230 is not a numerical value as it is part of a tag (230FIC100.PV).
Using the web tool regexp.com I come up with the following expression that works for the expression above.
(?!\s)(?<!\w)[+-]?((\d+\.\d*)|(\.\d+)|(\d+))([eE][+-]?\d+)?(\s)|(?<!\w)[0-9]\d+(?<!\s)$
However when I try to use the above expression in python re.findall() I receive as result a list with 5 tuples with 6 elements on each.
import re
pat = r'(?!\s)(?<!\w)[+-]?((\d+\.\d*)|(\.\d+)|(\d+))([eE][+-]?\d+)?(\s)|(?<!\w)[0-9]\d+(?<!\s)$'
exp = '230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1 '
matches = re.findall(pat,exp)
The result is
special variables
function variables
0:('2', '', '', '2', 'e3', ' ')
1:('20', '', '', '20', '', ' ')
2:('20.4', '20.4', '', '', '', ' ')
3:('45', '', '', '45', '', ' ')
4:('2', '', '', '2', 'e4', ' ')
len():5
I would like some help to undestand what is happening and if there is any way to get this done in a similar way that happen on the regexp.com.
This should take care of it. (All the items are strings)
import re
st = '230FIC000.PV>=-2e3 211FIC00.PV <= 20 100fic>-20.4 tic200 >=45 tic100 <-2E-4 fic123 >1'
re.findall(r'-?[0-9]+\.?[0-9]*(?:[Ee]\ *-?\ *[0-9]+)|-?\d+\.\d+|\b\d+\b', st)
referred: How to extract numbers from strings,
Extracting scientific numbers from string,
and Extracting decimal values from string

python regex for incomplete decimals numbers

I have a string of numbers which may have incomplete decimal reprisentation
for example
a = '1. 1,00,000.00 1 .99 1,000,000.999'
desired output
['1','1,00,000.00','1','.99','1,000,000.999']
so far i have tried the following 2
re.findall(r'[-+]?(\d+(?:[.,]\d+)*)',a)
which gives
['1', '1,00,000.00', '1', '99', '1,000,000.999']
which makes .99 to 99 which is not desired
while
re.findall(r'[-+]?(\d*(?:[.,]\d+)*)',a)
gives
['1', '', '', '1,00,000.00', '', '', '1', '', '.99', '', '1,000,000.999', '']
which gives undesirable empty string results as well
this is for finding currency values in a string so the commas separators don't have a set pattern or mat not be present at all
My suggestion is to use the regex below:
I've implemented a snippet in python.
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
result = re.split('/\.?\d\.?\,?/', a)
print result
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']
You can use re.split:
import re
a = '1. 1,00,000.00 1 .99 1,000,000.999'
d = re.split('(?<=\d)\.\s+|(?<=\d)\s+', a)
Output:
['1', '1,00,000.00', '1', '.99', '1,000,000.999']
This regex will give you your desired output:
([0-9]+(?=\.))|([0-9,]+\.[0-9]+)|([0-9]+)|(\.[0-9]+)
You can test it here: https://regex101.com/r/VfQIJC/6

Python: Regular Expression not working properly

I'm using following regex, it suppose to find the string 'U.S.A.', but it only gets 'A.', is anyone know what's wrong?
#INPUT
import re
text = 'That U.S.A. poster-print costs $12.40...'
print re.findall(r'([A-Z]\.)+', text)
#OUTPUT
['A.']
Expected Output:
['U.S.A.']
I'm following the NLTK Book, Chapter 3.7 here, it has a set of regex but it just not workin. I've tried it in both Python 2.7 and 3.4.
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
nltk.regexp_tokenize() works the same as re.findall(), I think somehow my python here does not recognize the regex as expected. The regex listed above output this:
[('', '', ''),
('A.', '', ''),
('', '-print', ''),
('', '', ''),
('', '', '.40'),
('', '', '')]
Possibly, it's something to do with how regexes were compiled previously using nltk.internals.compile_regexp_to_noncapturing() that is abolished in v3.1, see here)
>>> import nltk
>>> nltk.__version__
'3.0.5'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-#&*] # special characters with meanings
... '''
>>>
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
But it doesn't work in NLTK v3.1:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... | \w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... | [+/\-#&*] # special characters with meanings
... '''
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
With slight modification of how you define your regex groups, you could get the same pattern to work in NLTK v3.1, using this regex:
pattern = r"""(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
|\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
|(?:[+/\-#&*]) # special characters with meanings
"""
In code:
>>> import nltk
>>> nltk.__version__
'3.1'
>>> pattern = r"""
... (?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-#&*]) # special characters with meanings
... """
>>> from nltk.tokenize.regexp import RegexpTokenizer
>>> tokeniser=RegexpTokenizer(pattern)
>>> line="My weight is about 68 kg, +/- 10 grams."
>>> tokeniser.tokenize(line)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
Without NLTK, using python's re module, we see that the old regex patterns are not supported natively:
>>> pattern1 = r"""(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
... |\w+([-']\w+)* # words w/ optional internal hyphens/apostrophe
... |[+/\-#&*] # special characters with meanings
... |\S\w* # any sequence of word characters#
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern1, text)
[('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
>>> pattern2 = r"""(?x) # set flag to allow verbose regexps
... (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
... |\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
... |\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
... |(?:[+/\-#&*]) # special characters with meanings
... """
>>> text="My weight is about 68 kg, +/- 10 grams."
>>> re.findall(pattern2, text)
['My', 'weight', 'is', 'about', '68', 'kg', '+', '/', '-', '10', 'grams']
Note: The change in how NLTK's RegexpTokenizer compiles the regexes would make the examples on NLTK's Regular Expression Tokenizer obsolete too.
Drop the trailing +, or put it inside the group:
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> re.findall(r'([A-Z]\.)+', text)
['A.'] # wrong
>>> re.findall(r'([A-Z]\.)', text)
['U.', 'S.', 'A.'] # without '+'
>>> re.findall(r'((?:[A-Z]\.)+)', text)
['U.S.A.'] # with '+' inside the group
The first part of the text that the regexp matches is "U.S.A." because ([A-Z]\.)+ matches the first group (part within parenthesis) three times. However you can only return one match per group, so Python picks the last match for that group.
If you instead change the regular expression to include the "+" in the group, then the group will only match once and the full match will be returned. For example (([A-Z]\.)+) or ((?:[A-Z]\.)+).
If you instead want three separate results, then just get rid of the "+" sign in the regular expression and it will only match one letter and one dot for each time.
The problem is the "capturing group", aka the parentheses, which have an unexpected effect on the result of findall(): When a capturing group is utilized multiple times in a match, the regexp engine loses track and strange things happen. Specifically: the regexp correctly matches the entire U.S.A., but findall drops it on the floor and only returns the last group capture.
As this answer says, the re module doesn't support repeated capturing groups, but you could install the alternative regexp module that does handle this correctly. (However, this would be no help to you if you want to pass your regexp to nltk.tokenize.regexp.)
Anyway to match U.S.A. correctly, use this: r'(?:[A-Z]\.)+', text).
>>> re.findall(r'(?:[A-Z]\.)+', text)
['U.S.A.']
You can apply the same fix to all repeated patterns in the NLTK regexp, and everything will work correctly. As #alvas suggested, the NLTK used to make this substitution behind the scenes, but this feature was recently dropped and replaced with a warning in the documentation of the tokenizer. The book is clearly out of date; #alvas filed a bug report about it back in November, but it hasn't been acted on yet...

Grouping data with a regex in Python

I have some raw data like this:
Dear John Buy 1 of Coke, cost 10 dollars
Ivan Buy 20 of Milk
Dear Tina Buy 10 of Coke, cost 100 dollars
Mary Buy 5 of Milk
The rule of the data is:
Not everyone will start with "Dear", while if there is any, it must end with costs
The item may not always normal words, it could be written without limits (including str, num, etc.)
I want to group the information, and I tried to use regex. That's what I tried before:
for line in file.readlines():
match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>\w+)(?:\D+)(?P<costs>\d*)',line)
if match is not None:
print(match.groups())
file.close()
Now the output looks like:
('John', '1', 'Coke', '10')
('Ivan', '20', 'Milk', '')
('Tina', '10', 'Coke', '100')
('Mary', '5', 'Milk', '')
Showing above is what I want. However, if the item is replaced by some strange string like A1~A10, some of outputs will get wrong info:
('Ivan', '20', 'A1', '10')
('Mary', '5', 'A1', '10')
I think the constant format in the item field is that it will always end with , (if there is any). But I just don't know how to use the advantage.
Thought it's temporarily success by using the code above, I thought the (?P<item>\w+) has to be replaced like (?P<item>.+). If I do so, it'll take wrong string in the tuple like:
('John', '1', 'Coke, cost 10 dollars', '')
How could I read the data into the format I want by using the regex in Python?
I have tried this regular expression
^(Dear)?\s*(?P<name>\w*)\D*(?P<num>\d+)\sof\s(?P<drink>\w*)(,\D*(?P<cost>\d+)\D*)?
Explanation
^(Dear)? match line starting either with Dear if exists
(?P<name>\w*) a name capture group to capture the name
\D* match any non-digit characters
(?P<num>\d+) named capture group to get the num.
\sof\s matching string of
(?P<drink>\w*) to get the drink
(,\D*(?P<cost>\d+)\D*)? this is an optional group to get the cost of the drink
with
>>> reobject = re.compile('^(Dear)?\s*(\w*)[\sa-zA-Z]*(\d+)\s*\w*\s*(\w*)(,[\sa-zA-Z]*(\d+)[\s\w]*)?')
First data snippet
>>> data1 = 'Dear John Buy 1 of Coke, cost 10 dollars'
>>> match_object = reobject.search(data1)
>>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost'))
('John', '1', 'Coke', '10')
Second data snippet
>>> data2 = ' Ivan Buy 20 of Milk'
>>> match_object = reobject.search(data2)
>>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost'))
('Ivan', '20', 'Milk', None)
Without regex:
with open('commandes.txt') as f:
results = []
for line in f:
parts = line.split(None, 5)
price = ''
if parts[0] == 'Dear':
tmp = parts[5].split(',', 1)
for tok in tmp[1].split():
if tok.isnumeric():
price = tok
break
results.append((parts[1], parts[3], tmp[0], price))
else:
results.append((parts[0], parts[2], parts[4].split(',')[0], price))
print(results)
It doesn't care what characters are used except spaces until the product name, that's why each line is splitted by spaces in 5 parts. When the line starts with "Dear", the last part is separated by the comma to extract the product name and the price. Note that if the price is always at the same place (ie: after "cost"), you can avoid the innermost for loop and replace it with price = tmp[1].split()[1]
Note: if you want to prevent empty lines to be processed, you can change the first for loop to:
for line in (x for x in f if x.rstrip()):
I would use this regex:
r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?'
Demo
>>> line = 'Dear Tina Buy 10 of A1~A10'
>>> match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?', line)
>>> match.groups()
('Tina', '10', 'A1~A10', None)
>>> line = 'Dear Tina Buy 10 of A1~A10, cost 100 dollars'
>>> match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?', line)
>>> match.groups()
('Tina', '10', 'A1~A10', '100')
Explanation
The first section of your regex is perfectly fine, here’s the tricky part:
(?P<item>[^,]+) As we're sure that the string will contain a comma when the cost string is present, here we say that we want anything but comma to set the item value.
(?:,\D+)?(?P<costs>\d+)? Here we're using two groups. The important thing is the ? after the parenthesis enclosing the groups:
'?' Causes the resulting RE to match 0 or 1 repetitions of the
preceding RE. ab? will match either ‘a’ or ‘ab’.
So we use ? to match both possibilities (with the cost string present or not)
(?:,\D+) is a non-capturing that will match a comma followed by anything but a digit.
(?P<costs>\d+) will capture any digit in the named group cost.
If you use .+, the subpattern will grab the whole rest of the line as . matches any character but a newline without the re.S flag.
You can replace the \w+ with a negated character class subpattern [^,]+ to match one or more characters other than a comma:
r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)\D*(?P<costs>\d*)'
^^^^^
See the IDEONE demo:
import re
file = "Dear John Buy 1 of A1~A10, cost 10 dollars\n Ivan Buy 20 of Milk\nDear Tina Buy 10 of Coke, cost 100 dollars\n Mary Buy 5 of Milk"
for line in file.split("\n"):
match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,\W]+)\D*(?P<costs>\d*)',line)
if match:
print(match.groups())
Output:
('John', '1', 'A1~A10', '10')
('Ivan', '20', 'Mil', '')
('Tina', '10', 'Coke', '100')
('Mary', '5', 'Mil', '')

Regex to match a capturing group one or more times

I'm trying to match pair of digits in a string and capture them in groups, however i seem to be only able to capture the last group.
Regex:
(\d\d){1,3}
Input String: 123456 789101
Match 1: 123456
Group 1: 56
Match 2: 789101
Group 1: 01
What I want is to capture all the groups like this:
Match 1: 123456
Group 1: 12
Group 2: 34
Group 3: 56
* Update
It looks like Python does not let you capture multiple groups, for example in .NET you could capture all the groups in a single pass, hence re.findall('\d\d', '123456') does the job.
You cannot do that using just a single regular expression. It is a special case of counting, which you cannot do with just a regex pattern. \d\d will get you:
Group1: 12
Group2: 23
Group3: 34
...
regex library in python comes with a non-overlapping routine namely re.findall() that does the trick. as in:
re.findall('\d\d', '123456')
will return ['12', '34', '56']
(\d{2})+(\d)?
I'm not sure how python handles its matching, but this is how i would do it
Try this:
import re
re.findall(r'\d\d','123456')
Is this what you want ? :
import re
regx = re.compile('(?:(?<= )|(?<=\A)|(?<=\r)|(?<=\n))'
'(\d\d)(\d\d)?(\d\d)?'
'(?= |\Z|\r|\n)')
for s in (' 112233 58975 6677 981 897899\r',
'\n123456 4433 789101 41586 56 21365899 362547\n',
'0101 456899 1 7895'):
print repr(s),'\n',regx.findall(s),'\n'
result
' 112233 58975 6677 981 897899\r'
[('11', '22', '33'), ('66', '77', ''), ('89', '78', '99')]
'\n123456 4433 789101 41586 56 21365899 362547\n'
[('12', '34', '56'), ('44', '33', ''), ('78', '91', '01'), ('56', '', ''), ('36', '25', '47')]
'0101 456899 1 7895'
[('01', '01', ''), ('45', '68', '99'), ('78', '95', '')]

Categories

Resources