Python regex group() works explanation - python

Could someone please explain why below each print gives different result? thanks.
import re
s = "-h5ello"
m = re.match("-\w(\d\w+)", s)
print ' m.group(): ',(m.group())
print ' m.group(0): ',(m.group(0))
print ' m.group(1): ',(m.group(1))

m.group() and m.group(0) simply return the whole string if there was a match.
The reason they're identical is that the function is defined with a default value of zero:
def group(num=0):
As for the matches:
m.group(1), m.group(2)... returns the matched groups (in your case - there's only one)
More about matche groups can be found in the docs

m.group() and m.group(0) should be, and are, identical.
m.group(1) only gives you the match from inside the first pair of parentheses.
EDIT to clarify what a "matched group" is:
In regular expressions, plain parentheses are called "captures". The reason for this is the fact that they capture submatches into capture groups. Consider this:
import re
m = re.match(r'a(b)c(d(e)f)g', 'abcdefg')
print m.group()
# => 'abcdefg'
print m.groups()
# => ('b', 'def', 'e')
m.group(0), or equivalently m.group(), is the whole match. Parentheses pick out submatches, with first parenthesis pair yielding m.group(1), second m.group(2), and third m.group(3).
In your example, you have parentheses too. They do not include -\w, so your m.group(1) does not include -h part of your string - they only include the submatch for \d\w+, which is 5ello.

Related

get the value instead of the object from python regex search [duplicate]

I am trying to use a regular expression to extract words inside of a pattern.
I have some string that looks like this
someline abc
someother line
name my_user_name is valid
some more lines
I want to extract the word my_user_name. I do something like
import re
s = #that big string
p = re.compile("name .* is valid", re.flags)
p.match(s) # this gives me <_sre.SRE_Match object at 0x026B6838>
How do I extract my_user_name now?
You need to capture from regex. search for the pattern, if found, retrieve the string using group(index). Assuming valid checks are performed:
>>> p = re.compile("name (.*) is valid")
>>> result = p.search(s)
>>> result
<_sre.SRE_Match object at 0x10555e738>
>>> result.group(1) # group(1) will return the 1st capture (stuff within the brackets).
# group(0) will returned the entire matched text.
'my_user_name'
You can use matching groups:
p = re.compile('name (.*) is valid')
e.g.
>>> import re
>>> p = re.compile('name (.*) is valid')
>>> s = """
... someline abc
... someother line
... name my_user_name is valid
... some more lines"""
>>> p.findall(s)
['my_user_name']
Here I use re.findall rather than re.search to get all instances of my_user_name. Using re.search, you'd need to get the data from the group on the match object:
>>> p.search(s) #gives a match object or None if no match is found
<_sre.SRE_Match object at 0xf5c60>
>>> p.search(s).group() #entire string that matched
'name my_user_name is valid'
>>> p.search(s).group(1) #first group that match in the string that matched
'my_user_name'
As mentioned in the comments, you might want to make your regex non-greedy:
p = re.compile('name (.*?) is valid')
to only pick up the stuff between 'name ' and the next ' is valid' (rather than allowing your regex to pick up other ' is valid' in your group.
You could use something like this:
import re
s = #that big string
# the parenthesis create a group with what was matched
# and '\w' matches only alphanumeric charactes
p = re.compile("name +(\w+) +is valid", re.flags)
# use search(), so the match doesn't have to happen
# at the beginning of "big string"
m = p.search(s)
# search() returns a Match object with information about what was matched
if m:
name = m.group(1)
else:
raise Exception('name not found')
You can use groups (indicated with '(' and ')') to capture parts of the string. The match object's group() method then gives you the group's contents:
>>> import re
>>> s = 'name my_user_name is valid'
>>> match = re.search('name (.*) is valid', s)
>>> match.group(0) # the entire match
'name my_user_name is valid'
>>> match.group(1) # the first parenthesized subgroup
'my_user_name'
In Python 3.6+ you can also index into a match object instead of using group():
>>> match[0] # the entire match
'name my_user_name is valid'
>>> match[1] # the first parenthesized subgroup
'my_user_name'
Maybe that's a bit shorter and easier to understand:
import re
text = '... someline abc... someother line... name my_user_name is valid.. some more lines'
>>> re.search('name (.*) is valid', text).group(1)
'my_user_name'
You want a capture group.
p = re.compile("name (.*) is valid", re.flags) # parentheses for capture groups
print p.match(s).groups() # This gives you a tuple of your matches.
Here's a way to do it without using groups (Python 3.6 or above):
>>> re.search('2\d\d\d[01]\d[0-3]\d', 'report_20191207.xml')[0]
'20191207'
You can also use a capture group (?P<user>pattern) and access the group like a dictionary match['user'].
string = '''someline abc\n
someother line\n
name my_user_name is valid\n
some more lines\n'''
pattern = r'name (?P<user>.*) is valid'
matches = re.search(pattern, str(string), re.DOTALL)
print(matches['user'])
# my_user_name
I found this answer via google because I wanted to unpack a re.search() result with multiple groups directly into multiple variables. While this might be obvious for some, it was not for me because I always used group() in the past, so maybe it helps someone in the future who also did not know about group*s*().
s = "2020:12:30"
year, month, day = re.search(r"(\d+):(\d+):(\d+)", s).groups()
It seems like you're actually trying to extract a name vice simply find a match. If this is the case, having span indexes for your match is helpful and I'd recommend using re.finditer. As a shortcut, you know the name part of your regex is length 5 and the is valid is length 9, so you can slice the matching text to extract the name.
Note - In your example, it looks like s is string with line breaks, so that's what's assumed below.
## covert s to list of strings separated by line:
s2 = s.splitlines()
## find matches by line:
for i, j in enumerate(s2):
matches = re.finditer("name (.*) is valid", j)
## ignore lines without a match
if matches:
## loop through match group elements
for k in matches:
## get text
match_txt = k.group(0)
## get line span
match_span = k.span(0)
## extract username
my_user_name = match_txt[5:-9]
## compare with original text
print(f'Extracted Username: {my_user_name} - found on line {i}')
print('Match Text:', match_txt)

python regular expression grouping

My regular expression goal:
"If the sentence has a '#' in it, group all the stuff to the left of the '#' and group all the stuff to the right of the '#'. If the character doesn't have a '#', then just return the entire sentence as one group"
Examples of the two cases:
A) '120x4#Words' -> ('120x4', 'Words')
B) '120x4#9.5' -> ('120x4#9.5')
I made a regular expression that parses case A correctly
(.*)(?:#(.*))
# List the groups found
>>> r.groups()
(u'120x4', u'words')
But of course this won't work for case B -- I need to make "# and everything to the right of it" optional
So I tried to use the '?' "zero or none" operator on that second grouping to indicate it's optional.
(.*)(?:#(.*))?
But it gives me bad results. The first grouping eats up the entire string.
# List the groups found
>>> r.groups()
(u'120x4#words', None)
Guess I'm either misunderstanding the none-or-one '?' operator and how it works on groupings or I am misunderstanding how the first group is acting greedy and grabbing the entire string. I did try to make the first group 'reluctant', but that gave me a total no-match.
(.*?)(?:#(.*))?
# List the groups found
>>> r.groups()
(u'', None)
Simply use the standard str.split function:
s = '120x4#Words'
x = s.split( '#' )
If you still want a regex solution, use the following pattern:
([^#]+)(?:#(.*))?
(.*?)#(.*)|(.+)
this sjould work.See demo.
http://regex101.com/r/oC3nN4/14
use re.split :
>>> import re
>>> a='120x4#Words'
>>> re.split('#',a)
['120x4', 'Words']
>>> b='120x4#9.5'
>>> re.split('#',b)
['120x4#9.5']
>>>
Here's a verbose re solution. But, you're better off using str.split.
import re
REGEX = re.compile(r'''
\A
(?P<left>.*?)
(?:
[#]
(?P<right>.*)
)?
\Z
''', re.VERBOSE)
def parse(text):
match = REGEX.match(text)
if match:
return tuple(filter(None, match.groups()))
print(parse('120x4#Words'))
print(parse('120x4#9.5'))
Better solution
def parse(text):
return text.split('#', maxsplit=1)
print(parse('120x4#Words'))
print(parse('120x4#9.5'))

Match any word in string except those preceded by a curly brace in python

I have a string like
line = u'I need to match the whole line except for {thisword for example'
I have a difficulty doing this. What I've tried and it doesn't work:
# in general case there will be Unicode characters in the pattern
matchobj = re.search(ur'[^\{].+', line)
matchobj = re.search(ur'(?!\{).+', line)
Could you please help me figure out what's wrong and how to do it right?
P.S. I don't think I need to substitute "{thisword" with empty string
I am not exactly clear what you need. From your question title It looks you wants to find "All words in a string e.g 'line' those doesn't starts with {", but you are using re.search() function that confuses me.
re.search() and re.findall()
The function re.search() return a corresponding MatchObject instance, re.serach is usually used to match and return a patter in a long string. It doesn't return all possible matches. See below a simple example:
>>> re.search('a', 'aaa').group(0) # only first match
'a'
>>> re.search('a', 'aaa').group(1) # there is no second matched
Traceback (most recent call last):
File "<console>", line 1, in <module>
IndexError: no such group
With regex 'a' search returns only one patters 'a' in string 'aaa', it doesn't returns all possible matches.
If your objective to find – "all words in a string those doesn't starts with {". You should use re.findall() function:- that matches all occurrences of a pattern, not just the first one as re.search() does. See example:
>>> re.findall('a', 'aaa')
['a', 'a', 'a']
Edit: On the basis of comment adding one more example to demonstrate use of re.search and re.findall:
>>> re.search('a+', 'not itnot baaal laaaaaaall ').group()
'aaa' # returns ^^^ ^^^^^ doesn't
>>> re.findall('a+', 'not itnot baaal laaaaaaall ')
['aaa', 'aaaaaaa'] # ^^^ ^^^^^^^ match both
Here is a good tutorial for Python re module: re – Regular Expressions
Additionally, there is concept of group in Python-regex – "a matching pattern within parenthesis". If more than one groups are present in your regex patter then re.findall() return a list of groups; this will be a list of tuples if the pattern has more than one group. see below:
>>> re.findall('(a(b))', 'abab') # 2 groups according to 2 pair of ( )
[('ab', 'b'), ('ab', 'b')] # list of tuples of groups captured
In Python regex (a(b)) contains two groups; as two pairs of parenthesis (this is unlike regular expression in formal languages – regex are not exactly same as regular
expression in formal languages but that is different matter).
Answer: The words in sentence line are separated by spaces (other either at starts of string) regex should be:
ur"(^|\s)(\w+)
Regex description:
(^|\s+) means: either word at start or start after some spaces.
\w*: Matches an alphanumeric character, including "_".
On applying regex r to your line:
>>> import pprint # for pretty-print, you can ignore thesis two lines
>>> pp = pprint.PrettyPrinter(indent=4)
>>> r = ur"(^|\s)(\w+)"
>>> L = re.findall(r, line)
>>> pp.pprint(L)
[ (u'', u'I'),
(u' ', u'need'),
(u' ', u'to'),
(u' ', u'match'),
(u' ', u'the'),
(u' ', u'whole'),
(u' ', u'line'),
(u' ', u'except'),
(u' ', u'for'), # notice 'for' after 'for'
(u' ', u'for'), # '{thisword' is not included
(u' ', u'example')]
>>>
To find all words in a single line use:
>>> [t[1] for t in re.findall(r, line)]
Note: it will avoid { or any other special char from line because \w only pass alphanumeric and '_' chars.
If you specifically only avoid { if it appears at start of a word (in middle it is allowed) then use regex: r = ur"(^|\s+)(?P<word>[^{]\S*)".
To understand diffidence between this regex and other is check this example:
>>> r = ur"(^|\s+)(?P<word>[^{]\S*)"
>>> [t[1] for t in re.findall(r, "I am {not yes{ what")]
['I', 'am', 'yes{', 'what']
Without Regex:
You could achieve same thing simply without any regex as follows:
>>> [w for w in line.split() if w[0] != '{']
re.sub() to replace pattern
If you wants to just replace one (or more) words starts with { you should use re.sub() to replace patterns start with { by emplty string "" check following code:
>>> r = ur"{\w+"
>>> re.findall(r, line)
[u'{thisword']
>>> re.sub(r, "", line)
u'I need to match the whole line except for for example'
Edit Adding Comment's reply:
The (?P<name>...) is Python's Regex extension: (it has meaning in Python) - (?P<name>...) is similar to regular parentheses - create a group (a named group). The group is accessible via the symbolic group name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. example-1:
>>> r = "(?P<capture_all_A>A+)"
>>> mo = re.search(r, "aaaAAAAAAbbbaaaaa")
>>> mo.group('capture_all_A')
'AAAAAA'
example-2: suppose you wants to filter name from a name-line that may contain title also e.g mr use regex: name_re = "(?P<title>(mr|ms)\.?)? ?(?P<name>[a-z ]*)"
we can read name in input string using group('name'):
>>> re.search(name_re, "mr grijesh chauhan").group('name')
'grijesh chauhan'
>>> re.search(name_re, "grijesh chauhan").group('name')
'grijesh chauhan'
>>> re.search(name_re, "ms. xyz").group('name')
'xyz'
You can simply do:
(?<!{)(\b\w+\b) with the g flag enabled (all matches)
Demo: http://regex101.com/r/zA0sL6
Try this pattern:
(.*)(?:\{\w+)\s(.*)
Code:
import re
p = re.compile(r'(.*)(?:\{\w+)\s(.*)')
str = "I need to match the whole line except for {thisword for example"
p.match(str)
Example:
http://regex101.com/r/wR8eP6

Regex: How do I exclude commas when in my pattern?

I want to find the first number that is sandwiched with commas on either end, and I came up with this:
m = re.search("\,([0-9])*\,",line)
However, this returns to me the number with the commas, how do I exclude them?
m.group(0) returns
',1620693,'
group(0) will always return the entire match.
See python documentation:
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0) # The entire match
'Isaac Newton'
>>> m.group(1) # The first parenthesized subgroup.
'Isaac'
Use m.group(1). You also don't need to escape (backslash) the commas. m.group(0) refers to the entire match, and each number after that refers to matched groups.

Python regular expression for a sentence does not want to match

Can anyone explain why this re (in Python):
pattern = re.compile(r"""
^
([[a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+\s{1}]+)
([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+) # Last word.
\.{1}
$
""", re.VERBOSE + re.UNICODE)
if re.match(pattern, line):
does not match "A sentence."
I would actually like to return the entire sentence (including the period) as a returned group (), but have been failing miserably.
I think that maybe you meant to do this:
(([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+\s{1})+)
^ ^
I don't think the nested square brackets you had do what you think they do.
This regex works:
pattern = re.compile(r"""
^
([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+\s{1})+
([a-zA-Zàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]+) # Last word.
\.{1}
$
""", re.VERBOSE + re.UNICODE)
line = "A sentence."
match = re.match(pattern, line)
>>> print "'%s'" % match.group(0)
'A sentence.'
>>> print "'%s'" % match.group(1)
'A '
>>> print "'%s'" % match.group(2)
'sentence'
To return the entire match (line in this case), use match.group(0).
Because the first match group can match multiple times (once for each word except the last one), you can only access the next to last word using match.group(1).
Btw, the {1} notation is not necessary in this case, matching once and only once is the default behavior, so this bit can be removed.
The extra set of square brackets definitely weren't helping you :)
It turns out the following actually works and includes all the extended ascii characters I wanted
^
([\w+\s{1}]+\w{1}\.{1})
$

Categories

Resources