python - regex why does `findall` find nothing, but `search` works? [duplicate] - python

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 5 years ago.
>>> reg = re.compile(r'^\d{1,3}(,\d{3})*$')
>>> str = '42'
>>> reg.search(str).group()
'42'
>>> reg.findall(str)
['']
>>>
python regex
Why does reg.findall find nothing, but reg.search works in this piece of code above?

When you have capture groups (wrapped with parenthesis) in the regex, findall will return the match of the captured group; And in your case the captured group matches an empty string; You can make it non capture with ?: if you want to return the whole match; re.search ignores capture groups on the other hand. These are reflected in the documentation:
re.findall:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group.
re.search:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance. Return None if no position in the string matches
the pattern; note that this is different from finding a zero-length
match at some point in the string.
import re
reg = re.compile(r'^\d{1,3}(?:,\d{3})*$')
s = '42'
reg.search(s).group()
​# '42'
reg.findall(s)
# ['42']

Related

RegEx Python not working [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 8 years ago.
My Reg-Ex pattern is not working, why?
string = "../../example/tobematched/nonimportant.html"
pattern = "example\/([a-z]+)\/"
test = re.match(pattern, string)
# None
http://www.regexr.com/39mpu
re.match() matches from the beginning of the string, you need to use re.search() which looks for the first location where the regular expression pattern produces a match and returns a corresponding MatchObject instance.
>>> import re
>>> s = "../../example/tobematched/nonimportant.html"
>>> re.search(r'example/([a-z]+)/', s).group(1)
'tobematched'
Try this.
test = re.search(pattern, string)
Match matches the whole string from the start, so it will give None as the result.
Grab the result from test.group().
To give you the answer in short:
search ⇒ finds something anywhere in the string and return a match object.
match ⇒ finds something at the beginning of the string and return a match object.
That is the reason you have to use
foo = re.search(pattern, bar)

Modify regular expression

I am trying to get first pair of numbers from "09_135624.jpg"
My code now:
import re
string = "09_135624.jpg"
pattern = r"(?P<pair>(.*))_135624.jpg"
match = re.findall(pattern, string)
print match
Output:
[('09', '09')]
Why I have tuple in output?
Can you help me modify my code to get this:
['09']
Or:
'09'
re.findall returns differently according to the number of capturing group in the pattern:
>>> re.findall(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg")
['09']
According to the documentation:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
Alternative using re.search:
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg")
<_sre.SRE_Match object at 0x00000000025D0D50>
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg").group('pair')
'09'
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg").group(1)
'09'
UPDATE
To match . literally, you need to escape it: \..
(?P<pair>(?:.*))_135624.jpg
Try this. You are getting two results because you are capturing them twice. I have modified it to capture only once:
http://regex101.com/r/lS5tT3/62

Search preceding and following characters of re [duplicate]

This question already has an answer here:
Python regex alternation
(1 answer)
Closed 8 years ago.
I am trying to find the characters immediately before and after a regex match in a given string. This is the code.
>>>import re
>>>s='dafddadffdbdasbffsbbfdbabbfsdfadsfdfddf' #completely garbage test string
>>>re.findall('.{0,5}(abb).{0,5}',s)
['abb']
The test string has an occurence of 'abb' here ...fdbabbfsd... I am under the impression that the special character . matches any character other than \n and the {m,n} Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible as stated here
So I expect my re to return ['bbfdbabbfsdfa'] and not just ['abb']. What am I missing?
It's because of the capturing group. Just move the parentheses:
re.findall('(.{0,5}abb.{0,5})',s)
findall only matches groups, so everything you want to match needs to be in the parentheses.
According to re.findall documentation:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group.
So by surrounding whole pattern as a group or removing group will give you what you want.
>>> re.findall('(.{0,5}abb.{0,5})',s) # Entire pattern as a group
['bbfdbabbfsdfa']
>>> re.findall('.{0,5}abb.{0,5}',s) # No capturing group
['bbfdbabbfsdfa']

Python re.search

I have a string variable containing
string = "123hello456world789"
string contain no spacess. I want to write a regex such that prints only words containing(a-z)
I tried a simple regex
pat = "([a-z]+){1,}"
match = re.search(r""+pat,word,re.DEBUG)
match object contains only the word Hello and the word World is not matched.
When is used re.findall() I could get both Hello and World.
My question is why we can't do this with re.search()?
How do this with re.search()?
re.search() finds the pattern once in the string, documenation:
Scan through string looking for a location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance. Return None if no position in the string matches
the pattern; note that this is different from finding a zero-length
match at some point in the string.
In order to match every occurrence, you need re.findall(), documentation:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
Example:
>>> import re
>>> regex = re.compile(r'([a-z]+)', re.I)
>>> # using search we only get the first item.
>>> regex.search("123hello456world789").groups()
('hello',)
>>> # using findall we get every item.
>>> regex.findall("123hello456world789")
['hello', 'world']
UPDATE:
Due to your duplicate question (as discussed at this link) I have added my other answer here as well:
>>> import re
>>> regex = re.compile(r'([a-z][a-z-\']+[a-z])')
>>> regex.findall("HELLO W-O-R-L-D") # this has uppercase
[] # there are no results here, because the string is uppercase
>>> regex.findall("HELLO W-O-R-L-D".lower()) # lets lowercase
['hello', 'w-o-r-l-d'] # now we have results
>>> regex.findall("123hello456world789")
['hello', 'world']
As you can see, the reason why you were failing on the first sample you provided is because of the uppercase, you can simply add the re.IGNORECASE flag, though you mentioned that matches should be lowercase only.
#InbarRose answer shows why re.search works that way, but if you want match objects rather than just the string outputs from re.findall, use re.finditer
>>> for match in re.finditer(pat, string):
... print match.groups()
...
('hello',)
('world',)
>>>
Or alternatively if you wanted a list
>>> list(re.finditer(pat, string))
[<_sre.SRE_Match object at 0x022DB320>, <_sre.SRE_Match object at 0x022DB660>]
It's also generally a bad idea to use string as a variable name given that it's a common module.

Simple regex to match a string containing a certain word [duplicate]

This question already has answers here:
How to use regex for words
(4 answers)
Closed 9 years ago.
I am trying to match text containing a word (let's say 'word'). I am using the following regex:
r = re.compile(r'\bword\b')
When I try this regex I get the following results:
r.match('a word a') > None
r.match(' word ') > None
r.match('word') > match
Shouldn't all three strings match?
From the docs:
re.search(pattern, string, flags=0) Scan through string looking for a
location where the regular expression pattern produces a match, and
return a corresponding MatchObject instance. Return None if no
position in the string matches the pattern; note that this is
different from finding a zero-length match at some point in the
string.
re.match(pattern, string, flags=0) If zero or more characters at the
beginning of string match the regular expression pattern, return a
corresponding MatchObject instance. Return None if the string does not
match the pattern; note that this is different from a zero-length
match.
Note that even in MULTILINE mode, re.match() will only match at the
beginning of the string and not at the beginning of each line.
If you want to locate a match anywhere in string, use search() instead
So, just do r.search(...) and you should get what you want.

Categories

Resources