I am trying to get first pair of numbers from "09_135624.jpg"
My code now:
import re
string = "09_135624.jpg"
pattern = r"(?P<pair>(.*))_135624.jpg"
match = re.findall(pattern, string)
print match
Output:
[('09', '09')]
Why I have tuple in output?
Can you help me modify my code to get this:
['09']
Or:
'09'
re.findall returns differently according to the number of capturing group in the pattern:
>>> re.findall(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg")
['09']
According to the documentation:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
Alternative using re.search:
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg")
<_sre.SRE_Match object at 0x00000000025D0D50>
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg").group('pair')
'09'
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg").group(1)
'09'
UPDATE
To match . literally, you need to escape it: \..
(?P<pair>(?:.*))_135624.jpg
Try this. You are getting two results because you are capturing them twice. I have modified it to capture only once:
http://regex101.com/r/lS5tT3/62
Related
I am having trouble understanding findall, which says...
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
Why doesn't this basic IP regex work with findall as expected? The matches are not overlapping, and regexpal confirms that pattern is highlighted in re_str.
Expected: ['1.2.2.3', '123.345.34.3']
Actual: ['2.', '34.']
re_str = r'(\d{1,3}\.){3}\d{1,3}'
line = 'blahblah -- 1.2.2.3 blah 123.345.34.3'
matches = re.findall(re_str, line)
print(matches) # ['2.', '34.']
When you use parentheses in your regex, re.findall() will return only the parenthesized groups, not the entire matched string. Put a ?: after the ( to tell it not to use the parentheses to extract a group, and then the results should be the entire matched string.
This is because capturing groups return only the last match if they're repeated.
Instead, you should make the repeating group non-capturing, and use a non-repeated capture at an outer layer:
re_str = r'((?:\d{1,3}\.){3}\d{1,3})'
Note that for findall, if there is no capturing group, the whole match is automatically selected (like \0), so you could drop the outer capture:
re_str = r'(?:\d{1,3}\.){3}\d{1,3}'
I have a regular expression to match all instances of 1 followed by a letter. I would like to remove all these instances.
EXPRESSION = re.compile(r"1([A-Z])")
I can use re.split.
result = EXPRESSION.split(input)
This would return a list. So we could do
result = ''.join(EXPRESSION.split(input))
to convert it back to a string.
or
result = EXPRESSION.sub('', input)
Are there any differences to the end result?
Yes, the results are different. Here is a simple example:
import re
EXPRESSION = re.compile(r"1([A-Z])")
s = 'hello1Aworld'
result_split = ''.join(EXPRESSION.split(s))
result_sub = EXPRESSION.sub('', s)
print('split:', result_split)
print('sub: ', result_sub)
Output:
split: helloAworld
sub: helloworld
The reason is that because of the capture group, EXPRESSION.split(s) includes the A, as noted in the documentation:
re.split = split(pattern, string, maxsplit=0, flags=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings. If
capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list. If maxsplit is nonzero, at most maxsplit splits occur,
and the remainder of the string is returned as the final element
of the list.
When removing the capturing parentheses, i.e., using
EXPRESSION = re.compile(r"1[A-Z]")
then so far I have not found a case where result_split and result_sub are different, even after reading this answer to a similar question about regular expressions in JavaScript, and changing the replacement string from '' to '-'.
I have as str that I want to get the substring inside single quotes ('):
line = "This is a 'car' which has a 'person' in it!"
so I used:
name = re.findall("\'(.+?)\'", line)
print(name[0])
print(name[1])
car
person
But when I try this approach:
pattern = re.compile("\'(.+?)\'")
matches = re.search(pattern, line)
print(matches.group(0))
print(matches.group(1))
# print(matches.group(2)) # <- this produces an error of course
'car'
car
So, my question is why the pattern behaves differently in each case? I know that the former returns "all non-overlapping matches of pattern in string" and the latter match objects which might explain some difference but I would expect with the same pattern same results (even in different format).
So, to make it more concrete:
In the first case with findall the pattern returns all substrings but in the latter case it only return the first substring.
In the latter case matches.group(0) (which corresponds to the whole match according to the documentation) is different than matches.group(1) (which correspond to the first parenthesized subgroup). Why is that?
re.finditer("\'(.+?)\'", line) returns match objects so it functions like re.search.
I know that there are similar question is SO like this one or this one but they don't seem to answer why (or at least I did not get it).
You already read the docs and other answers, so I will give you a hands-on explanation
Let's first take this example from here
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0) # The entire match
'Isaac Newton'
>>> m.group(1) # The first parenthesized subgroup.
'Isaac'
>>> m.group(2) # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2) # Multiple arguments give us a tuple.
('Isaac', 'Newton')
If you go on this website you will find the correspondence with the previous detections
group(0) is taking the full match, group(1) and group(2) are respectively Group 1 and Group 2 in the picture. Because as said here "Match.group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned)"
Now let's go back to your example
As said by others with re.search(pattern, line) you will find ONLY the first occurrence of the pattern ["Scan through string looking for the first location where the regular expression pattern produces a match" as said here] and following the previous logic you will now understand why matches.group(0) will output the full match and matches.group(1) the Group 1. And you will understand why matches.group(2) is giving you error [because as you can see from the screenshot there is not a group 2 for the first occurrence in this last example]
re.findall returns list of matches (in this particular example, first groups of matches), while re.search returns
only first leftmost match.
As stated in python documentation (re.findall):
Return all non-overlapping
matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found. If
one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result.
matches.group(0) gives you whole fragment of string that matches your pattern, that's why it have quotes, while matches.group(1) gives you first parenthesized substring of matching fragment, that means it will not include quotes because they are outside of parentheses. Check Match.group() docs for more information.
This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 5 years ago.
>>> reg = re.compile(r'^\d{1,3}(,\d{3})*$')
>>> str = '42'
>>> reg.search(str).group()
'42'
>>> reg.findall(str)
['']
>>>
python regex
Why does reg.findall find nothing, but reg.search works in this piece of code above?
When you have capture groups (wrapped with parenthesis) in the regex, findall will return the match of the captured group; And in your case the captured group matches an empty string; You can make it non capture with ?: if you want to return the whole match; re.search ignores capture groups on the other hand. These are reflected in the documentation:
re.findall:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group.
re.search:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance. Return None if no position in the string matches
the pattern; note that this is different from finding a zero-length
match at some point in the string.
import re
reg = re.compile(r'^\d{1,3}(?:,\d{3})*$')
s = '42'
reg.search(s).group()
# '42'
reg.findall(s)
# ['42']
I have a string variable containing
string = "123hello456world789"
string contain no spacess. I want to write a regex such that prints only words containing(a-z)
I tried a simple regex
pat = "([a-z]+){1,}"
match = re.search(r""+pat,word,re.DEBUG)
match object contains only the word Hello and the word World is not matched.
When is used re.findall() I could get both Hello and World.
My question is why we can't do this with re.search()?
How do this with re.search()?
re.search() finds the pattern once in the string, documenation:
Scan through string looking for a location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance. Return None if no position in the string matches
the pattern; note that this is different from finding a zero-length
match at some point in the string.
In order to match every occurrence, you need re.findall(), documentation:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
Example:
>>> import re
>>> regex = re.compile(r'([a-z]+)', re.I)
>>> # using search we only get the first item.
>>> regex.search("123hello456world789").groups()
('hello',)
>>> # using findall we get every item.
>>> regex.findall("123hello456world789")
['hello', 'world']
UPDATE:
Due to your duplicate question (as discussed at this link) I have added my other answer here as well:
>>> import re
>>> regex = re.compile(r'([a-z][a-z-\']+[a-z])')
>>> regex.findall("HELLO W-O-R-L-D") # this has uppercase
[] # there are no results here, because the string is uppercase
>>> regex.findall("HELLO W-O-R-L-D".lower()) # lets lowercase
['hello', 'w-o-r-l-d'] # now we have results
>>> regex.findall("123hello456world789")
['hello', 'world']
As you can see, the reason why you were failing on the first sample you provided is because of the uppercase, you can simply add the re.IGNORECASE flag, though you mentioned that matches should be lowercase only.
#InbarRose answer shows why re.search works that way, but if you want match objects rather than just the string outputs from re.findall, use re.finditer
>>> for match in re.finditer(pat, string):
... print match.groups()
...
('hello',)
('world',)
>>>
Or alternatively if you wanted a list
>>> list(re.finditer(pat, string))
[<_sre.SRE_Match object at 0x022DB320>, <_sre.SRE_Match object at 0x022DB660>]
It's also generally a bad idea to use string as a variable name given that it's a common module.