This question already has answers here:
Difference between * and + regex
(7 answers)
Closed 5 years ago.
I am new to python and was going through "Google for Education" python course
Now, the line below confuses me:
* -- 0 or more occurrences of the pattern to its left
(all the examples are in python3)
e.g. 1
In [1]: re.search(r"pi*", "piiig!!").group()
Out[1]: 'piii'
This is fine since, "pi" has 1 occurrance so it is retured
e.g. 2
In [2]: re.search(r"i*", "piiig!!").group()
Out[2]: ''
Why does it not return "i" in fact - from my understanding, it should be returning "iii". But the result is an empty string.
Also, What exactly does "0 or more" mean? I searched on google but everywhere it is mentioned * -- 0 or more. But if there is 0 occurrence of an expression, does that not become true even if it's not there? What is the point of searching then?
I am so confused with this. Can you please help me with explaining this or point me in the right direction.
i hope the right explanation would also resolve my this issue:
In [3]: re.search(r"i?", "piiig!!").group()
Out[3]: ''
I have tried the examples in Spyder 3.2.4
The explanation is a bit more complicated than the answers we have seen so far.
First, unlike re.match() the primitive operation re.search() checks for a match anywhere in the string (this is what Perl does by default) and finds the pattern once:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance. Return None if no position in the string matches
the pattern; note that this is different from finding a zero-length
match at some point in the string. See: Ref.
If we follow every step of the regex engine while it tries to find a match, we can observe the following for the pattern i* and the test string piigg!!:
As you can see, the first character (at position 0) produces a match because p is zero times i and the result is an empty match (and not p - because we do not search for p or any other character).
At the second character (position 1) the second match (spanning to position 2) is found since ii is zero or more times i... at position 3 there is another empty match, and so far and so forth.
Because re.search only returns the first match it sticks with the first empty match at position 0. That's why you get the (confusing) result you have posted:
In [2]: re.search(r"i*", "piiig!!").group()
Out[2]: ''
In order to match every occurrence, you need re.findall():
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match. See: Ref.
You need to use *(0 or more) and +(1 or more) properly to get your desired output
Eg: 1 Matches because you have defined * only for "i", this patter will capture all the "p" or "pi" combination
Eg: 2 If you need to match only "i" you need to use "+" instead of "*".
If you use "*"
In: re.search(r"pi*g", "piiig!!").group()
This will return if you input is ("pig" or "piig" or "pg")
If you use "+"
In: re.search(r"pi+g", "piiig!!").group()
This will return if you input is ("pig" or "piig")
Because '' is the first matched result of r'i*' and 'iii' is the second matched result.
In [1]: import re
In [2]: re.findall(r'i*', 'piiig!!')
Out[2]: ['', 'iii', '', '', '', '']
This website will also explain the way how regular expression work.
https://regex101.com/r/XVPXMv/1
The special charecter * means 0 or more occurrence of the preceding character. For eg. a* matches with 0 or more occurrence of a which could be '', 'a', 'aa' etc. This happens because '' has 0 occurrence of a.
To get iii you should have used + instead of * and thus would have got the first non zero sequence of 'i' which is iii
re.search("i+", "piiig!!").group()
Related
I have as str that I want to get the substring inside single quotes ('):
line = "This is a 'car' which has a 'person' in it!"
so I used:
name = re.findall("\'(.+?)\'", line)
print(name[0])
print(name[1])
car
person
But when I try this approach:
pattern = re.compile("\'(.+?)\'")
matches = re.search(pattern, line)
print(matches.group(0))
print(matches.group(1))
# print(matches.group(2)) # <- this produces an error of course
'car'
car
So, my question is why the pattern behaves differently in each case? I know that the former returns "all non-overlapping matches of pattern in string" and the latter match objects which might explain some difference but I would expect with the same pattern same results (even in different format).
So, to make it more concrete:
In the first case with findall the pattern returns all substrings but in the latter case it only return the first substring.
In the latter case matches.group(0) (which corresponds to the whole match according to the documentation) is different than matches.group(1) (which correspond to the first parenthesized subgroup). Why is that?
re.finditer("\'(.+?)\'", line) returns match objects so it functions like re.search.
I know that there are similar question is SO like this one or this one but they don't seem to answer why (or at least I did not get it).
You already read the docs and other answers, so I will give you a hands-on explanation
Let's first take this example from here
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0) # The entire match
'Isaac Newton'
>>> m.group(1) # The first parenthesized subgroup.
'Isaac'
>>> m.group(2) # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2) # Multiple arguments give us a tuple.
('Isaac', 'Newton')
If you go on this website you will find the correspondence with the previous detections
group(0) is taking the full match, group(1) and group(2) are respectively Group 1 and Group 2 in the picture. Because as said here "Match.group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned)"
Now let's go back to your example
As said by others with re.search(pattern, line) you will find ONLY the first occurrence of the pattern ["Scan through string looking for the first location where the regular expression pattern produces a match" as said here] and following the previous logic you will now understand why matches.group(0) will output the full match and matches.group(1) the Group 1. And you will understand why matches.group(2) is giving you error [because as you can see from the screenshot there is not a group 2 for the first occurrence in this last example]
re.findall returns list of matches (in this particular example, first groups of matches), while re.search returns
only first leftmost match.
As stated in python documentation (re.findall):
Return all non-overlapping
matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found. If
one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result.
matches.group(0) gives you whole fragment of string that matches your pattern, that's why it have quotes, while matches.group(1) gives you first parenthesized substring of matching fragment, that means it will not include quotes because they are outside of parentheses. Check Match.group() docs for more information.
From the test string:
test=text-AB123-12a
test=text-AB123a
I have to extract only 'AB123-12' and 'AB123', but:
re.findall("[A-Z]{0,9}\d{0,5}(?:-\d{0,2}a)?", test)
returns:
['', '', '', '', '', '', '', 'AB123-12a', '']
What are all these extra empty spaces? How do I remove them?
The quantifier {0,n} will match anywhere from 0 to n occurrences of the preceding pattern. Since the two patterns you match allow 0 occurrences, and the third is optional (?) it will match 0-length strings, i.e. every character in your string.
Editing to find a minimum of one and maximum of 9 and 5 for each pattern yields correct results:
>>> test='text-AB123-12a'
>>> import re
>>> re.findall("[A-Z]{1,9}\d{1,5}(?:-\d{0,2}a)?", test)
['AB123-12a']
Without further detail about what exactly the strings you are matching look like, I can't give a better answer.
Your pattern is set to match zero length characters with the lower limits of your character set quantifier set to 0. Simply setting to 1 will produce the results you want:
>>> import re
>>> test = ''' test=text-AB123-12a
... test=text-AB123a'''
>>> re.findall("[A-Z]{1,9}\d{1,5}(?:-\d{0,2}a)?", test)
['AB123-12a', 'AB123']
RegEx tester: http://www.regexpal.com/ says that your pattern string [A-Z]{0,9}\d{0,5}(?:-\d{0,2}a)? can match 0 characters, and therefore matches infinitely.
Check your expression one more time. Python gives you undefined result.
Since all parts of your pattern are optional (your ranges specify zero to N occurences and you are qualifying the group with ?), each position in the string counts as a match and most of those are empty matches.
How to prevent this from happening depends on the exact format of what you are trying to match. Are all those parts of your match really optional?
Since letters or digits are optional at the beginning, you must ensure that there's at least one letter or one digit, otherwise your pattern will match the empty string at each position in the string. You can do it starting your pattern with a lookahead. Example:
re.findall(r'(?=[A-Z0-9])[A-Z]{0,9}\d{0,5}(?:-\d\d?)?(?=a)', test)
In this way the match can start with a letter or with a digit.
I assume that when there's an hyphen, it is followed by at least one digit (otherwise what is the reason of this hyphen?). In other words, I assume that -a isn't possible at the end. (correct me if I'm wrong.)
To exclude the "a" from the match result, I putted it in a lookahead.
This the text file abc.txt
abc.txt
aa:s0:education.gov.in
bb:s1:defence.gov.in
cc:s2:finance.gov.in
I'm trying to parse this file by tokenizing (correct me if this is the incorrect term :) ) at every ":" using the following regular expression.
parser.py
import re,sys,os,subprocess
path = "C:\abc.txt"
site_list = open(path,'r')
for line in site_list:
site_line = re.search(r'(\w)*:(\w)*:([\w\W]*\.[\W\w]*\.[\W\w]*)',line)
print('Regex found that site_line.group(2) = '+str(site_line.group(2))
Why is the output
Regex found that site_line.group(2) = 0
Regex found that site_line.group(2) = 1
Regex found that site_line.group(2) = 2
Can someone please help me understand why it matches the last character of the second group ? I think its matching 0 from s0 , 1 from s1 & 2 from s2
But Why ?
Let's show a simplified example:
>>> re.search(r'(.)*', 'asdf').group(1)
'f'
>>> re.search(r'(.*)', 'asdf').group(1)
'asdf'
If you have a repetition operator around a capturing group, the group stores the last repetition. Putting the group around the repetition operator does what you want.
If you were expecting to see data from the third group, that would be group(3). group(0) is the whole match, and group(1), group(2), etc. count through the actual parenthesized capturing groups.
That said, as the comments suggest, regexes are overkill for this.
>>> 'aa:s0:education.gov.in'.split(':')
['aa', 's0', 'education.gov.in']
And first group is entire match by default.
If a groupN argument is zero, the corresponding return value is the
entire matching string.
So you should skip it. And check group(3), if you want last one.
Also, you should compile regexp before for-loop. It increase performance of your parser.
And you can replace (\w)* to (\w*), if you want match all symbols between :.
I really don't understand the following example found on docs.python.org:
>>>> p = re.compile('x*')
>>>> p.sub('-', 'abxd')
'-a-b-d-'
Why the regex 'x*' is matching four times?
I thought the output should be: 'ab-'
* meta character matches 0 or more times. So,
a bx d
^ ^ -- ^
^ is the position where x* matches 0 times and -- is the place where x* matches 1 time. That is why the output is -a-b-d-.
To get the output ab-d, you need to use x+ in the regular expression. It means that match one or more times. So, it will match only the following positions
abxd
^
One update about re.sub since Python 3.7.
Empty matches for the pattern are replaced when adjacent to a previous non-empty match.
The result becomes "-a-b--d-" because that "d" is now having an empty match. In the previous versions of python, this empty match is not allowed since it is adjacent to the matching of "x".
I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.