Regex Quantifiers in Python (trying to match a group) [duplicate] - python

This question already has answers here:
Python regular expression pattern * is not working as expected
(2 answers)
Closed 2 years ago.
My question is related to Metacharacters in Python :
import re
string = 'Python 123'
print(re.search('(\d)+',string)) # It matches perfectly
<re.Match object; span=(7, 10), match='123'>
But when it comes to (?) or (*) quantifiers :
print(re.search('(\d)?',string))
<re.Match object; span=(0, 0), match=''>
Or
print(re.search('(\d)*',string))
<re.Match object; span=(0, 0), match=''>
My question is : -Why * and + don't match the digits in the string and shows span = (0,0) instead

The second & third regex allows the empty string '' to be matched.
re.search is merely returning the first match as documented: https://docs.python.org/3/library/re.html#re.search

Related

Validate string format based on format

I have an issue with the following task.
I have a string:
ABCD[A] or A7D3[A,B,C]
First 4 Characters are 0-9 or A-Z.
5th character is [.
6th to nth character is A-Z followed by , in case there is more than one letter
e.g. A, E,F, A,B,C,D,F I don't know if there is a character limit with the middle part, so I have to assume it is 26 (A-Z).
last character is ].
I need to verify, that the structure of the string is as stated above.
ABCD[A,B]
BD1F[E,G,A,R]
S4P5[C]
I tried with regex ( in python)
r = re.match('^[0-9A-Z]{4}[[A-Z,]+$',text)
text being an example of the string, however it is not working.
A true / false or 0 or 1 as result would be fine
Any ideas how this could be done? What I've seen on google so far regex would work, however I'm not proficient enough with it to solve this by myself.
You can use '[0-9A-Z]{4}\[[A-Z](?:,[A-Z]){,25}\]':
import re
for s in ['ABCD[A,B]', 'BD1F[E,G,A,R]', 'S4P5[C]']:
print(re.fullmatch(r'[0-9A-Z]{4}\[[A-Z](?:,[A-Z]){,25}\]', s))
Note that the (?:,[A-Z]){,25} limits the number of letters in the square brackets but does not ensure that they are non-duplicates.
Output:
<re.Match object; span=(0, 9), match='ABCD[A,B]'>
<re.Match object; span=(0, 13), match='BD1F[E,G,A,R]'>
<re.Match object; span=(0, 7), match='S4P5[C]'>
regex demo
You can try:
import re
lst = ["ABCD[A,B]", "BD1F[E,G,A,R]", "S4P5[C]", "S4P5[CD]"]
pattern = r"^[A-Z0-9]{4}\[[A-Z](?:,[A-Z])*]$"
for string in lst:
m = re.match(pattern, string)
print(bool(m), m)
output:
True <re.Match object; span=(0, 9), match='ABCD[A,B]'>
True <re.Match object; span=(0, 13), match='BD1F[E,G,A,R]'>
True <re.Match object; span=(0, 7), match='S4P5[C]'>
False None
Explanation:
^: beginning of the string.
[A-Z0-9]{4} for getting the first 4 characters.
\[ for escaping the bracket.
[A-Z] first character inside bracket is mandatory.
(?:,[A-Z])* the rest would be optional.
]$: end of the string.
Note-1: You could restrict the inside characters to 25 by changing * to {,25}.
Note-2: I didn't escape the last bracket but doing so doesn't hurt if you want (maybe better).

Python Regular expression simple

I want to find all occurrences of LocalizedString(...) in a text file. Between parenthesis, anything could be included. How can I find this using regular expressions?
I searched online but had no luck.
Thank you
Lets say we have a file with the following string:
class MyClass {
func myFunc(){
var text = LocalizedString("arrow.Find")
}
}
r"LocalizedString\([^)]*\)"
LocalizedString\( matches LocalizedString( literally.
[^)]* matches anything that isn't a closing parenthesis.
\) matches a closing parenthesis literally.
Example:
>>> import re
>>> regex = re.compile(r"LocalizedString\([^)]*\)")
>>> regex.match('LocalizedString(a,b)')
<re.Match object; span=(0, 20), match='LocalizedString(a,b)'>
Note that the match for code like LocalizedString(a(x),b) is not the entire string:
>>> regex.match('LocalizedString(a(x),b)')
<re.Match object; span=(0, 20), match='LocalizedString(a(x)'>
This happens because regular expressions can't handle arbitrarily nested parentheses such as a(b(c(x()))).
You could also eagerly match everything that's between any two parentheses, going as far as possible till there are no more closing parentheses:
>>> regex2 = re.compile(r"LocalizedString\(.*\)")
>>> regex2.match('LocalizedString(a(x),b)')
<re.Match object; span=(0, 23), match='LocalizedString(a(x),b)'>
>>> regex2.match('LocalizedString(a(x),b) && print("Hello!")')
<re.Match object; span=(0, 42), match='LocalizedString(a(x),b) && print("Hello!")'>
To match just what's inside the parentheses:
r"LocalizedString\(([^)]*)\)"
Then the capture group ([^)]*) will contain the required data:
>>> regex3 = re.compile(r"LocalizedString\(([^)]*)\)")
>>> regex3.match('LocalizedString(a,b)').groups()
('a,b',)
>>> regex3.match('LocalizedString(a(x),b)').groups()
('a(x',)

Matching '//' or EOL '$' in regex giving inconsistent results when grouped (in Python)? [duplicate]

This question already has answers here:
Regular expression pipe confusion
(5 answers)
Closed 2 years ago.
Anyone know why these two regexes give different results when trying to match either '//' or '$'? (Python 3.6.4)
(a)(//|$) : Matches both 'a' and 'a//'
(a)(//)|($) : Matches 'a//' but not 'a'
>>> at = re.compile('(a)(//|$)')
>>> m = at.match('a')
>>> m
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>> m = at.match('a//')
>>> m
<_sre.SRE_Match object; span=(0, 3), match='a//'>
>>>
vs
>>> at = re.compile('(a)(//)|($)')
>>> m = at.match('a//')
>>> m
<_sre.SRE_Match object; span=(0, 3), match='a//'>
>>> m = at.match('a')
>>> m
>>> type(m)
<class 'NoneType'>
>>>
The regex engine will group the expressions on each side of a pipe before evaluating.
In the first case
(a)(//|$)
implies it'll match a string that must have an a before either // or $ (i.e EOL)
Hence, first alternative in this case is // and second alternative is $, both must follow an a
In this expression, the capturing groups are
a
Either // or $
(a)(//)|($)
implies it'll match a string that must be either a// or $
Hence, first alternative in this case is a// and second alternative is $
In this expression, the capturing groups are
Either
a
//
OR
$
In fact, the grouping doesn't matter in the second example, a//|$ will give the same result, since the regex engine will evaluate it as (a//)|$ (note the parentheses are just symbolic for my example, they do not represent capture group syntax).
Try it out in a regex tester. It'll tell you what the alternatives are for each expression
| has low precedence, so (a)(//)|($) means ((a)(//))|($), therefore it will either math ((a)(//)) or ($). To achieve the results like first one, use (a)((//)|($)), which is same as first with groups added. First regex is cleaner and should be preferred unless you need group matching.
See here for more details on precedence - https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04_08

Python: Difference between "re.match(pattern)" v/s "re.search('^' + pattern)" [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 3 years ago.
While reading the docs, I found out that the whole difference between re.match() and re.search() is that re.match() starts checking only from the beginning of the string.
>>> import re
>>> a = 'abcde'
>>> re.match(r'b', a)
>>> re.search(r'b', a)
<_sre.SRE_Match object at 0xffe25c98>
>>> re.search(r'^b', a)
>>>
Is there anything I am misunderstanding, or is there no difference at all between re.search('^' + pattern) and re.match(pattern)?
Is it a good practice to only use re.search()?
You should take a look at Python's re.search() vs. re.match() document which clearly mentions about the other difference which is:
Note however that in MULTILINE mode match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.
>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
<_sre.SRE_Match object; span=(4, 5), match='X'>
The first difference (for future readers) being:
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
For example:
>>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
<_sre.SRE_Match object; span=(2, 3), match='c'>
Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:
>>> re.match("c", "abcdef") # No match
>>> re.search("^c", "abcdef") # No match
>>> re.search("^a", "abcdef") # Match
<_sre.SRE_Match object; span=(0, 1), match='a'>
If you look at this from a code golfing perspective, I'd say there is some use in keeping the two functions separate.
If you're looking from the beginning of the string, re.match, would be preferable to re.search, because the former has one character less in its name, thus saving a byte. Furthermore, with re.search, you also have to add the start-of-line anchor ^ to signify matching from the start. You don't need to specify this with re.match because it is implied, further saving another byte.

Python re (regex) matching particular string containing letters, hyphen, numbers

I am trying to match the following strings in python 2.7 using the python regular expression package re and am having trouble coming up with the regex code:
https://www.this.com/john-smith/e5609239
https://www.this.com/jane-johnson/e426609216
https://www.this.com/wendy-saad/e172645609215
https://www.this.com/nick-madison/e7265609214
https://www.this.com/tom-taylor/e17265709211
https://www.this.com/james-bates/e9212
So the prefix is fixed "https://www.this.com/" and then there are a variable number of lowercase letters, then "-", then "e", then a variable number of digits.
Here is what I have tried to no avail:
href=re.compile("https://www.this.com/people-search/[a-z]+[\-](?P<firstNumBlock>\d+)/")
href=re.compile("https://www.this.com/people-search/[a-z][\-][a-z]+/e[0-9]+")
Thanks for your help!
You are running into issues with escaping special characters. Since you're not using raw strings, the backslash has special meaning in your string literal itself. Additionally, character classes (with []) don't require escaping in a regular expression. You can simplify your expression as follows:
expression = r"https://www.mylife.com/people-search/[a-z]+-[a-z]+/e\d+"
With the following data:
strings = ['https://www.mylife.com/people-search/john-smith/e5609239',
'https://www.this.com/people-search/jane-johnson/e426609216',
'https://www.this.com/people-search/wendy-saad/e172645609215',
'https://www.this.com/people-search/nick-madison/e7265609214',
'https://www.this.com/people-search/tom-taylor/e17265709211',
'https://www.this.com/people-search/james-bates/e9212']
Result:
>>> for s in strings:
... print(re.match(expression, s))
...
<_sre.SRE_Match object; span=(0, 56), match='https://www.this.com/people-search/john-smith/e>
<_sre.SRE_Match object; span=(0, 60), match='https://www.this.com/people-search/jane-johnson>
<_sre.SRE_Match object; span=(0, 61), match='https://www.this.com/people-search/wendy-saad/e>
<_sre.SRE_Match object; span=(0, 61), match='https://www.this.com/people-search/nick-madison>
<_sre.SRE_Match object; span=(0, 60), match='https://www.this.com/people-search/tom-taylor/e>
<_sre.SRE_Match object; span=(0, 54), match='https://www.this.com/people-search/james-bates/>
href=re.compile("https://www\.mylife\.com/people-search/[a-z]+-[a-z]+/e[0-9]+")
Try out here.
re.compile(r'https://www.this.com/[a-z-]+/e\d+')
[a-z-]+ match john-smith
e\d+ match e5609239
text = '''https://www.this.com/john-smith/e5609239
https://www.this.com/jane-johnson/e426609216
https://www.this.com/wendy-saad/e172645609215
https://www.this.com/nick-madison/e7265609214
https://www.this.com/tom-taylor/e17265709211
https://www.this.com/james-bates/e9212'''
href = re.compile(r'https://www\.this\.com/[a-zA-Z]+\-[a-zA-Z]+/e[0-9]+')
m = href.findall(text)
pprint(m)
Outputs:
['https://www.this.com/john-smith/e5609239',
'https://www.this.com/jane-johnson/e426609216',
'https://www.this.com/wendy-saad/e172645609215',
'https://www.this.com/nick-madison/e7265609214',
'https://www.this.com/tom-taylor/e17265709211',
'https://www.this.com/james-bates/e9212']

Categories

Resources