questions about Python regex

questions about Python regex - python

Why the following pattern string results in a match of "A cat", instead of "a hat" since match is greedy by default?
>>> m = re.match(r'(\w+) (\w+)', "A cat jumpped over a hat")
>>> m
<_sre.SRE_Match object; span=(0, 5), match='A cat'>
Could someone shed some light on them?

From the official Python documentation on regexes
re.match() checks for a match only at the beginning of the string

From official document:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

As others have alluded, re.match starts from the beginning of the string-to-match and only checks for what is necessary. Notice match='A cat' at the end of the object's string representation denotes what was matched: r'(\w+) (\w+)' of "A cat jumpped over a hat".
If you were to add a $ to the end of your pattern, indicating the string-to-match should end there, it will not result in a match. And if you were to take that same pattern and shorten it to only two words, it would match once again:
>>> re.match(r'(\w+) (\w+)', "A cat jumpped over a hat")
<_sre.SRE_Match object; span=(0, 5), match='A cat'>
>>> re.match(r'(\w+) (\w+)$', "A cat jumpped over a hat")
>>> re.match(r'(\w+) (\w+)$', "A cat")
<_sre.SRE_Match object; span=(0, 5), match='A cat'>

Related

Validate string format based on format

I have an issue with the following task.
I have a string:
ABCD[A] or A7D3[A,B,C]
First 4 Characters are 0-9 or A-Z.
5th character is [.
6th to nth character is A-Z followed by , in case there is more than one letter
e.g. A, E,F, A,B,C,D,F I don't know if there is a character limit with the middle part, so I have to assume it is 26 (A-Z).
last character is ].
I need to verify, that the structure of the string is as stated above.
ABCD[A,B]
BD1F[E,G,A,R]
S4P5[C]
I tried with regex ( in python)
r = re.match('^[0-9A-Z]{4}[[A-Z,]+$',text)
text being an example of the string, however it is not working.
A true / false or 0 or 1 as result would be fine
Any ideas how this could be done? What I've seen on google so far regex would work, however I'm not proficient enough with it to solve this by myself.

You can use '[0-9A-Z]{4}\[[A-Z](?:,[A-Z]){,25}\]':
import re
for s in ['ABCD[A,B]', 'BD1F[E,G,A,R]', 'S4P5[C]']:
print(re.fullmatch(r'[0-9A-Z]{4}\[[A-Z](?:,[A-Z]){,25}\]', s))
Note that the (?:,[A-Z]){,25} limits the number of letters in the square brackets but does not ensure that they are non-duplicates.
Output:
<re.Match object; span=(0, 9), match='ABCD[A,B]'>
<re.Match object; span=(0, 13), match='BD1F[E,G,A,R]'>
<re.Match object; span=(0, 7), match='S4P5[C]'>
regex demo

You can try:
import re
lst = ["ABCD[A,B]", "BD1F[E,G,A,R]", "S4P5[C]", "S4P5[CD]"]
pattern = r"^[A-Z0-9]{4}\[[A-Z](?:,[A-Z])*]$"
for string in lst:
m = re.match(pattern, string)
print(bool(m), m)
output:
True <re.Match object; span=(0, 9), match='ABCD[A,B]'>
True <re.Match object; span=(0, 13), match='BD1F[E,G,A,R]'>
True <re.Match object; span=(0, 7), match='S4P5[C]'>
False None
Explanation:
^: beginning of the string.
[A-Z0-9]{4} for getting the first 4 characters.
\[ for escaping the bracket.
[A-Z] first character inside bracket is mandatory.
(?:,[A-Z])* the rest would be optional.
]$: end of the string.
Note-1: You could restrict the inside characters to 25 by changing * to {,25}.
Note-2: I didn't escape the last bracket but doing so doesn't hurt if you want (maybe better).

Python Regular expression simple

I want to find all occurrences of LocalizedString(...) in a text file. Between parenthesis, anything could be included. How can I find this using regular expressions?
I searched online but had no luck.
Thank you
Lets say we have a file with the following string:
class MyClass {
func myFunc(){
var text = LocalizedString("arrow.Find")
}
}

r"LocalizedString\([^)]*\)"
LocalizedString\( matches LocalizedString( literally.
[^)]* matches anything that isn't a closing parenthesis.
\) matches a closing parenthesis literally.
Example:
>>> import re
>>> regex = re.compile(r"LocalizedString\([^)]*\)")
>>> regex.match('LocalizedString(a,b)')
<re.Match object; span=(0, 20), match='LocalizedString(a,b)'>
Note that the match for code like LocalizedString(a(x),b) is not the entire string:
>>> regex.match('LocalizedString(a(x),b)')
<re.Match object; span=(0, 20), match='LocalizedString(a(x)'>
This happens because regular expressions can't handle arbitrarily nested parentheses such as a(b(c(x()))).
You could also eagerly match everything that's between any two parentheses, going as far as possible till there are no more closing parentheses:
>>> regex2 = re.compile(r"LocalizedString\(.*\)")
>>> regex2.match('LocalizedString(a(x),b)')
<re.Match object; span=(0, 23), match='LocalizedString(a(x),b)'>
>>> regex2.match('LocalizedString(a(x),b) && print("Hello!")')
<re.Match object; span=(0, 42), match='LocalizedString(a(x),b) && print("Hello!")'>
To match just what's inside the parentheses:
r"LocalizedString\(([^)]*)\)"
Then the capture group ([^)]*) will contain the required data:
>>> regex3 = re.compile(r"LocalizedString\(([^)]*)\)")
>>> regex3.match('LocalizedString(a,b)').groups()
('a,b',)
>>> regex3.match('LocalizedString(a(x),b)').groups()
('a(x',)

How to match everything between A and B but don't cross boundaries of predefined word

I need a regular expression so it match everything starting "Hello" up to and including "everyone", with any characters in between. In case of 'and' found between 'Hello' and 'everyone' the expression has to fail.
So this string "Hello you and everyone" has to fail, but this "Hello you everyone" has to match.
I was trying to implement it something like this:
Hello.*?((?!and)){1}everyone
but it doesn't fail on and
https://regex101.com/r/mX51ru/150

You can do:
^(?!Hello.*?\band\b.*?everyone)Hello.*?everyone
^ matches start of the line
(?!Hello.*?\band\b.*?everyone) is a zero-width negative lookahead pattern to make sure the word and does not come in-between Hello and everyone
Hello.*?everyone matches the desired input having Hello and everyone in the line
Example:
In [1925]: str_1 = 'Hello you everyone'
In [1926]: str_2 = 'Hello you and everyone'
In [1927]: re.search(r'^(?!Hello.*?\band\b.*?everyone)Hello.*?everyone', str_1)
Out[1927]: <re.Match object; span=(0, 18), match='Hello you everyone'>
In [1928]: re.search(r'^(?!Hello.*?\band\b.*?everyone)Hello.*?everyone', str_2) is None
Out[1928]: True

Python: Difference between "re.match(pattern)" v/s "re.search('^' + pattern)" [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 3 years ago.
While reading the docs, I found out that the whole difference between re.match() and re.search() is that re.match() starts checking only from the beginning of the string.
>>> import re
>>> a = 'abcde'
>>> re.match(r'b', a)
>>> re.search(r'b', a)
<_sre.SRE_Match object at 0xffe25c98>
>>> re.search(r'^b', a)
>>>
Is there anything I am misunderstanding, or is there no difference at all between re.search('^' + pattern) and re.match(pattern)?
Is it a good practice to only use re.search()?

You should take a look at Python's re.search() vs. re.match() document which clearly mentions about the other difference which is:
Note however that in MULTILINE mode match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.
>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
<_sre.SRE_Match object; span=(4, 5), match='X'>
The first difference (for future readers) being:
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
For example:
>>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
<_sre.SRE_Match object; span=(2, 3), match='c'>
Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:
>>> re.match("c", "abcdef") # No match
>>> re.search("^c", "abcdef") # No match
>>> re.search("^a", "abcdef") # Match
<_sre.SRE_Match object; span=(0, 1), match='a'>

If you look at this from a code golfing perspective, I'd say there is some use in keeping the two functions separate.
If you're looking from the beginning of the string, re.match, would be preferable to re.search, because the former has one character less in its name, thus saving a byte. Furthermore, with re.search, you also have to add the start-of-line anchor ^ to signify matching from the start. You don't need to specify this with re.match because it is implied, further saving another byte.

Is there a difference between [^[:print:]] and [[:cntrl:]]

Trying to determine if there is a functional difference between the POSIX character groups named above, or more specifically, the following two patterns:
r'[^[\x20-\x7E]]' # Match All non-printable
r'[\x00-\x1F\x7F]' # Match control characters

I'm not sure about the POSIX groups (Python's regex engine doesn't support them anyway), but
r'[^[\x20-\x7E]]'
is definitely wrong (should be r'[^\x20-\x7E]') and matches far more than
r'[\x00-\x1F\x7F]'
because the latter only considers ASCII characters whereas the former will also match anything above codepoint 126:
>>> r1 = re.compile(r'[^\x20-\x7E]')
>>> r2 = re.compile(r'[\x00-\x1F\x7F]')
>>> r1.match("ä")
<_sre.SRE_Match object; span=(0, 1), match='ä'>
>>> r2.match("ä")
>>>
To expand on my point above why your regex r'[^[\x20-\x7E]]' is faulty: it matches a letter that is neither an opening square bracket nor in the range between ASCII 20 and ASCII 126 (which already includes [ anyway), and that is followed by a literal closing bracket:
>>> r1 = re.compile(r'[^[\x20-\x7E]]')
>>> r1.match("\x00")
>>> r1.match("\x00]")
<_sre.SRE_Match object; span=(0, 2), match='\x00]'>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

questions about Python regex - python

Why the following pattern string results in a match of "A cat", instead of "a hat" since match is greedy by default? >>> m = re.match(r'(\w+) (\w+)', "A cat jumpped over a hat") >>> m <_sre.SRE_Match object; span=(0, 5), match='A cat'> Could someone shed some light on them?

From the official Python documentation on regexes re.match() checks for a match only at the beginning of the string

From official document: If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

Related

Validate string format based on format

Python Regular expression simple

How to match everything between A and B but don't cross boundaries of predefined word

Python: Difference between "re.match(pattern)" v/s "re.search('^' + pattern)" [duplicate]

Is there a difference between [^[:print:]] and [[:cntrl:]]

Categories

Resources