Is there a difference between [^[:print:]] and [[:cntrl:]] - python

Trying to determine if there is a functional difference between the POSIX character groups named above, or more specifically, the following two patterns:
r'[^[\x20-\x7E]]' # Match All non-printable
r'[\x00-\x1F\x7F]' # Match control characters

I'm not sure about the POSIX groups (Python's regex engine doesn't support them anyway), but
r'[^[\x20-\x7E]]'
is definitely wrong (should be r'[^\x20-\x7E]') and matches far more than
r'[\x00-\x1F\x7F]'
because the latter only considers ASCII characters whereas the former will also match anything above codepoint 126:
>>> r1 = re.compile(r'[^\x20-\x7E]')
>>> r2 = re.compile(r'[\x00-\x1F\x7F]')
>>> r1.match("ä")
<_sre.SRE_Match object; span=(0, 1), match='ä'>
>>> r2.match("ä")
>>>
To expand on my point above why your regex r'[^[\x20-\x7E]]' is faulty: it matches a letter that is neither an opening square bracket nor in the range between ASCII 20 and ASCII 126 (which already includes [ anyway), and that is followed by a literal closing bracket:
>>> r1 = re.compile(r'[^[\x20-\x7E]]')
>>> r1.match("\x00")
>>> r1.match("\x00]")
<_sre.SRE_Match object; span=(0, 2), match='\x00]'>

Related

Validate string format based on format

I have an issue with the following task.
I have a string:
ABCD[A] or A7D3[A,B,C]
First 4 Characters are 0-9 or A-Z.
5th character is [.
6th to nth character is A-Z followed by , in case there is more than one letter
e.g. A, E,F, A,B,C,D,F I don't know if there is a character limit with the middle part, so I have to assume it is 26 (A-Z).
last character is ].
I need to verify, that the structure of the string is as stated above.
ABCD[A,B]
BD1F[E,G,A,R]
S4P5[C]
I tried with regex ( in python)
r = re.match('^[0-9A-Z]{4}[[A-Z,]+$',text)
text being an example of the string, however it is not working.
A true / false or 0 or 1 as result would be fine
Any ideas how this could be done? What I've seen on google so far regex would work, however I'm not proficient enough with it to solve this by myself.
You can use '[0-9A-Z]{4}\[[A-Z](?:,[A-Z]){,25}\]':
import re
for s in ['ABCD[A,B]', 'BD1F[E,G,A,R]', 'S4P5[C]']:
print(re.fullmatch(r'[0-9A-Z]{4}\[[A-Z](?:,[A-Z]){,25}\]', s))
Note that the (?:,[A-Z]){,25} limits the number of letters in the square brackets but does not ensure that they are non-duplicates.
Output:
<re.Match object; span=(0, 9), match='ABCD[A,B]'>
<re.Match object; span=(0, 13), match='BD1F[E,G,A,R]'>
<re.Match object; span=(0, 7), match='S4P5[C]'>
regex demo
You can try:
import re
lst = ["ABCD[A,B]", "BD1F[E,G,A,R]", "S4P5[C]", "S4P5[CD]"]
pattern = r"^[A-Z0-9]{4}\[[A-Z](?:,[A-Z])*]$"
for string in lst:
m = re.match(pattern, string)
print(bool(m), m)
output:
True <re.Match object; span=(0, 9), match='ABCD[A,B]'>
True <re.Match object; span=(0, 13), match='BD1F[E,G,A,R]'>
True <re.Match object; span=(0, 7), match='S4P5[C]'>
False None
Explanation:
^: beginning of the string.
[A-Z0-9]{4} for getting the first 4 characters.
\[ for escaping the bracket.
[A-Z] first character inside bracket is mandatory.
(?:,[A-Z])* the rest would be optional.
]$: end of the string.
Note-1: You could restrict the inside characters to 25 by changing * to {,25}.
Note-2: I didn't escape the last bracket but doing so doesn't hurt if you want (maybe better).

Matching '//' or EOL '$' in regex giving inconsistent results when grouped (in Python)? [duplicate]

This question already has answers here:
Regular expression pipe confusion
(5 answers)
Closed 2 years ago.
Anyone know why these two regexes give different results when trying to match either '//' or '$'? (Python 3.6.4)
(a)(//|$) : Matches both 'a' and 'a//'
(a)(//)|($) : Matches 'a//' but not 'a'
>>> at = re.compile('(a)(//|$)')
>>> m = at.match('a')
>>> m
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>> m = at.match('a//')
>>> m
<_sre.SRE_Match object; span=(0, 3), match='a//'>
>>>
vs
>>> at = re.compile('(a)(//)|($)')
>>> m = at.match('a//')
>>> m
<_sre.SRE_Match object; span=(0, 3), match='a//'>
>>> m = at.match('a')
>>> m
>>> type(m)
<class 'NoneType'>
>>>
The regex engine will group the expressions on each side of a pipe before evaluating.
In the first case
(a)(//|$)
implies it'll match a string that must have an a before either // or $ (i.e EOL)
Hence, first alternative in this case is // and second alternative is $, both must follow an a
In this expression, the capturing groups are
a
Either // or $
(a)(//)|($)
implies it'll match a string that must be either a// or $
Hence, first alternative in this case is a// and second alternative is $
In this expression, the capturing groups are
Either
a
//
OR
$
In fact, the grouping doesn't matter in the second example, a//|$ will give the same result, since the regex engine will evaluate it as (a//)|$ (note the parentheses are just symbolic for my example, they do not represent capture group syntax).
Try it out in a regex tester. It'll tell you what the alternatives are for each expression
| has low precedence, so (a)(//)|($) means ((a)(//))|($), therefore it will either math ((a)(//)) or ($). To achieve the results like first one, use (a)((//)|($)), which is same as first with groups added. First regex is cleaner and should be preferred unless you need group matching.
See here for more details on precedence - https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04_08

questions about Python regex

Why the following pattern string results in a match of "A cat", instead of "a hat" since match is greedy by default?
>>> m = re.match(r'(\w+) (\w+)', "A cat jumpped over a hat")
>>> m
<_sre.SRE_Match object; span=(0, 5), match='A cat'>
Could someone shed some light on them?
From the official Python documentation on regexes
re.match() checks for a match only at the beginning of the string
From official document:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
As others have alluded, re.match starts from the beginning of the string-to-match and only checks for what is necessary. Notice match='A cat' at the end of the object's string representation denotes what was matched: r'(\w+) (\w+)' of "A cat jumpped over a hat".
If you were to add a $ to the end of your pattern, indicating the string-to-match should end there, it will not result in a match. And if you were to take that same pattern and shorten it to only two words, it would match once again:
>>> re.match(r'(\w+) (\w+)', "A cat jumpped over a hat")
<_sre.SRE_Match object; span=(0, 5), match='A cat'>
>>> re.match(r'(\w+) (\w+)$', "A cat jumpped over a hat")
>>> re.match(r'(\w+) (\w+)$', "A cat")
<_sre.SRE_Match object; span=(0, 5), match='A cat'>

Python: Difference between "re.match(pattern)" v/s "re.search('^' + pattern)" [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Closed 3 years ago.
While reading the docs, I found out that the whole difference between re.match() and re.search() is that re.match() starts checking only from the beginning of the string.
>>> import re
>>> a = 'abcde'
>>> re.match(r'b', a)
>>> re.search(r'b', a)
<_sre.SRE_Match object at 0xffe25c98>
>>> re.search(r'^b', a)
>>>
Is there anything I am misunderstanding, or is there no difference at all between re.search('^' + pattern) and re.match(pattern)?
Is it a good practice to only use re.search()?
You should take a look at Python's re.search() vs. re.match() document which clearly mentions about the other difference which is:
Note however that in MULTILINE mode match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.
>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
<_sre.SRE_Match object; span=(4, 5), match='X'>
The first difference (for future readers) being:
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
For example:
>>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
<_sre.SRE_Match object; span=(2, 3), match='c'>
Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:
>>> re.match("c", "abcdef") # No match
>>> re.search("^c", "abcdef") # No match
>>> re.search("^a", "abcdef") # Match
<_sre.SRE_Match object; span=(0, 1), match='a'>
If you look at this from a code golfing perspective, I'd say there is some use in keeping the two functions separate.
If you're looking from the beginning of the string, re.match, would be preferable to re.search, because the former has one character less in its name, thus saving a byte. Furthermore, with re.search, you also have to add the start-of-line anchor ^ to signify matching from the start. You don't need to specify this with re.match because it is implied, further saving another byte.

difference b/w [ab] and (a|b) in regex match?

I knew that [] denotes a set of allowable characters -
>>> p = r'^[ab]$'
>>>
>>> re.search(p, '')
>>> re.search(p, 'a')
<_sre.SRE_Match object at 0x1004823d8>
>>> re.search(p, 'b')
<_sre.SRE_Match object at 0x100482370>
>>> re.search(p, 'ab')
>>> re.search(p, 'ba')
But ... today I came across an expression with vertical bars within parenthesis to define mutually exclusive patterns -
>>> q = r'^(a|b)$'
>>>
>>> re.search(q, '')
>>> re.search(q, 'a')
<_sre.SRE_Match object at 0x100498dc8>
>>> re.search(q, 'b')
<_sre.SRE_Match object at 0x100498e40>
>>> re.search(q, 'ab')
>>> re.search(q, 'ba')
This seems to mimic the same functionality as above, or am I missing something?
PS: In Python parenthesis themselves are used to define logical groups of matched text. If I use the second technique, then how do I use parenthesis for both jobs?
In this case it is the same.
However, the alternation is not just limited to a single character. For instance,
^(hello|world)$
will match "hello" or "world" (and only these two inputs) while
^[helloworld]$
would just match a single character ("h" or "w" or "d" or whatnot).
Happy coding.
[ab] matches one character (a or b) and doesn't capture the group. (a|b) captures a or b, and matches it. In this case, no big difference, but in more complex cases [] can only contain characters and character classes, while (|) can contain arbitrarily complex regex's on either side of the pipe
In the example you gave they are interchangeable. There are some differences worth noting:
In the character class square brackets you don't have to escape anything but a dash or square brackets, or the caret ^ (but then only if it's the first character.)
Parentheses capture matches so you can refer to them later. Character class matches don't do that.
You can match multi-character strings in parentheses but not in character classes

Categories

Resources