Validate string format based on format

Validate string format based on format - python

I have an issue with the following task.
I have a string:
ABCD[A] or A7D3[A,B,C]
First 4 Characters are 0-9 or A-Z.
5th character is [.
6th to nth character is A-Z followed by , in case there is more than one letter
e.g. A, E,F, A,B,C,D,F I don't know if there is a character limit with the middle part, so I have to assume it is 26 (A-Z).
last character is ].
I need to verify, that the structure of the string is as stated above.
ABCD[A,B]
BD1F[E,G,A,R]
S4P5[C]
I tried with regex ( in python)
r = re.match('^[0-9A-Z]{4}[[A-Z,]+$',text)
text being an example of the string, however it is not working.
A true / false or 0 or 1 as result would be fine
Any ideas how this could be done? What I've seen on google so far regex would work, however I'm not proficient enough with it to solve this by myself.

You can use '[0-9A-Z]{4}\[[A-Z](?:,[A-Z]){,25}\]':
import re
for s in ['ABCD[A,B]', 'BD1F[E,G,A,R]', 'S4P5[C]']:
print(re.fullmatch(r'[0-9A-Z]{4}\[[A-Z](?:,[A-Z]){,25}\]', s))
Note that the (?:,[A-Z]){,25} limits the number of letters in the square brackets but does not ensure that they are non-duplicates.
Output:
<re.Match object; span=(0, 9), match='ABCD[A,B]'>
<re.Match object; span=(0, 13), match='BD1F[E,G,A,R]'>
<re.Match object; span=(0, 7), match='S4P5[C]'>
regex demo

You can try:
import re
lst = ["ABCD[A,B]", "BD1F[E,G,A,R]", "S4P5[C]", "S4P5[CD]"]
pattern = r"^[A-Z0-9]{4}\[[A-Z](?:,[A-Z])*]$"
for string in lst:
m = re.match(pattern, string)
print(bool(m), m)
output:
True <re.Match object; span=(0, 9), match='ABCD[A,B]'>
True <re.Match object; span=(0, 13), match='BD1F[E,G,A,R]'>
True <re.Match object; span=(0, 7), match='S4P5[C]'>
False None
Explanation:
^: beginning of the string.
[A-Z0-9]{4} for getting the first 4 characters.
\[ for escaping the bracket.
[A-Z] first character inside bracket is mandatory.
(?:,[A-Z])* the rest would be optional.
]$: end of the string.
Note-1: You could restrict the inside characters to 25 by changing * to {,25}.
Note-2: I didn't escape the last bracket but doing so doesn't hurt if you want (maybe better).

Related

Python Regular expression simple

I want to find all occurrences of LocalizedString(...) in a text file. Between parenthesis, anything could be included. How can I find this using regular expressions?
I searched online but had no luck.
Thank you
Lets say we have a file with the following string:
class MyClass {
func myFunc(){
var text = LocalizedString("arrow.Find")
}
}

r"LocalizedString\([^)]*\)"
LocalizedString\( matches LocalizedString( literally.
[^)]* matches anything that isn't a closing parenthesis.
\) matches a closing parenthesis literally.
Example:
>>> import re
>>> regex = re.compile(r"LocalizedString\([^)]*\)")
>>> regex.match('LocalizedString(a,b)')
<re.Match object; span=(0, 20), match='LocalizedString(a,b)'>
Note that the match for code like LocalizedString(a(x),b) is not the entire string:
>>> regex.match('LocalizedString(a(x),b)')
<re.Match object; span=(0, 20), match='LocalizedString(a(x)'>
This happens because regular expressions can't handle arbitrarily nested parentheses such as a(b(c(x()))).
You could also eagerly match everything that's between any two parentheses, going as far as possible till there are no more closing parentheses:
>>> regex2 = re.compile(r"LocalizedString\(.*\)")
>>> regex2.match('LocalizedString(a(x),b)')
<re.Match object; span=(0, 23), match='LocalizedString(a(x),b)'>
>>> regex2.match('LocalizedString(a(x),b) && print("Hello!")')
<re.Match object; span=(0, 42), match='LocalizedString(a(x),b) && print("Hello!")'>
To match just what's inside the parentheses:
r"LocalizedString\(([^)]*)\)"
Then the capture group ([^)]*) will contain the required data:
>>> regex3 = re.compile(r"LocalizedString\(([^)]*)\)")
>>> regex3.match('LocalizedString(a,b)').groups()
('a,b',)
>>> regex3.match('LocalizedString(a(x),b)').groups()
('a(x',)

Regex Quantifiers in Python (trying to match a group) [duplicate]

This question already has answers here:
Python regular expression pattern * is not working as expected
(2 answers)
Closed 2 years ago.
My question is related to Metacharacters in Python :
import re
string = 'Python 123'
print(re.search('(\d)+',string)) # It matches perfectly
<re.Match object; span=(7, 10), match='123'>
But when it comes to (?) or (*) quantifiers :
print(re.search('(\d)?',string))
<re.Match object; span=(0, 0), match=''>
Or
print(re.search('(\d)*',string))
<re.Match object; span=(0, 0), match=''>
My question is : -Why * and + don't match the digits in the string and shows span = (0,0) instead

The second & third regex allows the empty string '' to be matched.
re.search is merely returning the first match as documented: https://docs.python.org/3/library/re.html#re.search

questions about Python regex

Why the following pattern string results in a match of "A cat", instead of "a hat" since match is greedy by default?
>>> m = re.match(r'(\w+) (\w+)', "A cat jumpped over a hat")
>>> m
<_sre.SRE_Match object; span=(0, 5), match='A cat'>
Could someone shed some light on them?

From the official Python documentation on regexes
re.match() checks for a match only at the beginning of the string

From official document:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

As others have alluded, re.match starts from the beginning of the string-to-match and only checks for what is necessary. Notice match='A cat' at the end of the object's string representation denotes what was matched: r'(\w+) (\w+)' of "A cat jumpped over a hat".
If you were to add a $ to the end of your pattern, indicating the string-to-match should end there, it will not result in a match. And if you were to take that same pattern and shorten it to only two words, it would match once again:
>>> re.match(r'(\w+) (\w+)', "A cat jumpped over a hat")
<_sre.SRE_Match object; span=(0, 5), match='A cat'>
>>> re.match(r'(\w+) (\w+)$', "A cat jumpped over a hat")
>>> re.match(r'(\w+) (\w+)$', "A cat")
<_sre.SRE_Match object; span=(0, 5), match='A cat'>

Python re (regex) matching particular string containing letters, hyphen, numbers

I am trying to match the following strings in python 2.7 using the python regular expression package re and am having trouble coming up with the regex code:
https://www.this.com/john-smith/e5609239
https://www.this.com/jane-johnson/e426609216
https://www.this.com/wendy-saad/e172645609215
https://www.this.com/nick-madison/e7265609214
https://www.this.com/tom-taylor/e17265709211
https://www.this.com/james-bates/e9212
So the prefix is fixed "https://www.this.com/" and then there are a variable number of lowercase letters, then "-", then "e", then a variable number of digits.
Here is what I have tried to no avail:
href=re.compile("https://www.this.com/people-search/[a-z]+[\-](?P<firstNumBlock>\d+)/")
href=re.compile("https://www.this.com/people-search/[a-z][\-][a-z]+/e[0-9]+")
Thanks for your help!

You are running into issues with escaping special characters. Since you're not using raw strings, the backslash has special meaning in your string literal itself. Additionally, character classes (with []) don't require escaping in a regular expression. You can simplify your expression as follows:
expression = r"https://www.mylife.com/people-search/[a-z]+-[a-z]+/e\d+"
With the following data:
strings = ['https://www.mylife.com/people-search/john-smith/e5609239',
'https://www.this.com/people-search/jane-johnson/e426609216',
'https://www.this.com/people-search/wendy-saad/e172645609215',
'https://www.this.com/people-search/nick-madison/e7265609214',
'https://www.this.com/people-search/tom-taylor/e17265709211',
'https://www.this.com/people-search/james-bates/e9212']
Result:
>>> for s in strings:
... print(re.match(expression, s))
...
<_sre.SRE_Match object; span=(0, 56), match='https://www.this.com/people-search/john-smith/e>
<_sre.SRE_Match object; span=(0, 60), match='https://www.this.com/people-search/jane-johnson>
<_sre.SRE_Match object; span=(0, 61), match='https://www.this.com/people-search/wendy-saad/e>
<_sre.SRE_Match object; span=(0, 61), match='https://www.this.com/people-search/nick-madison>
<_sre.SRE_Match object; span=(0, 60), match='https://www.this.com/people-search/tom-taylor/e>
<_sre.SRE_Match object; span=(0, 54), match='https://www.this.com/people-search/james-bates/>

href=re.compile("https://www\.mylife\.com/people-search/[a-z]+-[a-z]+/e[0-9]+")
Try out here.

re.compile(r'https://www.this.com/[a-z-]+/e\d+')
[a-z-]+ match john-smith
e\d+ match e5609239

text = '''https://www.this.com/john-smith/e5609239
https://www.this.com/jane-johnson/e426609216
https://www.this.com/wendy-saad/e172645609215
https://www.this.com/nick-madison/e7265609214
https://www.this.com/tom-taylor/e17265709211
https://www.this.com/james-bates/e9212'''
href = re.compile(r'https://www\.this\.com/[a-zA-Z]+\-[a-zA-Z]+/e[0-9]+')
m = href.findall(text)
pprint(m)
Outputs:
['https://www.this.com/john-smith/e5609239',
'https://www.this.com/jane-johnson/e426609216',
'https://www.this.com/wendy-saad/e172645609215',
'https://www.this.com/nick-madison/e7265609214',
'https://www.this.com/tom-taylor/e17265709211',
'https://www.this.com/james-bates/e9212']

Is there a difference between [^[:print:]] and [[:cntrl:]]

Trying to determine if there is a functional difference between the POSIX character groups named above, or more specifically, the following two patterns:
r'[^[\x20-\x7E]]' # Match All non-printable
r'[\x00-\x1F\x7F]' # Match control characters

I'm not sure about the POSIX groups (Python's regex engine doesn't support them anyway), but
r'[^[\x20-\x7E]]'
is definitely wrong (should be r'[^\x20-\x7E]') and matches far more than
r'[\x00-\x1F\x7F]'
because the latter only considers ASCII characters whereas the former will also match anything above codepoint 126:
>>> r1 = re.compile(r'[^\x20-\x7E]')
>>> r2 = re.compile(r'[\x00-\x1F\x7F]')
>>> r1.match("ä")
<_sre.SRE_Match object; span=(0, 1), match='ä'>
>>> r2.match("ä")
>>>
To expand on my point above why your regex r'[^[\x20-\x7E]]' is faulty: it matches a letter that is neither an opening square bracket nor in the range between ASCII 20 and ASCII 126 (which already includes [ anyway), and that is followed by a literal closing bracket:
>>> r1 = re.compile(r'[^[\x20-\x7E]]')
>>> r1.match("\x00")
>>> r1.match("\x00]")
<_sre.SRE_Match object; span=(0, 2), match='\x00]'>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Validate string format based on format - python

Related

Python Regular expression simple

Regex Quantifiers in Python (trying to match a group) [duplicate]

questions about Python regex

Python re (regex) matching particular string containing letters, hyphen, numbers

Is there a difference between [^[:print:]] and [[:cntrl:]]

Categories

Resources