Python RegEx Meaning - python

I'm new to python regular expressions and was wondering if someone could help me out by walking me through what this means (I'll state what I think each bit means here as well).
Thanks!
RegExp:
r'(^.*def\W*)(\w+)\W*\((.*)\):'
r'...' = python definition of regular expression within the ''
(...) = a regex term
(^. = match the beginning of any character
*def\W* = ???
(\w+) = match any of [a, z] 1 or more times
\W*\ = ? i think its the same as the line above this but from 0+ more times instead of 1 but since it matches the def\W line above (which i dont really know the meaning of) i'm not sure.
((.*)\): = match any additional character within brackets ()
thanks!

It seems like a failed attempt to match a Python function signature:
import re
regex = re.compile(r""" # r'' means that \n and the like is two chars
# '\\','n' and not a single newline character
( # begin capturing group #1; you can get it: regex.match(text).group(1)
^ # match begining of the string or a new line if re.MULTILINE is set
.* # match zero or more characters except newline (unless
# re.DOTALL is set)
def # match string 'def'
\W* # match zero or more non-\w chars i.e., [^a-zA-Z0-9_] if no
# re.LOCALE or re.UNICODE
) # end capturing group #1
(\w+) # second capturing group [a-zA-Z0-9_] one or more times if
# no above flags
\W* # see above
\( # match literal paren '('
(.*) # 3rd capturing group NOTE: `*` is greedy `.` matches even ')'
# therefore re.match(r'\((.*)\)', '(a)(b)').group(1) == 'a)(b'
\) # match literal paren ')'
: # match literal ':'
""", re.VERBOSE|re.DEBUG)
re.DEBUG flag causes the output:
subpattern 1
at at_beginning
max_repeat 0 65535
any None
literal 100
literal 101
literal 102
max_repeat 0 65535
in
category category_not_word
subpattern 2
max_repeat 1 65535
in
category category_word
max_repeat 0 65535
in
category category_not_word
literal 40
subpattern 3
max_repeat 0 65535
any None
literal 41
literal 58
more

r'..' = Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'.
(...) = a capture group which stores the captured value in a var to be used in replacement/mathing.
^ = the start of the string.
.* = 0 or more chars of any type.
def = the literal string def
\W* = 0 or more non word chars (Anything other than a-zA-Z or _)
\w+ = 1 or more word chars (see above)
\( = escapes the (, therefore means a literal (
\) = same as above.
: = literal :
PS: I like the effort that you made to try to understand the regex. It will serve you well, much better than people asking what does r'(^.*def\W*)(\w+)\W*\((.*)\):' mean.

r'...' = python definition of regular expression within the ''
The r'' syntax has nothing to with regular expressions (or at least, not directly). The r stands for raw and is simply an indicator to Python that no string interpolation should be performed on the string.
This is often used with regular expressions so that you don't have to escape backslash (\) characters, which would otherwise be eaten by the normal string interpolation mechanism.
(^. = match the beginning of any character
I'm not sure what "beginning of any character" means. The ^ character matches the beginning of a line.
def\W = ???
def matches the characters def. For \W, take a look at pydoc re, which describes the regular expression language.
\W*
As above.
Other than the above, your interpretation seems largely correct.

r'(^.*def\W\*)(\w+)\W*((.*)):'
==================================
r' tells python this is a raw string so you don't have to double escape all the \
^ match start of string
() in each case this means group this match where each () is a different group
.* match zero or more of any characters
def match the literal 'def'
\W* match zero or more of any non word character
() more grouping of the contained expression
\w+ match one or more of word character
\W* zero or more of any non word character
\( escape the left paren
() more grouping of the contained expression
.* zero of more of any character
\) escape the right paren
: match a single colon literal
This looks like it is trying to match a python method definition. Here is a link to play with this regular expression. Yes it is powered by Ruby, but the syntax is pretty much the same across all languages, I use this site to test regexes for Python, Java and Ruby.

Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it.
( ) groups a statement, it's treated as ONE thing, so you can do ()? or ()* or ()+ if the things in the brackets need to handled together.
^ matches it if it's the beginning of the string
(Dot.) In the default mode, this matches any character except a newline.
Since it is ".*" the * means match 0 or more matches of the previous thing, which in this case is any character.
def\W* - lines beginning with the string "def" and then \W matches any non-alphanumeric characters and is equivalent to [^a-zA-Z0-9_]. Since we have the * again, this time it matches 0 or more non-alphanumeric characters.
(\w+), the + is for 1 or more of the previous, which in this case is \w, equivalent to [a-zA-Z0-9_].
7.\W*, we know that one already.
"(" - means just to match "(", differentiate it from () grouping things, same for ")/"
(.*) - match 0 or more characters inside.
: - the matched string ends with a colon at the end.
The whole thing seems to be matching a definition of a function in python i.e. "def foo(x):" will be matched. Dealing with regular expressions is hard - using tools such as http://www.pythonregex.com/ helps me to try out different things. And since RE are slightly different in different languages, it's nice to have tools for those too.

r'...' → the preferred way to define the string of a regular expression in Python
(...) → regex term
^ → matches only with the beginning of the string
So, in the first pair of parenthesis (^.def\W), firstly it matches with the string.
. → matches any character
* → repeat the previous match 0 or more times
Then .* will match anything any number of times. The following 'def' is an exact match, that only matches with itself.
\W → matches anything that is NOT a letter, nor a number, nor the underscore character.
Then \W* will match zero or more of these non-letter-number-underscore characters. The next pair of parenthesis (\w+) you got it right. In the last part \W*\((.*)\): the initial \W* means the same thing as the previous \W* . Next, \( matches with ( , then there is the group (.*) which means, as previously, anything any number of times, followed by \): that matches ): .
An example of string that is matched by this regular expression is:
thing_def = function_name (123 anything in here):

Related

How to match a string with pythons regex with optional character, but only if that optional character is preceded by another character

I need to match a string that optionally ends with numbers, but only if the numbers aren't preceded by a 0.
so AAAA should match, AAA1 should, AA20 should, but AA02 should not.
I can figure out the optionality of it, but I'm not sure if python has a "preceded by" or "followed by" flag.
if s.isalnum() and re.match("^[A-Z]+[1-9][0-9]*$", s):
return True
Try:
^[A-Z]+(?:[1-9][0-9]*)?$
Regex demo.
^[A-Z]+ - match letters from the beginning of string
(?:[1-9][0-9]*)? - optionally match a number that doesn't start from 0
$ - end of string

python string split only by `/` but not `//` [duplicate]

I have a string like this
"yJdz:jkj8h:jkhd::hjkjh"
I want to split it using colon as a separator, but not a double colon. Desired result:
("yJdz", "jkj8h", "jkhd::hjkjh")
I'm trying with:
re.split(":{1}", "yJdz:jkj8h:jkhd::hjkjh")
but I got a wrong result.
In the meanwhile I'm escaping "::", with string.replace("::", "$$")
You could split on (?<!:):(?!:). This uses two negative lookarounds (a lookbehind and a lookahead) which assert that a valid match only has one colon, without a colon before or after it.
To explain the pattern:
(?<!:) # assert that the previous character is not a colon
: # match a literal : character
(?!:) # assert that the next character is not a colon
Both lookarounds are needed, because if there was only the lookbehind, then the regular expression engine would match the first colon in :: (because the previous character isn't a colon), and if there was only the lookahead, the second colon would match (because the next character isn't a colon).
You can do this with lookahead and lookbehind, if you want:
>>> s = "yJdz:jkj8h:jkhd::hjkjh"
>>> l = re.split("(?<!:):(?!:)", s)
>>> print l
['yJdz', 'jkj8h', 'jkhd::hjkjh']
This regex essentially says "match a : that is not followed by a : or preceded by a :"

strange output regular expression r'[-.\:alnum:](.*)'

I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']
First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []
Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).
Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')

python regular expression : how to remove all punctuation characters from a string but keep those between numbers?

I am working on a Chinese NLP project. I need to remove all punctuation characters except those characters between numbers and remain only Chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-zA-Z).For example,the
hyphen in 12-34 should be kept while the equal mark after 123 should be removed.
Here is my python script.
import re
s = "中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[^0-9])','',s)
print(res)
the expected output should be
中国中国foo中国bar中123国中国12-34中国
but the result is
中国中国foo中国bar中123=国中国12-34中国
I can't figure out why there is an extra equal sign in the output?
Your regex will first check "=" against [^\u4e00-\u9fff0-9a-zA-Z]+. This will succeed. It will then check the lookbehind and lookahead, which must both fail. Ie: If one of them succeeds, the character is kept. This means your code actually keeps any non-alphanumeric, non-Chinese characters which have numbers on any side.
You can try the following regex:
u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))'
You can use it as such:
import re
s = "中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
res = re.findall(u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))',s)
print(res.join(''))
I suggest matching and capturing these characters in between digits (to restore them later in the output), and just match them in other contexts.
In Python 2, it will look like
import re
s = u"中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
pat_block = u'[^\u4e00-\u9fff0-9a-zA-Z]+';
pattern = u'([0-9]+{0}[0-9]+)|{0}'.format(pat_block)
res = re.sub(pattern, lambda x: x.group(1) if x.group(1) else u"" ,s)
print(res.encode("utf8")) # => 中国中国foo中国bar中123国中国12-34中国
See the Python demo
If you need to preserve those symbols inside any Unicode digits, you need to replace [0-9] with \d and pass the re.UNICODE flag to the regex.
The regex will look like
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+)|[^\u4e00-\u9fff0-9a-zA-Z]+
It will works like this:
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+) - Group 1 capturing
[0-9]+ - 1+ digits
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
[0-9]+ - 1+ digits
| - or
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
In Python 2.x, when a group is not matched in re.sub, the backreference to it is None, that is why a lambda expression is required to check if Group 1 matched first.

using reg exp to check if test string is of a fixed format

I want to make sure using regex that a string is of the format- "999.999-A9-Won" and without any white spaces or tabs or newline characters.
There may be 2 or 3 numbers in the range 0 - 9.
Followed by a period '.'
Again followed by 2 or 3 numbers in the range 0 - 9
Followed by a hyphen, character 'A' and a number between 0 - 9 .
This can be followed by anything.
Example: 87.98-A8-abcdef
The code I have come up until now is:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9][0-9][.][0-9][0-9][-A][0-9][-]*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
This doesn't seem to work. I'm not sure what I'm missing and also the problem here is I'm not checking for white spaces, tabs and new line characters and also hard-coded the number for integers before and after decimal.
With {m,n} you can specify the number of times a pattern can repeat, and the \d character class matches all digits. The \S character class matches anything that is not whitespace. Using these your regular expression can be simplified to:
re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
Note also the \Z anchor, making the \S* expression match all the way to the end of the string. No whitespace (newlines, tabs, etc.) are allowed here. If you combine this with the .match() method you assure that all characters in your tested string conform to the pattern, nothing more, nothing less. See search() vs. match() for more information on .match().
A small demonstration:
>>> import re
>>> pattern = re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
>>> pattern.match('87.98-A1-help')
<_sre.SRE_Match object at 0x1026905e0>
>>> pattern.match('123.45-A6-no whitespace allowed')
>>> pattern.match('123.45-A6-everything_else_is_allowed')
<_sre.SRE_Match object at 0x1026905e0>
Let's look at your regular expression. If you want:
"2 or 3 numbers in the range 0 - 9"
then you can't start your regular expression with '^[0-9][0-9][.] because that will only match strings with exactly two integers at the beginning. A second issue with your regex is at the end: [0-9][-]* - if you wish to match anything at the end of the string then you need to finish your regular expression with .* instead. Edit: see Martijn Pieters's answer regarding the whitespace in the regular expressions.
Here is an updated regular expression:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9]{2,3}\.[0-9]{2,3}-A[0-9]-.*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
Not everything needs to be enclosed inside [ and ], in particular when you know the character(s) that you wish to match (such as the part -A). Furthermore:
the notation {m,n} means: match at least m times and at most n times, and
to explicitly match a dot, you need to escape it: that's why there is \. in the regular expression above.

Categories

Resources