I am confused about the semantics of the following Python regular expression:
r"/actors(\\..+)?"
I looked through the Python documentation section on regular expressions, but couldn't make sense of this expression. Can someone help me out?
/ # literal /
actors # literal actors
( # starting a subpattern
\\ # (escaped) literal \
. # arbitrary character
.+ # 1 or more arbitrary characters
)? # ends the subpattern and makes it optional
This would mean, it matches forward slash, 'actors', and then optionally backslash and 2 or more arbitrary characters.
I suppose there is a typo here. Either the string should not have been marked raw, or there is one backslash too much. In both cases there would be an escaped . instead of an escaped \ followed by an arbitrary character. This in turn would matches files, called actors with an arbitrary or missing file extension.
So either "/actors(\\..+)?" or r"/actors(\..+)?".
\\..+
Here, \\ is an escaped \ character, so it does match that one exactly. Following is a . that can match any character, followed by another . that must be there at least once (or more often. So ..+ will match two characters or more. And \\..+ will match any two characters or more, prefixed by a backslash.
(\\..+)?
That all is inside an optional capturing group means this all could be left out as well.
Note that the expression is probably wrong. It looks as if you are trying to match some kind of URL and want to match the file extension, introduced by a . character. However the \\ inside a raw-enquoted string r" " will match the \ character and will not escape the dot itself. So you probably want r"/actors(\..+)?" or "/actors(\\..+)?".
It means: string /actors, follow by an optional capture group, which contains a literal ., and then one or more of whatever the non-literal . is configured to match.
Related
I have the following regex expression which is meant to find the "IF" keyword (case insensitive) in a string. Some constraints are imposed:
It should be preceded by a whitespace or a ) character (from a previous expression)
It should be followed by whitespace or ( character
The below expression accomplishes these constraints. However, this expression does not find the keyword when it's located at the start of a string (if(foo, 1, 2) for instance).
Using something like ^|(?<=[\s\)])(?i)if(?=[\s\(]) does not seem to work. I tried ?:^|[\s\)]) but that seems to also capture the space in front of the keyword.
This is what I have so far:
(?<=[\s\)])(?i)if(?=[\s\(])
You may use an alternation group with two zero-width assertions:
(?i)(?:^|(?<=[\s)]))if(?=[\s(])
^^^^^^^^^^^^^^^^
See the regex demo.
Here, (?:^|(?<=[\s)])) matches:
^ - start of string
| - or
(?<=[\s)]) - a location that is immediately preceded with a whitespace or ) character.
Note that the (?i) inline case insensitive modifier in a Python re regex affects the whole pattern regardless of where it is located in it, so I suggest moving it to the pattern start for better visibility.
Also, there is no need to escape ( and ) inside character classes, [...] constructs, as they are treated as literal parentheses inside them.
The problem is that | is applied at top level, so it is an alteration between:
^ and (?<=[\s\)])(?i)if(?=[\s\(]).
Just add non-capturing group around ^ and (?<=[\s\)]):
(?:^|(?<=[\s\)]))(?i)if(?=[\s\(])
You can solve the problem (for this particular case that only involves single characters) using a double negation:
(?<![^\s)])
(not preceded by a character that is not a whitespace nor a closing parenthesis). This condition includes the start of the string too.
For this question, I am not interested in alternative pythonic methods, I am only interested in solving the Regex in my code. I can't figure out why it does not work.
Let's say I have the following string:
hello.world
I want to get all characters, excluding all characters before the dot, except the first one before it. So, I am trying to extract the following substring:
o.world
This is my code:
re.sub('^.*[^.\..*]', '', string)
My Regex logic is broken down as follows, the first characters ^.* which are not one character followed by a dot followed by any number of characters [^.\..*], are removed.
However, the Regex doesn't work, can someone help me out?
Your current code is not working because your pattern is not matching what you think it is. Putting .* in a character set does not mean "zero or more characters". Instead, it means the characters . or * literally. Also, \. is treated as \ or ., not an escaped . (since . has no special meaning in a character set).
This means that your pattern is actually equivalent to:
^.*[^\.*]
which matches:
^ # The start of the string
.* # Zero or more characters
[^\.*] # A character that is not \, ., or *
To do what you want with re.sub, you can use:
>>> import re
>>> re.sub('[^.]*(.\..*)', r'\1', 'hello.world')
'o.world'
>>>
Below is an explanation of what the pattern does:
[^.]* # Matches zero or more characters that are not .
( # Starts a capture group
. # Matches any character (save a newline).
\. # Matches a literal .
.* # Matches zero or more characters
) # Closes the capture group
The important part though is the capture group. Inside the replace string, \1 will refer to whatever was matched by it, which in this case is the text that you want to keep. So, the code above can be seen as replacing all of the text with only that which we need.
That said, it seems like it would be better to just use re.search:
>>> import re
>>> re.search('[^.]*(.\..*)', 'hello.world').group(1)
'o.world'
>>>
I have a regular expression that matches alphabets, numbers, _ and - (with a minimum and maximum length).
^[a-zA-Z0-9_-]{3,100}$
I want to include whitespace in that set of characters.
According to the Python documentation:
Character classes such as \w or \S are also accepted inside a set.
So I tried:
^[a-zA-Z0-9_-\s]{3,100}$
But it gives bad character range error. How can I include whitespace in the above set?
The problem is not the \s but the - which indicates a character range, unless it is at the end or start of the class. Use this:
^[a-zA-Z0-9_\s-]{3,100}$
^[-a-zA-Z0-9_\s]{3,100}
_-\s was interpreted as a range. A dash representing itself has to be the first or last character inside [...]
You're on the right track, Add a second backslash to escape the slash, because the backslash is an escape character.
^[a-zA-Z0-9_\\-\\s]{3,100}$
Is it !-\ (characters from 33=ord('!') to 92=ord('\')
and '.' and '&' in a set?
I think my interpretation is incorrect based on my test.
But python reference doesn't say anything wrong with my interpretation.
http://docs.python.org/library/re.html
In short, r'[!-\.&]' is just a complicated form of writing r'[!-.]'.
It matches all characters with ord between 33 = ord('!') and 46 = ord('.'), i.e. any of the following:
!"#$%&\'()*+,-.
The escaping backslash before . is ignored in character classes; it is unnecessary (. matching all characters in a character class wouldn't make any sense). Since the ampersand & is already in the character class, it is superfluous as well.
Tests may show that the pattern matches chr(33) through chr(46), but the pattern is not guaranteed to work that way on all systems. Here's why. Character sets vary from system to system.
This is why the Perl regex documentation specifically recommends “to use only ranges that begin from and end at either alphabetics of equal case ([a-e], [A-E]), or digits ([0-9]). Anything else is unsafe.” (Perl regex is relevant because that's the regex used by Python.)
So, if this pattern is ever run on an EBCDIC based platform, it will match a different set of characters. It is only correct to say that the pattern matches chr(33) through chr(46) on ASCII based platforms.
It seems that the intention of this regex is to match any character between "!" and "." (notice that the slash is escaping the "." character), which are ! " # $ % & ' ( ) * + , - . (from the Unicode table at http://www.tamasoft.co.jp/en/general-info/unicode.html).
Two comments about the expression:
Usually, you don't need to escape characters within brackets [] (except, maybe, by the \ itself).
The ampersand symbol "&" is already contained in the range defined by "!-.", so it is redundant.
The backslash escapes the dot and the range will thus be from ! to .. The regex will match:
!"#$%&'()*+,-.
The last & is not necessary since it's included in the range, and escaping a dot is not needed either since it's inside a character class.
I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.
>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.
It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.
If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.
mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)