Python regex: Including whitespace inside character range - python

I have a regular expression that matches alphabets, numbers, _ and - (with a minimum and maximum length).
^[a-zA-Z0-9_-]{3,100}$
I want to include whitespace in that set of characters.
According to the Python documentation:
Character classes such as \w or \S are also accepted inside a set.
So I tried:
^[a-zA-Z0-9_-\s]{3,100}$
But it gives bad character range error. How can I include whitespace in the above set?

The problem is not the \s but the - which indicates a character range, unless it is at the end or start of the class. Use this:
^[a-zA-Z0-9_\s-]{3,100}$

^[-a-zA-Z0-9_\s]{3,100}
_-\s was interpreted as a range. A dash representing itself has to be the first or last character inside [...]

You're on the right track, Add a second backslash to escape the slash, because the backslash is an escape character.
^[a-zA-Z0-9_\\-\\s]{3,100}$

Related

Python regex: alternative positive lookbehind assertion

I have the following regex expression which is meant to find the "IF" keyword (case insensitive) in a string. Some constraints are imposed:
It should be preceded by a whitespace or a ) character (from a previous expression)
It should be followed by whitespace or ( character
The below expression accomplishes these constraints. However, this expression does not find the keyword when it's located at the start of a string (if(foo, 1, 2) for instance).
Using something like ^|(?<=[\s\)])(?i)if(?=[\s\(]) does not seem to work. I tried ?:^|[\s\)]) but that seems to also capture the space in front of the keyword.
This is what I have so far:
(?<=[\s\)])(?i)if(?=[\s\(])
You may use an alternation group with two zero-width assertions:
(?i)(?:^|(?<=[\s)]))if(?=[\s(])
^^^^^^^^^^^^^^^^
See the regex demo.
Here, (?:^|(?<=[\s)])) matches:
^ - start of string
| - or
(?<=[\s)]) - a location that is immediately preceded with a whitespace or ) character.
Note that the (?i) inline case insensitive modifier in a Python re regex affects the whole pattern regardless of where it is located in it, so I suggest moving it to the pattern start for better visibility.
Also, there is no need to escape ( and ) inside character classes, [...] constructs, as they are treated as literal parentheses inside them.
The problem is that | is applied at top level, so it is an alteration between:
^ and (?<=[\s\)])(?i)if(?=[\s\(]).
Just add non-capturing group around ^ and (?<=[\s\)]):
(?:^|(?<=[\s\)]))(?i)if(?=[\s\(])
You can solve the problem (for this particular case that only involves single characters) using a double negation:
(?<![^\s)])
(not preceded by a character that is not a whitespace nor a closing parenthesis). This condition includes the start of the string too.

Is this regex syntax working?

I wanted to search a string for a substring beginning with ">"
Does this syntax say what I want it to say: this character followed by anything.
regex_firstline = re.compile("[>]{1}.*")
As a pythonic way for such tasks you can use str.startswith() method, and don't need to use regex.
But about your regex "[>]{1}.*" you don't need {1} after your character class and you can specify the start of your regex with anchor ^.So it can be "^>.*"
Using http://regex101.com:
[>]{1} matches the single character > literally exactly one time (but it denotes {1} is a meaningless quantifier), and
.* then matches any character as many times as possible.
If a list was provided inside square brackets (as opposed to a single character), regex would attempt to match a single character within the list exactly one time. http://regex101.com has a good listing of tokens and what they mean.
An ideal regex expression would be ^[>].*, meaning at the beginning of a string find exactly one > character followed by anything else (and with only one character in the square brackets, you can remove those to simplify it even further: ^>.*

Pattern for '.' separated words with arbitrary number of whitespaces

It's the first time that I'm using regular expressions in Python and I just can't get it to work.
Here is what I want to achieve: I want to find all strings, where there is a word followed by a dot followed by another word. After that an unknown number of whitespaces followed by either (off) or (on). For example:
word1.word2 (off)
Here is what I have come up so far.
string_group = re.search(r'\w+\.\w+\s+[(\(on\))(\(off\))]', analyzed_string)
\w+ for the first word
\. for the dot
\w+ for the second word
\s+ for the whitespaces
[(\(on\))(\(off\))] for the (off) or (on)
I think that the last expression might not be doing what I need it to. With the implementation right now, the program does find the right place in the string, but the output of
string_group.group(0)
Is just
word1.word2 (
instead of the whole expression I'm looking for. Could you please give me a hint what I am doing wrong?
[ ... ] is used for character class, and will match any one character inside them unless you put a quantifier: [ ... ]+ for one or more time.
But simply adding that won't work...
\w+\.\w+\s+[(\(on\))(\(off\))]+
Will match garbage stuff like word1.word2 )(fno(nofn too, so you actually don't want to use a character class, because it'll match the characters in any order. What you can use is a capturing group, and a non-capturing group along with an OR operator |:
\w+\.\w+\s+(\((?:on|off)\))
(?:on|off) will match either on or off
Now, if you don't like the parentheses, to be caught too in the first group, you can change that to:
\w+\.\w+\s+\((on|off)\)
You've got your logical OR mixed up.
[(\(on\))(\(off\))]
should be
\((?:on|off)\)
[]s are just for matching single characters.
The square brackets are a character class, which matches any one of the characters in the brackets. You appear to be trying to use it to match one of the sub-regexes (\(one\)) and (\(two\)). The way to do that is with an alternation operation, the pipe symbol: (\(one\)|\(two\)).
I think your problem may be with the square brackets []
they indicate a set of single characters to match. So your expression would match a single instance of any of the following chars: "()ofn"
So for the string "word1.word2 (on)", you are matching only this part: "word1.word2 ("
Try using this one instead:
re.search(r'\w+\.\w+\s+\((on|off)\)', analyzed_string)
This match assumes that the () will be there, and looks for either "on" or "off" inside the parenthesis.

Python reference to regex in parentheses

I have a text file that needs to have the letter 't' removed if it is not immediately preceded by a number.
I am trying to do this using re.sub and I have this:
f=open('File.txt').read()
g=f
g=re.sub('([^0-9])t','',g)
This identifies the letters to be removed correctly but also removes the preceding character. How can I refer to the parenthesized regex in the replacement String?
Thanks!
Use a lookbehind (or negative lookbehind) instead.
g=re.sub('(?<=[^0-9])t','',g)
or
g=re.sub('(?<![0-9])t','',g)
Three options:
g=re.sub('([^0-9])t','\\1',g)
or
g=re.sub('(?<=[^0-9])t','',g)
or
g=re.sub('(?<![0-9])t','',g)
The first option is what you are looking for, a backreference to the captured string. \\1 will refer to the first captured group.
Lookarounds don't consume characters, so you don't need to replace them back. Here, I have used a positive lookbehind for the first one and a negative lookbehind for the second one. Those don't consume the characters within their brackets, so you are not taking the [^0-9] or [0-9] in the replacement. It might be better to use those since it prevents overlapping matches.
The positive lookbehind makes sure that t has a non-digit character before it. The negative lookbehind makes sure that t does not have a digit character before it.

Python Regular Expression (\..+)?

I am confused about the semantics of the following Python regular expression:
r"/actors(\\..+)?"
I looked through the Python documentation section on regular expressions, but couldn't make sense of this expression. Can someone help me out?
/ # literal /
actors # literal actors
( # starting a subpattern
\\ # (escaped) literal \
. # arbitrary character
.+ # 1 or more arbitrary characters
)? # ends the subpattern and makes it optional
This would mean, it matches forward slash, 'actors', and then optionally backslash and 2 or more arbitrary characters.
I suppose there is a typo here. Either the string should not have been marked raw, or there is one backslash too much. In both cases there would be an escaped . instead of an escaped \ followed by an arbitrary character. This in turn would matches files, called actors with an arbitrary or missing file extension.
So either "/actors(\\..+)?" or r"/actors(\..+)?".
\\..+
Here, \\ is an escaped \ character, so it does match that one exactly. Following is a . that can match any character, followed by another . that must be there at least once (or more often. So ..+ will match two characters or more. And \\..+ will match any two characters or more, prefixed by a backslash.
(\\..+)?
That all is inside an optional capturing group means this all could be left out as well.
Note that the expression is probably wrong. It looks as if you are trying to match some kind of URL and want to match the file extension, introduced by a . character. However the \\ inside a raw-enquoted string r" " will match the \ character and will not escape the dot itself. So you probably want r"/actors(\..+)?" or "/actors(\\..+)?".
It means: string /actors, follow by an optional capture group, which contains a literal ., and then one or more of whatever the non-literal . is configured to match.

Categories

Resources