Regex shorthand for matching special characters - python

I'm building a password validator for the back-end of a web app and I'm using the pretty standard uppercase, lowercase, digit, min length and special character requirement, and I'm looking to refactor a bit the regex so it's more compact. Is there a way to search for a match of any special character without having a pretty long regex with every special character written?
r'^(?=.*[\d])(?=.*[A-Z])(?=.*[a-z])(?=.*[special chars regex])[\w\d special chars regex]{8,255}$'
So far my attempts have been around the idea of having a negation set like \S, which works to match them but since they also match digits, then it's still allowing for passwords with no special characters.
EDIT:
What I mean with special characters can be summarized as, I want to catch characters with ASCII index between 33 and 126, excluding letters and digits (indexes 48 ~ 57 for digits, 65 ~ 90 for upper case letters and 97 ~ 122 for lower case letters) as those are indeed part of pre-existing regex short hands such as \w and \d.
Here's an ASCII chart for reference.

[^\w\s\d] will match a single character that's not a word character (lowercase or uppercase), a number, or a whitespace character.
That would mean you would in theory use [\w\d[^\w\d\s]] to indicate all the characters that a password can be composed of, but since that union doesn't seem to want to parse correctly, you can specify the Unicode range explicitly (though in hex, not decimal):
(?=.*[\d])(?=.*[A-Z])(?=.*[a-z])(?=.*[^\w\d\s])[\u0021-\u007E]{8,255}

Related

Regex matching Unicode variable names

In Python 2, a Python variable name contains only ASCII letters, numbers and underscores, and it must not start with a number. Thus,
re.search(r'[_a-zA-Z][_a-zA-Z0-9]*', s)
will find a matching Python name in the str s.
In Python 3, the letters are no longer restricted to ASCII. I am in search for a new regex which will match any and all legal Python 3 variable names.
According to the docs, \w in a regex will match any Unicode word literal, including numbers and the underscore. I am however unsure whether this character set contains exactly those characters which might be used in variable names.
Even if the character set \w contains exactly the characters from which Python 3 variable names may legally be constructed, how do I use it to create my regex? Using just \w+ will also match "words" which start with a number, which is no good. I have the following solution in mind,
re.search(r'(\w&[^0-9])\w*', s)
where & is the "and" operator (just like | is the "or" operator). The parentheses will thus match any word literal which at the same time is not a number. The problem with this is that the & operator does not exist, and so I'm stuck with no solution.
Edit
Though the "double negative" trick (as explained in the answer by Patrick Artner below) can also be found in this question, note that this only partly answers my question. Using [^\W0-9]\w* only works if I am guaranteed that \w exactly matches the legal Unicode characters, plus the numbers 0-9. I would like a source of this knowledge, or some other regex which gets the job done.
You can use a double negative - \W is anything that \w is not - just disallow it to allow any \w:
[^\W0-9]\w*
essentially using any not - non-wordcharacter except 0-9 followed by any word character any number of times.
Doku: regular-expression-syntax
You could try using
^(?![0-9])\w+$
Which will not partial match invalid variable names
Alternatively, if you don't need to use regex. str.isidentifier() will probably do what you want.

Removing special characters and symbols from a string in python

I am trying to do what my title says. I have a list of about 30 thousand business addressess, and I'm trying to make each address as uniform as possible
As far as removing weird symbols and characters goes, I have found three suggestions, but I don't understand how they are different.
If somebody can explain the difference, or provide insight into a better way to standardize address information, please and thank you!
address = re.sub(r'([^\s\w]|_)+', '', address)
address = re.sub('[^a-zA-Z0-9-_*.]', '', address)
address = re.sub(r'[^\w]', ' ', address)
The first suggestion uses the \s and \w regex wildcards.
\s means "match any whitespace".
\w means "match any letter or number".
This is used as an inverted capture group ([^\s\w]), which, all together, means "match anything which isn't whitespace, a letter or a number". Finally, it is combined using an alternative | with _, which will just match an underscore and given a + quantifier which matches one or more times.
So what this says is: "Match any sequence of one or more characters which aren't whitespace, letters, numbers or underscores and remove it".
The second option says: "Match any character which isn't a letter, number, hyphen, underscore, dot or asterisk and remove it". This is stated by that big capture group (the stuff between the brackets).
The third option says "Take anything which is not a letter or number and replace it by a space". It uses the \w wildcard, which I have explained.
All of the options use Regular Expressions in order to match character sequences with certain characteristics, and the re.sub function, which sub-stitutes anything matched by the given regex by the second string argument.
You can read more about Regular Expressions in Python here.
The enumeration [^a-zA-Z0-9-_*.] enumerates exactly the character ranges to remove (though the literal - should be at the beginning or end of the character class).
\w is defined as "word character" which in traditional ASCII locales included A-Z and a-z as well as digits and underscore, but with Unicode support, it matches accented characters, Cyrillics, Japanese ideographs, etc.
\s matches space characters, which again with Unicode includes a number of extended characters such as the non-breakable space, numeric space, etc.
Which exactly to choose obviously depends on what you want to accomplish and what you mean by "special characters". Numbers are "symbols", all characters are "special", etc.
Here's a pertinent quotation from the Python re documentation:
\s
For Unicode (str) patterns:
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [ \t\n\r\f\v] may be a better choice).
For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].
\w
For Unicode (str) patterns:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).
For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_].
How you read the re.sub function is like this (more docs):
re.sub(a, b, my_string) # replace any matches of regex a with b in my_string
I would go with the second one. Regexes can be tricky, but this one says:
[^a-zA-Z0-9-_*.] # anything that's NOT a-z, A-Z, 0-9, -, * .
Which seems like it's what you want. Whenever I'm using regexes, I use this site:
http://regexr.com/
You can put in some of your inputs, and make sure they are matching the right kinds of things before throwing them in your code!

Escaping numbers and special chars in python regex

I have form and I want that its fields will not match any numbers and any special chars.
Now,
I have like this
name = forms.RegexField(regex = r'[^0-9]+$')
It just escapes numbers, now.
How to set a regex pattern for escaping numbers and special chars. Any advice?
Assuming that words containing alphabets and numbers both are allowed, this regex could do the work
[a-zA-Z0-9]*[a-zA-Z]+[a-zA-Z0-9]*
this would check that the input field must contain atleast one character, and would only allow alphabets, or a combination of alphabets and numbers. No special characters are allowed. abc 12abc abc12 are valid, but 123 ab#/ are invalid.
What I did is, I reversed the approach. Instead of restricting special characters, i allowed only alphabets and the above mentioned combination. This automatically restricted special characters.
If escaping numbers is the requirement, this regex could be used:
[a-zA-Z]+
this would only allow alphabets, and would restrict all numbers and all special characters.
Here a table to see all regexp simbols:
http://www.javascriptkit.com/jsref/regexp.shtml
For your problem, if I understand what you want it should be... r'[a-zA-Z]+$'
Dont know, but look at the table in the link, its very usefull.
As far as I understand, you need to match only letters.
To be unicode compatible, use properties :
\p{L}+
This will match one or more letter in any language.

What is the proper regular expression to match all utf-8/unicode lowercase letter forms

I would like to match all lowercase letter forms in the Latin block. The trivial '[a-z]' only matches characters between U+0061 and U+007A, and not all the other lowercase forms.
I would like to match all lowercase letters, most importantly, all the accented lowercase letters in the Latin block used in EFIGS languages.
[a-zà-ý] is a start, but there are still tons of other lowercase characters (see http://www.unicode.org/charts/PDF/U0000.pdf). Is there a recommended way of doing this?
FYI I'm using Python, but I suspect that this problem is cross-language.
Python's builtin "islower()" method seems to do the right checking:
lower = ''
for c in xrange(0,2**16):
if unichr(c).islower():
lower += unichr(c)
print lower
Python does not currently support Unicode properties in regular expressions. See this answer for a link to the Ponyguruma library which does support them.
Using such a library, you could use \p{Ll} to match any lowercase letter in a Unicode string.
Every character in the Unicode standard is in exactly one category. \p{Ll} is the category of lowercase letters, while \p{L} comprises all the characters in one of the "Letter" categories (Letter, uppercase; Letter, lowercase; Letter, titlecase; Letter, modifier; and Letter, other). For more information see the Character Properties chapter of the Unicode Standard. Or see this page for a good explanation on use of Unicode in regular expressions.
Looks as though this recipe posted back in the old 2005
import sys, re
uppers = [u'[']
for i in xrange(sys.maxunicode):
c = unichr(i)
if c.isupper(): uppers.append(c)
uppers.append(u']')
uppers = u"".join(uppers)
uppers_re = re.compile(uppers)
print uppers_re.match('A')
is still relevant.
You might want to have a look at regular-expressions.info.
However, as far as I know there's no character class or modifier that expresses "lower case characters only" (and not every language has lower case characters), so I'd say you might have to use multiple ranges (possible almost as many as there are unicode blocks.
Edit:
reading a bit more on this, there might be a way: [\p{Ll}\p{Lo}] which means lowercase characters with an upper case variant or characters that don't have lower case and upper case (in case of chinese characters for example).
Regex [\p{Ll}\p{Lo}]+ matches test string àÀhelloHello你好Прывітанне and replacing the matches with x results in xÀxHxПx whereas replacing the matches of [\p{Ll}]+ results in xÀxHx你好Пx (note the Chinese characters that were not matched).
if you use \p{L} it will match any unicode letter. check the examples here. You can also combine it with \p{M} to match Hebrew-esqe languages that include diacritic marks. (\p{L}|\p{M})+
EDIT:
I missed the part about only lowercase letters the first time around. \p{L} will match all letters, \p{Ll} will match lowercase only.

How to match alphabetical chars without numeric chars with Python regexp?

Using Python module re, how to get the equivalent of the "\w" (which matches alphanumeric chars) WITHOUT matching the numeric characters (those which can be matched by "[0-9]")?
Notice that the basic need is to match any character (including all unicode variation) without numerical chars (which are matched by "[0-9]").
As a final note, I really need a regexp as it is part of a greater regexp.
Underscores should not be matched.
EDIT:
I hadn't thought about underscores state, so thanks for warnings about this being matched by "\w" and for the elected solution that addresses this issue.
You want [^\W\d]: the group of characters that is not (either a digit or not an alphanumeric). Add an underscore in that negated set if you don't want them either.
A bit twisted, if you ask me, but it works. Should be faster than the lookahead alternative.
(?!\d)\w
A position that is not followed by a digit, and then \w. Effectively cancels out digits but allows the \w range by using a negative look-ahead.
The same could be expressed as a positive look-ahead and \D:
(?=\D)\w
To match multiple of these, enclose in parens:
(?:(?!\d)\w)+

Categories

Resources