Escaping numbers and special chars in python regex - python

I have form and I want that its fields will not match any numbers and any special chars.
Now,
I have like this
name = forms.RegexField(regex = r'[^0-9]+$')
It just escapes numbers, now.
How to set a regex pattern for escaping numbers and special chars. Any advice?

Assuming that words containing alphabets and numbers both are allowed, this regex could do the work
[a-zA-Z0-9]*[a-zA-Z]+[a-zA-Z0-9]*
this would check that the input field must contain atleast one character, and would only allow alphabets, or a combination of alphabets and numbers. No special characters are allowed. abc 12abc abc12 are valid, but 123 ab#/ are invalid.
What I did is, I reversed the approach. Instead of restricting special characters, i allowed only alphabets and the above mentioned combination. This automatically restricted special characters.
If escaping numbers is the requirement, this regex could be used:
[a-zA-Z]+
this would only allow alphabets, and would restrict all numbers and all special characters.

Here a table to see all regexp simbols:
http://www.javascriptkit.com/jsref/regexp.shtml
For your problem, if I understand what you want it should be... r'[a-zA-Z]+$'
Dont know, but look at the table in the link, its very usefull.

As far as I understand, you need to match only letters.
To be unicode compatible, use properties :
\p{L}+
This will match one or more letter in any language.

Related

Regular expression that accepts tokens of three or more alphabetical characters

I'm trying to build a TFIDVectorizer that only accepts tokens of 3 or more alphabetical characters using TFIdfVectorizer(token_pattern="(?u)\\b\\D\\D\\D+\\b")
But it doesn't behave correctly, I know token_pattern="(?u)\\b\\w\\w\\w+\\b" accepts tokens of 3 or more alphanumerical characters, so I just don't understand why the former is not working.
What am I missing?
The problem lies in using the \D metacharacter, as it's actually for matching any non-digit character, rather than any alphabetical character. From Python docs:
You can go instead with:
token_pattern="(?i)[a-z]{3,}"
Explanation:
(?i) — inline flag to make matching case-insensitive,
[a-z] — matches any Latin letter,
{3,} — makes the previous token match three or more times (greedily, i.e., as many times as possible).
I hope this answers your question. :)

Regex matching Unicode variable names

In Python 2, a Python variable name contains only ASCII letters, numbers and underscores, and it must not start with a number. Thus,
re.search(r'[_a-zA-Z][_a-zA-Z0-9]*', s)
will find a matching Python name in the str s.
In Python 3, the letters are no longer restricted to ASCII. I am in search for a new regex which will match any and all legal Python 3 variable names.
According to the docs, \w in a regex will match any Unicode word literal, including numbers and the underscore. I am however unsure whether this character set contains exactly those characters which might be used in variable names.
Even if the character set \w contains exactly the characters from which Python 3 variable names may legally be constructed, how do I use it to create my regex? Using just \w+ will also match "words" which start with a number, which is no good. I have the following solution in mind,
re.search(r'(\w&[^0-9])\w*', s)
where & is the "and" operator (just like | is the "or" operator). The parentheses will thus match any word literal which at the same time is not a number. The problem with this is that the & operator does not exist, and so I'm stuck with no solution.
Edit
Though the "double negative" trick (as explained in the answer by Patrick Artner below) can also be found in this question, note that this only partly answers my question. Using [^\W0-9]\w* only works if I am guaranteed that \w exactly matches the legal Unicode characters, plus the numbers 0-9. I would like a source of this knowledge, or some other regex which gets the job done.
You can use a double negative - \W is anything that \w is not - just disallow it to allow any \w:
[^\W0-9]\w*
essentially using any not - non-wordcharacter except 0-9 followed by any word character any number of times.
Doku: regular-expression-syntax
You could try using
^(?![0-9])\w+$
Which will not partial match invalid variable names
Alternatively, if you don't need to use regex. str.isidentifier() will probably do what you want.

What is the proper regular expression to match all utf-8/unicode lowercase letter forms

I would like to match all lowercase letter forms in the Latin block. The trivial '[a-z]' only matches characters between U+0061 and U+007A, and not all the other lowercase forms.
I would like to match all lowercase letters, most importantly, all the accented lowercase letters in the Latin block used in EFIGS languages.
[a-zà-ý] is a start, but there are still tons of other lowercase characters (see http://www.unicode.org/charts/PDF/U0000.pdf). Is there a recommended way of doing this?
FYI I'm using Python, but I suspect that this problem is cross-language.
Python's builtin "islower()" method seems to do the right checking:
lower = ''
for c in xrange(0,2**16):
if unichr(c).islower():
lower += unichr(c)
print lower
Python does not currently support Unicode properties in regular expressions. See this answer for a link to the Ponyguruma library which does support them.
Using such a library, you could use \p{Ll} to match any lowercase letter in a Unicode string.
Every character in the Unicode standard is in exactly one category. \p{Ll} is the category of lowercase letters, while \p{L} comprises all the characters in one of the "Letter" categories (Letter, uppercase; Letter, lowercase; Letter, titlecase; Letter, modifier; and Letter, other). For more information see the Character Properties chapter of the Unicode Standard. Or see this page for a good explanation on use of Unicode in regular expressions.
Looks as though this recipe posted back in the old 2005
import sys, re
uppers = [u'[']
for i in xrange(sys.maxunicode):
c = unichr(i)
if c.isupper(): uppers.append(c)
uppers.append(u']')
uppers = u"".join(uppers)
uppers_re = re.compile(uppers)
print uppers_re.match('A')
is still relevant.
You might want to have a look at regular-expressions.info.
However, as far as I know there's no character class or modifier that expresses "lower case characters only" (and not every language has lower case characters), so I'd say you might have to use multiple ranges (possible almost as many as there are unicode blocks.
Edit:
reading a bit more on this, there might be a way: [\p{Ll}\p{Lo}] which means lowercase characters with an upper case variant or characters that don't have lower case and upper case (in case of chinese characters for example).
Regex [\p{Ll}\p{Lo}]+ matches test string àÀhelloHello你好Прывітанне and replacing the matches with x results in xÀxHxПx whereas replacing the matches of [\p{Ll}]+ results in xÀxHx你好Пx (note the Chinese characters that were not matched).
if you use \p{L} it will match any unicode letter. check the examples here. You can also combine it with \p{M} to match Hebrew-esqe languages that include diacritic marks. (\p{L}|\p{M})+
EDIT:
I missed the part about only lowercase letters the first time around. \p{L} will match all letters, \p{Ll} will match lowercase only.

Python: Regular expression to match alpha-numeric not working?

I am looking to match a string that is inputted from a website to check if is alpha-numeric and possibly contains an underscore.
My code:
if re.match('[a-zA-Z0-9_]',playerName):
# do stuff
For some reason, this matches with crazy chars for example: nIg○▲ ☆ ★ ◇ ◆
I only want regular A-Z and 0-9 and _ matching, is there something i am missing here?
Python has a special sequence \w for matching alphanumeric and underscore when the LOCALE and UNICODE flags are not specified. So you can modify your pattern as,
pattern = '^\w+$'
Your regex only matches one character. Try this instead:
if re.match('^[a-zA-Z0-9_]+$',playerName):
…check if is alpha-numeric and possibly contains an underscore.
Do you mean this literally, so that only one underscore is allowed, total? (Not unreasonable for player names; adjacent underscores in particular can be hard for other players to read.) Should "a_b_c" not match?
If so:
if playerName and re.match("^[a-zA-Z0-9]*_?[a-zA-Z0-9]*$", playerName):
The new first part of the condition checks for an empty value, which simplifies the regex.
This places no restrictions on where the underscore can occur, so all of "_a", "a_", and "_" will match. If you instead want to prevent both leading and trailing underscores, which is again reasonable for player names, change to:
if re.match("^[a-zA-Z0-9]+(?:_[a-zA-Z0-9]+)?$", playerName):
// this regex doesn't match an empty string, so that check is unneeded

How to match alphabetical chars without numeric chars with Python regexp?

Using Python module re, how to get the equivalent of the "\w" (which matches alphanumeric chars) WITHOUT matching the numeric characters (those which can be matched by "[0-9]")?
Notice that the basic need is to match any character (including all unicode variation) without numerical chars (which are matched by "[0-9]").
As a final note, I really need a regexp as it is part of a greater regexp.
Underscores should not be matched.
EDIT:
I hadn't thought about underscores state, so thanks for warnings about this being matched by "\w" and for the elected solution that addresses this issue.
You want [^\W\d]: the group of characters that is not (either a digit or not an alphanumeric). Add an underscore in that negated set if you don't want them either.
A bit twisted, if you ask me, but it works. Should be faster than the lookahead alternative.
(?!\d)\w
A position that is not followed by a digit, and then \w. Effectively cancels out digits but allows the \w range by using a negative look-ahead.
The same could be expressed as a positive look-ahead and \D:
(?=\D)\w
To match multiple of these, enclose in parens:
(?:(?!\d)\w)+

Categories

Resources