Regex matching Unicode variable names - python

In Python 2, a Python variable name contains only ASCII letters, numbers and underscores, and it must not start with a number. Thus,
re.search(r'[_a-zA-Z][_a-zA-Z0-9]*', s)
will find a matching Python name in the str s.
In Python 3, the letters are no longer restricted to ASCII. I am in search for a new regex which will match any and all legal Python 3 variable names.
According to the docs, \w in a regex will match any Unicode word literal, including numbers and the underscore. I am however unsure whether this character set contains exactly those characters which might be used in variable names.
Even if the character set \w contains exactly the characters from which Python 3 variable names may legally be constructed, how do I use it to create my regex? Using just \w+ will also match "words" which start with a number, which is no good. I have the following solution in mind,
re.search(r'(\w&[^0-9])\w*', s)
where & is the "and" operator (just like | is the "or" operator). The parentheses will thus match any word literal which at the same time is not a number. The problem with this is that the & operator does not exist, and so I'm stuck with no solution.
Edit
Though the "double negative" trick (as explained in the answer by Patrick Artner below) can also be found in this question, note that this only partly answers my question. Using [^\W0-9]\w* only works if I am guaranteed that \w exactly matches the legal Unicode characters, plus the numbers 0-9. I would like a source of this knowledge, or some other regex which gets the job done.

You can use a double negative - \W is anything that \w is not - just disallow it to allow any \w:
[^\W0-9]\w*
essentially using any not - non-wordcharacter except 0-9 followed by any word character any number of times.
Doku: regular-expression-syntax

You could try using
^(?![0-9])\w+$
Which will not partial match invalid variable names
Alternatively, if you don't need to use regex. str.isidentifier() will probably do what you want.

Related

Regex python do lookahead in a conditional statement

I'm trying to do lookaheads in a conditional statement.
Explanation by words:
(specified string that has to be a number (decimal or not) or a word character, a named capturing group is created) (if the named capturing group is a word character then check if the next string is a number (decimal or not) with a lookahead else check if the next string is a word character with a lookahead)
To understand, here some examples that are matched or not:
a 6 or 6.4 b-> matched, since the first and the second string haven't the same "type"
ab 7 or 7 rt -> not matched, need only a single word character
R 7.55t -> not matched, 7.55t is not a valid number
a r or 5 6-> not matched, the first and the second string have the same "type" (number and number, or, word character and word character)
I've already found the answer for the first string: (?P<var>([a-zA-Z]|(-?\d+(.\d+)?)))
I've found nothing on Internet about lookaheads in a condition statement in Python.
The problem is that Python doesn't support conditional statement like PCRE:
Python supports conditionals using a numbered or named capturing group. Python does not support conditionals using lookaround, even though Python does support lookaround outside conditionals. Instead of a conditional like (?(?=regex)then|else), you can alternate two opposite lookarounds: (?=regex)then|(?!regex)else. (source: https://www.regular-expressions.info/conditional.html)
Maybe there's a better solution that I've planned or maybe it's just impossible to do what I want, I don't know.
What I tried: (?P<var>([a-zA-Z]|(-?\d+(.\d+)?))) (?(?=[a-zA-Z])(?=(-?\d+(.\d+)?))|(?=[a-zA-Z]))(?P=var) but that doesn't work.
The named capture group (?P<var>...) contains the actual text which matched, not the regex itself. There is a way to create a named regex, too; but it's probably not particularly necessary or useful here.
Simply spell out the alternatives:
((?<![a-zA-Z0-9])[a-zA-Z]\s+-?\d+(.\d+)?(?![a-zA-Z.0-9])|(?<![a-zA-Z.0-9])-?\d+(.\d+)?\s+[a-zA-Z](?![a-zA-Z0-9]))
If you genuinely require the second token to remain unmatched, it should be obvious how to change the parts starting at each \s into a lookahead.
Demo: https://ideone.com/nPNAIN

Regular expression that accepts tokens of three or more alphabetical characters

I'm trying to build a TFIDVectorizer that only accepts tokens of 3 or more alphabetical characters using TFIdfVectorizer(token_pattern="(?u)\\b\\D\\D\\D+\\b")
But it doesn't behave correctly, I know token_pattern="(?u)\\b\\w\\w\\w+\\b" accepts tokens of 3 or more alphanumerical characters, so I just don't understand why the former is not working.
What am I missing?
The problem lies in using the \D metacharacter, as it's actually for matching any non-digit character, rather than any alphabetical character. From Python docs:
You can go instead with:
token_pattern="(?i)[a-z]{3,}"
Explanation:
(?i) — inline flag to make matching case-insensitive,
[a-z] — matches any Latin letter,
{3,} — makes the previous token match three or more times (greedily, i.e., as many times as possible).
I hope this answers your question. :)

What does the regex [^\s]*? mean?

I am starting to learn python spider to download some pictures on the web and I found the code as follows. I know some basic regex.
I knew \.jpg means .jpg and | means or. what's the meaning of [^\s]*? of the first line? I am wondering why using \s?
And what's the difference between the two regexes?
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
Alright, so to answer your first question, I'll break down [^\s]*?.
The square brackets ([]) indicate a character class. A character class basically means that you want to match anything in the class, at that position, one time. [abc] will match the strings a, b, and c. In this case, your character class is negated using the caret (^) at the beginning - this inverts its meaning, making it match anything but the characters in it.
\s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines.
*? is a little harder to explain. The * quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times". The ?, when applied to a quantifier, makes it lazy - it will match as little as it can, going from left to right one character at a time.
In this case, what the whole pattern snippet [^\s]*? means is "match any sequence of non-whitespace characters, including the empty string". As mentioned in the comments, this can more succinctly be written as \S*?.
To answer the second part of your question, I'll compare the two regexes you give:
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
They both start the same way: attempting to match the protocol at the beginning of a URL and the subsequent colon (:) character. The first then matches any string that does not contain any whitespace and ends with the specified file extensions. The second, meanwhile, will match two literal slash characters (/) before matching any sequence of characters followed by a valid extension.
Now, it's obvious that both patterns are meant to match a URL, but both are incorrect. The first pattern, for instance, will match strings like
http:foo.bar.png
http:.png
Both of which are invalid. Likewise, the second pattern will permit spaces, allowing stuff like this:
http:// .jpg
http://foo bar.png
Which is equally illegal in valid URLs. A better regex for this (though I caution strongly against trying to match URLs with regexes) might look like:
https?://\S+\.(jpe?g|png|gif)
In this case, it'll match URLs starting with both http and https, as well as files that end in both variations of jpg.

Escaping numbers and special chars in python regex

I have form and I want that its fields will not match any numbers and any special chars.
Now,
I have like this
name = forms.RegexField(regex = r'[^0-9]+$')
It just escapes numbers, now.
How to set a regex pattern for escaping numbers and special chars. Any advice?
Assuming that words containing alphabets and numbers both are allowed, this regex could do the work
[a-zA-Z0-9]*[a-zA-Z]+[a-zA-Z0-9]*
this would check that the input field must contain atleast one character, and would only allow alphabets, or a combination of alphabets and numbers. No special characters are allowed. abc 12abc abc12 are valid, but 123 ab#/ are invalid.
What I did is, I reversed the approach. Instead of restricting special characters, i allowed only alphabets and the above mentioned combination. This automatically restricted special characters.
If escaping numbers is the requirement, this regex could be used:
[a-zA-Z]+
this would only allow alphabets, and would restrict all numbers and all special characters.
Here a table to see all regexp simbols:
http://www.javascriptkit.com/jsref/regexp.shtml
For your problem, if I understand what you want it should be... r'[a-zA-Z]+$'
Dont know, but look at the table in the link, its very usefull.
As far as I understand, you need to match only letters.
To be unicode compatible, use properties :
\p{L}+
This will match one or more letter in any language.

How to match alphabetical chars without numeric chars with Python regexp?

Using Python module re, how to get the equivalent of the "\w" (which matches alphanumeric chars) WITHOUT matching the numeric characters (those which can be matched by "[0-9]")?
Notice that the basic need is to match any character (including all unicode variation) without numerical chars (which are matched by "[0-9]").
As a final note, I really need a regexp as it is part of a greater regexp.
Underscores should not be matched.
EDIT:
I hadn't thought about underscores state, so thanks for warnings about this being matched by "\w" and for the elected solution that addresses this issue.
You want [^\W\d]: the group of characters that is not (either a digit or not an alphanumeric). Add an underscore in that negated set if you don't want them either.
A bit twisted, if you ask me, but it works. Should be faster than the lookahead alternative.
(?!\d)\w
A position that is not followed by a digit, and then \w. Effectively cancels out digits but allows the \w range by using a negative look-ahead.
The same could be expressed as a positive look-ahead and \D:
(?=\D)\w
To match multiple of these, enclose in parens:
(?:(?!\d)\w)+

Categories

Resources