How to match alphabetical chars without numeric chars with Python regexp? - python

Using Python module re, how to get the equivalent of the "\w" (which matches alphanumeric chars) WITHOUT matching the numeric characters (those which can be matched by "[0-9]")?
Notice that the basic need is to match any character (including all unicode variation) without numerical chars (which are matched by "[0-9]").
As a final note, I really need a regexp as it is part of a greater regexp.
Underscores should not be matched.
EDIT:
I hadn't thought about underscores state, so thanks for warnings about this being matched by "\w" and for the elected solution that addresses this issue.

You want [^\W\d]: the group of characters that is not (either a digit or not an alphanumeric). Add an underscore in that negated set if you don't want them either.
A bit twisted, if you ask me, but it works. Should be faster than the lookahead alternative.

(?!\d)\w
A position that is not followed by a digit, and then \w. Effectively cancels out digits but allows the \w range by using a negative look-ahead.
The same could be expressed as a positive look-ahead and \D:
(?=\D)\w
To match multiple of these, enclose in parens:
(?:(?!\d)\w)+

Related

Regex backreference to match opposite case

Before I begin — it may be worth stating, that: this technically does not have to be solved using a Regex, it's just that I immediately thought of a Regex when I started solving this problem, and I'm interested in knowing whether it's possible to solve using a Regex.
I've spent the last couple hours trying to create a Regex that does the following.
The regex must match a string that is ten characters long, iff the first five characters and last five characters are identical but each individual character is opposite in case.
In other words, if you take the first five characters, invert the case of each individual character, that should match the last five characters of the string.
For example, the regex should match abCDeABcdE, since the first five characters and the last five characters are the same, but each matching character is opposite in case. In other words, flip_case("abCDe") == "ABcdE"
Here are a few more strings that should match:
abcdeABCDE, abcdEABCDe, zYxWvZyXwV.
And here are a few that shouldn't match:
abcdeABCDZ, although the case is opposite, the strings themselves do not match.
abcdeABCDe, is a very close match, but should not match since the e's are not opposite in case.
Here is the first regex I tried, which is obviously wrong since it doesn't account for the case-swap process.
/([a-zA-Z]{5})\1/g
My next though was whether the following is possible in a regex, but I've been reading several Regex tutorials and I can't seem to find it anywhere.
/([A-Z])[\1+32]/g
This new regex (that obviously doesn't work) is supposed to match a single uppercase letter, immediately followed by itself-plus-32-ascii, so, in other words, it should match an uppercase letter followed immediately by its' lowercase counterpart. But, as far as I'm concerned, you cannot "add an ascii value" to backreference in a regex.
And, bonus points to whoever can answer this — in this specific case, the string in question is known to be 10 characters long. Would it be possible to create a regex that matches strings of an arbitrary length?
You want to use the following pattern with the Python regex module:
^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)
See the regex demo
Details
^ - start of string
(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L})) - a positive lookahead with a sequence of five capturing groups that capture the first five letters individually
(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$) - a ppositive lookahead that make sure that, at the end of the string, there are 5 letters that are the same as the ones captured at the start but are of different case.
In brief, the first (\p{L}) in the first lookahead captures the first a in abcdeABCDE and then, inside the second lookahead, (?!\1)(?i:\1) makes sure the fifth char from the end is the same (with the case insensitive mode on), and (?!\1) negative lookahead make sure this letter is not identical to the one captured.
The re module does not support inline modifier groups, so this expression won't work with that moduue.
Python regex based module demo:
import regex
strs = ['abcdeABCDE', 'abcdEABCDe', 'zYxWvZyXwV', 'abcdeABCDZ', 'abcdeABCDe']
rx = r'^(?=(\p{L})(\p{L})(\p{L})(\p{L})(\p{L}))(?=.*(?!\1)(?i:\1)(?!\2)(?i:\2)(?!\3)(?i:\3)(?!\4)(?i:\4)(?!\5)(?i:\5)$)'
for s in strs:
print("Testing {}...".format(s))
if regex.search(rx, s):
print("Matched")
Output:
Testing abcdeABCDE...
Matched
Testing abcdEABCDe...
Matched
Testing zYxWvZyXwV...
Matched
Testing abcdeABCDZ...
Testing abcdeABCDe...

Regex matching Unicode variable names

In Python 2, a Python variable name contains only ASCII letters, numbers and underscores, and it must not start with a number. Thus,
re.search(r'[_a-zA-Z][_a-zA-Z0-9]*', s)
will find a matching Python name in the str s.
In Python 3, the letters are no longer restricted to ASCII. I am in search for a new regex which will match any and all legal Python 3 variable names.
According to the docs, \w in a regex will match any Unicode word literal, including numbers and the underscore. I am however unsure whether this character set contains exactly those characters which might be used in variable names.
Even if the character set \w contains exactly the characters from which Python 3 variable names may legally be constructed, how do I use it to create my regex? Using just \w+ will also match "words" which start with a number, which is no good. I have the following solution in mind,
re.search(r'(\w&[^0-9])\w*', s)
where & is the "and" operator (just like | is the "or" operator). The parentheses will thus match any word literal which at the same time is not a number. The problem with this is that the & operator does not exist, and so I'm stuck with no solution.
Edit
Though the "double negative" trick (as explained in the answer by Patrick Artner below) can also be found in this question, note that this only partly answers my question. Using [^\W0-9]\w* only works if I am guaranteed that \w exactly matches the legal Unicode characters, plus the numbers 0-9. I would like a source of this knowledge, or some other regex which gets the job done.
You can use a double negative - \W is anything that \w is not - just disallow it to allow any \w:
[^\W0-9]\w*
essentially using any not - non-wordcharacter except 0-9 followed by any word character any number of times.
Doku: regular-expression-syntax
You could try using
^(?![0-9])\w+$
Which will not partial match invalid variable names
Alternatively, if you don't need to use regex. str.isidentifier() will probably do what you want.

Removing special characters and symbols from a string in python

I am trying to do what my title says. I have a list of about 30 thousand business addressess, and I'm trying to make each address as uniform as possible
As far as removing weird symbols and characters goes, I have found three suggestions, but I don't understand how they are different.
If somebody can explain the difference, or provide insight into a better way to standardize address information, please and thank you!
address = re.sub(r'([^\s\w]|_)+', '', address)
address = re.sub('[^a-zA-Z0-9-_*.]', '', address)
address = re.sub(r'[^\w]', ' ', address)
The first suggestion uses the \s and \w regex wildcards.
\s means "match any whitespace".
\w means "match any letter or number".
This is used as an inverted capture group ([^\s\w]), which, all together, means "match anything which isn't whitespace, a letter or a number". Finally, it is combined using an alternative | with _, which will just match an underscore and given a + quantifier which matches one or more times.
So what this says is: "Match any sequence of one or more characters which aren't whitespace, letters, numbers or underscores and remove it".
The second option says: "Match any character which isn't a letter, number, hyphen, underscore, dot or asterisk and remove it". This is stated by that big capture group (the stuff between the brackets).
The third option says "Take anything which is not a letter or number and replace it by a space". It uses the \w wildcard, which I have explained.
All of the options use Regular Expressions in order to match character sequences with certain characteristics, and the re.sub function, which sub-stitutes anything matched by the given regex by the second string argument.
You can read more about Regular Expressions in Python here.
The enumeration [^a-zA-Z0-9-_*.] enumerates exactly the character ranges to remove (though the literal - should be at the beginning or end of the character class).
\w is defined as "word character" which in traditional ASCII locales included A-Z and a-z as well as digits and underscore, but with Unicode support, it matches accented characters, Cyrillics, Japanese ideographs, etc.
\s matches space characters, which again with Unicode includes a number of extended characters such as the non-breakable space, numeric space, etc.
Which exactly to choose obviously depends on what you want to accomplish and what you mean by "special characters". Numbers are "symbols", all characters are "special", etc.
Here's a pertinent quotation from the Python re documentation:
\s
For Unicode (str) patterns:
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [ \t\n\r\f\v] may be a better choice).
For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].
\w
For Unicode (str) patterns:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).
For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_].
How you read the re.sub function is like this (more docs):
re.sub(a, b, my_string) # replace any matches of regex a with b in my_string
I would go with the second one. Regexes can be tricky, but this one says:
[^a-zA-Z0-9-_*.] # anything that's NOT a-z, A-Z, 0-9, -, * .
Which seems like it's what you want. Whenever I'm using regexes, I use this site:
http://regexr.com/
You can put in some of your inputs, and make sure they are matching the right kinds of things before throwing them in your code!

Python reference to regex in parentheses

I have a text file that needs to have the letter 't' removed if it is not immediately preceded by a number.
I am trying to do this using re.sub and I have this:
f=open('File.txt').read()
g=f
g=re.sub('([^0-9])t','',g)
This identifies the letters to be removed correctly but also removes the preceding character. How can I refer to the parenthesized regex in the replacement String?
Thanks!
Use a lookbehind (or negative lookbehind) instead.
g=re.sub('(?<=[^0-9])t','',g)
or
g=re.sub('(?<![0-9])t','',g)
Three options:
g=re.sub('([^0-9])t','\\1',g)
or
g=re.sub('(?<=[^0-9])t','',g)
or
g=re.sub('(?<![0-9])t','',g)
The first option is what you are looking for, a backreference to the captured string. \\1 will refer to the first captured group.
Lookarounds don't consume characters, so you don't need to replace them back. Here, I have used a positive lookbehind for the first one and a negative lookbehind for the second one. Those don't consume the characters within their brackets, so you are not taking the [^0-9] or [0-9] in the replacement. It might be better to use those since it prevents overlapping matches.
The positive lookbehind makes sure that t has a non-digit character before it. The negative lookbehind makes sure that t does not have a digit character before it.

Python regex positive look ahead

I have the following regex that is supposed to find sequence of words that are ended with a punctuation. The look ahead function assures that after the match there is a space and a capital letter or digit.
pat1 = re.compile(r"\w.+?[?.!](?=\s[A-Z\d])"
What is the function of the following lookahead?
pat2 = re.compile(r"\w.+?[?.!](?=\s+[A-Z\d])"
Is Python 3.2 supporting variable lookahead (\s+)? I do not get any error. Furthermore I cannot see any differences in both patterns. Both seem to work the same regardless the number of blanks that I have. Is there an explanation for the purpose of the \s+ in the look ahead?
I'm not really sure what you are tying to achieve here.
Sequence of words ended by a punctuation can be matched with something like:
re.findall(r'([\w\s]*[\?\!\.;])', s)
the lookahead requires another string to follow?
In any case:
\s requires one and only one space;
\s+ requires at least one space.
And yes, the lookahead accepts the "+" modifier even in python 2.x
The same as before but with a lookahead:
re.findall(r'([\w\s]*[\?\!\.;])(?=\s\w)', s)
or
re.findall(r'([\w\s]*[\?\!\.;])(?=\s+\w)', s)
you can try them all on something like:
s='Stefano ciao. a domani. a presto;'
Depending on your strings, the lookahead might be necessary or not, and might or might not change to have "+" more than one space option.
The difference is that the first lookahead expects exactly one whitespace character before the digit or capital letter while the second one expects at least one whitespace character but as many as possible.
The + is called a quantifier. It means 1 to n as many as possible.
To recap
\s (Exactly one whitespace character allowed. Will fail without it or with more than one.)
\s+ (At least one but maybe more whitespaces allowed.)
Further studying.
I have multiple blanks, the \w.+? continues to match the blanks until the last blank before the capital letter
To answer this comment please consider :
What does \w.+? actually matches?
A single word character [a-zA-Z0-9_] followed by at least one "any" character(except newline) but with the lazy quantifier +?. So in your case, it leaves one space so that the lookahead later matches. Therefore you consume all the blanks except one. This is why you see them at your output.

Categories

Resources