Removing special characters and symbols from a string in python

Removing special characters and symbols from a string in python - python

I am trying to do what my title says. I have a list of about 30 thousand business addressess, and I'm trying to make each address as uniform as possible
As far as removing weird symbols and characters goes, I have found three suggestions, but I don't understand how they are different.
If somebody can explain the difference, or provide insight into a better way to standardize address information, please and thank you!
address = re.sub(r'([^\s\w]|_)+', '', address)
address = re.sub('[^a-zA-Z0-9-_*.]', '', address)
address = re.sub(r'[^\w]', ' ', address)

The first suggestion uses the \s and \w regex wildcards.
\s means "match any whitespace".
\w means "match any letter or number".
This is used as an inverted capture group ([^\s\w]), which, all together, means "match anything which isn't whitespace, a letter or a number". Finally, it is combined using an alternative | with _, which will just match an underscore and given a + quantifier which matches one or more times.
So what this says is: "Match any sequence of one or more characters which aren't whitespace, letters, numbers or underscores and remove it".
The second option says: "Match any character which isn't a letter, number, hyphen, underscore, dot or asterisk and remove it". This is stated by that big capture group (the stuff between the brackets).
The third option says "Take anything which is not a letter or number and replace it by a space". It uses the \w wildcard, which I have explained.
All of the options use Regular Expressions in order to match character sequences with certain characteristics, and the re.sub function, which sub-stitutes anything matched by the given regex by the second string argument.
You can read more about Regular Expressions in Python here.

The enumeration [^a-zA-Z0-9-_*.] enumerates exactly the character ranges to remove (though the literal - should be at the beginning or end of the character class).
\w is defined as "word character" which in traditional ASCII locales included A-Z and a-z as well as digits and underscore, but with Unicode support, it matches accented characters, Cyrillics, Japanese ideographs, etc.
\s matches space characters, which again with Unicode includes a number of extended characters such as the non-breakable space, numeric space, etc.
Which exactly to choose obviously depends on what you want to accomplish and what you mean by "special characters". Numbers are "symbols", all characters are "special", etc.
Here's a pertinent quotation from the Python re documentation:
\s
For Unicode (str) patterns:
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [ \t\n\r\f\v] may be a better choice).
For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].
\w
For Unicode (str) patterns:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).
For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_].

How you read the re.sub function is like this (more docs):
re.sub(a, b, my_string) # replace any matches of regex a with b in my_string
I would go with the second one. Regexes can be tricky, but this one says:
[^a-zA-Z0-9-_*.] # anything that's NOT a-z, A-Z, 0-9, -, * .
Which seems like it's what you want. Whenever I'm using regexes, I use this site:
http://regexr.com/
You can put in some of your inputs, and make sure they are matching the right kinds of things before throwing them in your code!

Related

How to get a complete list of character class of Python regex to lookup a specific special character quickly?

This python doc gives a complete list of the metacharacters
. ^ $ * + ? { } [ ] \ | ( )
Similarly, is there a page giving a complete list of character class?
I assume "Character classes" in that doc refers to a finite numbers of some kind of special characters instead of all possible unicode characters. Please correct me if necessary.
I did search and didn't find the canonical term.
If "character classes" indeed refers to all possible unicode characters, I would like change my question as "a convenient way to lookup regex special characters in python".
It seems that regular-expressions.info call that "Shorthand Character Classes"
More positive examples (that I am looking for) are \d, \s, \S, \A etc; negative examples (that I am not looking for) are abcdefghijklmnopqrstuvwxyz0123456789
I've searched "character class" and "Shorthand Character Classes" on Python doc and stackoverflow and didn't find what I want.
Why do I need this? When I read a section of the doc, such as
Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.
I would like to know what does \w stand for. Either searching in the doc or google would take me some time. For example, using search menu command of chrome on that doc, \w gets 41 results.
If there is a list of those characters, I can look up everything by no more than 2 search (lower case and capital).

Categories Visible from the Shell
The code shows all the of the "CATEGORIES". The ones marked "IN" are character categories (the others mark specific slice points between characters):
>>> from pprint import pprint
>>> import sre_parse
>>> pprint(sre_parse.CATEGORIES)
{'\\A': (AT, AT_BEGINNING_STRING),
'\\B': (AT, AT_NON_BOUNDARY),
'\\D': (IN, [(CATEGORY, CATEGORY_NOT_DIGIT)]),
'\\S': (IN, [(CATEGORY, CATEGORY_NOT_SPACE)]),
'\\W': (IN, [(CATEGORY, CATEGORY_NOT_WORD)]),
'\\Z': (AT, AT_END_STRING),
'\\b': (AT, AT_BOUNDARY),
'\\d': (IN, [(CATEGORY, CATEGORY_DIGIT)]),
'\\s': (IN, [(CATEGORY, CATEGORY_SPACE)]),
'\\w': (IN, [(CATEGORY, CATEGORY_WORD)])
The entries with "CATEGORY" are the character categories
This also answers the question of what \w stands for. It is a "word character". See also: In regex, what does \w* mean?
Categories Explained in the Docs
This is in the output of print(re.__doc__). It explains the intended meaning of each category:
The special sequences consist of "\\" and a character from the list
below. If the ordinary character is not on the list, then the
resulting RE will match the second character.
\number Matches the contents of the group of the same number.
\A Matches only at the start of the string.
\Z Matches only at the end of the string.
\b Matches the empty string, but only at the start or end of a word.
\B Matches the empty string, but not at the start or end of a word.
\d Matches any decimal digit; equivalent to the set [0-9] in
bytes patterns or string patterns with the ASCII flag.
In string patterns without the ASCII flag, it will match the whole
range of Unicode digits.
\D Matches any non-digit character; equivalent to [^\d].
\s Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
bytes patterns or string patterns with the ASCII flag.
In string patterns without the ASCII flag, it will match the whole
range of Unicode whitespace characters.
\S Matches any non-whitespace character; equivalent to [^\s].
\w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
in bytes patterns or string patterns with the ASCII flag.
In string patterns without the ASCII flag, it will match the
range of Unicode alphanumeric characters (letters plus digits
plus underscore).
With LOCALE, it will match the set [0-9_] plus characters defined
as letters for the current locale.
\W Matches the complement of \w.
\\ Matches a literal backslash.
Other Special Character Groups
Besides the short-hand character classes, the sre_parse module details other interesting character groups as well:
SPECIAL_CHARS = ".\\[{()*+?^$|"
REPEAT_CHARS = "*+?{"
DIGITS = frozenset("0123456789")
OCTDIGITS = frozenset("01234567")
HEXDIGITS = frozenset("0123456789abcdefABCDEF")
ASCIILETTERS = frozenset("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
WHITESPACE = frozenset(" \t\n\r\v\f")
ESCAPES = {
r"\a": (LITERAL, ord("\a")),
r"\b": (LITERAL, ord("\b")),
r"\f": (LITERAL, ord("\f")),
r"\n": (LITERAL, ord("\n")),
r"\r": (LITERAL, ord("\r")),
r"\t": (LITERAL, ord("\t")),
r"\v": (LITERAL, ord("\v")),
r"\\": (LITERAL, ord("\\"))
}
FLAGS = {
# standard flags
"i": SRE_FLAG_IGNORECASE,
"L": SRE_FLAG_LOCALE,
"m": SRE_FLAG_MULTILINE,
"s": SRE_FLAG_DOTALL,
"x": SRE_FLAG_VERBOSE,
# extensions
"a": SRE_FLAG_ASCII,
"t": SRE_FLAG_TEMPLATE,
"u": SRE_FLAG_UNICODE,
}

It looks like you're looking for all shorthand character classes Python's re module supports. Things like [abc] also fall under the name "character class", although this might not be obvious from the re docs, and it would be impossible and pointless to try to make a complete list of those.
A character class is regex syntax for matching a single character, usually by specifying that it belongs or doesn't belong to some set of characters. Syntax like [abc] lets you explicitly specify a set of characters to match, while shorthand character classes like \d are shorthand for large, predefined sets of characters.
Python's re module supports 6 shorthand character classes: \d, which matches digits, \s, which matches whitespace, \w, which matches "word" characters, and \D, \S, and \W, which match any character \d, \s, and \w don't match. Exactly which characters count or don't count depend on whether you're using Unicode strings or bytestrings and whether the ASCII or LOCALE flags are set; see the re docs for further details (and expect disappointment with the vague docs for \w).
There are plenty of other backslash-letter sequences with special meaning, but they're not character classes. For example, \b matches a word boundary (or if you forgot to use raw strings, it gets interpreted as a backspace character before the regex engine gets to see it), but that's not a character class.
Other regex implementations may support different shorthand character classes, and their shorthand character classes may match different characters. For example, Perl has way more of these, and Perl's \w matches more characters than Python's, like combining diacritics.

Are you looking for string.printable or perhaps filter(lambda x: not x.isalnum(), string.printable) which returns
!"#$%&\'()*+,-./:;<=>?#[\\]^_``{|}~ \t\n\r\x0b\x0c
?

Regex, better way

How do you separate a regex, that could be matched multiple times within a string, if the delimiter is within the string, ie:
Well then 'Bang bang swing'(BBS) aota 'Bing Bong Bin'(BBB)
With the regex: "'.+'(\S+)"
It would match from Everything from 'Bang ... (BBB) instead of matching 'Bang bang swing'(BBS) and 'Bing Bong Bin'(BBB)
I have a manner of making this work with regex: '[A-z0-9-/?|q~`!##$%^&*()_-=+ ]+'(\S+)
But this is excessive, and honestly I hate that it even works correctly.
I'm fairly new to regexes, and beginning with Pythons implementation of them is apparently not the smartest manner in which to start it.

To get a substring from one character up to another character, where neither can appear in-between, you should always consider using negated character classes.
The [negated] character class matches any character that is not in the character class. Unlike the dot, negated character classes also match (invisible) line break characters. If you don't want a negated character class to match line breaks, you need to include the line break characters in the class. [^0-9\r\n] matches any character that is not a digit or a line break.
So, you can use
'[^']*'\([^()]*\)
See regex demo
Here,
'[^']*' - matches ' followed by 0 or more characters other than ' and then followed by a ' again
\( - matches a literal ) (it must be escaped)
[^()]* - matches 0 or more characters other than ( and ) (they do not have to be escaped inside a character class)
\) - matches a literal ) (must be escaped outside a character class).
If you might have 1 or more single quotes before (...) part, you will need an unrolled lazy matching regex:
'[^']*(?:'(?!\([^()]*\))[^']*)*'\([^()]*\)
See regex demo.
Here, the '[^']*(?:'(?!\([^()]*\))[^']*)*' is matching the same as '.*?' with DOTALL flag, but is much more efficient due to the linear regex execution. See more about unrolling regex technique here.
EDIT:
When input strings are not complex and short, lazy dot matching turns out more efficient. However, when complexity grows, lazy dot matching may cause issues.

How about this regular expression
'.+?'\(\S+\)

Regex Get All Alphabetic characters

I want something like [A-z] that counts for all alphabetic characters plus stuff like ö, ä, ü etc.
If i do [A-ü] i get probably all special characters used by latin languages but it also allows other stuff like ¿¿]|{}[¢§øæ¬°µ©¥
Example: https://regex101.com/r/tN9gA5/2
Edit:
I need this in python2.

Depending on what regular expression engine you are using, you could use the ^\p{L}+$ regular expression. The \p{L} denotes a unicode letter:
In addition to complications, Unicode also brings new possibilities.
One is that each Unicode character belongs to a certain category. You
can match a single character belonging to the "letter" category with
\p{L}
Source
This example should illustrate what I am saying. It seems that the regex engine on Regex101 does support this, you just need to select PCRE (PHP) fromo the top left.

When you use [A-z], you are not only capturing letters from "A" to "z", you also capture some more non-letter characters: [ \ ] ^ _ `.
In Python, you can use [^\W\d_] with re.U option to match Unicode characters (see this post).
Here is a sample based on your input string.
Python example:
import re
r = re.search(
r'(?P<unicode_word>[^\W\d_]*)',
u'TestöäüéàèÉÀÈéàè',
re.U
)
print r.group('unicode_word')
>>> TestöäüéàèÉÀÈéàè

clarifications on the re.findall() method in python

I wanted to strip a string of punctuation marks and I ended up using
re.findall(r"[\w]+|[^\s\w]", text)
It works fine and it does solve my problem. What I don't understand is the details within the parentheses and the whole pattern thing. What does r"[\w]+|[^\s\w]" really mean? I looked it up in the Python standard library and it says:
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
I am not sure if I get this and the clarification sounds a little vague to me. Can anyone please tell me what a pattern in this context means and how exactly it is defined in the findall() method?

To break it down, [] creates a character class. You'll often see things like [abc] which will match a, b or c. Conversely, you also might see [^abc] will will match anything that isn't a, b or c. Finally, you'll also see character ranges: [a-cA-C]. This introduces two ranges and it will match any of a, b, c, A, B, C.
In this case, your character class contains special tokens. \w and \s. \w matches anything letter-like. \w actually depends on your locale, but it is usually the same thing as [a-zA-Z0-9_] matches anything in the ranges a-z, A-Z, 0-9 or _. \s is similar, but it matches anything that can be considered whitespace.
The + means that you can repeat the previous match 1 or more times. so [a]+ will match the entire string aaaaaaaaaaa. In your case, you're matching alphanumeric characters that are next to each other.
the | is basically like "or". match the stuff on the left, or match the stuff on the right if the left stuff doesn't match.

\w means Alphanumeric characters plus "_". And \s means Whitespace characters including " \t\r\n\v\f" and space character " ". So, [\w]+|[^\s\w] means a string which contains only words and "_".

How to match alphabetical chars without numeric chars with Python regexp?

Using Python module re, how to get the equivalent of the "\w" (which matches alphanumeric chars) WITHOUT matching the numeric characters (those which can be matched by "[0-9]")?
Notice that the basic need is to match any character (including all unicode variation) without numerical chars (which are matched by "[0-9]").
As a final note, I really need a regexp as it is part of a greater regexp.
Underscores should not be matched.
EDIT:
I hadn't thought about underscores state, so thanks for warnings about this being matched by "\w" and for the elected solution that addresses this issue.

You want [^\W\d]: the group of characters that is not (either a digit or not an alphanumeric). Add an underscore in that negated set if you don't want them either.
A bit twisted, if you ask me, but it works. Should be faster than the lookahead alternative.

(?!\d)\w
A position that is not followed by a digit, and then \w. Effectively cancels out digits but allows the \w range by using a negative look-ahead.
The same could be expressed as a positive look-ahead and \D:
(?=\D)\w
To match multiple of these, enclose in parens:
(?:(?!\d)\w)+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.