Python remove anything that is not a letter or number - python

I'm having a little trouble with Python regular expressions.
What is a good way to remove all characters in a string that are not letters or numbers?
Thanks!

[\w] matches (alphanumeric or underscore).
[\W] matches (not (alphanumeric or underscore)), which is equivalent to (not alphanumeric and not underscore)
You need [\W_] to remove ALL non-alphanumerics.
When using re.sub(), it will be much more efficient if you reduce the number of substitutions (expensive) by matching using [\W_]+ instead of doing it one at a time.
Now all you need is to define alphanumerics:
str object, only ASCII A-Za-z0-9:
re.sub(r'[\W_]+', '', s)
str object, only locale-defined alphanumerics:
re.sub(r'[\W_]+', '', s, flags=re.LOCALE)
unicode object, all alphanumerics:
re.sub(ur'[\W_]+', u'', s, flags=re.UNICODE)
Examples for str object:
>>> import re, locale
>>> sall = ''.join(chr(i) for i in xrange(256))
>>> len(sall)
256
>>> re.sub('[\W_]+', '', sall)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
>>> re.sub('[\W_]+', '', sall, flags=re.LOCALE)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
>>> locale.setlocale(locale.LC_ALL, '')
'English_Australia.1252'
>>> re.sub('[\W_]+', '', sall, flags=re.LOCALE)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\x83\x8a\x8c\x8e\
x9a\x9c\x9e\x9f\xaa\xb2\xb3\xb5\xb9\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\
xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\
xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\
xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
# above output wrapped at column 80
Unicode example:
>>> re.sub(ur'[\W_]+', u'', u'a_b A_Z \x80\xFF \u0404', flags=re.UNICODE)
u'abAZ\xff\u0404'

In the char set matching rule [...] you can specify ^ as first char to mean "not in"
import re
re.sub("[^0-9a-zA-Z]", # Anything except 0..9, a..z and A..Z
"", # replaced with nothing
"this is a test!!") # in this string
--> 'thisisatest'

'\W' is the same as [^A-Za-z0-9_] plus accented chars from your locale.
>>> re.sub('\W', '', 'text 1, 2, 3...')
'text123'
Maybe you want to keep the spaces or have all the words (and numbers):
>>> re.findall('\w+', 'my. text, --without-- (punctuation) 123')
['my', 'text', 'without', 'punctuation', '123']

Also you can try to use isalpha and isnumeric methods the following way:
text = 'base, sample test;'
getVals = lambda x: (c for c in text if c.isalpha() or c.isnumeric())
map(lambda word: ' '.join(getVals(word)): text.split(' '))

There are other ways also you may consider e.g. simply loop thru string and skip unwanted chars e.g. assuming you want to delete all ascii chars which are not letter or digits
>>> newstring = [c for c in "a!1#b$2c%3\t\nx" if c in string.letters + string.digits]
>>> "".join(newstring)
'a1b2c3x'
or use string.translate to map one char to other or delete some chars e.g.
>>> todelete = [ chr(i) for i in range(256) if chr(i) not in string.letters + string.digits ]
>>> todelete = "".join(todelete)
>>> "a!1#b$2c%3\t\nx".translate(None, todelete)
'a1b2c3x'
this way you need to calculate todelete list once or todelete can be hard-coded once and use it everywhere you need to convert string

you can use predefined regex in python : \W corresponds to the set [^a-zA-Z0-9_]. Then,
import re
s = 'Hello dutrow 123'
re.sub('\W', '', s)
--> 'Hellodutrow123'

You need to be more specific:
What about Unicode "letters"? ie, those with diacriticals.
What about white space? (I assume this is what you DO want to delete along with punctuation)
When you say "letters" do you mean A-Z and a-z in ASCII only?
When you say "numbers" do you mean 0-9 only? What about decimals, separators and exponents?
It gets complex quickly...
A great place to start is an interactive regex site, such as RegExr
You can also get Python specific Python Regex Tool

Related

How to remove non alphabetic characters from words-ends

When I have a string "Mary's!!" I want to get "Mary's!", so only one non alphabetic character is removed at the beginning and/or the end of each word in the string, not in the middle of the word.
I have this so far in Python 3
import re
s = "Mary's!! string. With. Punctuation?" # Sample string
out = re.sub(r'[^\w\d\s]','', s)
print(out)
This outputs:
"Marys string With Punctuation"
It removes everything, while it should be like this:
"Mary's! string With Punctuation"
You could require that there is a space next to it (or start/end of string):
re.sub(r'(\s|^)[^\w\d\s]|[^\w\d\s](\s|$)', r'\1\2', s)
Or, alternatively with look-around:
re.sub(r'(?<!\S)[^\w\d\s]|[^\w\d\s](?!\S)', '', s)

regex unicode characters

The following regex working online but not working in python code and shows no matches:
https://regex101.com/r/lY1kY8/2
s=re.sub(r'\x.+[0-9]',' ',s)
required:
re.sub(r'\x.+[0-9]* ',' ',r'cats\xe2\x80\x99 faces')
Out[23]: 'cats faces'
basically wanted to remove the unicode special characters "\xe2\x80\x99"
As another option that doesn't require regex, you could instead remove the unicode characters by removing anything not listed in string.printable
>>> import string
>>> ''.join(i for i in 'cats\xe2\x80\x99 faces' if i in string.printable)
'cats faces'
print re.findall(r'\\x.*?[0-9]* ',r'cats\xe2\x80\x99 faces')
^^
Use raw mode flag.Use findall as match starts matching from beginning
print re.sub(ur'\\x.*?[0-9]+','',r'cats\xe2\x80\x99 faces')
with re.sub
s=r'cats\xe2\x80\x99 faces'
print re.sub(r'\\x.+?[0-9]*','',s)
EDIT:
The correct way would be to decode to utf-8 and then apply regex.
s='cats\xe2\x80\x99 faces'
\xe2\x80\x99 is U+2019
print re.sub(u'\u2019','',s.decode('utf-8'))
Assume you use Python 2.x
>>> s = 'cats\xe2\x80\x99 f'
>>> len(s), s[4]
(9, 'â')
Means chars like \xe2 is with 1 length, instead 3. So that you cannot match it with r'\\x.+?[0-9]*' to match it.
>>> s = '\x63\x61\x74\x73\xe2\x80\x99 f'
>>> ''.join([c for c in s if c <= 'z'])
'cats f'
Help this help a bit.

Extract unicode substrings with the re module

I have a string like this:
s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
I want this text:
result = 'the unicode text I want with an é'
I've tried to use this code:
expr = r'(?<=BEGIN)[\sa-zA-Z]+(?=END)'
result = re.search(expr, s)
result = re.sub(r'(^\s+)|(\s+$)', '', result) # just to strip out leading/trailing white space
But as long as the é is in the string s, re.search always returns None.
Note, I've tried using different combinations of .* instead of [\sa-zA-Z]+ without success.
The character ranges a-z and A-Z only capture ASCII characters. You can use . to capture Unicode characters:
>>> import re
>>> s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
>>> print re.search(r'BEGIN(.+?)END', s).group(1)
the unicode text I want with an é
>>>
Note too that I simplified your pattern a bit. Here is what it does:
BEGIN # Matches BEGIN
(.+?) # Captures one or more characters non-greedily
END # Matches END
Also, you do not need Regex to remove whitespace from the ends of a string. Just use str.strip:
>>> ' a '.strip()
'a'
>>>

Python Regex - checking for a capital letter with a lowercase after

I am trying to check for a capital letter that has a lowercase letter coming directly after it. The trick is that there is going to be a bunch of garbage capital letters and number coming directly before it. For example:
AASKH317298DIUANFProgramming is fun
as you can see, there is a bunch of stuff we don't need coming directly before the phrase we do need, Programming is fun.
I am trying to use regex to do this by taking each string and then substituting it out with '' as the original string does not have to be kept.
re.sub(r'^[A-Z0-9]*', '', string)
The problem with this code is that it leaves us with rogramming is fun, as the P is a capital letter.
How would I go about checking to make sure that if the next letter is a lowercase, then I should leave that capital untouched. (The P in Programming)
Use a negative look-ahead:
re.sub(r'^[A-Z0-9]*(?![a-z])', '', string)
This matches any uppercase character or digit that is not followed by a lowercase character.
Demo:
>>> import re
>>> string = 'AASKH317298DIUANFProgramming is fun'
>>> re.sub(r'^[A-Z0-9]*(?![a-z])', '', string)
'Programming is fun'
You can also use match like this :
>>> import re
>>> s = 'AASKH317298DIUANFProgramming is fun'
>>> r = r'^.*([A-Z][a-z].*)$'
>>> m = re.match(r, s)
>>> if m:
... print(m.group(1))
...
Programming is fun

Python: strip a wildcard word

I have strings with words separated by points.
Example:
string1 = 'one.two.three.four.five.six.eight'
string2 = 'one.two.hello.four.five.six.seven'
How do I use this string in a python method, assigning one word as wildcard (because in this case for example the third word varies). I am thinking of regular expressions, but do not know if the approach like I have it in mind is possible in python.
For example:
string1.lstrip("one.two.[wildcard].four.")
or
string2.lstrip("one.two.'/.*/'.four.")
(I know that I can extract this by split('.')[-3:], but I am looking for a general way, lstrip is just an example)
Use re.sub(pattern, '', original_string) to remove matching part from original_string:
>>> import re
>>> string1 = 'one.two.three.four.five.six.eight'
>>> string2 = 'one.two.hello.four.five.six.seven'
>>> re.sub(r'^one\.two\.\w+\.four', '', string1)
'.five.six.eight'
>>> re.sub(r'^one\.two\.\w+\.four', '', string2)
'.five.six.seven'
BTW, you are misunderstanding str.lstrip:
>>> 'abcddcbaabcd'.lstrip('abcd')
''
str.replace is more appropriate (of course, re.sub, too):
>>> 'abcddcbaabcd'.replace('abcd', '')
'dcba'
>>> 'abcddcbaabcd'.replace('abcd', '', 1)
'dcbaabcd'

Categories

Resources