How to remove all unicode representations in python - python

I am trying to remove all representations of special characters in my document, for example part of the document says: "world\u2019s", when I split this it gives ['world', '\u2019', 's'] but I need only the word(unicode and 's' removed).
I am already removing all punctuation and this works on the actual punctuation that are shown normally not on these unicode representations.
And I have also tried to use regex to match everything that begins with a '\' but that doesn't seem to work either.

import re
string = "world\u2019s"
print (re.sub(r"\b([^\s]+)\\([^\s]+)\b",r'\1',str(string.encode('ascii', 'backslashreplace'), 'ascii')))
Output:
world
You can apply this to your whole string document, should be working.
import re
string = "world\u2019s h\u2018e"
print (re.sub(r"\b([^\s]+)\\([^\s]+)\b",r'\1',str(string.encode('ascii', 'backslashreplace'), 'ascii')))
Output:
world h

Related

Remove special quotation marks and other characters

I am trying to download articles using Article from newspaper, and trying to tokenize the words using nltk word_tokenizer. The problem is, when I try to print the parsed article text, some of these articles have special quotation marks like “, ”, ’, which aren't filtered out by the tokenizer, as it would a regular ' and ".
Is there a way to replace these special quotes with normal quotes, or better yet remove all possible speical characters that the tokenizer may miss out on?
I tried to remove these special characters by explicitly mentioning them in code, but it gives me the error Non-UTF-8 code starting with '\x92'.
Using the unidecode package would normally replace these characters with utf-8 ones.
from unidecode import unidecode
text = unidecode(text)
A drawback, however, is that you would also change some characters (e.g. accentuated ones) which you may want to keep. If that is the case, an option is to use regular expressions to specifically erase (or replace) some pre-identified special characters:
import re
exotic_quotes = ['\\x92'] # fill this up
text = re.sub(exotic_quotes, "'", text) # changing the second argument to fill the kind of quote you want to replace the exotic ones with
I hope this helps !

Regular expression including and excluding characters

I have the following regular expression that almost works fine.
WORD_REGEXP = re.compile(r"[a-zA-Zá-úÁ-Úñ]+")
It includes lower and upper case letters with and without an accent plus the Spanish letter «ñ». Unfortunately, it also includes (I don't know why) characters that are also used in Spanish like «¡» or «¿» which I would like to remove as well.
In a line like ¡España, olé! I would like to extract just España and olé, by means of the regular expression.
How can I exclude these two characters («¿», «¡») in the regular expression?
According to stribizhe, it seems as if the regex was OK. So the problem must be other. I include the full Python code:
import re
linea = "¡Arriba Éspáña, ¿olé!"
WORD_REGEXP = re.compile(r"([a-zA-Zá-úÁ-Úñ]+)", re.UNICODE)
palabras = WORD_REGEXP.findall(linea)
for pal in palabras:
pal = unicode(pal,'latin1').encode('latin1', 'replace')
print pal
The result is the following:
¡Arriba
Éspáña
¿olé
Use the special sequence '\w', according to documentation:
If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
Note, however that your string must be a unicode string:
import re
linea = u"¡Arriba Éspáña, ¿olé!"
regex = re.compile(r"\w+", re.UNICODE)
regex.findall(linea)
# [u'Arriba', u'\xc9sp\xe1\xf1a', u'ol\xe9']
NOTE: The cause of your error is that your regex is being interpreted as UTF-8, e.g.:
You pattern r'([a-zA-Zá-úÁ-Úñ]+)' is not defined as a unicode string, so it's encoded to UTF-8 by your text editor and read by python as '([a-zA-Z\xc3\xa1-\xc3\xba\xc3\x81-\xc3\x9a\xc3\xb1]+)', note the patterns starting with \xc3 (that is the unicode start byte).
You can confirm that by printing the repr of WORD_REGEXP. So the actual pattern used by the re module is:
patt = r"([a-zA-Zá-úÁ-Úñ]+)"
print patt.decode('latin1')
Or:
a-z
A-Z
\xc3
\xa1-\xc3
\xba
\xc3
\x81-\xc3
\x9a
\xc3
\xb1
Simplifying it, you are actually using pattern
a-zA-Z\x81-\xc3
That last range, covers a lot of characters!!
It's better to use code points. The codepoint's for those characters are
¡ - \x{A1}
¿ - \x{BF}
which seem to fall outside the range of your accent characters.
[a-zA-Z\x{E1}-\x{FA}\x{C1}-\x{DA}\x{F1}]+

How to remove this \xa0 from a string in python?

I have the following string:
word = u'Buffalo,\xa0IL\xa060625'
I don't want the "\xa0" in there. How can I get rid of it? The string I want is:
word = 'Buffalo, IL 06025
The most robust way would be to use the unidecode module to convert all non-ASCII characters to their closest ASCII equivalent automatically.
The character \xa0 (not \xa as you stated) is a NO-BREAK SPACE, and the closest ASCII equivalent would of course be a regular space.
import unidecode
word = unidecode.unidecode(word)
If you know for sure that is the only character you don't want, you can .replace it:
>>> word.replace(u'\xa0', ' ')
u'Buffalo, IL 60625'
If you need to handle all non-ascii characters, encoding and replacing bad characters might be a good start...:
>>> word.encode('ascii', 'replace')
'Buffalo,?IL?60625'
There is no \xa there. If you try to put that into a string literal, you're going to get a syntax error if you're lucky, or it's going to swallow up the next attempted character if you're not, because \x sequences aways have to be followed by two hexadecimal digits.
What you have is \xa0, which is an escape sequence for the character U+00A0, aka "NO-BREAK SPACE".
I think you want to replace them with spaces, but whatever you want to do is pretty easy to write:
word.replace(u'\xa0', u' ') # replaced with space
word.replace(u'\xa0', u'0') # closest to what you were literally asking for
word.replace(u'\xa0', u'') # removed completely
You can easily use unicodedata to get rid of all of \x... characters.
from unicodedata import normalize
normalize('NFKD', word)
>>> 'Buffalo, IL 60625'
This seems to work for getting rid of non-ascii characters:
fixedword = word.encode('ascii','ignore')

Python regex sub function with matching group

I'm trying to get a python regex sub function to work but I'm having a bit of trouble. Below is the code that I'm using.
string = 'á:tdfrec'
newString = re.sub(ur"([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):", ur"\1:\2", string)
#newString = re.sub(ur"([a|e|i|o|ä|ë|ö|á|é|í|ó|à|è|ì|ò])([a|e|i|o|ä|ë|ö|á|é|í|ó|ú|à|è|ì|ò]):", ur"\1:\2", string)
print newString
# a:́tdfrec is printed
So the the above code is not working the way that I intend. It's not displaying correctly but the string printed has the accute accent over the :. The regex statement is moving the accute accent from over the a to over the :. For the string that I'm declaring this regex is not suppose be applied. My intention for this regex statement is to only be applied for the following examples:
aä:dtcbd becomes a:ädtcbd
adfseì:gh becomes adfse:ìgh
éò:fdbh becomes é:òfdbh
but my regex statement is being applied and I don't want it to be. I think my problem is the second character set followed by the : (ie á:) is what's causing the regex statement to be applied. I've been staring at this for a while and tried a few other things and I feel like this should work but I'm missing something. Any help is appreciated!
The follow code with re.UNICODE flag also doesn't achieve the desired output:
>>> import re
>>> original = u'á:tdfrec'
>>> pattern = re.compile(ur"([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):", re.UNICODE)
>>> print pattern.sub(ur'\1:\2', string)
á:tdfrec
Is it because of the diacritic and the tony the pony example for les misérable? The diacritic is on the wrong character after reversing it:
>>> original = u'les misérable'
>>> print ''.join([i for i in reversed(original)])
elbarésim sel
edit: Definitely an issue with the combining diacritics, you need to normalize both the regular expression and the strings you are trying to match. For example:
import unicodedata
regex = unicodedata.normalize('NFC', ur'([aeioäëöáéíóàèìò])([aeioäëöáéíóúàèìò]):')
string = unicodedata.normalize('NFC', u'aä:dtcbd')
newString = re.sub(regex, ur'\1:\2', string)
Here is an example that shows why you might hit an issue without the normalization. The string u'á' could either be the single code point LATIN SMALL LETTER A WITH ACCUTE (U+00E1) or it could be two code points, LATIN SMALL LETTER A (U+0061) followed by COMBINING ACUTE ACCENT (U+0301). These will probably look the same, but they will have very different behaviors in a regex because you can match the combining accent as its own character. That is what is happening here with the string 'á:tdfrec', a regular 'a' is captured in group 1, and the combining diacritic is captured in group 2.
By normalizing both the regex and the string you are matching you ensure this doesn't happen, because the NFC normalization will replace the diacritic and the character before it with a single equivalent character.
Original answer below.
I think your issue here is that the string you are attempting to do the replacement on is a byte string, not a Unicode string.
If these are string literals make sure you are using the u prefix, e.g. string = u'aä:dtcbd'. If they are not literals you will need to decode them, e.g. string = string.decode('utf-8') (although you may need to use a different codec).
You should probably also normalize your string, because part of the issue may have something to do with combining diacritics.
Note that in this case the re.UNICODE flag will not make a difference, because that only changes the meaning of character class shorthands like \w and \d. The important thing here is that if you are using a Unicode regular expression, it should probably be applied to a Unicode string.

How to eliminate the ☎ unicode?

During web scraping and after getting rid of all html tags, I got the black telephone character \u260e in unicode (☎). But unlike this response I do want to get rid of it too.
I used the following regular expressions in Scrapy to eliminate html tags:
pattern = re.compile("<.*?>| |&",re.DOTALL|re.M)
Then I tried to match \u260e and I think I got caught by the backslash plague. I tried unsuccessfully this patterns:
pattern = re.compile("<.*?>| |&|\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>| |&|\\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>| |&|\\\\u260e",re.DOTALL|re.M)
None of this worked and I still have \u260e as an output.
How can I make this disappear?
Using Python 2.7.3, the following works fine for me:
import re
pattern = re.compile(u"<.*?>| |&|\u260e",re.DOTALL|re.M)
s = u"bla ble \u260e blo"
re.sub(pattern, "", s)
Output:
u'bla ble blo'
As pointed by #Zack, this works due to the fact that the string is now in unicode, i.e., the string is already converted, and the sequence of characters \u260e is now the -- probably -- two bytes used to write that little black phone ☎ (:
Once both the string to be searched and the regular expression have the black phone itself, and not the sequence of characters \u260e, they both match.
If your string is already unicode, there's two easy ways. The second one will affect more than just the ☎, obviously.
>>> import string
>>> foo = u"Lorum ☎ Ipsum"
>>> foo.replace(u'☎', '')
u'Lorum Ipsum'
>>> "".join(s for s in foo if s in string.printable)
u'Lorum Ipsum'
Remove non-ascii characters but leave periods and spaces for more information about string.printable
The SHORTEST way to remove multiple spaces in a string in Python if you don't want multiple whitespaces.
You may try with BeatifulSoup, as explained here, with something like
soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

Categories

Resources