In a text file I'm processing, I have characters like ����. Not sure what they are.
I'm wondering how to remove/convert these characters.
I have tried to convert it into ascii by using .encode(‘ascii’,'ignore’). python told me char is not whithin 0,128
I have also tried unicodedata, unicodedata.normalize('NFKD', text).encode('ascii','ignore'), with the same error
Anyone help?
Thanks!
You can always take a Unicode string an use the code you showed:
my_ascii = my_uni_string.encode('ascii', 'ignore')
If that gave you an error, then you didn't really have a Unicode string to begin with. If that is true, then you have a byte string instead. You'll need to know what encoding it's using, and you can turn it into a Unicode string with:
my_uni_string = my_byte_string.decode('utf8')
(assuming your encoding is UTF-8).
This split between byte string and Unicode string can be confusing. My presentation, Pragmatic Unicode, or, How Do I Stop The Pain can help you to keep it all straight.
It's not perfect (especially for shorter strings) but the chardet library would be of use here:
http://pypi.python.org/pypi/chardet
To have chardet figure out the encoding and then encode as unicode you would do:
import chardet
encoding = chardet.detect(some_string)['encoding']
unicode_string = unicode(some_string, encoding)
Of course, you won't be able to encode them as ascii if they're out of the ascii range.
Related
I am using a package in python that returns a string using ASCII characters as opposed to unicode (eg. returns 'seré' as opposed to seré).
Given this is python 3.8, the string is actually encoded in unicode, the package just seems to output it as if it were ASCII. As such, when I try to perform x.decode('utf-8') or x.encode('ascii'), neither work. Is there a way to make python treat the string as if it were ASCII, such that I can decode it to unicode? Or is there a package that can serve this purpose.
I am relatively new to python so I apologise if my explanation is unclear. I am happy to clarify things if needed.
Code
from spanishconjugator import Conjugator as c
verb = c().conjugate('pasar', 'preterite', 'indicative', 'yo')
print(verb)
This returns the string 'pasé' where it should return 'pasé'.
Update
From further searching and from your answers, it appears to be an issue to do with single 2-byte UTF-8 (é) characters being literally interpreted as two 1-byte latin-1 (é) characters (nothing to do with ASCII, my mistake).
Managed to fix it with:
verb.encode('latin-1').decode('utf-8')
Thank you to those that commented.
If the input string contains the raw byte ordinals (such as \xc3\xa9/é instead of é) use latin1 to encode it to bytes verbatim, then decode with the desired encoding.
>>> "pasé".encode('latin1').decode()
'pasé'
So I have this text that I pulled from the internet, that some of the words are not using the correct characters, like this one "experiências". Is there any function or something in python where I could tackle strings like that and turn into the portuguese version. like experiência.
Thanks !
What you "pulled" was not a Unicode string but a string in the Western-European encoding, probably CP1252. You must encode it back to the byte object and then decode correctly.
"experiências".encode("cp1252").decode()
# 'experiências'
In the middle of writing this I got this to work. Here it is anyway in case it's useful or the solution is less than optimal.
I have a unicode string u'http://en.wikipedia.org/wiki/Espa%C3%B1ol' from which I'd like to have u'http://en.wikipedia.org/wiki/Español'. My attempt using urllib.unquote gives me u'http://en.wikipedia.org/wiki/Espa\xc3\xb1ol'.
The problem is that what %C3%B1 means depends on the encoding of the string.
As Unicode, it means ñ. As Latin-1, it also means ñ. As UTF-8, it means ñ.
So, you need to unescape those characters before decoding from UTF-8.
In other words, somewhere, you're doing the equivalent of:
u = urllib.unquote(s.decode('utf-8'))
Don't do that. You should be doing:
u = urllib.unquote(s).decode('utf-8')
If some framework you're using has already decoded the string before you get to see it, re-encode it, unquote it, and re-decode it:
u = urllib.unquote(u.encode('utf-8')).decode('utf-8')
But it would be better to not have the framework hand you charset-decoded but still quote-encoded strings in the first place.
The string is unnecessarily unicode, so convert to a byte string representation first, then decode to unicode like so:
urllib.unquote(str(u'http://en.wikipedia.org/wiki/Espa%C3%B1ol')).decode('utf8')
I first tried typing in a Unicode character, encode it in UTF-8, and decode it back. Python happily gives back the original character.
I took a look at the encoded string, it is b'\xe6\x88\x91'. I don't understand what this is, it looks like 3 hex numbers.
Then I did some research and I found that the CJK set starts from 4E00, so now I want Python to show me what this character looks like. How do I do that? Do I need to convert 4E00 to the form of something like the one above?
The text b'\xe6\x88\x91' is the representation of the bytes that are the utf-8 encoding of the unicode codepoint \u6211 which is the character 我. So there is no need in converting something, other than to a unicode string with .decode('utf-8').
You'll need to decode it using the UTF-8 encoding:
>>> print(b'\xe6\x88\x91'.decode('UTF-8'))
我
By decoding it you're turning the bytes (which is what b'...' is) into a Unicode string and that's how you can display / use the text.
I am trying to work with several documents that all have various encodings - some utf-8, some ISO-8859-2, some ascii etc. Is there a reliable way of decoding to a standard encoding for processing?
I have tried the following:
import chardet
encoding = chardet.detect(text)
text = unicode(text,encoding['encoding']).decode(sys.getdefaultencoding(),'ignore')
With the above code I still get UnicodeEncodeError errors
Use decode to convert bytes to unicode, and encode to convert unicode to bytes:
text.decode(encoding['encoding'], 'ignore').encode(sys.getdefaultencoding(), 'ignore')
Although I would recommend doing your processing on the unicode objects themselves, or UTF-8 encoded strings if you absolutely need to work with bytes. sys.getdefaultencoding() is 'ascii', which provides a very limited character set. See also: http://wiki.python.org/moin/DefaultEncoding
You probably mean encode:
u = unicode(text, encoding['encoding'], 'ignore')
text = u.encode(sys.getdefaultencoding(), 'ignore')
or equivalently and more commonly,
u = text.decode(encoding['encoding'], 'ignore')
text = u.encode(sys.getdefaultencoding(), 'ignore')
You may want ignore on both, as above: the incoming text may have invalid characters in it, causing it to fail to decode to Unicode, and it may have characters which can't be represented in the default encoding, causing it to fail to encode. (You may not actually want to ignore errors, though, since it looks like you were just trying to work around using the wrong function.)