Python encoding problem when reading but not when typing

Python encoding problem when reading but not when typing - python

I'm reading some strings from a text file.
Some of these strings have some "strange" characters, e.g. "\xc3\xa9comiam".
If I copy that string and paste it into a variable, I can convert it to readable characters:
string = "\xc3\xa9comiam"
print(string.encode("raw_unicode_escape").decode('utf-8'))
écomiam
but if I read it from the file, it doesn't work:
with open(fn) as f:
for string in f.readlines():
print(string.encode("raw_unicode_escape").decode('utf-8'))
\xc3\xa9comiam
It seems the solution must be pretty easy, but I can't find it.
What can I do?
Thanks!

Those not unicode-escape ones - like the name suggests, that handles Unicode sequences like \u00e9 but not \xe9.
What you have is a UTF-8 enooded sequence. The way to decode that is to get it into a bytes sequence which can then be decoded to a Unicode string.
# Let's not shadow the string library
s = "\xc3\xa9comiam"
print(bytes(s, 'latin-1').decode('utf-8'))
The 'latin-1' trick is a dirty secret which simply converts every byte to a character with the same character code.
For your file, you could open it in binary mode so you don't have to explictly convert it to bytes, or you could simply apply the same conversion to the strings you read.

Thanks everyone for your help,
I think, I've found a solution (not very elegant, but it does the trick).
print(bytes(tm.strip(), "utf-8").decode("unicode_escape").encode("raw_unicode_escape").decode('utf-8'))
Thanks!

Related

How to print original string from Unicode string(such as \uD14C\uC2A4\uD2B8) in python

I searched a bit about this. But most of people want to convert original string(테스트) to unicode(\uD14C\uC2A4\uD2B8).
But what I want is converting unicode string(such as \uD14C\uC2A4\uD2B8) to real string(테스트). I have JSON file in which all the Korean strings are in form of unicode(\uXXXX) and I have to parse it into original string. How can I do it in Python?
To sum up,
the way to convert unicode string to original string such as in Python
\uD14C\uC2A4\uD2B8 -> 테스트

import codecs
file_variable = 'path/to/file.json'
with codecs.open(file_variable, encoding='utf-8') as file:
json_object = json.load(file)
See if that allows you to handle your json in unicode. The import codecs also allows you to use the .encode() and .decode() functions to go back and forth between unicode and unicode escaped:
string = ((some unicode text here))
string.decode('utf-8')
string.encode('utf-8')

I solved by this way
read byte by byte and add byte(variable c) into variable
aJSON+=encode(c)
aJSON.decode('unicode-escape') gives expected result.
Thanks for interest anyway..

Convert multichar %xx escapes to unicode

In the middle of writing this I got this to work. Here it is anyway in case it's useful or the solution is less than optimal.
I have a unicode string u'http://en.wikipedia.org/wiki/Espa%C3%B1ol' from which I'd like to have u'http://en.wikipedia.org/wiki/Español'. My attempt using urllib.unquote gives me u'http://en.wikipedia.org/wiki/Espa\xc3\xb1ol'.

The problem is that what %C3%B1 means depends on the encoding of the string.
As Unicode, it means Ã±. As Latin-1, it also means Ã±. As UTF-8, it means ñ.
So, you need to unescape those characters before decoding from UTF-8.
In other words, somewhere, you're doing the equivalent of:
u = urllib.unquote(s.decode('utf-8'))
Don't do that. You should be doing:
u = urllib.unquote(s).decode('utf-8')
If some framework you're using has already decoded the string before you get to see it, re-encode it, unquote it, and re-decode it:
u = urllib.unquote(u.encode('utf-8')).decode('utf-8')
But it would be better to not have the framework hand you charset-decoded but still quote-encoded strings in the first place.

The string is unnecessarily unicode, so convert to a byte string representation first, then decode to unicode like so:
urllib.unquote(str(u'http://en.wikipedia.org/wiki/Espa%C3%B1ol')).decode('utf8')

Python how to get null padded byte string from unicode string

Im sure that someone should be able to help me here, as it feels like such a simple answer, but i can't find it anywhere. I need to write a unicode string (null padded ascii basically), but it isn't working as expected, no matter what i try from the internets, it ends up as pure ascii.
with open('test.txt', 'wb') as oFile:
name = u'AAA'
oFile.write(name) //always writing 0x414141 i want 0x410041004100
Just to clarify, though the question is answered already, in case someone wanders here, the use case is it is a mixed binary file (an int here, a unicode string there, a struct, etc) and I am editing in place. I really just wanted to be able to write the string the way it is represented in the file ('AAA' as 0x410041004100 instead of 0x414141)

You can use the .encode() method with an appropriate codec:
>>> name = u"aaa"
>>> name.encode("utf_16")
'\xff\xfea\x00a\x00a\x00'
The \xff\xfe at the beginning is a Byte Order Mark (BOM). Your application may or may not require that, and you can remove it if not needed.

You can use the codecs module to specify an encoding when you open the file:
import codecs
with codecs.open('test.txt', 'wb', encoding='utf-16') as oFile:
...
Further information:
Unicode HOWTO
Comparison of Unicode encodings

Python, Encoding output to UTF-8

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.
However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)
I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...
Any suggestions?
EDIT: Actual code here:
source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)
Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:
[日木曜 Deliverables]= CASE WHEN things = 11
THEN C ELSE 0 END

Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.
As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.
Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).
Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.

In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:
# byte string
>>> 'abc'
'abc'
# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'
'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):
>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
This encoded string now can be written to file:
>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
... f.write(s.encode('utf8'))
Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:
>>> with open('text.txt', 'r') as f:
... content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'
You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:
>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'
After decode, we've got back our unicode string with unicode code points.
>>> print content.decode('utf8')
abc абв

xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.
If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

python convert unknown character to ascii

In a text file I'm processing, I have characters like ����. Not sure what they are.
I'm wondering how to remove/convert these characters.
I have tried to convert it into ascii by using .encode(‘ascii’,'ignore’). python told me char is not whithin 0,128
I have also tried unicodedata, unicodedata.normalize('NFKD', text).encode('ascii','ignore'), with the same error
Anyone help?
Thanks!

You can always take a Unicode string an use the code you showed:
my_ascii = my_uni_string.encode('ascii', 'ignore')
If that gave you an error, then you didn't really have a Unicode string to begin with. If that is true, then you have a byte string instead. You'll need to know what encoding it's using, and you can turn it into a Unicode string with:
my_uni_string = my_byte_string.decode('utf8')
(assuming your encoding is UTF-8).
This split between byte string and Unicode string can be confusing. My presentation, Pragmatic Unicode, or, How Do I Stop The Pain can help you to keep it all straight.

It's not perfect (especially for shorter strings) but the chardet library would be of use here:
http://pypi.python.org/pypi/chardet
To have chardet figure out the encoding and then encode as unicode you would do:
import chardet
encoding = chardet.detect(some_string)['encoding']
unicode_string = unicode(some_string, encoding)
Of course, you won't be able to encode them as ascii if they're out of the ascii range.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python encoding problem when reading but not when typing - python

Thanks everyone for your help, I think, I've found a solution (not very elegant, but it does the trick). print(bytes(tm.strip(), "utf-8").decode("unicode_escape").encode("raw_unicode_escape").decode('utf-8')) Thanks!

Related

How to print original string from Unicode string(such as \uD14C\uC2A4\uD2B8) in python

Convert multichar %xx escapes to unicode

Python how to get null padded byte string from unicode string

Python, Encoding output to UTF-8

python convert unknown character to ascii

Categories

Resources