Convert utf-8 string to cp950 encoding in python

Convert utf-8 string to cp950 encoding in python - python

I'm handling an encoding problem.
My input is a unicode string, such as:
>>> s
u'\xa6\xe8\xac\xc9'
Actually it is encoded in cp950. I want to decode it: (notice there's no "u")
>>> print unicode('\xa6\xe8\xac\xc9', 'cp950')
西界
However, I don't know how to get rid of that "u".
Direct conversion is not working:
>>> str(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
The result of using encode() is not what I wanted:
>>> s.encode('utf8')
'\xc2\xa6\xc3\xa8\xc2\xac\xc3\x89'
what I want is '\xa6\xe8\xac\xc9'

This is a bit of an abuse of the unicode type. Characters in a unicode string are expected to be Unicode codepoints (e.g. u'\u897f\u754c'), and thus are encoding-agnostic. They are not supposed to be bytes from a specific encoding (Python 3 makes this distinction very clear by separating Unicode strings str, from byte strings bytes).
Since you want to just interpret each codepoint as bytes, you can do
u'\xa6\xe8\xac\xc9'.encode('iso-8859-1')
since the first 256 codepoints of Unicode are defined to be equal to the codepoints of ISO-8859-1. However, please try to fix the issue that gave you this incorrect Unicode string in the first place.

So let's get this straight: you have a sequence of bytes that were read in as Unicode codepoints, and you need them to be interpreted as cp950 instead?
>>> ''.join(chr(ord(c)) for c in s)
'\xa6\xe8\xac\xc9'
>>> print ''.join(chr(ord(c)) for c in s).decode('cp950')
西界

Related

Not able to convert HEX to ASCII in python 3.6.3

I tried below methods, but no luck.
method 1:
var contains the hex value
bytes.fromhex(var).decode('ascii')
Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xdb in position 0: ordinal not in range(128)
method 2:
codecs.decode(var,"hex")
This is returning me in bytes, not in ASCII.
Can someone help on this conversion?

Did you try:
chr(var)
This should give you the character for an ASCII code.

As it follows from the question I assume using Python 3x.
The reason of your error is the fact, that you try to decode with ASCII the byte '0xdb' which has the value above 127.
You just can't do that - there is no such a byte value in ASCII encoding.
Your options are:
1. Ignore decode errors:
>>> u = 'DB91132598CC' # unicode
>>> b = codecs.decode(u,"hex") # bytes
>>> b
b'\xdb\x91\x13%\x98\xcc'
>>> result = b.decode("ascii", errors="ignore") # unicode
>>> result
'\x13%'
2. Use different encoding:
>>> result = b.decode("cp1252") # for example
>>> result
'Û‘\x13%˜Ì'
If you want only ASCII chars in the result use option #1.

have you tried
codecs.decode(codecs.decode(var,'hex'),'ascii')

>>> var = int('7A', 16) #var is an integer now
>>> chr(var) #int value to char
'z'
This solution is only for one character. You have to split the string and convert all hex values seperated. Look here how to split it.

The following code:
codecs.decode(var,"hex")
in Python 2.7 results with str:
'\xdb\x91\x13%\x98\xcc\xbfv\xaef\x8bK\x08Qv\xbb\x19\'u"\x1f\xdb\xb5\x0f\xce\x1c9\'\xc0w\xea\xf1\xe3\xda\xc4\xc8\xa8\xe8\x02\x8c?r\x95\xef\x81W\xce\xd5\x97\xa3n\xf1\xc3\xbf\xa4QG{\xff2\xee\xb1\x80l,\xc0D%\x85\x19z+\xcd,C\x92\x14z\xad\xb90f\xd0\xbaZ\xb6\xdb\xfd?o\xce\xb7\x07:\xe6\x1a]J\xa8\xab\xcb\xcf\xf4\xee\xbd\x1a\x16Uh\x9b\xfd~\xab\x82\xd7{\xf7"Ou\xfb\xcd2<\x9b\x9f\xa9\xc0\xb7\xd7\x99\x18\x08x\xa8\x1d]\x07\xcf\x05\xbe9\xee\xf9\x89\xb2\xfc0w\x99},/\x11b\xe5\xb4}\x99\xe4\xb4\x15\xbc\x8c\xe5\xc7UGi1\xbd\x8e\xd1K_\xce\xc1\xc8\xc6TQYF\xabx`\xbb\xbe\xe7\xdc\xcf\xda\xa7\xaaA\x0f\xf6SR\xb1S\xb5\x87(\xd5x\x14\xc6\x10\xf8%(m\x83\x0c0\x84)\xbd\xcf\x11g\x88{\x12^\xfb/\xa3K=\xea\xcd2\x9fWgL\x07\x1b\xefl\x9c\xea\xc0\xc7\xfa\xbbXz\x1do\x8bM\x0bS'
and in Python 3.6 bytes:
b'\xdb\x91\x13%\x98\xcc\xbfv\xaef\x8bK\x08Qv\xbb\x19\'u"\x1f\xdb\xb5\x0f\xce\x1c9\'\xc0w\xea\xf1\xe3\xda\xc4\xc8\xa8\xe8\x02\x8c?r\x95\xef\x81W\xce\xd5\x97\xa3n\xf1\xc3\xbf\xa4QG{\xff2\xee\xb1\x80l,\xc0D%\x85\x19z+\xcd,C\x92\x14z\xad\xb90f\xd0\xbaZ\xb6\xdb\xfd?o\xce\xb7\x07:\xe6\x1a]J\xa8\xab\xcb\xcf\xf4\xee\xbd\x1a\x16Uh\x9b\xfd~\xab\x82\xd7{\xf7"Ou\xfb\xcd2<\x9b\x9f\xa9\xc0\xb7\xd7\x99\x18\x08x\xa8\x1d]\x07\xcf\x05\xbe9\xee\xf9\x89\xb2\xfc0w\x99},/\x11b\xe5\xb4}\x99\xe4\xb4\x15\xbc\x8c\xe5\xc7UGi1\xbd\x8e\xd1K_\xce\xc1\xc8\xc6TQYF\xabx`\xbb\xbe\xe7\xdc\xcf\xda\xa7\xaaA\x0f\xf6SR\xb1S\xb5\x87(\xd5x\x14\xc6\x10\xf8%(m\x83\x0c0\x84)\xbd\xcf\x11g\x88{\x12^\xfb/\xa3K=\xea\xcd2\x9fWgL\x07\x1b\xefl\x9c\xea\xc0\xc7\xfa\xbbXz\x1do\x8bM\x0bS'
The reason is that in Python 2.7
str == bytes #True
while in Python 3.6
str == bytes #False
Python 2 strings are byte strings, while Python 3 strings are unicode strings. Both results are actually the same, but byte string in Python 3 is of type bytes not str and literal representation is prefixed with b.
This has nothing to do with ASCII encoding as none of the output variables (regardless of Python version) is ASCII encoded.
Moreover on Python 2.7 you will also get this error:
codecs.decode(var, 'hex').decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdb in position 0: ordinal not in range(128)
I actually pasted it from Python 2.7 interpreter but you can check yourself.
Your output string can't be decoded with ascii codec in any version of Python because it's simply not ascii encoded string in any case.

Python string cent symbol conversion

The string I have is
u'3.4\xa2 / each'
The '\xa2' is the "cent" symbol, and I want to show it that way.
I tried
i= "3.4\xa2 / each"
print unicode(i, errors='replace')
In the result, the cent symbol is shown as a question mark inside a solid circle.
I also tried
i= "3.4\xa2 / each"
print i.encode('utf-8')
I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa2 in position 3: ordinal not in range(128)
So what is the right way to accomplish this?

'\xa2' is a byte. It may be interpreted as a cent symbol, but only if you specify the right codec. By specifying the right codec can you decode it to the Unicode codepoint equivalent. Latin-1 would do:
>>> print '\xa2'.decode('latin1')
¢
There are a whole series of encodings that encode the ¢ cent codepoint as A2, however.
Alternatively, start with a Unicode string to begin with. \xa2 in a Unicode string expression is the same thing as \u00a2, which happens to be the right codepoint:
>>> print u'\xa2'
¢
>>> print u'\u00a2'
¢
That's because the first 256 codepoints of the Unicode standard happen to match the Latin-1 (ISO-8859-1) standard.
You may have trouble printing; if you are using a terminal or console, print is supposed to automatically encode Unicode data to match your terminal or console configuration, but that may not always be correct or be set to a codec that can handle the characters you are trying to print!
Note that I decoded. If you encode, Python tries to be helpful and decode the bytes to a Unicode object first, so that it can be encoded afterwards. Because \xa2 in not a valid ASCII byte, that decoding failed.
You may want to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO
before continuing.

A few points:
encode is a method to convert unicode strings to bytes. If you call encode on a byte string, Python2 will first try to decode it with ASCII and then encode it. That's where your error is coming from.
Your string cannot be decoded with UTF-8, because not every sequence of bytes is valid UTF-8.
Demo:
>>> "3.4\xa2 / each".decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 3: invalid start byte
You can use the latin-1 encoding here, because it maps every byte to the corresponding unicode ordinal.
Demo:
>>> print("3.4\xa2 / each".decode('latin-1'))
3.4¢ / each

You can try:
print "3.4" + u"\u00A2" +"each"
Works for me.

Python Unicode hex string decoding

I have the following string: u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9' encoded in windows-1255 and I want to decode it into Unicode code points (u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9').
>>> u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp1255.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
However, if I try to decode the string: '\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9' I don't get the exception:
>>> '\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'
How do I decode the Unicode hex string (the one that gets the exception) or convert it to a regular string that can be decoded?
Thanks for the help.

I have the following string: u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9' encoded in windows-1255
That is self-contradictory. The u indicates it is a Unicode string. But if you say it is encoded in whatever, it must be a byte string (because a Unicode string can only be encoded into a byte string).
And indeed - your given entities - \xe4\xe7 etc. - represent a byte each, and only through the given encoding, windows-1255 they are given their respective meaning.
In other words, if you have a u'\xe4', you can be sure it is the same as u'\u00e4' and NOT u'\u05d4' as it would be the case otherwise.
If, by any chance, you got your erroneous Unicode string from a source which is unaware of this problem, you can derive from it the byte string you really need: with the help of a "1:1 coding", which is latin1.
So
correct_str = u_str.encode("latin1")
# now every byte of the correct_str corresponds to the respective code point in the 0x80..0xFF range
correct_u_str = correct_str.decode("windows-1255")

That's because \xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9 is a byte array, not a Unicode string: The bytes represent valid windows-1255 characters rather than valid Unicode code points.
Therefore, when prepending it with a u, the Python interpreter can not decode the string, or even print it:
>>> print u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
So, in order to convert your byte array to UTF-8, you will have to decode it as windows-1255 and then encode it to utf-8:
>>> '\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
.encode('utf8')
'\xd7\x94\xd7\x97\xd7\x9c\xd7\xa7 \xd7\x94\xd7\xa9\xd7\x9c\xd7\x99\xd7\xa9\xd7\x99'
Which gives the original Hebrew text:
>>> print '\xd7\x94\xd7\x97\xd7\x9c\xd7\xa7 \xd7\x94\xd7\xa9\xd7\x9c\xd7\x99\xd7\xa9\xd7\x99'
החלק השלישי

Try this
>> u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.encode('latin-1').decode('windows-1255')
u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'

Decode like this,
>>> b'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'

What's the difference between these method to deal with Unicode strings in Python?

I tried print a_str.decode("utf-8"), print uni_str, print uni_str.decode("utf-8"),print uni_str.encode("utf-8")..
But only the first one works.
>>> print '\xe8\xb7\xb3'.decode("utf-8")
跳
>>> print u'\xe8\xb7\xb3\xe8'
è·³è
>>> print u'\xe8\xb7\xb3\xe8'.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
>>> print u'\xe8\xb7\xb3\xe8'.encode("utf-8")
è·³è
I'm really confused with how to display a Unicode string normally. If I have a string like this:
a=u'\xe8\xb7\xb3\xe8', how can I print a?

'\xe8\xb7\xb3' is a Chinese character encoded with utf8, so '\xe8\xb7\xb3'.decode('utf-8') works fine, which returns the unicode value of 跳, u'\u8df3'. But u'\xe8\xb7\xb3' is a literal unicode String, which is not same with the unicode of 跳. And a unicode string cannot be decoded, it's unicode.
At last,a=u'\xe8\xb7\xb3\xe8' is really not a valid unicode string[1].
Where the u'\xe8\xb7\xb3' comes from? Another function?
[1]Check out the first comment.

If you have a string like that then it's broken. You'll need to encode it as Latin-1 to get it to a bytestring with the same byte values, and then decode as UTF-8.

The unicode string u'\xe8\xb7\xb3\xe8' is equivalent to u'\u00e8\u00b7\u00b3\u00e8'. What you want is u'\u8df3' which can be encoded in utf8 as '\xe8\xb7\xb3'.
In Python, unicode is a UCS-2 string (build option). So, u'\xe8\xb7\xb3\xe8' is a string of 4 16bit Unicode characters.
If you got a utf-8 string (8bit string) incorrectly presented as Unicode (16bit string), you have to convert it to 8bit string first:
>>> ''.join([chr(ord(a)) for a in u'\xe8\xb7\xb3']).decode('utf8')
u'\u8df3'
Note that '\xe8\xb7\xb3\xe8' is not valid utf8 string as the last byte '\xe8' is a first character of a two byte sequence and cannot terminate a utf8 string.

string encoding and decoding?

Here are my attempts with error messages. What am I doing wrong?
string.decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 37: ordinal not in range(128)
string.encode('utf-8', "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
37: ordinal not in range(128)

You can't decode a unicode, and you can't encode a str. Try doing it the other way around.

Guessing at all the things omitted from the original question, but, assuming Python 2.x the key is to read the error messages carefully: in particular where you call 'encode' but the message says 'decode' and vice versa, but also the types of the values included in the messages.
In the first example string is of type unicode and you attempted to decode it which is an operation converting a byte string to unicode. Python helpfully attempted to convert the unicode value to str using the default 'ascii' encoding but since your string contained a non-ascii character you got the error which says that Python was unable to encode a unicode value. Here's an example which shows the type of the input string:
>>> u"\xa0".decode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
u"\xa0".decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
In the second case you do the reverse attempting to encode a byte string. Encoding is an operation that converts unicode to a byte string so Python helpfully attempts to convert your byte string to unicode first and, since you didn't give it an ascii string the default ascii decoder fails:
>>> "\xc2".encode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
"\xc2".encode("ascii", "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

Aside from getting decode and encode backwards, I think part of the answer here is actually don't use the ascii encoding. It's probably not what you want.
To begin with, think of str like you would a plain text file. It's just a bunch of bytes with no encoding actually attached to it. How it's interpreted is up to whatever piece of code is reading it. If you don't know what this paragraph is talking about, go read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets right now before you go any further.
Naturally, we're all aware of the mess that created. The answer is to, at least within memory, have a standard encoding for all strings. That's where unicode comes in. I'm having trouble tracking down exactly what encoding Python uses internally for sure, but it doesn't really matter just for this. The point is that you know it's a sequence of bytes that are interpreted a certain way. So you only need to think about the characters themselves, and not the bytes.
The problem is that in practice, you run into both. Some libraries give you a str, and some expect a str. Certainly that makes sense whenever you're streaming a series of bytes (such as to or from disk or over a web request). So you need to be able to translate back and forth.
Enter codecs: it's the translation library between these two data types. You use encode to generate a sequence of bytes (str) from a text string (unicode), and you use decode to get a text string (unicode) from a sequence of bytes (str).
For example:
>>> s = "I look like a string, but I'm actually a sequence of bytes. \xe2\x9d\xa4"
>>> codecs.decode(s, 'utf-8')
u"I look like a string, but I'm actually a sequence of bytes. \u2764"
What happened here? I gave Python a sequence of bytes, and then I told it, "Give me the unicode version of this, given that this sequence of bytes is in 'utf-8'." It did as I asked, and those bytes (a heart character) are now treated as a whole, represented by their Unicode codepoint.
Let's go the other way around:
>>> u = u"I'm a string! Really! \u2764"
>>> codecs.encode(u, 'utf-8')
"I'm a string! Really! \xe2\x9d\xa4"
I gave Python a Unicode string, and I asked it to translate the string into a sequence of bytes using the 'utf-8' encoding. So it did, and now the heart is just a bunch of bytes it can't print as ASCII; so it shows me the hexadecimal instead.
We can work with other encodings, too, of course:
>>> s = "I have a section \xa7"
>>> codecs.decode(s, 'latin1')
u'I have a section \xa7'
>>> codecs.decode(s, 'latin1')[-1] == u'\u00A7'
True
>>> u = u"I have a section \u00a7"
>>> u
u'I have a section \xa7'
>>> codecs.encode(u, 'latin1')
'I have a section \xa7'
('\xa7' is the section character, in both
Unicode and Latin-1.)
So for your question, you first need to figure out what encoding your str is in.
Did it come from a file? From a web request? From your database? Then the source determines the encoding. Find out the encoding of the source and use that to translate it into a unicode.
s = [get from external source]
u = codecs.decode(s, 'utf-8') # Replace utf-8 with the actual input encoding
Or maybe you're trying to write it out somewhere. What encoding does the destination expect? Use that to translate it into a str. UTF-8 is a good choice for plain text documents; most things can read it.
u = u'My string'
s = codecs.encode(u, 'utf-8') # Replace utf-8 with the actual output encoding
[Write s out somewhere]
Are you just translating back and forth in memory for interoperability or something? Then just pick an encoding and stick with it; 'utf-8' is probably the best choice for that:
u = u'My string'
s = codecs.encode(u, 'utf-8')
newu = codecs.decode(s, 'utf-8')
In modern programming, you probably never want to use the 'ascii' encoding for any of this. It's an extremely small subset of all possible characters, and no system I know of uses it by default or anything.
Python 3 does its best to make this immensely clearer simply by changing the names. In Python 3, str was replaced with bytes, and unicode was replaced with str.

That's because your input string can’t be converted according to the encoding rules (strict by default).
I don't know, but I always encoded using directly unicode() constructor, at least that's the ways at the official documentation:
unicode(your_str, errors="ignore")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert utf-8 string to cp950 encoding in python - python

So let's get this straight: you have a sequence of bytes that were read in as Unicode codepoints, and you need them to be interpreted as cp950 instead? >>> ''.join(chr(ord(c)) for c in s) '\xa6\xe8\xac\xc9' >>> print ''.join(chr(ord(c)) for c in s).decode('cp950') 西界

Related

Not able to convert HEX to ASCII in python 3.6.3

Python string cent symbol conversion

Python Unicode hex string decoding

What's the difference between these method to deal with Unicode strings in Python?

string encoding and decoding?

Categories

Resources