Convert an int value to unicode - python

I am using pyserial and need to send some values less than 255. If I send the int itself the the ascii value of the int gets sent. So now I am converting the int into a unicode value and sending it through the serial port.
unichr(numlessthan255);
However it throws this error:
'ascii' codec can't encode character u'\x9a' in position 24: ordinal not in range(128)
Whats the best way to convert an int to unicode?

In Python 2 - Turn it into a string first, then into unicode.
str(integer).decode("utf-8")
Best way I think. Works with any integer, plus still works if you put a string in as the input.
Updated edit due to a comment: For Python 2 and 3 - This works on both but a bit messy:
str(integer).encode("utf-8").decode("utf-8")

Just use chr(somenumber) to get a 1 byte value of an int as long as it is less than 256. pySerial will then send it fine.
If you are looking at sending things over pySerial it is a very good idea to look at the struct module in the standard library it handles endian issues an packing issues as well as encoding for just about every data type that you are likely to need that is 1 byte or over.

I think that the best solution is to be explicit and say that you want to represent a number as a byte (and not as a character):
>>> import struct
>>> struct.pack('B', 128)
>>> '\x80'
This makes your code work in both Python 2 and Python 3 (in Python 3, the result is, as it should, a bytes object). An alternative, in Python 3, would be to use the new bytes([128]) to create a single byte of value 128.
I am not a big fan of the chr() solutions: in Python 3, they produce a (character, not byte) string that needs to be encoded before sending it anywhere (file, socket, terminal,…)—chr() in Python 3 is equivalent to the problematic Python 2 unichr() of the question. The struct solution has the advantage of correctly producing a byte whatever the version of Python. If you want to send data over the serial port with chr(), you need to have control over the encoding that must take place subsequently. The code might work when the default encoding used by Python 3 is UTF-8 (which I think is the case), but this is due to the fact that Unicode characters of code point smaller than 256 can be coded as a single byte in UTF-8. This adds an unnecessary layer of subtlety and complexity that I do not recommend (it makes the code harder to understand and, if necessary, debug).
So, I strongly suggest that you use the approach above (which was also hinted at by Steve Barnes and Martijn Pieters): it makes it clear that you want to produce a byte (and not characters). It will not give you any surprise even if you run your code with Python 3, and it makes your intent clearer and more obvious.

Use the chr() function instead; you are sending a value of less than 256 but more than 128, but are creating a Unicode character.
The unicode character has to then be encoded first to get a byte character, and that encoding fails because you are using a value outside the ASCII range (0-127):
>>> str(unichr(169))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)
This is normal Python 2 behaviour; when trying to convert a unicode string to a byte string, an implicit encoding has to take place and the default encoding is ASCII.
If you were to use chr() instead, you create a byte string of one character and that implicit encoding does not have to take place:
>>> str(chr(169))
'\xa9'
Another method you may want to look into is the struct module, especially if you need to send integer values greater than 255:
>>> struct.pack('!H', 1000)
'\x03\xe8'
The above example packs an integer into a unsigned short in network byte order, for example.

Related

Python convert strings of bytes to byte array

For example given an arbitrary string. Could be chars or just random bytes:
string = '\xf0\x9f\xa4\xb1'
I want to output:
b'\xf0\x9f\xa4\xb1'
This seems so simple, but I could not find an answer anywhere. Of course just typing the b followed by the string will do. But I want to do this runtime, or from a variable containing the strings of byte.
if the given string was AAAA or some known characters I can simply do string.encode('utf-8'), but I am expecting the string of bytes to just be random. Doing that to '\xf0\x9f\xa4\xb1' ( random bytes ) produces unexpected result b'\xc3\xb0\xc2\x9f\xc2\xa4\xc2\xb1'.
There must be a simpler way to do this?
Edit:
I want to convert the string to bytes without using an encoding
The Latin-1 character encoding trivially (and unlike every other encoding supported by Python) encodes every code point in the range 0x00-0xff to a byte with the same value.
byteobj = '\xf0\x9f\xa4\xb1'.encode('latin-1')
You say you don't want to use an encoding, but the alternatives which avoid it seem far inferior.
The UTF-8 encoding is unsuitable because, as you already discovered, code points above 0x7f map to a sequence of multiple bytes (up to four bytes) none of which are exactly the input code point as a byte value.
Omitting the argument to .encode() (as in a now-deleted answer) forces Python to guess an encoding, which produces system-dependent behavior (probably picks UTF-8 on most systems except Windows, where it will typically instead choose something much more unpredictable, as well as usually much more sinister and horrible).
I found a working solution
import struct
def convert_string_to_bytes(string):
bytes = b''
for i in string:
bytes += struct.pack("B", ord(i))
return bytes
string = '\xf0\x9f\xa4\xb1'
print (convert_string_to_bytes(string)))
output:
b'\xf0\x9f\xa4\xb1'

Is there a way to find a character's Unicode code point in Python 2.7?

I'm working with International Phonetic Alphabet (IPA) symbols in my Python program, a rather strange set of characters whose UTF-8 codes can range anywhere from 1 to 3 bytes long. This thread from several years ago basically asked the reverse question and it seems that ord(character) can retrieve a decimal number that I could convert to hex and thereafter to a code point, but the input for ord() seems to be limited to one byte. If I try ord() on any non-ASCII character, like ɨ for example, it outputs:
TypeError: ord() expected a character, but a string of length 2 found
With that no longer an option, is there any way in Python 2.7 to find the Unicode code point of a given character? (And does that character then have to be a unicode type?) I don't mean by just manually looking it up on a Unicode table, either.
With that no longer an option, is there any way in Python 2.7 to find the Unicode code point of a given character? (And does that character then have to be a unicode type?) I don't mean by just manually looking it up on a Unicode table, either.
You can only find the unicode code point of a unicode object. To convert your byte string to a unicode object, decode it using mystr.decode(encoding), where encoding is the encoding of your string. (You know the encoding of your string, right? It's probably UTF-8. :-) Then you can use ord according to the instructions you already found.
>>> ord(b"ɨ".decode('utf-8'))
616
As an aside, from your question it sounds like you're working with the strings in their UTF-8 encoded bytes form. That's probably going to be a pain. You should decode the strings to unicode objects as soon as you get them, and only encode them if you need to output them somewhere.
This is actually a bug in Python 2, depending on how it was built, for unicode characters outside the BMP (>= 0xFFFF); see: https://bugs.python.org/issue8670#msg105656
For example this works:
>>> ord('\uffff')
65535
>>> len('\uffff')
1
But this does not:
>>> ord(u'\U00010000')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
And even more surprisingly:
>>> len(u'\U00010000')
2
This is because there used to be "narrow" builds of Python versus "wide" builds. In "narrow" builds, unicode strings are represented internally with UCS2 (and thus use less memory, but have to use two UCS2 characters ("surrogate pairs") to represent characters above U+FFFF), whereas in "wide" builds UCS4 is used internally for unicode strings and you won't have this problem.
In newer versions of Python 3 (I think since 3.2 or 3.3 but I can't remember) this is no longer a problem and the situation is much better. The easiest way to check is with sys.maxunicode which will be 0xffff on narrow builds.
This answer demonstrates how to extract the ordinal from surrogate pairs in narrow builds.
>>> u'ɨ'
u'\u0268'
>>> u'i'
u'i'
>>> 'ɨ'.decode('utf-8')
u'\u0268'

output issue when using ASCII convertion in Python 3.4.3

I try to write the ASCII character out to a file base on the decoded input.
outfile = open ('output','w')
This one is working if using the constant value for chr()
c = chr(65) << work
outfile.write(c)
However, this one is not working (note: i is an integer variable)
c = chr(i+65) << not work
outfile.write(c)
It complains "UnicodeEncodeError: 'charmap' codec can't encode character '\x81' in position 0: character maps to "
After the conversion from decimal to ASCII, chr(n), should it be the character already?
Why doesn't work?
In Python 3, the chr function returns a one-character Unicode string (a str instance). At some point in your code, i is taking on the value 64, which when you do chr(i+65) creates the character '\x81'. That's a control character, not something that can be encoded in ASCII, which is why you're getting an error when you go to write it out.
There are various ways you can go about fixing this issue, which is best will depend a great deal on what you're trying to accomplish.
The first solution would be to figure out why you're getting an i value of 64 and stop that from happening. This may be an off-by-one error with a range bound, or something else that's simple to fix.
If you really do want to write '\x81' to your file, another approach would be to change how you're opening your file, specifying an encoding other than the default (or opening in binary mode and handling the encoding yourself). You can also create a binary string from an integer directly with the bytes constructor: bytes((i + 65,)). Note that the argument needs to be an iterable, even if there's just one value.
I add a limitation to input i.
i < 128
Now it works fine.
Thanks all for helping.

Why is Python's .decode('cp037') not working on specific binary array?

When printing out DB2 query results I'm getting the following error on column 'F00002' which is a binary array.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)
I am using the following line:
print result[2].decode('cp037')
...just as I do the first two columns where the same code works fine. Why is this not working on the third column and what is the proper decoding/encoding?
Notice that the error is about encoding to ASCII, not about decoding from cp307. But you're not asking it to encode anywhere, so why is this happening?
Well, there are actually two possible places this could go wrong, and we can't know which of them it is without some help from you.
First, if your result[2] is already a unicode object, calling decode('cp037') on it will first try to encode it with sys.getdefaultencoding(), which is usually 'ascii', so that it has something to decode. So, instead of getting an error saying "Hey, bozo, I'm already decoded", you get an error about encoding to ASCII failing. (This may seem very silly, but it's useful for a handful of codecs that can decode unicode->unicode or unicode->str, like ROT13 and quoted-printable.)
If this is your problem, the solution is to not call decode. You've presumably already decoded the data somewhere along the way to this point, so don't try to do it again. (If you've decoded it wrong, you need to figure out where you decoded it and fix that to do it right; re-decoding it after it's already wrong won't help.)
Second, passing a Unicode string to print will automatically try to encode it with (depending on your Python version) either sys.getdefaultencoding() or sys.stdout.encoding. If Python has failed to guess the right encoding for your console (pretty common on Windows), or if you're redirecting your script's stdout to a file instead of printing to the console (which means Python can't possibly guess the right encoding), you can end up with 'ascii' even in sys.stdout.encoding.
If this is your problem, you have to explicitly specify the right encoding for your console (if you're lucky, it's in sys.stdout.encoding), or the encoding you want for the text file you're redirecting to (probably 'utf-8', but that's up to you), and explicitly encode everything you print.
So, how do you know which one of these it is?
Simple. print type(result[2]) and see whether it's a unicode or a str. Or break it up into two pieces: x = result[2].decode('cp037') and then print x, and see which of the two raises. Or run in a debugger. You have all kinds of options for debugging this, but you have to do something.
Of course it's also possible that, once you fix the first one, you'll immediately run into the second one. But now you know how to deal with that to.
Also, note that cp037 is EBCDIC, one of the few encodings that Python knows about that isn't ASCII-compatible. In fact, '\xe3' is EBCDIC for the letter T.
It seems that your result[2] is already unicode:
>>> u'\xe3'.decode('cp037')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 0: ordinal not in range(128)
>>> u'\xe3'.encode('cp037')
'F'
In fact, as pointed out #abarnert in comments, in python 2.x decode being called for unicode object is performed in two steps:
encoding to string with sys.getdefaultencoding(),
then decoding back to unicode
i.e., you statement is translated as:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> u'\xe3'.encode('ascii').decode('cp037')
and the error you get is from the first part of expression, u'\xe3'.encode('ascii')
All right, so as #abarnert established, you don't really have a Unicode problem, per se. The Unicode only enters the picture when trying to print. After looking at your data, I can see that there is actually not just EBCDIC character data in there, but arbitrary binary data as well. The data definitely seems columnar, so what we probably have here is a bunch of subfields all packed into the field called F00002 in your example. RPG programmers would refer to this as a data structure; it's akin to a C struct.
The F00001 and K00001 columns probably worked fine because they happen to contain only EBCDIC character data.
So if you want to extract the complete data from F00002, you'll have to find out (via documentation or some person who has the knowledge) what the subfields are. Normally, once you've found that out, you could just use Python's struct module to quickly and simply unpack the data, but since the data comes from an IBM i, you may be faced with converting its native data types into Python's types. (The most common of these would be packed decimal for numeric data.)
For now, you can still extract the character portions of F00002 by decoding as before, but then explicitly choosing a new encoding that works with your output (display or file), as #abarnert suggested. My recommendation is to write the values to a file, using result[2].decode('cp037').encode('utf-8') (which will produce a bunch of clearly not human-readable data interspersed with the text; you may be able to use that as-is, or you could use it to at least tell you where the text portions are for further processing).
Edit:
We don't have time to do all your work and research for you. Things you need to just read up on and work out for yourself:
IBM's packed decimal format (crash course: each digit takes up 4 bits using basic hexadecimal; with an additional 4 bits on the right for the sign, which is 'F' for positive and 'D' for negative; the whole thing zero-padded on the left if needed to fill out a whole number of bytes; decimal place is implied)
IBM's zoned decimal format (crash course: each digit is 1 byte and is identical to the EBCDIC representation of the corresponding character; except that on the rightmost digit, the upper 4 bits are used for the sign, 'F' for positive and 'D' for negative; decimal place is implied)
Python's struct module (doesn't automatically handle the above types; you have to use raw bytes for everything (type 's') and handle as needed)
Possibly pick up some ideas (and code) for handling IBM packed and zoned decimals from the add-on api2 module for iSeriesPython 2.7 (in particular, check out the iSeriesStruct class, which is a subclass of struct.Struct, keeping in mind that the whole module is designed to be running on the iSeries, using iSeriesPython, and thus is not necessarily usable as-is from regular Python communicating with the iSeries via pyodbc).

python ascii codes to utf

So when i post a name or text in mod_python in my native language i get:
македонија
And i also get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
When i use:
hparser = HTMLParser.HTMLParser()
req.write(hparser.unescape(text))
How can i decode it?
It's hard to explain UnicodeErrors if you don't understand the underlying mechanism. You should really read either or both of
Pragmatic Unicode (Ned Batchelder)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (Joel Spolsky)
In a (very small) nutshell, a Unicode code point is an abstract "thingy" representing one character1. Programmers like to work with these, because we like to think of strings as coming one character at a time. Unfortunately, it was decreed a long time ago that a character must fit in one byte of memory, so there can be at most 256 different characters. Which is fine for plain English, but doesn't work for anything else. There's a global list of code points -- thousands of them -- which are meant to hold every possible character, but clearly they don't fit in a byte.
The solution: there is a difference between the ordered list of code points that make a string, and its encoding as a sequence of bytes. You have to be clear whenever you work with a string which of these forms it should be in.
To convert between the forms you can .encode() a list of code points (a Unicode string) as a list of bytes, and .decode() bytes into a list of code points. To do so, you need to know how to map code points into bytes and vice versa, which is the encoding. If you don't specify one, Python 2.x will guess that you meant ASCII. If that guess is wrong, you will get a UnicodeError.
Note that Python 3.x is much better at handling Unicode strings, because the distinction between bytes and code points is much more clear cut.
1Sort of.
EDIT: I guess I should point out how this helps. But you really should read the above links! Just throwing in .encode()s and .decode()s everywhere is a terrible way to code, and one day you'll get bitten by a worse problem.
Anyway, if you step through what you're doing in the shell you'll see
>>> from HTMLParser import HTMLParser
>>> text = "македонија"
>>> hparser = HTMLParser()
>>> text = hparser.unescape(text)
>>> text
u'\u043c\u0430\u043a\u0435\u0434\u043e\u043d\u0438\u0458\u0430'
I'm using Python 2.7 here, so that's a Unicode string i.e. a sequence of Unicode code points. We can encode them into a regular string (i.e. a list of bytes) like
>>> text.encode("utf-8")
'\xd0\xbc\xd0\xb0\xd0\xba\xd0\xb5\xd0\xb4\xd0\xbe\xd0\xbd\xd0\xb8\xd1\x98\xd0\xb0'
But we could also pick a different encoding!
>>> text.encode("utf-16")
'\xff\xfe<\x040\x04:\x045\x044\x04>\x04=\x048\x04X\x040\x04'
You'll need to decide what encoding you want to use.
What went wrong when you did it? Well, not every encoding understands every code point. In particular, the "ascii" encoding only understands the first 256! So if you try
>>> text.encode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
you just get an error, because you can't encode those code points in ASCII.
When you do req.write, you are trying to write a list of code points down the request. But HTML requests don't understand code points: they just use ASCII. Python 2 will try to be helpful by automatically ASCII-encoding your Unicode strings, which is fine if they really are ASCII but not if they aren't.
So you need to do req.write(hparser.unescape(text).encode("some-encoding")).

Categories

Resources