output issue when using ASCII convertion in Python 3.4.3 - python

I try to write the ASCII character out to a file base on the decoded input.
outfile = open ('output','w')
This one is working if using the constant value for chr()
c = chr(65) << work
outfile.write(c)
However, this one is not working (note: i is an integer variable)
c = chr(i+65) << not work
outfile.write(c)
It complains "UnicodeEncodeError: 'charmap' codec can't encode character '\x81' in position 0: character maps to "
After the conversion from decimal to ASCII, chr(n), should it be the character already?
Why doesn't work?

In Python 3, the chr function returns a one-character Unicode string (a str instance). At some point in your code, i is taking on the value 64, which when you do chr(i+65) creates the character '\x81'. That's a control character, not something that can be encoded in ASCII, which is why you're getting an error when you go to write it out.
There are various ways you can go about fixing this issue, which is best will depend a great deal on what you're trying to accomplish.
The first solution would be to figure out why you're getting an i value of 64 and stop that from happening. This may be an off-by-one error with a range bound, or something else that's simple to fix.
If you really do want to write '\x81' to your file, another approach would be to change how you're opening your file, specifying an encoding other than the default (or opening in binary mode and handling the encoding yourself). You can also create a binary string from an integer directly with the bytes constructor: bytes((i + 65,)). Note that the argument needs to be an iterable, even if there's just one value.

I add a limitation to input i.
i < 128
Now it works fine.
Thanks all for helping.

Related

Encoding "\xaf" char results in 2 encoded characters

I've just stumbled across the fact that if I try to encode the "\xaf" character i get an additional character in my bytes object. Meaning when i run this command:
print('\xaf'.encode())
I get the following result:
b'\xc2\xaf'
So I thought maybe its something that the print function is doing, so I opened up IDLE and tried just running the command by itself and seeing what the output will be but the character is still there:
I really don't understand why this additional character is popping up and I would be very thankfull if someone could explain it to me.
Thanks in advance!
This is how UTF-8 encoding works. You can read more about here. So the main idea is that UTF-8 uses following rules to encode the string.
If the code point is < 128, it’s represented by the corresponding
byte value.
If the code point is >= 128, it’s turned into a sequence of two,
three, or four bytes, where each byte of the sequence is between 128
and 255.
Your initial character has code point greater than 128(does not fit with one byte), hence encoded with two bytes.
Note: You can use ord function to get the Unicode code point for a one-character string
>>> help(ord)
Help on built-in function ord in module builtins:
ord(c, /)
Return the Unicode code point for a one-character string.
>>>
>>> ord('\xaf')
175
>>> list('\xaf'.encode())
[194, 175]

Write unmapped characters to file?

For example: the character \x80, or 128 in decimal, has no UTF-8 character assigned to it. But if I understand text files correctly, I should still be able to create a file that contains that character, even if nothing can display it. However, when I try to print an array that contains one of these characters, it writes as '\x80', and when I try to write it directly as a chr, I get an error "UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 0: character maps to ". Am I doing something fundamentally wrong, or is there a fix I just don't know about here?
The bytes type in Python is what I should have been using for this. Though I didn't quite understand it when I was posting the question, I needed a list of single-byte variables. This is exactly what the bytes object does, and even better, it can be used exactly like a string.

Why is Python's .decode('cp037') not working on specific binary array?

When printing out DB2 query results I'm getting the following error on column 'F00002' which is a binary array.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)
I am using the following line:
print result[2].decode('cp037')
...just as I do the first two columns where the same code works fine. Why is this not working on the third column and what is the proper decoding/encoding?
Notice that the error is about encoding to ASCII, not about decoding from cp307. But you're not asking it to encode anywhere, so why is this happening?
Well, there are actually two possible places this could go wrong, and we can't know which of them it is without some help from you.
First, if your result[2] is already a unicode object, calling decode('cp037') on it will first try to encode it with sys.getdefaultencoding(), which is usually 'ascii', so that it has something to decode. So, instead of getting an error saying "Hey, bozo, I'm already decoded", you get an error about encoding to ASCII failing. (This may seem very silly, but it's useful for a handful of codecs that can decode unicode->unicode or unicode->str, like ROT13 and quoted-printable.)
If this is your problem, the solution is to not call decode. You've presumably already decoded the data somewhere along the way to this point, so don't try to do it again. (If you've decoded it wrong, you need to figure out where you decoded it and fix that to do it right; re-decoding it after it's already wrong won't help.)
Second, passing a Unicode string to print will automatically try to encode it with (depending on your Python version) either sys.getdefaultencoding() or sys.stdout.encoding. If Python has failed to guess the right encoding for your console (pretty common on Windows), or if you're redirecting your script's stdout to a file instead of printing to the console (which means Python can't possibly guess the right encoding), you can end up with 'ascii' even in sys.stdout.encoding.
If this is your problem, you have to explicitly specify the right encoding for your console (if you're lucky, it's in sys.stdout.encoding), or the encoding you want for the text file you're redirecting to (probably 'utf-8', but that's up to you), and explicitly encode everything you print.
So, how do you know which one of these it is?
Simple. print type(result[2]) and see whether it's a unicode or a str. Or break it up into two pieces: x = result[2].decode('cp037') and then print x, and see which of the two raises. Or run in a debugger. You have all kinds of options for debugging this, but you have to do something.
Of course it's also possible that, once you fix the first one, you'll immediately run into the second one. But now you know how to deal with that to.
Also, note that cp037 is EBCDIC, one of the few encodings that Python knows about that isn't ASCII-compatible. In fact, '\xe3' is EBCDIC for the letter T.
It seems that your result[2] is already unicode:
>>> u'\xe3'.decode('cp037')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 0: ordinal not in range(128)
>>> u'\xe3'.encode('cp037')
'F'
In fact, as pointed out #abarnert in comments, in python 2.x decode being called for unicode object is performed in two steps:
encoding to string with sys.getdefaultencoding(),
then decoding back to unicode
i.e., you statement is translated as:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> u'\xe3'.encode('ascii').decode('cp037')
and the error you get is from the first part of expression, u'\xe3'.encode('ascii')
All right, so as #abarnert established, you don't really have a Unicode problem, per se. The Unicode only enters the picture when trying to print. After looking at your data, I can see that there is actually not just EBCDIC character data in there, but arbitrary binary data as well. The data definitely seems columnar, so what we probably have here is a bunch of subfields all packed into the field called F00002 in your example. RPG programmers would refer to this as a data structure; it's akin to a C struct.
The F00001 and K00001 columns probably worked fine because they happen to contain only EBCDIC character data.
So if you want to extract the complete data from F00002, you'll have to find out (via documentation or some person who has the knowledge) what the subfields are. Normally, once you've found that out, you could just use Python's struct module to quickly and simply unpack the data, but since the data comes from an IBM i, you may be faced with converting its native data types into Python's types. (The most common of these would be packed decimal for numeric data.)
For now, you can still extract the character portions of F00002 by decoding as before, but then explicitly choosing a new encoding that works with your output (display or file), as #abarnert suggested. My recommendation is to write the values to a file, using result[2].decode('cp037').encode('utf-8') (which will produce a bunch of clearly not human-readable data interspersed with the text; you may be able to use that as-is, or you could use it to at least tell you where the text portions are for further processing).
Edit:
We don't have time to do all your work and research for you. Things you need to just read up on and work out for yourself:
IBM's packed decimal format (crash course: each digit takes up 4 bits using basic hexadecimal; with an additional 4 bits on the right for the sign, which is 'F' for positive and 'D' for negative; the whole thing zero-padded on the left if needed to fill out a whole number of bytes; decimal place is implied)
IBM's zoned decimal format (crash course: each digit is 1 byte and is identical to the EBCDIC representation of the corresponding character; except that on the rightmost digit, the upper 4 bits are used for the sign, 'F' for positive and 'D' for negative; decimal place is implied)
Python's struct module (doesn't automatically handle the above types; you have to use raw bytes for everything (type 's') and handle as needed)
Possibly pick up some ideas (and code) for handling IBM packed and zoned decimals from the add-on api2 module for iSeriesPython 2.7 (in particular, check out the iSeriesStruct class, which is a subclass of struct.Struct, keeping in mind that the whole module is designed to be running on the iSeries, using iSeriesPython, and thus is not necessarily usable as-is from regular Python communicating with the iSeries via pyodbc).

Convert an int value to unicode

I am using pyserial and need to send some values less than 255. If I send the int itself the the ascii value of the int gets sent. So now I am converting the int into a unicode value and sending it through the serial port.
unichr(numlessthan255);
However it throws this error:
'ascii' codec can't encode character u'\x9a' in position 24: ordinal not in range(128)
Whats the best way to convert an int to unicode?
In Python 2 - Turn it into a string first, then into unicode.
str(integer).decode("utf-8")
Best way I think. Works with any integer, plus still works if you put a string in as the input.
Updated edit due to a comment: For Python 2 and 3 - This works on both but a bit messy:
str(integer).encode("utf-8").decode("utf-8")
Just use chr(somenumber) to get a 1 byte value of an int as long as it is less than 256. pySerial will then send it fine.
If you are looking at sending things over pySerial it is a very good idea to look at the struct module in the standard library it handles endian issues an packing issues as well as encoding for just about every data type that you are likely to need that is 1 byte or over.
I think that the best solution is to be explicit and say that you want to represent a number as a byte (and not as a character):
>>> import struct
>>> struct.pack('B', 128)
>>> '\x80'
This makes your code work in both Python 2 and Python 3 (in Python 3, the result is, as it should, a bytes object). An alternative, in Python 3, would be to use the new bytes([128]) to create a single byte of value 128.
I am not a big fan of the chr() solutions: in Python 3, they produce a (character, not byte) string that needs to be encoded before sending it anywhere (file, socket, terminal,…)—chr() in Python 3 is equivalent to the problematic Python 2 unichr() of the question. The struct solution has the advantage of correctly producing a byte whatever the version of Python. If you want to send data over the serial port with chr(), you need to have control over the encoding that must take place subsequently. The code might work when the default encoding used by Python 3 is UTF-8 (which I think is the case), but this is due to the fact that Unicode characters of code point smaller than 256 can be coded as a single byte in UTF-8. This adds an unnecessary layer of subtlety and complexity that I do not recommend (it makes the code harder to understand and, if necessary, debug).
So, I strongly suggest that you use the approach above (which was also hinted at by Steve Barnes and Martijn Pieters): it makes it clear that you want to produce a byte (and not characters). It will not give you any surprise even if you run your code with Python 3, and it makes your intent clearer and more obvious.
Use the chr() function instead; you are sending a value of less than 256 but more than 128, but are creating a Unicode character.
The unicode character has to then be encoded first to get a byte character, and that encoding fails because you are using a value outside the ASCII range (0-127):
>>> str(unichr(169))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)
This is normal Python 2 behaviour; when trying to convert a unicode string to a byte string, an implicit encoding has to take place and the default encoding is ASCII.
If you were to use chr() instead, you create a byte string of one character and that implicit encoding does not have to take place:
>>> str(chr(169))
'\xa9'
Another method you may want to look into is the struct module, especially if you need to send integer values greater than 255:
>>> struct.pack('!H', 1000)
'\x03\xe8'
The above example packs an integer into a unsigned short in network byte order, for example.

Linux/Python: encoding a unicode string for print

I have a fairly large python 2.6 application with lots of print statements sprinkled about. I'm using unicode strings throughout, and it usually works great. However, if I redirect the output of the application (like "myapp.py >output.txt"), then I occasionally get errors such as this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa1' in position 0: ordinal not in range(128)
I guess the same issue comes up if someone has set their LOCALE to ASCII. Now, I understand perfectly well the reason for this error. There are characters in my Unicode strings that are not possible to encode in ASCII. Fair enough. But I'd like my python program to make a best effort to try to print something understandable, maybe skipping the suspicious characters or replacing them with their Unicode ids.
This problem must be common... What is the best practice for handling this problem? I'd prefer a solution that allows me to keep using plain old "print", but I can modify all occurrences if necessary.
PS: I have now solved this problem. The solution was neither of the answers given. I used the method given at http://wiki.python.org/moin/PrintFails , as given by ChrisJ in one of the comments. That is, I replace sys.stdout with a wrapper that calls unicode encode with the correct arguments. Works very well.
If you're dumping to an ASCII terminal, encode manually using unicode.encode, and specify that errors should be ignored.
u = u'\xa0'
u.encode('ascii') # This fails
u.encode('ascii', 'ignore') # This replaces failed encoding attempts with empty string
If you want to store unicode files, try this:
u = u'\xa0'
print >>open('out', 'w'), u # This fails
print >>open('out', 'w'), u.encode('utf-8') # This is ok
I have now solved this problem. The solution was neither of the answers given. I used the method given at http://wiki.python.org/moin/PrintFails , as given by ChrisJ in one of the comments. That is, I replace sys.stdout with a wrapper that calls unicode encode with the correct arguments. Works very well.
Either wrap all your print statement through a method perform arbitrary unicode -> utf8 conversion or as last resort change the Python default encoding from ascii to utf-8 inside your site.py. In general it is a bad idea printing unicode strings unfiltered to sys.stdout since Python will trigger an implict conversion of unicode strings to the configured default encoding which is ascii.

Categories

Resources