python ascii codes to utf - python

So when i post a name or text in mod_python in my native language i get:
македонија
And i also get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
When i use:
hparser = HTMLParser.HTMLParser()
req.write(hparser.unescape(text))
How can i decode it?

It's hard to explain UnicodeErrors if you don't understand the underlying mechanism. You should really read either or both of
Pragmatic Unicode (Ned Batchelder)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (Joel Spolsky)
In a (very small) nutshell, a Unicode code point is an abstract "thingy" representing one character1. Programmers like to work with these, because we like to think of strings as coming one character at a time. Unfortunately, it was decreed a long time ago that a character must fit in one byte of memory, so there can be at most 256 different characters. Which is fine for plain English, but doesn't work for anything else. There's a global list of code points -- thousands of them -- which are meant to hold every possible character, but clearly they don't fit in a byte.
The solution: there is a difference between the ordered list of code points that make a string, and its encoding as a sequence of bytes. You have to be clear whenever you work with a string which of these forms it should be in.
To convert between the forms you can .encode() a list of code points (a Unicode string) as a list of bytes, and .decode() bytes into a list of code points. To do so, you need to know how to map code points into bytes and vice versa, which is the encoding. If you don't specify one, Python 2.x will guess that you meant ASCII. If that guess is wrong, you will get a UnicodeError.
Note that Python 3.x is much better at handling Unicode strings, because the distinction between bytes and code points is much more clear cut.
1Sort of.
EDIT: I guess I should point out how this helps. But you really should read the above links! Just throwing in .encode()s and .decode()s everywhere is a terrible way to code, and one day you'll get bitten by a worse problem.
Anyway, if you step through what you're doing in the shell you'll see
>>> from HTMLParser import HTMLParser
>>> text = "македонија"
>>> hparser = HTMLParser()
>>> text = hparser.unescape(text)
>>> text
u'\u043c\u0430\u043a\u0435\u0434\u043e\u043d\u0438\u0458\u0430'
I'm using Python 2.7 here, so that's a Unicode string i.e. a sequence of Unicode code points. We can encode them into a regular string (i.e. a list of bytes) like
>>> text.encode("utf-8")
'\xd0\xbc\xd0\xb0\xd0\xba\xd0\xb5\xd0\xb4\xd0\xbe\xd0\xbd\xd0\xb8\xd1\x98\xd0\xb0'
But we could also pick a different encoding!
>>> text.encode("utf-16")
'\xff\xfe<\x040\x04:\x045\x044\x04>\x04=\x048\x04X\x040\x04'
You'll need to decide what encoding you want to use.
What went wrong when you did it? Well, not every encoding understands every code point. In particular, the "ascii" encoding only understands the first 256! So if you try
>>> text.encode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
you just get an error, because you can't encode those code points in ASCII.
When you do req.write, you are trying to write a list of code points down the request. But HTML requests don't understand code points: they just use ASCII. Python 2 will try to be helpful by automatically ASCII-encoding your Unicode strings, which is fine if they really are ASCII but not if they aren't.
So you need to do req.write(hparser.unescape(text).encode("some-encoding")).

Related

ValueError: Unpaired high surrogate when decoding 'string' on reading json file

I am currently working on python 3.8.6. I am getting the following error on reading (thousands of) json files in python:
ValueError: Unpaired high surrogate when decoding 'string' on reading json file
I tried using the following solutions while checking other stackoverflow posts but nothing worked:
1) import json
json.loads('{"":"\\ud800"}')
2) import simplejson
simplejson.loads('{"":"\\ud800"}')
The problem is that after getting this error the remaining json files are not read. Is there a way to get rid of this error so I can read all the json files?
I am not sure what all information is necessary to provide regarding the problem so please feel free to ask.
Unicode code point U+D800 may only occur as part of a surrogate pair (and then only in UTF-16 encoding). So that string inside the JSON is (after decoding it) not valid UTF-8.
The JSON itself might or might not be valid. The spec doesn't mention the case of unmatched surrogate pairs, but does explicitly allow nonexistent code points:
To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.
Note that the JSON grammar permits code points for which Unicode does not currently provide character assignments.
Now, you can choose your friends, but you can't choose your family and you can't always choose your JSON either. So the next question is: how to parse this mess?
It looks like both the built-in json module in Python (version 3.9) and simplejson (version 3.17.2) have no problems parsing the JSON. The problem only occurs once you try to use the string. So this really doesn't have anything to do with JSON at all:
>>> bork = '\ud800'
>>> bork
'\ud800'
>>> print(bork)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed
Fortunately, we can encode the string manually and tell Python how to handle the error. For example, replace the erroneous code point with a question mark:
>>> bork.encode('utf-8', errors='replace')
b'?'
The documentation lists other possible options for the errors argument.
To fix up this broken string, we can encode (into bytes) and then decode (back into str):
>>> bork.encode('utf-8', errors='replace').decode('utf-8')
'?'
A Unicode surrogate in isolation does not correspond to anything. Every valid high surrogate code point needs to be immediately followed by a low surrogate code point before it can be meaningfully decoded.
The error message simply means that this code point in isolation does not have a well-defined meaning. It's like saying "take" without saying what we should take, or "look at" without the object of the sentence filled in.
You should not be using surrogates in files which do not contain UTF-16 anyway; they are reserved strictly for this encoding. It is used for encoding characters outside the 16-bit space which this 16-bit encoding can naturally represent by way of splitting them across two code points.
The simple and obvious fix is to supply the missing information, but we can't know what it is. Perhaps you have more context, and can fill in with a correct low surrogate pair. But for example, this works:
>>> json.loads('{"":"\\ud800\\udc00"}')
{'': '𐀀'}
It populates the JSON with the single code point U+010000 but of course we can have no idea whether that's actually the code point your data should contain.

Is there a way to find a character's Unicode code point in Python 2.7?

I'm working with International Phonetic Alphabet (IPA) symbols in my Python program, a rather strange set of characters whose UTF-8 codes can range anywhere from 1 to 3 bytes long. This thread from several years ago basically asked the reverse question and it seems that ord(character) can retrieve a decimal number that I could convert to hex and thereafter to a code point, but the input for ord() seems to be limited to one byte. If I try ord() on any non-ASCII character, like ɨ for example, it outputs:
TypeError: ord() expected a character, but a string of length 2 found
With that no longer an option, is there any way in Python 2.7 to find the Unicode code point of a given character? (And does that character then have to be a unicode type?) I don't mean by just manually looking it up on a Unicode table, either.
With that no longer an option, is there any way in Python 2.7 to find the Unicode code point of a given character? (And does that character then have to be a unicode type?) I don't mean by just manually looking it up on a Unicode table, either.
You can only find the unicode code point of a unicode object. To convert your byte string to a unicode object, decode it using mystr.decode(encoding), where encoding is the encoding of your string. (You know the encoding of your string, right? It's probably UTF-8. :-) Then you can use ord according to the instructions you already found.
>>> ord(b"ɨ".decode('utf-8'))
616
As an aside, from your question it sounds like you're working with the strings in their UTF-8 encoded bytes form. That's probably going to be a pain. You should decode the strings to unicode objects as soon as you get them, and only encode them if you need to output them somewhere.
This is actually a bug in Python 2, depending on how it was built, for unicode characters outside the BMP (>= 0xFFFF); see: https://bugs.python.org/issue8670#msg105656
For example this works:
>>> ord('\uffff')
65535
>>> len('\uffff')
1
But this does not:
>>> ord(u'\U00010000')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
And even more surprisingly:
>>> len(u'\U00010000')
2
This is because there used to be "narrow" builds of Python versus "wide" builds. In "narrow" builds, unicode strings are represented internally with UCS2 (and thus use less memory, but have to use two UCS2 characters ("surrogate pairs") to represent characters above U+FFFF), whereas in "wide" builds UCS4 is used internally for unicode strings and you won't have this problem.
In newer versions of Python 3 (I think since 3.2 or 3.3 but I can't remember) this is no longer a problem and the situation is much better. The easiest way to check is with sys.maxunicode which will be 0xffff on narrow builds.
This answer demonstrates how to extract the ordinal from surrogate pairs in narrow builds.
>>> u'ɨ'
u'\u0268'
>>> u'i'
u'i'
>>> 'ɨ'.decode('utf-8')
u'\u0268'

How to convert a unicode character representation from string to unicode in python?

Ok I've found a lot of threads about how to convert a string from something like "/xe3" to "ã" but how the hell am I supposed to do it the other way around?
My concrete problem: I am using an API and everything works great except I provide some strings which then result in a json object. The result is sorted after the names (strings) I provided however they are returned as their unicode representation and as json APIs always work in pure strings. So all I need is a way to get from "ã" to "/xe3" but it can't for the love of god get it to work.
Every type of encoding or decoding I try either defaults back to a normal string, a string without that character, a string with a plain A or an unicode error that ascii can't decode it. (<- this was due to a horrible shell setup. Yay for old me.)
All I want is the plain encoded string!
(yea no not at all past me. All you want is the unicode representation of a character as string)
PS: All in python if that wasn't obvious from the title already.
Edit: Even though this is quite old I wanted to update this to not completely embarrass myself in the future.
The issue was an API which provided unicode representations of characters as string as a response. All I wanted to do was checking if they are the same however I had major issues getting python to interpret the string as unicode especially since those characters were just some inside of a longer text partially with backslashes.
This did help but I just stumbled across this horribly written question and just couldn't leave it like that.
"\xe3" in python is a string literal that represents a single byte with value 227:
>>> print len("\xe3")
1
>>> print ord("\xe3")
227
This single byte represents the 'ã' character in the latin-1 encoding (http://en.wikipedia.org/wiki/ISO/IEC_8859-1).
"ã" in python is a string literal consisting of two bytes: 0xC3, 0xA3 (195, 163):
>>> print len("ã")
2
>>> print ord("ã"[0])
195
>>> print ord("ã"[1])
163
This byte sequence is the UTF-8 encoding of the character "ã".
So, to go from "ã" in python to "\xe3", you first need to decode the utf-8 byte sequence into a python unicode string:
>>> "ã".decode("utf-8")
u'\xe3'
Now, you can take that unicode string and encode it however you like (e.g. into latin-1):
>>> "ã".decode("utf-8").encode("latin-1")
'\xe3'
Please read http://www.joelonsoftware.com/articles/Unicode.html . You should realize tehre is no such a thing as "a plain encoded string". There is "an encoded string in a given text encoding". So you are really in need to understand the better the concepts of Unicode.
Among other things, this is plain wrong: "The result is sorted after the names (strings) I provided however they are returned in encoded form." JSON uses Unicode, so you get the string in a decoded form.
Since I assume you are, perhaps unknowingly, working with UTF-8, you should be aware that \xe3 is the Unicode code point for the character ã. Not to be mistaken for the actual bytes that UTF-8 uses to reference that code point:
http://hexutf8.com/?q=U+e3
I.e. UTF-8 maps the byte sequence c3 a3 to the code point U+e3 which represents the character ã.
UTF-16 maps a different byte sequence, 00 e3 to that exact same code point. (Note how much simpler, but less space efficient the UTF-16 encoding is...)

Why is Python's .decode('cp037') not working on specific binary array?

When printing out DB2 query results I'm getting the following error on column 'F00002' which is a binary array.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)
I am using the following line:
print result[2].decode('cp037')
...just as I do the first two columns where the same code works fine. Why is this not working on the third column and what is the proper decoding/encoding?
Notice that the error is about encoding to ASCII, not about decoding from cp307. But you're not asking it to encode anywhere, so why is this happening?
Well, there are actually two possible places this could go wrong, and we can't know which of them it is without some help from you.
First, if your result[2] is already a unicode object, calling decode('cp037') on it will first try to encode it with sys.getdefaultencoding(), which is usually 'ascii', so that it has something to decode. So, instead of getting an error saying "Hey, bozo, I'm already decoded", you get an error about encoding to ASCII failing. (This may seem very silly, but it's useful for a handful of codecs that can decode unicode->unicode or unicode->str, like ROT13 and quoted-printable.)
If this is your problem, the solution is to not call decode. You've presumably already decoded the data somewhere along the way to this point, so don't try to do it again. (If you've decoded it wrong, you need to figure out where you decoded it and fix that to do it right; re-decoding it after it's already wrong won't help.)
Second, passing a Unicode string to print will automatically try to encode it with (depending on your Python version) either sys.getdefaultencoding() or sys.stdout.encoding. If Python has failed to guess the right encoding for your console (pretty common on Windows), or if you're redirecting your script's stdout to a file instead of printing to the console (which means Python can't possibly guess the right encoding), you can end up with 'ascii' even in sys.stdout.encoding.
If this is your problem, you have to explicitly specify the right encoding for your console (if you're lucky, it's in sys.stdout.encoding), or the encoding you want for the text file you're redirecting to (probably 'utf-8', but that's up to you), and explicitly encode everything you print.
So, how do you know which one of these it is?
Simple. print type(result[2]) and see whether it's a unicode or a str. Or break it up into two pieces: x = result[2].decode('cp037') and then print x, and see which of the two raises. Or run in a debugger. You have all kinds of options for debugging this, but you have to do something.
Of course it's also possible that, once you fix the first one, you'll immediately run into the second one. But now you know how to deal with that to.
Also, note that cp037 is EBCDIC, one of the few encodings that Python knows about that isn't ASCII-compatible. In fact, '\xe3' is EBCDIC for the letter T.
It seems that your result[2] is already unicode:
>>> u'\xe3'.decode('cp037')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 0: ordinal not in range(128)
>>> u'\xe3'.encode('cp037')
'F'
In fact, as pointed out #abarnert in comments, in python 2.x decode being called for unicode object is performed in two steps:
encoding to string with sys.getdefaultencoding(),
then decoding back to unicode
i.e., you statement is translated as:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> u'\xe3'.encode('ascii').decode('cp037')
and the error you get is from the first part of expression, u'\xe3'.encode('ascii')
All right, so as #abarnert established, you don't really have a Unicode problem, per se. The Unicode only enters the picture when trying to print. After looking at your data, I can see that there is actually not just EBCDIC character data in there, but arbitrary binary data as well. The data definitely seems columnar, so what we probably have here is a bunch of subfields all packed into the field called F00002 in your example. RPG programmers would refer to this as a data structure; it's akin to a C struct.
The F00001 and K00001 columns probably worked fine because they happen to contain only EBCDIC character data.
So if you want to extract the complete data from F00002, you'll have to find out (via documentation or some person who has the knowledge) what the subfields are. Normally, once you've found that out, you could just use Python's struct module to quickly and simply unpack the data, but since the data comes from an IBM i, you may be faced with converting its native data types into Python's types. (The most common of these would be packed decimal for numeric data.)
For now, you can still extract the character portions of F00002 by decoding as before, but then explicitly choosing a new encoding that works with your output (display or file), as #abarnert suggested. My recommendation is to write the values to a file, using result[2].decode('cp037').encode('utf-8') (which will produce a bunch of clearly not human-readable data interspersed with the text; you may be able to use that as-is, or you could use it to at least tell you where the text portions are for further processing).
Edit:
We don't have time to do all your work and research for you. Things you need to just read up on and work out for yourself:
IBM's packed decimal format (crash course: each digit takes up 4 bits using basic hexadecimal; with an additional 4 bits on the right for the sign, which is 'F' for positive and 'D' for negative; the whole thing zero-padded on the left if needed to fill out a whole number of bytes; decimal place is implied)
IBM's zoned decimal format (crash course: each digit is 1 byte and is identical to the EBCDIC representation of the corresponding character; except that on the rightmost digit, the upper 4 bits are used for the sign, 'F' for positive and 'D' for negative; decimal place is implied)
Python's struct module (doesn't automatically handle the above types; you have to use raw bytes for everything (type 's') and handle as needed)
Possibly pick up some ideas (and code) for handling IBM packed and zoned decimals from the add-on api2 module for iSeriesPython 2.7 (in particular, check out the iSeriesStruct class, which is a subclass of struct.Struct, keeping in mind that the whole module is designed to be running on the iSeries, using iSeriesPython, and thus is not necessarily usable as-is from regular Python communicating with the iSeries via pyodbc).

Python, .format(), and UTF-8

My background is in Perl, but I'm giving Python plus BeautifulSoup a try for a new project.
In this example, I'm trying to extract and present the link targets and link text contained in a single page. Here's the source:
table_row = u'<tr><td>{}</td><td>{}</td></tr>'.encode('utf-8')
link_text = unicode(link.get_text()).encode('utf-8')
link_target = link['href'].encode('utf-8')
line_out = unicode(table_row.format(link_text, link_target))
All those explicit calls to .encode('utf-8') are my attempt to make this work, but they don't seem to help -- it's likely that I'm completely misunderstanding something about how Python 2.7 handles Unicode strings.
Anyway. This works fine up until it encounters U+2013 in a URL (yes, really). At that point it bombs out with:
Traceback (most recent call last):
File "./test2.py", line 30, in <module>
line_out = unicode(table_row.encode('utf-8').format(link_text, link_target.encode('utf-8')))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 79: ordinal not in range(128)
Presumably .format(), even applied to a Unicode string, is playing silly-buggers and trying to do a .decode() operation. And as ASCII is the default, it's using that, and of course it can't map U+2013 to an ASCII character, and thus...
The options seem to be to remove it or convert it to something else, but really what I want is to simply preserve it. Ultimately (this is just a little test case) I need to be able to present working clickable links.
The BS3 documentation suggests changing the default encoding from ASCII to UTF-8 but reading comments on similar questions that looks to be a really bad idea as it'll muck up dictionaries.
Short of using Python 3.2 instead (which means no Django, which we're considering for part of this project) is there some way to make this work cleanly?
First, note that your two code samples disagree on the text of the problematic line:
line_out = unicode(table_row.encode('utf-8').format(link_text, link_target.encode('utf-8')))
vs
line_out = unicode(table_row.format(link_text, link_target))
The first is the one from the traceback, so it's the one to look at. Assuming the rest of your first code sample is accurate, table_row is a byte-string, because you took a unicode string and encoded it. Byte strings can't be encoded, so Python 2 implicitly converts table_row from byte-string to unicode by decoding it as ascii. Hence the error message, "UnicodeDecodeError from ascii".
You need to decide what strings will be byte strings and which will be unicode strings, and be disciplined about it. I recommend keeping all text as Unicode strings as much as possible.
Here's a presentation I gave at PyCon that explains it all: Pragmatic Unicode, or, How Do I Stop The Pain?

Categories

Resources