Python Encoding Issue - python

I am really lost in all the encoding/decoding issues with Python. Having read quite few docs about how to handle incoming perfectly, i still have issues with few languages, like Korean. Anyhow, here is the what i am doing.
korean_text = korean_text.encode('utf-8', 'ignore')
korean_text = unicode(korean_text, 'utf-8')
I save the above data to database, which goes through fine.
Later when i need to display data, i fetch content from db, and do the following:
korean_text = korean_text.encode( 'utf-8' )
print korean_text
And all i see is '???' echoed on the browser. Can someone please let me know what is the right way to save and display above data.
Thanks

Even having read some docs, you seem to be confused on how unicode works.
Unicode is not an encoding. Unicode is the absence of encodings.
utf-8 is not unicode. utf-8 is an encoding.
You decode utf-8 bytestrings to get unicode. You encode unicode using an encoding, say, utf-8, to get an encoded bytestring.
Only bytestrings can be saved to disk, database, or sent on a network, or printed on a printer, or screen. Unicode only exists inside your code.
The good practice is to decode everything you get as early as possible, work with it decoded, as unicode, in all your code, and then encode it as late as possible, when the text is ready to leave your program, to screen, database or network.
Now for your problem:
If you have a text that came from the browser, say, from a form, then it is encoded. It is a bytestring. It is not unicode.
You must then decode it to get unicode. Decode it using the encoding the browser used to encode. The correct encoding comes from the browser itself, in the correct HTTP REQUEST header.
Don't use 'ignore' when decoding. Since the browser said which encoding it is using, you shouldn't get any errors. Using 'ignore' means you will hide a bug if there is one.
Perhaps your web framework of choice already does that. I know that django, pylons, werkzeug, cherrypy all do that. In that case you already get unicode.
Now that you have a decoded unicode string, you can encode it using whatever encoding you like to store on the database. utf-8 is a good choice, since it can encode all unicode codepoints.
When you retrieve the data from the database, decode it using the same encoding you used to store it. And then encode it using the encoding you want to use on the page - the one declared in the html meta header <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>. If the encoding is the same used on the previous step, you can skip the decode/reencode since it is already encoded in utf-8.
If you see ??? then the data is being lost on any step above. To know exactly, more information is needed.

Read through this post about handling Unicode in Python.
You basically want to be doing these things:
.encode() text to a particular encoding (such as utf-8) before sending it to the database.
.decode() text back to unicode (from your encoding) when reading it from the database

The problem is most certainly (especially if other non-ASCII characters appear to work fine) that your browser or OS doesn't have the right fonts to display Korean text, or that the default font used by your browser doesn't support Korean. Try to choose another font until it works.

Related

html DOM: good test webpages to test encoding/decoding

What I'm doing is:
via javascript, reading the DOM of webpage
converting to json string
sending to python as ajax
in Python, json decoding the string into object
What I want is for any text that is part of the json to be in unicode to avoid any character issues. I used to use beautifulsoup for this:
from bs4 import *
from bs4.dammit import UnicodeDammit
text_unicode = UnicodeDammit(text, [None, None], "html", True).unicode_markup
But that doesn't work with the json string. Running the string through UnicodeDammit causes an error when I try to json decode it.
The thing is, I'm not even sure that collecting the DOM doesn't handle this issue automatically.
For starters, I would therefore like a series of test webpages to test this. Where one is encoded with utf-8, another with something else, etc. And that uses characters that will look wrong if, for example, you think it's utf-8 but it's not. Note that I don't even bother considering the webpage's stated encoding. This is too often wrong.
You are trying to solve a problem that does not exist.
The browser is responsible for detecting and handling the web page encoding. It'll determine the correct encoding based on the server headers, meta tags in the HTML page and plain guessing if needed. The DOM gives you Unicode data.
JSON handles Unicode data; sending JSON data to your Python process sends appropriately encoded byte data that any decent JSON library will turn back into Unicode values for you. The Python json module is such a library.
Just load the data from your JavaScript script with the json.load() or json.loads() functions as is. Your browser will already have used the correct encoding (most likely UTF-8), and the Python json module will decode any of the standard encodings used without additional configuration or handling.

Scraping a website whose encoding is iso-8859-1 instead of utf-8: how do I store the correct unicode in my database?

I'd like to scrape a website using Python that is full of horrible problems, one being the wrong encoding at the top:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
This is wrong because the page is full of occurrences like the following:
Nell’ambito
instead of
Nell'ambito (please notice ’ replaces ')
If I understand correctly, this is happening because utf-8 bytes (probably the database encoding) are interpreted as iso-8859-1 bytes (forced by the charset in the meta tag).
I found some initial explanation at this link http://www.i18nqa.com/debug/utf8-debug.html
I am using BeautifulSoup to navigate the page, Google App Engine's urlfetch to make requests, however all I need is to understand what is the correct way to store in my database a string that fixes ’ by encoding the string to '.
I am using BeautifulSoup to navigate the page, Google App Engine's urlfetch to make requests
Are you feeding the encoding from the Content-Type HTTP header into BeautifulSoup?
If an HTML page has both a Content-Type header and a meta tag, the header should ‘win’, so if you're only taking the meta tag you may get the wrong encoding.
Otherwise, you could either feed the fixed encoding 'utf-8' into Beautiful, or fix up each string indvidually.
Annoying note: it's not actually ISO-8859-1. When web pages say ISO-8859-1, browsers actually take it to mean Windows code page 1252, which is similar to 8859-1 but not the same. The € would seem to indicate cp1252 because it's not present in 8859-1.
u'Nell’ambito'.encode('cp1252').decode('utf-8')
If the content is encoded inconsistently with some UTF-8 and some cp1252 on the same page (typically due to poor database content handling), this would be the only way to recover it, catching UnicodeError and returning the original string when it wouldn't transcode.

Peter Piper piped a Python program - and lost all his unicode characters

I have a Python script that loads a web page using urllib2.urlopen, does some various magic, and spits out the results using print. We then run the program on Windows like so:
python program.py > output.htm
Here's the problem:
The urlopen reads data from an IIS web server which outputs UTF8. It spits out this same data to the output, however certain characters (such as the long hyphen that Word always inserts for you against your will because it's smarter than you) get garbled and end up like – instead.
Upon further investigation, I noticed even though the web server spits out UTF8 data, the output.htm file is encoded with the ISO-8859-1 character set.
My questions:
When you redirect a Python program to an output file on Windows, does it always use this character set?
If so, is there any way to change that behavior?
If not, is there a workaround? I suppose I could just pass in output.htm as a command line parameter and write to that file instead of the screen, but I'd have to redo a whole bunch of logic in my program.
Thanks for any help!
UPDATE:
At the top of output.htm I added:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
However, it makes no difference. The characters are still garbled. If I manually switch over to UTF-8 in Firefox, the file displays correctly. Both IE and FF think this file is Western ISO even though it is clearly not.
From your comments and question update it seems that the data is correctly encoded in UTF-8. This means you just need to tell your browser it's UTF-8, either by using a BOM, or better, by adding encoding information to your HTML document:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
You really shouldn't use an XML declaration if the document is no valid XML.
The best and most reliable way would be to serve the file via HTTP and set the Content-Type: header appropriately.
When you pipe a Python program to an output file on Windows, does it always use this character set?
Default encoding used to output to pipe. On my machine:
In [5]: sys.getdefaultencoding()
Out[5]: 'ascii'
If not, is there a workaround?
import sys
try:
sys.setappdefaultencoding('utf-8')
except:
sys = reload(sys)
sys.setdefaultencoding('utf-8')
Now all output is encoded to 'utf-8'.
I think correct way to handle this situation without
redo a whole bunch of logic
is to decode all data from your internet source from server or page encoding to unicode, and then to use workaround shown above to set default encoding to utf-8.
Most programs under Windows will assume that you're using the default Windows encoding, which will be ISO-8859-1 for an English installation. This goes for the command window output as well. There's no way to set the default encoding to UTF-8 unfortunately - there's a code page defined for it, but it's not well supported.
Some editors will recognize any BOM characters at the start of the file and switch to UTF-8, but that's not guaranteed.
If you're generating HTML you should include the proper charset tag; then the browser will interpret it properly.

Robust way to put contents of any arbitrary text file in the database (using Django/Python)?

As part of my Django app, I have to get the contents of a text file which a user uploads (which could be any charset) and save it to my DB. I keep running into issues (like having to remove UTF8's BOM manually, or having to figure out how to account for non-printable characters, or having to figure out how to make all unicode characters work - not just Latin ones, etc.) and each of these issues requires its own hack.
Is there a robust way to do this that doesn't require each of these case-by-case fixes? Right now I'm just using file.read() to get the contents, then doing all of those workarounds to clean the contents, and then using .save() to save it to the DB (I have a model for this).
What else can I be doing?
Causes some overhead, but you could base64 encode the entire string before persisting to the db. Then no escaping is required.
If you want to explicitly steer away from any issues with encoding and just see files as bunches of binary data (not strings of text in a specific encoding) you might want to use your database's binary format.
For MySQL this is BINARY and VARBINARY: http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html
For a deeper understanding of unicode & utf-8 issues (recommended) this is a nice read on the subject:
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

Encode East Asian languages using Python

This may not really be a Python related question, but pertains to language encoding in general. I'm mining tweets from Twitter, and it appears that there is a large Japanese user community (with messages in Japanese). When I tried encoding the tweets for an XML file I used utf-8. e.g tweet=tweet.encode('utf-8') and none of the Japanese tweets appeared as they should have. My question that I am posing is, how should I have encoded them? What was my mistake? If I was to store the data in a CSV, what encoding scheme would I use in that case?
Normally you would query the format for what encoding the data is in. Having said that, Shift-JIS is quite a popular encoding for Japanese text.
>>> u'あいうえお'.encode('shift-jis')
'\x82\xa0\x82\xa2\x82\xa4\x82\xa6\x82\xa8'
There should be a way to query the encoding of the tweets when read from Twitter. You then decode them to Unicode as you read them into your program, then encode them when you write them back out to an XML file. Chinese, for example, might be using gbk encoding:
import codecs
unicode_data = data.decode('gbk')
f = codecs.open('out.xml','w','utf-8')
f.write(unicode_data)
f.close()

Categories

Resources