Peter Piper piped a Python program - and lost all his unicode characters - python

I have a Python script that loads a web page using urllib2.urlopen, does some various magic, and spits out the results using print. We then run the program on Windows like so:
python program.py > output.htm
Here's the problem:
The urlopen reads data from an IIS web server which outputs UTF8. It spits out this same data to the output, however certain characters (such as the long hyphen that Word always inserts for you against your will because it's smarter than you) get garbled and end up like – instead.
Upon further investigation, I noticed even though the web server spits out UTF8 data, the output.htm file is encoded with the ISO-8859-1 character set.
My questions:
When you redirect a Python program to an output file on Windows, does it always use this character set?
If so, is there any way to change that behavior?
If not, is there a workaround? I suppose I could just pass in output.htm as a command line parameter and write to that file instead of the screen, but I'd have to redo a whole bunch of logic in my program.
Thanks for any help!
UPDATE:
At the top of output.htm I added:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
However, it makes no difference. The characters are still garbled. If I manually switch over to UTF-8 in Firefox, the file displays correctly. Both IE and FF think this file is Western ISO even though it is clearly not.

From your comments and question update it seems that the data is correctly encoded in UTF-8. This means you just need to tell your browser it's UTF-8, either by using a BOM, or better, by adding encoding information to your HTML document:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
You really shouldn't use an XML declaration if the document is no valid XML.
The best and most reliable way would be to serve the file via HTTP and set the Content-Type: header appropriately.

When you pipe a Python program to an output file on Windows, does it always use this character set?
Default encoding used to output to pipe. On my machine:
In [5]: sys.getdefaultencoding()
Out[5]: 'ascii'
If not, is there a workaround?
import sys
try:
sys.setappdefaultencoding('utf-8')
except:
sys = reload(sys)
sys.setdefaultencoding('utf-8')
Now all output is encoded to 'utf-8'.
I think correct way to handle this situation without
redo a whole bunch of logic
is to decode all data from your internet source from server or page encoding to unicode, and then to use workaround shown above to set default encoding to utf-8.

Most programs under Windows will assume that you're using the default Windows encoding, which will be ISO-8859-1 for an English installation. This goes for the command window output as well. There's no way to set the default encoding to UTF-8 unfortunately - there's a code page defined for it, but it's not well supported.
Some editors will recognize any BOM characters at the start of the file and switch to UTF-8, but that's not guaranteed.
If you're generating HTML you should include the proper charset tag; then the browser will interpret it properly.

Related

Python encoding character stress

I have some scraped data that I have output into a html file held locally as a 'raw' version before I do some data manipulation.
The issue is when I process the website I have a troublesome time dealing with a "'" character.
After much research I am getting to the end of my tether. I have seen much on that apostrophe causing issues, I have tried many versions of encoding and decoding, chardet etc and still cannot get it to work.
A word in a few tables is : CA’BELLAVISTA
When I process a script the IDE screen prints it correctly after I get the right encoding/decoding pattern however when I view the outputted HTML file I get the following CA\x92BELLAVISTA every time.
The script is simply a urllib.response.read() then encoding.
Is it the web browser doing it or is the script genially not getting the correct character?
The next step involves me loading in the HTML file for further manipulation and output to JSON/csv so I thought nailing the encoding on html file output would be the best option.
I think it's a ISO-9959-1/Latin1 charset although that seems to change on the odd webpage.
I hope I'm doing the correct thing in trying to put it into UTF-8.

html DOM: good test webpages to test encoding/decoding

What I'm doing is:
via javascript, reading the DOM of webpage
converting to json string
sending to python as ajax
in Python, json decoding the string into object
What I want is for any text that is part of the json to be in unicode to avoid any character issues. I used to use beautifulsoup for this:
from bs4 import *
from bs4.dammit import UnicodeDammit
text_unicode = UnicodeDammit(text, [None, None], "html", True).unicode_markup
But that doesn't work with the json string. Running the string through UnicodeDammit causes an error when I try to json decode it.
The thing is, I'm not even sure that collecting the DOM doesn't handle this issue automatically.
For starters, I would therefore like a series of test webpages to test this. Where one is encoded with utf-8, another with something else, etc. And that uses characters that will look wrong if, for example, you think it's utf-8 but it's not. Note that I don't even bother considering the webpage's stated encoding. This is too often wrong.
You are trying to solve a problem that does not exist.
The browser is responsible for detecting and handling the web page encoding. It'll determine the correct encoding based on the server headers, meta tags in the HTML page and plain guessing if needed. The DOM gives you Unicode data.
JSON handles Unicode data; sending JSON data to your Python process sends appropriately encoded byte data that any decent JSON library will turn back into Unicode values for you. The Python json module is such a library.
Just load the data from your JavaScript script with the json.load() or json.loads() functions as is. Your browser will already have used the correct encoding (most likely UTF-8), and the Python json module will decode any of the standard encodings used without additional configuration or handling.

Scraping a website whose encoding is iso-8859-1 instead of utf-8: how do I store the correct unicode in my database?

I'd like to scrape a website using Python that is full of horrible problems, one being the wrong encoding at the top:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
This is wrong because the page is full of occurrences like the following:
Nell’ambito
instead of
Nell'ambito (please notice ’ replaces ')
If I understand correctly, this is happening because utf-8 bytes (probably the database encoding) are interpreted as iso-8859-1 bytes (forced by the charset in the meta tag).
I found some initial explanation at this link http://www.i18nqa.com/debug/utf8-debug.html
I am using BeautifulSoup to navigate the page, Google App Engine's urlfetch to make requests, however all I need is to understand what is the correct way to store in my database a string that fixes ’ by encoding the string to '.
I am using BeautifulSoup to navigate the page, Google App Engine's urlfetch to make requests
Are you feeding the encoding from the Content-Type HTTP header into BeautifulSoup?
If an HTML page has both a Content-Type header and a meta tag, the header should ‘win’, so if you're only taking the meta tag you may get the wrong encoding.
Otherwise, you could either feed the fixed encoding 'utf-8' into Beautiful, or fix up each string indvidually.
Annoying note: it's not actually ISO-8859-1. When web pages say ISO-8859-1, browsers actually take it to mean Windows code page 1252, which is similar to 8859-1 but not the same. The € would seem to indicate cp1252 because it's not present in 8859-1.
u'Nell’ambito'.encode('cp1252').decode('utf-8')
If the content is encoded inconsistently with some UTF-8 and some cp1252 on the same page (typically due to poor database content handling), this would be the only way to recover it, catching UnicodeError and returning the original string when it wouldn't transcode.

Special characters in output are like �

I have some .txt files which included Turkish characters. I prepared a HTML code and wanna include texts which are in my txt files. The processes are successful but the html files which are made by python have character problems(special characters seems like this: �)
I have tried add u before strings in python code but it did not work.
txt files are made by python. actually they are my blog entries I got them using urrlib. Moreover, they have not character problems
thank you for your answers.
When you serve the content to a web browser, you need to tell it what encoding the file is in. Ideally, you should send a Content-type: HTTP header in the response with something like text/plain; charset=utf-8, where "utf-8" is replaced by whatever encoding you're actually using if it's not utf-8.
Your browser may also need to be set to use a unicode-aware font for displaying text files; if it uses a font that doesn't have the necessary glyphs, obviously it can't display them.
I think it isn't a Python question. Have you specified the document's character encoding?
You have to supply you html with something like the next:
<meta content="text/html; charset=UTF-8">
Replace UTF-8 with encoding used in .txt file.

Python Encoding Issue

I am really lost in all the encoding/decoding issues with Python. Having read quite few docs about how to handle incoming perfectly, i still have issues with few languages, like Korean. Anyhow, here is the what i am doing.
korean_text = korean_text.encode('utf-8', 'ignore')
korean_text = unicode(korean_text, 'utf-8')
I save the above data to database, which goes through fine.
Later when i need to display data, i fetch content from db, and do the following:
korean_text = korean_text.encode( 'utf-8' )
print korean_text
And all i see is '???' echoed on the browser. Can someone please let me know what is the right way to save and display above data.
Thanks
Even having read some docs, you seem to be confused on how unicode works.
Unicode is not an encoding. Unicode is the absence of encodings.
utf-8 is not unicode. utf-8 is an encoding.
You decode utf-8 bytestrings to get unicode. You encode unicode using an encoding, say, utf-8, to get an encoded bytestring.
Only bytestrings can be saved to disk, database, or sent on a network, or printed on a printer, or screen. Unicode only exists inside your code.
The good practice is to decode everything you get as early as possible, work with it decoded, as unicode, in all your code, and then encode it as late as possible, when the text is ready to leave your program, to screen, database or network.
Now for your problem:
If you have a text that came from the browser, say, from a form, then it is encoded. It is a bytestring. It is not unicode.
You must then decode it to get unicode. Decode it using the encoding the browser used to encode. The correct encoding comes from the browser itself, in the correct HTTP REQUEST header.
Don't use 'ignore' when decoding. Since the browser said which encoding it is using, you shouldn't get any errors. Using 'ignore' means you will hide a bug if there is one.
Perhaps your web framework of choice already does that. I know that django, pylons, werkzeug, cherrypy all do that. In that case you already get unicode.
Now that you have a decoded unicode string, you can encode it using whatever encoding you like to store on the database. utf-8 is a good choice, since it can encode all unicode codepoints.
When you retrieve the data from the database, decode it using the same encoding you used to store it. And then encode it using the encoding you want to use on the page - the one declared in the html meta header <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>. If the encoding is the same used on the previous step, you can skip the decode/reencode since it is already encoded in utf-8.
If you see ??? then the data is being lost on any step above. To know exactly, more information is needed.
Read through this post about handling Unicode in Python.
You basically want to be doing these things:
.encode() text to a particular encoding (such as utf-8) before sending it to the database.
.decode() text back to unicode (from your encoding) when reading it from the database
The problem is most certainly (especially if other non-ASCII characters appear to work fine) that your browser or OS doesn't have the right fonts to display Korean text, or that the default font used by your browser doesn't support Korean. Try to choose another font until it works.

Categories

Resources