Encode East Asian languages using Python

Encode East Asian languages using Python - python

This may not really be a Python related question, but pertains to language encoding in general. I'm mining tweets from Twitter, and it appears that there is a large Japanese user community (with messages in Japanese). When I tried encoding the tweets for an XML file I used utf-8. e.g tweet=tweet.encode('utf-8') and none of the Japanese tweets appeared as they should have. My question that I am posing is, how should I have encoded them? What was my mistake? If I was to store the data in a CSV, what encoding scheme would I use in that case?

Normally you would query the format for what encoding the data is in. Having said that, Shift-JIS is quite a popular encoding for Japanese text.
>>> u'あいうえお'.encode('shift-jis')
'\x82\xa0\x82\xa2\x82\xa4\x82\xa6\x82\xa8'

There should be a way to query the encoding of the tweets when read from Twitter. You then decode them to Unicode as you read them into your program, then encode them when you write them back out to an XML file. Chinese, for example, might be using gbk encoding:
import codecs
unicode_data = data.decode('gbk')
f = codecs.open('out.xml','w','utf-8')
f.write(unicode_data)
f.close()

Related

How to remove non-english from the textual data in csv file

I am cleaning the textual data that I scrawled from multiple urls. How can I remove non-English words/symbols from the data in a csv file?
I saved the data and read the data using the following codes:
To save the data as csv file:
df.to_csv("blogdata.csv", encoding = "utf-8")
After saving the data, the csv file shows as follows including non-English words and symbols (e.g., '\n\t\t\t', mâ€™, etc.):
The symbols did not show in the original data and some of them even appear from the data that are in English. Take 'Ross Parker' in the 7th row as an example.
The data saved in the csv file says: ['\n\t\t\t', 'Itâ€™s about time I wrote an update on what weâ€™ve been up to over the past few months. Weâ€™re about to...
Where in the original data scrawled from the url, it shows as follows:
Can anybody explain why this happens and help me solve this issue and clean the non-English data from the file?
Thank you so much in advance!

It looks like pilot error: the data is correct but you are looking at it in a tool which is configured or hard-coded to display the text as Latin-1 (or Windows code page 1252?) even though you saved it as UTF-8.
Some tools - epecially on Windows - will do whimsical things with UTF-8 which doesn't have a BOM. Maybe try adding one (and maybe file a bug report if this really helps; the tool should at the very least let you override its default encoding, without modifying the input data).
In other words, if the screen shot with the broken data is from Excel, you probably selected a DOS oode page (or the horribly mislabelled "ANSI") instead of UTF-8 when it asked how to import this CSV file. Perhaps the best fix would be to devise a workflow which does not involve a spreadsheet.
Or perhaps you used a tool which didn't ask you anything, and tried to "sniff" the data to determine its encoding, and it guessed wrong. Adding an invisible byte sequence called a BOM which is unique to UTF-8 should hopefully allow it to guess right; but this is buggy behavior - you should not be a hostage to its clearly imperfect heuristics. (See also "Bush hid the facts" for a related story.)

How to read the encoding format of document if we don't know in which format it is written

When creating an XML document we specify the encoding format. So to know the encoding format we must first read the coding format. How can we read the coding format if don't know in which encoding the document is written.
My question doesn't concern only XML (as apparently in XML we try all the possible encoding) but I'm speaking in general. I know that in Python we specify the encoding also. And what about other specification that use the header to specify the encoding.

openERP 7 need to export data in UTF-8 CSV , but how?

I can export a CSV with openERP 7 , but it is encoded in ANSI. I would like to have it as a UTF-8 encoded file. How can I achieve this ? The default export option in openERP doesn"t have any extra options. What files should be modified ? Or is there an app for this matter ? Any help would be appreciated.

Encodings are a complicated thing, and it is difficult to answer an encoding-related question without precise facts. ANSI is not an encoding, I assume you actually mean ASCII. And ASCII itself can be seen as a subset of UTF-8, so technically ASCII is valid UTF-8.
OpenERP 7.0 only exports CSV files in UTF-8, so if you do not get the expected result you are probably facing a different kind of issue, for example:
The original data was imported using a wrong encoding (you can choose the encoding when you import, but again the default is UTF-8), so it is actually corrupted in the database, and OpenERP cannot do anything about it
The CSV file might be exported correctly in UTF-8 but you are opening it with a different encoding (for example on Windows most programs will assume your files are ISO-8859-1/Latin-1/Windows-1252 encoded). Double-check the settings of your program.
If you need more help you'll have to be much more specific: what result do you get (what does the data look like), what did you expect, etc.

Robust way to put contents of any arbitrary text file in the database (using Django/Python)?

As part of my Django app, I have to get the contents of a text file which a user uploads (which could be any charset) and save it to my DB. I keep running into issues (like having to remove UTF8's BOM manually, or having to figure out how to account for non-printable characters, or having to figure out how to make all unicode characters work - not just Latin ones, etc.) and each of these issues requires its own hack.
Is there a robust way to do this that doesn't require each of these case-by-case fixes? Right now I'm just using file.read() to get the contents, then doing all of those workarounds to clean the contents, and then using .save() to save it to the DB (I have a model for this).
What else can I be doing?

Causes some overhead, but you could base64 encode the entire string before persisting to the db. Then no escaping is required.

If you want to explicitly steer away from any issues with encoding and just see files as bunches of binary data (not strings of text in a specific encoding) you might want to use your database's binary format.
For MySQL this is BINARY and VARBINARY: http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html
For a deeper understanding of unicode & utf-8 issues (recommended) this is a nice read on the subject:
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

Python Encoding Issue

I am really lost in all the encoding/decoding issues with Python. Having read quite few docs about how to handle incoming perfectly, i still have issues with few languages, like Korean. Anyhow, here is the what i am doing.
korean_text = korean_text.encode('utf-8', 'ignore')
korean_text = unicode(korean_text, 'utf-8')
I save the above data to database, which goes through fine.
Later when i need to display data, i fetch content from db, and do the following:
korean_text = korean_text.encode( 'utf-8' )
print korean_text
And all i see is '???' echoed on the browser. Can someone please let me know what is the right way to save and display above data.
Thanks

Even having read some docs, you seem to be confused on how unicode works.
Unicode is not an encoding. Unicode is the absence of encodings.
utf-8 is not unicode. utf-8 is an encoding.
You decode utf-8 bytestrings to get unicode. You encode unicode using an encoding, say, utf-8, to get an encoded bytestring.
Only bytestrings can be saved to disk, database, or sent on a network, or printed on a printer, or screen. Unicode only exists inside your code.
The good practice is to decode everything you get as early as possible, work with it decoded, as unicode, in all your code, and then encode it as late as possible, when the text is ready to leave your program, to screen, database or network.
Now for your problem:
If you have a text that came from the browser, say, from a form, then it is encoded. It is a bytestring. It is not unicode.
You must then decode it to get unicode. Decode it using the encoding the browser used to encode. The correct encoding comes from the browser itself, in the correct HTTP REQUEST header.
Don't use 'ignore' when decoding. Since the browser said which encoding it is using, you shouldn't get any errors. Using 'ignore' means you will hide a bug if there is one.
Perhaps your web framework of choice already does that. I know that django, pylons, werkzeug, cherrypy all do that. In that case you already get unicode.
Now that you have a decoded unicode string, you can encode it using whatever encoding you like to store on the database. utf-8 is a good choice, since it can encode all unicode codepoints.
When you retrieve the data from the database, decode it using the same encoding you used to store it. And then encode it using the encoding you want to use on the page - the one declared in the html meta header <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>. If the encoding is the same used on the previous step, you can skip the decode/reencode since it is already encoded in utf-8.
If you see ??? then the data is being lost on any step above. To know exactly, more information is needed.

Read through this post about handling Unicode in Python.
You basically want to be doing these things:
.encode() text to a particular encoding (such as utf-8) before sending it to the database.
.decode() text back to unicode (from your encoding) when reading it from the database

The problem is most certainly (especially if other non-ASCII characters appear to work fine) that your browser or OS doesn't have the right fonts to display Korean text, or that the default font used by your browser doesn't support Korean. Try to choose another font until it works.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.