Python MySQL CSV export to json strange encoding

Python MySQL CSV export to json strange encoding - python

I received a csv file exported from a MySQL database (I think the encoding is latin1 since the language is spanish). Unfortunately the encoding is wrong and I cannot process it at all. If I use file:
$ file -I file.csv
file.csv: text/plain; charset=unknown-8bit
I have tried to read the file in python and convert it to utf-8 like:
r.decode('latin-1').encode("utf-8")
or using mysql_latin1_codec:
r.decode('mysql_latin1').encode('UTF-8')
I am trying to transform the data into json objects. The error comes when I save the file:
'UnicodeEncodeError: 'ascii' codec can't encode characters in position'
Do you know how can I convert it to normal utf-8 chars? Or how can I convert data to a valid json? Thanks!!

I got really good results by using pandas dataframe from Continuum Analytics.
You coud do something like:
import pandas as pd
from pandas import *
con='Your database connection credentials user, password, host, database to use'
data=pd.read_sql_query('SELECT * FROM YOUR TABLE',conn=con)
Then you could do:
data.to_csv('path_with_file_name')
or to convert to JSON:
data.to_json(orient='records')
or if you prefer to customize your json format see the documentation here:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html

Have you tried using the codecs module?:
import codecs
....
codecs.EncodedFile(r, 'latin1').reader.read()
I remember having a similar issue a while back and the answer was something to do with how encoding was done prior to Python 3. Codecs seems to handle this problem relatively elegantly.
As coder mentioned in the question comments, it's difficult to pinpoint the problem without being able to reproduce it so I may be barking up the wrong tree.

You probably have two problems. But let's back off... We can't tell whether the text was imported incorrectly, exported incorrectly, or merely displayed in a goofy way.
First, I am going to discuss "importing"...
Do not try to alter the encoding. Instead live with the encoding. But first, figure out what the encoding is. It could be latin1 or it could be utf8. (Or any of lots of less likely charsets.)
Find out the hex for the incoming file. In Python, the code is something like this for dumping hex (etc) for string u:
for i, c in enumerate(u):
print i, '%04x' % ord(c), unicodedata.category(c),
print unicodedata.name(c)
You can go here to see a list of hex values for all the latin1 characters, together with the utf8 hex. For example, ó is latin1 F3 or utf8 C2B3.
Now, armed with knowing the encoding, tell MySQL that.
LOAD DATA INFILE ...
...
CHARACTER SET utf8 -- or latin1
...;
Meanwhile, it does not matter what CHARACTER SET ... the table or column is defined to be; mysql will transcode if necessary. All Spanish characters are available in latin1 and utf8.
Go to this Q&A .
I suggested that you have two errors, one is the "black diamond" case mentioned there; there other is something else. But... Follow the "Best Practice" mentioned.
Back to you question of "exporting"...
Again, you need to check the hex of the output file. Again it does not matter whether it is latin1 or utf8. However... If the hex is C383C2B3 for simply ó, you have "double encoding". If you have that, check to see that you have removed any manual conversion function calls, and simply told MySQL what's what.
Here are some more utf8+Python tips you might need.
If you need more help, follow the text step-by-step. Show us the code used to move/convert it at each step, and show us the HEX at each step.

Related

Character encoding and decoding in Python with MySQL

For query:
SHOW VARIABLES LIKE 'char%';
MySQL Database returns:
character_set_client latin1
character_set_connection latin1
character_set_database latin1
character_set_filesystem binary
character_set_results latin1
character_set_server latin1
character_set_system utf8
character_sets_dir /usr/local/mysql-5.7.27-macos10.14-x86_64/share/charsets/
In my Python script:
conn = get_database_connection()
conn.setdecoding(pyodbc.SQL_CHAR, encoding='latin1')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='latin1')
For one of the columns that has following value:
N’a pas
Python returns:
N?a pas
Between N and a, There is a star shaped question-mark. How do I read it as is? What's the best way to handle it? I have been reading about converting my db to utf-8 but that seems like a long shot with a good chance of breaking other things. Is there a more efficient way to do it?
At some of the places in code, I have done :
value = value.encode('utf-8', 'ignore').decode('utf-8')
to handle utf-8 data like accented characters but apostrophe did not get handled with the same and I ended up with ? instead of '

Converting the database to UTF-8 is better for the long run, but risky because you may break other things like you say. What you can do is change the database connection encoding to UTF-8. That way you get UTF-8 encoded strings out of the database, without having changed how the data is actually stored.
conn.setdecoding(pyodbc.SQL_CHAR, encoding='utf8')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf8')
If that seems too risky, but you could consider having two separate database connections, the original and one in utf8, and migrate the app to using utf8 little by little, as you have time to test.
If even that seems too risky, maybe try using a character encoding that's more similar to mysql's version of latin1. MySQL's "latin1" is actually an extended version of cp1252 encoding, which itself is a Microsoft extension of the "standard latin1" that's used in Python (among others).
conn.setdecoding(pyodbc.SQL_CHAR, encoding='cp1252')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='cp1252')

Don't use any form of encoding/decoding; it only complicates your code and hides more errors. In fact, you may be trying to "make two wrongs make a right".
Go with utf8 (or utf8mb4).
Notes on "question mark": Trouble with UTF-8 characters; what I see is not what I stored
Notes on Python: http://mysql.rjweb.org/doc.php/charcoll#python

python - how to convert html string to utf-8? Getting UnicodeDecodeError errors

I have a script thats looping through a database and doing some beautifulsoup processing on the string along with replacing some text with other text, etc.
This works 100% most of the time, however some html blobs seems to contain unicode text which breaks the script with the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 112: ordinal not in range(128)
I'm not sure what to do in this case, does anyone know of a module / function to force all text in the string to be a standardized utf-8 or something?
All the html blobs in the database came from feedparser (downloading rss feeds, storing in db).

Before you do any further processing with your string variable:
clean_str = unicode(str_var_with_strange_coding, errors='ignore')
The messed up characters are skipped. Not elegant, as you don't try to restore any maybe meaningful values, but effective.

Since you don't want to show us your code, I'm going to give a general answer that hopefully helps you find the problem.
When you first get the data out of the database and fetch it with fetchone, you need to convert it into a unicode object. It is good practice to do this as soon as you have your variable, and then re-encode it only when you output it.
db = MySQLdb.connect()
cur = db.cursor()
cur.execute("SELECT col FROM the_table LIMIT 10")
xml = cur.fetchone()[0].decode('utf-8') # Or whatever encoding the text is in, though we're pretty sure it's utf-8. You might use chardet
After you run xml through BeautifulSoup, you might encode the string again if it is being saved into a file or you might just leave it as a Unicode object if you are re-inserting it into the database.

Make sure you really understand what is the difference between unicode and UTF-8 and that it is not the same (what is a surprise for many). That is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
What is encoding of your DB? Is it really UTF-8 or you only assume that it is? If it contains blobs with with random encodings, then you have problem, because you cannot guess the encoding. When you read from the database, then decode the blob to unicode and use unicode later in your code.
But let assume your base is UTF-8. Then you should use unicode everywhere - decode early, encode late. Use unicode everywhere inside you program, and only decode/encode when you read from or write to the database, display, write to file etc.
Unicode and encoding is a bit pain in Python 2.x, fortunately in python 3 all text is unicode
Regarding BeautifulSoup, use the latest version 4.

Well after a couple more hours googling, I finally came across a solution that eliminated all decode errors. I'm still fairly new to python (heavy php background) and didn't understand character encoding.
In my code I had a .decode('utf-8') and after that did some .replace(str(beatiful_soup_tag),'') statements. The solution ended up being so simple as to change all str() to unicode(). After that, not a single issue.
Answer found on:
http://ubuntuforums.org/showthread.php?t=1212933
I sincerely apologize to the commenters who requested I post the code, what I thought was rock solid and not the issue was quite the opposite and I'm sure they would have caught the issue right away! I'll not make that mistake again! :)

decode/encode problems

I currently have serious problems with coding/encoding under Linux (Ubuntu). I never needed to deal with that before, so I don't have any idea why this actually doesn't work!
I'm parsing *.desktop files from /usr/share/applications/ and extracting information which is shown in the Web browser via a HTTPServer. I'm using jinja2 for templating.
First, I received UnicodeDecodeError at the call to jinja2.Template.render() which said that
utf-8 cannot decode character XXX at position YY [...]
So I have made all values that come from my appfind-module (which parses the *.desktop files) returning only unicode-strings.
The problem at this place was solved so far, but at some point I am writing a string returned by a function to the BaseHTTPServer.BaseHTTTPRequestHandler.wfile slot, and I can't get this error fixed, no matter what encoding I use.
At this point, the string that is written to wfile comes from jinja2.Template.render() which, afaik, returns a unicode object.
The bizarre part is, that it is working on my Ubuntu 12.04 LTS but not on my friend's Ubuntu 11.04 LTS. However, that might not be the reason. He has a lot more applications and maybe they do use encodings in their *.desktop files that raise the error.
However, I properly checked for the encoding in the *.desktop files:
data = dict(parser.items('Desktop Entry'))
try:
encoding = data.get('encoding', 'utf-8')
result = {
'name': data['name'].decode(encoding),
'exec': DKENTRY_EXECREPL.sub('', data['exec']).decode(encoding),
'type': data['type'].decode(encoding),
'version': float(data.get('version', 1.0)),
'encoding': encoding,
'comment': data.get('comment', '').decode(encoding) or None,
'categories': _filter_bool(data.get('categories', '').
decode(encoding).split(';')),
'mimetypes': _filter_bool(data.get('mimetype', '').
decode(encoding).split(';')),
}
# ...
Can someone please enlighten me about how I can fix this error? I have read on an answer on SO that I should use unicode() always, but that would be so much pain to implemented, and I don't think it would fix the problem when writing to wfile?
Thanks,
Niklas

This is probably obvious, but anyway: wfile is an ordinary byte stream: everything written must be unicode.encode():ed when written to it.
Reading OP, it is not clear to me what, exactly is afoot. However, there are some tricks that may help you, that I have found to be helpful to debug encoding problems. I appologize in advance if this is stuff you have long since transcended.
cat -v on a file will output all non-ascii characters as '^X' which is the only fool-proof way I have found to decide what encoding a file really has. UTF-8 non-ascii characters are multi-byte. That means that they will be sequences of more than one '^'-entry by cat -v.
Shell environment (LC_ALL, et al) is in my experience the most common cause of problems. Make sure you have a system that has locales with both UTF-8 and e.g. latin-1 available. Always set your LC_ALL to a locale that explicitly names an encoding, e.g. LC_ALL=sv_SE.iso88591.
In bash and zsh, you can run a command with specific environment changes for that command, like so:
$ LC_ALL=sv_SE.utf8 python ./foo.py
This makes it a lot easier to test than having to export different locales, and you won't pollute the shell.
Don't assume that you have unicode strings internally. Write assert statements that verify that strings are unicode.
assert isinstance(foo, unicode)
Learn to recognize mangled/misrepresented versions of common characters in the encodings you are working with. E.g. '\xe4' is latin-1 a diaresis and 'Ã¤' are the two UTF-8 bytes, that make up a diaresis, misstakenly represented in latin-1. I have found that knowing this sort of gorp cuts debugging encoding issues considerably.

You need to take a disciplined approach to your byte strings and Unicode strings. This explains it all: Pragmatic Unicode, or, How Do I Stop the Pain?

By default, when python hits an encoding issue with unicde, it throws an error. However, this behavior can be modified, such as if the error is expected or not important.
Say you are converting between two unicode pages that are supersets of ascii. The both have mostly the same characters, but there is no one-to-one correspondence. Therefore, you would want to ignore errors.
To do so, use the errors variable in the encode function.
mystring = u'This is a test'
print mystring.encode('utf-8', 'ignore')
print mystring.encode('utf-8', 'replace')
print mystring.encode('utf-8', 'xmlcharrefreplace')
print mystring.encode('utf-8', 'backslashreplace')
There are lots of issues with unicode if the wrong encodings are used when reading/writing. Make sure that after you get the unicode string, you convert it to the form of unicode desired by jinja2.
If this doesn't help, could you please add the second error you see, with perhaps a code snippet to clarify what's going on?

Try using .encode(encoding) instead of .decode(encoding) in all its occurences in the snippet.

Working with foreign symbols in python

I'm parsing a JSON feed in Python and it contains this character, causing it not to validate.
Is there a way to handle these symbols? Can they be converted or is they're a tidy way to remove them?
I don't even know what this symbol is called or what causes them, otherwise I would research it myself.
EDIT: Stackover Flow is stripping the character so here:
http://files.getdropbox.com/u/194177/symbol.jpg
It's that [?] symbol in "Classic 80s"

That probably means the text you have is in some sort of encoding, and you need to figure out what encoding, and convert it to Unicode with a thetext.decode('encoding') call.
I not sure, but it could possibly be the [?] character, meaning that the display you have there also doesn't know how to display it. That would probably mean that the data you have is incorrect, and that there is a character in there that doesn't exist in the encoding that you are supposed to use. To handle that you call the decode like this: thetext.decode('encoding', 'ignore'). There are other options than ignore, like "replace", "xmlcharrefreplace" and more.

JSON must be encoded in one of UTF-8, UTF-16, or UTF-32. If a JSON file contains bytes which are illegal in its current encoding, it is garbage.
If you don't know which encoding it's using, you can try parsing using my jsonlib library, which includes an encoding-detector. JSON parsed using jsonlib will be provided to the programmer as Unicode strings, so you don't have to worry about encoding at all.

Reading "raw" Unicode-strings in Python

I am quite new to Python so my question might be silly, but even though reading through a lot of threads I didn't find an answer to my question.
I have a mixed source document which contains html, xml, latex and other textformats and which I try to get into a latex-only format.
Therefore, I have used python to recognise the different commands as regular expresssions and replace them with the adequate latex command. Everything has worked out fine so far.
Now I am left with some "raw-type" Unicode signs, such as the greek letters. Unfortunaltly is just about to much to do it by hand. Therefore, I am looking for a way to do this the smart way too. Is there a way for Python to recognise / read them? And how do I tell python to recognise / read e.g. Pi written as a Greek letter?
A minimal example of the code I use is:
fh = open('SOURCE_DOCUMENT','r')
stuff = fh.read()
fh.close()
new_stuff = re.sub('READ','REPLACE',stuff)
fh = open('LATEX_DOCUMENT','w')
fh.write(new_stuff)
fh.close()
I am not sure whether it is an important information or not, but I am using Python 2.6 running on windows.
I would be really glad, if someone might be able to give me hint, at least where to find the according information or how this might work. Or whether I am completely wrong, and Python can't do this job ...
Many thanks in advance.
Cheers,
Britta

You talk of ``raw'' Unicode strings. What does that mean? Unicode itself is not an encoding, but there are different encodings to store Unicode characters (read this post by Joel).
The open function in Python 3.0 takes an optional encoding argument that lets you specify the encoding, e.g. UTF-8 (a very common way to encode Unicode). In Python 2.x, have a look at the codecs module, which also provides an open function that allows specifying the encoding of the file.
Edit: alternatively, why not just let those poor characters be, and specify the encoding of your LaTeX file at the top:
\usepackage[utf8]{inputenc}
(I never tried this, but I figure it should work. You may need to replace utf8 by utf8x, though)

Please, first, read this:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, come back and ask questions.

You need to determine the "encoding" of the input document. Unicode can encode millions of characters but files can only story 8-bit values (0-255). So the Unicode text must be encoded in some way.
If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default if there is no "encoding" field). For HTML, look for "charset".
If all else fails, open the document in an editor where you can set the encoding (jEdit, for example). Try them until the text looks right. Then use this value as the encoding parameter for codecs.open() in Python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.