sqlite remove non utf-8 characters

sqlite remove non utf-8 characters - python

I have an sqlite db that has some crazy ascii characters in it and I would like to remove them, but I have no idea how to go about doing it. I googled some stuff and found some people saying to use REGEXP with mysql, but that threw an error saying REGEXP wasn't recognized.
Here is the error I get:
sqlalchemy.exc.OperationalError: (OperationalError) Could not decode to UTF-8 column 'table_name' with text ...
Thanks for the help

Well, if you really want to shoehorn a rich unicode string into a plain ascii string
(and don't mind some goofs), you could use this:
import unicodedata as ud
def shoehorn_unicode_into_ascii(s):
# This removes accents, but also other things, like ß‘’“”
return ud.normalize('NFKD', s).encode('ascii','ignore')
For a more complete solution (with somewhat fewer goofs, but requiring a third-party module unidecode), see this answer.
Really, though, the best solution is to work with unicode data throughout your code as much as possible, and drop to an encoding only when necessary.

django.utils.encoding has a greate set of robust unicode encoding and decoding functions.

Related

Python MySQL CSV export to json strange encoding

I received a csv file exported from a MySQL database (I think the encoding is latin1 since the language is spanish). Unfortunately the encoding is wrong and I cannot process it at all. If I use file:
$ file -I file.csv
file.csv: text/plain; charset=unknown-8bit
I have tried to read the file in python and convert it to utf-8 like:
r.decode('latin-1').encode("utf-8")
or using mysql_latin1_codec:
r.decode('mysql_latin1').encode('UTF-8')
I am trying to transform the data into json objects. The error comes when I save the file:
'UnicodeEncodeError: 'ascii' codec can't encode characters in position'
Do you know how can I convert it to normal utf-8 chars? Or how can I convert data to a valid json? Thanks!!

I got really good results by using pandas dataframe from Continuum Analytics.
You coud do something like:
import pandas as pd
from pandas import *
con='Your database connection credentials user, password, host, database to use'
data=pd.read_sql_query('SELECT * FROM YOUR TABLE',conn=con)
Then you could do:
data.to_csv('path_with_file_name')
or to convert to JSON:
data.to_json(orient='records')
or if you prefer to customize your json format see the documentation here:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html

Have you tried using the codecs module?:
import codecs
....
codecs.EncodedFile(r, 'latin1').reader.read()
I remember having a similar issue a while back and the answer was something to do with how encoding was done prior to Python 3. Codecs seems to handle this problem relatively elegantly.
As coder mentioned in the question comments, it's difficult to pinpoint the problem without being able to reproduce it so I may be barking up the wrong tree.

You probably have two problems. But let's back off... We can't tell whether the text was imported incorrectly, exported incorrectly, or merely displayed in a goofy way.
First, I am going to discuss "importing"...
Do not try to alter the encoding. Instead live with the encoding. But first, figure out what the encoding is. It could be latin1 or it could be utf8. (Or any of lots of less likely charsets.)
Find out the hex for the incoming file. In Python, the code is something like this for dumping hex (etc) for string u:
for i, c in enumerate(u):
print i, '%04x' % ord(c), unicodedata.category(c),
print unicodedata.name(c)
You can go here to see a list of hex values for all the latin1 characters, together with the utf8 hex. For example, ó is latin1 F3 or utf8 C2B3.
Now, armed with knowing the encoding, tell MySQL that.
LOAD DATA INFILE ...
...
CHARACTER SET utf8 -- or latin1
...;
Meanwhile, it does not matter what CHARACTER SET ... the table or column is defined to be; mysql will transcode if necessary. All Spanish characters are available in latin1 and utf8.
Go to this Q&A .
I suggested that you have two errors, one is the "black diamond" case mentioned there; there other is something else. But... Follow the "Best Practice" mentioned.
Back to you question of "exporting"...
Again, you need to check the hex of the output file. Again it does not matter whether it is latin1 or utf8. However... If the hex is C383C2B3 for simply ó, you have "double encoding". If you have that, check to see that you have removed any manual conversion function calls, and simply told MySQL what's what.
Here are some more utf8+Python tips you might need.
If you need more help, follow the text step-by-step. Show us the code used to move/convert it at each step, and show us the HEX at each step.

Python.27 - MySQL utf8 encoding

In my calling MySQL from Python I prepare it with "SET NAMES 'utf8'", but still something is not right. I get a sequence like this:
å½å®¶1çº§è¯ä¹¦
When I am supposed to get chinese characters, elsewhere always covered by utf8.
When I look at the utf8 code/sequence it clearly doesn't match the real one. Same sort of format, but different numbers.
Is this erroneous encoding on Python 2.7's end or bad programming on my end? I know Python 3.x has solved these issues but I cannot use the modules I want in later versions.
I know Python 2.7 can actually display chinese, by using the print operator, but it is otherwise stored and viewed as utf8-code. Look:
>>> '你好'
'\xc4\xe3\xba\xc3'
>>> print '\xc4\xe3\xba\xc3'
你好

Ok.. It seems adding
"SET NAMES 'gbk'"
before the MySQL SELECT query did the trick. Now at least the strings from my dictionary and from the sql database can be compared. It also seems that gbk is often the prefered char format in China.

python - how to convert html string to utf-8? Getting UnicodeDecodeError errors

I have a script thats looping through a database and doing some beautifulsoup processing on the string along with replacing some text with other text, etc.
This works 100% most of the time, however some html blobs seems to contain unicode text which breaks the script with the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 112: ordinal not in range(128)
I'm not sure what to do in this case, does anyone know of a module / function to force all text in the string to be a standardized utf-8 or something?
All the html blobs in the database came from feedparser (downloading rss feeds, storing in db).

Before you do any further processing with your string variable:
clean_str = unicode(str_var_with_strange_coding, errors='ignore')
The messed up characters are skipped. Not elegant, as you don't try to restore any maybe meaningful values, but effective.

Since you don't want to show us your code, I'm going to give a general answer that hopefully helps you find the problem.
When you first get the data out of the database and fetch it with fetchone, you need to convert it into a unicode object. It is good practice to do this as soon as you have your variable, and then re-encode it only when you output it.
db = MySQLdb.connect()
cur = db.cursor()
cur.execute("SELECT col FROM the_table LIMIT 10")
xml = cur.fetchone()[0].decode('utf-8') # Or whatever encoding the text is in, though we're pretty sure it's utf-8. You might use chardet
After you run xml through BeautifulSoup, you might encode the string again if it is being saved into a file or you might just leave it as a Unicode object if you are re-inserting it into the database.

Make sure you really understand what is the difference between unicode and UTF-8 and that it is not the same (what is a surprise for many). That is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
What is encoding of your DB? Is it really UTF-8 or you only assume that it is? If it contains blobs with with random encodings, then you have problem, because you cannot guess the encoding. When you read from the database, then decode the blob to unicode and use unicode later in your code.
But let assume your base is UTF-8. Then you should use unicode everywhere - decode early, encode late. Use unicode everywhere inside you program, and only decode/encode when you read from or write to the database, display, write to file etc.
Unicode and encoding is a bit pain in Python 2.x, fortunately in python 3 all text is unicode
Regarding BeautifulSoup, use the latest version 4.

Well after a couple more hours googling, I finally came across a solution that eliminated all decode errors. I'm still fairly new to python (heavy php background) and didn't understand character encoding.
In my code I had a .decode('utf-8') and after that did some .replace(str(beatiful_soup_tag),'') statements. The solution ended up being so simple as to change all str() to unicode(). After that, not a single issue.
Answer found on:
http://ubuntuforums.org/showthread.php?t=1212933
I sincerely apologize to the commenters who requested I post the code, what I thought was rock solid and not the issue was quite the opposite and I'm sure they would have caught the issue right away! I'll not make that mistake again! :)

In Django, why do I get problems with utf-8 encoded strings?

I'm a German developer writing web applications for Germans, which means I cannot by any means rely on plain ASCII encoding. At least characters like ä, ö, ü, ß have to be supported.
Fortunately, Django treats ByteStrings as utf-8 encoded by default (as described in the docs). So it should just work, if I add the # -*- coding: utf-8 -*- line to the beginning of each .py file and set the editor encoding, shouldn't it? Well, it does most of the time...
But I seem to miss something when it comes to URLs. Or maybe that has not to do anything with URLs but until now I didn't notice any other encoding misbehavior. There are two cases I can remember as examples:
The URL pattern url(r'^([a-z0-9äöüß_\-]+)/$', views.view_page) doesn't recognize URLs containing ä, ö, ü, ß at all. Those characters are simply ignored.
The following code of a view function throws an Exception:
def do_redirect(request, id):
return redirect('/page/{0}'.format(id))
Where the id argument is captured from the URL like the one in the first example. If I fix the URL pattern (by specifying it as unicode string) and than access /ä/, I get the Exception
UnicodeEncodeError at /ä/
'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
However, trying the following code for the view function:
def do_redirect(request, id):
return redirect('/page/' + id)
everything works out fine. That makes me belief the actual problem lies not within Django but derives from Python, treating ByteStrings as ASCII. I'm not that much into encoding but the problem in the second example is obviously the format() method of the String object. So, in the first example it might fail because of the way Python handles regular expressions (though I don't know if Django uses the re module or something else).
My workaround until now is just prefixing the string with u whenever such an error occurs. That's a bad solution since I might easily overlook something. I tried marking every Python string as unicode but that causes other exceptions and is quite ugly.
Does anyone know exactly, what the problem is and how to solve it in a pleasant way (i.e. a way that doesn't let your head explode when the code grows bigger)?
Thanks in advance!
EDIT: For my regular expression I found out, why the u is needed. Specifying a string as Raw String (r) makes it being interpreted as ASCII. Leaving the r away makes the regex work without the u but introduces some headache with backslashes.

Prefixing your strings with u is the solution.
If it's a problem for you, then it looks like a symptom of a more general problem: you have a lot of magic constants in your code. It is bad (and you already see why). Try to avoid them, for example you can use named url pattern or view name for redirecting instead of re-typing the part of URL.
If you can't avoid them, turn them into named constants, and place their assignments in one place. Then, you'll see that all of them are prefixed properly, and it will be difficult to overlook it.

In django 1.4, one of the new features is better support for url internationalization, including support for translating URLs.
This would go a long way in helping you out, but it doesn't mean you should ignore the other advice as that is for Python in general and applies to everything, not just django.

Python and Unicode: How everything should be Unicode

Forgive if this a long a question:
I have been programming in Python for around six months. Self taught, starting with the Python tutorial and then SO and then just using Google for stuff.
Here is the sad part: No one told me all strings should be Unicode. No, I am not lying or making this up, but where does the tutorial mention it? And most examples also I see just make use of byte strings, instead of Unicode strings. I was just browsing and came across this question on SO, which says how every string in Python should be a Unicode string. This pretty much made me cry!
I read that every string in Python 3.0 is Unicode by default, so my questions are for 2.x:
Should I do a:
print u'Some text' or just print
'Text' ?
Everything should be Unicode, does this mean, like say I have a tuple:
t = ('First', 'Second'), it should be t = (u'First', u'Second')?
I read that I can do a from __future__ import unicode_literals and then every string will be a Unicode string, but should I do this inside a container also?
When reading/ writing to a file, I should use the codecs module. Right? Or should I just use the standard way or reading/ writing and encode or decode where required?
If I get the string from say raw_input(), should I convert that to Unicode also?
What is the common approach to handling all of the above issues in 2.x? The from __future__ import unicode_literals statement?
Sorry for being a such a noob, but this changes what I have been doing for a long time and so clearly I am confused.

The "always use Unicode" suggestion is primarily to make the transition to Python 3 easier. If you have a lot of non-Unicode string access in your code, it'll take more work to port it.
Also, you shouldn't have to decide on a case-by-case basis whether a string should be stored as Unicode or not. You shouldn't have to change the types of your strings and their very syntax just because you changed their contents, either.
It's also easy to use the wrong string type, leading to code that mostly works, or code which works in Linux but not in Windows, or in one locale but not another. For example, for c in "漢字" in a UTF-8 locale will iterate over each UTF-8 byte (all six of them), not over each character; whether that breaks things depends on what you do with them.
In principle, nothing should break if you use Unicode strings, but things may break if you use regular strings when you shouldn't.
In practice, however, it's a pain to use Unicode strings everywhere in Python 2. codecs.open doesn't pick the correct locale automatically; this fails:
codecs.open("blar.txt", "w").write(u"漢字")
The real answer is:
import locale, codecs
lang, encoding = locale.getdefaultlocale()
codecs.open("blar.txt", "w", encoding).write(u"漢字")
... which is cumbersome, forcing people to make helper functions just to open files. codecs.open should be using the encoding from locale automatically when one isn't specified; Python's failure to make such a simple operation convenient is one of the reasons people generally don't use Unicode everywhere.
Finally, note that Unicode strings are even more critical in Windows in some cases. For example, if you're in a Western locale and you have a file named "漢字", you must use a Unicode string to access it, eg. os.stat(u"漢字"). It's impossible to access it with a non-Unicode string; it just won't see the file.
So, in principle I'd say the Unicode string recommendation is reasonable, but with the caveat that I don't generally even follow it myself.

No, not every string "should be Unicode". Within your Python code, you know if the string literals needs to be Unicode or not, so it doesn't make any sense to make every string literal into a Unicode literal.
But there are cases where you should use Unicode. For example, if you have arbitrary input that is text, use Unicode for it. You will sooner or later find a non-american using it, and he want to wrîte têxt ås hé is üsed tö. And you'll get problems in that case unless your input and output happen to use the same encoding, which you can't be sure of.
So in short, no, strings shouldn't be Unicode. Text should be. But YMMV.
Specifically:
No need to use Unicode here. You know if that string is ASCII or not.
Depends if you need to merge those strings with Unicode or not.
Both ways work. But do not encode decode "when required". Decode ASAP, encode as late as possible. Using codecs work well (or io, from Python 2.7).
Yeah.

IMHO (my simple rules):
Should I do a:
print u'Some text' or just print 'Text' ?
Everything should be Unicode, does this mean, like say I have a tuple:
t = ('First', 'Second'), it should be t = (u'First', u'Second')?
Well, I use unicode literals only when I have some char above ASCII 128:
print 'New York', u'São Paulo'
t = ('New York', u'São Paulo')
When reading/ writing to a file, I should use the codecs module. Right? Or should I just use the standard way or reading/ writing and encode or decode where required?
If you expect unicode text, use codecs.
If I get the string from say raw_input(), should I convert that to Unicode also?
Only if you expect unicode text that may get transfered to another system with distinct default encoding (including databases).
EDITED (about mixing unicode and byte strings):
>>> print 'New York', 'to', u'São Paulo'
New York to São Paulo
>>> print 'New York' + ' to ' + u'São Paulo'
New York to São Paulo
>>> print "Côte d'Azur" + ' to ' + u'São Paulo'
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
>>> print "Côte d'Azur".decode('utf-8') + ' to ' + u'São Paulo'
Côte d'Azur to São Paulo
So if you mix a byte string that contains utf-8 (or other non ascii char) with unicode text without explicit conversion, you will have trouble, because default assumes ascii. The other way arround seems to be safe. If you follow the rule of writing every string containing non-ascii as an unicode literal, you should be OK.
DISCLAIMER: I live in Brazil where people speak Portuguese, a language with lots of non-ascii chars. My default encoding is always set to 'utf-8'. Your mileage may vary in English/ascii systems.

I’m just adding my personal opinion here. Not as long and elaborate at the other answers, but maybe it can help, too.
print u'Some text' or just print 'Text' ?
I’d indeed prefer the first. If you know that you only have Unicode strings, you have one invariant more. Various other languages (C, C++, Perl, PHP, Ruby, Lua, …) sometimes encounter painful problems because of their lack of separation between code unit sequences and integer sequences. I find the approach of strict distinction between them that exists in .NET, Java, Python etc. quite a bit cleaner.
Everything should be Unicode, does this mean, like say I have a tuple:
t = ('First', 'Second'), it should be t = (u'First', u'Second')?
Yes.
I read that I can do a from __future__ import unicode_literals and then every string will be a Unicode string, but should I do this inside a container also?
Yes. Future statements apply only to the file where they’re used, so you can use them without interfering with other modules. I generally import all futures in Python 2.x modules to make the transition to 3.x easier.
When reading/ writing to a file, I should use the codecs module. Right? Or should I just use the standard way or reading/ writing and encode or decode where required?
You should use the codecs module because that makes it impossible (or at least very hard) to accidentally write differently-encoded representations to a single file. It is also the way Python 3.x works when you open a file in text mode.
If I get the string from say raw_input(), should I convert that to Unicode also?
I’d say yes to this too: In most cases it’s easier to deal with only one encoding, so I recommend converting to Python Unicode strings as early as possible.
What is the common approach to handling all of the above issues in 2.x? The from __future__ import unicode_literals statement?
I don’t know what the common approach is, but I use that statement all the time. I have encountered only very few issues with this approach, and most of them are related to bugs in external libraries—i.e., NumPy sometimes requires byte strings without documenting that.

The fact that you were writing Python code for 6 months before encountering anything about Unicode means that the Python 2.x ASCII default for strings didn't cause you any problems. Certainly for a beginner to try to grasp the idea of Unicode/code points/encoding in itself is a hard issue to tackle; therefore, most tutorials naturally bypass it until you get more of a grounding in the fundamentals. That's why in a book like Dive Into Python, Unicode is only mentioned in later chapters.
If you need to support Unicode in your application, I suggest looking at Kumar McMillan's PyCon 2008 talk on Unicode for a list of best practices. It should answer your remaining questions.

1/2) Personally I've never heard of "always use unicode". That seems pretty stupid to me. I guess I understand if you plan to support other languages that need unicode support. But other than that I wouldn't do that, it seems like more of a pain than it's worth.
3) I would just read/write the standard way and encode when necessary.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.