Fix UnicodeDecodeError - python

I have the following code . I use Python 2.7
import csv
import sqlite3
conn = sqlite3.connect('torrents.db')
c = conn.cursor()
# Create table
c.execute('''DROP TABLE torrents''')
c.execute('''CREATE TABLE IF NOT EXISTS torrents
(name text, size long, info_hash text, downloads_count long,
category_id text, seeders long, leechers long)''')
with open('torrents_mini.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter='|')
for row in spamreader:
name = unicode(row[0])
size = row[1]
info_hash = unicode(row[2])
downloads_count = row[3]
category_id = unicode(row[4])
seeders = row[5]
leechers = row[6]
c.execute('INSERT INTO torrents (name, size, info_hash, downloads_count,
category_id, seeders, leechers) VALUES (?,?,?,?,?,?,?)',
(name, size, info_hash, downloads_count, category_id, seeders, leechers))
conn.commit()
conn.close()
The error message I receive is
Traceback (most recent call last):
File "db.py", line 15, in <module>
name = unicode(row[0])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
If I don't convert into unicode then the error i get is
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
adding name = row[0].decode('UTF-8') gives me another error
Traceback (most recent call last):
File "db.py", line 27, in <module>
for row in spamreader:
_csv.Error: line contains NULL byte
the data contained in the csv file is in the following format
Tha Twilight New Moon DVDrip 2009 XviD-AMiABLE|694554360|2cae2fc76d110f35917d5d069282afd8335bc306|0|movies|0|1
Edit:I finally dropped the attempt and accomplished the task using sqlite3 command-line tool(it was quite easy).
I do not yet know what caused the errors , but when sqlite3 was importing the said csv file , it kept popping warnings about "unescaped character", the character being quotes(").
Thanks to everyone who tried to help.

Your data is not encoded as ASCII. Use the correct codec for your data.
You can tell Python what codec to use with:
unicode(row[0], correct_codec)
or use the str.decode() method:
row[0].decode(correct_codec)
What that correct codec is, we cannot tell you. You'll have to consult whatever you got the file from.
If you cannot figure out what encoding was used, you could use a package like chardet to make an educated guess, but take into account that such a library is not fail-proof.

Related

exportation data from mysql database using csv

i neeed a python script to generate a csv file from my database XXXX. i wrote thise script but i have something wrong :
import mysql.connector
import csv
filename=open('test.csv','wb')
c=csv.writer(filename)
cnx = mysql.connector.connect(user='XXXXXXX', password='XXXXX',
host='localhost',
database='XXXXX')
cursor = cnx.cursor()
query = ("SELECT `Id_Vendeur`, `Nom`, `Prenom`, `email`, `Num_magasin`, `Nom_de_magasin`, `Identifiant_Filiale`, `Groupe_DV`, `drt_Cartes`.`gain` as 'gain', `Date_Distribution`, `Status_Grattage`, `Date_Grattage` FROM `drt_Cartes_Distribuer`,`drt_Agent`,`drt_Magasin`,`drt_Cartes` where `drt_Cartes_Distribuer`.`Id_Vendeur` = `drt_Agent`.`id_agent` AND `Num_magasin` = `drt_Magasin`.`Numero_de_magasin` AND `drt_Cartes_Distribuer`.`Id_Carte` = `drt_Cartes`.`num_carte`")
cursor.execute(query)
for Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage in cursor:
c.writerow([Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage] )
cursor.close()
filename.close()
cnx.close()
when i executing the command on phpmyadmin its look working very well but from my shell i got thise message :
# python test.py
Traceback (most recent call last):
File "test.py", line 18, in <module>
c.writerow([Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage] )
UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 5: ordinal not in range(128)
It looks you are using csv for Python 2.7. Quoting docs:
Note This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
Options, choice one of them:
Follow doc link, go to samples section, and modify your code accordantly.
Use a csv packet with unicode supprt like https://pypi.python.org/pypi/unicodecsv
Your data from the database are not only ascii characteres. I suggest you use the 'unicodecvs' python module as suggested in the answer to this question: How to write UTF-8 in a CSV file

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014'

I'm getting this error UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014'
I'm trying to load lots of news articles into a MySQLdb. However I'm having difficulty handling non-standard characters, I get hundreds of these errors for all sorts of characters. I can handle them individually using .replace() although I would like a more complete solution to handle them correctly.
ubuntu#ip-10-0-0-21:~/scripts/work$ python test_db_load_error.py
Traceback (most recent call last):
File "test_db_load_error.py", line 27, in <module>
cursor.execute(sql_load)
File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 157, in execute
query = query.encode(charset)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014' in position 158: ordinal not in range(256)
My script;
import MySQLdb as mdb
from goose import Goose
import string
import datetime
host = 'rds.amazonaws.com'
user = 'news'
password = 'xxxxxxx'
db_name = 'news_reader'
conn = mdb.connect(host, user, password, db_name)
url = 'http://www.dailymail.co.uk/wires/ap/article-3060183/Andrew-Lesnie-Lord-Rings-cinematographer-dies.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490'
g = Goose()
article = g.extract(url=url)
body = article.cleaned_text
body = body.replace("'","`")
load_date = str(datetime.datetime.now())
summary = article.meta_description
title = article.title
image = article.top_image
sql_load = "insert into articles " \
" (title,summary,article,,image,source,load_date) " \
" values ('%s','%s','%s','%s','%s','%s');" % \
(title,summary,body,image,url,load_date)
cursor = conn.cursor()
cursor.execute(sql_load)
#conn.commit()
Any help would be appreciated.
When you create your mysqldb connection pass the charset='utf8' to the connection.
conn = mdb.connect(host, user, password, db_name, charset='utf8')
If your database is actually configured for Latin-1, then you cannot store non-Latin-1 characters in it. That includes U+2014, EM DASH.
The ideal solution is to just switch to a database configured for UTF-8. Just pass charset='utf-8' when initially creating the database, and every time you connect to it. (If you already have existing data, you probably want to use MySQL tools to migrate the old database to a new one, instead of Python code, but the basic idea is the same.)
However, sometimes that isn't possible. Maybe you have other software that can't be updated, requires Latin-1, and needs to share the same database. Or maybe you've mixed Latin-1 text and binary data in ways that can't be programmatically unmixed, or your database is just too huge to migrate, or whatever. In that case, you have two choices:
Destructively convert your strings to Latin-1 before storing and searching. For example, you might want to convert an em dash to -, or to --, or maybe it's not all that important and you can just convert all non-Latin-1 characters to ? (which is faster and simpler).
Come up with an encoding scheme to smuggle non-Latin-1 characters into the database. This means some searches become more complicated, or just can't be done directly in the database.
This might be a heavy read, but at least got me started.
http://www.joelonsoftware.com/articles/Unicode.html

Inserting unicode into sqlite?

I am still learning Python and as a little Project I wrote a script that would take the values I have in a text file and insert them into a sqlite3 database. But some of the names have weird letter (I guess you would call them non-ASCII), and generate an error when they come up. Here is my little script (and please tell me if there is anyway it could be more Pythonic):
import sqlite3
f = open('complete', 'r')
fList = f.readlines()
conn = sqlite3.connect('tpb')
cur = conn.cursor()
for i in fList:
exploaded = i.split('|')
eList = (
(exploaded[1], exploaded[5])
)
cur.execute('INSERT INTO magnets VALUES(?, ?)', eList)
conn.commit()
cur.close()
And it generates this error:
Traceback (most recent call last):
File "C:\Users\Admin\Desktop\sortinghat.py", line 13, in <module>
cur.execute('INSERT INTO magnets VALUES(?, ?)', eList)
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a te
xt_factory that can interpret 8-bit bytestrings (like text_factory = str). It is
highly recommended that you instead just switch your application to Unicode str
ings.
To get the file contents into unicode you need to decode from whichever encoding it is in.
It looks like you're on Windows so a good bet is cp1252.
If you got the file from somewhere else all bets are off.
Once you have the encoding sorted, an easy way to decode is to use the codecs module, e.g.:
import codecs
# ...
with codecs.open('complete', encoding='cp1252') as fin: # or utf-8 or whatever
for line in fin:
to_insert = (line.split('|')[1], line.split('|')[5])
cur.execute('INSERT INTO magnets VALUES (?,?)', to_insert)
conn.commit()
# ...

Python: Encoding problem

I want to copy data from one database to another database. Therefore I wrote a Python script for this purpose.
Names are in german, but I don't think that will be a problem for understanding my question.
The script does the following
db = mysql.connect(db='', charset="utf8", use_unicode=True, **v.MySQLServer[server]);
...
cursor = db.cursor();
cursor.execute('select * from %s.%s where %s = %d;' % (eingangsDatenbankName, tabelle, syncFeldname, v.NEU))
daten = cursor.fetchall()
for zeile in daten:
sql = 'select * from %s.%s where ' % (hauptdatenbankName, tabelle)
...
for i in xrange(len(spalten)):
sql += " %s, " % db_util.formatierFeld(unicode(str(zeile[i]), "utf-8"), feldTypen[i])
The method "db_util.formatierFeld" looks like this
def formatierFeld(inhalt, feldTyp):
if inhalt.lower() == "none":
return "NULL" #Stringtypen
if "char" in feldTyp.lower() or "text" in feldTyp.lower() or "blob" in feldTyp.lower() or "date".lower() in feldTyp.lower() or "time" in feldTyp.lower():
return '"%s"' % inhalt
else:
return '%s' % inhalt
Well, to some of you this stuff will seem quite odd, but I can asure you I MUST do it this way, so please no discussion about style etc.
Okay, when running this code I get the following error message when I run into words with umlauts.
Traceback (most recent call last):
File "db_import.py", line 222, in <module>
main()
File "db_import.py", line 219, in main
importieren(server, lokaleMaschine, dbEingang, dbHaupt)
File "db_import.py", line 145, in importieren
sql += " %s, " % db_util.formatierFeld(unicode(str(zeile[i]), "utf-8"), feldTypen[i])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 1: ordinal not in range(128)
Actually I do not understand why this string can't be build that way. I my opinion this should work since I explicitly tell the program to use unicode here.
Anybody has a guess what is going wrong here?
The error is made more difficult to interpret by the deep nesting of expressions you have.
In the line
sql += " %s, " % db_util.formatierFeld(unicode(str(zeile[i]), "utf-8"), feldTypen[i])
where does the exception come from? It's difficult to say. However, I would suppose that it comes from str(zeile[i]). If zeile[i] is unicode containing non-ASCII characters, then you cannot convert it to a byte string using str. Instead, you must encode it to a byte string using a codec which can represent all of the characters it contains.
However...
unicode(str(zeile[i]), "utf-8")
This is pointless, if zeile[i] is a unicode string. First you try to encode it to a byte string, then you try to decode it back into a unicode string. You could skip all that and just do zeile[i]. formatierFeld doesn't really matter because execution never gets that far.

Fixing a type-error in Python's Pg

Thank you for bobince in solving the first bugs!
How can you use pg.escape_bytea or pg.escape_string in the following?
#1 With both pg.escape_string and pg.escape_bytea
con1.query(
"INSERT INTO files (file, file_name) VALUES ('%s', '%s')" %
(pg.escape_bytea(pg.espace_string(f.read())), pg.espace_string(pg.escape_bytea(f.name)))
I get the error
AttributeError: 'module' object has no attribute 'espace_string'
I tested the two escapes in the reverse order unsuccessfully too.
#2 Without pg.escape_string()
con1.query(
"INSERT INTO files (file, file_name) VALUES ('%s', '%s')" %
(pg.escape_bytea(f.read()), pg.escape_bytea(f.name))
)
I get
WARNING: nonstandard use of \\ in a string literal
LINE 1: INSERT INTO files (file, file_name) VALUES ('%PDF-1.4\\012%\...
^
HINT: Use the escape string syntax for backslashes, e.g., E'\\'.
------------------------
-- Putting pdf files in
I get the following error
# 3 With only pg.escape_string
------------------------
-- Putting pdf files in
------------------------
Traceback (most recent call last):
File "<stdin>", line 30, in <module>
File "<stdin>", line 27, in put_pdf_files_in
File "/usr/lib/python2.6/dist-packages/pg.py", line 313, in query
return self.db.query(qstr)
pg.ProgrammingError: ERROR: invalid byte sequence for encoding "UTF8": 0xc7ec
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
INSERT INTO files('binf','file_name') VALUES(file,file_name)
You've got the (...) sections the wrong way round, you're trying to insert the columns (file, filename) into the string literals ('binf', 'file_name'). You're also not actually inserting the contents of the variables binf and file_name into the query.
The pg module's query call does not support parameterisation. You would have to make the string yourself:
con1.query(
"INSERT INTO files (file, file_name) VALUES ('%s', '%s')" %
(pg.escape_string(f.read()), pg.escape_string(f.name))
)
This is assuming f is a file object; I'm not sure where file is coming from in the code above or what .read(binf) is supposed to mean. If you are using a bytea column to hold your file data you must use escape_bytea instead of escape_string.
Better than creating your own queries is letting pg do it for you with the insert method:
con1.insert('files', file= f.read(), file_name= f.name)
Alternatively, consider using the pgdb interface or one of the other DB-API-compliant interfaces that is not PostgreSQL-specific, if you ever want to consider running your app on a different database. DB-API gives you parameterisation in the execute method:
cursor.execute(
'INSERT INTO files (file, file_name) VALUES (%(content)s, %(name)s)',
{'content': f.read(), 'name': f.name }
)

Categories

Resources