UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014' - python

I'm getting this error UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014'
I'm trying to load lots of news articles into a MySQLdb. However I'm having difficulty handling non-standard characters, I get hundreds of these errors for all sorts of characters. I can handle them individually using .replace() although I would like a more complete solution to handle them correctly.
ubuntu#ip-10-0-0-21:~/scripts/work$ python test_db_load_error.py
Traceback (most recent call last):
File "test_db_load_error.py", line 27, in <module>
cursor.execute(sql_load)
File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 157, in execute
query = query.encode(charset)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014' in position 158: ordinal not in range(256)
My script;
import MySQLdb as mdb
from goose import Goose
import string
import datetime
host = 'rds.amazonaws.com'
user = 'news'
password = 'xxxxxxx'
db_name = 'news_reader'
conn = mdb.connect(host, user, password, db_name)
url = 'http://www.dailymail.co.uk/wires/ap/article-3060183/Andrew-Lesnie-Lord-Rings-cinematographer-dies.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490'
g = Goose()
article = g.extract(url=url)
body = article.cleaned_text
body = body.replace("'","`")
load_date = str(datetime.datetime.now())
summary = article.meta_description
title = article.title
image = article.top_image
sql_load = "insert into articles " \
" (title,summary,article,,image,source,load_date) " \
" values ('%s','%s','%s','%s','%s','%s');" % \
(title,summary,body,image,url,load_date)
cursor = conn.cursor()
cursor.execute(sql_load)
#conn.commit()
Any help would be appreciated.

When you create your mysqldb connection pass the charset='utf8' to the connection.
conn = mdb.connect(host, user, password, db_name, charset='utf8')

If your database is actually configured for Latin-1, then you cannot store non-Latin-1 characters in it. That includes U+2014, EM DASH.
The ideal solution is to just switch to a database configured for UTF-8. Just pass charset='utf-8' when initially creating the database, and every time you connect to it. (If you already have existing data, you probably want to use MySQL tools to migrate the old database to a new one, instead of Python code, but the basic idea is the same.)
However, sometimes that isn't possible. Maybe you have other software that can't be updated, requires Latin-1, and needs to share the same database. Or maybe you've mixed Latin-1 text and binary data in ways that can't be programmatically unmixed, or your database is just too huge to migrate, or whatever. In that case, you have two choices:
Destructively convert your strings to Latin-1 before storing and searching. For example, you might want to convert an em dash to -, or to --, or maybe it's not all that important and you can just convert all non-Latin-1 characters to ? (which is faster and simpler).
Come up with an encoding scheme to smuggle non-Latin-1 characters into the database. This means some searches become more complicated, or just can't be done directly in the database.

This might be a heavy read, but at least got me started.
http://www.joelonsoftware.com/articles/Unicode.html

Related

how to avoid the following issue, using python3?

Hello I have the following code:
from __future__ import print_function
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
import pandas as pd
import re
import threading
import pickle
import sqlite3
#from treetagger import TreeTagger
conn = sqlite3.connect('Telcel.db')
cursor = conn.cursor()
cursor.execute('select id_comment from Tweets')
id_comment = [i for i in cursor]
cursor.execute('select id_author from Tweets')
id_author = [i for i in cursor]
cursor.execute('select comment_message from Tweets')
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
cursor.execute('select comment_published from Tweets')
comment_published = [i for i in cursor]
That is working well in python 2.7.12, output:
~/data$ python DBtoList.py
8003
8003
8003
8003
However when I run the same code using python3 as follows, I got:
~/data$ python3 DBtoList.py
Traceback (most recent call last):
File "DBtoList.py", line 21, in <module>
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
File "DBtoList.py", line 21, in <listcomp>
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
sqlite3.OperationalError: Could not decode to UTF-8 column 'comment_message' with text 'dancing music ������'
I searched for this line and I found:
"dancing music 😜"
I am not sure why the code is working in python 2, it seems that python Python 3.5.2 is not able to decode this character at this line:
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
so I would like to appreciate suggestions to fix this problem, thanks for the support
Python 3 has no issue with the string itself if you store it using the Python sqlite3 API. I've set utf-8 as my default encoding everywhere.
import sqlite3
conn = sqlite3.connect(':memory:')
conn.execute('create table Tweets (comment_message text)')
conn.execute('insert into Tweets values ("dancing music 😜")')
[(tweet,) ] = conn.execute('select comment_message from tweets')
tweet
output:
'dancing music 😜'
Now, let's see the type:
>>> type(tweet)
str
So everything is fine if you work with Python str from the start.
Now, as an aside, the thing you are trying to do (encode utf-8, decode latin-1) makes very little sense, especially if you have things like emojis in the string. Look what happens to your tweet:
>>> tweet.encode('utf-8').decode('latin-1')
'dancing music ð\x9f\x98\x9c'
But now to your problem: You have stored strings (byte sequences) in your database using an encoding different from utf-8. The error you are seeing is caused by the sqlite3 library attempting to decode these byte sequences and failing because the bytes are not valid utf-8 sequences. The only way to solve this problem is:
Find out what encoding was used to encode the strings in the database
Use that encoding to decode the strings by setting conn.text_factory = lambda x: str(x, 'latin-1'). This assumes you've stored the strings using latin1.
I would then suggest that you run through the database and update the values so that they now are encoded using utf-8 which is the default behaviour.
See also this question.
I also highly recommend that you read this article about how encodings work.

exportation data from mysql database using csv

i neeed a python script to generate a csv file from my database XXXX. i wrote thise script but i have something wrong :
import mysql.connector
import csv
filename=open('test.csv','wb')
c=csv.writer(filename)
cnx = mysql.connector.connect(user='XXXXXXX', password='XXXXX',
host='localhost',
database='XXXXX')
cursor = cnx.cursor()
query = ("SELECT `Id_Vendeur`, `Nom`, `Prenom`, `email`, `Num_magasin`, `Nom_de_magasin`, `Identifiant_Filiale`, `Groupe_DV`, `drt_Cartes`.`gain` as 'gain', `Date_Distribution`, `Status_Grattage`, `Date_Grattage` FROM `drt_Cartes_Distribuer`,`drt_Agent`,`drt_Magasin`,`drt_Cartes` where `drt_Cartes_Distribuer`.`Id_Vendeur` = `drt_Agent`.`id_agent` AND `Num_magasin` = `drt_Magasin`.`Numero_de_magasin` AND `drt_Cartes_Distribuer`.`Id_Carte` = `drt_Cartes`.`num_carte`")
cursor.execute(query)
for Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage in cursor:
c.writerow([Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage] )
cursor.close()
filename.close()
cnx.close()
when i executing the command on phpmyadmin its look working very well but from my shell i got thise message :
# python test.py
Traceback (most recent call last):
File "test.py", line 18, in <module>
c.writerow([Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage] )
UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 5: ordinal not in range(128)
It looks you are using csv for Python 2.7. Quoting docs:
Note This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
Options, choice one of them:
Follow doc link, go to samples section, and modify your code accordantly.
Use a csv packet with unicode supprt like https://pypi.python.org/pypi/unicodecsv
Your data from the database are not only ascii characteres. I suggest you use the 'unicodecvs' python module as suggested in the answer to this question: How to write UTF-8 in a CSV file

Fix UnicodeDecodeError

I have the following code . I use Python 2.7
import csv
import sqlite3
conn = sqlite3.connect('torrents.db')
c = conn.cursor()
# Create table
c.execute('''DROP TABLE torrents''')
c.execute('''CREATE TABLE IF NOT EXISTS torrents
(name text, size long, info_hash text, downloads_count long,
category_id text, seeders long, leechers long)''')
with open('torrents_mini.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter='|')
for row in spamreader:
name = unicode(row[0])
size = row[1]
info_hash = unicode(row[2])
downloads_count = row[3]
category_id = unicode(row[4])
seeders = row[5]
leechers = row[6]
c.execute('INSERT INTO torrents (name, size, info_hash, downloads_count,
category_id, seeders, leechers) VALUES (?,?,?,?,?,?,?)',
(name, size, info_hash, downloads_count, category_id, seeders, leechers))
conn.commit()
conn.close()
The error message I receive is
Traceback (most recent call last):
File "db.py", line 15, in <module>
name = unicode(row[0])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
If I don't convert into unicode then the error i get is
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
adding name = row[0].decode('UTF-8') gives me another error
Traceback (most recent call last):
File "db.py", line 27, in <module>
for row in spamreader:
_csv.Error: line contains NULL byte
the data contained in the csv file is in the following format
Tha Twilight New Moon DVDrip 2009 XviD-AMiABLE|694554360|2cae2fc76d110f35917d5d069282afd8335bc306|0|movies|0|1
Edit:I finally dropped the attempt and accomplished the task using sqlite3 command-line tool(it was quite easy).
I do not yet know what caused the errors , but when sqlite3 was importing the said csv file , it kept popping warnings about "unescaped character", the character being quotes(").
Thanks to everyone who tried to help.
Your data is not encoded as ASCII. Use the correct codec for your data.
You can tell Python what codec to use with:
unicode(row[0], correct_codec)
or use the str.decode() method:
row[0].decode(correct_codec)
What that correct codec is, we cannot tell you. You'll have to consult whatever you got the file from.
If you cannot figure out what encoding was used, you could use a package like chardet to make an educated guess, but take into account that such a library is not fail-proof.

Python: UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

Scenario: I have a list of server names in a JSON file that is getting read by the script and put into a dictionary. I'm then trying to use those server names in what will become a SQL query. However, I'm having a hell of a time with the UTF-8 encoded strings.
Error Traceback:
Traceback (most recent call last):
File "run.py", line 18, in <module>
print(str(len(download.downloadRealmFiles('eu'))) + " EU files downloaded.")
File "/var/www/etherealpost.com/scripts/ahdata/download.py", line 73, in downloadRealmFiles
sql = u"UPDATE realms_lastmodified SET last_modified = '%d', latest_hash = '%s' WHERE region = '%s' AND realm = '%s'" % (lastModified, lastHash.encode('utf-8'), region.encode('utf-8'), realm)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
The code:
realm = data['files'][0]['realm']
lastHash = realmFile.split('/')[-2]
lastModified = data['files'][0]['lastModified']
dataURLs.append(realmFile)
sql = u"UPDATE realms_lastmodified SET last_modified = '%d', latest_hash = '%s' WHERE region = '%s' AND realm = '%s'" % (lastModified, lastHash.encode('utf-8'), region.encode('utf-8'), realm.encode('utf-8'))
lastModified is of type long
The variable realm is the one that contains the Unicode characters.
I'm out of ideas why this isn't working.
Don't interpolate strings into a SQL query! Use SQL parameters instead and leave it up to your database to handle quoting and Unicode values:
sql = """\
UPDATE realms_lastmodified
SET last_modified=?, latest_hash=?
WHERE region=? AND realm=?
"""
cursor.execute(sql, (lastModified, lastHash, region, realm))
I used ? as the parameter placeholders here, but it depends on the exact database library used; you may need to use %s as the placeholder instead (regardless of the type of the column!).
Your error specifically is caused by you interpolating encoded bytestrings into a Unicode value. Don't do that either; interpolate, then encode. Otherwise, Python attempts to decode the UTF8 bytes using the default codec to get Unicode again, and that fails here.

Python: Encoding problem

I want to copy data from one database to another database. Therefore I wrote a Python script for this purpose.
Names are in german, but I don't think that will be a problem for understanding my question.
The script does the following
db = mysql.connect(db='', charset="utf8", use_unicode=True, **v.MySQLServer[server]);
...
cursor = db.cursor();
cursor.execute('select * from %s.%s where %s = %d;' % (eingangsDatenbankName, tabelle, syncFeldname, v.NEU))
daten = cursor.fetchall()
for zeile in daten:
sql = 'select * from %s.%s where ' % (hauptdatenbankName, tabelle)
...
for i in xrange(len(spalten)):
sql += " %s, " % db_util.formatierFeld(unicode(str(zeile[i]), "utf-8"), feldTypen[i])
The method "db_util.formatierFeld" looks like this
def formatierFeld(inhalt, feldTyp):
if inhalt.lower() == "none":
return "NULL" #Stringtypen
if "char" in feldTyp.lower() or "text" in feldTyp.lower() or "blob" in feldTyp.lower() or "date".lower() in feldTyp.lower() or "time" in feldTyp.lower():
return '"%s"' % inhalt
else:
return '%s' % inhalt
Well, to some of you this stuff will seem quite odd, but I can asure you I MUST do it this way, so please no discussion about style etc.
Okay, when running this code I get the following error message when I run into words with umlauts.
Traceback (most recent call last):
File "db_import.py", line 222, in <module>
main()
File "db_import.py", line 219, in main
importieren(server, lokaleMaschine, dbEingang, dbHaupt)
File "db_import.py", line 145, in importieren
sql += " %s, " % db_util.formatierFeld(unicode(str(zeile[i]), "utf-8"), feldTypen[i])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 1: ordinal not in range(128)
Actually I do not understand why this string can't be build that way. I my opinion this should work since I explicitly tell the program to use unicode here.
Anybody has a guess what is going wrong here?
The error is made more difficult to interpret by the deep nesting of expressions you have.
In the line
sql += " %s, " % db_util.formatierFeld(unicode(str(zeile[i]), "utf-8"), feldTypen[i])
where does the exception come from? It's difficult to say. However, I would suppose that it comes from str(zeile[i]). If zeile[i] is unicode containing non-ASCII characters, then you cannot convert it to a byte string using str. Instead, you must encode it to a byte string using a codec which can represent all of the characters it contains.
However...
unicode(str(zeile[i]), "utf-8")
This is pointless, if zeile[i] is a unicode string. First you try to encode it to a byte string, then you try to decode it back into a unicode string. You could skip all that and just do zeile[i]. formatierFeld doesn't really matter because execution never gets that far.

Categories

Resources