how to avoid the following issue, using python3? - python

Hello I have the following code:
from __future__ import print_function
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
import pandas as pd
import re
import threading
import pickle
import sqlite3
#from treetagger import TreeTagger
conn = sqlite3.connect('Telcel.db')
cursor = conn.cursor()
cursor.execute('select id_comment from Tweets')
id_comment = [i for i in cursor]
cursor.execute('select id_author from Tweets')
id_author = [i for i in cursor]
cursor.execute('select comment_message from Tweets')
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
cursor.execute('select comment_published from Tweets')
comment_published = [i for i in cursor]
That is working well in python 2.7.12, output:
~/data$ python DBtoList.py
8003
8003
8003
8003
However when I run the same code using python3 as follows, I got:
~/data$ python3 DBtoList.py
Traceback (most recent call last):
File "DBtoList.py", line 21, in <module>
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
File "DBtoList.py", line 21, in <listcomp>
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
sqlite3.OperationalError: Could not decode to UTF-8 column 'comment_message' with text 'dancing music ������'
I searched for this line and I found:
"dancing music 😜"
I am not sure why the code is working in python 2, it seems that python Python 3.5.2 is not able to decode this character at this line:
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
so I would like to appreciate suggestions to fix this problem, thanks for the support

Python 3 has no issue with the string itself if you store it using the Python sqlite3 API. I've set utf-8 as my default encoding everywhere.
import sqlite3
conn = sqlite3.connect(':memory:')
conn.execute('create table Tweets (comment_message text)')
conn.execute('insert into Tweets values ("dancing music 😜")')
[(tweet,) ] = conn.execute('select comment_message from tweets')
tweet
output:
'dancing music 😜'
Now, let's see the type:
>>> type(tweet)
str
So everything is fine if you work with Python str from the start.
Now, as an aside, the thing you are trying to do (encode utf-8, decode latin-1) makes very little sense, especially if you have things like emojis in the string. Look what happens to your tweet:
>>> tweet.encode('utf-8').decode('latin-1')
'dancing music ð\x9f\x98\x9c'
But now to your problem: You have stored strings (byte sequences) in your database using an encoding different from utf-8. The error you are seeing is caused by the sqlite3 library attempting to decode these byte sequences and failing because the bytes are not valid utf-8 sequences. The only way to solve this problem is:
Find out what encoding was used to encode the strings in the database
Use that encoding to decode the strings by setting conn.text_factory = lambda x: str(x, 'latin-1'). This assumes you've stored the strings using latin1.
I would then suggest that you run through the database and update the values so that they now are encoded using utf-8 which is the default behaviour.
See also this question.
I also highly recommend that you read this article about how encodings work.

Related

exportation data from mysql database using csv

i neeed a python script to generate a csv file from my database XXXX. i wrote thise script but i have something wrong :
import mysql.connector
import csv
filename=open('test.csv','wb')
c=csv.writer(filename)
cnx = mysql.connector.connect(user='XXXXXXX', password='XXXXX',
host='localhost',
database='XXXXX')
cursor = cnx.cursor()
query = ("SELECT `Id_Vendeur`, `Nom`, `Prenom`, `email`, `Num_magasin`, `Nom_de_magasin`, `Identifiant_Filiale`, `Groupe_DV`, `drt_Cartes`.`gain` as 'gain', `Date_Distribution`, `Status_Grattage`, `Date_Grattage` FROM `drt_Cartes_Distribuer`,`drt_Agent`,`drt_Magasin`,`drt_Cartes` where `drt_Cartes_Distribuer`.`Id_Vendeur` = `drt_Agent`.`id_agent` AND `Num_magasin` = `drt_Magasin`.`Numero_de_magasin` AND `drt_Cartes_Distribuer`.`Id_Carte` = `drt_Cartes`.`num_carte`")
cursor.execute(query)
for Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage in cursor:
c.writerow([Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage] )
cursor.close()
filename.close()
cnx.close()
when i executing the command on phpmyadmin its look working very well but from my shell i got thise message :
# python test.py
Traceback (most recent call last):
File "test.py", line 18, in <module>
c.writerow([Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage] )
UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 5: ordinal not in range(128)
It looks you are using csv for Python 2.7. Quoting docs:
Note This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
Options, choice one of them:
Follow doc link, go to samples section, and modify your code accordantly.
Use a csv packet with unicode supprt like https://pypi.python.org/pypi/unicodecsv
Your data from the database are not only ascii characteres. I suggest you use the 'unicodecvs' python module as suggested in the answer to this question: How to write UTF-8 in a CSV file

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014'

I'm getting this error UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014'
I'm trying to load lots of news articles into a MySQLdb. However I'm having difficulty handling non-standard characters, I get hundreds of these errors for all sorts of characters. I can handle them individually using .replace() although I would like a more complete solution to handle them correctly.
ubuntu#ip-10-0-0-21:~/scripts/work$ python test_db_load_error.py
Traceback (most recent call last):
File "test_db_load_error.py", line 27, in <module>
cursor.execute(sql_load)
File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 157, in execute
query = query.encode(charset)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014' in position 158: ordinal not in range(256)
My script;
import MySQLdb as mdb
from goose import Goose
import string
import datetime
host = 'rds.amazonaws.com'
user = 'news'
password = 'xxxxxxx'
db_name = 'news_reader'
conn = mdb.connect(host, user, password, db_name)
url = 'http://www.dailymail.co.uk/wires/ap/article-3060183/Andrew-Lesnie-Lord-Rings-cinematographer-dies.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490'
g = Goose()
article = g.extract(url=url)
body = article.cleaned_text
body = body.replace("'","`")
load_date = str(datetime.datetime.now())
summary = article.meta_description
title = article.title
image = article.top_image
sql_load = "insert into articles " \
" (title,summary,article,,image,source,load_date) " \
" values ('%s','%s','%s','%s','%s','%s');" % \
(title,summary,body,image,url,load_date)
cursor = conn.cursor()
cursor.execute(sql_load)
#conn.commit()
Any help would be appreciated.
When you create your mysqldb connection pass the charset='utf8' to the connection.
conn = mdb.connect(host, user, password, db_name, charset='utf8')
If your database is actually configured for Latin-1, then you cannot store non-Latin-1 characters in it. That includes U+2014, EM DASH.
The ideal solution is to just switch to a database configured for UTF-8. Just pass charset='utf-8' when initially creating the database, and every time you connect to it. (If you already have existing data, you probably want to use MySQL tools to migrate the old database to a new one, instead of Python code, but the basic idea is the same.)
However, sometimes that isn't possible. Maybe you have other software that can't be updated, requires Latin-1, and needs to share the same database. Or maybe you've mixed Latin-1 text and binary data in ways that can't be programmatically unmixed, or your database is just too huge to migrate, or whatever. In that case, you have two choices:
Destructively convert your strings to Latin-1 before storing and searching. For example, you might want to convert an em dash to -, or to --, or maybe it's not all that important and you can just convert all non-Latin-1 characters to ? (which is faster and simpler).
Come up with an encoding scheme to smuggle non-Latin-1 characters into the database. This means some searches become more complicated, or just can't be done directly in the database.
This might be a heavy read, but at least got me started.
http://www.joelonsoftware.com/articles/Unicode.html

Hebrew appears as gibberish, DB importing with PyPyODBC

I'm trying to work with a Hebrew database, unfortunately the output is gibberish. What am I doing wrong?
# -*- coding: utf-8 -*-
import pypyodbc
conn = pypyodbc.connect('Driver={Microsoft Access Driver (*.mdb)};DBQ=C:\\client.mdb')
cur = conn.cursor()
cur.execute('''SELECT * FROM Client''')
d = cur.fetchone()
for field in d:
print field
If I look at cur.fetchone():
'\xf0\xf1\xe0\xf8', '\xe0\xe9\xe0\xe3'
Output:
αΘαπ
2001
εδßΘ
αΘ°σ
If either of נסאר or איאד is meaningful, then try:
field.decode('cp1255')
Google Translate suggests this might correspond to a person named Iyad Nassar.
try use:
field.encode('utf-8')

Python stores " ó " as " ó "

I use scrapy to take information from a website that according to w3 validator is utf-8..
My python project has
# -*- coding: utf-8 -*-
I receive some names like López J and when I print it, it shows fine...
But when I want to store it into the mysql I get some error about ascii not being able to encode blah blah blah...
If I use .encode ('ascii', 'ignore') i get: Lpez J
If I use .encode ('ascii', 'replace') i get: López J
if I use .encode ('utf-8') i get: López J
What should I do?
I'm in a big trouble here :'(
When you connect to the database use charset='utf-8', use_unicode=True with other keywords to connect() method. This should make the dababase accept and return unicode values, so you don't have to (and shouldn't) encode them manually.
Example:
>>> import MySQLdb
>>> conn = MySQLdb.connect(... , use_unicode=True, charset='utf8')
>>> cur = conn.cursor()
>>> cur.execute('CREATE TABLE testing(x VARCHAR(20))')
0L
>>> cur.execute('INSERT INTO testing values(%s)', ('López J',))
1L
>>> cur.execute('SELECT * FROM testing')
1L
>>> print cur.fetchall()[0][0]
López J
Check your server, database, table, column and connection character sets.
As a quick test, try executing
SET NAMES 'utf8';
immediately after connecting.

Inserting unicode into sqlite?

I am still learning Python and as a little Project I wrote a script that would take the values I have in a text file and insert them into a sqlite3 database. But some of the names have weird letter (I guess you would call them non-ASCII), and generate an error when they come up. Here is my little script (and please tell me if there is anyway it could be more Pythonic):
import sqlite3
f = open('complete', 'r')
fList = f.readlines()
conn = sqlite3.connect('tpb')
cur = conn.cursor()
for i in fList:
exploaded = i.split('|')
eList = (
(exploaded[1], exploaded[5])
)
cur.execute('INSERT INTO magnets VALUES(?, ?)', eList)
conn.commit()
cur.close()
And it generates this error:
Traceback (most recent call last):
File "C:\Users\Admin\Desktop\sortinghat.py", line 13, in <module>
cur.execute('INSERT INTO magnets VALUES(?, ?)', eList)
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a te
xt_factory that can interpret 8-bit bytestrings (like text_factory = str). It is
highly recommended that you instead just switch your application to Unicode str
ings.
To get the file contents into unicode you need to decode from whichever encoding it is in.
It looks like you're on Windows so a good bet is cp1252.
If you got the file from somewhere else all bets are off.
Once you have the encoding sorted, an easy way to decode is to use the codecs module, e.g.:
import codecs
# ...
with codecs.open('complete', encoding='cp1252') as fin: # or utf-8 or whatever
for line in fin:
to_insert = (line.split('|')[1], line.split('|')[5])
cur.execute('INSERT INTO magnets VALUES (?,?)', to_insert)
conn.commit()
# ...

Categories

Resources