I use scrapy to take information from a website that according to w3 validator is utf-8..
My python project has
# -*- coding: utf-8 -*-
I receive some names like López J and when I print it, it shows fine...
But when I want to store it into the mysql I get some error about ascii not being able to encode blah blah blah...
If I use .encode ('ascii', 'ignore') i get: Lpez J
If I use .encode ('ascii', 'replace') i get: López J
if I use .encode ('utf-8') i get: López J
What should I do?
I'm in a big trouble here :'(
When you connect to the database use charset='utf-8', use_unicode=True with other keywords to connect() method. This should make the dababase accept and return unicode values, so you don't have to (and shouldn't) encode them manually.
Example:
>>> import MySQLdb
>>> conn = MySQLdb.connect(... , use_unicode=True, charset='utf8')
>>> cur = conn.cursor()
>>> cur.execute('CREATE TABLE testing(x VARCHAR(20))')
0L
>>> cur.execute('INSERT INTO testing values(%s)', ('López J',))
1L
>>> cur.execute('SELECT * FROM testing')
1L
>>> print cur.fetchall()[0][0]
López J
Check your server, database, table, column and connection character sets.
As a quick test, try executing
SET NAMES 'utf8';
immediately after connecting.
Related
I'm trying to develop a (really) simple server who an iOS app will interrogate. The Python script has to connect to the MySQL database and return data in JSON format. I can not achieve that it works also with special characters like è or é. This is a short and simplified version of my code with a lot of debugging printing inside...
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import MySQLdb
import json
print ("Content-Type: application/json; charset=utf-8\n\n")
db = MySQLdb.connect("localhost","root","******","*******" )
cursor = db.cursor()
sql = "SELECT * FROM places WHERE name IN (\"Palazzina Majani\")"
try:
cursor.execute(sql)
num_fields = len(cursor.description)
field_names = [i[0] for i in cursor.description]
results = cursor.fetchall()
print ("------------------results:")
print (results)
output_json = []
for row in results:
output_json.append(dict(zip(field_names,row)))
print ("------------------output_json:")
print (output_json)
output = json.dumps(output_json, ensure_ascii=False)
print ("------------------output:")
print (output)
except:
print ("Error")
db.close()
And this is what I get with terminal and also with browser:
Content-Type: application/json; charset=utf-8
------------------results:
(('Palazzina Majani', 'kasj \xe8.\xe9', 'palazzina_majani'),)
------------------output_json:
[{'imageName': 'palazzina_majani', 'name': 'Palazzina Majani', 'description': 'kasj \xe8.\xe9'}]
------------------output:
[{"imageName": "palazzina_majani", "name": "Palazzina Majani", "description": "kasj ?.?"}]
How can I manage those special characters (and the mainly used from latin-1)?
What if I simply replace single quotes with double quotes from "output_json" insted of using json.dumps?
Thank you!
SOLUTION
As Parfait said in the comments, passing charset='utf8' inside connect() solved the problem!
Hello I have the following code:
from __future__ import print_function
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
import pandas as pd
import re
import threading
import pickle
import sqlite3
#from treetagger import TreeTagger
conn = sqlite3.connect('Telcel.db')
cursor = conn.cursor()
cursor.execute('select id_comment from Tweets')
id_comment = [i for i in cursor]
cursor.execute('select id_author from Tweets')
id_author = [i for i in cursor]
cursor.execute('select comment_message from Tweets')
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
cursor.execute('select comment_published from Tweets')
comment_published = [i for i in cursor]
That is working well in python 2.7.12, output:
~/data$ python DBtoList.py
8003
8003
8003
8003
However when I run the same code using python3 as follows, I got:
~/data$ python3 DBtoList.py
Traceback (most recent call last):
File "DBtoList.py", line 21, in <module>
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
File "DBtoList.py", line 21, in <listcomp>
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
sqlite3.OperationalError: Could not decode to UTF-8 column 'comment_message' with text 'dancing music ������'
I searched for this line and I found:
"dancing music 😜"
I am not sure why the code is working in python 2, it seems that python Python 3.5.2 is not able to decode this character at this line:
comment_message = [i[0].encode('utf-8').decode('latin-1') for i in cursor]
so I would like to appreciate suggestions to fix this problem, thanks for the support
Python 3 has no issue with the string itself if you store it using the Python sqlite3 API. I've set utf-8 as my default encoding everywhere.
import sqlite3
conn = sqlite3.connect(':memory:')
conn.execute('create table Tweets (comment_message text)')
conn.execute('insert into Tweets values ("dancing music 😜")')
[(tweet,) ] = conn.execute('select comment_message from tweets')
tweet
output:
'dancing music 😜'
Now, let's see the type:
>>> type(tweet)
str
So everything is fine if you work with Python str from the start.
Now, as an aside, the thing you are trying to do (encode utf-8, decode latin-1) makes very little sense, especially if you have things like emojis in the string. Look what happens to your tweet:
>>> tweet.encode('utf-8').decode('latin-1')
'dancing music ð\x9f\x98\x9c'
But now to your problem: You have stored strings (byte sequences) in your database using an encoding different from utf-8. The error you are seeing is caused by the sqlite3 library attempting to decode these byte sequences and failing because the bytes are not valid utf-8 sequences. The only way to solve this problem is:
Find out what encoding was used to encode the strings in the database
Use that encoding to decode the strings by setting conn.text_factory = lambda x: str(x, 'latin-1'). This assumes you've stored the strings using latin1.
I would then suggest that you run through the database and update the values so that they now are encoded using utf-8 which is the default behaviour.
See also this question.
I also highly recommend that you read this article about how encodings work.
I got a database in DBsqlite.in this DBsqlite database I have have a records containing portugese text like "Hiper-radiação simétrica periocular bem delimitada, homogênea."
and the characters like ç ã é ê don't parse right in my python script.
While normal english text is doing it perfectly.
In my terminal window (I use a mac) the
I know it has something do to with the encoding. but the code still doesn't recognise portugese.
my sample code:
# -*- coding: UTF-8 -*-
import xml.etree.ElementTree as ET
import sqlite3
#open a database connection to the database translateDB.sqlite
conn = sqlite3.connect('translateDB.sqlite')
#prepare a cursor object using cursus() method
cursor = conn.cursor()
#test input
# this doesn't work
text = ('Hiper-radiação simétrica periocular bem delimitada, homogênea')
# this does work in english
#text = ('Well delimited, homogeneous symmetric periocular hyper- radiation.')
# Execute SQL query using execute() method.
cursor.execute('SELECT * FROM translate WHERE L2_portugese=?', (text,))
# Fetch a single row using fetchone() method and display it.
print cursor.fetchone()
# Disconnect from server
conn.close()
any tips & tricks are greatly appreciated. Ron
I am trying to read a file, parse the data using python 2.7, and import the data into sqlite3. However, I'm running into a problem when inserting the data. After I parse a line from the file, the é in my string is replaced with \xe9. After I split the line from my file, I want a list that contains [73,'Misérables, Les'] but instead I'm getting [73,'Mis\xe9rables, Les'] which is screwing up the SQL INSERT statement. How can I fix this?
#!/usr/bin/python
# -*- coding: latin-1 -*-
import sqlite3
line = '73::Misérables, Les'.decode('latin-1')
vals = line.split("::")
con = sqlite3.connect('myDb.db')
cur = con.cursor()
cur.execute("DROP TABLE IF EXISTS movie")
cur.execute('CREATE TABLE movie (id INT, title TEXT)')
sql = 'INSERT INTO movie VALUES (?,?)'
cur.execute(sql,tuple(vals))
cur.execute('SELECT * FROM movie')
for record in cur:
print record
Your program inserts data into the db perfectly. It subsequently retrieves the correct data. Your problem is when you display the result.
When you print a tuple, the system displays the repr() of each item, not the str() of each item. Thus you see \xe9 instead of é in the output.
To get what you want, try replacing the loop at the end of your program:
for record in cur:
print record[0], record[1]
I'm trying to work with a Hebrew database, unfortunately the output is gibberish. What am I doing wrong?
# -*- coding: utf-8 -*-
import pypyodbc
conn = pypyodbc.connect('Driver={Microsoft Access Driver (*.mdb)};DBQ=C:\\client.mdb')
cur = conn.cursor()
cur.execute('''SELECT * FROM Client''')
d = cur.fetchone()
for field in d:
print field
If I look at cur.fetchone():
'\xf0\xf1\xe0\xf8', '\xe0\xe9\xe0\xe3'
Output:
αΘαπ
2001
εδßΘ
αΘ°σ
If either of נסאר or איאד is meaningful, then try:
field.decode('cp1255')
Google Translate suggests this might correspond to a person named Iyad Nassar.
try use:
field.encode('utf-8')