Sqlite3 cannot correctly query a UTF-8 string? - python

I'm having a lot of trouble using python's sqlite3 library with UTF-8 strings. I need this encoding because I am working people's names, in my database.
My SQL schema for the desired table is:
CREATE TABLE senators (id integer, name char);
I would like to do the following in Python (ignore the very ugly way I wrote the select statement. I did it this way for debugging purposes):
statement = u"select * from senators where name like '" + '%'+row[0]+'%'+"'"
c.execute(statement)
row[0] is the name of each row in a file that has this type of entry:
Dário Berger,1
Edison Lobão,1
Eduardo Braga,1
While I have a non empty result for names like Eduardo Braga, any time my string has UTF-8 characters, I get a null result.
I have checked that my file has in fact been saved with UTF-8 encoding (Microsoft Notepad). On a Apple mac, in the terminal, I used the PRAGMA command in the sqlite3 shell to check the encoding:
sqlite> PRAGMA encoding;
UTF-8
Does anybody have an idea what I can do here?
EDIT - Complete example:
Python script that creates the databases, and populates with initial data from senators.csv (file):
# -*- coding: utf-8 -*-
import sqlite3
import csv
conn = sqlite3.connect('senators.db')
c = conn.cursor()
c.execute('''CREATE TABLE senators (id integer, name char)''')
c.execute('''CREATE TABLE polls (id integer, senator char, vote integer, FOREIGN KEY(senator) REFERENCES senators(name))''')
with open('senators.csv', encoding='utf-8') as f:
f_csv = csv.reader(f)
for row in f_csv:
c.execute(u"INSERT INTO senators VALUES(?,?)", (row[1], row[0]))
conn.commit()
conn.close()
Script that populates the polls table, using Q1.txt (file).
import csv
import sqlite3
import re
import glob
conn = sqlite3.connect('senators.db')
c = conn.cursor()
POLLS = {
'senator': 'votes/senator/Q*.txt',
'deputee': 'votes/deputee/Q*.txt',
}
s_polls = glob.glob(POLLS['senator'])
d_polls = glob.glob(POLLS['deputee'])
for poll in s_polls:
m = re.match('.*Q(\d+)\.txt', poll)
poll_id = m.groups(0)
with open(poll, encoding='utf-8') as p:
f_csv = csv.reader(p)
for row in f_csv:
c.execute(u'SELECT id FROM senators WHERE name LIKE ?', ('%'+row[0]+'%',))
data = c.fetchone()
print(data) # I should not get None results here, but I do, exactly when the query has UTF-8 characters.
Note the file paths, if you want to test these scripts out.

Ok guys,
After a lot of trouble, I found out that the problem was that the encodings, all though were both considered UTF-8, were still different anyways. The difference was that while the database was decomposed UTF-8 (ã = a + ~), my input was in precomposed form (one code for the ã character).
To fix it, I had to convert all my input data to the decomposed form.
from unicodedata import normalize
with open(poll, encoding='utf-8') as p:
f_csv = csv.reader(p)
for row in f_csv:
name = normalize("NFD",row[0])
c.execute(u'SELECT id FROM senators WHERE name LIKE ?', ('%'+name+'%',))
See this article, for some excellent information on the subject.

From the SQLite docs:
Important Note: SQLite only understands upper/lower case for ASCII characters by default. The LIKE operator is case sensitive by default for unicode characters that are beyond the ASCII range. For example, the expression 'a' LIKE 'A' is TRUE but 'æ' LIKE 'Æ' is FALSE.
Also, use query parameters. Your query is vulnerable to SQL injection.

Related

What is the best way to dump MySQL table data to csv and convert character encoding?

I have a table with about 200 columns. I need to take a dump of the daily transaction data for ETL purposes. Its a MySQL DB. I tried that with Python both using pandas dataframe as well as basic write to CSV file method. I even tried to look for the same functionality using shell script. I saw one such for oracle Database using sqlplus. Following are my python codes with the two approaches:
Using Pandas:
import MySQLdb as mdb
import pandas as pd
host = ""
user = ''
pass_ = ''
db = ''
query = 'SELECT * FROM TABLE1'
conn = mdb.connect(host=host,
user=user, passwd=pass_,
db=db)
df = pd.read_sql(query, con=conn)
df.to_csv('resume_bank.csv', sep=',')
Using basic python file write:
import MySQLdb
import csv
import datetime
currentDate = datetime.datetime.now().date()
host = ""
user = ''
pass_ = ''
db = ''
table = ''
con = MySQLdb.connect(user=user, passwd=pass_, host=host, db=db, charset='utf8')
cursor = con.cursor()
query = "SELECT * FROM %s;" % table
cursor.execute(query)
with open('Data_on_%s.csv' % currentDate, 'w') as f:
writer = csv.writer(f)
for row in cursor.fetchall():
writer.writerow(row)
print('Done')
The table has about 300,000 records. It's taking too much time with both the python codes.
Also, there's an issue with encoding here. The DB resultset has some latin-1 characters for which I'm getting some errors like : UnicodeEncodeError: 'ascii' codec can't encode character '\x96' in position 1078: ordinal not in range(128).
I need to save the CSV in Unicode format. Can you please help me with the best approach to perform this task.
A Unix based or Python based solution will work for me. This script needs to be run daily to dump daily data.
You can achieve that just leveraging MySql. For example:
SELECT * FROM your_table WHERE...
INTO OUTFILE 'your_file.csv'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
FIELDS ESCAPED BY '\'
LINES TERMINATED BY '\n';
if you need to schedule your query put such a query into a file (e.g., csv_dump.sql) anche create a cron task like this one
00 00 * * * mysql -h your_host -u user -ppassword < /foo/bar/csv_dump.sql
For strings this will use the default character encoding which happens to be ASCII, and this fails when you have non-ASCII characters. You want unicode instead of str.
rows = cursor.fetchall()
f = open('Data_on_%s.csv' % currentDate, 'w')
myFile = csv.writer(f)
myFile.writerow([unicode(s).encode("utf-8") for s in rows])
fp.close()
You can use mysqldump for this task. (Source for command)
mysqldump -u username -p --tab -T/path/to/directory dbname table_name --fields-terminated-by=','
The arguments are as follows:
-u username for the username
-p to indicate that a password should be used
-ppassword to give the password via command line
--tab Produce tab-separated data files
For mor command line switches see https://dev.mysql.com/doc/refman/5.5/en/mysqldump.html
To run it on a regular basis, create a cron task like written in the other answers.

portugese characters in DBsqlite and python parsing not recognised

I got a database in DBsqlite.in this DBsqlite database I have have a records containing portugese text like "Hiper-radiação simétrica periocular bem delimitada, homogênea."
and the characters like ç ã é ê don't parse right in my python script.
While normal english text is doing it perfectly.
In my terminal window (I use a mac) the
I know it has something do to with the encoding. but the code still doesn't recognise portugese.
my sample code:
# -*- coding: UTF-8 -*-
import xml.etree.ElementTree as ET
import sqlite3
#open a database connection to the database translateDB.sqlite
conn = sqlite3.connect('translateDB.sqlite')
#prepare a cursor object using cursus() method
cursor = conn.cursor()
#test input
# this doesn't work
text = ('Hiper-radiação simétrica periocular bem delimitada, homogênea')
# this does work in english
#text = ('Well delimited, homogeneous symmetric periocular hyper- radiation.')
# Execute SQL query using execute() method.
cursor.execute('SELECT * FROM translate WHERE L2_portugese=?', (text,))
# Fetch a single row using fetchone() method and display it.
print cursor.fetchone()
# Disconnect from server
conn.close()
any tips & tricks are greatly appreciated. Ron

Inserting UNICODE into sqlite3

I am trying to read a file, parse the data using python 2.7, and import the data into sqlite3. However, I'm running into a problem when inserting the data. After I parse a line from the file, the é in my string is replaced with \xe9. After I split the line from my file, I want a list that contains [73,'Misérables, Les'] but instead I'm getting [73,'Mis\xe9rables, Les'] which is screwing up the SQL INSERT statement. How can I fix this?
#!/usr/bin/python
# -*- coding: latin-1 -*-
import sqlite3
line = '73::Misérables, Les'.decode('latin-1')
vals = line.split("::")
con = sqlite3.connect('myDb.db')
cur = con.cursor()
cur.execute("DROP TABLE IF EXISTS movie")
cur.execute('CREATE TABLE movie (id INT, title TEXT)')
sql = 'INSERT INTO movie VALUES (?,?)'
cur.execute(sql,tuple(vals))
cur.execute('SELECT * FROM movie')
for record in cur:
print record
Your program inserts data into the db perfectly. It subsequently retrieves the correct data. Your problem is when you display the result.
When you print a tuple, the system displays the repr() of each item, not the str() of each item. Thus you see \xe9 instead of é in the output.
To get what you want, try replacing the loop at the end of your program:
for record in cur:
print record[0], record[1]

Python to Pull Oracle Data in Unicode (Arabic) format

I am using cx_Oracle to fetch some data stored in Arabic characters from an Oracle database. Below is how I try to connect to the database. When I try to print the results, specially those columns stored in Arabic, I get something like "?????" which seems to me that the data was not coded properly.
I tried to print random Arabic string in Python it went alright, which indicates the problem is in the manner in which I am pulling data from the database.
connection = cx_Oracle.connect(username, password, instanceName)
wells = getWells(connection)
def getWells(conn):
cursor = conn.cursor()
wells = []
cursor.execute(sql)
clmns = len(cursor.description)
for row in cursor.fetchall():
print row
well = {}
for i in range(0, clmns):
if type(row[i]) is not datetime.datetime:
well[cursor.description[i][0]] = row[i]
else:
well[cursor.description[i][0]] = row[i].isoformat()
wells.append(well)
cursor.close()
connection.close()
return wells
In order to force a reset of the default encoding from the environment, you can call the setdefaultencoding method in the sys module.
As this is not recommended, it is not visible by default and a reload is required.
It is recommended that you attempt to fix the encoding set in the shell for the user on the host system rather than modifying in a script.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Writing accented characters to Oracle

I have to update an existing script so that it writes some data to an Oracle 10g database. The script and the database both run on the same Solaris 10 (Intel) machine. Python is v2.4.4.
I'm using cx_Oracle and can read/write to the database with no problem. But the data I'm writing contains accented characters which are not getting written correctly. The accented character turns into an upside-down question mark.
The value is read from a binary file with this code, in :
class CustomerHeaderRecord:
def __init__( self, rec, debug = False ):
self.record = rec
self.acct = rec[ 84:104 ]
And the contents of the acct variable displays on-screen correctly.
Below is the code that writes to the db (the acct value is passed in as the val_1 variable):
class MQ:
def __init__( self, rec, debug = False ):
self.customer_record = CustomerHeaderRecord( rec, debug )
self.add_record(self.customer_record.acct, self.cm_custid)
def add_record(self, val_1, val_2):
cur = conn.cursor()
qry = "select count(*) from table_name where value1 = :val1"
cur.execute(qry, {'val1':val_1})
count = cur.fetchone()
if count[0] == 0:
cur = conn.cursor()
qry = "insert into table_name (value1, value2) values(:val1, :val2)"
cur.execute(qry, {'val1':val_1, 'val2':val_2})
conn.commit()
The acct value doesn't make it to the database correctly. I've googled a bunch of stuff about unicode and UTF-8 but haven't found anything that helps me yet. In the database, the NLS_LANGUAGE is 'American' and the NLS_CHARACTERSET is 'AL32UTF8'.
Do I need to 'do something' to/with the acct variable before/during the insert?
Your input file appears to be encoding in Latin-1. Decode this to unicode data; cx_Oracle will do the rest for you:
acct = rec[ 84:104 ].decode('latin1')
or use the codecs.open() function to open the file for automatic decoding:
inputfile = codecs.open(filename, 'r', encoding='latin1')
Reading from inputfile will give you unicode data.
On insertion, the cx_Oracle library will encode unicode values to the correct encoding that Oracle expects. You do need to set the NLS_LANG environment variable to AL32UTF8 before connecting, either in the shell or in Python with:
os.environ["NLS_LANG"] = ".AL32UTF8"
You may want to review the Python Unicode HOWTO for more details.

Categories

Resources