Python, XML and MySQL - ascii v utf8 encoding issues

Python, XML and MySQL - ascii v utf8 encoding issues - python

I have a MySQL table, with XML content stored in a longtext field, encoded as utf8mb4_general_ci
Database Table
I want to use a Python script to read in the XML data from the transcript field, modify an element, and then write the value back to the database.
When I try to get the XML content into an Element using ElementTree.tostring I get the following encoding error:
Traceback (most recent call last):
File "ImageProcessing.py", line 33,
in <module> root = etree.fromstring(row[1])
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etre‌e/ElementTree.py", line 1300,
in XML parser.feed(text)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etre‌ e/ElementTree.py", line 1640,
in feed self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 9568: ordinal not in range(128)
Code:
import datetime
import mysql.connector
import xml.etree.ElementTree as etree
# Creates the config parameters, connects
# to the database and creates a cursor
config = {
'user': 'username',
'password': 'password',
'host': '127.0.0.1',
'database': 'dbname',
'raise_on_warnings': True,
'use_unicode': True,
'charset': 'utf8',
}
cnx = mysql.connector.connect(**config)
cursor = cnx.cursor()
# Structures the SQL query
query = ("SELECT * FROM transcription")
# Executes the query and fetches the first row
cursor.execute(query)
row = cursor.fetchone()
while row is not None:
print(row[0])
#Some of the things I have tried to resolve the encoding issue
#parser = etree.XMLParser(encoding="utf-8")
#root = etree.fromstring(row[1], parser=parser)
#row[1].encode('ascii', 'ignore')
#Line where the encoding error is being thrown
root = etree.fromstring(row[1])
for img in root.iter('img'):
refno = img.text
img.attrib['href']='http://www.link.com/images.jsp?doc=' + refno
print img.tag, img.attrib, img.text
row = cursor.fetchone()
cursor.close()
cnx.close()

You've got everything well setup and your database connection is returning Unicodes, which is a good thing.
Unfortunately, ElementTree's fromstring() requires a byte str not a Unicode. This is so ElementTree can decode it using the encoding defined in the XML header.
You need to use this instead:
utf_8_xml = row[1].encode("utf-8")
root = etree.fromstring(utf_8_xml)

Related

python3 cannot save to mysql

it supposed to collect data from website and save it to database, it can collect data but cannot save it, I have tested connection between mysql is work fine, but it cannot be save. someone please help...
from bs4 import BeautifulSoup
from urllib import request
from urllib import parse
import MySQLdb
db_connection = MySQLdb.connect(host='localhost', db='database', user='root', passwd='123321')
cursor = db_connection.cursor()
url = "https://srh.bankofchina.com/search/whpj/search_cn.jsp"
Form_Data = {}
Form_Data['erectDate'] = ''
Form_Data['nothing'] = ''
Form_Data['pjname'] = '欧元'
data = parse.urlencode(Form_Data).encode('utf-8')
html = request.urlopen(url,data).read()
soup = BeautifulSoup(html,'html.parser')
div = soup.find('div', attrs = {'class':'BOC_main publish'})
table = div.find('table')
tr = table.find_all('tr')
td = tr[1].find_all('td')
c_name = td[0].text.strip()
c_updated = td[3].text.strip()
print(td[0].get_text(),td[3].get_text(),td[6].get_text())
sql = ("INSERT INTO currency_rate(c_name,c_updated)" "VALUES (%s,%s)" %(c_name,c_updated))
try:
cursor.execute(sql)
db_connection.commit()
except:
db_connection.rollback()
db_connection.close()
that is error message:
Traceback (most recent call last):
File "D:\Python\Python38\test1.py", line 33, in
cursor.execute(sql)
File "D:\Python\Python38\lib\site-packages\MySQLdb\cursors.py", line 191, in execute
query = query.encode(db.encoding)
File "D:\Python\Python38\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 52-53: character maps to

Looks like you have a unicode decode/encode error. Have a look at the docs on the subject.
From what I can see on the traceback, it seems that your database uses an encoding that do not support some of ythe characters you are trying to save.
Try change the encoding of your database (also called collation in MySql world) or encode your strings into something that your db can accept before executing the query.
p.s. : it also seem that your sql is a little malformed, looks like you are missing a couple apexes:
try modify this line:
sql = ("INSERT INTO currency_rate(c_name,c_updated)" "VALUES (%s,%s)" %(c_name,c_updated))
into:
sql = ("INSERT INTO currency_rate(c_name,c_updated)" "VALUES ('%s','%s')" %(c_name,c_updated))

Error loading log file data into mysql using cvs format and python

I am trying to take a data from a log file in cvs format, open the log file and inserting row by row into mysql. I am getting an error like this:
ERROR Traceback (most recent call last): File "/Users/alex/PycharmProjects/PA_REPORTING/padb_populate.py", line 26, in VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)', row) File "/Users/alex/anaconda/lib/python2.7/site-packages/MySQLdb/cursors.py", line 187, in execute query = query % tuple([db.literal(item) for item in args]) TypeError: not all arguments converted during string formatting.
import csv
import MySQLdb
mydb = MySQLdb.connect(host='192.168.56.103',
user='user',
passwd='pass',
db='palogdb')
cursor = mydb.cursor()
csv_data = csv.reader(file('/tmp/PALOG_DEMODATA-100.csv'))
for row in csv_data:
cursor.execute('INSERT INTO palogdb(RECEIVE_TIME,SERIAL,TYPE,SUBTYPE,COL1,TIME_GENERATED,SRC,DST,NATSRC,NATDST,RULE,\
SRCUSR,DSTUSR,APP,VSYS1,FROM,TO,INBOUND_IF,OUTBOUND_IF,LOGSET,COL2,SESSIONID,COL3,REPEATCNT,SOURCEPORT,NATSPORT,NATDPORT, \
FLAGS,PROTO,ACTION,BYTES,BYTES_SENT,BYTES_RECEIVED,PACKETS,START,ELAPSED,CATEGORY,COL4,SEQNO,ACTIONFLAGS,SRCLOC,DSTLOC,NONE, \
PKTS_SENT,PKTS_RECEIVED,SESSION_END_REASON) \
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)', row)
#close the connection to the database.
mydb.commit()
cursor.close()

Is it possible, that you don't have enough data in row for all your %s's? Maybe your row is interpreted as one value, and thus only the first %s is expanded? Try *row to expand the vector to values.
To debug, you could try to build the string passed to execute by some other method, e.g.
sql_string = 'INSERT ... VALUES ({}, {}, {})'.format(*row)
and print it. If you get such an error, you can check, whether the generated string looks reasonable...

exportation data from mysql database using csv

i neeed a python script to generate a csv file from my database XXXX. i wrote thise script but i have something wrong :
import mysql.connector
import csv
filename=open('test.csv','wb')
c=csv.writer(filename)
cnx = mysql.connector.connect(user='XXXXXXX', password='XXXXX',
host='localhost',
database='XXXXX')
cursor = cnx.cursor()
query = ("SELECT `Id_Vendeur`, `Nom`, `Prenom`, `email`, `Num_magasin`, `Nom_de_magasin`, `Identifiant_Filiale`, `Groupe_DV`, `drt_Cartes`.`gain` as 'gain', `Date_Distribution`, `Status_Grattage`, `Date_Grattage` FROM `drt_Cartes_Distribuer`,`drt_Agent`,`drt_Magasin`,`drt_Cartes` where `drt_Cartes_Distribuer`.`Id_Vendeur` = `drt_Agent`.`id_agent` AND `Num_magasin` = `drt_Magasin`.`Numero_de_magasin` AND `drt_Cartes_Distribuer`.`Id_Carte` = `drt_Cartes`.`num_carte`")
cursor.execute(query)
for Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage in cursor:
c.writerow([Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage] )
cursor.close()
filename.close()
cnx.close()
when i executing the command on phpmyadmin its look working very well but from my shell i got thise message :
# python test.py
Traceback (most recent call last):
File "test.py", line 18, in <module>
c.writerow([Id_Vendeur, Nom, Prenom, email, Num_magasin, Nom_de_magasin, Identifiant_Filiale, Groupe_DV, gain, Date_Distribution, Status_Grattage, Date_Grattage] )
UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 5: ordinal not in range(128)

It looks you are using csv for Python 2.7. Quoting docs:
Note This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
Options, choice one of them:
Follow doc link, go to samples section, and modify your code accordantly.
Use a csv packet with unicode supprt like https://pypi.python.org/pypi/unicodecsv

Your data from the database are not only ascii characteres. I suggest you use the 'unicodecvs' python module as suggested in the answer to this question: How to write UTF-8 in a CSV file

Fix UnicodeDecodeError

I have the following code . I use Python 2.7
import csv
import sqlite3
conn = sqlite3.connect('torrents.db')
c = conn.cursor()
# Create table
c.execute('''DROP TABLE torrents''')
c.execute('''CREATE TABLE IF NOT EXISTS torrents
(name text, size long, info_hash text, downloads_count long,
category_id text, seeders long, leechers long)''')
with open('torrents_mini.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter='|')
for row in spamreader:
name = unicode(row[0])
size = row[1]
info_hash = unicode(row[2])
downloads_count = row[3]
category_id = unicode(row[4])
seeders = row[5]
leechers = row[6]
c.execute('INSERT INTO torrents (name, size, info_hash, downloads_count,
category_id, seeders, leechers) VALUES (?,?,?,?,?,?,?)',
(name, size, info_hash, downloads_count, category_id, seeders, leechers))
conn.commit()
conn.close()
The error message I receive is
Traceback (most recent call last):
File "db.py", line 15, in <module>
name = unicode(row[0])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
If I don't convert into unicode then the error i get is
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
adding name = row[0].decode('UTF-8') gives me another error
Traceback (most recent call last):
File "db.py", line 27, in <module>
for row in spamreader:
_csv.Error: line contains NULL byte
the data contained in the csv file is in the following format
Tha Twilight New Moon DVDrip 2009 XviD-AMiABLE|694554360|2cae2fc76d110f35917d5d069282afd8335bc306|0|movies|0|1
Edit:I finally dropped the attempt and accomplished the task using sqlite3 command-line tool(it was quite easy).
I do not yet know what caused the errors , but when sqlite3 was importing the said csv file , it kept popping warnings about "unescaped character", the character being quotes(").
Thanks to everyone who tried to help.

Your data is not encoded as ASCII. Use the correct codec for your data.
You can tell Python what codec to use with:
unicode(row[0], correct_codec)
or use the str.decode() method:
row[0].decode(correct_codec)
What that correct codec is, we cannot tell you. You'll have to consult whatever you got the file from.
If you cannot figure out what encoding was used, you could use a package like chardet to make an educated guess, but take into account that such a library is not fail-proof.

psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xa0

I have done quite a bit of googling on this error and have boiled it down to the fact that the databases I am working with are in different encodings.
The AIX server I am working with is running
psql 8.2.4
server_encoding | LATIN1 | | Client Connection Defaults / Locale and Formatting | Sets the server (database) character set encoding.
The windows 2008 R2 server I am working with is running
psql (9.3.4)
CREATE DATABASE postgres
WITH OWNER = postgres
ENCODING = 'UTF8'
TABLESPACE = pg_default
LC_COLLATE = 'English_Australia.1252'
LC_CTYPE = 'English_Australia.1252'
CONNECTION LIMIT = -1;
COMMENT ON DATABASE postgres
IS 'default administrative connection database';
Now when i try execute my below python script I get this error
Traceback (most recent call last):
File "datamain.py", line 39, in <module>
sys.exit(main())
File "datamain.py", line 33, in main
write_file_to_table("cms_jobdef.txt", "cms_jobdef", con_S104838)
File "datamain.py", line 21, in write_file_to_table
cur.copy_from(f, table, ",")
psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xa0
CONTEXT: COPY cms_jobdef, line 15209
Here is my script
import psycopg2
import StringIO
import sys
import pdb
def connect_db(db, usr, pw, hst, prt):
conn = psycopg2.connect(database=db, user=usr,
password=pw, host=hst, port=prt)
return conn
def write_table_to_file(file, table, connection):
f = open(file, "w")
cur = connection.cursor()
cur.copy_to(f, table, ",")
f.close()
cur.close()
def write_file_to_table(file, table, connection):
f = open(file,"r")
cur = connection.cursor()
cur.copy_from(f, table, ",")
f.close()
cur.close()
def main():
login = open('login.txt','r')
con_tctmsv64 = connect_db("x", "y",
login.readline().strip(),
"d.domain", "c")
con_S104838 = connect_db("x", "y", "z", "a", "b")
try:
write_table_to_file("cms_jobdef.txt", "cms_jobdef", con_tctmsv64)
write_file_to_table("cms_jobdef.txt", "cms_jobdef", con_S104838)
finally:
con_tctmsv64.close()
con_S104838.close()
if __name__ == "__main__":
sys.exit(main())
have removed some sensitive data.
So I'm not sure how I can proceed. As far as I can tell the copy_expert method might help by exporting as a UTF8 encoding. But because the server I am pulling the data from is running 8.2.4 I Dont think it supports COPY encoding format.
I think my best shot is to try and reinstall the postgre database with an encoding of LATIN1 on the windows server. When I try and do that I get the below error.
So im quite stuck,any help would be greatly appreciated!
Update I installed the postgre db on the windows as LATIN1 encoding by changing the default local to 'C'. This however gave me the below error and doesnt seem like a likely successful/correct approach
I have also tried encoding the files in BINARY using the PSQL COPY function
def write_table_to_file(file, table, connection):
f = open(file, "w")
cur = connection.cursor()
#cur.copy_to(f, table, ",")
cur.copy_expert("COPY cms_jobdef TO STDOUT WITH BINARY", f)
f.close()
cur.close()
def write_file_to_table(file, table, connection):
f = open(file,"r")
cur = connection.cursor()
#cur.copy_from(f, table)
cur.copy_expert("COPY cms_jobdef FROM STDOUT WITH BINARY", f)
f.close()
cur.close()
Still no luck I get the same error
DataError: invalid byte sequence for encoding "UTF8": 0xa0
CONTEXT: COPY cms_jobdef, line 15209, column descript
In relation to Phils answer I have tried this approach with still no success.
import psycopg2
import StringIO
import sys
import pdb
import codecs
def connect_db(db, usr, pw, hst, prt):
conn = psycopg2.connect(database=db, user=usr,
password=pw, host=hst, port=prt)
return conn
def write_table_to_file(file, table, connection):
f = open(file, "w")
#fx = codecs.EncodedFile(f,"LATIN1", "UTF8")
cur = connection.cursor()
cur.execute("SHOW client_encoding;")
print cur.fetchone()
cur.copy_to(f, table)
#cur.copy_expert("COPY cms_jobdef TO STDOUT WITH BINARY", f)
f.close()
cur.close()
def write_file_to_table(file, table, connection):
f = open(file,"r")
cur = connection.cursor()
cur.execute("SET CLIENT_ENCODING TO 'LATIN1';")
cur.execute("SHOW client_encoding;")
print cur.fetchone()
cur.copy_from(f, table)
#cur.copy_expert("COPY cms_jobdef FROM STDOUT WITH BINARY", f)
f.close()
cur.close()
def main():
login = open('login.txt','r')
con_tctmsv64 = connect_db("x", "y",
login.readline().strip(),
"ctmtest1.int.corp.sun", "5436")
con_S104838 = connect_db("x", "y", "z", "t", "5432")
try:
write_table_to_file("cms_jobdef.txt", "cms_jobdef", con_tctmsv64)
write_file_to_table("cms_jobdef.txt", "cms_jobdef", con_S104838)
finally:
con_tctmsv64.close()
con_S104838.close()
if __name__ == "__main__":
sys.exit(main())
output
In [4]: %run datamain.py
('sql_ascii',)
('LATIN1',)
In [5]:
This completes successfully but when i run a
select * from cms_jobdef;
Nothing is in the new database
I have even tried converting the file format from LATIN1 to UTF8. Still no luck
The weird thing is when I do this process manually by only using the postgre COPY function it works. I have no idea why. Once again any help would be greatly appreciated.

Turns out there are a few options to solve this problem.
The option to change the clients encoding suggested by Phil does work.
cur.execute("SET CLIENT_ENCODING TO 'LATIN1';")
Another option is to convert the data in the fly. I used a python module called codecs to do this.
f = open(file, "w")
fx = codecs.EncodedFile(f,"LATIN1", "UTF8")
cur = connection.cursor()
cur.execute("SHOW client_encoding;")
print cur.fetchone()
cur.copy_to(fx, table)
The key line being
fx = codecs.EncodedFile(f,"LATIN1", "UTF8")
My main problem was that I was not committing my changes to the database! Silly me :)

I'm in the process of migrating from an SQL_ASCII database to a UTF8 database, and ran into the same problem. Based on this answer, I simply added this statement to the start of my import script:
set client_encoding to 'latin1'
and everything appears to have imported correctly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, XML and MySQL - ascii v utf8 encoding issues - python

Related

python3 cannot save to mysql

Error loading log file data into mysql using cvs format and python

exportation data from mysql database using csv

Fix UnicodeDecodeError

psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xa0

Categories

Resources