Probleme encoding characters with Python 2.7 - python

It works fine with regular characters but it doesn't work with
accented characters like é,à etc...
Here is the program:
def search():
connection = sqlite3.connect('vocab.sqlite')
cursor = connection.cursor()
sql = "SELECT French, English value FROM Ami "
cursor.execute(sql)
data = cursor.fetchall()
data=sorted(data)
file_open=open('vraiamis.html','w')
for i in data:
a='<a href="'+'http://www.google.fr/#hl=fr&gs_nf=1&cp=4&gs_id=o&xhr=t&q='
a=a+str(i[0]).encode('latin-1')+'">'+str(i[0]).encode('latin-1')+'</a>'+'<br>'
file_open.write(a)
file_open.close()
webbrowser.open('vraiamis.html')
when the value in the database contains special characters like é,à,ç ( it doesn't work I get the following error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
Thanks in advance for your help

Try
a=a+i[0].encode('latin-1')+'">' + i[0].encode('latin-1')+'</a>'+'<br>'
etc - your str() calls are trying to convert the unicode to a bytestring before you've decoded it.

You may write your vraiamis.html in utf-8 encoding, so that your special characters may be encoded.
def search():
import codecs
connection = sqlite3.connect('vocab.sqlite')
cursor = connection.cursor()
sql = "SELECT French, English value FROM Ami "
cursor.execute(sql)
data = cursor.fetchall()
data=sorted(data)
file_open= codecs.open('vraiamis.html', 'w', encoding='utf-8')
for i in data:
a=u'<a href="' + u'http://www.google.fr/#hl=fr&gs_nf=1&cp=4&gs_id=o&xhr=t&q='
a=a + i[0] + u'">' + i[0] + u'</a>' + u'<br>'
file_open.write(a)
file_open.close()
webbrowser.open('vraiamis.html')

Related

Python 2.7 ascii' codec can't encode character u'\xe4

I have experienced a code problem in Python 2.7, I already used UTF-8, but it still got the exception
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 81: ordinal not in range(128)"
My files and contains so many this kind of shit, but for some reason, I'm not allowed to delete it.
desktop,[Search] Store | Automated Titles,google / cpc,Titles > Kesäkaverit,275285048,13
I have tried the below method to avoid, but still, haven't fix it. Can anyone help me ?
1.With "#!/usr/bin/python" in my file header
2.Set setdefaultencoding
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
3.content = unicode(s3core.download_file_to_memory(S3_PROFILE, S3_RAW + file), "utf-8", "ignore")
My code below
content = unicode(s3core.download_file_to_memory(S3_PROFILE, S3_RAW + file), "utf8", "ignore")
rows = content.split('\n')[1:]
for row in rows:
if not row:
continue
try:
# fetch variables
cols = row.rstrip('\n').split(',')
transaction = cols[0]
device_category = cols[1]
campaign = cols[2]
source = cols[3].split('/')[0].strip()
medium = cols[3].split('/')[1].strip()
ad_group = cols[4]
transactions = cols[5]
data_list.append('\t'.join(
['-'.join([dt[:4], dt[4:6], dt[6:]]), country, transaction, device_category, campaign, source,
medium, ad_group, transactions]))
except:
print 'ignoring row: ' + row

Unicode symbols in output file in Python 3.6.1

I need to log Connection errors to log.txt. Windows is Russian.
My code:
# e is a name for "requests.ConnectionError" form Windows if server is not avilable
# I take error and cut from it text I need and convert it to str
e_warning = str(e.args[0].reason)
# I search text I need in string with "re"
e_lst = re.findall('>:\s(.+)', e_warning)
# I create string again from list "re" gives me
e_str = ''.join(e_lst)
# I Convert string to bytes
e_str_unicode = codecs.encode(e_str, 'utf-8')
# It is a message to warning window
e_str_utf = codecs.decode(e_str_unicode, encoding='utf-8')
messagebox.showerror(title='Connection error', message=e_str)
with codecs.open('log.txt', 'a', encoding='utf-8') as log:
log.write(strftime(str("%H:%M:%S %Y-%m-%d") + str(e_str_unicode) + '\n'))
If I use "e_str_utf" in the last line it gives me:
UnicodeEncodeError: 'locale' codec can't encode character '\u041f' in position 72: Illegal byte sequence
Make sense - 72 is first Russian letter.
If I use "e_str_unicode" in the last line it is no error, but in log file I see:
15:25:18 2017-04-28b'Failed to establish a new connection: [WinError 10060] \xd0\x9f\xd0\xbe\xd0\xbf\xd1\x8b\xd1\x82\xd0\xba\xd0\xb0 \xd1\x83\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xb8\xd1\x82\xd1\x8c \xd1\x81\xd0\xbe\xd0\xb5\xd0\xb4\xd0\xb8\xd0\xbd\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb1\xd1\x8b\xd0\xbb\xd0\xb0 \xd0\xb1\xd0\xb5\xd0\xb7\xd1\x83\xd1\x81\xd0\xbf\xd0\xb5\xd1\x88\xd0\xbd\xd0\xbe\xd0\xb9, \xd1\x82.\xd0\xba. \xd0\xbe\xd1\x82 \xd0\xb4\xd1\x80\xd1\x83\xd0\xb3\xd0\xbe\xd0\xb3\xd0\xbe \xd0\xba\xd0\xbe\xd0\xbc\xd0\xbf\xd1\x8c\xd1\x8e\xd1\x82\xd0\xb5\xd1\x80\xd0\xb0 \xd0\xb7\xd0\xb0 \xd1\x82\xd1\x80\xd0\xb5\xd0\xb1\xd1\x83\xd0\xb5\xd0\xbc\xd0\xbe\xd0\xb5 \xd0\xb2\xd1\x80\xd0\xb5\xd0\xbc\xd1\x8f \xd0\xbd\xd0\xb5 \xd0\xbf\xd0\xbe\xd0\xbb\xd1\x83\xd1\x87\xd0\xb5\xd0\xbd \xd0\xbd\xd1\x83\xd0\xb6\xd0\xbd\xd1\x8b\xd0\xb9 \xd0\xbe\xd1\x82\xd0\xba\xd0\xbb\xd0\xb8\xd0\xba, \xd0\xb8\xd0\xbb\xd0\xb8 \xd0\xb1\xd1\x8b\xd0\xbb\xd0\xbe \xd1\x80\xd0\xb0\xd0\xb7\xd0\xbe\xd1\x80\xd0\xb2\xd0\xb0\xd0\xbd\xd0\xbe \xd1\x83\xd0\xb6\xd0\xb5 \xd1\x83\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xbd\xd0\xbe\xd0\xb5 \xd1\x81\xd0\xbe\xd0\xb5\xd0\xb4\xd0\xb8\xd0\xbd\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb8\xd0\xb7-\xd0\xb7\xd0\xb0 \xd0\xbd\xd0\xb5\xd0\xb2\xd0\xb5\xd1\x80\xd0\xbd\xd0\xbe\xd0\xb3\xd0\xbe \xd0\xbe\xd1\x82\xd0\xba\xd0\xbb\xd0\xb8\xd0\xba\xd0\xb0 \xd1\x83\xd0\xb6\xd0\xb5 \xd0\xbf\xd0\xbe\xd0\xb4\xd0\xba\xd0\xbb\xd1\x8e\xd1\x87\xd0\xb5\xd0\xbd\xd0\xbd\xd0\xbe\xd0\xb3\xd0\xbe \xd0\xba\xd0\xbe\xd0\xbc\xd0\xbf\xd1\x8c\xd1\x8e\xd1\x82\xd0\xb5\xd1\x80\xd0\xb0'
As I can understand encoding='utf-8' in
with codecs.open('log.txt', 'a', encoding='utf-8') as log:
should save UNICODE bytes in utf-8 code in my file, but for some reason it is ignores encoding setting... Why?
First: what is codec codecs.open('log.txt', 'a', encoding='utf-8')?
Second: this is not right strftime(str("%H:%M:%S %Y-%m-%d") + str(e_str_unicode) + '\n') it should be strftime("%H:%M:%S %Y-%m-%d") + e_str_unicode + '\n'
This is a short example how to do it:
from time import strftime
text = input()
print(text)
with open('log.text', 'a', encoding='utf-8') as log:
message = strftime("%H:%M:%S %Y-%m-%d") + '=>' + text + '\n'
log.write(message)

How to query unicode database with ascii characters

I am currently running a query on my postgresql database that ignores German characters - umlauts. I however, do not want to loose these characters and would rather have the German characters or at least their equivalent (e.g ä = ae) in the output of the query. Running Python 2.7.12
When I change the encode object to replace or xmlcharrefreplace I get the following error:
psycopg2.ProgrammingError: syntax error at or near "?"
LINE 1: ?SELECT
Code Snippet:
# -*- coding: utf-8 -*-
connection_str = r'postgresql://' + user + ':' + password + '#' + host + '/' + database
def query_db(conn, sql):
with conn.cursor() as curs:
curs.execute(sql)
rows = curs.fetchall()
print("fetched %s rows from db" % len(rows))
return rows
with psycopg2.connect(connection_str) as conn:
for filename in files:
# Read SQL
sql = u""
f = codecs.open(os.path.join(SQL_LOC, filename), "r", "utf-8")
for line in f:
sql += line.encode('ascii', 'replace').replace('\r\n', ' ')
rows = query_db(conn, f)
How can I pass a query as a unicode object with German characters ?
I also tried decoded the query as utf-8 but then I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
Here is a solution to obtain their encoded equivalent. You will be able to re-encode it later and the query will not create an error:
SELECT convert_from(BYTEA 'foo ᚠ bar'::bytea, 'latin-1');
+----------------+
| convert_from |
|----------------|
| foo á<U+009A>  bar |
+----------------+
SELECT 1
Time: 0.011s
You just need to conn.set_client_encoding("utf-8") and then you can just execute unicode strings - sql and results will be encoded and decoded on the fly:
$ cat psycopg2-unicode.py
import sys
import os
import psycopg2
import csv
with psycopg2.connect("") as conn:
conn.set_client_encoding("utf-8")
for filename in sys.argv[1:]:
file = open(filename, "r", encoding="utf-8")
sql = file.read()
with conn.cursor() as cursor:
cursor.execute(sql)
try:
rows = cursor.fetchall()
except psycopg2.ProgrammingError as err:
# No results
continue
with open(filename+".out", "w", encoding="utf-8", newline="") as outfile:
csv.writer(outfile, dialect="excel-tab").writerows(rows)
$ cat sql0.sql
create temporary table t(v) as
select 'The quick brown fox jumps over the lazy dog.'
union all
select 'Zwölf große Boxkämpfer jagen Viktor quer über den Sylter Deich.'
union all
select 'Любя, съешь щипцы, — вздохнёт мэр, — кайф жгуч.'
union all
select 'Mężny bądź, chroń pułk twój i sześć flag.'
;
$ cat sql1.sql
select * from t;
$ python3 psycopg2-unicode.py sql0.sql sql1.sql
$ cat sql1.sql.out
The quick brown fox jumps over the lazy dog.
Zwölf große Boxkämpfer jagen Viktor quer über den Sylter Deich.
Любя, съешь щипцы, — вздохнёт мэр, — кайф жгуч.
Mężny bądź, chroń pułk twój i sześć flag.
A Python2 version of this program is a little bit more complicated, as we need to tell the driver that we'd like return values as unicode objects. Also csv module I used for output does not support unicode, so it needs a workaround. Here it is:
$ cat psycopg2-unicode2.py
from __future__ import print_function
import sys
import os
import csv
import codecs
import psycopg2
import psycopg2.extensions
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
with psycopg2.connect("") as conn:
conn.set_client_encoding("utf-8")
for filename in sys.argv[1:]:
file = codecs.open(filename, "r", encoding="utf-8")
sql = file.read()
with conn.cursor() as cursor:
cursor.execute(sql)
try:
rows = cursor.fetchall()
except psycopg2.ProgrammingError as err:
# No results from SQL
continue
with open(filename+".out", "wb") as outfile:
for row in rows:
row_utf8 = [v.encode('utf-8') for v in row]
csv.writer(outfile, dialect="excel-tab").writerow(row_utf8)

How to read special characters encoded in UTF-8 in python

I was trying to extract some data from mysql database using python, But I have problem with special characters (the data are strings in FR, ES, De and IT languages). Whenever a word has a special character (like an accent á ñ etc.) are no encoded properly in the file (I'm creating a csv with the extracted data)
This is the code I was using
import mysql.connector
if __name__ == '__main__':
cnx = mysql.connector.connect(user='user', password='psswrd',
host='slave',
database='DB',
buffered=True)
us_id_list = ['496305']
f = open('missing_cat_mappings.csv', 'w')
for (us_id) in us_id_list:
print us_id
mapping_cursor = cnx.cursor()
query = (format(user_id=us_id,))
success = False
fails = 0
while not success:
try:
print "try" + str(fails)
mapping_cursor.execute(query)
success = True
except:
fails += 1
if fails > 10:
raise
for row in mapping_cursor:
f.write(str(row) + "\n")
mapping_cursor.close()
f.close()
cnx.close()
I added:
#!/usr/bin/python
# vim: set fileencoding=<UTF-8> :
at the beggining but it didn't make any difference.
Basically you will need to open the CSV file in binary mode, 'wb' not text mode 'w'

Python MySQL Bulk Insertion Error with Character Encode

I Start new Project in Python with MySQL.
I just try to insert millions of record from CSV to MySQL through MySQLdb package.
My Code:
import pandas as pd
import MySQLdb
#Connect with MySQL
db = MySQLdb.connect('localhost','root','****','MY_DB')
cur = db.cursor()
#Reading CSV
df = pd.read_csv('/home/shankar/LAB/Python/Rough/******.csv')
for i in df.COMPANY_NAME:
i = i.replace("'","")
i = i.replace("\\","")
#i = i.encode('latin-1', 'ignore')
cur.execute("INSERT INTO polls_company (name) VALUES ('" + i + "')")
db.commit()
This code working fine in some sort of CSV files, but having issues in few CSV files.
Errors :
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-7-aac849862588> in <module>()
13 i = i.replace("\\","")
14 #i = i.encode('latin-1', 'ignore')
---> 15 cur.execute("INSERT INTO polls_company (name) VALUES ('" + i + "')")
16 db.commit()
/home/shankar/.local/lib/python3.5/site-packages/MySQLdb/cursors.py in execute(self, query, args)
211
212 if isinstance(query, unicode):
--> 213 query = query.encode(db.unicode_literal.charset, 'surrogateescape')
214
215 res = None
UnicodeEncodeError: 'latin-1' codec can't encode character '\ufffd' in position 49: ordinal not in range(256)
Here, this "Character Encoding" issue is occurred in some CSV files only, but i want automatic Insertion with common encoding techniques.
Because CSV Files encoded works with "utf-8", "latin-1" and more...
If i use utf-8 : then i got error in latin-1
and vise versa
So, Is there any ways to operate all kind of CSV file with common encoding
or
any other ways to solve this ?
[Thanks in Advance...]
I would let the pandas take care of encoding and you don't need to loop through your DF. Let's do it pandas way:
import pandas as pd
import MySQLdb
#Connect with MySQL
db = MySQLdb.connect('localhost','root','****','MY_DB')
cur = db.cursor()
#Reading CSV
df = pd.read_csv('/home/shankar/LAB/Python/Rough/******.csv')
df.COMPANY_NAME.str.replace(r"['\]*", "").rename(columns={'COMPANY_NAME':'name'}).to_sql('polls_company', db, if_exists='append', index=False)

Categories

Resources