How to read special characters encoded in UTF-8 in python - python

I was trying to extract some data from mysql database using python, But I have problem with special characters (the data are strings in FR, ES, De and IT languages). Whenever a word has a special character (like an accent á ñ etc.) are no encoded properly in the file (I'm creating a csv with the extracted data)
This is the code I was using
import mysql.connector
if __name__ == '__main__':
cnx = mysql.connector.connect(user='user', password='psswrd',
host='slave',
database='DB',
buffered=True)
us_id_list = ['496305']
f = open('missing_cat_mappings.csv', 'w')
for (us_id) in us_id_list:
print us_id
mapping_cursor = cnx.cursor()
query = (format(user_id=us_id,))
success = False
fails = 0
while not success:
try:
print "try" + str(fails)
mapping_cursor.execute(query)
success = True
except:
fails += 1
if fails > 10:
raise
for row in mapping_cursor:
f.write(str(row) + "\n")
mapping_cursor.close()
f.close()
cnx.close()
I added:
#!/usr/bin/python
# vim: set fileencoding=<UTF-8> :
at the beggining but it didn't make any difference.

Basically you will need to open the CSV file in binary mode, 'wb' not text mode 'w'

Related

How to query unicode database with ascii characters

I am currently running a query on my postgresql database that ignores German characters - umlauts. I however, do not want to loose these characters and would rather have the German characters or at least their equivalent (e.g ä = ae) in the output of the query. Running Python 2.7.12
When I change the encode object to replace or xmlcharrefreplace I get the following error:
psycopg2.ProgrammingError: syntax error at or near "?"
LINE 1: ?SELECT
Code Snippet:
# -*- coding: utf-8 -*-
connection_str = r'postgresql://' + user + ':' + password + '#' + host + '/' + database
def query_db(conn, sql):
with conn.cursor() as curs:
curs.execute(sql)
rows = curs.fetchall()
print("fetched %s rows from db" % len(rows))
return rows
with psycopg2.connect(connection_str) as conn:
for filename in files:
# Read SQL
sql = u""
f = codecs.open(os.path.join(SQL_LOC, filename), "r", "utf-8")
for line in f:
sql += line.encode('ascii', 'replace').replace('\r\n', ' ')
rows = query_db(conn, f)
How can I pass a query as a unicode object with German characters ?
I also tried decoded the query as utf-8 but then I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
Here is a solution to obtain their encoded equivalent. You will be able to re-encode it later and the query will not create an error:
SELECT convert_from(BYTEA 'foo ᚠ bar'::bytea, 'latin-1');
+----------------+
| convert_from |
|----------------|
| foo á<U+009A>  bar |
+----------------+
SELECT 1
Time: 0.011s
You just need to conn.set_client_encoding("utf-8") and then you can just execute unicode strings - sql and results will be encoded and decoded on the fly:
$ cat psycopg2-unicode.py
import sys
import os
import psycopg2
import csv
with psycopg2.connect("") as conn:
conn.set_client_encoding("utf-8")
for filename in sys.argv[1:]:
file = open(filename, "r", encoding="utf-8")
sql = file.read()
with conn.cursor() as cursor:
cursor.execute(sql)
try:
rows = cursor.fetchall()
except psycopg2.ProgrammingError as err:
# No results
continue
with open(filename+".out", "w", encoding="utf-8", newline="") as outfile:
csv.writer(outfile, dialect="excel-tab").writerows(rows)
$ cat sql0.sql
create temporary table t(v) as
select 'The quick brown fox jumps over the lazy dog.'
union all
select 'Zwölf große Boxkämpfer jagen Viktor quer über den Sylter Deich.'
union all
select 'Любя, съешь щипцы, — вздохнёт мэр, — кайф жгуч.'
union all
select 'Mężny bądź, chroń pułk twój i sześć flag.'
;
$ cat sql1.sql
select * from t;
$ python3 psycopg2-unicode.py sql0.sql sql1.sql
$ cat sql1.sql.out
The quick brown fox jumps over the lazy dog.
Zwölf große Boxkämpfer jagen Viktor quer über den Sylter Deich.
Любя, съешь щипцы, — вздохнёт мэр, — кайф жгуч.
Mężny bądź, chroń pułk twój i sześć flag.
A Python2 version of this program is a little bit more complicated, as we need to tell the driver that we'd like return values as unicode objects. Also csv module I used for output does not support unicode, so it needs a workaround. Here it is:
$ cat psycopg2-unicode2.py
from __future__ import print_function
import sys
import os
import csv
import codecs
import psycopg2
import psycopg2.extensions
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
with psycopg2.connect("") as conn:
conn.set_client_encoding("utf-8")
for filename in sys.argv[1:]:
file = codecs.open(filename, "r", encoding="utf-8")
sql = file.read()
with conn.cursor() as cursor:
cursor.execute(sql)
try:
rows = cursor.fetchall()
except psycopg2.ProgrammingError as err:
# No results from SQL
continue
with open(filename+".out", "wb") as outfile:
for row in rows:
row_utf8 = [v.encode('utf-8') for v in row]
csv.writer(outfile, dialect="excel-tab").writerow(row_utf8)

Python Encoding Issue with JSON and CSV

I am having an encoding issue when I run my script below:
Here is the error code:
-UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 9: ordinal not in range(128)
Here is my script:
import logging
import urllib
import csv
import json
import io
import codecs
with open('/home/local/apple.csv',
'rb') as csvinput:
reader = csv.reader(csvinput, delimiter=',')
firstline = True
for row in reader:
if firstline:
firstline = False
continue
address1 = row[0]
print row[0]
locality = row[1]
admin_area = row[2]
query = ' '.join(str(x) for x in (address1, locality, admin_area))
normalized = query.replace(" ", "+")
BaseURL = 'http://localhost:8080/verify?country=JP&freeform='
URL = BaseURL + normalized
print URL
data = urllib.urlopen(URL)
response = data.getcode()
print response
if response == 200:
file= json.load(data)
print file
output_f=open('output.csv','wb')
csvwriter=csv.writer(output_f)
count = 0
for f in file:
if count == 0:
header= f.keys()
csvwriter.writerow(header)
count += 1
csvwriter.writerow(f.values())
output_f.close()
else:
print 'error'
can anyone help me fix this its getting really annoying. I need to encode to utf8
Looks like you are using Python 2.x, instead of python's standard open, use codecs.open where you can optionally pass an encoding to use and what to do when there are errors. Gets a little less confusing in Python 3 where the standard Python open can do this.
So in your two lines where you are opening, do:
with codecs.open('/home/local/apple.csv',
'rb', 'utf-8') as csvinput:
output_f = codecs.open('output.csv','wb', 'utf-8')
The optional error parm defaults to "strict" which raises an exception if the bytes can't be mapped to the given encoding. In some contexts you may want to use 'ignore' or 'replace'.
See the python doc for a bit more info.

Encoding UTF-8 when writing to CSV

I have some simple code to ingest some JSON Twitter data, and output some specific fields into separate columns of a CSV file. My problem is that I cannot for the life of me figure out the proper way to encode the output as UTF-8. Below is the closest I've been able to get, with the help of a member here, but I still it still isn't running correctly and fails because of the unique characters in the tweet text field.
import json
import sys
import csv
import codecs
def main():
writer = csv.writer(codecs.getwriter("utf-8")(sys.stdout), delimiter="\t")
for line in sys.stdin:
line = line.strip()
data = []
try:
data.append(json.loads(line))
except ValueError as detail:
continue
for tweet in data:
## deletes any rate limited data
if tweet.has_key('limit'):
pass
else:
writer.writerow([
tweet['id_str'],
tweet['user']['screen_name'],
tweet['text']
])
if __name__ == '__main__':
main()
From Docs:
https://docs.python.org/2/howto/unicode.html
a = "string"
encodedstring = a.encode('utf-8')
If that does not work:
Python DictWriter writing UTF-8 encoded CSV files
I have had the same problem. I have a large amount of data from twitter firehose so every possible complication case (and has arisen)!
I've solved it as follows using try / except:
if the dict value is a string: if isinstance(value,basestring) I try to encode it straight away. If not a string, I make it a string and then encode it.
If this fails, it's because some joker is tweeting odd symbols to mess up my script. If that is the case, firstly I decode then re-encode value.decode('utf-8').encode('utf-8') for strings and decode, make into a string and re-encode for non-strings value.decode('utf-8').encode('utf-8')
Have a go with this:
import csv
def export_to_csv(list_of_tweet_dicts,export_name="flat_twitter_output.csv"):
utf8_flat_tweets=[]
keys = []
for tweet in list_of_tweet_dicts:
tmp_tweet = tweet
for key,value in tweet.iteritems():
if key not in keys: keys.append(key)
# convert fields to utf-8 if text
try:
if isinstance(value,basestring):
tmp_tweet[key] = value.encode('utf-8')
else:
tmp_tweet[key] = str(value).encode('utf-8')
except:
if isinstance(value,basestring):
tmp_tweet[key] = value.decode('utf-8').encode('utf-8')
else:
tmp_tweet[key] = str(value.decode('utf-8')).encode('utf-8')
utf8_flat_tweets.append(tmp_tweet)
del tmp_tweet
list_of_tweet_dicts = utf8_flat_tweets
del utf8_flat_tweets
with open(export_name, 'w') as f:
dict_writer = csv.DictWriter(f, fieldnames=keys,quoting=csv.QUOTE_ALL)
dict_writer.writeheader()
dict_writer.writerows(list_of_tweet_dicts)
print "exported tweets to '"+export_name+"'"
return list_of_tweet_dicts
hope that helps you.

Unicode Decode Error in Python with files

so I'm having this trouble with the decode. I found it in other threads how to do it for simple strings, with the u'string'.encode. But I can't find a way to make it work with files.
Any help would be appreciated!
Here's the code.
text = file.read()
text.replace(txt.encode('utf-8'), novo_txt.encode('utf-8'))
file.seek(0) # rewind
file.write(text.encode('utf-8'))
and here's the whole code, should it help.
#!/usr/bin/env python
# coding: utf-8
"""
Script to helps on translate some code's methods from
portuguese to english.
"""
from multiprocessing import Pool
from mock import MagicMock
from goslate import Goslate
import fnmatch
import logging
import os
import re
import urllib2
_MAX_PEERS = 1
try:
os.remove('traducoes.log')
except OSError:
pass
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
handler = logging.FileHandler('traducoes.log')
logger.addHandler(handler)
def fileWalker(ext, dirname, names):
"""
Find the files with the correct extension
"""
pat = "*" + ext[0]
for f in names:
if fnmatch.fnmatch(f, pat):
ext[1].append(os.path.join(dirname, f))
def encontre_text(file):
"""
find on the string the works wich have '_' on it
"""
text = file.read().decode('utf-8')
return re.findall(r"\w+(?<=_)\w+", text)
#return re.findall(r"\"\w+\"", text)
def traduza_palavra(txt):
"""
Translate the word/phrase to english
"""
try:
# try connect with google
response = urllib2.urlopen('http://google.com', timeout=2)
pass
except urllib2.URLError as err:
print "No network connection "
exit(-1)
if txt[0] != '_':
txt = txt.replace('_', ' ')
txt = txt.replace('media'.decode('utf-8'), 'média'.decode('utf-8'))
gs = Goslate()
#txt = gs.translate(txt, 'en', gs.detect(txt))
txt = gs.translate(txt, 'en', 'pt-br') # garantindo idioma tupiniquim
txt = txt.replace(' en ', ' br ')
return txt.replace(' ', '_') # .lower()
def subistitua(file, txt, novo_txt):
"""
should rewrite the file with the new text in the future
"""
text = file.read()
text.replace(txt.encode('utf-8'), novo_txt.encode('utf-8'))
file.seek(0) # rewind
file.write(text.encode('utf-8'))
def magica(File):
"""
Thread Pool. Every single thread should play around here with
one element from list os files
"""
global _DONE
if _MAX_PEERS == 1: # inviavel em multithread
logger.info('\n---- File %s' % File)
with open(File, "r+") as file:
list_txt = encontre_text(file)
for txt in list_txt:
novo_txt = traduza_palavra(txt)
if txt != novo_txt:
logger.info('%s -> %s [%s]' % (txt, novo_txt, File))
subistitua(file, txt, novo_txt)
file.close()
print File.ljust(70) + '[OK]'.rjust(5)
if __name__ == '__main__':
try:
response = urllib2.urlopen('http://www.google.com.br', timeout=1)
except urllib2.URLError as err:
print "No network connection "
exit(-1)
root = './app'
ex = ".py"
files = []
os.path.walk(root, fileWalker, [ex, files])
print '%d files found to be translated' % len(files)
try:
if _MAX_PEERS > 1:
_pool = Pool(processes=_MAX_PEERS)
result = _pool.map_async(magica, files)
result.wait()
else:
result = MagicMock()
result.successful.return_value = False
for f in files:
pass
magica(f)
result.successful.return_value = True
except AssertionError, e:
print e
else:
pass
finally:
if result.successful():
print 'Translated all files'
else:
print 'Some files were not translated'
Thank you all for the help!
In Python 2, reading from files produces regular (byte) string objects, not unicode objects. There is no need to call .encode() on these; in fact, that'll only trigger an automatic decode to Unicode first, which can fail.
Rule of thumb: use a unicode sandwich. Whenever you read data, you decode to unicode at that stage. Use unicode values throughout your code. Whenever you write data, encode at that point. You can use io.open() to open file objects that encode and decode automatically for you.
That also means you can use unicode literals everywhere; for your regular expressions, for your string literals. So use:
def encontre_text(file):
text = file.read() # assume `io.open()` was used
return re.findall(ur"\w+(?<=_)\w+", text) # use a unicode pattern
and
def subistitua(file, txt, novo_txt):
text = file.read() # assume `io.open()` was used
text = text.replace(txt, novo_txt)
file.seek(0) # rewind
file.write(text)
as all string values in the program are already unicode, and
txt = txt.replace(u'media', u'média')
as u'..' unicode string literals don't need decoding anymore.

Swedish characters in python error

I am making a program that uses words with Swedish characters and stores them in a list. I can print Swedish characters before I put them into a list, but after they are put in, they do not appear normally, just a big mess of characters.
Here is my code:
# coding=UTF-8
def get_word(lines, eng=0):
if eng == 1: #function to get word in english
word_start = lines[1]
def do_format(word, lang):
if lang == "sv":
first_word = word
second_word = translate(word, lang)
element = first_word + " - " + second_word
elif lang == "en":
first_word = translate(word, lang)
second_word = word
element = first_word + " - " + second_word
return element
def translate(word, lang):
if lang == "sv":
return "ENGLISH"
if lang == "en":
return "SWEDISH"
translated = []
path = "C:\Users\LK\Desktop\Dropbox\Dokumentai\School\Swedish\V47.txt"
doc = open(path, 'r') #opens the documen
doc_list = [] #the variable that will contain list of words
for lines in doc.readlines(): #repeat as many times as there are lines
if len(lines) > 1: #ignore empty spaces
lines = lines.rstrip() #don't add "\n" at the end
doc_list.append(lines) #add to the list
for i in doc_list:
print i
for i in doc_list:
if "-" in i:
if i[0] == "-":
element = do_format(i[2:], "en")
translated.append(element)
else:
translated.append(i)
else:
element = do_format(i, "sv")
translated.append(element)
print translated
raw_input()
I can simplify the problem to a simple code as:
# -*- coding: utf-8 -*-
test_string = "ö"
test_list = ["å"]
print test_string, test_list
If I run that, I get this
ö ['\xc3\xa5']
There are multiple things to notice:
The broken character. This seems to happen because your python seems to output UTF-8 but your terminal seems to be configured to some ISO-8859-X mode (hence the two characters). I'd try to use proper unicode strings in Python 2! (always u"ö" instead of "ö"). And check your locale settings (locale command when on linux)
The weird string in the list. In Python print e will print out str(e). For lists (such as ["å"]) the implementation of __str__ is the same as __repr__. And since repr(some_list) will call repr on any of the elements contained in the list, you end up with the string you see.
Example for repr(string):
>>> print u"ö"
ö
>>> print repr(u"ö")
u'\xf6'
>>> print repr("ö")
'\xc3\xb6'
If you print list then it can be print as some structure. You should convert it to string for example by using join() string method. With your test code it may looks like:
print test_string, test_list
print('%s, %s, %s' % (test_string, test_list[0], ','.join(test_list)))
And output:
ö ['\xc3\xa5']
ö, å, å
I think in your main program you can:
print('%s' % (', '.join(translated)))
You can use codecs module to specify encoding of the read bytes.
import codecs
doc = codecs.open(path, 'r', encoding='utf-8') #opens the document
Files opened with codecs.open will give you unicode string after decoding the raw bytes with specified encoding.
In you code, prefix your string literals with u, to make them unicode string.
# -*- coding: utf-8 -*-
test_string = u"ö"
test_list = [u"å"]
print test_string, test_list[0]

Categories

Resources