My requirement is to read some data from mysql database, and then write it in a JSON format to a file. However, while writing into file, the unicode data is garbled.
Actual Unicode Name: ぎぎぎは.exe
Name written to file: ã<81>Žã<81>Žã<81>Žã<81>¯.exe
My database has charset set as utf8. I am opening connection like below:
MySQLdb.connect (host = "XXXXX", user = "XXXXX", passwd = "XXXX", cursorclass=cursors.SSCursor,charset='utf8',use_unicode=True)
And the outfile is opened as below:
for r in data:
with open("XX.json",'w') as out:
d={}
d['name']=r[0]
d['type']='Work'
out.write('%s\n' % json.dumps(d, indent=0, ensure_ascii=False).replace('\n', ''))
This is working, but as mentioned above unicode data is getting garbled.
If I do type(r[0]), it's coming as 'str'.
If your solution includes to use codes.open function, with encoding as 'utf-8', then please help me add decode/encode whereever required. This method needs all data to be unicode.
I am lost in plethora of solution available online, but none of them are working perfectly fine for me :(
OS Details: CentOS 6
>>> import sys
>>> sys.getfilesystemencoding()
'UTF-8'
>>>
Try io.open:
# -*- coding: utf8 -*-
import json
import io
with io.open('b.txt', 'at', encoding='utf8') as json_file:
json_file.write(json.dumps({u'ぎぎぎは': 0}, ensure_ascii=False, encoding='utf8'))
Related
I'm trying to send data from a XML feed to MySQL database, but I'm getting wrong pt-br characters in python and mysql.
import MySQLdb
import urllib2
import sys
import codecs
## default enconding
reload(sys)
sys.setdefaultencoding('utf-8')
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
file = urllib2.urlopen('feed.xml')
data = file.read()
file.close()
data = xmltodict.parse(data)
db = MySQLdb.connect(host=MYSQL_HOST, # your host, usually localhost
user=MYSQL_USER, # your username
passwd=MYSQL_PASSWD, # your password
db=MYSQL_DB) # name of the data base
cur = db.cursor()
product_name = str(data.items()[0][1].items()[2][1].items()[3][1][i].items()[1][1])
But when I print product_name in Python or insert it into mysql, I get this:
'Probi\xc3\xb3tica (120caps)'
this should be:
'Probiótica'
How can I fix this?
'Probi\xc3\xb3tica' is the utf-8 encoded version of 'Probiótica'.
Is your terminal (or whatever you are using to run this) set up to handle utf-8 output?
Try print 'Probi\xc3\xb3tica'.decode('utf-8') to see what happens.
I get Probiótica.
In this first example we save two Unicode strings in a file while delegating to codecs the task of encoding them.
# -*- coding: utf-8 -*-
import codecs
cities = [u'Düsseldorf', u'天津市']
with codecs.open("cities", "w", "utf-8") as f:
for c in cities:
f.write(c)
We now do the same thing, first saving the two names to redis, then reading them back and saving what we've read to a file. Because what we've read is already in utf-8 we skip decoding/encoding for that part.
# -*- coding: utf-8 -*-
import redis
r_server = redis.Redis('localhost') #, decode_responses = True)
cities_tag = u'Städte'
cities = [u'Düsseldorf', u'天津市']
for city in cities:
r_server.sadd(cities_tag.encode('utf8'),
city.encode('utf8'))
with open(u'someCities.txt', 'w') as f:
while r_server.scard(cities_tag.encode('utf8')) != 0:
city_utf8 = r_server.srandmember(cities_tag.encode('utf8'))
f.write(city_utf8)
r_server.srem(cities_tag.encode('utf8'), city_utf8)
How can I replace the line
r_server = redis.Redis('localhost')
with
r_server = redis.Redis('localhost', decode_responses = True)
to avoid the wholesale introduction of .encode/.decode when using redis?
I'm not sure that there is a problem.
If you remove all of the .encode('utf8') calls in your code it produces a correct file, i.e. the file is the same as the one produced by your current code.
>>> r_server = redis.Redis('localhost')
>>> r_server.keys()
[]
>>> r_server.sadd(u'Hauptstädte', u'東京', u'Godthåb',u'Москва')
3
>>> r_server.keys()
['Hauptst\xc3\xa4dte']
>>> r_server.smembers(u'Hauptstädte')
set(['Godth\xc3\xa5b', '\xd0\x9c\xd0\xbe\xd1\x81\xd0\xba\xd0\xb2\xd0\xb0', '\xe6\x9d\xb1\xe4\xba\xac'])
This shows that keys and values are UTF8 encoded, therefore .encode('utf8') is not required. The default encoding for the redis module is UTF8. This can be changed by passing an encoding when creating the client, e.g. redis.Redis('localhost', encoding='iso-8859-1'), but there's no reason to.
If you enable response decoding with decode_responses=True then the responses will be converted to unicode using the client connection's encoding. This just means that you don't need to explicitly decode the returned data, redis will do it for you and give you back a unicode string:
>>> r_server = redis.Redis('localhost', decode_responses=True)
>>> r_server.keys()
[u'Hauptst\xe4dte']
>>> r_server.smembers(u'Hauptstädte')
set([u'Godth\xe5b', u'\u041c\u043e\u0441\u043a\u0432\u0430', u'\u6771\u4eac'])
So, in your second example where you write data retrieved from redis to a file, if you enable response decoding then you need to open the output file with the desired encoding. If this is the default encoding then you can just use open(). Otherwise you can use codecs.open() or manually encode the data before writing to the file.
import codecs
cities_tag = u'Hauptstädte'
with codecs.open('capitals.txt', 'w', encoding='utf8') as f:
while r_server.scard(cities_tag) != 0:
city = r_server.srandmember(cities_tag)
f.write(city + '\n')
r_server.srem(cities_tag, city)
How do you change the encoding through a python script?
I've got some files that I'm looping doing some other stuff. But before that I need to change the encoding on each file from UTF-8 to UTF-16 since SQL server does not support UTF-8
Tried this, but not working.
data = "UTF-8 data"
udata = data.decode("utf-8")
data = udata.encode("utf-16","ignore")
Cheers!
If you want to convert a file from utf-8 encoding to a file with utf-16 encoding, this script works:
#!/usr/bin/python2.7
import codecs
import shutil
with codecs.open("input_file.utf8.txt", encoding="utf-8") as input_file:
with codecs.open(
"output_file.utf16.txt", "w", encoding="utf-16") as output_file:
shutil.copyfileobj(input_file, output_file)
About saving in UTF-8
I have a python script which consumes Twitter API and saves some tweets in JSON structures.
When it comes to text containing characters like á, é, í, ó or ú I won't get that text saved properly but replaced by a pattern like \u00f1:
This is my while script:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import twitter
import json
from api.search import *
from api.tpt import *
from tweet.tweet import *
from time import *
WORLD_WOE_ID = 23424900
trendingTopics = startTrendsDictionary()
twitterAPI = createAPIConsumer()
print "Twitter API started"
while True:
mexican_trends = twitterAPI.trends.place(_id=WORLD_WOE_ID)
print "Current trends fetched"
with open ('../files/trends/trends '+strftime("%c"), 'w') as trendsFile:
trendsFile.write(json.dumps(mexican_trends,indent=1))
print "Current trends at +- {t} saved".format(t=strftime("%c"))
statuses = harvestTrendingTopicTweets(twitterAPI,trendingTopics,100,10)
print "Harvest done"
with open('../files/tweets/unclassified '+strftime("%c"), 'w') as statusesFile:
statusesFile.write(json.dumps(statuses,indent=1).encode('utf-8'))
print "File saved"
print "We're going to wait in order not to fed up Twitter API"
sleep(2400)
print "OK, it was enough waiting, here we go again"
I thought that both:
# -*- coding: utf-8 -*-
.encode('utf-8'))
Will solve it but they don't.
About reading in UTF-8
When it comes to read, I'm trying:
import json
with open('file', 'r', buffering=1) as f:
tweetsJSON = json.load(f)
for category in tweetsJSON:
for trend in tweetsJSON[category]:
for t in tweetsJSON[category][trend]:
print t['text']
print
In this case, printed to console, *I can see all those letters displayed properly.*
So, why when I open the saved files with a text editor (Sublime Text, in my case) they won't look ok?
I've been trying to scrape data from a website and write out the data that I find to a file. More than 90% of the time, I don't run into Unicode errors but when the data has the following characters such as "Burger King®, Hans Café", it doesn't like writing that into the file so my error handling prints it to the screen as is and without any further errors.
I've tried the encode and decode functions and the various encodings but to no avail.
Please find an excerpt of the current code that I've written below:
import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from string import maketrans
import codecs
f=codecs.open('alldetails7.txt', mode='w', encoding='utf-8', errors='replace')
...
soup5 = BeautifulSoup(html5)
enc_s5 = soup5.originalEncoding
for company in iter(soup5.findAll(height="20px")):
stream = ""
count_detail = 1
for tag in iter(company.findAll('td')):
if count_detail > 1:
stream = stream + tag.text.replace(u',',u';')
if count_detail < 4 :
stream=stream+","
count_detail = count_detail + 1
stream.strip()
try:
f.write(str(stnum)+","+br_name_addr+","+stream.decode(enc_s5)+os.linesep)
except:
print "Unicode error ->"+str(storenum)+","+branch_name_address+","+stream
Your f.write() line doesn't make sense to me - stream will be a unicode since it's made indirectly from from tag.text and BeautifulSoup gives you Unicode, so you shouldn't call decode on stream. (You use decode to turn a str with a particular character encoding into a unicode.) You've opened the file for writing with codecs.open() and told it to use UTF-8, so you can just write() a unicode and that should work. So, instead I would try:
f.write(unicode(stnum)+br_name_addr+u","+stream+os.linesep)
... or, supposing that instead you had just opened the file with f=open('alldetails7.txt','w'), you would do:
line = unicode(stnum)+br_name_addr+u","+stream+os.linesep
f.write(line.encode('utf-8'))
Have you checked the encoding of the file you're writing to, and made sure the characters can be shown in the encoding you're trying to write to the file? Try setting the character encoding to UTF-8 or something else explicitly to have the characters show up.