Getting wrong characters in pt-br from xml in python - python

I'm trying to send data from a XML feed to MySQL database, but I'm getting wrong pt-br characters in python and mysql.
import MySQLdb
import urllib2
import sys
import codecs
## default enconding
reload(sys)
sys.setdefaultencoding('utf-8')
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
file = urllib2.urlopen('feed.xml')
data = file.read()
file.close()
data = xmltodict.parse(data)
db = MySQLdb.connect(host=MYSQL_HOST, # your host, usually localhost
user=MYSQL_USER, # your username
passwd=MYSQL_PASSWD, # your password
db=MYSQL_DB) # name of the data base
cur = db.cursor()
product_name = str(data.items()[0][1].items()[2][1].items()[3][1][i].items()[1][1])
But when I print product_name in Python or insert it into mysql, I get this:
'Probi\xc3\xb3tica (120caps)'
this should be:
'Probiótica'
How can I fix this?

'Probi\xc3\xb3tica' is the utf-8 encoded version of 'Probiótica'.
Is your terminal (or whatever you are using to run this) set up to handle utf-8 output?
Try print 'Probi\xc3\xb3tica'.decode('utf-8') to see what happens.
I get Probiótica.

Related

Python:: Unicode characters garbled while writing in file

My requirement is to read some data from mysql database, and then write it in a JSON format to a file. However, while writing into file, the unicode data is garbled.
Actual Unicode Name: ぎぎぎは.exe
Name written to file: ã<81>Žã<81>Žã<81>Žã<81>¯.exe
My database has charset set as utf8. I am opening connection like below:
MySQLdb.connect (host = "XXXXX", user = "XXXXX", passwd = "XXXX", cursorclass=cursors.SSCursor,charset='utf8',use_unicode=True)
And the outfile is opened as below:
for r in data:
with open("XX.json",'w') as out:
d={}
d['name']=r[0]
d['type']='Work'
out.write('%s\n' % json.dumps(d, indent=0, ensure_ascii=False).replace('\n', ''))
This is working, but as mentioned above unicode data is getting garbled.
If I do type(r[0]), it's coming as 'str'.
If your solution includes to use codes.open function, with encoding as 'utf-8', then please help me add decode/encode whereever required. This method needs all data to be unicode.
I am lost in plethora of solution available online, but none of them are working perfectly fine for me :(
OS Details: CentOS 6
>>> import sys
>>> sys.getfilesystemencoding()
'UTF-8'
>>>
Try io.open:
# -*- coding: utf8 -*-
import json
import io
with io.open('b.txt', 'at', encoding='utf8') as json_file:
json_file.write(json.dumps({u'ぎぎぎは': 0}, ensure_ascii=False, encoding='utf8'))

Upload file (size <16MB) to MongoDB

I have a requirement to upload file to MongoDB. Currently I am saving files in a folder in current filesystem using Flask. Is there a way I can upload file to MongoDB without using GridFS? I believe I did something like this long before but I cannot recollect since its been longtime since I last used MongoDB.
Any file I select to upload is no more than 16MB in size.
Update: I tried this to convert image file using binData but it throws error global name binData is not defined.
import pymongo
import base64
import bson
# establish a connection to the database
connection = pymongo.MongoClient()
#get a handle to the test database
db = connection.test
file_meta = db.file_meta
file_used = "Headshot.jpg"
def main():
coll = db.sample
with open(file_used, "r") as fin:
f = fin.read()
encoded = binData(f)
coll.insert({"filename": file_used, "file": f, "description": "test" })
Mongo BSON (https://docs.mongodb.com/manual/reference/bson-types/) has binary data (binData) type for field.
Python driver (http://api.mongodb.com/python/current/api/bson/binary.html) supports it.
You can store file as array of bytes.
You code should be slightly modified:
Add import: from bson.binary import Binary
Encode file bytes using Binary: encoded = Binary(f)
Use encoded value in insert statement.
Full example below:
import pymongo
import base64
import bson
from bson.binary import Binary
# establish a connection to the database
connection = pymongo.MongoClient()
#get a handle to the test database
db = connection.test
file_meta = db.file_meta
file_used = "Headshot.jpg"
def main():
coll = db.sample
with open(file_used, "rb") as f:
encoded = Binary(f.read())
coll.insert({"filename": file_used, "file": encoded, "description": "test" })

Why isn't my text file being written in UTF-8 encoding in Python and how and I read it properly?

About saving in UTF-8
I have a python script which consumes Twitter API and saves some tweets in JSON structures.
When it comes to text containing characters like á, é, í, ó or ú I won't get that text saved properly but replaced by a pattern like \u00f1:
This is my while script:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import twitter
import json
from api.search import *
from api.tpt import *
from tweet.tweet import *
from time import *
WORLD_WOE_ID = 23424900
trendingTopics = startTrendsDictionary()
twitterAPI = createAPIConsumer()
print "Twitter API started"
while True:
mexican_trends = twitterAPI.trends.place(_id=WORLD_WOE_ID)
print "Current trends fetched"
with open ('../files/trends/trends '+strftime("%c"), 'w') as trendsFile:
trendsFile.write(json.dumps(mexican_trends,indent=1))
print "Current trends at +- {t} saved".format(t=strftime("%c"))
statuses = harvestTrendingTopicTweets(twitterAPI,trendingTopics,100,10)
print "Harvest done"
with open('../files/tweets/unclassified '+strftime("%c"), 'w') as statusesFile:
statusesFile.write(json.dumps(statuses,indent=1).encode('utf-8'))
print "File saved"
print "We're going to wait in order not to fed up Twitter API"
sleep(2400)
print "OK, it was enough waiting, here we go again"
I thought that both:
# -*- coding: utf-8 -*-
.encode('utf-8'))
Will solve it but they don't.
About reading in UTF-8
When it comes to read, I'm trying:
import json
with open('file', 'r', buffering=1) as f:
tweetsJSON = json.load(f)
for category in tweetsJSON:
for trend in tweetsJSON[category]:
for t in tweetsJSON[category][trend]:
print t['text']
print
In this case, printed to console, *I can see all those letters displayed properly.*
So, why when I open the saved files with a text editor (Sublime Text, in my case) they won't look ok?

How to fetch xml data to write in a database

I'm working on a Python script to read the xml data from a server and storing the xml data in the database. When I create the database, I can see that it will write the list of xml in a database without fetching for each data and it did not create the database table which it looks like this: http://imageshack.com/a/img401/4210/ofa5.jpg
The xml i got from the server link: http://ontv.dk/xmltv/c81e728d9d4c2f636f067f89cc14862c
Here is the current code:
import xbmc
import xbmcgui
import xbmcaddon
import os
import urllib
import urllib2
import StringIO
import sqlite3
import datetime
import time
from xml.etree import ElementTree
ADDON = xbmcaddon.Addon(id = 'script.myaddon')
class MyScript(xbmcgui.WindowXML):
def __new__(cls):
return super(MyScript, cls).__new__(cls, 'script-menu.xml', ADDON.getAddonInfo('path'))
def onInit(self):
#DOWNLOAD THE XML SOURCE HERE
url = ADDON.getSetting('ontv.url')
req = urllib2.Request(url)
response = urllib2.urlopen(req)
data = response.read()
response.close()
profilePath = xbmc.translatePath(os.path.join('special://userdata/addon_data/script.tvguide', ''))
io = StringIO.StringIO(req)
context = ElementTree.iterparse(io)
if os.path.exists(profilePath):
profilePath = profilePath + 'source.db'
con = sqlite3.connect(profilePath)
cur = con.cursor()
cur.execute('CREATE TABLE programs(channel TEXT, title TEXT, start_date TIMESTAMP, end_date TIMESTAMP, description TEXT, image_large TEXT, image_small TEXT, source TEXT, updates_id INTEGER, FOREIGN KEY(channel, source) REFERENCES channels(id, source) ON DELETE CASCADE, FOREIGN KEY(updates_id) REFERENCES updates(id) ON DELETE CASCADE)')
cur.close()
fc = open(profilePath, 'w')
fc.write(data)
fc.close
I want to fetch for each xml data to write in a database after when I creating the database table. I want to know how do you write the source for xbmc to fetch for each xml data to store in a database after when I create the database table?
I haven't got the xbmc module installed, so this code is based on loading the XML from a file and then parsing through it.
I couldn't see any references to image_large, image_small or updates_id in the XML, so I left those commented out. There is probably a better way of doing this, but this should get you started and hopefully from here you should be able to work out how to loop through the list to write each programme to your database table.
import xml.etree.ElementTree as ET
tree = ET.parse('epg.xml')
root = tree.getroot()
programmes = []
for item in root.findall('programme'):
programme = {}
programme["channel"] = item.attrib['channel']
programme["title"] = item.find('title').text
programme["start_date"] = item.attrib['start']
programme["end_date"] = item.attrib['stop']
programme["description"] = item.find('desc').text
#programme["image_large"] =
#programme["image_small"] =
programme["source"] = item.find('icon').attrib['src']
#programme["updates_id"] =
programmes.append(programme)

Splitting text in Python

I'm writing some script which capture data from web site and save them into DB. Some of datas are merged and I need to split them. I have sth like this
Endokrynologia (bez st.),Położnictwo i ginekologia (II st.)
So i need to get:
Endokrynologia (bez st.)
Położnictwo i ginekologia (II st.)
So i wrote some code in python:
#!/usr/bin/env python
# -*- encoding: utf-8
import MySQLdb as mdb
from lxml import html, etree
import urllib
import sys
import re
Nr = 17268
Link = "http://rpwdl.csioz.gov.pl/rpz/druk/wyswietlKsiegaServletPub?idKsiega="
sock = urllib.urlopen(Link+str(Nr))
htmlSource = sock.read()
sock.close()
root = etree.HTML(htmlSource)
result = etree.tostring(root, pretty_print=True, method="html")
Spec = etree.XPath("string(//html/body/div/table[2]/tr[18]/td[2]/text())")
Specjalizacja = Spec(root)
if re.search(r'(,)\b', Specjalizacja):
text = Specjalizacja.split()
print text[0]
print text[1]
and i get:
Endokrynologia
(bez
what i'm doing wrong ?
you would try to replace
text = Specjalizacja.split()
with
text = Specjalizacja.split(',')
Don't know whether that would fix your problem.

Categories

Resources