I generate custom XML files that have to be in a certain format with this scripts. It queries a database and turns the results into one big xml file. I do this to multiple databases that range from inventory parts list to employee records.
import csv
import StringIO
import time
import MySQLdb
import lxml.etree
import lxml.builder
from datetime import datetime
import string
from lxml import etree
from lxml.builder import E as buildE
from datetime import datetime
from time import sleep
import shutil
import glob
import os
import logging
def logWrite(message):
logging.basicConfig(
filename="C:\\logs\\XMLSyncOut.log",
level=logging.DEBUG,
format='%(asctime)s %(message)s',
datefmt='%m/%d/%Y %I:%M:%S: %p'
)
logging.debug(message)
def buildTag(tag,parent=None,content=None):
element = buildE(tag)
if content is not None:
element.text = unicode(content)
if parent is not None:
parent.append(element)
return element
def fetchXML(cursor):
logWrite("constructing XML from cursor")
fields = [x[0] for x in cursor.description]
doc = buildTag('DATA')
for record in cursor.fetchall():
r = buildTag('ROW',parent=doc)
for (k,v) in zip(fields,record):
buildTag(k,content=v,parent=r)
return doc
def updateDatabase 1():
try:
conn = MySQLdb.connect(host = 'host',user = 'user',passwd = 'passwd',db = 'database')
cursor = conn.cursor()
except:
sys.exit(1)
logWrite("Cannot connect to database - quitting!")
cursor.execute("SELECT * FROM database.table")
logWrite("Dumping fields from database.table into cursor")
xmlFile = open("results.xml","w")
doc = fetchXML(cursor)
xmlFile.write(etree.tostring(doc,pretty_print=True))
logWrite("Writing XML results.xml")
xmlFile.close()
For some reason one of the new databases I imported from an excel spreadsheet is having some type of encoding error that that the others aren't having. This is the error
element.text = unicode(content)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 21: ordinal not in range(128)
I tried explicitly encoding to ascii by changing the buildTag function to look like this:
def buildTag(tag,parent=None,content=None):
element = buildE(tag)
if content is not None:
content = str(content).encode('ascii','ignore')
element.text = content
if parent is not None:
parent.append(element)
return element
This still didn't work.
Any ideas on what I can do to stop this? I can't escape them because I can't have "\x92" showing up in records as output.
I think you problem in windows encoding you can try in shell:
In: print '\x92'.decode('cp1251')
Out: '
I'm focusing on
element.text = unicode(content)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 21: ordinal not in range(128)
I am assuming that content is of type str, i.e. it contains byte code (only true for Python 2). You have to know which encoding has been used to generate this byte code. Then, in order to create a unicode object from this byte code, you have to explicitly tell Python how to decode it, via e.g.:
element.text = content.decode("utf-8")
Related
So I am trying to figure out how to properly encode the .table function in terminaltables but so far no luck.
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import re
from terminaltables import AsciiTable
html = """<a href="#" class="link" title="Some title B">
Some text with åäö</a>"""
soup = BeautifulSoup(html, "html.parser")
test = soup.find('a').get_text(strip=True).encode('utf-8')
test_list = [test]
test_table = AsciiTable(test_list)
print test_list
print test
print test_table.table
This will print
['Some text with \xc3\xa5\xc3\xa4\xc3\xb6']
Some text with åäö
And the traceback for the test_table.table
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data
The documentation for asciitable states that the argument to AsciiData should be a list of list of strings.
Changing your definition of test_list to test_list = [[test]] prints out a table for me.
Dropping the encode('utf-8') from the definition of test makes it show up nicely on my terminal.
I'm working on a new project but I can't fix the error in the title.
Here's the code:
#!/usr/bin/env python3.5.2
import urllib.request , urllib.parse
def start(url):
source_code = urllib.request.urlopen(url).read()
info = urllib.parse.parse_qs(source_code)
print(info)
start('https://www.youtube.com/watch?v=YfRLJQlpMNw')
The error occurred because of .encode which works on a unicode object. So we need to convert the byte string to unicode string using
.decode('unicode_escape')
So the code will be:
#!/usr/bin/env python3.5.2
import urllib.request , urllib.parse
def start(url):
source_code = urllib.request.urlopen(url).read()
info = urllib.parse.parse_qs(source_code.decode('unicode_escape'))
print(info)
start('https://www.youtube.com/watch?v=YfRLJQlpMNw')
Try this
source_code = urllib.request.urlopen(url).read().decode('utf-8')
The error message is self explainatory: there is a byte 0xf0 in an input string that is expected to be an ascii string.
You should have given the exact error message and on what line it happened, but I can guess that is happened on info = urllib.parse.parse_qs(source_code), because parse_qs expects either a unicode string or an ascii byte string.
The first question is why you call parse_qs on data coming from youtube, because the doc for the Python Standart Library says:
Parse a query string given as a string argument (data of type application/x-www-form-urlencoded). Data are returned as a dictionary. The dictionary keys are the unique query variable names and the values are lists of values for each name.
So you are going to parse this on = and & character to interpret it as a query string in the form key1=value11&key2=value2&key1=value12 to give { 'key1': [ 'value11', 'value12'], 'key2': ['value2']}.
If you know why you want that, you should first decode the byte string into a unicode string, using the proper encoding, or if unsure Latin1 which is able to accept any byte:
def start(url):
source_code = urllib.request.urlopen(url).read().decode('latin1')
info = urllib.parse.parse_qs(source_code)
print(info)
This code is rather weird indeed. You are using query parser to parse contents of a web page.
So instead of using parse_qs you should be using something like this.
I have a web crawler that get a lot of these errors:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 27: ordinal not in range(128)
To mitigate these errors I have implemented a function that encode them like this:
def properEncode(url):
url = url.replace("ø", "%C3%B8")
url = url.replace("å", "%C3%A5")
url = url.replace("æ", "%C3%A6")
url = url.replace("é", "%c3%a9")
url = url.replace("Ø", "%C3%98")
url = url.replace("Å", "%C3%A5")
url = url.replace("Æ", "%C3%85")
url = url.replace("í", "%C3%AD")
return url
These are based on this table: http://www.utf8-chartable.de/
The conversion I do seems to be to convert them to utf-8 hex? Is there a python function to do this automatically?
You are URL encoding them. You can do so trivially with the urllib.parse.quote() function:
>>> from urllib.parse import quote
>>> quote("ø")
'%C3%B8'
or put into a function to only fix the URL path of a given URL (as this encoding doesn't apply to the host portion, for example):
from urllib.parse import quote, urlparse
def properEncode(url):
parts = urlparse(url)
path = quote(parts.path)
return parts._replace(path=path).geturl()
This limits the encoding to just the path portion of the URL. If you need to encode the query string, use the quote_plus function as query parameters replace spaces with a plus instead of %20 (and handle the query portion of the URL).
I'm new to Python and have to build an application to get historical data from Twitter. I can see the tweets in my console and all the information I need!
But the problem I have now is that I need to write this information to a .csv file but I encounter the following error while running the code:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)". I know that the problem is that the Tweets I'm collecting are written in Swedish so its a frequent use of the letter "ÅÄÖ".
Can anyone please help me with this or have any pointers as to where i should start looking for a solution?
#!/usr/bin/python
# -*- coding: utf-8 -*-
from TwitterSearch import *
import csv
import codecs
try:
tuo = TwitterUserOrder('Vismaspcs') # create a TwitterUserOrder
ts = TwitterSearch(
consumer_key = '',
consumer_secret = '',
access_token = '',
access_token_secret = ''
)
# start asking Twitter about the timeline
for tweet in ts.search_tweets_iterable(tuo):
print( '#%s tweeted: %s' % ( tweet['user']['screen_name'], tweet['text']) )
print (tweet['created_at'],tweet['favorite_count'],tweet ['retweet_count'])
with open('visma.csv','w') as fout:
writer=csv.writer(fout)
writer.writerows([tweet['user']['screen_name'],tweet['text'],tweet['created_at'],tweet['favorite_count'],tweet['retweet_count']])
except TwitterSearchException as e: # catch all those ugly errors
print(e)
The csv module cannot handle unicode in python2:
Note This version of the csv module doesn't support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
You can use tweet['user']['screen_name'].encode("utf-8")...
Thank's,
I use it with file.write(word['value'].encode("utf-8")) and it work too :)
But you can try with file.encode('utf8') if it's not for write something
I'm trying to download this page - https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8 (looks like this for me in Russia - http://screencloud.net/v/6a7o) via spynner in python - it uses some javascript checking so one does not simply download it without full browser emulation.
My code:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from StringIO import StringIO
import spynner
def log(str, filename_end):
filename = '/tmp/apple_log_%s.html' % filename_end
print 'logged to %s' % filename
f = open(filename, 'w')
f.write(str)
f.close()
debug_stream = StringIO()
browser = spynner.Browser(debug_level=3, debug_stream=debug_stream)
browser.load("https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8")
ret = browser.contents
log(ret, 'noenc')
print 'content length = %s' % len(ret)
browser.close()
del browser
f=open('/tmp/apple_log_debug', 'w')
f.write(debug_stream.getvalue())
f.close()
print 'log stored in /tmp/debug_log'
So, the problem is: either apple, either spynner work wrong with Cyrillic symbols. I see them fine if I try browser.show() after loading, but in the code and logs they are still wrong encoded like <meta content="ÐолÑÑиÑÑ Farm Story⢠в App Store. ÐÑоÑмоÑÑеÑÑ ÑкÑинÑоÑÑ Ð¸ ÑейÑинги, пÑоÑиÑаÑÑ Ð¾ÑзÑÐ²Ñ Ð¿Ð¾ÐºÑпаÑелей." property="og:description">.
http://2cyr.com/ Says that it is a utf-8 text displayed like iso-8859-1...
As you see - I don't use any headers in my request, but if I take them from chrome's network debug console and pass it to load() method e.g. headers=[('Accept-Encoding', 'utf-8'), ('Accept-Language', 'ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4')] - I get the same result.
Also, from the same network console you can see that chrome uses gzip,deflate,sdch as Accept-Encoding. I can try that too, but I fail to decode what I get: <html><head></head><body>��}ksÇ�g!���4�I/z�O���/)�(yw���é®i��{�<v���:��ٷ�س-?�b�b�� j�... even if I remove the tags from the begin and end of the result.
Any help?
Basically, browser.webframe.toHtml() returns a QTString in which case str() won't help if res actually has unicode non-latin characters.
If you want to get a Python unicode string you need to do:
ret = unicode(browser.webframe.toHtml().toUtf8(), encoding="UTF-8")
#if you want to get rid of non-latin text
ret = ret.encode("ascii", errors="replace") # encodes to bytestring
in case you suspect its in Russian you could decode it to a Russian multibyte oem string (sill a bytestring) by doing
ret = ret.encode("cp1251", errors="replace") # encodes to Win-1251
# or
ret = ret.encode("cp866", errors="replace") # encodes to windows/dos console
Only then you can save it to an ASCII file.
str(browser.webframe.toHtml()) saved me