urllib2, Google App Engine, and unicode question - python

Hey guys, I'm just learning google app engine so I'm running into a bunch of problems...
My current predicament is this. I have a database,
class Website(db.Model):
web_address = db.StringProperty()
company_name = db.StringProperty()
content = db.TextProperty()
div_section = db.StringProperty()
local_links = db.StringProperty()
absolute_links = db.BooleanProperty()
date_updated = db.DateTimeProperty()
and the problem i'm having is with the content property.
I'm using the db.TextProperty() because I need to store the contents of a webpage which have >500 bytes.
The problem i'm running into is urllib2.readlines() formats as unicode. When putting into a TextProperty() it's converting to ASCII. some of the characters are >128 and it throws a UnicodeDecodeError.
Is there a simple way to bypass this? For the most part, I don't care about those characters...
my error is:
Traceback (most recent call last):
File
"/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/init.py",
line 511, in call
handler.get(*groups) File "/base/data/home/apps/game-job-finder/1.346504560470727679/main.py", line 61, in get
x.content = website_data_joined File
"/base/python_runtime/python_lib/versions/1/google/appengine/ext/db/init.py",
line 542, in set
value = self.validate(value) File
"/base/python_runtime/python_lib/versions/1/google/appengine/ext/db/init.py",
line 2407, in validate
value = self.data_type(value) File
"/base/python_runtime/python_lib/versions/1/google/appengine/api/datastore_types.py",
line 1006, in new
return super(Text, cls).new(cls, arg, encoding)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xc2 in position
2124: ordinal not in range(128)

It would appear that the lines returned from readlines are not unicode strings, but rather byte strings (i.e. instances of str containing potentially non-ASCII characters). These bytes are the raw data received in the HTTP response body, and will represent different strings depending on the encoding used. They need to be "decoded" before they can be treated as text (bytes != characters).
If the encoding is UTF-8, this code should work properly:
f = urllib2.open('http://www.google.com')
website = Website()
website.content = db.Text(f.read(), encoding = 'utf-8-sig') # 'sig' deals with BOM if present
Note that the actual encoding varies from website to website (sometimes even from page to page). The encoding used should be included in the Content-Type header in the HTTP response (see this question for how to get it), but if it's not, it may be included in a meta tag in the head of the HTML (in which case extracting properly is much more tricky):
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Note that there are sites that do not specify an encoding, or specify the wrong encoding.
If you really don't care about any characters but ASCII, you can ignore them and be done with it:
f = urllib2.open('http://www.google.com')
website = Website()
content = unicode(f.read(), errors = 'ignore') # Ignore characters that cause errors
website.content = db.Text(content) # Don't need to specify an encoding since content is already a unicode string

Related

How to get a webpage with unicode chars in python

I am trying to get and parse a webpage that contains non-ASCII characters (the URL is http://www.one.co.il). This is what I have:
url = "http://www.one.co.il"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
encoding = response.headers.getparam('charset') # windows-1255
html = response.read() # The length of this is valid - about 31000-32000,
# but printing the first characters shows garbage -
# '\x1f\x8b\x08\x00\x00\x00\x00\x00', instead of
# '<!DOCTYPE'
html_decoded = html.decode(encoding)
The last line gives me an exception:
File "C:/Users/....\WebGetter.py", line 16, in get_page
html_decoded = html.decode(encoding)
File "C:\Python27\lib\encodings\cp1255.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0xdb in position 14: character maps to <undefined>
I tried looking at other related questions such as urllib2 read to Unicode and How to handle response encoding from urllib.request.urlopen() , but didn't find anything helpful about this.
Can someone please shed some light and guide me in this subject? Thanks!
0x1f 0x8b 0x08 is the magic number for a gzipped file. You will need to decompress it before you can use the contents.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte [duplicate]

This question already has answers here:
UnicodeEncodeError: 'charmap' codec can't encode characters
(11 answers)
Closed 5 months ago.
I am trying to make a crawler in python by following an udacity course. I have this method get_page() which returns the content of the page.
def get_page(url):
'''
Open the given url and return the content of the page.
'''
data = urlopen(url)
html = data.read()
return html.decode('utf8')
the original method was just returning data.read(), but that way I could not do operations like str.find(). After a quick search I found out I need to decode the data. But now I get this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position
1: invalid start byte
I have found similar questions in SO but none of them were specifically for this. Please help.
You are trying to decode an invalid string.
The start byte of any valid UTF-8 string must be in the range of 0x00 to 0x7F.
So 0x8B is definitely invalid.
From RFC3629 Section 3:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number.
You should post the string you are trying to decode.
Maybe the page is encoded with other character encoding but 'utf-8'. So the start byte is invalid.
You could do this.
def get_page(self, url):
if url is None:
return None
response=urllib.request.urlopen(url)
if response.getcode()!=200:
print("Http code:",response.getcode())
return None
else:
try:
return response.read().decode('utf-8')
except:
return response.read()
Web servers often serve HTML pages with a Content-Type header that includes the encoding used to encoding the page. The header might look this:
Content-Type: text/html; charset=UTF-8
We can inspect the content of this header to find the encoding to use to decode the page:
from urllib.request import urlopen
def get_page(url):
""" Open the given url and return the content of the page."""
data = urlopen(url)
content_type = data.headers.get('content-type', '')
print(f'{content_type=}')
encoding = 'latin-1'
if 'charset' in content_type:
_, _, encoding = content_type.rpartition('=')
print(f'{encoding=}')
html = data.read()
return html.decode(encoding)
Using requests is similar:
response = requests.get(url)
content_type = reponse.headers.get('content-type', '')
Latin-1 (or ISO-8859-1) is a safe default: it will always decode any bytes (though the result may not be useful).
If the server doesn't serve a content-type header you can try looking for a <meta> tag that specifies the encoding in the HTML. Or pass the response bytes to Beautiful Soup and let it try to guess the encoding.

Can't convert 'bytes' object to str implictly HTML Parser Python3 Error

I am trying to create an HTML Parser in Python 3.4.2 on a Macbook Air(OS X):
plaintext.py:
from html.parser import HTMLParser
import urllib.request, formatter, sys
website = urllib.request.urlopen("http://www.profmcmillan.com")
data = website.read()
website.close()
format = formatter.AbstractFormatter(formatter.DumbWriter(sys.stdout))
ptext = HTMLParser(format)
ptext.feed(data)
ptext.close()
But I get the following error:
Traceback (most recent call last):
File "/Users/deannarobertazzi/Documents/plaintext.py", line 9, in <module>
ptext.feed(data)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/html/parser.py", line 164, in feed
self.rawdata = self.rawdata + data
TypeError: Can't convert 'bytes' object to str implicitly
I looked at the Python documentation and apparently the way you parse HTML data in Python 3 is vastly different from doing such a thing in Python 2. I don't know how to modify my code so that it works for Python 3. Thank you.
2.x implicit conversions only worked if all the bytes were in the ascii range.[0-127]
>>> u'a' + 'b'
u'ab'
>>> u'a' + '\xca'
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
u'a' + '\xca'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xca in position 0: ordinal not in range(128)
What often happened, and why this was dropped, is that code would work when tested with ascii data, such as Prof. McMillan's site seems to be today, and later fail, such as if Prof. McMillan were to add a title with a non-ascii char, or if another source were used that were not all-ascii.
The doc for HTMLParser.feed(data) says that the data must be 'text', which in 3.x means a unicode string. So bytes from the web must be decoded to unicode. Decoding the site with utf-8 works today because ascii is a subset of utf-8. However, the page currently has
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">
So if a non-ascii char were to be added, and the encoding not changed, utf-8 would not work. There is really no substitute for paying attention to encoding of bytes. How to discover or guess the encoding of a web page (assuming that there is only one encoding used) is a separate subject.

Downloading text from webpages using Scrapy raises UnicodeError and text is not stored correctly

I am using a Scrapy crawler to download text from some webpages belonging to different companies and store the text in a csv file using a utf-8 encoding and with the format
'company','company number','extracted text'
My problem is that no matter how I try to take webpage char encoding into account, I always get a lot of UnicodeError's of the type
2014-08-26 13:43:13+0200 [scrapy] INFO: Encoding for page http://militaryaircraftspares.net/index.html is cp1252
2014-08-26 13:43:13+0200 [scrapy] INFO: UNICODE ERROR, in http://militaryaircraftspares.net/index.html error is 'ascii' codec can't encode character u'\xa0' in position 478: ordinal not in range(128)
therefore loosing a lot of data.
Furthermore, the final csv file (I implemented a pipeline that uses CsvItemExporter) is completely messed up: there are definitely more than 3 columns, overall, and I have some parts of the extracted_text ending up in the 'company' or 'company number' fields. As if some excape characters in the extracted_text are not properly recognized and produce a new line where it is not needed (that's my guess, at least).
I assume I'm doing something deeply wrong somewhere, but couldn't figure out where...
So here is the crawler's function that is supposed to do the work
def extract_text(self, response):
""" extract text from webpage"""
#checks whether the page is actually html
if type(response) == scrapy.http.response.html.HtmlResponse:
hxs = HtmlXPathSelector(response)
page_text = ' '.join(hxs.select("//body//p//text()").extract())
current_encoding = response.encoding
log.msg("Encoding for page "+response.url+" is "+current_encoding)
item = CompanyText()
item['co_name'] = self.co_name.encode('utf-8')
item['co_number'] = self.co_number.encode('utf-8')
if current_encoding != 'utf-8':
try:
decoded_page = page_text.decode(current_encoding, errors='ignore')
encoded_page = decoded_page.encode("utf-8",errors="ignore")
item['extracted_text'] = encoded_page
except UnicodeError, e:
log.msg("UNICODE ERROR, in "+response.url+" error is %s" % e)
item['extracted_text'] = ''.encode('utf-8')
else:
item['extracted_text'] = page_text
else:
item = None
The output of selectors is always unicode, so you should join unicode.
This:
page_text = ' '.join(hxs.select("//body//p//text()").extract())
Should be:
page_text = u' '.join(hxs.select("//body//p//text()").extract())
Since page_text is unicode.
This isn't needed:
decoded_page = page_text.decode(current_encoding, errors='ignore')
encoded_page = decoded_page.encode("utf-8",errors="ignore")
CsvExporter will try to get the Field serializer, which is a function that receives value as argument. Our default serializer is _to_str_if_unicode, and use utf8 as encoding.

parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

I send a GET request to the CareerBuilder API :
import requests
url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text
And get back an XML that looks like this. However, I have trouble parsing it.
Using either lxml
>>> from lxml import etree
>>> print etree.fromstring(xml)
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
print etree.fromstring(xml)
File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (src\lxml\lxml.etree.c:62311)
File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.
or ElementTree:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
print ET.fromstring(xml)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1301, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1641, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 3717: ordinal not in range(128)
So, even though the XML file starts with
<?xml version="1.0" encoding="UTF-8"?>
I have the impression that it contains characters that are not allowed. How do I parse this file with either lxmlor ElementTree?
You are using the decoded unicode value. Use r.raw raw response data instead:
r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)
which will read the data from the response directly; do note the stream=True option to .get().
Setting the r.raw.decode_content = True flag ensures that the raw socket will give you the decompressed content even if the response is gzip or deflate compressed.
You don't have to stream the response; for smaller XML documents it is fine to use the response.content attribute, which is the un-decoded response body:
r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)
XML parsers always expect bytes as input as the XML format itself dictates how the parser is to decode those bytes to Unicode text.
Correction!
See below how I got it all wrong. Basically, when we use the method .text then the result is a unicode encoded string. Using it raises the following exception in lxml
ValueError: Unicode strings with encoding declaration are not
supported. Please use bytes input or XML fragments without
declaration.
Which basically means that #martijn-pieters was right, we must use the raw response as returned by .content
Incorrect answer (but might be interesting to someone)
For whoever is interested. I believe the reason this error occurs is probably an invalid guess taken by requests as explained in Response.text documentation:
Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using chardet.
The encoding of the response content is determined based solely on
HTTP headers, following RFC 2616 to the letter. If you can take
advantage of non-HTTP knowledge to make a better guess at the
encoding, you should set r.encoding appropriately before accessing
this property.
So, following this, one could also make sure requests' r.text encodes the response content correctly by explicitly setting the encoding with r.encoding = 'UTF-8'
This approach adds another validation that the received response is indeed in the correct encoding prior to parsing it with lxml.
Understand the question has already got its answer, I faced this similar issue on Python3 and it worked fine on Python2. My resolution was: str_xml.encode() and then xml = etree.fromstring(str_xml) and then the parsing and extractions of tags and attributes.

Categories

Resources