I am using OSX 10.6 and python 2.7.1 with BeautifulSoup 3.0 and feedparser 5.01.
I am trying to parse the New York Times RSS Feed, which validates, and which Beautiful Soup on its own will parse happily.
The minimum code to produce the error is:
import feedparser
from BeautifulSoup import BeautifulSoup
feed = feedparser.parse("http://www.nytimes.com/services/xml/rss/nyt/GlobalHome.xml")
It fails if I use either the url or
if I use urllib2.urlopen to get the
contents.
I have also tried the character set detector.
The error block is:
/Users/user/Source/python/feed/BeautifulSoup.py:1553: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:3] == '\xef\xbb\xbf':
/Users/user/Source/python/feed/BeautifulSoup.py:1556: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == '\x00\x00\xfe\xff':
/Users/user/Source/python/feed/BeautifulSoup.py:1559: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == '\xff\xfe\x00\x00':
Traceback (most recent call last):
File "parse.py", line 5, in <module>
feed = feedparser.parse("http://www.nytimes.com/services/xml/rss/nyt/GlobalHome.xml")
File "/Users/user/Source/python/feed/feedparser.py", line 3822, in parse
feedparser.feed(data.decode('utf-8', 'replace'))
File "/Users/user/Source/python/feed/feedparser.py", line 1851, in feed
sgmllib.SGMLParser.feed(self, data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 360, in finish_endtag
self.unknown_endtag(tag)
File "/Users/user/Source/python/feed/feedparser.py", line 657, in unknown_endtag
method()
File "/Users/user/Source/python/feed/feedparser.py", line 1545, in _end_description
value = self.popContent('description')
File "/Users/user/Source/python/feed/feedparser.py", line 961, in popContent
value = self.pop(tag)
File "/Users/user/Source/python/feed/feedparser.py", line 868, in pop
mfresults = _parseMicroformats(output, self.baseuri, self.encoding)
File "/Users/user/Source/python/feed/feedparser.py", line 2420, in _parseMicroformats
p = _MicroformatsParser(htmlSource, baseURI, encoding)
File "/Users/user/Source/python/feed/feedparser.py", line 2024, in __init__
self.document = BeautifulSoup.BeautifulSoup(data)
File "/Users/user/Source/python/feed/BeautifulSoup.py", line 1228, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/Users/user/Source/python/feed/BeautifulSoup.py", line 892, in __init__
self._feed()
File "/Users/user/Source/python/feed/BeautifulSoup.py", line 917, in _feed
SGMLParser.feed(self, markup)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 103, in feed
self.rawdata = self.rawdata + data
TypeError: cannot concatenate 'str' and 'NoneType' objects
I would appreciate any advice very much.
I tested using Python 2.7.1, feedparser 5.0.1, and BeautifulSoup 3.2.0, but the feed didn't cause a traceback. Try upgrading to BeautifulSoup 3.2.0.
Related
I am trying to scrape data from yelp and put it into a .csv file using urllib, beautifulsoup, and pandas. I am getting several errors and I am not sure what to do.
I have reviewed all the similar questions I could find online but none of them seem to be the same issue.
This is what error I get at runtime:
Traceback (most recent call last):
File "/Users/sydney/PycharmProjects/yelpdata/venv/yelp.py", line 7, in <module>
page_soup = bs.BeautifulSoup(source, 'html.parser')
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-
packages/bs4/__init__.py", line 310, in __init__
markup, from_encoding, exclude_encodings=exclude_encodings)):
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-
packages/bs4/builder/_htmlparser.py", line 248, in prepare_markup
exclude_encodings=exclude_encodings)
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-packages/bs4/dammit.py",
line 395, in __init__
for encoding in self.detector.encodings:
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-packages/bs4/dammit.py",
line 278, in encodings
self.markup, self.is_html)
File "/Users/sydney/PycharmProjects/yelpdata/venv/lib/python2.7/site-packages/bs4/dammit.py",
line 343, in find_declared_encoding
declared_encoding_match = xml_re.search(markup, endpos=xml_endpos)
TypeError: expected string or buffer
This is my python file (linked to GitHub)
I'm trying to import an excel file into Pandas. I'm using df=pd.read_excel(file_path) but it keeps getting me this error:
*** No CODEPAGE record, no encoding_override: will use 'ascii'
*** No CODEPAGE record, no encoding_override: will use 'ascii'
Traceback (most recent call last):
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/FindCos/FindCos_Functions.py", line 5468, in <module>
adjust_sheet(y1,y2,y3)
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/FindCos/FindCos_Functions.py", line 5130, in adjust_sheet
y1=pd.read_excel(y1)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/pandas/util/_decorators.py", line 118, in wrapper
return func(*args, **kwargs)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/pandas/io/excel.py", line 230, in read_excel
io = ExcelFile(io, engine=engine)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/pandas/io/excel.py", line 294, in __init__
self.book = xlrd.open_workbook(self._io)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/xlrd/__init__.py", line 162, in open_workbook
ragged_rows=ragged_rows,
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/xlrd/book.py", line 119, in open_workbook_xls
bk.get_sheets()
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/xlrd/book.py", line 719, in get_sheets
self.get_sheet(sheetno)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/xlrd/book.py", line 710, in get_sheet
sh.read(self)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/xlrd/sheet.py", line 815, in read
strg = unpack_string(data, 6, bk.encoding or bk.derive_encoding(), lenlen=2)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/xlrd/biffh.py", line 249, in unpack_string
return unicode(data[pos:pos+nchars], encoding)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/xlrd/timemachine.py", line 30, in <lambda>
unicode = lambda b, enc: b.decode(enc)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)
The file I'm trying to import is this one.
Is that an encoding problem or some character in the file is causing this? What would be the way to solve it?
pd.read_excel('data.csv' encoding='utf-8')
#astrobiologist gave a good hint
Since I didn't want the hassle of going into patches, the way I found to solve was to open the file in Open Office and save it as an Excel 97 file. Finally worked
I'm using the lxml element factory on Python 3 to create an XML file that contains base64-encoded pdf files. The XML file will be used to import data into a database software, so the schema can not be changed.
When creating the XML file, lxml complains about the length of the base64 string:
article = E.article(
E.galley(
E.label('PDF'),
E.file(
ET.XML("<embed filename=\"" + row['galley'] + ".pdf\""
+ " encoding=\"base64\" mime_type=\"application/pdf\" >"
+ str(base64fulltext)
+ "</embed>")
), self.LOCALE(row['language']),
), self.LANGUAGE(row['language'])
)
When running the whole script, the error message ('line 45') points to the line where it says str(base64fulltext) in the code snippet above. The error message is as follows:
(lxml) vboxadmin#linux-x3el:~/repos/x> python3 test-csvFileImport.py
Traceback (most recent call last):
File "test-csvFileImport.py", line 65, in <module>
articlePdfBase64)
File "/home/vboxadmin/repos/x/y/writer.py", line 45, in exportArticle
+ "</embed>")
File "src/lxml/etree.pyx", line 3192, in lxml.etree.XML
File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1757, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1067, in lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: xmlSAX2Characters: huge text node, line 1, column 10027189
The expected result would have been to have the base64 string to be written to the xml file.
So far, I could only find that there is the option "huge_tree" in lxml.etree.iterparse (http://lxml.de/api/lxml.etree.iterparse-class.html), but I am not sure whether/how I can use this to solve my problem.
As a workaround, I am considering using string replace to insert the base64 string to the xml after it has been written to file. However, I would be more happy to use a proper lxml solution if anyone could suggest one. Thanks!
On Python 3.2.3 running on Kubuntu Linux 12.10 with Requests 0.12.1 and BeautifulSoup 4.1.0, I am having some web pages break on parsing:
try:
response = requests.get('http://www.wbsonline.com/resources/employee-check-tampering-fraud/')
except Exception as error:
return False
pprint(str(type(response)));
pprint(response);
pprint(str(type(response.content)));
soup = bs4.BeautifulSoup(response.content)
Note that hundreds of other web pages parse fine. What is in this particular page that is crashing Python, and how can I work around it? Here is the crash:
- bruno:scraper$ ./test-broken-site.py
"<class 'requests.models.Response'>"
<Response [200]>
"<class 'bytes'>"
Traceback (most recent call last):
File "./test-broken-site.py", line 146, in <module>
main(sys.argv)
File "./test-broken-site.py", line 138, in main
has_adsense('http://www.wbsonline.com/resources/employee-check-tampering-fraud/')
File "./test-broken-site.py", line 67, in test_page_parse
soup = bs4.BeautifulSoup(response.content)
File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 172, in __init__
self._feed()
File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 185, in _feed
self.builder.feed(self.markup)
File "/usr/lib/python3/dist-packages/bs4/builder/_lxml.py", line 175, in feed
self.parser.close()
File "parser.pxi", line 1171, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:79886)
File "parsertarget.pxi", line 126, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:88932)
File "lxml.etree.pyx", line 282, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:7469)
File "saxparser.pxi", line 288, in lxml.etree._handleSaxDoctype (src/lxml/lxml.etree.c:85572)
File "parsertarget.pxi", line 84, in lxml.etree._PythonSaxParserTarget._handleSaxDoctype (src/lxml/lxml.etree.c:88469)
File "/usr/lib/python3/dist-packages/bs4/builder/_lxml.py", line 150, in doctype
doctype = Doctype.for_name_and_ids(name, pubid, system)
File "/usr/lib/python3/dist-packages/bs4/element.py", line 720, in for_name_and_ids
return Doctype(value)
File "/usr/lib/python3/dist-packages/bs4/element.py", line 653, in __new__
return str.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
TypeError: coercing to str: need bytes, bytearray or buffer-like object, NoneType found
Instead of bs4.BeautifulSoup(response.content) I had tried bs4.BeautifulSoup(response.text). This had the same result (same crash on this page). What can I do to work around pages that break like this, so that I could parse them?
The website provided in your output has doctype:
<!DOCTYPE>
Whereas a proper site has to have something like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
When the beautifulsoup parser tries to get the doctype here:
File "/usr/lib/python3/dist-packages/bs4/element.py", line 720, in for_name_and_ids
return Doctype(value)
The value of Doctype is empty, and then when that value is attempted to be used, the parser fails.
One solution is to manually fix the problem with regex, before parsing the page to beautifulsoup
I'm attempting to pull some data off a popular browser based game, but am having trouble with some decoding errors:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.neopets.com/")
p = BeautifulSoup(r.text)
This produces the following stack trace:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build/bdist.linux-x86_64/egg/bs4/__init__.py", line 172, in __init__
File "build/bdist.linux-x86_64/egg/bs4/__init__.py", line 185, in _feed
File "build/bdist.linux-x86_64/egg/bs4/builder/_lxml.py", line 195, in feed
File "parser.pxi", line 1187, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:87912)
File "parsertarget.pxi", line 130, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:97055)
File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:8862)
File "saxparser.pxi", line 274, in lxml.etree._handleSaxCData (src/lxml/lxml.etree.c:93385)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb1 in position 476: invalid start byte
Doing the following:
print repr(r.text[476 - 10: 476 + 10])
Produces:
u'ttp-equiv="X-UA-Comp'
I'm really not sure what the issue here is. Any help is greatly appreciated. Thank you.
.text on a response returns a decoded unicode value, but perhaps you should let BeautifulSoup do the decoding for you:
p = BeautifulSoup(r.content, from_encoding=r.encoding)
r.content returns the un-decoded raw bytestring, and r.encoding is the encoding detected from the headers.