Why is BeautifulSoup throwing this HTMLParseError?

Why is BeautifulSoup throwing this HTMLParseError? - python

I thought BeautifulSoup will be able to handle malformed documents, but when I sent it the source of a page, the following traceback got printed:
Traceback (most recent call last):
File "mx.py", line 7, in
s = BeautifulSoup(content)
File "build\bdist.win32\egg\BeautifulSoup.py", line 1499, in __init__
File "build\bdist.win32\egg\BeautifulSoup.py", line 1230, in __init__
File "build\bdist.win32\egg\BeautifulSoup.py", line 1263, in _feed
File "C:\Python26\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python26\lib\HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "C:\Python26\lib\HTMLParser.py", line 314, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "C:\Python26\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u"", at line 258, column 34
Shouldn't it be able to handle this sort of stuff? If it can handle them, how could I do it? If not, is there a module that can handle malformed documents?
EDIT: here's an update. I saved the page locally, using firefox, and I tried to create a soup object from the contents of the file. That's where BeautifulSoup fails. If I try to create a soup object directly from the website, it works.Here's the document that causes trouble for soup.

Worked fine for me using BeautifulSoup version 3.0.7. The latest is 3.1.0, but there's a note on the BeautifulSoup home page to try 3.0.7a if you're having trouble. I think I ran into a similar problem as yours some time ago and reverted, which fixed the problem; I'd try that.
If you want to stick with your current version, I suggest removing the large <script> block at the top, since that is where the error occurs, and since you cannot parse that section with BeautifulSoup anyway.

In my experience BeautifulSoup isn't that fault tolerant. I had to use it once for a small script and ran into these problems. I think using a regular expression to strip out the tags helped a bit, but I eventually just gave up and moved the script over to Ruby and Nokogiri.

The problem appears to be the
contents = contents.replace(/</g, '<');
in line 258 plus the similar
contents = contents.replace(/>/g, '>');
in the next line.
I'd just use re.sub to clobber all occurrences of r"replace(/[<>]/" with something inocuous before feeding it to BeautifulSoup ... moving away from BeautifulSoup would be like throwing out the baby with the bathwater IMHO.

Related

Data scraping from a webpage with javascript using python

I'm trying to scrape the title off of a webpage. Initially, I tried using BeautifulSoup but found out that the page itself wouldn't load without Javascript. So I'm using some code that I found off Google that use the request-html library:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")
soup.find_all('h1')
But there's always an error along the line of:
D:\Python\TitleSraping\venv\Scripts\python.exe "D:/Python/TitleSraping/venv/Text Scraping.py"
Traceback (most recent call last):
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:/Python/TitleSraping/venv/Text Scraping.py", line 5, in <module>
resp.html.render()
File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 598, in render
content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
File "D:\Program Files (x86)\Python\lib\asyncio\base_events.py", line 584, in run_until_complete
return future.result()
File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 531, in _async_render
content = await page.content()
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\page.py", line 780, in content
return await frame.content()
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 379, in content
'''.strip())
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 295, in evaluate
pageFunction, *args, force_expr=force_expr)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 55, in evaluate
pageFunction, *args, force_expr=force_expr)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
_rewriteError(e)
File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 238, in _rewriteError
raise type(error)(msg)
pyppeteer.errors.NetworkError: Execution context was destroyed, most likely because of a navigation.
Process finished with exit code 1
Does anyone know what this means? I'm quite new to this, so I apologize if I'm using any terminology improperly.

As Ivan said, here you have full code: sleep=1, keep_page=True make the trick
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render(sleep=1, keep_page=True)
soup = BeautifulSoup(resp.html.html, "lxml")
print(soup.find_all('title'))
Response:
[<title>
Milled wheat and wheat flour produced</title>]

Seems like a bug in underlying library puppeteer, caused by processing some javascript. Here's one workaround from https://github.com/kennethreitz/requests-html/issues/251, maybe it'll help.
resp.html.render(sleep=1, keep_page=True)

You need to load the JS because if you don't load it the HTML code wont load. You can use Selenium

Try Seleneum.
Seleneum is a library that allows programs to interact with web pages by taking control of the browser.
Here is an example in
an answer
to someone else's question.

ExpatError: no element found - Python script

Using OS X 10.6.8, libxml 2-2.7.8, libxslt-1.1.26, and python 2.6, I'm trying to run the tumblrRestore.py script linked here:
https://github.com/hughsaunders/Tumblr-Restore/blob/master/tumblrRestore.py
It ran successfully and restored 76 posts before crashing.
However on second run I got an ExpatError: no element found, and have not been able to run it successfully since - it always produces this same error now. Error text:
Tumblr Restore
Traceback (most recent call last):
File "tumblrRestore.py", line 264, in <module>
cli.start()
File "tumblrRestore.py", line 232, in start
bp.parse()
File "tumblrRestore.py", line 51, in parse
postelement=ElementTree.fromstring(xml_string)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/etree/ElementTree.py", line 964, in XM
return parser.close()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/etree/ElementTree.py", line 1254, in close
self._parser.Parse("", 1) # end of data
xml.parsers.expat.ExpatError: no element found: line 1, column 0
I'm wondering whether I have the wrong or competing or outdated versions of python or lxml, though that still doesn't explain why the script ran successfully once.
Complete newbie, any advice appreciated.

Check your extract_xml_string method of BackupParser class. It definitely returns empty string, because your begin_re regular expresssion doesn't match xml header.
Try the next one:
begin_re = re.compile("<\? xml .*\?>")

Python 3.x and TestLink xmlprc

Appreciate your helping first, I am new for the python 3.x.
When I try to use Python 3.x to parse the testlink xmlprc server. I got below error, but I can run the code under Python 2.x, any idea?
import xmlrpc.client
server = xmlrpc.client.Server("http://172.16.29.132/SITM/lib/api/xmlrpc.php") //here is my testlink server
print (server.system.listMethods()) //I can print the methods list here
print (server.tl.ping()) // Got error.
Here is the error:
['system.multicall', 'system.listMethods', 'system.getCapabilities', 'tl.repeat', 'tl.sayHello', 'tl.ping', 'tl.setTestMode', 'tl.about', 'tl.checkDevKey', 'tl.doesUserExist', 'tl.deleteExecution', 'tl.getTestSuiteByID', 'tl.getFullPath', 'tl.getTestCase', 'tl.getTestCaseAttachments', 'tl.getFirstLevelTestSuitesForTestProject', 'tl.getTestCaseCustomFieldDesignValue', 'tl.getTestCaseIDByName', 'tl.getTestCasesForTestPlan', 'tl.getTestCasesForTestSuite', 'tl.getTestSuitesForTestSuite', 'tl.getTestSuitesForTestPlan', 'tl.getLastExecutionResult', 'tl.getLatestBuildForTestPlan', 'tl.getBuildsForTestPlan', 'tl.getTotalsForTestPlan', 'tl.getTestPlanPlatforms', 'tl.getProjectTestPlans', 'tl.getTestPlanByName', 'tl.getTestProjectByName', 'tl.getProjects', 'tl.addTestCaseToTestPlan', 'tl.assignRequirements', 'tl.uploadAttachment', 'tl.uploadTestCaseAttachment', 'tl.uploadTestSuiteAttachment', 'tl.uploadTestProjectAttachment', 'tl.uploadRequirementAttachment', 'tl.uploadRequirementSpecificationAttachment', 'tl.uploadExecutionAttachment', 'tl.createTestSuite', 'tl.createTestProject', 'tl.createTestPlan', 'tl.createTestCase', 'tl.createBuild', 'tl.setTestCaseExecutionResult', 'tl.reportTCResult']
Traceback (most recent call last):
File "F:\SQA\Python\Testlink\Test.py", line 5, in <module>
print (server.tl.ping())
File "C:\Python31\lib\xmlrpc\client.py", line 1029, in __call__
return self.__send(self.__name, args)
File "C:\Python31\lib\xmlrpc\client.py", line 1271, in __request
verbose=self.__verbose
File "C:\Python31\lib\xmlrpc\client.py", line 1070, in request
return self.parse_response(resp)
File "C:\Python31\lib\xmlrpc\client.py", line 1164, in parse_response
p.feed(response)
File "C:\Python31\lib\xmlrpc\client.py", line 454, in feed
self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: junk after document element: line 2, column 0

When I've seen this message before, it happened because the contents of the transported data wasn't escaped for XML transport. The solution was to wrap the data in an XMLRPC Binary object.
In your case, you don't control the server side, so the above isn't a solution for you but it may suggest what the actual problem is.
Also, the Python 2 versus Python 3 difference suggests that there is a text/bytes issue at work.
To help diagnose the issue, set verbose=True so you can see the actual HTTP request/response headers and the XML request/response. That may show you what is at line 2: column 0. You may find that the issue may be with the PHP script not wrapping up binary data in base64 encoding as required by the XMLRPC spec.

Thank you , I find out all the methods list, only 'tl.sayHello', 'tl.ping','tl.about' has this problem, and all of them are pass a string with a PHP automatic loader empty file *.class.php to the parser, other methods are pass a xml file. So I give up to use those methods and the script works fine.

Wikipedia with Python

I have this very simple python code to read xml for the wikipedia api:
import urllib
from xml.dom import minidom
usock = urllib.urlopen("http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500")
xmldoc=minidom.parse(usock)
usock.close()
print xmldoc.toxml()
But this code returns with these errors:
Traceback (most recent call last):
File "/home/user/workspace/wikipediafoundations/src/list.py", line 5, in <module><br>
xmldoc=minidom.parse(usock)<br>
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse<br>
return expatbuilder.parse(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 928, in parse<br>
result = builder.parseFile(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile<br>
parser.Parse(buffer, 0)<br>
xml.parsers.expat.ExpatError: syntax error: line 1, column 62<br>
I have no clue as I just learning python. Is there a way to get an error with more detail? Does anyone know the solution? Also, please recommend a better language to do this in.
Thank You,
Venkat Rao

The URL you're requesting is an HTML representation of the XML that would be returned:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500
So the XML parser fails. You can see this by pasting the above in a browser. Try adding a format=xml at the end:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500&format=xml
as documented on the linked page:
http://en.wikipedia.org/w/api.php

Yahoo BOSS Python Library, ExpatError

I tried to install the Yahoo BOSS mashup framework, but am having trouble running the examples provided. Examples 1, 2, 5, and 6 work, but 3 & 4 give Expat errors. Here is the output from ex3.py:
gpython examples/ex3.py
examples/ex3.py:33: Warning: 'as' will become a reserved keyword in Python 2.6
Traceback (most recent call last):
File "examples/ex3.py", line 27, in <module>
digg = db.select(name="dg", udf=titlef, url="http://digg.com/rss_search?search=google+android&area=dig&type=both&section=news")
File "/usr/lib/python2.5/site-packages/yos/yql/db.py", line 214, in select
tb = create(name, data=data, url=url, keep_standards_prefix=keep_standards_prefix)
File "/usr/lib/python2.5/site-packages/yos/yql/db.py", line 201, in create
return WebTable(name, d=rest.load(url), keep_standards_prefix=keep_standards_prefix)
File "/usr/lib/python2.5/site-packages/yos/crawl/rest.py", line 38, in load
return xml2dict.fromstring(dl)
File "/usr/lib/python2.5/site-packages/yos/crawl/xml2dict.py", line 41, in fromstring
t = ET.fromstring(s)
File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 963, in XML
parser.feed(text)
File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 1245, in feed
self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
It looks like both examples are failing when trying to query Digg.com. Here is the query that is constructed in ex3.py's code:
diggf = lambda r: {"title": r["title"]["value"], "diggs": int(r["diggCount"]["value"])}
digg = db.select(name="dg", udf=diggf, url="http://digg.com/rss_search?search=google+android&area=dig&type=both&section=news")

The problem is the digg search string. It should be "s=". Not "search="

I believe that must be an error in the example: it's getting a JSON result (indeed if you copy and paste that URL in your browser, you'll download a file names search.json which starts with
{"results":[{"profile_image_url":
"http://a3.twimg.com/profile_images/255524395/KEN_OMALLEY_REVISED_normal.jpg",
"created_at":"Mon, 14 Sep 2009 14:52:07 +0000","from_user":"twilightlords",
i.e. perfectly normal JSON; but then instead of parsing it with modules such as json or simplejson, it tries to parse it as XML -- and obviously this attempt fails.
I believe the fix (which probably needs to be brought to the attention of whoever maintains that code so they can incorporate it) is either to ask for XML instead of JSON output, OR to parse the resulting JSON with appropriate means instead of trying to look at it as XML (not sure how to best implement either change, as I'm not familiar with that code).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is BeautifulSoup throwing this HTMLParseError? - python

In my experience BeautifulSoup isn't that fault tolerant. I had to use it once for a small script and ran into these problems. I think using a regular expression to strip out the tags helped a bit, but I eventually just gave up and moved the script over to Ruby and Nokogiri.

Related

Data scraping from a webpage with javascript using python

ExpatError: no element found - Python script

Python 3.x and TestLink xmlprc

Wikipedia with Python

Yahoo BOSS Python Library, ExpatError

Categories

Resources