Wikipedia with Python

Wikipedia with Python - python

I have this very simple python code to read xml for the wikipedia api:
import urllib
from xml.dom import minidom
usock = urllib.urlopen("http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500")
xmldoc=minidom.parse(usock)
usock.close()
print xmldoc.toxml()
But this code returns with these errors:
Traceback (most recent call last):
File "/home/user/workspace/wikipediafoundations/src/list.py", line 5, in <module><br>
xmldoc=minidom.parse(usock)<br>
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse<br>
return expatbuilder.parse(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 928, in parse<br>
result = builder.parseFile(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile<br>
parser.Parse(buffer, 0)<br>
xml.parsers.expat.ExpatError: syntax error: line 1, column 62<br>
I have no clue as I just learning python. Is there a way to get an error with more detail? Does anyone know the solution? Also, please recommend a better language to do this in.
Thank You,
Venkat Rao

The URL you're requesting is an HTML representation of the XML that would be returned:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500
So the XML parser fails. You can see this by pasting the above in a browser. Try adding a format=xml at the end:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500&format=xml
as documented on the linked page:
http://en.wikipedia.org/w/api.php

Related

Error while parsing XML document in Python

I'm trying to write a simple script that parses my XML document to get name from all <xs:element> tags. I'm using minidom (is there a better way?) Here is my code so far:
import csv
from xml.dom import minidom
xmldoc = minidom.parse('core.xml')
core = xmldoc.getElementsByTagName('xs:element')
print(len(core))
print(core[0].attributes['name'].value)
for x in core:
print(x.attributes['name'].value)
I'm getting this error:
Traceback (most recent call last):
File "C:/Users/user/Desktop/XML Parsing/test.py", line 9, in <module>
print(core[0].attributes['name'].value)
File "C:\Python27\lib\xml\dom\minidom.py", line 522, in __getitem__
return self._attrs[attname_or_tuple]
KeyError: 'name'

As you have the tag name, you don't need to add the index.
Just replace with the following code:
print(core.attributes['name'].value)

Upper limit of fromstring function in ElementTree

I'm using Python 2.4 version on a Windows 32-bit PC. I'm trying to parse through a very large XML file using the ElementTree module. I downloaded version 1.2.6 of this module from effbot.org.
I followed the below code for my purpose:
import elementtree.ElementTree as ET
input = ''' 001 Chuck 009 Brent '''
stuff = ET.fromstring(input)
lst = stuff.findall("users/user")
print len(lst)
for item in lst:
print item.attrib["x"]
item = lst[0]
ET.dump(item)
item.get("x") # get works on attributes
item.find("id").text
item.find("id").tag
for user in stuff.getiterator('user'):
print "User" , user.attrib["x"]
ET.dump(user)
If the content of input is too large, more than 10,000 lines, the fromstring function raises an error (below). Can anyone help me out in rectifying this error?
This is the error generated:
Traceback (most recent call last): File "C:\Documents and Settings\hariprar\My Documents\My files\Python Try\xml_try1.py", line 16, in -toplevel- stuff = ET.fromstring(input) File "C:\Python24\Lib\site-packages\elementtree\ElementTree.py", line 1012, in XML return api.fromstring(text) File "C:\Python24\Lib\site-packages\elementtree\ElementTree.py", line 182, in fromstring parser.feed(text) File "C:\Python24\Lib\site-packages\elementtree\ElementTree.py", line 1292, in feed self._parser.Parse(data, 0) ExpatError: not well-formed (invalid token): line 2445, column 39

Take a look at the iterparse function. It will let you parse your input incrementally rather than reading it into memory as one big chunk.
It's described here: http://effbot.org/zone/element-iterparse.htm

ExpatError: no element found - Python script

Using OS X 10.6.8, libxml 2-2.7.8, libxslt-1.1.26, and python 2.6, I'm trying to run the tumblrRestore.py script linked here:
https://github.com/hughsaunders/Tumblr-Restore/blob/master/tumblrRestore.py
It ran successfully and restored 76 posts before crashing.
However on second run I got an ExpatError: no element found, and have not been able to run it successfully since - it always produces this same error now. Error text:
Tumblr Restore
Traceback (most recent call last):
File "tumblrRestore.py", line 264, in <module>
cli.start()
File "tumblrRestore.py", line 232, in start
bp.parse()
File "tumblrRestore.py", line 51, in parse
postelement=ElementTree.fromstring(xml_string)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/etree/ElementTree.py", line 964, in XM
return parser.close()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/etree/ElementTree.py", line 1254, in close
self._parser.Parse("", 1) # end of data
xml.parsers.expat.ExpatError: no element found: line 1, column 0
I'm wondering whether I have the wrong or competing or outdated versions of python or lxml, though that still doesn't explain why the script ran successfully once.
Complete newbie, any advice appreciated.

Check your extract_xml_string method of BackupParser class. It definitely returns empty string, because your begin_re regular expresssion doesn't match xml header.
Try the next one:
begin_re = re.compile("<\? xml .*\?>")

Yahoo BOSS Python Library, ExpatError

I tried to install the Yahoo BOSS mashup framework, but am having trouble running the examples provided. Examples 1, 2, 5, and 6 work, but 3 & 4 give Expat errors. Here is the output from ex3.py:
gpython examples/ex3.py
examples/ex3.py:33: Warning: 'as' will become a reserved keyword in Python 2.6
Traceback (most recent call last):
File "examples/ex3.py", line 27, in <module>
digg = db.select(name="dg", udf=titlef, url="http://digg.com/rss_search?search=google+android&area=dig&type=both&section=news")
File "/usr/lib/python2.5/site-packages/yos/yql/db.py", line 214, in select
tb = create(name, data=data, url=url, keep_standards_prefix=keep_standards_prefix)
File "/usr/lib/python2.5/site-packages/yos/yql/db.py", line 201, in create
return WebTable(name, d=rest.load(url), keep_standards_prefix=keep_standards_prefix)
File "/usr/lib/python2.5/site-packages/yos/crawl/rest.py", line 38, in load
return xml2dict.fromstring(dl)
File "/usr/lib/python2.5/site-packages/yos/crawl/xml2dict.py", line 41, in fromstring
t = ET.fromstring(s)
File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 963, in XML
parser.feed(text)
File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 1245, in feed
self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
It looks like both examples are failing when trying to query Digg.com. Here is the query that is constructed in ex3.py's code:
diggf = lambda r: {"title": r["title"]["value"], "diggs": int(r["diggCount"]["value"])}
digg = db.select(name="dg", udf=diggf, url="http://digg.com/rss_search?search=google+android&area=dig&type=both&section=news")

The problem is the digg search string. It should be "s=". Not "search="

I believe that must be an error in the example: it's getting a JSON result (indeed if you copy and paste that URL in your browser, you'll download a file names search.json which starts with
{"results":[{"profile_image_url":
"http://a3.twimg.com/profile_images/255524395/KEN_OMALLEY_REVISED_normal.jpg",
"created_at":"Mon, 14 Sep 2009 14:52:07 +0000","from_user":"twilightlords",
i.e. perfectly normal JSON; but then instead of parsing it with modules such as json or simplejson, it tries to parse it as XML -- and obviously this attempt fails.
I believe the fix (which probably needs to be brought to the attention of whoever maintains that code so they can incorporate it) is either to ask for XML instead of JSON output, OR to parse the resulting JSON with appropriate means instead of trying to look at it as XML (not sure how to best implement either change, as I'm not familiar with that code).

Why is BeautifulSoup throwing this HTMLParseError?

I thought BeautifulSoup will be able to handle malformed documents, but when I sent it the source of a page, the following traceback got printed:
Traceback (most recent call last):
File "mx.py", line 7, in
s = BeautifulSoup(content)
File "build\bdist.win32\egg\BeautifulSoup.py", line 1499, in __init__
File "build\bdist.win32\egg\BeautifulSoup.py", line 1230, in __init__
File "build\bdist.win32\egg\BeautifulSoup.py", line 1263, in _feed
File "C:\Python26\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python26\lib\HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "C:\Python26\lib\HTMLParser.py", line 314, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "C:\Python26\lib\HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u"", at line 258, column 34
Shouldn't it be able to handle this sort of stuff? If it can handle them, how could I do it? If not, is there a module that can handle malformed documents?
EDIT: here's an update. I saved the page locally, using firefox, and I tried to create a soup object from the contents of the file. That's where BeautifulSoup fails. If I try to create a soup object directly from the website, it works.Here's the document that causes trouble for soup.

Worked fine for me using BeautifulSoup version 3.0.7. The latest is 3.1.0, but there's a note on the BeautifulSoup home page to try 3.0.7a if you're having trouble. I think I ran into a similar problem as yours some time ago and reverted, which fixed the problem; I'd try that.
If you want to stick with your current version, I suggest removing the large <script> block at the top, since that is where the error occurs, and since you cannot parse that section with BeautifulSoup anyway.

In my experience BeautifulSoup isn't that fault tolerant. I had to use it once for a small script and ran into these problems. I think using a regular expression to strip out the tags helped a bit, but I eventually just gave up and moved the script over to Ruby and Nokogiri.

The problem appears to be the
contents = contents.replace(/</g, '<');
in line 258 plus the similar
contents = contents.replace(/>/g, '>');
in the next line.
I'd just use re.sub to clobber all occurrences of r"replace(/[<>]/" with something inocuous before feeding it to BeautifulSoup ... moving away from BeautifulSoup would be like throwing out the baby with the bathwater IMHO.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Wikipedia with Python - python

Related

Error while parsing XML document in Python

Upper limit of fromstring function in ElementTree

ExpatError: no element found - Python script

Yahoo BOSS Python Library, ExpatError

Why is BeautifulSoup throwing this HTMLParseError?

Categories

Resources