Using Python to get a field in XML from URL - python

I'm trying to get information from a specific field from a XML file from a URL. I'm getting these weird erros before I even start to try. Here is my code:
url1 = 'http://www.dac.unicamp.br/sistemas/horarios/grad/G5A0/indiceP.htm'
data1 = urllib.urlopen(url1)
xml1 = minidom.parse(data1)
I get this error:
File "C:\Users\Administrator\Desktop\teste.py", line 15, in <module>
xml1 = minidom.parse(data1)
File "C:\Python27\lib\xml\dom\minidom.py", line 1920, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 928, in parse
result = builder.parseFile(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: not well-formed (invalid token): line 4, column 22
Did I do anything wrong? I copied those functions from a tutorial, and it seems like it should be working..

use lxml.html, it handles invalid xhtml better.
import lxml.html as lh
In [24]: xml1=lh.parse('http://www.dac.unicamp.br/sistemas/horarios/grad/G5A0/indiceP.htm')

Related

I am facing this error "xml.parsers.expat.ExpatError: not well-formed (invalid token):" while prase the url data using minidom

I am facing this error xml.parsers.expat.ExpatError: syntax error: line 1, column 0 while parse data from url using minidom. Anyone can help me for this ?
Here is my code:
from xml.dom import minidom
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
doc=minidom.parse("about_us.xml")
Error:
Traceback (most recent call last):
File "test3.py", line 11, in <module>
doc=minidom.parse("about_us.xml")
File "C:\Python27\lib\xml\dom\minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 211, in parseFile
parser.Parse("", True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
parser.Parse("", True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
The above from your traceback indicates to me that your "about_us.xml" file is empty.
You have openurl but you have not shown that you've ever called openurl.read() to actually get at the data.
Nor have you shown where or how you've written said data to your "about_us.xml" file.
from xml.dom import minidom
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
doc=minidom.parse(openurl)
print doc
gives me
Traceback (most recent call last):
File "main.py", line 5, in <module>
doc=minidom.parse(openurl)
File "/usr/local/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/local/lib/python2.7/xml/dom/expatbuilder.py", line 928, in parse
result = builder.parseFile(file)
File "/usr/local/lib/python2.7/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 51, column 81
which indicates that the page you are trying to parse as XML is not well-formed. Try using beautiful soup instead which, from memory, is very forgiving.
from BeautifulSoup import BeautifulSoup
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
soup = BeautifulSoup(openurl.read())
for a in soup.findAll('a'):
print (a.text, a.get('href'))
BTW, you'll need ver 3 of Beautiful Soup since you're still on python 2.7

robust DOM parsing with getElementsByTagName

The following (from "Dive into Python")
from xml.dom import minidom
xmldoc = minidom.parse('/path/to/index.html')
reflist = xmldoc.getElementsByTagName('img')
failed with
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/path/to/htmlToNumEmbedded.py", line 2, in <module>
xmldoc = minidom.parse('/path/to/index.html')
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: mismatched tag: line 12, column 4
Using lxml, which is recommended by http://www.ianbicking.org/blog/2008/12/lxml-an-underappreciated-web-scraping-library.html, allows you to parse the document, but it does not seem to have an getElementsByTagName. The following works:
from lxml import html
xmldoc = html.parse('/path/to/index.html')
root = xmldoc.getroot()
for i in root.iter("img"):
print i
but seems kludgey: is there a built-in function that I overlooked?
Or another more elegant way to have robust DOM parsing with getElementsByTagName?
If you want a list of Element, instead of iterating the return value of the Element.iter, call list on it:
from lxml import html
reflist = list(html.parse('/path/to/index.html.html').iter('img'))
You can use BeautifulSoup for this:
from bs4 import BeautifulSoup
with open('/path/to/index.html') as f:
soup = BeautifulSoup(f)
soup.find_all("img")
See Going through HTML DOM in Python

etree generating error when using urlib

I am trying to parse an HTML table into python (2.7) with the solutions in this post.
When I try either one of the first two with a string (as in the example) it works perfect.
But when I try to to use the etree.xml on HTML page I read with urlib I get an error. I did a check for each one of solutions, and the variable I pass is a str as well.
For the following code:
from lxml import etree
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = etree.XML(s)
I get this error:
File "C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line
9, in table = etree.XML(s)
File "lxml.etree.pyx", line 2723, in lxml.etree.XML
(src/lxml/lxml.etree.c:52448)
File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument
(src/lxml/lxml.etree.c:79932)
File "parser.pxi", line 1452, in lxml.etree._parseDoc
(src/lxml/lxml.etree.c:78774)
File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc
(src/lxml/lxml.etree.c:75389)
File "parser.pxi", line 564, in
lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:72614)
File "parser.pxi", line 585, in lxml.etree._raiseParseError
(src/lxml/lxml.etree.c:71955) lxml.etree.XMLSyntaxError: Opening and
ending tag mismatch: link line 8 and head, line 8, column 48
and for this code:
from xml.etree import ElementTree as ET
import urllib
yearurl="http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s=urllib.urlopen(yearurl).read()
print type (s)
table = ET.XML(s)
I get this error:
Traceback (most recent call last): File
"C:/Users/user/PycharmProjects/Wikipedia/TestingFile.py", line 6, in
table = ET.XML(s)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in
_raiseerror
raise err xml.etree.ElementTree.ParseError: mismatched tag: line 8, column 111
While they may seem the same markup types, HTML is not as stringent as XML to be well-formed and follow markup rules (opening/closing nodes, escaping entities, etc.). Hence, what passes for HTML may not be allowed for XML.
Therefore, consider using etree's HTML() function to parse the page. Additionally, you can use XPath to target the particular area you intend to extract or use. Below is an example attempting to pull the main page's table. Do note the webpage uses a quite a bit of nested tables.
from lxml import etree
import urllib.request as rq
yearurl = "http://www.boxofficemojo.com/yearly/chart/?yr=2014&p=.htm"
s = rq.urlopen(yearurl).read()
print(type(s))
# PARSE PAGE
htmlpage = etree.HTML(s)
# XPATH TO SPECIFIC CONTENT
htmltable = htmlpage.xpath("//table[tr/td/font/a/b='Rank']//text()")
for row in htmltable:
print(row)

Setting unicode declaration in lxml.htm.clean.Cleaner or any other html2plaintext

How can I set the "Unicode strings encoding declaration" in the lxml.html.clean.Cleaner module? I'm looking to read the plaintext of a website, and have used lxml in the past as a way of doing this, scraping out the html and javascript. For some pages, I'm starting to get some weird errors about encoding, but can't make sense of how to set this param correctly in the documentation.
import requests
from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.javascript = True
cleaner.style = True
cleaner.html= True
>>> url = 'http://www.princeton.edu'
>>> r = requests.get(url)
>>> lx = r.text.replace('\t',' ').replace('\n',' ').replace('\r',' ')
>>> #lx = r.text
... lxclean = cleaner.clean_html(lx)
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/username/gh/venv/local/lib/python2.7/site-packages/lxml/html/clean.py", line 501, in clean_html
doc = fromstring(html)
File "/home/username/gh/venv/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 672, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/home/username/gh/venv/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 568, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2997, in lxml.etree.fromstring (src/lxml/lxml.etree.c:63276)
File "parser.pxi", line 1607, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:93592)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
However, it works for other urls, like 'http://www.google.com'
This seems suddenly broken. Use beautifulsoup's html parser instead.

Wikipedia with Python

I have this very simple python code to read xml for the wikipedia api:
import urllib
from xml.dom import minidom
usock = urllib.urlopen("http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500")
xmldoc=minidom.parse(usock)
usock.close()
print xmldoc.toxml()
But this code returns with these errors:
Traceback (most recent call last):
File "/home/user/workspace/wikipediafoundations/src/list.py", line 5, in <module><br>
xmldoc=minidom.parse(usock)<br>
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1918, in parse<br>
return expatbuilder.parse(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 928, in parse<br>
result = builder.parseFile(file)<br>
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile<br>
parser.Parse(buffer, 0)<br>
xml.parsers.expat.ExpatError: syntax error: line 1, column 62<br>
I have no clue as I just learning python. Is there a way to get an error with more detail? Does anyone know the solution? Also, please recommend a better language to do this in.
Thank You,
Venkat Rao
The URL you're requesting is an HTML representation of the XML that would be returned:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500
So the XML parser fails. You can see this by pasting the above in a browser. Try adding a format=xml at the end:
http://en.wikipedia.org/w/api.php?action=query&titles=Fractal&prop=links&pllimit=500&format=xml
as documented on the linked page:
http://en.wikipedia.org/w/api.php

Categories

Resources