robust DOM parsing with getElementsByTagName - python

The following (from "Dive into Python")
from xml.dom import minidom
xmldoc = minidom.parse('/path/to/index.html')
reflist = xmldoc.getElementsByTagName('img')
failed with
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/path/to/htmlToNumEmbedded.py", line 2, in <module>
xmldoc = minidom.parse('/path/to/index.html')
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: mismatched tag: line 12, column 4
Using lxml, which is recommended by http://www.ianbicking.org/blog/2008/12/lxml-an-underappreciated-web-scraping-library.html, allows you to parse the document, but it does not seem to have an getElementsByTagName. The following works:
from lxml import html
xmldoc = html.parse('/path/to/index.html')
root = xmldoc.getroot()
for i in root.iter("img"):
print i
but seems kludgey: is there a built-in function that I overlooked?
Or another more elegant way to have robust DOM parsing with getElementsByTagName?

If you want a list of Element, instead of iterating the return value of the Element.iter, call list on it:
from lxml import html
reflist = list(html.parse('/path/to/index.html.html').iter('img'))

You can use BeautifulSoup for this:
from bs4 import BeautifulSoup
with open('/path/to/index.html') as f:
soup = BeautifulSoup(f)
soup.find_all("img")
See Going through HTML DOM in Python

Related

I am facing this error "xml.parsers.expat.ExpatError: not well-formed (invalid token):" while prase the url data using minidom

I am facing this error xml.parsers.expat.ExpatError: syntax error: line 1, column 0 while parse data from url using minidom. Anyone can help me for this ?
Here is my code:
from xml.dom import minidom
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
doc=minidom.parse("about_us.xml")
Error:
Traceback (most recent call last):
File "test3.py", line 11, in <module>
doc=minidom.parse("about_us.xml")
File "C:\Python27\lib\xml\dom\minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 211, in parseFile
parser.Parse("", True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
parser.Parse("", True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
The above from your traceback indicates to me that your "about_us.xml" file is empty.
You have openurl but you have not shown that you've ever called openurl.read() to actually get at the data.
Nor have you shown where or how you've written said data to your "about_us.xml" file.
from xml.dom import minidom
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
doc=minidom.parse(openurl)
print doc
gives me
Traceback (most recent call last):
File "main.py", line 5, in <module>
doc=minidom.parse(openurl)
File "/usr/local/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/local/lib/python2.7/xml/dom/expatbuilder.py", line 928, in parse
result = builder.parseFile(file)
File "/usr/local/lib/python2.7/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 51, column 81
which indicates that the page you are trying to parse as XML is not well-formed. Try using beautiful soup instead which, from memory, is very forgiving.
from BeautifulSoup import BeautifulSoup
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
soup = BeautifulSoup(openurl.read())
for a in soup.findAll('a'):
print (a.text, a.get('href'))
BTW, you'll need ver 3 of Beautiful Soup since you're still on python 2.7

Error while parsing XML document in Python

I'm trying to write a simple script that parses my XML document to get name from all <xs:element> tags. I'm using minidom (is there a better way?) Here is my code so far:
import csv
from xml.dom import minidom
xmldoc = minidom.parse('core.xml')
core = xmldoc.getElementsByTagName('xs:element')
print(len(core))
print(core[0].attributes['name'].value)
for x in core:
print(x.attributes['name'].value)
I'm getting this error:
Traceback (most recent call last):
File "C:/Users/user/Desktop/XML Parsing/test.py", line 9, in <module>
print(core[0].attributes['name'].value)
File "C:\Python27\lib\xml\dom\minidom.py", line 522, in __getitem__
return self._attrs[attname_or_tuple]
KeyError: 'name'
As you have the tag name, you don't need to add the index.
Just replace with the following code:
print(core.attributes['name'].value)

Setting unicode declaration in lxml.htm.clean.Cleaner or any other html2plaintext

How can I set the "Unicode strings encoding declaration" in the lxml.html.clean.Cleaner module? I'm looking to read the plaintext of a website, and have used lxml in the past as a way of doing this, scraping out the html and javascript. For some pages, I'm starting to get some weird errors about encoding, but can't make sense of how to set this param correctly in the documentation.
import requests
from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.javascript = True
cleaner.style = True
cleaner.html= True
>>> url = 'http://www.princeton.edu'
>>> r = requests.get(url)
>>> lx = r.text.replace('\t',' ').replace('\n',' ').replace('\r',' ')
>>> #lx = r.text
... lxclean = cleaner.clean_html(lx)
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/username/gh/venv/local/lib/python2.7/site-packages/lxml/html/clean.py", line 501, in clean_html
doc = fromstring(html)
File "/home/username/gh/venv/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 672, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/home/username/gh/venv/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 568, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2997, in lxml.etree.fromstring (src/lxml/lxml.etree.c:63276)
File "parser.pxi", line 1607, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:93592)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
However, it works for other urls, like 'http://www.google.com'
This seems suddenly broken. Use beautifulsoup's html parser instead.

Using Python to get a field in XML from URL

I'm trying to get information from a specific field from a XML file from a URL. I'm getting these weird erros before I even start to try. Here is my code:
url1 = 'http://www.dac.unicamp.br/sistemas/horarios/grad/G5A0/indiceP.htm'
data1 = urllib.urlopen(url1)
xml1 = minidom.parse(data1)
I get this error:
File "C:\Users\Administrator\Desktop\teste.py", line 15, in <module>
xml1 = minidom.parse(data1)
File "C:\Python27\lib\xml\dom\minidom.py", line 1920, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 928, in parse
result = builder.parseFile(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
ExpatError: not well-formed (invalid token): line 4, column 22
Did I do anything wrong? I copied those functions from a tutorial, and it seems like it should be working..
use lxml.html, it handles invalid xhtml better.
import lxml.html as lh
In [24]: xml1=lh.parse('http://www.dac.unicamp.br/sistemas/horarios/grad/G5A0/indiceP.htm')

cannot run BeautifulSoup using requests.get(url)

start_url=requests.get('http://www.delicious.com/golisoda')
soup=BeautifulSoup(start_url)
this code is displaying the following error:
Traceback (most recent call last):
File "test2_requests.py", line 10, in <module>
soup=BeautifulSoup(start_url)
File "/usr/local/lib/python2.7/dist-packages/bs4/__init__.py", line 169, in __init__
self.builder.prepare_markup(markup, from_encoding))
File "/usr/local/lib/python2.7/dist-packages/bs4/builder/_lxml.py", line 68, in prepare_markup
dammit = UnicodeDammit(markup, try_encodings, is_html=True)
File "/usr/local/lib/python2.7/dist-packages/bs4/dammit.py", line 203, in __init__
self._detectEncoding(markup, is_html)
File "/usr/local/lib/python2.7/dist-packages/bs4/dammit.py", line 373, in _detectEncoding
xml_encoding_match = xml_encoding_re.match(xml_data)
TypeError: expected string or buffer
Use the .content of the response:
start_url = requests.get('http://www.delicious.com/golisoda')
soup = BeautifulSoup(start_url.content)
Alternatively, you can use the decoded unicode text:
start_url = requests.get('http://www.delicious.com/golisoda')
soup = BeautifulSoup(start_url.text)
See the Response content section of the documentation.
you probebly need to Use
using
soup=BeautifulSoup(start_url.read())
or
soup=BeautifulSoup(start_url.text)
from BeautifulSoup import BeautifulSoup
import urllib2
data=urllib2.urlopen('http://www.delicious.com/golisoda').read()
soup=BeautifulSoup(data)

Categories

Resources