Parseing xml and html in same project

Parseing xml and html in same project - python

I want to parse in one project XML and HTML at the same time.
I tried this:
from xml.etree import ElementTree as ET
tree = ET.parse(fpath)
html_file = ET.parse(htmlpath)
and got this error:
Traceback (most recent call last):
File "C:.py", line 55, in
html_file = ET.parse("htmlpath")
File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1197, in parse
tree.parse(source, parser)
File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: undefined entity : line 690, column 78

Related

File "<string>", line unknown ParseError: not well-formed (invalid token): line 1, column 5

After running this, i get the errors below and i have tried everything else mentioned here and i cannot get past this error
from xml.etree import ElementTree as ET
from os import path, listdir
path__ = "blogs/"
files = [path.join(path__, f) for f in listdir(path__)
if f.endswith('.xml')]
for file in files:
print(file)
parse = ET.XMLParser(encoding="unicode_escape")
tree = ET.fromstring(file, parser=parse)
blogs/1000331.female.37.indUnk.Leo.xml
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/IPython/core/interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-3-40e3bf76804f>", line 8, in <module>
tree = ET.fromstring(file, parser=parse)
File "/usr/lib/python3.9/xml/etree/ElementTree.py", line 1347, in XML
parser.feed(text)

The fromstring, as the name says takes in a string that is XML, not a file name. Hence it tries to parse the text
blogs/foo.xml
as an XML document. You want to use ET.parse instead:
parser = ET.XMLParser(encoding="unicode_escape")
for file in files:
print(file)
tree = ET.parse(file, parser=parser)

I am facing this error "xml.parsers.expat.ExpatError: not well-formed (invalid token):" while prase the url data using minidom

I am facing this error xml.parsers.expat.ExpatError: syntax error: line 1, column 0 while parse data from url using minidom. Anyone can help me for this ?
Here is my code:
from xml.dom import minidom
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
doc=minidom.parse("about_us.xml")
Error:
Traceback (most recent call last):
File "test3.py", line 11, in <module>
doc=minidom.parse("about_us.xml")
File "C:\Python27\lib\xml\dom\minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 211, in parseFile
parser.Parse("", True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0

parser.Parse("", True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
The above from your traceback indicates to me that your "about_us.xml" file is empty.
You have openurl but you have not shown that you've ever called openurl.read() to actually get at the data.
Nor have you shown where or how you've written said data to your "about_us.xml" file.
from xml.dom import minidom
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
doc=minidom.parse(openurl)
print doc
gives me
Traceback (most recent call last):
File "main.py", line 5, in <module>
doc=minidom.parse(openurl)
File "/usr/local/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/local/lib/python2.7/xml/dom/expatbuilder.py", line 928, in parse
result = builder.parseFile(file)
File "/usr/local/lib/python2.7/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 51, column 81
which indicates that the page you are trying to parse as XML is not well-formed. Try using beautiful soup instead which, from memory, is very forgiving.
from BeautifulSoup import BeautifulSoup
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
soup = BeautifulSoup(openurl.read())
for a in soup.findAll('a'):
print (a.text, a.get('href'))
BTW, you'll need ver 3 of Beautiful Soup since you're still on python 2.7

Parsing bad XHTML

My new project is to extract data from the Naxos Glossary of Musical Terms, a great resource whose text data I want to process and extract to a database to use on another, simpler website I'll create.
My only problem is awful XHTML formatting. The
W3C XHTML validation raises 318 errors and 54 warnings. Even a HTML Tidier I found can't fix it all.
I'm using Python 3.67 and the page I'm parsing was ASP. I've tested LXML and Python XML modules, but both fail.
Can anyone suggest any other tidiers or modules? Or will I have to use some sort of raw text manipulation (yuck!)?
My code:
LXML:
from lxml import etree
file = open("glossary.asp", "r", encoding="ISO-8859-1")
parsed = etree.parse(file)
Error:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
parsed = etree.parse(file)
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1861, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "/media/skuzzyneon/STORE-1/naxos_dict/glossary.asp", line 25
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 25, column 128
>>>
Python XML (using the tidied XHTML):
import xml.etree.ElementTree as ET
file = open("tidy.html", "r", encoding="ISO-8859-1")
root = ET.fromstring(file.read())
# Top-level elements
print(root.findall("."))
Error:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
root = ET.fromstring(file.read())
File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
parser.feed(text)
File "<string>", line None
xml.etree.ElementTree.ParseError: undefined entity: line 526, column 33

Lxml likely thinks you're giving it xml that way.
Try it like this:
from lxml import html
from cssselect import GenericTranslator, SelectorError
file = open("glossary.asp", "r", encoding="ISO-8859-1")
doc = html.document_fromstring(file.read())
print(doc.cssselect('title')[0].text_content())
Also instead of "HTML Tidiers" just open it in chrome and copy the html in the elements panel.

Python: Read a very large XML file

Am trying to read an xml file(which is a french diccionary) from a python script, but am getting the error below. Is there anyway I can fix it?
PS: the file is 158 643 Ko
from xml.dom import minidom
doc = minidom.parse("Dic.xml")
Data = doc.getElementsByTagName("title")[0]
titleData = Data.firstChild.data
print (titleData)
The error:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
doc = minidom.parse("Morphalou-2.0.xml")
File "C:\Python27\lib\xml\dom\minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 204, in parseFile
buffer = file.read(16*1024)
MemoryError
Advance Thanks

parse xml file and output to text file

Trying to parse an xml file (config.xml) with ElementTree and output to a text file. I looked at other similar ques here but none helped me. Using Python 2.7.9
import xml.etree.ElementTree as ET
tree = ET.parse('config.xml')
notags = ET.tostring(tree,encoding='us-ascii',method='text')
print(notags)
OUTPUT
Traceback (most recent call last):
File "./python_element", line 9, in <module>
notags = ET.tostring(tree,encoding='us-ascii',method='text')
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
ElementTree(element).write(file, encoding, method=method
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 814, in write
_serialize_text(write, self._root, encoding)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1005, in _serialize_text
for part in elem.itertext():
AttributeError:
> 'ElementTree' object has no attribute 'itertext'

Instead of tree (ElementTree object), pass an Element object. You can get an root element using .getroot() method:
notags = ET.tostring(tree.getroot(), encoding='utf-8',method='text')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parseing xml and html in same project - python

Related

File "<string>", line unknown ParseError: not well-formed (invalid token): line 1, column 5

I am facing this error "xml.parsers.expat.ExpatError: not well-formed (invalid token):" while prase the url data using minidom

Parsing bad XHTML

Python: Read a very large XML file

parse xml file and output to text file

Categories

Resources