parse xml file and output to text file - python

Trying to parse an xml file (config.xml) with ElementTree and output to a text file. I looked at other similar ques here but none helped me. Using Python 2.7.9
import xml.etree.ElementTree as ET
tree = ET.parse('config.xml')
notags = ET.tostring(tree,encoding='us-ascii',method='text')
print(notags)
OUTPUT
Traceback (most recent call last):
File "./python_element", line 9, in <module>
notags = ET.tostring(tree,encoding='us-ascii',method='text')
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
ElementTree(element).write(file, encoding, method=method
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 814, in write
_serialize_text(write, self._root, encoding)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1005, in _serialize_text
for part in elem.itertext():
AttributeError:
> 'ElementTree' object has no attribute 'itertext'

Instead of tree (ElementTree object), pass an Element object. You can get an root element using .getroot() method:
notags = ET.tostring(tree.getroot(), encoding='utf-8',method='text')

Related

Parseing xml and html in same project

I want to parse in one project XML and HTML at the same time.
I tried this:
from xml.etree import ElementTree as ET
tree = ET.parse(fpath)
html_file = ET.parse(htmlpath)
and got this error:
Traceback (most recent call last):
File "C:.py", line 55, in
html_file = ET.parse("htmlpath")
File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1197, in parse
tree.parse(source, parser)
File "C:\Users\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: undefined entity : line 690, column 78

Parsing bad XHTML

My new project is to extract data from the Naxos Glossary of Musical Terms, a great resource whose text data I want to process and extract to a database to use on another, simpler website I'll create.
My only problem is awful XHTML formatting. The
W3C XHTML validation raises 318 errors and 54 warnings. Even a HTML Tidier I found can't fix it all.
I'm using Python 3.67 and the page I'm parsing was ASP. I've tested LXML and Python XML modules, but both fail.
Can anyone suggest any other tidiers or modules? Or will I have to use some sort of raw text manipulation (yuck!)?
My code:
LXML:
from lxml import etree
file = open("glossary.asp", "r", encoding="ISO-8859-1")
parsed = etree.parse(file)
Error:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
parsed = etree.parse(file)
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1861, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "/media/skuzzyneon/STORE-1/naxos_dict/glossary.asp", line 25
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 25, column 128
>>>
Python XML (using the tidied XHTML):
import xml.etree.ElementTree as ET
file = open("tidy.html", "r", encoding="ISO-8859-1")
root = ET.fromstring(file.read())
# Top-level elements
print(root.findall("."))
Error:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
root = ET.fromstring(file.read())
File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
parser.feed(text)
File "<string>", line None
xml.etree.ElementTree.ParseError: undefined entity: line 526, column 33
Lxml likely thinks you're giving it xml that way.
Try it like this:
from lxml import html
from cssselect import GenericTranslator, SelectorError
file = open("glossary.asp", "r", encoding="ISO-8859-1")
doc = html.document_fromstring(file.read())
print(doc.cssselect('title')[0].text_content())
Also instead of "HTML Tidiers" just open it in chrome and copy the html in the elements panel.

Python: Read a very large XML file

Am trying to read an xml file(which is a french diccionary) from a python script, but am getting the error below. Is there anyway I can fix it?
PS: the file is 158 643 Ko
from xml.dom import minidom
doc = minidom.parse("Dic.xml")
Data = doc.getElementsByTagName("title")[0]
titleData = Data.firstChild.data
print (titleData)
The error:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
doc = minidom.parse("Morphalou-2.0.xml")
File "C:\Python27\lib\xml\dom\minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 204, in parseFile
buffer = file.read(16*1024)
MemoryError
Advance Thanks

tags on an element's children

The following code (or rather, code very much like it) has been doing my head in all day. I'm out of ideas.
import xml.etree.ElementTree as etree
parent = etree.Element(etree.QName('http://www.example.com', tag='parent'))
child_a = etree.Element(etree.QName('http://www.example.com', tag='child'))
child_a.text='Bill'
parent.append(child_a)
child_b = etree.Element(etree.QName('http://www.example.com', tag='child'))
child_b.text='Barry'
parent.append(child_b)
print(etree.tostring(parent))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "Lib\xml\etree\ElementTree.py", line 1120, in tostring
File "Lib\xml\etree\ElementTree.py", line 812, in write
File "Lib\xml\etree\ElementTree.py", line 880, in namespaces
File "Lib\xml\etree\ElementTree.py", line 1046, in _raise_serialization_error
TypeError: cannot serialize <xml.etree.ElementTree.QName object at 0x0000000222B2198> (type QName)
It's not letting me add two children to an element if both children have the same qualified name. Is there any way around this (using etree, because it's too late to swap out my xml writer before tomorrow)?

how to pass an xml file to lxml to parse?

I'm trying to parse an xml file using lxml. xml.etree allowed me to simply pass the file name as a parameter to the parse function, so I attempted to do the same with lxml.
My code:
from lxml import etree
from lxml import objectify
file = "C:\Projects\python\cb.xml"
tree = etree.parse(file)
but I get the error:
Traceback (most recent call last):
File "cb.py", line 5, in <module>
tree = etree.parse(file)
File "lxml.etree.pyx", line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:4
9590)
File "parser.pxi", line 1491, in lxml.etree._parseDocument (src/lxml/lxml.etre
e.c:71205)
File "parser.pxi", line 1520, in lxml.etree._parseDocumentFromURL (src/lxml/lx
ml.etree.c:71488)
File "parser.pxi", line 1420, in lxml.etree._parseDocFromFile (src/lxml/lxml.e
tree.c:70583)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/
lxml/lxml.etree.c:67736)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDo
c (src/lxml/lxml.etree.c:63820)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.e
tree.c:64741)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etr
ee.c:64084)
lxml.etree.XMLSyntaxError: AttValue: " or ' expected, line 2, column 26
What am I doing wrong?
What you are doing wrong is (1) not checking whether you got the same outcome by using xml.etree on the same file (2) not reading the error message, which indicates a syntax error in line 2 of the file, way down stream from any file-opening issue
I stumbled across a similar error message this morning, and for me the answer was a malformed DTD. In my DTD, there was an Attribute definition with a default value that was not enclosed in quotes - as soon as I changed that, the error didn't happen anymore.
You have a syntax error in your XML Markup. You aren't doing anything wrong.
lxml allows you load a broken xml by creating a parser instance with recover=True
etree.XMLParser(recover=True)
While this is not ideal, I use this to load an xml for schema/dtd/schematron validation.

Categories

Resources