I know how to parse xml with sax in python, but how would I go about inserting elements into the document i'm parsing? Do I have to create a separate file?
Could someone provide a simple example or alter the one I've put below. Thanks.
from xml.sax.handler import ContentHandler
from xml.sax import make_parser
import sys
class aHandler(ContentHandler):
def startElement(self, name, attrs):
print "<",name,">"
def characters(self, content):
print content
def endElement(self,name):
print "</",name,">"
handler = aHandler()
saxparser = make_parser()
saxparser.setContentHandler(handler)
datasource = open("settings.xml","r")
saxparser.parse(datasource)
<?xml version="1.0"?>
<names>
<name>
<first>First1</first>
<second>Second1</second>
</name>
<name>
<first>First2</first>
<second>Second2</second>
</name>
<name>
<first>First3</first>
<second>Second3</second>
</name>
</names>
With DOM, you have the entire xml structure in memory.
With SAX, you don't have a DOM available, so you don't have anything to append an element to.
The main reason for using SAX is if the xml structure is really, really huge-- if it would be a serious performance hit to place the DOM in memory. If that isn't the case (as it appears to be from your small sample xml file), I would always use DOM vs. SAX.
If you go the DOM route, (which seems to be the only option to me), look into lxml. It's one of the best python xml libraries around.
Related
Hello :) This is my first python program but it doesn't work.
What I want to do :
import a XML file and grab only Example.swf from
<page id="Example">
<info>
<title>page 1</title>
</info>
<vector_file>Example.swf</vector_file>
</page>
(the text inside <vector_file>)
than download the associated file on a website (https://website.com/.../.../Example.swf)
than rename it 1.swf (or page 1.swf)
and loop until I reach the last file, at the end of the page (Exampleaa_idontknow.swf → 231.swf)
convert all the files in pdf
What i have done (but useless, because of AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'xpath'):
import re
import urllib.request
import requests
import time
import requests
import lxml
import lxml.html
import os
from xml.etree import ElementTree as ET
DIR="C:/Users/mypath.../"
for filename in os.listdir(DIR):
if filename.endswith(".xml"):
with open(file=DIR+".xml",mode='r',encoding='utf-8') as file:
_tree = ET.fromstring(text=file.read())
_all_metadata_tags = _tree.xpath('.//vector_file')
for i in _all_metadata_tags:
print(i.text + '\n')
else:
print("skipping for filename")
First of all, you need to make up your mind about what module you're going to use. lxml or xml? Import only one of them. lxml has more features, but it's an external dependency. xml is more basic, but it is built-in. Both modules share a lot of their API, so they are easy to confuse. Check that you're looking at the correct documentation.
For what you want to do, the built-in module is good enough. However, the .xpath() method is not supported there, the method you are looking for here is called .findall().
Then you need to remember to never parse XML files by opening them as plain text files, reading them into into string, and parsing that string. Not only is this wasteful, it's fundamentally the wrong thing to do. XML parsers have built-in automatic encoding detection. This mechanism makes sure you never have to worry about file encodings, but you have to use it, too.
It's not only better, but less code to write: Use ET.parse() and pass a filename.
import os
from xml.etree import ElementTree as ET
DIR = r'C:\Users\mypath'
for filename in os.listdir(DIR):
if not filename.lower().endswith(".xml"):
print("skipping for filename")
continue
fullname = os.path.join(DIR, filename)
tree = ET.parse(fullname)
for vector_file in tree.findall('.//vector_file'):
print(vector_file.text + '\n')
If you only expect a single <vector_file> element per file, or if you only care for the first such element, use .find() instead of .findall():
vector_file = tree.find('.//vector_file')
if vector_file is None:
print('Nothing found')
else:
print(vector_file.text + '\n')
Assume that I've the following XML which I want to modify using Python's ElementTree:
<root xmlns:prefix="URI">
<child company:name="***"/>
...
</root>
I'm doing some modification on the XML file like this:
import xml.etree.ElementTree as ET
tree = ET.parse('filename.xml')
# XML modification here
# save the modifications
tree.write('filename.xml')
Then the XML file looks like:
<root xmlns:ns0="URI">
<child ns0:name="***"/>
...
</root>
As you can see, the namepsace prefix changed to ns0. I'm aware of using ET.register_namespace() as mentioned here.
The problem with ET.register_namespace() is that:
You need to know prefix and URI
It can not be used with default namespace.
e.g. If the xml looks like:
<root xmlns="http://uri">
<child name="name">
...
</child>
</root>
It will be transfomed to something like:
<ns0:root xmlns:ns0="http://uri">
<ns0:child name="name">
...
</ns0:child>
</ns0:root>
As you can see, the default namespace is changed to ns0.
Is there any way to solve this problem with ElementTree?
ElementTree will replace those namespaces' prefixes that are not registered with ET.register_namespace. To preserve a namespace prefix, you need to register it first before writing your modifications on a file. The following method does the job and registers all namespaces globally,
def register_all_namespaces(filename):
namespaces = dict([node for _, node in ET.iterparse(filename, events=['start-ns'])])
for ns in namespaces:
ET.register_namespace(ns, namespaces[ns])
This method should be called before ET.parse method, so that the namespaces will remain as unchanged,
import xml.etree.ElementTree as ET
register_all_namespaces('filename.xml')
tree = ET.parse('filename.xml')
# XML modification here
# save the modifications
tree.write('filename.xml')
I am trying to Parse an XML file using elemenTree of Python.
The xml file is like below:
<App xmlns="test attribute">
<name>sagar</name>
</App>
Parser Code:
from xml.etree.ElementTree import ElementTree
from xml.etree.ElementTree import Element
import xml.etree.ElementTree as etree
def parser():
eleTree = etree.parse('app.xml')
eleRoot = eleTree.getroot()
print("Tag:"+str(eleRoot.tag)+"\nAttrib:"+str(eleRoot.attrib))
if __name__ == "__main__":
parser()
Output:
[sagar#linux Parser]$ python test.py
Tag:{test attribute}App <------------- It should print only "App"
Attrib:{}
When I remove "xmlns" attribute or rename "xmlns" attribute to something else the eleRoot.tag is printing correct value.
Why can't element tree unable to parse the tags properly when I have "xmlns" attribute in the tag. Am I missing some pre-requisite to parse an XML of this format using element tree?
Your xml uses the attribute xmlns, which is trying to define a default xml namespace. Xml namespaces are used to solve naming conflicts, and require a valid URI for their value, as such the value of "test attribute" is invalid, which appears to be troubling the parsing of your xml by etree.
For more information on xml namespaces see XML Namespaces at W3 Schools.
Edit:
After looking into the issue further it appears that the fully qualified name of an element when using a python's ElementTree has the form {namespace_url}tag_name. This means that, as you defined the default namespace of "test attribute", the fully qualified name of your "App" tag is infact {test attribute}App, which is what you're getting out of your program.
Source
I am new to etree. I wanted to read etree and put that particular information in another file format like html, xml, etc. I checked and now I can do that but now what about other way around? Like, If I want to read any other file format and generate or write into etree. Please give me some suggestions or with example to proceed with that.
Suppose you want to write an xml file test.xml like the following:
<?xml version='1.0' encoding='ASCII'?>
<document category = "location">
<name>Timbuktu</name>
<name>Eldorado</name>
</document>
The corresponding code would be:
from lxml import etree
root = etree.Element("document", {"category" : "locations"})
for location in ["Timbuktu", "Eldorado"]:
name = etree.SubElement(root, "name")
name.text = location
tree = etree.ElementTree(element=root, file=None, parser=None)
tree.write('test.xml', pretty_print=True, xml_declaration=True)
If you want to add further sub-elements under name then you have to nest another for loop and create subelements under the name tag object.
E.g. consider parsing a pom.xml file:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<parent>
<groupId>com.parent</groupId>
<artifactId>parent</artifactId>
<version>1.0-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
</parent>
<modelVersion>2.0.0</modelVersion>
<groupId>com.parent.somemodule</groupId>
<artifactId>some_module</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>Some Module</name>
...
Code:
import xml.etree.ElementTree as ET
tree = ET.parse(pom)
root = tree.getroot()
groupId = root.find("groupId")
artifactId = root.find("artifactId")
Both groupId and artifactId are None. Why when they are the direct descendants of the root? I tried to replace the root with tree (groupId = tree.find("groupId")) but that didn't change anything.
The problem is that you don't have a child named groupId, you have a child named {http://maven.apache.org/POM/4.0.0}groupId, because etree doesn't ignore XML namespaces, it uses "universal names". See Working with Namespaces and Qualified Names in the effbot docs.
Just to expand on abarnert's comment about BeautifulSoup, if you DO just want a quick and dirty solution to the problem, this is probably the fastest way to go about it. I have implemented this (for a personal script) that uses bs4, where you can traverse the tree with
element = dom.getElementsByTagNameNS('*','elementname')
This will reference the dom using ANY namespace, handy if you know you've only got one in the file so there's no ambiguity.