case insensitive xml and python - python

I got this piece of code and I am trying to read all the 'ref' 'href' tags. I am not sure how to make this to be case insensitive as some of my xml files have REF or Ref or ref.
Any suggestions?
f = urllib.urlopen(url)
tree = ET.parse(f)
root = tree.getroot()
for child in root.iter('ref'):
t = child.get('href')
if t not in self.href:
self.href.append(t)
print self.href[-1]

You can normalize tags and attributes by converting them to lowercase using the functions below as a step of preprocessing:
import xml.etree.ElementTree as ET
f = urllib.urlopen(url)
tree = ET.parse(f)
root = tree.getroot()
def normalize_tags(root):
root.tag = root.tag.lower()
for child in root:
normalize_tags(child)
def normalize_attr(root):
for attr,value in root.attrib.items():
norm_attr = attr.lower()
if norm_attr != attr:
root.set(norm_attr,value)
root.attrib.pop(attr)
for child in root:
normalize_attr(child)
normalize_tags(root)
normalize_attr(root)
print(ET.tostring(root))

The following should help
f = urllib.urlopen(url)
tree = ET.parse(f)
root = tree.getroot()
for child in root:
if child.tag.lower() == 'ref':
t = child.attribute.get('href')
if t not in self.href:
self.href.append(t)
print self.href[-1]

If you are using lxml then one option is to use XPath with regular expressions through XSLT extensions (https://stackoverflow.com/a/2756994/2997179):
root.xpath("./*[re:test(local-name(), '(?i)href')]",
namespaces={"re": "http://exslt.org/regular-expressions"})

Related

Python - multiple logs/structures in the same xml file using ElementTree

I wrote an xml file with ElementTree. The problem i'm facing is that I want to write multiple logs in the same xml file.
The code:
import xml.etree.cElementTree as ET
#XML
root = ET.Element('LOG')
DATE = ET.SubElement(root, 'DATE')
DATE.text = "child_1"
TIME = ET.SubElement(root, 'TIME')
TIME.text = "child_2"
CC = ET.SubElement(root, 'CC')
CC.text = "child_3"
AMOUNT = ET.SubElement(root, 'AMOUNT')
AMOUNT.text = "child_4"
tree = ET.ElementTree(root)
#Generating XML
tree.write("file_name.xlm")
#Print
print(open("file_name.xlm").read())
Current output example:
<LOG><DATE>child_1</DATE><TIME>child_2</TIME><CC>child_3</CC><AMOUNT>child_4</AMOUNT></LOG>
Output that I want:
<LOG><DATE>child_1</DATE><TIME>child_2</TIME><CC>child_3</CC><AMOUNT>child_4</AMOUNT></LOG><LOG><DATE>child_1</DATE><TIME>child_2</TIME><CC>child_3</CC><AMOUNT>child_4</AMOUNT></LOG><LOG><DATE>child_1</DATE><TIME>child_2</TIME><CC>child_3</CC><AMOUNT>child_4</AMOUNT></LOG>...
What you want is just not how xml works. xml documents have one root node, and just one. The closest you can get is by wrapping your <LOG> entries in a container:
<ALL_LOGS>
<LOG>....</LOG>
<LOG>....</LOG>
</ALL_LOGS>

How to extract values from xml file with namespaces?

I have the xml file shown below, that has namespaces, for which I'm trying to extract the values of Node24
My current code is below, that is not printing anything:
import xml.etree.ElementTree as ET
filename = 'ifile.xml'
tree = ET.parse(filename)
root = tree.getroot()
for neighbor in root.iter('Node24'):
print(neighbor)
My expected output would be:
03-c34ko
04-c64ko
07-c54ko
The is the ifile.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<data-main-43:DATAMAINXZ123 xmlns="https://example.com/DATA-MAIN-XZ123" xmlns:data-gen="https://example.com/DATA-GEN" xmlns:data-main-43="https://example.com/DATA-MAIN-XZ123" xmlns:xsi="http://www.w3.org/2011/XMLSchema-instance" xsi:schemaLocation="https://example.com/DATA-MAIN-XZ123 data-main-ir21-12.1.xsd">
<MAINXZ123FileHeader>
<DATAGenSchemaVersion>2.4</DATAGenSchemaVersion>
<DATAMAINXZ123SchemaVersion>12.1</DATAMAINXZ123SchemaVersion>
</MAINXZ123FileHeader>
<Node1>
<Node2>WTRT DDK</Node2>
<Node3>XYZW</Node3>
<Node4>
<Node5>
<Node6>XYZW882</Node6>
<Node5Type>Ter</Node5Type>
<Node5Data>
<Node9>
<Node10>
<Node11>2019-02-18</Node11>
<Node12>
<Node13>
<Node14>
<Node15>Ermso</Node15>
<Node16>
<PrimaryNode16>
<Node18>19.32</Node18>
<Node18>12.11</Node18>
</PrimaryNode16>
<SecondaryNode16>
<Node18>82.97</Node18>
<Node18>12.41</Node18>
</SecondaryNode16>
</Node16>
<Node20>Muuatippw</Node20>
</Node14>
</Node13>
</Node12>
<Node21>
<Node22>
<Node23>
<Node24>03-c34ko</Node24>
<Node24>04-c64ko</Node24>
<Node24>07-c54ko</Node24>
</Node23>
<Node26Node22EdgeAgent>
<Node26>jjkksonem</Node26>
<PrimaryNode18DEANode26>
<Node18>2.40</Node18>
</PrimaryNode18DEANode26>
</Node26Node22EdgeAgent>
</Node22>
</Node21>
<Node28>
<Node29>
<Node30>false</Node30>
<Node31>true</Node31>
</Node29>
</Node28>
</Node10>
</Node9>
</Node5Data>
</Node5>
</Node4>
</Node1>
</data-main-43:DATAMAINXZ123>
How can I do this? Thanks in advance.
Like the duplicate mzjn referenced, just add the namespace uri to the element name...
import xml.etree.ElementTree as ET
filename = 'ifile.xml'
tree = ET.parse(filename)
root = tree.getroot()
for neighbor in root.iter('{https://example.com/DATA-MAIN-XZ123}Node24'):
print(neighbor.text)
Note: I also added .text to neighbor so you'd get the requested result.
I'm using regular expression so this is an alternative answer.
I converted the xml into string then search for all strings between Node24
import xml.etree.ElementTree as ET
import re
filename = 'ifile.xml'
tree = ET.parse(filename)
root = tree.getroot()
xml_str = ET.tostring(root)
for s in re.findall(r'ns0:Node24>(.*?)</ns0:Node24', str(xml_str)):
print(s)
Result:
03-c34ko
04-c64ko
07-c54ko

Copy a node from one xml file to another using lxml

I'm trying to find the simplest way of copying one node to another XML file. Both files will contain the same node - just the contents of that node will be different.
In the past I've done some crazy copying of each element and subelement - but there has to be a better way..
#Master XML
parser = etree.XMLParser(strip_cdata=False)
tree = etree.parse('file1.xml', parser)
# Find the //input node - which has a lot of subelems
inputMaster= tree.xpath('//input')[0]
#Dest XML -
parser2 = etree.XMLParser(strip_cdata=False)
tree2 = etree.parse('file2.xml', parser2)
# this won't work but.. it would be nice
etree.SubElement(tree2,'input') = inputMaster
Here's one way - its not brilliant as it loses the position (i.e. it pops the node at the end) but hey..
def getMaster(somefile):
parser = etree.XMLParser(strip_cdata=False)
tree = etree.parse(somefile, parser)
doc = tree.getroot()
inputMaster = doc.find('input')
return inputMaster
inputXML = getMaster('master_file.xml')
parser = etree.XMLParser(strip_cdata=False)
tree = etree.parse('file_to_copy_node_to.xml', parser)
doc = tree.getroot()
doc.remove(doc.find('input'))
doc.append(inputXML)
# Now write it
newxml = etree.tostring(tree, pretty_print=True)
f = open('file_to_copy_node_to.xml', 'w')
f.write(newxml)
f.close()

Python: specify XMLNS on xml.etree elements

in my Python code I'm currently using the xml.etree library to create a tree and then dump it to an XML string. Unfortunately I can't use modules other than the ones in the Python Standard Libraries to do that.
Here is my code:
import xml.etree.ElementTree as ET
def dump_to_XML():
root_node = ET.Element("root")
c1_node = ET.SubElement(root_node, "child1")
c1_node.text = "foo"
c2_node = ET.SubElement(root_node, "child2")
gc1_node = ET.SubElement(c2_node, "grandchild1")
gc1_node.text = "bar"
return ET.tostring(root_node, encoding='utf8', method='xml')
which gives the string:
<?xml version='1.0' encoding='utf8'?>
<root>
<child1>foo</child1>
<child2>
<grandchild1>bar</grandchild1>
</child2>
</root>
Now, I have two schema files located - say - http://myhost.com/p.xsd and http://myhost.com/q.xsd, I want the output string to be turned into:
<?xml version='1.0' encoding='UTF-8'?>
<root xmlns:p="http://myhost.com/p.xsd" xmlns:q="http://myhost.com/q.xsd">
<p:child1>foo</p:child1>
<p:child2>
<q:grandchild1>bar</q:grandchild1>
</p:child2>
</root>
How can I leverage the etree library in order to achieve that?
Thanks in advance
Here we go:
import xml.etree.ElementTree as ET
xmlns_uris = {'p': 'http://myhost.com/p.xsd',
'q': 'http://myhost.com/q.xsd'}
def dump_to_XML():
root_node = ET.Element("root")
c1_node = ET.SubElement(root_node, "child1")
c1_node.text = "foo"
c2_node = ET.SubElement(root_node, "child2")
gc1_node = ET.SubElement(c2_node, "grandchild1")
gc1_node.text = "bar"
annotate_with_XMLNS_prefixes(gc1_node, 'q', False)
annotate_with_XMLNS_prefixes(root_node, 'p')
add_XMLNS_attributes(root_node, xmlns_uris)
return ET.tostring(root_node, encoding='UTF-8', method='xml')
def annotate_with_XMLNS_prefixes(tree, xmlns_prefix, skip_root_node=True):
if not ET.iselement(tree):
tree = tree.getroot()
iterator = tree.iter()
if skip_root_node: # Add XMLNS prefix also to the root node?
iterator.next()
for e in iterator:
if not ':' in e.tag:
e.tag = xmlns_prefix + ":" + e.tag
def add_XMLNS_attributes(tree, xmlns_uris_dict):
if not ET.iselement(tree):
tree = tree.getroot()
for prefix, uri in xmlns_uris_dict.items():
tree.attrib['xmlns:' + prefix] = uri
Executing: print dump_to_XML() gives:
<?xml version='1.0' encoding='UTF-8'?>
<root xmlns:p="http://myhost.com/p.xsd" xmlns:q="http://myhost.com/q.xsd">
<p:child1>foo</p:child1>
<p:child2>
<q:grandchild1>bar</q:grandchild1>
</p:child2>
</root>
from lxml import etree
xmlns_uris = {'p': 'http://myhost.com/p.xsd', 'q': 'http://myhost.com/q.xsd'}
root = etree.Element('root', nsmap = xmlns_uris)
child1 = etree.SubElement(root,'{%s}child1'%xmlns_uris['p'])
child1.text = 'foo'
child2 = etree.SubElement(root,'{%s}child2'%xmlns_uris['p'])
grandchild1 = etree.SubElement(child2,'{%s}grandchild1'%xmlns_uris['q'])
grandchild1.text = 'bar'
print(etree.tostring(root, pretty_print=True, encoding='UTF-8', xml_declaration=True).decode('cp1251'))

Changing a specific xml element using Python 3 ElementTree

I have a set of metadata files in xml which are updated regularly and I'm trying to automate.
I've worked out how to itteratively find and then replace text in the desired element of the xml but thought there must be a direct way to access and change the element. I just can't work it out.
The metadata xml is formatted:
<?xml version="1.0" ?>
<metadata xml:lang="en">
<Esri>
<CreaDate>20120405</CreaDate>
<CreaTime>13113000</CreaTime>
<ArcGISFormat>1.0</ArcGISFormat>
<SyncOnce>TRUE</SyncOnce>
<ModDate>20121129</ModDate>
<ModTime>11433300</ModTime>
<ArcGISProfile>ItemDescription</ArcGISProfile>
</Esri>
<dataIdInfo>
<idPurp>Updated :: 121129_114038</idPurp>
</dataIdInfo>
</metadata>
My iterative approach was:
for child in root:
for xel in child.iter('idPurp'):
download_new_datetime = strftime('%y%m%d_%H%M%S')
download_new_text = 'Downloaded :: '
xel.text = download_new_text + download_new_datetime
tree.write(xmlfile)
Ideas appreciated on a better way.
I would write to the file only once I'm done with the loop:
import xml.etree.ElementTree as ET
from time import strftime
xmlfile = '/tmp/file'
tree = ET.parse(xmlfile)
root = tree.getroot()
for child in root:
for xel in child.iter('idPurp'):
download_new_datetime = strftime('%y%m%d_%H%M%S')
download_new_text = 'Downloaded :: '
xel.text = download_new_text + download_new_datetime
tree.write(xmlfile)
I would even simplify that loop further to:
for child in root:
for xel in child.iter('idPurp'):
xel.text = 'Downloaded :: ' + time.strftime('%y%m%d_%H%M%S')
Two simpler ways, both work, tested.
First:
import xml.etree.ElementTree as ET
from time import strftime
xmlfile = 'metadata.xml'
tree = ET.parse(xmlfile)
root = tree.getroot()
xel = root.find('./dataIdInfo/idPurp')
xel.text = 'Downloaded :: ' + strftime('%y%m%d_%H%M%S')
tree.write(xmlfile)
Second:
import xml.etree.ElementTree as ET
from time import strftime
xmlfile = 'metadata.xml'
tree = ET.parse(xmlfile)
root = tree.getroot()
xel = root[1][0]
xel.text = 'Downloaded :: ' + strftime('%y%m%d_%H%M%S')
tree.write(xmlfile)
I prefer the first one, it's more readable in my opinion.

Categories

Resources