Find non-root parent node where child contains some text - python

I have some xml;
<root>
<parent>
<child>foo987654</child>
</parent>
<parent>
<child>bar15245</child>
</parent>
<parent>
<child>baz87742</child>
</parent>
<parent>
<child>foo123456</child>
</parent>
</root>
I'm using python and the etree module and I'd like to select all <parent> nodes whose child starts with "foo". I know etree has limited xpath support but i'm an xpath rookie so I'm struggling to land on the best solution. I'd think something to this effect
parent[(contains(child,'foo'))]
but i would want to reject parent nodes that contained foo but didn't start with foo (ie <child>125456foo</child>) so i'm not sure this would work. Furthermore, I'm not sure etree supports this level of xpath...
EDIT:
Another acceptable solution would be to to select parents whose children's text are in a list.
pseudo code
parent=>child[text = "foo1" || "bar1" || "bar2"]
Is that possible?

This will get what you want:
[elem for elem in root.findall('parent') if elem.find('child').text.startswith('foo')]
Watch it in action:
s = """<root>
<parent>
<child>foo987654</child>
</parent>
<parent>
<child>bar15245</child>
</parent>
<parent>
<child>baz87742</child>
</parent>
<parent>
<child>foo123456</child>
</parent>
</root>"""
import xml.etree.ElementTree as ET
root = ET.fromstring(s)
elems = [elem for elem in root.findall('parent') if elem.find('child').text.startswith('foo')]
Checking the data:
for elem in elems:
print elem.find('child').text
>>>
foo987654
foo123456

As you can see from the xml.etree documentation, this library doesn't support the contains() operator from XPath. My suggestion would be to select all children with the XPath /parent and then iterating on each result to remove children's content that do not start with foo.

with xpath
import lxml.html
doc = lxml.html.document_fromstring(s)
for e in doc.xpath(".//child[starts-with(text(), 'foo')]"):
print e.text

Related

Python -lxml xpath returns empty list

I am reading an xliff file and planning to retrieve specific element. I tried to print all the elements using
from lxml import etree
with open('path\to\file\.xliff', 'r',encoding = 'utf-8') as xml_file:
tree = etree.parse(xml_file)
root = tree.getroot()
for element in root.iter():
print("child", element)
The output was
child <Element {urn:oasis:names:tc:xliff:document:2.0}segment at 0x6b8f9c8>
child <Element {urn:oasis:names:tc:xliff:document:2.0}source at 0x6b8f908>
When I tried to get the specific element (with the help of many posts here) - source tag
segment = tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
it returns an empty list. Can someone tell me how to retrieve it properly.
Input :
<?xml version='1.0' encoding='UTF-8'?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0">
<segment id = 1>
<source>
Hello world
</source>
</segment>
<segment id = 2 >
<source>
2nd statement
</source>
</segment>
</xliff>
I want to get the values of segment and its corresponding source
This code,
tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
is not accepted by lxml ("lxml.etree.XPathEvalError: Invalid expression"). You need to use findall().
The following works (in the XML sample, the segment elements are children of xliff):
from lxml import etree
tree = etree.parse("test.xliff") # XML in the question; ill-formed attributes corrected
segment = tree.findall('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
However, the real XML is apparently more complex (segment is not a direct child of xliff). Then you need to add .// to search the whole tree:
segment = tree.findall('.//{urn:oasis:names:tc:xliff:document:2.0}segment')

Adding a new XML element using python ElementTree library

I'm trying to add a new element to an xml file using the python ElementTree library with the following code.
from xml.etree import ElementTree as et
def UpdateXML(pre):
xml_file = place/file.xml
tree = et.parse(xml_file)
root = tree.getroot()
for parent in root.findall('Parent'):
et.SubElement(parent,"NewNode", attribute=pre)
tree.write(xml_file)
The XML I want it to render is in the following format
<Parent>
<Child1 Attribute="Stuff"/>
<NewNode Attribute="MoreStuff"/> <--- new
<Child3>
<Child4>
<CHild5>
<Child6>
</Parent>
However the xml it actually renders is in this incorrect format
<Parent>
<Child1 Attribute="Stuff"/>
<Child3>
<Child4>
<CHild5>
<Child6>
<NewNode Attribute="MoreStuff"/> <--- new
</Parent>
What do I change in my code to render the correct xml?
You want the insert operation:
node = et.Element('NewNode')
parent.insert(1,node)
Which in my testing gets me:
<Parent>
<Child1 Attribute="Stuff" />
<NewNode /><Child3 />
<Child4 />
<CHild5 />
<Child6 />
</Parent>

Accessing Elements with and without namespaces using lxml

Is there a way to search for the same element, at the same time, within a document that occur with and without namespaces using lxml? As an example, I would want to get all occurences of the element identifier irrespective of whether or not it is associated with a specific namespace. I am currently only able to access them separately as below.
Code:
from lxml import etree
xmlfile = etree.parse('xmlfile.xml')
root = xmlfile.getroot()
for l in root.iter('identifier'):
print l.text
for l in root.iter('{http://www.openarchives.org/OAI/2.0/provenance}identifier'):
print l.text
File: xmlfile.xml
<?xml version="1.0"?>
<record>
<header>
<identifier>identifier1</identifier>
<datestamp>datastamp1</datestamp>
<setSpec>setspec1</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>title1</dc:title>
<dc:title>title2</dc:title>
<dc:creator>creator1</dc:creator>
<dc:subject>subject1</dc:subject>
<dc:subject>subject2</dc:subject>
</oai_dc:dc>
</metadata>
<about>
<provenance xmlns="http://www.openarchives.org/OAI/2.0/provenance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance http://www.openarchives.org/OAI/2.0/provenance.xsd">
<originDescription altered="false" harvestDate="2011-08-11T03:47:51Z">
<baseURL>baseURL1</baseURL>
<identifier>identifier3</identifier>
<datestamp>datestamp2</datestamp>
<metadataNamespace>xxxxx</metadataNamespace>
<originDescription altered="false" harvestDate="2010-10-10T06:15:53Z">
<baseURL>xxxxx</baseURL>
<identifier>identifier4</identifier>
<datestamp>2010-04-27T01:10:31Z</datestamp>
<metadataNamespace>xxxxx</metadataNamespace>
</originDescription>
</originDescription>
</provenance>
</about>
</record>
You could use XPath to solve that kind of issue:
from lxml import etree
xmlfile = etree.parse('xmlfile.xml')
identifier_nodes = xmlfile.xpath("//*[local-name() = 'identifier']")

XML to MYSQL Using Python

I have the folowing test.xml
<root>
<parent>
<ID>1</ID>
<child1>Value1</child1>
<child2>value11</child2>
<child3>
<subchild>value111</subchild>
</child3>
</parent>
<parent>
<ID>2</ID>
<child1>value2</child1>
<child2>value22</child2>
<child2>value333</child2>
</parent>
<parent>
<ID>3</ID>
<child1>value3</child1>
<child2>value33</child2>
</parent>
<parent>
<ID>4</ID>
<child1>value4</child1>
<child2>value44</child2>
</parent>
</root>
What Im trying to accomplish is the following: I want to iterate through the test.xml and for every parent I want to put all of the child nodes in a dictionary where the tag is the index and the text is the value and once i get to the end of the parent add that to the database and reset the dictionary and move onto the next parent.
So for the first parent I would want
insert = {'ID':1,'child1':'value1','child2':'value11','subchild':'value111'}
Use it in an SQL query, And then move onto the next parent reset the dictionary and do the same thing.
Not every parent has the same amount of children, and some children have sub children.
I have tried with:
value = []
tag = []
from elementtree import ElementTree as ET
for parent in tree.getiterator():
for child in parent:
value.append(child.text)
tag.append(child.tag)
But I couldn't figure out how to get my desired results. I left out retrieving and opening the xml in order to keep the post as simple as possible. This is the method I was attempting to use but I don't think its the right one because I haven't been able to stop the iteration at the end of the parent tag in order to insert.
Any help would be greatly appreciated! thanks
Try this using the lxml library:
from lxml import etree
source = """
<root>
<parent>
<ID>1</ID>
<child1>Value1</child1>
<child2>value11</child2>
<child3>
<subchild>value111</subchild>
</child3>
</parent>
<parent>
<ID>2</ID>
<child1>value2</child1>
<child2>value22</child2>
<child2>value333</child2>
</parent>
<parent>
<ID>3</ID>
<child1>value3</child1>
<child2>value33</child2>
</parent>
<parent>
<ID>4</ID>
<child1>value4</child1>
<child2>value44</child2>
</parent>
</root>
"""
document = etree.fromstring(source)
inserts = []
id_number = 3
for parent in document.findall('parent'):
insert = {}
cont = 0
for element in parent.iterdescendants():
if element.tag == 'ID':
if element.text == str(id_number):
cont = 1
if element.getchildren() == []:
insert[element.tag] = element.text
if cont:
inserts.append(insert)
print inserts
There is also an etree API shipped with python (it does not have pretty printing and some other features that lxml has though): http://docs.python.org/library/xml.etree.elementtree.html

Python ElementTree find() not matching within kml file

I'm trying to find an element from a kml file using element trees as follows:
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse("history-03-02-2012.kml")
p = tree.find(".//name")
A sufficient subset of the file to demonstrate the problem follows:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<name>Location history from 03/03/2012 to 03/10/2012</name>
</Document>
</kml>
A "name" element exists; why does the search come back empty?
The name element you're trying to match is actually within the KML namespace, but you aren't searching with that namespace in mind.
Try:
p = tree.find(".//{http://www.opengis.net/kml/2.2}name")
If you were using lxml's XPath instead of the standard-library ElementTree, you'd instead pass the namespace in as a dictionary:
>>> tree = lxml.etree.fromstring('''<kml xmlns="http://www.opengis.net/kml/2.2">
... <Document>
... <name>Location history from 03/03/2012 to 03/10/2012</name>
... </Document>
... </kml>''')
>>> tree.xpath('//kml:name', namespaces={'kml': "http://www.opengis.net/kml/2.2"})
[<Element {http://www.opengis.net/kml/2.2}name at 0x23afe60>]

Categories

Resources