Exact match of substring in string Python

Exact match of substring in string Python - python

I know this question is quite common, but my example below is a bit more complex than the title of the question suggests.
Suppose I've got the following "test.xml" file:
<?xml version="1.0" encoding="UTF-8"?>
<test:xml xmlns:test="http://com/whatever/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<parent xsi:type="parentType">
<child xsi:type="childtype">
<grandchild>
<greatgrandchildone>greatgrandchildone</greatgrandchildone>
<greatgrandchildtwo>greatgrandchildtwo</greatgrandchildtwo>
</grandchild><!--random comment -->
</child>
<child xsi:type="childtype">
<greatgrandchildthree>greatgrandchildthree</greatgrandchildthree>
<greatgrandchildfour>greatgrandchildfour</greatgrandchildfour><!--another random comment -->
</child>
<child xsi:type="childtype">
<greatgrandchildthree>greatgrandchildthree</greatgrandchildthree>
<greatgrandchildfour>greatgrandchildfour</greatgrandchildfour><!--third random comment -->
</child>
</parent>
</test:xml>
Within my program below, I'm doing two main things:
Find out all the nodes in the xml that contain a "type" attribute
Loop through each node of the xml and find out if it is a child of an element that contains a "type" attribute
This is my code:
from lxml import etree
import re
xmlDoc = etree.parse("test.xml")
root = xmlDoc.getroot()
nsmap = {
'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}
nodesWithType = []
def check_type_in_path(nodesWithType, path, root):
typesInPath = []
elementType = ""
for node in nodesWithType:
print("checking node: ", node, " and path: ", path)
if re.search(r"\b{}\b".format(
node), path, re.IGNORECASE) is not None:
element = root.find('.//{0}'.format(node))
elementType = element.attrib.get(f"{{{nsmap['xsi']}}}type")
if elementType is not None:
print("found an element for this path. adding to list")
typesInPath.append(elementType)
else:
print("element: ", node, " not found in path: ", path)
print("path ", path ," has types: ", elementType)
print("-------------------")
return typesInPath
def get_all_node_types(xmlDoc):
nodesWithType = []
root = xmlDoc.getroot()
for node in xmlDoc.iter():
path = "/".join(xmlDoc.getpath(node).strip("/").split('/')[1:])
if "COMMENT" not in path.upper():
element = root.find('.//{0}'.format(path))
elementType = element.attrib.get(f"{{{nsmap['xsi']}}}type")
if elementType is not None:
nodesWithType.append(path)
return nodesWithType
nodesWithType = get_all_node_types(xmlDoc)
print("nodesWithType: ", nodesWithType)
for node in xmlDoc.xpath('//*'):
path = "/".join(xmlDoc.getpath(node).strip("/").split('/')[1:])
typesInPath = check_type_in_path(nodesWithType, path, root)
The code should return all the types that are contained within a certain path. For example, consider the path parent/child[3]/greatgrandchildfour. This path is a child (either direct or distant) of two nodes that contain the attribute "type": parent and parent/child[3]. I would therefore expect the nodesWithType array for that particular node to include both "parentType" and "childtype".
However, based off the below prints, the nodesWithType array for this node only includes the "parentType" type and doesn't include "childtype". The main focus of this logic is checking whether the path to the node with the type is included in path to the node in question (hence checking for the exact match of the string). But this is clearly not working. I'm not sure if it's because there are array annotations within the condition that's not validating it, or perhaps something else.
For the above example, the returned prints are:
checking node: parent and path: parent/child[3]/greatgrandchildfour
found an element for this path. adding to list
checking node: parent/child[1] and path: parent/child[3]/greatgrandchildfour
element: parent/child[1] not found in path: parent/child[3]/greatgrandchildfour
checking node: parent/child[2] and path: parent/child[3]/greatgrandchildfour
element: parent/child[2] not found in path: parent/child[3]/greatgrandchildfour
checking node: parent/child[3] and path: parent/child[3]/greatgrandchildfour
element: parent/child[3] not found in path: parent/child[3]/greatgrandchildfour
path parent/child[3]/greatgrandchildfour has types: parentType

Related

Search for specific text in an element of XML with DOM (Python)

For a program in Python I am looking for a way to find a specific text in an element of XML and to find out which node number it is.
This is the xml:
-<shortcut>
<label>33060</label>
<label2>Common Shortcut</label2>
</shortcut>
-<shortcut>
<label>Test</label>
</shortcut>
Of course I know it is probably node number 2 in here, but the xml file can be longer.
This are to ways I tried it, but I don't get it to work properly:
xmldoc = minidom.parse("/DATA.xml")
Shortcut = xmldoc.getElementsByTagName("shortcut")
Label = xmldoc.getElementsByTagName("label")
print xmldoc.getElementsByTagName("label")[12].firstChild.nodeValue (works)
for element in Label:
if element.getAttributeNode("label") == 'Test':
# if element.getAttributeNode('label') == "Test":
print "element found"
else:
print "element not found"
for node in xmldoc.getElementsByTagName("label"):
if node.nodeValue == "Test":
print "element found"
else:
print "element not found"

This working example demonstrates one possible way to search element containing specific text using minidom module* :
from xml.dom.minidom import parseString
def getText(nodelist):
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
return ''.join(rc)
xml = """<root>
<shortcut>
<label>33060</label>
<label2>Common Shortcut</label2>
</shortcut>
<shortcut>
<label>Test</label>
</shortcut>
</root>"""
xmldoc = parseString(xml)
labels = xmldoc.getElementsByTagName("label")
for label in labels:
text = getText(label.childNodes)
if text == "Test":
print("node found : " + label.toprettyxml())
break
output :
node found : <label>Test</label>
*) getText() function taken from minidom documentation page.

Recursive XML parsing python using ElementTree

I'm trying to parse below XML using Python ElementTree to product output as below. I'm trying to write modules for top elements to print them. However It is slightly tricky as category element may or may not have property and cataegory element may have a category element inside.
I've referred to previous question in this topic, but they did not consist of nested elements with same name
My Code:
http://pastebin.com/Fsv2Xzqf
work.xml:
<suite id="1" name="MainApplication">
<displayNameKey>my Application</displayNameKey>
<displayName>my Application</displayName>
<application id="2" name="Sub Application1">
<displayNameKey>sub Application1</displayNameKey>
<displayName>sub Application1</displayName>
<category id="2423" name="about">
<displayNameKey>subApp.about</displayNameKey>
<displayName>subApp.about</displayName>
<category id="2423" name="comms">
<displayNameKey>subApp.comms</displayNameKey>
<displayName>subApp.comms</displayName>
<property id="5909" name="copyright" type="string_property" width="40">
<value>2014</value>
</property>
<property id="5910" name="os" type="string_property" width="40">
<value>Linux 2.6.32-431.29.2.el6.x86_64</value>
</property>
</category>
<property id="5908" name="releaseNumber" type="string_property" width="40">
<value>9.1.0.3.0.54</value>
</property>
</category>
</application>
</suite>
Output should be as below:
Suite: MainApplication
Application: Sub Application1
Category: about
property: releaseNumber | 9.1.0.3.0.54
category: comms
property: copyright | 2014
property: os | Linux 2.6.32-431.29.2.el6.x86_64
Any pointers in right direction would be of help.

import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='work.xml')
indent = 0
ignoreElems = ['displayNameKey', 'displayName']
def printRecur(root):
"""Recursively prints the tree."""
if root.tag in ignoreElems:
return
print ' '*indent + '%s: %s' % (root.tag.title(), root.attrib.get('name', root.text))
global indent
indent += 4
for elem in root.getchildren():
printRecur(elem)
indent -= 4
root = tree.getroot()
printRecur(root)
OUTPUT:
Suite: MainApplication
Application: Sub Application1
Category: about
Category: comms
Property: copyright
Value: 2014
Property: os
Value: Linux 2.6.32-431.29.2.el6.x86_64
Property: releaseNumber
Value: 9.1.0.3.0.54
This is closest I could get in 5 minutes. You should just recursively call a processor function and that would take care. You can improve on from this point :)
You can also define handler function for each tag and put all of them in a dictionary for easy lookup. Then you can check if you have an appropriate handler function for that tag, then call that else just continue with blindly printing. For example:
HANDLERS = {
'property': 'handle_property',
<tag_name>: <handler_function>
}
def handle_property(root):
"""Takes property root element and prints the values."""
data = ' '*indent + '%s: %s ' % (root.tag.title(), root.attrib['name'])
values = []
for elem in root.getchildren():
if elem.tag == 'value':
values.append(elem.text)
print data + '| %s' % (', '.join(values))
# printRecur would get modified accordingly.
def printRecur(root):
"""Recursively prints the tree."""
if root.tag in ignoreElems:
return
global indent
indent += 4
if root.tag in HANDLERS:
handler = globals()[HANDLERS[root.tag]]
handler(root)
else:
print ' '*indent + '%s: %s' % (root.tag.title(), root.attrib.get('name', root.text))
for elem in root.getchildren():
printRecur(elem)
indent -= 4
Output with above:
Suite: MainApplication
Application: Sub Application1
Category: about
Category: comms
Property: copyright | 2014
Property: os | Linux 2.6.32-431.29.2.el6.x86_64
Property: releaseNumber | 9.1.0.3.0.54
I find this very useful rather than putting tons of if/else in the code.

If you want a barebones XML recursive tree parser snippet:
from xml.etree import ElementTree
tree = ElementTree.parse('english_saheeh.xml')
root = tree.getroot()
def walk_tree_recursive(root):
#do whatever with .tags here
for child in root:
walk_tree_recursive(child)
walk_tree_recursive(root)

if you want a kind of universal xml importer, creating a record per xml element
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
def rij(elem,level,tags,rtag,mtag,keys,rootkey,data):
otag=mtag
mtag=elem.tag
mtag=mtag[mtag.rfind('}')+1:]
tags.append(mtag)
if level==1:
rtag=mtag
if elem.keys() is not None:
mkey=[]
if len(elem.keys())>1:
for key in elem.keys():
mkey.append(elem.attrib.get(key))
rootkey=mkey
else:
for key in elem.keys():
rootkey=elem.attrib.get(key)
else:
if elem.keys() is not None:
mkey=[]
lkey=[]
for key in elem.keys():
if len(elem.keys())>1:
mkey.append(elem.attrib.get(key))
keys=mkey
else:
for key in elem.keys():
keys=elem.attrib.get(key)
lkey=key
if elem.text is not None:
if elem.text!='\n ':
data.append([rootkey,tags,rtag,otag,mtag,lkey,keys,elem.text])
else:
data.append([rootkey,tags,rtag,otag,mtag,lkey,keys,''])
#print(data)
level+=1
for chil in elem.getchildren():
data = rij(chil, level,tags,rtag,mtag, keys,rootkey,data)
level-=1
mtag=elem.tag
mtag=mtag[mtag.rfind('}')+1:]
tags.remove(mtag)
return data
data = rij(root,0,[],'','', [],[],[])

Python ElementTree

Having trouble with XML config files using ElementTree. I want to have an easy way to find the text of an element regardless of where it is in the XML Tree. From what the documentation says, I should be able to do this with findtext(), but no matter what, I get a return of None. Where am I going wrong here? Everyone was telling me XML is so simple to handle in Python, yet I have had nothing but troubles.
configFileName = 'file.xml'
def configSet (x):
if os.path.exists(configFileName):
tree = ET.parse(configFileName)
root = tree.getroot()
return root.findtext(x)
hiTemp = configSet('hiTemp')
print hiTemp
and the XML
<configData>
<units>
<temp>F</temp>
</units>
<pins>
<lights>1</lights>
<fan>2</fan>
<co2>3</co2>
</pins>
<events>
<airTemps>
<hiTemp>80</hiTemp>
<lowTemp>72</lowTemp>
<hiTempAlarm>84</hiTempAlarm>
</airTemps>
<CO2>
<co2Hi>1500</co2Hi>
<co2Low>1400</co2Low>
<co2Alarm>600</co2Alarm>
</CO2>
</events>
<settings>
<apikeys>
<prowl>
<apikey>None</apikey>
</prowl>
</apikeys>
</settings>
expected result
80
actual result
None

findtext requires a full path, but you have given a relative path, so you cannot find the element you are looking for.
You can either provide a good xpath or modify your code
def configSet(x):
if os.path.exists(configFileName):
tree = ET.parse(configFileName)
root = tree.getroot()
for e in root.getiterator():
t = e.findtext(x)
if t is not None:
return t
Update 1:
If you want to have all matched text as a list, the code is a bit different.
def configSet(x):
matches = []
if os.path.exists(configFileName):
tree = ET.parse(configFileName)
root = tree.getroot()
for e in root.getiterator():
t = e.findtext(x)
if t is not None:
matches.append(t)
return matches

You can use xpath to get to your desired element.
return root.find('./events/airTemps/hiTemp').text
There's easy to follow documentation here.

How to copy multiple XML nodes to another file in Python

Bare in mind I am very new to Python. I'm trying to copy few XML nodes from sample1.xml to out.xml if it doesn't exist in sample2.xml.
this is how far I got before I'm stuck
import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='sample1.xml')
addtree = ET.ElementTree(file='sample2.xml')
root = tree.getroot()
addroot = addtree.getroot()
for adel in addroot.findall('.//cars/car'):
for el in root.findall('cars/car'):
with open('out.xml', 'w+') as f:
f.write("BEFORE\n")
f.write(el.tag)
f.write("\n")
f.write(adel.tag)
f.write("\n")
f.write("\n")
f.write("AFTER\n")
el = adel
f.write(el.tag)
f.write("\n")
f.write(adel.tag)
I have no idea what I'm missing, but it's only copying the actual "tag" itself.
outputs this:
BEFORE
car
car
AFTER
car
car
So I'm missing the children nodes, and also the <, >, </, > tags. Expected result is below.
sample1.xml:
<cars>
<car>
<use-car>0</use-car>
<use-gas>0</use-gas>
<car-name />
<car-key />
<car-location>hawaii</car-location>
<car-port>5</car-port>
</car>
</cars>
sample2.xml:
<cars>
<old>
1
</old>
<new>
8
</new>
<car />
</cars>
expected result in out.xml (final product)
<cars>
<old>
1
</old>
<new>
8
</old>
<car>
<use-car>0</use-car>
<use-gas>0</use-gas>
<car-name />
<car-key />
<car-location>hawaii</car-location>
<car-port>5</car-port>
</car>
</cars>
All the other nodes old and new must remain untouched. I'm just trying to replace <car /> with all its children and grandchildren (if existed) nodes.

First, a couple of trivial issues with your XML:
sample1: The closing cars tag is missing a /
sample2: The closing new tag incorrectly reads old, should read new
Second, a disclaimer: my solution below has its limitations - in particular, it wouldn't handle repeatedly substituting the car node from sample1 into multiple spots in sample2. But it works fine for the sample files you've supplied.
Third: thanks to the top couple of answers on access ElementTree node parent node - they informed the implementation of get_node_parent_info below.
Finally, the code:
import xml.etree.ElementTree as ET
def find_child(node, with_name):
"""Recursively find node with given name"""
for element in list(node):
if element.tag == with_name:
return element
elif list(element):
sub_result = find_child(element, with_name)
if sub_result is not None:
return sub_result
return None
def replace_node(from_tree, to_tree, node_name):
"""
Replace node with given node_name in to_tree with
the same-named node from the from_tree
"""
# Find nodes of given name ('car' in the example) in each tree
from_node = find_child(from_tree.getroot(), node_name)
to_node = find_child(to_tree.getroot(), node_name)
# Find where to substitute the from_node into the to_tree
to_parent, to_index = get_node_parent_info(to_tree, to_node)
# Replace to_node with from_node
to_parent.remove(to_node)
to_parent.insert(to_index, from_node)
def get_node_parent_info(tree, node):
"""
Return tuple of (parent, index) where:
parent = node's parent within tree
index = index of node under parent
"""
parent_map = {c:p for p in tree.iter() for c in p}
parent = parent_map[node]
return parent, list(parent).index(node)
from_tree = ET.ElementTree(file='sample1.xml')
to_tree = ET.ElementTree(file='sample2.xml')
replace_node(from_tree, to_tree, 'car')
# ET.dump(to_tree)
to_tree.write('output.xml')
UPDATE: It was recently brought to my attention that the implementation of find_child() in the solution I originally supplied would fail if the "child" in question was not in the first branch of the XML tree that was traversed. I've updated the implementation above to rectify this.

Reading Maven Pom xml in Python

I have a pom file that has the following defined:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.welsh</groupId>
<artifactId>my-site</artifactId>
<version>1.0.0</version>
<packaging>pom</packaging>
<profiles>
<profile>
<build>
<plugins>
<plugin>
<groupId>org.welsh.utils</groupId>
<artifactId>site-tool</artifactId>
<version>1.0</version>
<executions>
<execution>
<configuration>
<mappings>
<property>
<name>homepage</name>
<value>/content/homepage</value>
</property>
<property>
<name>assets</name>
<value>/content/assets</value>
</property>
</mappings>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
</profiles>
</project>
And I am looking to build a dictionary off the name & value elements under property under the mappings element.
So what I'm trying to figure out how to get all possible mappings elements (Incase of multiple build profiles) so I can get all property elements under it and from reading about Supported XPath syntax the following should print out all possible text/value elements:
import xml.etree.ElementTree as xml
pomFile = xml.parse('pom.xml')
root = pomFile.getroot()
for mapping in root.findall('*/mappings'):
for prop in mapping.findall('.//property'):
logging.info(prop.find('name').text + " => " + prop.find('value').text)
Which is returning nothing. I tried just printing out all the mappings elements and get:
>>> print root.findall('*/mappings')
[]
And when I print out the everything from root I get:
>>> print root.findall('*')
[<Element '{http://maven.apache.org/POM/4.0.0}modelVersion' at 0x10b38bd50>, <Element '{http://maven.apache.org/POM/4.0.0}groupId' at 0x10b38bd90>, <Element '{http://maven.apache.org/POM/4.0.0}artifactId' at 0x10b38bf10>, <Element '{http://maven.apache.org/POM/4.0.0}version' at 0x10b3900d0>, <Element '{http://maven.apache.org/POM/4.0.0}packaging' at 0x10b390110>, <Element '{http://maven.apache.org/POM/4.0.0}name' at 0x10b390150>, <Element '{http://maven.apache.org/POM/4.0.0}properties' at 0x10b390190>, <Element '{http://maven.apache.org/POM/4.0.0}build' at 0x10b390310>, <Element '{http://maven.apache.org/POM/4.0.0}profiles' at 0x10b390390>]
Which made me try to print:
>>> print root.findall('*/{http://maven.apache.org/POM/4.0.0}mappings')
[]
But that's not working either.
Any suggestions would be great.
Thanks,

The main issues of the code in the question are
that it doesn't specify namespaces, and
that it uses */ instead of // which only matches direct children.
As you can see at the top of the XML file, Maven uses the namespace http://maven.apache.org/POM/4.0.0. The attribute xmlns in the root node defines the default namespace. The attribute xmlns:xsi defines a namespace that is only used for xsi:schemaLocation.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
To specify tags like profile in methods like find, you have to specify the namespace as well. For example, you could write the following to find all profile-tags.
import xml.etree as xml
pom = xml.parse('pom.xml')
for profile in pom.findall('//{http://maven.apache.org/POM/4.0.0}profile'):
print(repr(profile))
Also note that I'm using //. Using */ would have the same result for your specific xml file above. However, it would not work for other tags like mappings. Since * represents only one level, */child can be expanded to parent/tag or xyz/tag but not to xyz/parent/tag.
Now, you should be able to come up with something like this to find all mappings:
pom = xml.parse('pom.xml')
map = {}
for mapping in pom.findall('//{http://maven.apache.org/POM/4.0.0}mappings'
'/{http://maven.apache.org/POM/4.0.0}property'):
name = mapping.find('{http://maven.apache.org/POM/4.0.0}name').text
value = mapping.find('{http://maven.apache.org/POM/4.0.0}value').text
map[name] = value
Specifying the namespaces like this is quite verbose. To make it easier to read, you can define a namespace map and pass it as second argument to find and findall:
# ...
nsmap = {'m': 'http://maven.apache.org/POM/4.0.0'}
for mapping in pom.findall('//m:mappings/m:property', nsmap):
name = mapping.find('m:name', nsmap).text
value = mapping.find('m:value', nsmap).text
map[name] = value

Ok, found out that when I remove maven stuff from the project element so its just <project> I can do this:
for mapping in root.findall('*//mappings'):
logging.info(mapping)
for prop in mapping.findall('./property'):
logging.info(prop.find('name').text + " => " + prop.find('value').text)
Which would result in:
INFO:root:<Element 'mappings' at 0x10d72d350>
INFO:root:homepage => /content/homepage
INFO:root:assets => /content/assets
However, if I leave the Maven stuff in at the top I can do this:
for mapping in root.findall('*//{http://maven.apache.org/POM/4.0.0}mappings'):
logging.info(mapping)
for prop in mapping.findall('./{http://maven.apache.org/POM/4.0.0}property'):
logging.info(prop.find('{http://maven.apache.org/POM/4.0.0}name').text + " => " + prop.find('{http://maven.apache.org/POM/4.0.0}value').text)
Which results in:
INFO:root:<Element '{http://maven.apache.org/POM/4.0.0}mappings' at 0x10aa7f310>
INFO:root:homepage => /content/homepage
INFO:root:assets => /content/assets
However, I'd love to be able to figure out how to avoid having to account for the maven stuff since it locks me into this one format.
EDIT:
Ok, I managed to get something a bit more verbose:
import xml.etree.ElementTree as xml
def getMappingsNode(node, nodeName):
if node.findall('*'):
for n in node.findall('*'):
if nodeName in n.tag:
return n
else:
return getMappingsNode(n, nodeName)
def getMappings(rootNode):
mappingsNode = getMappingsNode(rootNode, 'mappings')
mapping = {}
for prop in mappingsNode.findall('*'):
key = ''
val = ''
for child in prop.findall('*'):
if 'name' in child.tag:
key = child.text
if 'value' in child.tag:
val = child.text
if val and key:
mapping[key] = val
return mapping
pomFile = xml.parse('pom.xml')
root = pomFile.getroot()
mappings = getMappings(root)
print mappings

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Exact match of substring in string Python - python

Related

Search for specific text in an element of XML with DOM (Python)

Recursive XML parsing python using ElementTree

Python ElementTree

How to copy multiple XML nodes to another file in Python

Reading Maven Pom xml in Python

Categories

Resources