Read an xml element using python - python

I have an xml file. I want to search for a specific word in the file, and if i find it- i want to copy all of the xml element the word was in it.
for example:
<Actions>
<ActionGroup enabled="yes" name="viewsGroup" isExclusive="yes"/>
<ExtAction iconSet="" toolTip="" name="f5-script" text="f5-script"/>
</Actions>
I am looking for the word :"ExtAction", and since it is inside the Actions element I want to copy all of it. How can I do it?

I usually use ElementTree for this kind of job, as it seems the most intuitive to me. I believe this is part of the standard library, so no need to install anything
As a more general approach, the entire .xml file can be parsed as a dictionary of dictionaries, which you can then index accordingly if you so desired. This can be done like this (I just made a copy of your .xml file locally and called it "test.xml" for demonstration purposes. Of course, change this to correspond to your file if you choose this solution):
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
tags = [child.tag for child in root]
file_contents = {}
for tag in tags:
for p in tree.iter(tag=tag):
file_contents[tag] = dict(p.items())
If you print the file contents you will get:
"{'ActionGroup': {'enabled': 'yes', 'name': 'viewsGroup', 'isExclusive': 'yes'}, 'ExtAction': {'iconSet': '', 'toolTip': '', 'name': 'f5-script', 'text': 'f5-script'}}"
From this it is trivial to index out the bits of information you need. For example, if you want to get the name value from the ExtAction tag, you would just do:
print(file_contents['ExtAction']['name']) # or save this as a variable if you need it
Hope this helps!

Related

Python xml.etree.ElementTree 'findall()' method does not work with several namespaces

I am trying to parse an XML file with several namespaces. I already have a function which produces namespace map – a dictionary with namespace prefixes and namespace identifiers (example in the code). However, when I pass this dictionary to the findall() method, it works only with the first namespace but does not return anything if an element on XML path is in another namespace.
(It works only in case of the first namespace which has None as its prefix.)
Here is a code sample:
import xml.etree.ElementTree as ET
file - '.\folder\example_file.xml' # path to the file
xml_path = './DataArea/Order/Item/Price' # XML path to the element node
tree = ET.parse(file)
root = tree.getroot()
nsmap = dict([node for _, node in ET.iterparse(exp_file, events=['start-ns'])])
# This produces a dictionary with namespace prefixes and identifiers, e.g.
# {'': 'http://firstnamespace.example.com/', 'foo': 'http://secondnamespace.example.com/', etc.}
for elem in root.findall(xml_path, nsmap):
# Do something
EDIT:
On the mzjn's suggestion, I'm including sample XML file:
<?xml version="1.0" encoding="utf-8"?>
<SampleOrder xmlns="http://firstnamespace.example.com/" xmlns:foo="http://secondnamespace.example.com/" xmlns:bar="http://thirdnamespace.example.com/" xmlns:sta="http://fourthnamespace.example.com/" languageCode="en-US" releaseID="1.0" systemEnvironmentCode="PROD" versionID="1.0">
<ApplicationArea>
<Sender>
<SenderCode>4457</SenderCode>
</Sender>
</ApplicationArea>
<DataArea>
<Order>
<foo:Item>
<foo:Price>
<foo:AmountPerUnit currencyID="USD">58000.000000</foo:AmountPerUnit>
<foo:TotalAmount currencyID="USD">58000.000000</foo:TotalAmount>
</foo:Price>
<foo:Description>
<foo:ItemCode>259601</foo:ItemCode>
<foo:ItemName>PORTAL GUN 6UBC BLUE</foo:ItemName>
</foo:Description>
</foo:Item>
<bar:Supplier>
<bar:SupplierID>4474</bar:SupplierID>
<bar:SupplierName>APERTURE SCIENCE, INC</bar:SupplierName>
</bar:Supplier>
<sta:DeliveryLocation>
<sta:RecipientID>103</sta:RecipientID>
<sta:RecipientName>WARHOUSE 664</sta:RecipientName>
</sta:DeliveryLocation>
</Order>
</DataArea>
</SampleOrder>
You should specify the namespaces in your xml_path, for example: ./foo:DataArea/Order/Item/bar:Price. The reason it works with the empty namespace is because it is the default, you don't have to specify that one in your path.
Based on Jan Jaap Meijerink's answer and mzjn's comments under the question, the solution is to insert namespace prefixed in the XML path. This can be done by inserting a wildcard {*} as mzjn's comment and this answer (https://stackoverflow.com/a/62117710/407651) suggest.
To document the solution, you can add this simple operation to your code:
xml_path = './DataArea/Order/Item/Price/TotalAmount'
xml_path_splitted_to_list = xml_path.split('/')
xml_path_with_wildcard_prefix = '/{*}'.join(xml_path_splitted_to_list)
In case there are two or more nodes with the same XML path but different namespaces, findall() method (quite naturally) accesses all of those element nodes.

Removing sub-tags from XML and create new XML files

I have an XML input file which I need to split into multiple files based on MAPPING and WORKFLOW tags.
Since I have two MAPPING tags in my input XML and one WORKFLOW tag, I need to generate three files:
m_demo_trans_agg.XML
m_demo_trans_exp.XML
wf_m_demo_trans_agg_exp.XML
So, my mapping file (starting with m_) will have tags SOURCE, TARGET, and MAPPINGS. The workflow file will have tags WORKFLOW and CONFIG.
Please let me know how can I create mapping XML.
I started with workflow XML creation.
My code looks like:
import xml.etree.ElementTree
tree = ET.parse('input.xml')
root = tree.getroot()
target_node_first_parent = 'FOLDER'
target_nodes = ['SOURCE', 'TARGET', 'MAPPING']
for node in root.iter(target_node_first_parent):
for subnode in node.iter():
if subnode.tag in ['SOURCE', 'TARGET', 'MAPPING']:
print(subnode.tag)
node.remove(subnode)
out_tree = ET.ElementTree(root)
out_tree.write('output.xml')
I am getting the TARGET tags in my output.xml.
I am open to using any libraries apart from xml.etree.ElementTree.
Please assist.
Thanks

Python combining html and xml files into a single dictionary

I'm trying to combine html files (text) and xml files (metadata) into a single flat dictionary which I'll write as a Json. The files are contained in the same folder and have following name structure:
abcde.html
abcde.html.xml
Here's my simplified code, my issue is that I had to separate the xml meta-data writing into
### Create a list of dict with one dict per file, first write file content then meta-data
for path, subdirs, files in os.walk("."):
for fname in files:
docname, extension = os.path.splitext(fname)
filename = os.path.join(path,fname)
file_dict = {}
if extension == ".html":
file_dict['type'] = 'circulaire'
file_dict['filename'] = fname
html_dict = parse_html_to_dict(filename)
file_dict.update(html_dict)
list_of_dict.append(file_dict)
#elif extension == ".xml":
# if not any(d['filename'] == docname for d in list_of_dict):
# print("Well Well Well, there's no html file in the list yet !")
# continue
# else:
# index = next((i for i, element in enumerate(list_of_dict) if element['filename'] == docname), None)
# metadata_dict = extract_metadata_xml(filename)
# list_of_dict[index].update(metadata_dict)
else: continue
json.dump(list_of_dict, outfile, indent=3)
outfile.close()
############# Extract Metadata from XML FILES #############
import xmltodict
def extract_metadata_xml(filename):
""" Returns xml file to dict """
with open(filename, encoding='utf-8', errors='ignore') as xml_file:
temp_dict = xmltodict.parse(xml_file.read())
metadata_dict = temp_dict.get('doc', {}).get('fields', {})
xml_file.close()
return metadata_dict
Normally, I would add an elif condition (now commented) below the if loop for html files, which checks for xml and updates the corresponding dictionary (bool condition that filenames are identical) with the metadata, thus sequentially.
But, unfortunately it seems that for most files, the list of dict isn't fully up to date, or at least I can't find a match for 40% of my filenames.
The work-around I use seems a little silly to me, I wrote a second loop with os.walk after the first one which is used exclusively for html files, my second loop then checks for xml extensions and appends the list_of_dict, which is fully up to date and I get 100% of my html filenames matched with xml metadata.
Can I introduce some forced timing to make sure my html is done writing before I start to match any xml filename, is it possible that both if/elif loops are executed in parallel for different files?
Or else what is in terms of processing the best way to have all my html files processed before my xml files (just ordering my list of files by type before proceeding with my if/elif loops?)
I'm quite new to this forum, so please let me know if I can improve my question writing style, be mindful though that I'm trying my best ;).
Thanks for your help!
The way it is now you check all files and handle them differently, depending on whether they are html or xml, hoping that the corresponding html for each xml you encounter has already been processed – but that's not guaranteed, which I suspect to be the cause of the issues.
Instead, you should only look for html files and retrieve the corresponding xml right away:
if extension == ".html":
file_dict['type'] = 'circulaire'
file_dict['filename'] = fname
html_dict = parse_html_to_dict(filename)
file_dict.update(html_dict)
# process the corresponding xml file
# this file should always exist, according to your description of the file structure
xml_filename = os.path.join(path, fname+'.xml')
# extract the meta data
metadata_dict = extract_metadata_xml(xml_filename)
# put it in the file dict we still have
file_dict.update(metadata_dict)
# and finally store it
list_of_dict.append(file_dict)
Alternatively, though less efficiently, you could also iterate over the sorted file list - for fname in sorted(files): - and proceed as you did, as this would also result in the html files preceding the corresponding xml files.

Get path of xml element while dynamically creating the xml file in python

I am using xml.etree in python to generate an xml document, now while I am generating the xml document I want to get the xpath of a element that I have created, is there anyway of getting the absolute path while still creating the Element?
Below is an example of the generation process:
ParameterDeclaration = Element("ParameterDeclarations")
Parameter = Element("ParameterDeclaration")
Parameter.set("name", "Speed")
Parameter.set("value", action.speed)
Parameter.set("parameterType", "double")
ParameterDeclaration.append(Parameter)
Can I get the path of the Parameter element right after that last line?
Please note that this is only a small section of the xml file, the ParameterDeclaration is being appended to another element.

Python Manipulate and save XML without third-party libraries

I have a xml in which I have to search for a tag and replace the value of tag with a new values. For example,
<tag-Name>oldName</tag-Name>
and replace oldName to newName like
<tag-Name>newName</tag-Name>
and save the xml. How can I do that without using thirdparty lib like BeautifulSoup
Thank you
The best option from the standard lib is (I think) the xml.etree package.
Assuming that your example tag occurs only once somewhere in the document:
import xml.etree.ElementTree as etree
# or for a faster C implementation
# import xml.etree.cElementTree as etree
tree = etree.parse('input.xml')
elem = tree.find('//tag-Name') # finds the first occurrence of element tag-Name
elem.text = 'newName'
tree.write('output.xml')
Or if there are multiple occurrences of tag-Name, and you want to change them all if they have "oldName" as content:
import xml.etree.cElementTree as etree
tree = etree.parse('input.xml')
for elem in tree.findall('//tag-Name'):
if elem.text == 'oldName':
elem.text = 'newName'
# some output options for example
tree.write('output.xml', encoding='utf-8', xml_declaration=True)
Python has 'builtin' libraries for working with xml. For this simple task I'd look into minidom. You can find the docs here:
http://docs.python.org/library/xml.dom.minidom.html
If you are certain, I mean, completely 100% positive that the string <tag-Name> will never appear inside that tag and the XML will always be formatted like that, you can always use good old string manipulation tricks like:
xmlstring = xmlstring.replace('<tag-Name>oldName</tag-Name>', '<tag-Name>newName</tag-Name>')
If the XML isn't always as conveniently formatted as <tag>value</tag>, you could write something like:
a = """<tag-Name>
oldName
</tag-Name>"""
def replacetagvalue(xmlstring, tag, oldvalue, newvalue):
start = 0
while True:
try:
start = xmlstring.index('<%s>' % tag, start) + 2 + len(tag)
except ValueError:
break
end = xmlstring.index('</%s>' % tag, start)
value = xmlstring[start:end].strip()
if value == oldvalue:
xmlstring = xmlstring[:start] + newvalue + xmlstring[end:]
return xmlstring
print replacetagvalue(a, 'tag-Name', 'oldName', 'newName')
Others have already mentioned xml.dom.minidom which is probably a better idea if you're not absolutely certain beyond any doubt that your XML will be so simple. If you can guarantee that, though, remember that XML is just a big chunk of text that you can manipulate as needed.
I use similar tricks in some production code where simple checks like if "somevalue" in htmlpage are much faster and more understandable than invoking BeautifulSoup, lxml, or other *ML libraries.

Categories

Resources