Python Manipulate and save XML without third-party libraries - python

I have a xml in which I have to search for a tag and replace the value of tag with a new values. For example,
<tag-Name>oldName</tag-Name>
and replace oldName to newName like
<tag-Name>newName</tag-Name>
and save the xml. How can I do that without using thirdparty lib like BeautifulSoup
Thank you

The best option from the standard lib is (I think) the xml.etree package.
Assuming that your example tag occurs only once somewhere in the document:
import xml.etree.ElementTree as etree
# or for a faster C implementation
# import xml.etree.cElementTree as etree
tree = etree.parse('input.xml')
elem = tree.find('//tag-Name') # finds the first occurrence of element tag-Name
elem.text = 'newName'
tree.write('output.xml')
Or if there are multiple occurrences of tag-Name, and you want to change them all if they have "oldName" as content:
import xml.etree.cElementTree as etree
tree = etree.parse('input.xml')
for elem in tree.findall('//tag-Name'):
if elem.text == 'oldName':
elem.text = 'newName'
# some output options for example
tree.write('output.xml', encoding='utf-8', xml_declaration=True)

Python has 'builtin' libraries for working with xml. For this simple task I'd look into minidom. You can find the docs here:
http://docs.python.org/library/xml.dom.minidom.html

If you are certain, I mean, completely 100% positive that the string <tag-Name> will never appear inside that tag and the XML will always be formatted like that, you can always use good old string manipulation tricks like:
xmlstring = xmlstring.replace('<tag-Name>oldName</tag-Name>', '<tag-Name>newName</tag-Name>')
If the XML isn't always as conveniently formatted as <tag>value</tag>, you could write something like:
a = """<tag-Name>
oldName
</tag-Name>"""
def replacetagvalue(xmlstring, tag, oldvalue, newvalue):
start = 0
while True:
try:
start = xmlstring.index('<%s>' % tag, start) + 2 + len(tag)
except ValueError:
break
end = xmlstring.index('</%s>' % tag, start)
value = xmlstring[start:end].strip()
if value == oldvalue:
xmlstring = xmlstring[:start] + newvalue + xmlstring[end:]
return xmlstring
print replacetagvalue(a, 'tag-Name', 'oldName', 'newName')
Others have already mentioned xml.dom.minidom which is probably a better idea if you're not absolutely certain beyond any doubt that your XML will be so simple. If you can guarantee that, though, remember that XML is just a big chunk of text that you can manipulate as needed.
I use similar tricks in some production code where simple checks like if "somevalue" in htmlpage are much faster and more understandable than invoking BeautifulSoup, lxml, or other *ML libraries.

Related

Python xml.etree.ElementTree 'findall()' method does not work with several namespaces

I am trying to parse an XML file with several namespaces. I already have a function which produces namespace map – a dictionary with namespace prefixes and namespace identifiers (example in the code). However, when I pass this dictionary to the findall() method, it works only with the first namespace but does not return anything if an element on XML path is in another namespace.
(It works only in case of the first namespace which has None as its prefix.)
Here is a code sample:
import xml.etree.ElementTree as ET
file - '.\folder\example_file.xml' # path to the file
xml_path = './DataArea/Order/Item/Price' # XML path to the element node
tree = ET.parse(file)
root = tree.getroot()
nsmap = dict([node for _, node in ET.iterparse(exp_file, events=['start-ns'])])
# This produces a dictionary with namespace prefixes and identifiers, e.g.
# {'': 'http://firstnamespace.example.com/', 'foo': 'http://secondnamespace.example.com/', etc.}
for elem in root.findall(xml_path, nsmap):
# Do something
EDIT:
On the mzjn's suggestion, I'm including sample XML file:
<?xml version="1.0" encoding="utf-8"?>
<SampleOrder xmlns="http://firstnamespace.example.com/" xmlns:foo="http://secondnamespace.example.com/" xmlns:bar="http://thirdnamespace.example.com/" xmlns:sta="http://fourthnamespace.example.com/" languageCode="en-US" releaseID="1.0" systemEnvironmentCode="PROD" versionID="1.0">
<ApplicationArea>
<Sender>
<SenderCode>4457</SenderCode>
</Sender>
</ApplicationArea>
<DataArea>
<Order>
<foo:Item>
<foo:Price>
<foo:AmountPerUnit currencyID="USD">58000.000000</foo:AmountPerUnit>
<foo:TotalAmount currencyID="USD">58000.000000</foo:TotalAmount>
</foo:Price>
<foo:Description>
<foo:ItemCode>259601</foo:ItemCode>
<foo:ItemName>PORTAL GUN 6UBC BLUE</foo:ItemName>
</foo:Description>
</foo:Item>
<bar:Supplier>
<bar:SupplierID>4474</bar:SupplierID>
<bar:SupplierName>APERTURE SCIENCE, INC</bar:SupplierName>
</bar:Supplier>
<sta:DeliveryLocation>
<sta:RecipientID>103</sta:RecipientID>
<sta:RecipientName>WARHOUSE 664</sta:RecipientName>
</sta:DeliveryLocation>
</Order>
</DataArea>
</SampleOrder>
You should specify the namespaces in your xml_path, for example: ./foo:DataArea/Order/Item/bar:Price. The reason it works with the empty namespace is because it is the default, you don't have to specify that one in your path.
Based on Jan Jaap Meijerink's answer and mzjn's comments under the question, the solution is to insert namespace prefixed in the XML path. This can be done by inserting a wildcard {*} as mzjn's comment and this answer (https://stackoverflow.com/a/62117710/407651) suggest.
To document the solution, you can add this simple operation to your code:
xml_path = './DataArea/Order/Item/Price/TotalAmount'
xml_path_splitted_to_list = xml_path.split('/')
xml_path_with_wildcard_prefix = '/{*}'.join(xml_path_splitted_to_list)
In case there are two or more nodes with the same XML path but different namespaces, findall() method (quite naturally) accesses all of those element nodes.

Grab specific text from XML

Hello :) This is my first python program but it doesn't work.
What I want to do :
import a XML file and grab only Example.swf from
<page id="Example">
<info>
<title>page 1</title>
</info>
<vector_file>Example.swf</vector_file>
</page>
(the text inside <vector_file>)
than download the associated file on a website (https://website.com/.../.../Example.swf)
than rename it 1.swf (or page 1.swf)
and loop until I reach the last file, at the end of the page (Exampleaa_idontknow.swf → 231.swf)
convert all the files in pdf
What i have done (but useless, because of AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'xpath'):
import re
import urllib.request
import requests
import time
import requests
import lxml
import lxml.html
import os
from xml.etree import ElementTree as ET
DIR="C:/Users/mypath.../"
for filename in os.listdir(DIR):
if filename.endswith(".xml"):
with open(file=DIR+".xml",mode='r',encoding='utf-8') as file:
_tree = ET.fromstring(text=file.read())
_all_metadata_tags = _tree.xpath('.//vector_file')
for i in _all_metadata_tags:
print(i.text + '\n')
else:
print("skipping for filename")
First of all, you need to make up your mind about what module you're going to use. lxml or xml? Import only one of them. lxml has more features, but it's an external dependency. xml is more basic, but it is built-in. Both modules share a lot of their API, so they are easy to confuse. Check that you're looking at the correct documentation.
For what you want to do, the built-in module is good enough. However, the .xpath() method is not supported there, the method you are looking for here is called .findall().
Then you need to remember to never parse XML files by opening them as plain text files, reading them into into string, and parsing that string. Not only is this wasteful, it's fundamentally the wrong thing to do. XML parsers have built-in automatic encoding detection. This mechanism makes sure you never have to worry about file encodings, but you have to use it, too.
It's not only better, but less code to write: Use ET.parse() and pass a filename.
import os
from xml.etree import ElementTree as ET
DIR = r'C:\Users\mypath'
for filename in os.listdir(DIR):
if not filename.lower().endswith(".xml"):
print("skipping for filename")
continue
fullname = os.path.join(DIR, filename)
tree = ET.parse(fullname)
for vector_file in tree.findall('.//vector_file'):
print(vector_file.text + '\n')
If you only expect a single <vector_file> element per file, or if you only care for the first such element, use .find() instead of .findall():
vector_file = tree.find('.//vector_file')
if vector_file is None:
print('Nothing found')
else:
print(vector_file.text + '\n')

Read an xml element using python

I have an xml file. I want to search for a specific word in the file, and if i find it- i want to copy all of the xml element the word was in it.
for example:
<Actions>
<ActionGroup enabled="yes" name="viewsGroup" isExclusive="yes"/>
<ExtAction iconSet="" toolTip="" name="f5-script" text="f5-script"/>
</Actions>
I am looking for the word :"ExtAction", and since it is inside the Actions element I want to copy all of it. How can I do it?
I usually use ElementTree for this kind of job, as it seems the most intuitive to me. I believe this is part of the standard library, so no need to install anything
As a more general approach, the entire .xml file can be parsed as a dictionary of dictionaries, which you can then index accordingly if you so desired. This can be done like this (I just made a copy of your .xml file locally and called it "test.xml" for demonstration purposes. Of course, change this to correspond to your file if you choose this solution):
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
tags = [child.tag for child in root]
file_contents = {}
for tag in tags:
for p in tree.iter(tag=tag):
file_contents[tag] = dict(p.items())
If you print the file contents you will get:
"{'ActionGroup': {'enabled': 'yes', 'name': 'viewsGroup', 'isExclusive': 'yes'}, 'ExtAction': {'iconSet': '', 'toolTip': '', 'name': 'f5-script', 'text': 'f5-script'}}"
From this it is trivial to index out the bits of information you need. For example, if you want to get the name value from the ExtAction tag, you would just do:
print(file_contents['ExtAction']['name']) # or save this as a variable if you need it
Hope this helps!

Using a python program to rename all XML files within a linux directory

Currently, my code uses the name of an XML file as a parameter in order to take that file, parse some of its content and use it to rename said file, what I mean to do is actually run my program once and that program will search for every XML file (even if its inside a zip) inside the directory and rename it using the same parameters which is what I am having problems with.
#encoding:utf-8
import os, re
from sys import argv
script, nombre_de_archivo = argv
regexFecha = r'\d{4}-\d{2}-\d{2}'
regexLocalidad = r'localidad=\"[\w\s.,-_]*\"'
regexNombre = r'nombre=\"[\w\s.,-_]*\"'
regexTotal = r'total=\"\d+.?\d+\"'
fechas = []; localidades = []; nombres = []; totales = []
archivo = open(nombre_de_archivo)
for linea in archivo.readlines():
fechas.append(re.findall(regexFecha, linea))
localidades.append(re.findall(regexLocalidad, linea))
nombres.append(re.findall(regexNombre, linea))
totales.append(re.findall(regexTotal, linea))
fecha = str(fechas[1][0])
localidad = str(localidades[1][0]).strip('localidad=\"')
nombre = str(nombres[1][0]).strip('nombre=\"')
total = str(totales[1][0]).strip('total=\"')
nombre_nuevo_archivo = fecha+"_"+localidad+"_"+nombre+"_"+total+".xml"
os.rename(nombre_de_archivo, nombre_nuevo_archivo)
EDIT: an example of this would be.
directory contains only 3 files as well as the program.
silly.xml amusing.zip feisty.txt
So, you run the program and it ignores feisty as it is a .txt file and it reads silly.xml, ti then parses "fechas, localidad, nombre, total" concatenate or append or whatever and use that as the new file for silly.xml, then the program checks if zip has an xml file, if it does then it does the same thing.
so in the end we would have
20141211_sonora_walmart_2033.xml 20141008_sonora_starbucks_102.xml feisty txt amusing.zip
Your question is not clear, and the code you posted is too broad.
I can't debug regular expressions with the power of my sight, but there's a number of things you can do to simplify the code. Simple code means less errors, and an easier time debugging.
To locate your target files, use glob.glob:
files = glob.glob('dir/*.xml')
To parse them, ditch regular expressions and use the ElementTree API.
import xml.etree.ElementTree as ET
tree = ET.parse('target.xml')
root = tree.getroot()
There's also modules to navigate XML files with CSS notation and XPATH. Extracting fields form the filename using regex is okay, but check out named groups.

How to display a XML file in a folder like system

I have searched quite a bit for some help on this but most information about XML and Python is about how to read and parse files, not how to write them. I currently have a script that outputs the file system of a folder as an XML file but I want this file to include HTML tags that can be read on the web.
Also, I am looking for a graphical folder style display of the XML information. I have seen this solved in javascript but I don't have any experience in javascript so is this possible in python?
import os
import sys
from xml.sax.saxutils import quoteattr as xml_quoteattr
dirname = sys.argv[1]
a = open("output2.xml", "w")
def DirAsLessXML(path):
result = '<dir name = %s>\n' % xml_quoteattr(os.path.basename(path))
for item in os.listdir(path):
itempath = os.path.join(path, item)
if os.path.isdir(itempath):
result += '\n'.join(' ' + line for line in
DirAsLessXML(os.path.join(path, item)).split('\n'))
elif os.path.isfile(itempath):
result += ' <file name=%s />\n' % xml_quoteattr(item)
result += '</dir>'
return result
if __name__ == '__main__':
a.write('<structure>\n' + DirAsLessXML(dirname))
Generally it's not a good idea to code HTML directly in Python by building strings as you do it. You won't understand what kind of HTML you produce if look at your code in a month or so. I recommend using a templating engine.
Alternatively, as you are converting from XML to XML, XSLT (an XML transformation language) seems like a good solution.
Concerning JavaScript, you'll need it if you want an interactive folder view. But this does not influence how to generate the HTML page. Whatever approach you use, just add a script include tag to the HTML header and implement the JavaScript part in an extra file (the one you include, of course).

Categories

Resources