This is the input sample text. I want to do in object based cleanup to avoid hierarchy issues
<p><b><b><i><b><i><b>
<i>sample text</i>
</b></i></b></i></b></b></p>
Required Output
<p><b><i>sample text</i></b></p>
I written this Object based cleanup using lxml for sublevel duplicate tags. It may help others.
import lxml.etree as ET
textcont = '<p><b><b><i><b><i><b><i>sample text</i></b></i></b></i></b></b></p>'
soup = ET.fromstring(textcont)
for tname in ['i','b']:
for tagn in soup.iter(tname):
if tagn.getparent().getparent() != None and tagn.getparent().getparent().tag == tname:
iparOfParent = tagn.getparent().getparent()
iParent = tagn.getparent()
if iparOfParent.text == None:
iparOfParent.addnext(iParent)
iparOfParent.getparent().remove(iparOfParent)
elif tagn.getparent() != None and tagn.getparent().tag == tname:
iParent = tagn.getparent()
if iParent.text == None:
iParent.addnext(tagn)
iParent.getparent().remove(iParent)
print(ET.tostring(soup))
output:
b'<p><b><i>sample text</i></b></p>'
Markdown, itself, provides structural to extract elements inside
Using re in python, you may extract elements and recombine them.
For example:
import re
html = """<p><b><b><i><b><i><b>
<i>sample text</i>
</b></i></b></i></b></b></p>"""
regex_object = re.compile("\<(.*?)\>")
html_objects = regex_object.findall(html)
set_html = []
for obj in html_objects:
if obj[0] != "/" and obj not in set_html:
set_html.append(obj)
regex_text = re.compile("\>(.*?)\<")
text = [result for result in regex_text.findall(html) if result][0]
# Recombine
result = ""
for obj in set_html:
result += f"<{obj}>"
result += text
for obj in set_html[::-1]:
result += f"</{obj}>"
# result = '<p><b><i>sample text</i></b></p>'
You can use the regex library re to create a function to search for the matching opening tag and closing tag pair and everything else in between. Storing tags in a dictionary will remove duplicate tags and maintain the order they were found in (if order isn't important then just use a set). Once all pairs of tags are found, wrap what's left with the keys of the dictionary in reverse order.
import re
def remove_duplicates(string):
tags = {}
while (match := re.findall(r'\<(.+)\>([\w\W]*)\<\/\1\>', string)):
tag, string = match[0][0], match[0][1] # match is [(group0, group1)]
tags.update({tag: None})
for tag in reversed(tags):
string = f'<{tag}>{string}</{tag}>'
return string
Note: I've used [\w\W]* as a cheat to match everything.
I am messing around with a script in Flask I have this portion here
def get_interfaces_list2(device):
output_interfaces = device.send_command('show interfaces switchport')
current_dir = os.getcwd()
template_file = open(current_dir + "/scripts/textfsm/show_interface_switchport.template", "r")
template = TextFSM(template_file)
parsed_interfaces = template.ParseText(output_interfaces)
interface_list = []
for interface_data in parsed_interfaces:
resultDict = {}
resultDict["interface"] = interface_data[0]
resultDict["admin_mode"] = interface_data[5]
resultDict["access_vlan"] = interface_data[6]
resultDict["voice_vlan"] = interface_data[8]
resultDict["trunking_vlans"] = interface_data[9]
interface_list.append(resultDict)
Return interface_list
I would like to add another command to add more info from the switch
output_interfaces1 = device.send_command('show interfaces description')
current_dir = os.getcwd()
template_file = open(current_dir + "/scripts/textfsm/show_interface_description.template", "r")
template = TextFSM(template_file)
parsed_interfaces1 = template.ParseText(output_interfaces1)
interface_list1 = []
for interface_data1 in parsed_interfaces1:
resultDict["descrip"] = interface_data1
interface_list.append(interface_list1)
return interface_list
I would like to combine this into a single list and return that info in an HTML
If I understood correctly, you are currently saving information about an interface in a dictionary and storing that dict in a list. You then want to add more information about the interface. I think there are two approaches you can take here:
Run a single for loop on both parsed_interfaces and parsed_interfaces1 and store all of the info in one shot.
Store the info from your first loop in another dictionary instead of a list where the key is the interface name. Then in the second loop use that key to access the nested dict and store the new info.
I have an XML file like the following:
<AreaModel>
...
<RecipePhase>
<UniqueName>PHASE1</UniqueName>
...
<NumberOfParameterTags>7</NumberOfParameterTags>
...
<DefaultRecipeParameter>
<Name>PARAM1</Name>
----
</DefaultRecipeParameter>
<DefaultRecipeParameter>
<Name>PARAM2</Name>
----
</DefaultRecipeParameter>
<DefaultRecipeParameter>
<Name>PARAM3</Name>
----
</DefaultRecipeParameter>
</RecipePhase>
<RecipePhase>
....
</RecipePhase>
</AreaModel>
I would like to read this file in sequential order and generate different list. One for the texts of UniqueName TAGs and a list of lists containing for each list the set of texts for tag Name under each RecipePhase element.
For example, I might have 10 RecipePhase elements, each one with TAG UniqueName and each one containing a different set of children with tag DefaultRecipeParameter.
How can I take into account when I enter into RecipePhase and when I go out of the element during parsing?
I am trying ElementTree but I am not able to find a solution.
cheers,
m
You can use xml python module:
See my example:
from xml.dom import minidom as dom
import urllib2
def fetchPage(url):
a = urllib2.urlopen(url)
return ''.join(a.readlines())
def extract(page):
a = dom.parseString(page)
item = a.getElementsByTagName('Rate')
for i in item:
if i.hasChildNodes() == True:
print i.getAttribute('currency')+"-"+ i.firstChild.nodeValue
if __name__=='__main__':
page = fetchPage("http://www.bnro.ro/nbrfxrates.xml")
extract(page)
I solved partially my problem with the following code:
import xml.etree.ElementTree as ET
tree = ET.parse('control_strategies.axml')
root = tree.getroot()
phases=[]
for recipephase in root.findall('./RecipePhase/UniqueName'):
phases.append(recipephase.text)
n_elem = len(phases)
param=[[] for _ in range(n_elem)]
i = 0
for recipephase in root.findall('./RecipePhase'):
for defparam in recipephase.findall('./DefaultRecipeParameter'):
for paramname in defparam.findall('./Name'):
param[i].append(paramname.text)
i = i + 1
I would like to create a dict by parsing a string
<brns ret = "Herld" other = "very">
<brna name = "ame1">
I would like to create a dict that has the following key-value pairs:
dict = {'brnsret': 'Herld',
'brnsother':'very',
'brnaname':'ame1'}
I have a working script that can handle this:
<brns ret = "Herld">
<brna name = "ame1">
my Code to generate the dict:
match_tag = re.search('<(\w+)\s(\w+) = \"(\w+)\">', each_par_line)
if match_tag is not None:
dict_tag[match_tag.group(1)+match_tag.group(2)] = match_tag.group(3)
But how should I tweak my script to handle more than one attribute pair in a tag?
Thanks
An alternative option and, probably, just for educational reasons - you can pass this kind of string into a lenient HTML parser like BeautifulSoup:
from bs4 import BeautifulSoup
data = """
<brns ret = "Herld" other = "very">
<brna name = "ame1">
"""
d = {tag.name + attr: value
for tag in BeautifulSoup(data, "html.parser")()
for attr, value in tag.attrs.items()}
print(d)
Prints:
{'brnaname': 'ame1', 'brnsother': 'very', 'brnsret': 'Herld'}
I'm trying to generate customized xml files from a template xml file in python.
Conceptually, I want to read in the template xml, remove some elements, change some text attributes, and write the new xml out to a file. I wanted it to work something like this:
conf_base = ConvertXmlToDict('config-template.xml')
conf_base_dict = conf_base.UnWrap()
del conf_base_dict['root-name']['level1-name']['leaf1']
del conf_base_dict['root-name']['level1-name']['leaf2']
conf_new = ConvertDictToXml(conf_base_dict)
now I want to write to file, but I don't see how to get to
ElementTree.ElementTree.write()
conf_new.write('config-new.xml')
Is there some way to do this, or can someone suggest doing this a different way?
This'll get you a dict minus attributes. I don't know, if this is useful to anyone. I was looking for an xml to dict solution myself, when I came up with this.
import xml.etree.ElementTree as etree
tree = etree.parse('test.xml')
root = tree.getroot()
def xml_to_dict(el):
d={}
if el.text:
d[el.tag] = el.text
else:
d[el.tag] = {}
children = el.getchildren()
if children:
d[el.tag] = map(xml_to_dict, children)
return d
This: http://www.w3schools.com/XML/note.xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Would equal this:
{'note': [{'to': 'Tove'},
{'from': 'Jani'},
{'heading': 'Reminder'},
{'body': "Don't forget me this weekend!"}]}
I'm not sure if converting the info set to nested dicts first is easier. Using ElementTree, you can do this:
import xml.etree.ElementTree as ET
doc = ET.parse("template.xml")
lvl1 = doc.findall("level1-name")[0]
lvl1.remove(lvl1.find("leaf1")
lvl1.remove(lvl1.find("leaf2")
# or use del lvl1[idx]
doc.write("config-new.xml")
ElementTree was designed so that you don't have to convert your XML trees to lists and attributes first, since it uses exactly that internally.
It also support as small subset of XPath.
For easy manipulation of XML in python, I like the Beautiful Soup library. It works something like this:
Sample XML File:
<root>
<level1>leaf1</level1>
<level2>leaf2</level2>
</root>
Python code:
from BeautifulSoup import BeautifulStoneSoup, Tag, NavigableString
soup = BeautifulStoneSoup('config-template.xml') # get the parser for the xml file
soup.contents[0].name
# u'root'
You can use the node names as methods:
soup.root.contents[0].name
# u'level1'
It is also possible to use regexes:
import re
tags_starting_with_level = soup.findAll(re.compile('^level'))
for tag in tags_starting_with_level: print tag.name
# level1
# level2
Adding and inserting new nodes is pretty straightforward:
# build and insert a new level with a new leaf
level3 = Tag(soup, 'level3')
level3.insert(0, NavigableString('leaf3')
soup.root.insert(2, level3)
print soup.prettify()
# <root>
# <level1>
# leaf1
# </level1>
# <level2>
# leaf2
# </level2>
# <level3>
# leaf3
# </level3>
# </root>
My modification of Daniel's answer, to give a marginally neater dictionary:
def xml_to_dictionary(element):
l = len(namespace)
dictionary={}
tag = element.tag[l:]
if element.text:
if (element.text == ' '):
dictionary[tag] = {}
else:
dictionary[tag] = element.text
children = element.getchildren()
if children:
subdictionary = {}
for child in children:
for k,v in xml_to_dictionary(child).items():
if k in subdictionary:
if ( isinstance(subdictionary[k], list)):
subdictionary[k].append(v)
else:
subdictionary[k] = [subdictionary[k], v]
else:
subdictionary[k] = v
if (dictionary[tag] == {}):
dictionary[tag] = subdictionary
else:
dictionary[tag] = [dictionary[tag], subdictionary]
if element.attrib:
attribs = {}
for k,v in element.attrib.items():
attribs[k] = v
if (dictionary[tag] == {}):
dictionary[tag] = attribs
else:
dictionary[tag] = [dictionary[tag], attribs]
return dictionary
namespace is the xmlns string, including braces, that ElementTree prepends to all tags, so here I've cleared it as there is one namespace for the entire document
NB that I adjusted the raw xml too, so that 'empty' tags would produce at most a ' ' text property in the ElementTree representation
spacepattern = re.compile(r'\s+')
mydictionary = xml_to_dictionary(ElementTree.XML(spacepattern.sub(' ', content)))
would give for instance
{'note': {'to': 'Tove',
'from': 'Jani',
'heading': 'Reminder',
'body': "Don't forget me this weekend!"}}
it's designed for specific xml that is basically equivalent to json, should handle element attributes such as
<elementName attributeName='attributeContent'>elementContent</elementName>
too
there's the possibility of merging the attribute dictionary / subtag dictionary similarly to how repeat subtags are merged, although nesting the lists seems kind of appropriate :-)
Adding this line
d.update(('#' + k, v) for k, v in el.attrib.iteritems())
in the user247686's code you can have node attributes too.
Found it in this post https://stackoverflow.com/a/7684581/1395962
Example:
import xml.etree.ElementTree as etree
from urllib import urlopen
xml_file = "http://your_xml_url"
tree = etree.parse(urlopen(xml_file))
root = tree.getroot()
def xml_to_dict(el):
d={}
if el.text:
d[el.tag] = el.text
else:
d[el.tag] = {}
children = el.getchildren()
if children:
d[el.tag] = map(xml_to_dict, children)
d.update(('#' + k, v) for k, v in el.attrib.iteritems())
return d
Call as
xml_to_dict(root)
Have you tried this?
print xml.etree.ElementTree.tostring( conf_new )
most direct way to me :
root = ET.parse(xh)
data = root.getroot()
xdic = {}
if data > None:
for part in data.getchildren():
xdic[part.tag] = part.text
XML has a rich infoset, and it takes some special tricks to represent that in a Python dictionary. Elements are ordered, attributes are distinguished from element bodies, etc.
One project to handle round-trips between XML and Python dictionaries, with some configuration options to handle the tradeoffs in different ways is XML Support in Pickling Tools. Version 1.3 and newer is required. It isn't pure Python (and in fact is designed to make C++ / Python interaction easier), but it might be appropriate for various use cases.