Missing wrapper in XML - python

I'm trying to pull specific data from one big XML to another. My Main XML File looks like below
<MAIN>
<transaction>
<date>20190415</date>
<ticket>1</ticket>
<value>15</value>
</transaction>
<transaction>
<date>20190415</date>
<ticket>2</ticket>
<value>15</value>
</transaction>
<transaction>
<date>20190415</date>
<ticket>3</ticket>
<value>15</value>
</transaction>
<transaction>
<date>20190415</date>
<ticket>4</ticket>
<value>15</value>
</transaction>
<transaction>
<date>20190415</date>
<ticket>5</ticket>
<value>15</value>
</transaction>
</MAIN>
I'm only pulling the <ticket> values & Appending it to a Fresh/New xml file.
Below is my code
import pandas as pd
import xml.etree.ElementTree as ET
from lxml import etree
path_source = 'source\path'
path_dest = 'dest\path'
tree = ET.parse(path_source)
root = tree.getroot()
L_roots = []
for trx in root.iter('transaction'):
ticket = trx.find('ticket').text
root_T = ET.Element('MAIN')
doc = ET.SubElement(root_T, 'Transaction')
ET.SubElement(doc, 'ticket').text = ticket
L_roots.append(doc)
with open(path_dest,'wb') as f:
for i in L_roots:
ET.Element('MAIN')
f.write(ET.tostring(i, method="xml"))
what i get is a plain text file without the outer <MAIN> tags. like below
<Transaction>
<ticket>1</ticket>
</Transaction>
<Transaction>
<ticket>2</ticket>
</Transaction>
<Transaction>
<ticket>3</ticket>
</Transaction>
<Transaction>
<ticket>4</ticket>
</Transaction>
<Transaction>
<ticket>5</ticket>
</Transaction>
What is missing here is the wrapper <MAIN> tags. what should be changed in my code to achieve this?

Replace this:
with open(path_dest,'wb') as f:
for i in L_roots:
ET.Element('MAIN')
f.write(ET.tostring(i, method="xml"))
with this:
outroot = ET.Element('MAIN')
outroot.extend(L_roots)
with open(path_dest,'wb') as f:
f.write(ET.tostring(outroot, method="xml"))
The error in your snippet is that you never save the new ET.Element('MAIN') to a variable, so that is lost. When using f.write you are simply writing the elements in L_roots, which have the Transaction tag.
In the snippet I propose, all the L_roots elements are inserted into another MAIN element, and then the main element is written (all its subelements are automatically written).

Related

Export information from child nodes in xml using Python

I have an xml file called persons.xml in the following format:
<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York"/>
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles"/>
</person>
</persons>
I want to export to a file the list of person names along with the city names
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('./persons.xml')
root = tree.getroot()
df_cols = ["person_name", "city_name"]
rows = []
for node in root:
person_name = node.attrib.get("name")
rows.append({"person_name": person_name})
out_df = pd.DataFrame(rows, columns = df_cols)
out_df
Obviously this part of the code will only work for obtaining the name as it’s part of the root, but I can’t figure out how to loop through the child nodes too and obtain this info. Do I need to append something to root to iterate over the child nodes?
I can obtain everything using root.getchildren but it doesn’t allow me to return only the child nodes:
children = root.getchildren()
for child in children:
ElementTree.dump(child)
Is there a good way to get this information?
See below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York" />
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles" />
</person>
</persons>'''
root = ET.fromstring(xml)
data = []
for p in root.findall('.//person'):
data.append({'parson': p.attrib['name'], 'city': p.find('city').attrib['name']})
df = pd.DataFrame(data)
print(df)
output
parson city
0 John New York
1 Mary Los Angeles

Parse XML with childs that have different tags in Python

I am trying to parse following xml data from a file with python for print only the elements with tag "zip-code" with his attribute name
<response status="success" code="19"><result total-count="1" count="1">
<address>
<entry name="studio">
<zip-code>14407</zip-code>
<description>Nothing</description>
</entry>
<entry name="mailbox">
<zip-code>33896</zip-code>
<description>Nothing</description>
</entry>
<entry name="garage">
<zip-code>33746</zip-code>
<description>Tony garage</description>
</entry>
<entry name="playstore">
<url>playstation.com</url>
<description>game download</description>
</entry>
<entry name="gym">
<zip-code>33746</zip-code>
<description>Getronics NOC subnet 2</description>
</entry>
<entry name="e-cigars">
<url>vape.com/24</url>
<description>vape juices</description>
</entry>
</address>
</result></response>
The python code that I am trying to run is
from xml.etree import ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
items = root.iter('entry')
for item in items:
zip = item.find('zip-code').text
names = (item.attrib)
print(' {} {} '.format(
names, zip
))
However it fails once it gets to the items without "zip-code" tag.
How I could make this run?
Thanks in advance
As #AmitaiIrron suggested, xpath can help here.
This code searches the document for element named zip-code, and pings back to get the parent of that element. From there, you can get the name attribute, and pair with the text from zip-code element
for ent in root.findall(".//zip-code/.."):
print(ent.attrib.get('name'), ent.find('zip-code').text)
studio 14407
mailbox 33896
garage 33746
gym 33746
OR
{ent.attrib.get('name') : ent.find('zip-code').text
for ent in root.findall(".//zip-code/..")}
{'studio': '14407', 'mailbox': '33896', 'garage': '33746', 'gym': '33746'}
Your loop should look like this:
# Find all <entry> tags in the hierarchy
for item in root.findall('.//entry'):
# Try finding a <zip-code> child
zipc = item.find('./zip-code')
# If found a child, print data for it
if zipc is not None:
names = (item.attrib)
print(' {} {} '.format(
names, zipc.text
))
It's all a matter of learning to use xpath properly when searching through the XML tree.
If you have no problem using regular expressions, the following works just fine:
import re
file = open('file.xml', 'r').read()
pattern = r'name="(.*?)".*?<zip-code>(.*?)<\/zip-code>'
matches = re.findall(pattern, file, re.S)
for m in matches:
print("{} {}".format(m[0], m[1]))
and produces the result:
studio 14407
mailbox 33896
garage 33746
aystore 33746

Parsing XML in python using minidom

I have an XML as under;
<root>
<entry>
<accession>A</accession>
<accession>B</accession>
<accession>C</accession>
<feature type="cross-link" description="sumo2">
<location>
<position position="15111992"/>
</location>
</feature>
<feature type="temp" description="blah blah sumo">
<location>
<position position="12345"/>
</location>
</feature>
</entry>
<entry>
<accession>X</accession>
<accession>Y</accession>
<accession>Z</accession>
<feature type="test" description="testing">
<location>
<position position="1"/>
</location>
</feature>
<feature type="cross-link" description="sumo hello">
<location>
<position position="11223344"/>
</location>
</feature>
</entry>
</root>
I need to fetch the value of posiiton attribute whose feature type is "cross-link" and description contains the word sumo.
This is what I have tried so far which correctly gives me those value whose feature type is "cross-link" and description contains the word sumo.
from xml.dom import minidom
xmldoc = minidom.parse('P38398.xml')
itemlist = xmldoc.getElementsByTagName('feature')
for s in itemlist:
feattype = s.attributes['type'].value
description = s.attributes['description'].value
if "SUMO" in description:
if "cross-link" in feattype:
print feattype+","+description
How can I extract the value of position once I have the feature type as "cross-link" and description containing the word "sumo"?
You are nearly there except two points:
You have to change your "sumo" search pattern to lowercase to match the data given above
You then need to add something like the following to your loop body
posList = s.getElementsByTagName('position')
for p in posList:
print "-- position is {}".format(p.attributes['position'].value)
This is a job for XPath. A simple check for attribute matches and substring matches and then we return the attribute as a string.
from lxml import etree
root = etree.parse('P38398.xml').getroot()
xpquery = '//feature[#type="cross-link" and contains(#description, "sumo")]//position/#position'
for att in root.xpath(xpquery):
print(att)

How to create a subset of document using lxml?

Suppose you have an lmxl.etree element with the contents like:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
<element2>
<subelement2>blibli</sublement2>
</element2>
</root>
I can use find or xpath methods to get something an element rendering something like:
<element1>
<subelement1>blabla</subelement1>
</element1>
Is there a way simple to get:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
i.e The element of interest plus all it's ancestors up to the document root?
I am not sure there is something built-in for it, but here is a terrible, "don't ever use it in real life" type of a workaround using the iterancestors() parent iterator:
from lxml import etree as ET
data = """<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
<element2>
<subelement2>blibli</subelement2>
</element2>
</root>"""
root = ET.fromstring(data)
element = root.find(".//subelement1")
result = ET.tostring(element)
for node in element.iterancestors():
result = "<{name}>{text}</{name}>".format(name=node.tag, text=result)
print(ET.tostring(ET.fromstring(result), pretty_print=True))
Prints:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
The following code removes elements that don't have any subelement1 descendants and are not named subelement1.
from lxml import etree
tree = etree.parse("input.xml") # First XML document in question
for elem in tree.iter():
if elem.xpath("not(.//subelement1)") and not(elem.tag == "subelement1"):
if elem.getparent() is not None:
elem.getparent().remove(elem)
print etree.tostring(tree)
Output:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>

Parsing nested XML with ElementTree

I have the following XML format, and I want to pull out the values for name, region, and status using python's xml.etree.ElementTree module.
However, my attempt to get this information has been unsuccessful so far.
<feed>
<entry>
<id>uuid:asdfadsfasdf123123</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename</Name>
<Region>US</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
<entry>
<id>uuid:asdfadsfasdf234234</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename2</Name>
<Region>US2</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
</feed>
My code attempt:
NAMESPACE = '{http://www.w3.org/2005/Atom}'
root = et.fromstring(XML_STRING)
entry_root = root.findall('{0}entry'.format(NAMESPACE))
for child in entry_root:
content_node = child.find('{0}content'.format(NAMESPACE))
for content in content_node:
for desc in content.iter():
print desc.tag
name = desc.find('{0}Name'.format(NAMESPACE))
print name
desc.tag is giving me the nodes I want to access, but name is returning None. Any ideas what's wrong with my code?
Output of desc.tag:
{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Name
{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Region
{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Status
I don't know why I didn't see this before. But, I was able to get the values.
root = et.fromstring(XML_STRING)
entry_root = root.findall('{0}entry'.format(NAMESPACE))
for child in entry_root:
content_node = child.find('{0}content'.format(NAMESPACE))
for descr in content_node:
name_node = descr.find('{0}Name'.format(NAMESPACE))
print name_node.text
You can use lxml.etree along with default namespace mapping to parse the XML as follows:
content = '''
<feed>
<entry>
<id>uuid:asdfadsfasdf123123</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename</Name>
<Region>US</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
<entry>
<id>uuid:asdfadsfasdf234234</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename2</Name>
<Region>US2</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
</feed>'''
from lxml import etree
tree = etree.XML(content)
ns = {'default': 'http://schemas.microsoft.com/netservices/2010/10/servicebus/connect'}
names = tree.xpath('//default:Name/text()', namespaces=ns)
regions = tree.xpath('//default:Region/text()', namespaces=ns)
statuses = tree.xpath('//default:Status/text()', namespaces=ns)
print(names)
print(regions)
print(statuses)
Output
['instancename', 'instancename2']
['US', 'US2']
['Active', 'Active']
This XPath/namespace functionality can be adapted to output the data in any format you require.

Categories

Resources