I'm trying to pull specific data from one big XML to another. My Main XML File looks like below
<MAIN>
<transaction>
<date>20190415</date>
<ticket>1</ticket>
<value>15</value>
</transaction>
<transaction>
<date>20190415</date>
<ticket>2</ticket>
<value>15</value>
</transaction>
<transaction>
<date>20190415</date>
<ticket>3</ticket>
<value>15</value>
</transaction>
<transaction>
<date>20190415</date>
<ticket>4</ticket>
<value>15</value>
</transaction>
<transaction>
<date>20190415</date>
<ticket>5</ticket>
<value>15</value>
</transaction>
</MAIN>
I'm only pulling the <ticket> values & Appending it to a Fresh/New xml file.
Below is my code
import pandas as pd
import xml.etree.ElementTree as ET
from lxml import etree
path_source = 'source\path'
path_dest = 'dest\path'
tree = ET.parse(path_source)
root = tree.getroot()
L_roots = []
for trx in root.iter('transaction'):
ticket = trx.find('ticket').text
root_T = ET.Element('MAIN')
doc = ET.SubElement(root_T, 'Transaction')
ET.SubElement(doc, 'ticket').text = ticket
L_roots.append(doc)
with open(path_dest,'wb') as f:
for i in L_roots:
ET.Element('MAIN')
f.write(ET.tostring(i, method="xml"))
what i get is a plain text file without the outer <MAIN> tags. like below
<Transaction>
<ticket>1</ticket>
</Transaction>
<Transaction>
<ticket>2</ticket>
</Transaction>
<Transaction>
<ticket>3</ticket>
</Transaction>
<Transaction>
<ticket>4</ticket>
</Transaction>
<Transaction>
<ticket>5</ticket>
</Transaction>
What is missing here is the wrapper <MAIN> tags. what should be changed in my code to achieve this?
Replace this:
with open(path_dest,'wb') as f:
for i in L_roots:
ET.Element('MAIN')
f.write(ET.tostring(i, method="xml"))
with this:
outroot = ET.Element('MAIN')
outroot.extend(L_roots)
with open(path_dest,'wb') as f:
f.write(ET.tostring(outroot, method="xml"))
The error in your snippet is that you never save the new ET.Element('MAIN') to a variable, so that is lost. When using f.write you are simply writing the elements in L_roots, which have the Transaction tag.
In the snippet I propose, all the L_roots elements are inserted into another MAIN element, and then the main element is written (all its subelements are automatically written).
Related
I have an xml file called persons.xml in the following format:
<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York"/>
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles"/>
</person>
</persons>
I want to export to a file the list of person names along with the city names
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('./persons.xml')
root = tree.getroot()
df_cols = ["person_name", "city_name"]
rows = []
for node in root:
person_name = node.attrib.get("name")
rows.append({"person_name": person_name})
out_df = pd.DataFrame(rows, columns = df_cols)
out_df
Obviously this part of the code will only work for obtaining the name as it’s part of the root, but I can’t figure out how to loop through the child nodes too and obtain this info. Do I need to append something to root to iterate over the child nodes?
I can obtain everything using root.getchildren but it doesn’t allow me to return only the child nodes:
children = root.getchildren()
for child in children:
ElementTree.dump(child)
Is there a good way to get this information?
See below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York" />
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles" />
</person>
</persons>'''
root = ET.fromstring(xml)
data = []
for p in root.findall('.//person'):
data.append({'parson': p.attrib['name'], 'city': p.find('city').attrib['name']})
df = pd.DataFrame(data)
print(df)
output
parson city
0 John New York
1 Mary Los Angeles
I am trying to parse following xml data from a file with python for print only the elements with tag "zip-code" with his attribute name
<response status="success" code="19"><result total-count="1" count="1">
<address>
<entry name="studio">
<zip-code>14407</zip-code>
<description>Nothing</description>
</entry>
<entry name="mailbox">
<zip-code>33896</zip-code>
<description>Nothing</description>
</entry>
<entry name="garage">
<zip-code>33746</zip-code>
<description>Tony garage</description>
</entry>
<entry name="playstore">
<url>playstation.com</url>
<description>game download</description>
</entry>
<entry name="gym">
<zip-code>33746</zip-code>
<description>Getronics NOC subnet 2</description>
</entry>
<entry name="e-cigars">
<url>vape.com/24</url>
<description>vape juices</description>
</entry>
</address>
</result></response>
The python code that I am trying to run is
from xml.etree import ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
items = root.iter('entry')
for item in items:
zip = item.find('zip-code').text
names = (item.attrib)
print(' {} {} '.format(
names, zip
))
However it fails once it gets to the items without "zip-code" tag.
How I could make this run?
Thanks in advance
As #AmitaiIrron suggested, xpath can help here.
This code searches the document for element named zip-code, and pings back to get the parent of that element. From there, you can get the name attribute, and pair with the text from zip-code element
for ent in root.findall(".//zip-code/.."):
print(ent.attrib.get('name'), ent.find('zip-code').text)
studio 14407
mailbox 33896
garage 33746
gym 33746
OR
{ent.attrib.get('name') : ent.find('zip-code').text
for ent in root.findall(".//zip-code/..")}
{'studio': '14407', 'mailbox': '33896', 'garage': '33746', 'gym': '33746'}
Your loop should look like this:
# Find all <entry> tags in the hierarchy
for item in root.findall('.//entry'):
# Try finding a <zip-code> child
zipc = item.find('./zip-code')
# If found a child, print data for it
if zipc is not None:
names = (item.attrib)
print(' {} {} '.format(
names, zipc.text
))
It's all a matter of learning to use xpath properly when searching through the XML tree.
If you have no problem using regular expressions, the following works just fine:
import re
file = open('file.xml', 'r').read()
pattern = r'name="(.*?)".*?<zip-code>(.*?)<\/zip-code>'
matches = re.findall(pattern, file, re.S)
for m in matches:
print("{} {}".format(m[0], m[1]))
and produces the result:
studio 14407
mailbox 33896
garage 33746
aystore 33746
I have an XML as under;
<root>
<entry>
<accession>A</accession>
<accession>B</accession>
<accession>C</accession>
<feature type="cross-link" description="sumo2">
<location>
<position position="15111992"/>
</location>
</feature>
<feature type="temp" description="blah blah sumo">
<location>
<position position="12345"/>
</location>
</feature>
</entry>
<entry>
<accession>X</accession>
<accession>Y</accession>
<accession>Z</accession>
<feature type="test" description="testing">
<location>
<position position="1"/>
</location>
</feature>
<feature type="cross-link" description="sumo hello">
<location>
<position position="11223344"/>
</location>
</feature>
</entry>
</root>
I need to fetch the value of posiiton attribute whose feature type is "cross-link" and description contains the word sumo.
This is what I have tried so far which correctly gives me those value whose feature type is "cross-link" and description contains the word sumo.
from xml.dom import minidom
xmldoc = minidom.parse('P38398.xml')
itemlist = xmldoc.getElementsByTagName('feature')
for s in itemlist:
feattype = s.attributes['type'].value
description = s.attributes['description'].value
if "SUMO" in description:
if "cross-link" in feattype:
print feattype+","+description
How can I extract the value of position once I have the feature type as "cross-link" and description containing the word "sumo"?
You are nearly there except two points:
You have to change your "sumo" search pattern to lowercase to match the data given above
You then need to add something like the following to your loop body
posList = s.getElementsByTagName('position')
for p in posList:
print "-- position is {}".format(p.attributes['position'].value)
This is a job for XPath. A simple check for attribute matches and substring matches and then we return the attribute as a string.
from lxml import etree
root = etree.parse('P38398.xml').getroot()
xpquery = '//feature[#type="cross-link" and contains(#description, "sumo")]//position/#position'
for att in root.xpath(xpquery):
print(att)
Suppose you have an lmxl.etree element with the contents like:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
<element2>
<subelement2>blibli</sublement2>
</element2>
</root>
I can use find or xpath methods to get something an element rendering something like:
<element1>
<subelement1>blabla</subelement1>
</element1>
Is there a way simple to get:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
i.e The element of interest plus all it's ancestors up to the document root?
I am not sure there is something built-in for it, but here is a terrible, "don't ever use it in real life" type of a workaround using the iterancestors() parent iterator:
from lxml import etree as ET
data = """<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
<element2>
<subelement2>blibli</subelement2>
</element2>
</root>"""
root = ET.fromstring(data)
element = root.find(".//subelement1")
result = ET.tostring(element)
for node in element.iterancestors():
result = "<{name}>{text}</{name}>".format(name=node.tag, text=result)
print(ET.tostring(ET.fromstring(result), pretty_print=True))
Prints:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
The following code removes elements that don't have any subelement1 descendants and are not named subelement1.
from lxml import etree
tree = etree.parse("input.xml") # First XML document in question
for elem in tree.iter():
if elem.xpath("not(.//subelement1)") and not(elem.tag == "subelement1"):
if elem.getparent() is not None:
elem.getparent().remove(elem)
print etree.tostring(tree)
Output:
<root>
<element1>
<subelement1>blabla</subelement1>
</element1>
</root>
I have the following XML format, and I want to pull out the values for name, region, and status using python's xml.etree.ElementTree module.
However, my attempt to get this information has been unsuccessful so far.
<feed>
<entry>
<id>uuid:asdfadsfasdf123123</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename</Name>
<Region>US</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
<entry>
<id>uuid:asdfadsfasdf234234</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename2</Name>
<Region>US2</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
</feed>
My code attempt:
NAMESPACE = '{http://www.w3.org/2005/Atom}'
root = et.fromstring(XML_STRING)
entry_root = root.findall('{0}entry'.format(NAMESPACE))
for child in entry_root:
content_node = child.find('{0}content'.format(NAMESPACE))
for content in content_node:
for desc in content.iter():
print desc.tag
name = desc.find('{0}Name'.format(NAMESPACE))
print name
desc.tag is giving me the nodes I want to access, but name is returning None. Any ideas what's wrong with my code?
Output of desc.tag:
{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Name
{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Region
{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Status
I don't know why I didn't see this before. But, I was able to get the values.
root = et.fromstring(XML_STRING)
entry_root = root.findall('{0}entry'.format(NAMESPACE))
for child in entry_root:
content_node = child.find('{0}content'.format(NAMESPACE))
for descr in content_node:
name_node = descr.find('{0}Name'.format(NAMESPACE))
print name_node.text
You can use lxml.etree along with default namespace mapping to parse the XML as follows:
content = '''
<feed>
<entry>
<id>uuid:asdfadsfasdf123123</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename</Name>
<Region>US</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
<entry>
<id>uuid:asdfadsfasdf234234</id>
<title type="text"></title>
<content type="application/xml">
<NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>instancename2</Name>
<Region>US2</Region>
<Status>Active</Status>
</NamespaceDescription>
</content>
</entry>
</feed>'''
from lxml import etree
tree = etree.XML(content)
ns = {'default': 'http://schemas.microsoft.com/netservices/2010/10/servicebus/connect'}
names = tree.xpath('//default:Name/text()', namespaces=ns)
regions = tree.xpath('//default:Region/text()', namespaces=ns)
statuses = tree.xpath('//default:Status/text()', namespaces=ns)
print(names)
print(regions)
print(statuses)
Output
['instancename', 'instancename2']
['US', 'US2']
['Active', 'Active']
This XPath/namespace functionality can be adapted to output the data in any format you require.