How to parse xml string having deep structures using python

How to parse xml string having deep structures using python - python

A similar question is asked here (Python XML Parsing) but I could not reach to the content I am interested in.
I need to extract all the information that is enclosed between the tag patent-classification if the classification-scheme tag value is CPC. There are multiple such element and are enclosed inside patent-classifications tag.
In the example given below, there are three such values: C 07 K 16 22 I , A 61 K 2039 505 A and C 07 K 2317 21 A
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
<ops:meta name="elapsed-time" value="21"/>
<exchange-documents>
<exchange-document system="ops.epo.org" family-id="39103486" country="US" doc-number="2009234106" kind="A1">
<bibliographic-data>
<publication-reference>
<document-id document-id-type="docdb">
<country>US</country>
<doc-number>2009234106</doc-number>
<kind>A1</kind>
<date>20090917</date>
</document-id>
<document-id document-id-type="epodoc">
<doc-number>US2009234106</doc-number>
<date>20090917</date>
</document-id>
</publication-reference>
<classifications-ipcr>
<classification-ipcr sequence="1">
<text>C07K 16/ 44 A I </text>
</classification-ipcr>
</classifications-ipcr>
<patent-classifications>
<patent-classification sequence="1">
<classification-scheme office="" scheme="CPC"/>
<section>C</section>
<class>07</class>
<subclass>K</subclass>
<main-group>16</main-group>
<subgroup>22</subgroup>
<classification-value>I</classification-value>
</patent-classification>
<patent-classification sequence="2">
<classification-scheme office="" scheme="CPC"/>
<section>A</section>
<class>61</class>
<subclass>K</subclass>
<main-group>2039</main-group>
<subgroup>505</subgroup>
<classification-value>A</classification-value>
</patent-classification>
<patent-classification sequence="7">
<classification-scheme office="" scheme="CPC"/>
<section>C</section>
<class>07</class>
<subclass>K</subclass>
<main-group>2317</main-group>
<subgroup>92</subgroup>
<classification-value>A</classification-value>
</patent-classification>
<patent-classification sequence="1">
<classification-scheme office="US" scheme="UC"/>
<classification-symbol>530/387.9</classification-symbol>
</patent-classification>
</patent-classifications>
</bibliographic-data>
</exchange-document>
</exchange-documents>
</ops:world-patent-data>

Install BeautifulSoup if you don't have it:
$ easy_install BeautifulSoup4
Try this:
from bs4 import BeautifulSoup
xml = open('example.xml', 'rb').read()
bs = BeautifulSoup(xml)
# find patent-classification
patents = bs.findAll('patent-classification')
# filter the ones with CPC
for pa in patents:
if pa.find('classification-scheme', {'scheme': 'CPC'} ):
print pa.getText()

You can use python xml standard module:
import xml.etree.ElementTree as ET
root = ET.parse('a.xml').getroot()
for node in root.iterfind(".//{http://www.epo.org/exchange}classification-scheme[#scheme='CPC']/.."):
data = []
for d in node.getchildren():
if d.text:
data.append(d.text)
print ' '.join(data)

Related

How to extract data from GML file

I have a text file and would like to extract the <gml:pos>73664.300 836542.700</gml:pos> from it. More precisely I would like to get the GPS coordinate system [73664.300 836542.700] from the pos tag. The file contains multiple <wfs:member> and each of them has a <gml:pos> (deepest layer).
<?xml version='1.0' encoding='UTF-8'?>
<wfs:FeatureCollection xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wfs/2.0 http://schemas.opengis.net/wfs/2.0/wfs.xsd http://www.opengis.net/gml/3.2 http://schemas.opengis.net/gml/3.2.1/gml.xsd http://www.deegree.org/app https://web.de/feature_descr?SERVICE=WFS&VERSION=2.0.0&REQUEST=DescribeFeatureType&OUTPUTFORMAT=application%2Fgml%2Bxml%3B+version%3D3.2&TYPENAME=app:lsa_data&NAMESPACES=xmlns(app,http%3A%2F%2Fwww.deegree.org%2Fapp)" xmlns:wfs="http://www.opengis.net/wfs/2.0" timeStamp="2020-11-18T15:01:17Z" xmlns:gml="http://www.opengis.net/gml/3.2" numberMatched="unknown" numberReturned="0">
<!--NOTE: numberReturned attribute should be 'unknown' as well, but this would not validate against the current version of the WFS 2.0 schema (change upcoming). See change request (CR 144): https://portal.opengeospatial.org/files?fact_id=6798.-->
<wfs:member>
<app:dat_set xmlns:app="http://www.deegree.org/app" gml:id="app:dat_set_1">
<app:point>2</app:point>
<app:art>K </app:art>
<app:L_Name>westt / woustest </app:L_Name>
<app:geom>
<!--Inlined geometry 'data_1_APP_GEOM'-->
<gml:MultiPoint gml:id="data_1_APP_GEOM" srsName="EPSG:25832">
<gml:pointMember>
<gml:Point gml:id="GEOMETRY_ad608059-f297-4554-8464-cdde248cb531" srsName="EPSG:25832">
<gml:pos>73664.300 836542.700</gml:pos>
</gml:Point>
</gml:pointMember>
</gml:MultiPoint>
</app:geom>
</app:lsa_pointdata>
</wfs:member>
<wfs:member>
<app:dat_set xmlns:app="http://www.deegree.org/app" gml:id="app:dat_set_2">
<app:point>3</app:point>
<app:art>K </app:art>
<app:L_Name>route / riztr </app:L_Name>
<app:geom>
<!--Inlined geometry 'data_2_APP_GEOM'-->
<gml:MultiPoint gml:id="data_2_APP_GEOM" srsName="EPSG:25832">
<gml:pointMember>
<gml:Point gml:id="GEOMETRY_440d8630-b674-4768-a5b7-3fab46d9ac8c" srsName="EPSG:25832">
<gml:pos>74354.900 837456.300</gml:pos>
</gml:Point>
</gml:pointMember>
</gml:MultiPoint>
</app:geom>
</app:lsa_pointdata>
</wfs:member>
<wfs:member>
...
...
How could I get those gps coordinates ?
Thank you in advance.

You can use lxml and XPATH.
data = b'''\
<?xml version='1.0' encoding='UTF-8'?>
<wfs:FeatureCollection xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wfs/2.0 http://schemas.opengis.net/wfs/2.0/wfs.xsd http://www.opengis.net/gml/3.2 http://schemas.opengis.net/gml/3.2.1/gml.xsd http://www.deegree.org/app https://web.de/feature_descr?SERVICE=WFS&VERSION=2.0.0&REQUEST=DescribeFeatureType&OUTPUTFORMAT=application%2Fgml%2Bxml%3B+version%3D3.2&TYPENAME=app:lsa_data&NAMESPACES=xmlns(app,http%3A%2F%2Fwww.deegree.org%2Fapp)" xmlns:wfs="http://www.opengis.net/wfs/2.0" timeStamp="2020-11-18T15:01:17Z" xmlns:gml="http://www.opengis.net/gml/3.2" numberMatched="unknown" numberReturned="0">
<!--NOTE: numberReturned attribute should be 'unknown' as well, but this would not validate against the current version of the WFS 2.0 schema (change upcoming). See change request (CR 144): https://portal.opengeospatial.org/files?fact_id=6798.-->
<wfs:member>
<app:dat_set xmlns:app="http://www.deegree.org/app" gml:id="app:dat_set_1">
<app:point>2</app:point>
<app:art>K </app:art>
<app:L_Name>westt / woustest </app:L_Name>
<app:geom>
<!--Inlined geometry 'data_1_APP_GEOM'-->
<gml:MultiPoint gml:id="data_1_APP_GEOM" srsName="EPSG:25832">
<gml:pointMember>
<gml:Point gml:id="GEOMETRY_ad608059-f297-4554-8464-cdde248cb531" srsName="EPSG:25832">
<gml:pos>73664.300 836542.700</gml:pos>
</gml:Point>
</gml:pointMember>
</gml:MultiPoint>
</app:geom>
</app:dat_set>
</wfs:member>
<wfs:member>
<app:dat_set xmlns:app="http://www.deegree.org/app" gml:id="app:dat_set_2">
<app:point>3</app:point>
<app:art>K </app:art>
<app:L_Name>route / riztr </app:L_Name>
<app:geom>
<!--Inlined geometry 'data_2_APP_GEOM'-->
<gml:MultiPoint gml:id="data_2_APP_GEOM" srsName="EPSG:25832">
<gml:pointMember>
<gml:Point gml:id="GEOMETRY_440d8630-b674-4768-a5b7-3fab46d9ac8c" srsName="EPSG:25832">
<gml:pos>74354.900 837456.300</gml:pos>
</gml:Point>
</gml:pointMember>
</gml:MultiPoint>
</app:geom>
</app:dat_set>
</wfs:member>
</wfs:FeatureCollection>
'''
from lxml import etree
from io import BytesIO
f = BytesIO(data)
ns = {"gml": "http://www.opengis.net/gml/3.2"}
tree = etree.parse(f)
for e in tree.findall("//gml:pos", ns):
print(e.text)

Python XML Parser Issue

I am new to python. Sorry for asking this stupid question.
I am trying to read a XML file to python object (preferably to pandas)
For now I am just trying to print the variables, to see if I can read them properly in a tabular form.
I have used xml.etree.ElementTree for this, but I might not be using it as intended.
Code:
import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
ODM = tree.getroot()
ns = {'xmlns': 'http://www.cdisc.org/ns/odm/v1.3',
'mdsol': 'http://www.mdsol.com/ns/odm/metadata'}
for ClinicalData in ODM:
LocationOID=None
#print(ClinicalData.tag, ClinicalData.attrib)
for SubjectData in ClinicalData:
for SiteRef in SubjectData:
LocationOID=SiteRef.attrib.get('LocationOID')
for StudyEventData in SubjectData:
for AuditRecord in StudyEventData:
print(ClinicalData.attrib.get('MetaDataVersionOID'),
ClinicalData.attrib.get('AuditSubCategoryName'), #null ouptput due to namespace issue
SubjectData.attrib.get('SubjectKey'),
SubjectData.attrib.get('SubjectName'), #null ouptput due to namespace issue
LocationOID, #not sure what is the issue
StudyEventData.attrib.get('StudyEventRepeatKey'),
AuditRecord.find('DateTimeStamp') #not sure what is the issue
)
Input:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3"
xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata"
CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">
<ClinicalData MetaDataVersionOID="1772" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Activated">
<SubjectData SubjectKey="7735fd9c-1792-457c-aa58-0ca26ecdc810" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-3">
<SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960580">
<AuditRecord>
<UserRef UserOID="systemuser"/>
<LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<DateTimeStamp>2019-07-10T07:56:54</DateTimeStamp>
<ReasonForChange>Update</ReasonForChange>
<SourceID>394263772</SourceID>
</AuditRecord>
</StudyEventData>
</SubjectData>
</ClinicalData>
</ODM>
I am expecting all the print variables need to have the proper variable assigned values as in XML file. Please let me know is there any other proper way of doing it instead of inner looping multiple times.

Namespaces are a pain using ElementTree. See this discussion.
Short answer:
for ClinicalData in ODM:
#print(ClinicalData.tag, ClinicalData.attrib)
for SubjectData in ClinicalData:
SiteRef = SubjectData.find('{http://www.cdisc.org/ns/odm/v1.3}SiteRef')
LocationOID = SiteRef.attrib.get('LocationOID')
for StudyEventData in SubjectData:
for AuditRecord in StudyEventData:
print(
ClinicalData.attrib.get('MetaDataVersionOID'),
ClinicalData.attrib.
get('{http://www.mdsol.com/ns/odm/metadata}AuditSubCategoryName'
), #null ouptput due to namespace issue
SubjectData.attrib.get('SubjectKey'),
SubjectData.attrib.get(
'{http://www.mdsol.com/ns/odm/metadata}SubjectName'
), #null ouptput due to namespace issue
LocationOID, #not sure what is the issue
StudyEventData.attrib.get('StudyEventRepeatKey'),
AuditRecord.find(
'{http://www.cdisc.org/ns/odm/v1.3}DateTimeStamp').
text #not sure what is the issue
)

I think you can use BeautifulSoup for parsing XML:
from bs4 import BeautifulSoup
temp ="""<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3"
xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata"
CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">
<ClinicalData MetaDataVersionOID="1772" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Activated">
<SubjectData SubjectKey="7735fd9c-1792-457c-aa58-0ca26ecdc810" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-3">
<SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960580">
<AuditRecord>
<UserRef UserOID="systemuser"/>
<LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<DateTimeStamp>2019-07-10T07:56:54</DateTimeStamp>
<ReasonForChange>Update</ReasonForChange>
<SourceID>394263772</SourceID>
</AuditRecord>
</StudyEventData>
</SubjectData>
</ClinicalData>
</ODM>"""
temp=BeautifulSoup(temp,"lxml")
ClinicalData = temp.find('ClinicalData'.lower())
SubjectData = ClinicalData.find_all('SubjectData'.lower())
LocationOID=None
for i in SubjectData:
SiteRef = i.find('SiteRef'.lower())
LocationOID = SiteRef.attrs['locationoid']
print('LocationOID',LocationOID)
output:
LocationOID 0ACCSP3MAPPING1SITE1
[Finished in 1.2s]

#Justin
I have applied your suggestions, it worked, until I broke it.
Input:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">
<ClinicalData MetaDataVersionOID="2965" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Entered">
<SubjectData SubjectKey="481e4653-693c-4e15-8762-d8a66c0d2cf1" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-1">
<SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960564">
<FormData FormOID="VS" FormRepeatKey="1" mdsol:DataPageId="15331229">
<ItemGroupData ItemGroupOID="VS" mdsol:RecordId="17928808">
<ItemData ItemOID="VS.WT" TransactionType="Upsert" Value="45">
<AuditRecord>
<UserRef UserOID="alscrave2"/>
<LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<DateTimeStamp>2018-02-02T09:39:30</DateTimeStamp>
<ReasonForChange/>
<SourceID>122841525</SourceID>
</AuditRecord>
<MeasurementUnitRef MeasurementUnitOID="1761.Weight.1"/>
</ItemData>
</ItemGroupData>
</FormData>
</StudyEventData>
</SubjectData>
</ClinicalData>
<ClinicalData MetaDataVersionOID="2965" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Entered">
<SubjectData SubjectKey="481e4653-693c-4e15-8762-d8a66c0d2cf1" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-1">
<SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960564">
<FormData FormOID="VS" FormRepeatKey="1" mdsol:DataPageId="15331229">
<ItemGroupData ItemGroupOID="VS" mdsol:RecordId="17928809">
<ItemData ItemOID="VS.WT" TransactionType="Upsert" Value="46">
<AuditRecord>
<UserRef UserOID="alscrave2"/>
<LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<DateTimeStamp>2018-02-02T09:39:30</DateTimeStamp>
<ReasonForChange/>
<SourceID>122841525</SourceID>
</AuditRecord>
<MeasurementUnitRef MeasurementUnitOID="1761.Weight.1"/>
</ItemData>
</ItemGroupData>
</FormData>
</StudyEventData>
</SubjectData>
</ClinicalData>
</ODM>
Code:
import xml.etree.ElementTree as ET
import pandas as pd
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
tree = ET.parse("data.xml")
ODM = tree.getroot()
xmlns = "{http://www.cdisc.org/ns/odm/v1.3}"
mdsol = "{http://www.mdsol.com/ns/odm/metadata}"
def data_reader():
dfcols = ['CreationDateTime','StudyOID','MetaDataVersionOID','SubjectName','SUBJECTUUID','LocationOID','StudyEventOID',
'StudyEventRepeatKey','FormOID','FormRepeatKey','DataPageId','ItemgroupOID','RecordId','var_name','Value',
'DateTimeStamp','ASC_Name','Measurement_Unit','SourceID','UserOID','InstanceId']
df_xml = pd.DataFrame(columns=dfcols)
CreationDateTime = ODM.attrib.get('CreationDateTime')
for ClinicalData in ODM:
StudyOID = ClinicalData.attrib.get('StudyOID')
MetaDataVersionOID = ClinicalData.attrib.get('MetaDataVersionOID')
ASC_Name = ClinicalData.attrib.get('{0}AuditSubCategoryName'.format(mdsol))
for SubjectData in ClinicalData:
SubjectName = SubjectData.attrib.get('{0}SubjectName'.format(mdsol))
SUBJECTUUID = SubjectData.attrib.get('SubjectKey')
LocationOID = SubjectData.find('{0}SiteRef'.format(xmlns)).attrib.get('LocationOID')
for StudyEventData in SubjectData:
StudyEventOID = StudyEventData.attrib.get('StudyEventOID')
StudyEventRepeatKey = StudyEventData.attrib.get('StudyEventRepeatKey')
InstanceId = StudyEventData.attrib.get('{0}InstanceId'.format(mdsol))
for FormData in StudyEventData:
FormOID = FormData.attrib.get('FormOID')
FormRepeatKey = FormData.attrib.get('FormRepeatKey')
DataPageId = FormData.attrib.get('{0}DataPageId'.format(mdsol))
for ItemGroupData in FormData:
ItemgroupOID = ItemGroupData.attrib.get('ItemgroupOID')
RecordId = ItemGroupData.attrib.get('{0}RecordId'.format(mdsol))
for ItemData in ItemGroupData:
var_name = ItemData.attrib.get('ItemOID')
Value = ItemData.attrib.get('Value')
Measurement_Unit = ItemData.find('MeasurementUnitRef'.format(xmlns)).attrib.get('MeasurementUnitOID')
for AuditRecord in ItemData:
DateTimeStamp = AuditRecord.find('{0}DateTimeStamp'.format(xmlns)).text;
SourceID = AuditRecord.find('{0}SourceID'.format(xmlns)).text;
UserOID = ItemData.find('{0}UserRef'.format(xmlns)).attrib.get('UserOID')
df_xml = df_xml.append(
pd.Series([CreationDateTime,StudyOID,MetaDataVersionOID,SubjectName,
SUBJECTUUID,LocationOID,StudyEventOID,
StudyEventRepeatKey,FormOID,FormRepeatKey,DataPageId,ItemgroupOID,
RecordId,var_name,Value,DateTimeStamp,ASC_Name,Measurement_Unit,
SourceID,UserOID,InstanceId], index=dfcols),
ignore_index=True)
print(df_xml)
data_reader()
Issue: I am getting duplicate records. And variables DateTimeStamp, SourceID, UserOID and Measurement_Unit are throwing run time errors during assignment.

Python and products file XML

I'm using python and I need to find sku, min-order-qty and step-quantity for each occurence of sku.
Input file is:
<product sku="1235997403">
<sku>1235997403</sku>
<name xml:lang="fr-FR">Huile pour entretien des destructeurs de documents HSM</name>
<short-description xml:lang="fr-FR">Flacon 250 ml. Colis de 1 flacon.</short-description>
<category-links>
<category-link name="20319647o.rjpf_20320074o.rjpf" domain="RAJA-FR-WEB-0092-21" default = "1" hotdeal = "0"/>
</category-links>
<online>1</online>
<quantity unit="pcs">
<min-order-quantity>1</min-order-quantity>
<step-quantity>1</step-quantity>
</quantity>
....
</product>
....
I try to use lxml but fail to get min-order-qty and step-quantity
from lxml import etree
tree = etree.parse('./ST2CleanCourt.xml')
elem = tree.getroot()
for child in elem:
print (child.attrib["sku"])
I tried to use the 2 solutions below. It works but I need to read the file so I write
from lxml import etree
import codecs
f=codecs.open('./ST2CleanCourt.xml','r','utf-8')
fichier = f.read()
tree = etree.fromstring(fichier)
for child in tree:
print ('sku:', child.attrib['sku'])
print ('min:', child.find('.//min-order-quantity').text)
and I always get this error
print ('min:', child.find('.//min-order-quantity').text)
AttributeError: 'NoneType' object has no attribute 'text'
what is wrong ?

You can use the xpath method to get the required values.
Example:
from lxml import etree
a = """<product sku="1235997403">
<sku>1235997403</sku>
<name xml:lang="fr-FR">Huile pour entretien des destructeurs de documents HSM</name>
<short-description xml:lang="fr-FR">Flacon 250 ml. Colis de 1 flacon.</short-description>
<category-links>
<category-link name="20319647o.rjpf_20320074o.rjpf" domain="RAJA-FR-WEB-0092-21" default = "1" hotdeal = "0"/>
</category-links>
<online>1</online>
<quantity unit="pcs">
<min-order-quantity>1</min-order-quantity>
<step-quantity>1</step-quantity>
</quantity>
</product>
"""
tree = etree.fromstring(a)
tags = tree.xpath('/product')
for b in tags:
print b.attrib["sku"]
min_order = b.xpath("//quantity/min-order-quantity")
print min_order[0].text
step_quality = b.xpath("//quantity/step-quantity")
print step_quality[0].text
Output:
1235997403
1
1

Using more then 1 product and root node of products you can find this:
x = """
<products>
<product sku="1235997403">
<sku>1235997403</sku>
<name xml:lang="fr-FR">Huile pour entretien des destructeurs de documents HSM</name>
<short-description xml:lang="fr-FR">Flacon 250 ml. Colis de 1 flacon.</short-description>
<category-links>
<category-link name="20319647o.rjpf_20320074o.rjpf" domain="RAJA-FR-WEB-0092-21" default = "1" hotdeal = "0"/>
</category-links>
<online>1</online>
<quantity unit="pcs">
<min-order-quantity>1</min-order-quantity>
<step-quantity>1</step-quantity>
</quantity>
</product>
<product sku="997403">
<sku>1235997403</sku>
<name xml:lang="fr-FR">Huile pour entretien des destructeurs de documents HSM</name>
<short-description xml:lang="fr-FR">Flacon 250 ml. Colis de 1 flacon.</short-description>
<category-links>
<category-link name="20319647o.rjpf_20320074o.rjpf" domain="RAJA-FR-WEB-0092-21" default = "1" hotdeal = "0"/>
</category-links>
<online>1</online>
<quantity unit="pcs">
<min-order-quantity>5</min-order-quantity>
<step-quantity>7</step-quantity>
</quantity>
</product>
</products>
"""
from lxml import etree
tree = etree.fromstring(x)
for child in tree:
print ("sku:", child.attrib["sku"])
print ("min:", child.find(".//min-order-quantity").text) # looks for node below
print ("step:" ,child.find(".//step-quantity").text) # child with the given name
Essentially you look for any node below child that has the correct name and print its text.
Output:
sku:1235997403
min:1
step:1
sku:997403
min:5
step:7
Doku: http://lxml.de/tutorial.html#elementpath

lxml: Get all leaf nodes?

Give an XML file, is there a way using lxml to get all the leaf nodes with their names and attributes?
Here is the XML file of interest:
<?xml version="1.0" encoding="UTF-8"?>
<clinical_study>
<!-- This xml conforms to an XML Schema at:
http://clinicaltrials.gov/ct2/html/images/info/public.xsd
and an XML DTD at:
http://clinicaltrials.gov/ct2/html/images/info/public.dtd -->
<id_info>
<org_study_id>3370-2(-4)</org_study_id>
<nct_id>NCT00753818</nct_id>
<nct_alias>NCT00222157</nct_alias>
</id_info>
<brief_title>Developmental Effects of Infant Formula Supplemented With LCPUFA</brief_title>
<sponsors>
<lead_sponsor>
<agency>Mead Johnson Nutrition</agency>
<agency_class>Industry</agency_class>
</lead_sponsor>
</sponsors>
<source>Mead Johnson Nutrition</source>
<oversight_info>
<authority>United States: Institutional Review Board</authority>
</oversight_info>
<brief_summary>
<textblock>
The purpose of this study is to compare the effects on visual development, growth, cognitive
development, tolerance, and blood chemistry parameters in term infants fed one of four study
formulas containing various levels of DHA and ARA.
</textblock>
</brief_summary>
<overall_status>Completed</overall_status>
<phase>N/A</phase>
<study_type>Interventional</study_type>
<study_design>N/A</study_design>
<primary_outcome>
<measure>visual development</measure>
</primary_outcome>
<secondary_outcome>
<measure>Cognitive development</measure>
</secondary_outcome>
<number_of_arms>4</number_of_arms>
<condition>Cognitive Development</condition>
<condition>Growth</condition>
<arm_group>
<arm_group_label>1</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
</arm_group>
<arm_group>
<arm_group_label>2</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
</arm_group>
<arm_group>
<arm_group_label>3</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
</arm_group>
<arm_group>
<arm_group_label>4</arm_group_label>
<arm_group_type>Other</arm_group_type>
<description>Control</description>
</arm_group>
<intervention>
<intervention_type>Other</intervention_type>
<intervention_name>DHA and ARA</intervention_name>
<description>various levels of DHA and ARA</description>
<arm_group_label>1</arm_group_label>
<arm_group_label>2</arm_group_label>
<arm_group_label>3</arm_group_label>
</intervention>
<intervention>
<intervention_type>Other</intervention_type>
<intervention_name>Control</intervention_name>
<arm_group_label>4</arm_group_label>
</intervention>
</clinical_study>
What I would like is a dictionary that looks like this:
{
'id_info_org_study_id': '3370-2(-4)',
'id_info_nct_id': 'NCT00753818',
'id_info_nct_alias': 'NCT00222157',
'brief_title': 'Developmental Effects...'
}
Is this possible with lxml - or indeed any other Python library?
UPDATE:
I ended up doing it this way:
response = requests.get(url)
tree = lxml.etree.fromstring(response.content)
mydict = self._recurse_over_nodes(tree, None, {})
def _recurse_over_nodes(self, tree, parent_key, data):
for branch in tree:
key = branch.tag
if branch.getchildren():
if parent_key:
key = '%s_%s' % (parent_key, key)
data = self._recurse_over_nodes(branch, key, data)
else:
if parent_key:
key = '%s_%s' % (parent_key, key)
if key in data:
data[key] = data[key] + ', %s' % branch.text
else:
data[key] = branch.text
return data

Use the iter method.
http://lxml.de/api/lxml.etree._Element-class.html#iter
Here is a functioning example.
#!/usr/bin/python
from lxml import etree
xml='''
<book>
<chapter id="113">
<sentence id="1" drums='Neil'>
<word id="128160" bass='Geddy'>
<POS Tag="V"/>
<grammar type="STEM"/>
<Aspect type="IMPV"/>
<Number type="S"/>
</word>
<word id="128161">
<POS Tag="V"/>
<grammar type="STEM"/>
<Aspect type="IMPF"/>
</word>
</sentence>
<sentence id="2">
<word id="128162">
<POS Tag="P"/>
<grammar type="PREFIX"/>
<Tag Tag="bi+"/>
</word>
</sentence>
</chapter>
</book>
'''
filename='/usr/share/sri/configurations/saved/test1.xml'
if __name__ == '__main__':
root = etree.fromstring(xml)
# iter will return every node in the document
#
for node in root.iter('*'):
# nodes of length zero are leaf nodes
#
if 0 == len(node):
print node
Here is the output:
$ ./verifyXmlWithDirs.py
<Element POS at 0x176dcf8>
<Element grammar at 0x176da70>
<Element Aspect at 0x176dc20>
<Element Number at 0x176dcf8>
<Element POS at 0x176dc20>
<Element grammar at 0x176dcf8>
<Element Aspect at 0x176da70>
<Element POS at 0x176da70>
<Element grammar at 0x176dc20>
<Element Tag at 0x176dcf8>

Supposed you have done getroot(), something simple like below can construct a dictionary with what you expected:
import lxml.etree
tree = lxml.etree.parse('sample_ctgov.xml')
root = tree.getroot()
d = {}
for node in root:
key = node.tag
if node.getchildren():
for child in node:
key += '_' + child.tag
d.update({key: child.text})
else:
d.update({key: node.text})
Should do the trick, not optimised nor recursively hunt all children nodes, but you get the idea where to start.

Try this out:
from xml.etree import ElementTree
def crawl(root, prefix='', memo={}):
new_prefix = root.tag
if len(prefix) > 0:
new_prefix = prefix + "_" + new_prefix
for child in root.getchildren():
crawl(child, new_prefix, memo)
if len(root.getchildren()) == 0:
memo[new_prefix] = root.text
return memo
e = ElementTree.parse("data.xml")
nodes = crawl(e.getroot())
for k, v in nodes.items():
print k, v
crawl initially takes in the root of an xml tree. It then walks all of its children (recursively) keeping track of all of the tags it went over to get there (that's the whole prefix thing). When it finally finds an element with no children, it saves that data in memo.
Part of the output:
clinical_study_intervention_intervention_name Control clinical_study_phase
N/A clinical_study_arm_group_arm_group_type Other
clinical_study_id_info_nct_id NCT00753818

How to merge 2 xml files with namespaces

I am trying to merge two XML files using ElementTree module. Following are the XMLs:
a.xml:
<?xml version="1.0"?>
<ListOrdersResponse xmlns="https://mws.amazonservices.com/Orders/2013-09-01">
<ListOrdersResult>
<NextToken>token</NextToken>
<CreatedBefore>2014-10-07T08:13:11Z</CreatedBefore>
<Orders>
<Order>
<AmazonOrderId>12345</AmazonOrderId>
<SellerOrderId>R12345</SellerOrderId>
<PurchaseDate>2014-10-02T14:40:37Z</PurchaseDate>
<LastUpdateDate>2014-10-03T09:47:02Z</LastUpdateDate>
<OrderStatus>Shipped</OrderStatus>
<FulfillmentChannel>MFN</FulfillmentChannel>
<SalesChannel>Amazon.in</SalesChannel>
<ShipServiceLevel>IN Exp Dom 2</ShipServiceLevel>
<ShippingAddress>
<Name>name</Name>
<AddressLine1>line1</AddressLine1>
<AddressLine2>line2</AddressLine2>
<City>Pune</City>
<StateOrRegion>Maharashtra</StateOrRegion>
<PostalCode>411027</PostalCode>
<CountryCode>IN</CountryCode>
<Phone>123456789</Phone>
</ShippingAddress>
<OrderTotal>
<CurrencyCode>INR</CurrencyCode>
<Amount>520.00</Amount>
</OrderTotal>
<NumberOfItemsShipped>1</NumberOfItemsShipped>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<PaymentExecutionDetail/>
<PaymentMethod>Other</PaymentMethod>
<MarketplaceId>mid</MarketplaceId>
<BuyerEmail>email#buyer.com</BuyerEmail>
<BuyerName>name</BuyerName>
<ShipmentServiceLevelCategory>Expedited</ShipmentServiceLevelCategory>
<ShippedByAmazonTFM>false</ShippedByAmazonTFM>
<TFMShipmentStatus>Delivered</TFMShipmentStatus>
<OrderType>StandardOrder</OrderType>
<EarliestShipDate>2014-10-05T18:30:00Z</EarliestShipDate>
<LatestShipDate>2014-10-07T18:29:59Z</LatestShipDate>
<EarliestDeliveryDate>2014-10-07T18:30:00Z</EarliestDeliveryDate>
<LatestDeliveryDate>2014-10-11T18:29:59Z</LatestDeliveryDate>
</Order>
</Orders>
</ListOrdersResult>
</ListOrdersResponse>
b.xml:
<?xml version="1.0"?>
<ListOrdersByNextTokenResponse xmlns="https://mws.amazonservices.com/Orders/2013-09-01">
<ListOrdersByNextTokenResult>
<NextToken>token1</NextToken>
<CreatedBefore>2014-10-07T08:13:11Z</CreatedBefore>
<Orders>
<Order>
<AmazonOrderId>oid1</AmazonOrderId>
<PurchaseDate>2014-10-04T13:37:41Z</PurchaseDate>
<LastUpdateDate>2014-10-06T09:52:21Z</LastUpdateDate>
<OrderStatus>Shipped</OrderStatus>
<FulfillmentChannel>MFN</FulfillmentChannel>
<SalesChannel>Amazon.in</SalesChannel>
<ShipServiceLevel>IN Std Dom 2_50k_cod</ShipServiceLevel>
<ShippingAddress>
<Name>name1</Name>
<AddressLine1>line1-1</AddressLine1>
<AddressLine2>line2-1</AddressLine2>
<City>WADHVANCITY,SURENDRANAGAR</City>
<StateOrRegion>Gujarat</StateOrRegion>
<PostalCode>363035</PostalCode>
<CountryCode>IN</CountryCode>
<Phone>987654321</Phone>
</ShippingAddress>
<OrderTotal>
<CurrencyCode>INR</CurrencyCode>
<Amount>242.00</Amount>
</OrderTotal>
<NumberOfItemsShipped>1</NumberOfItemsShipped>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<PaymentExecutionDetail/>
<PaymentMethod>Other</PaymentMethod>
<MarketplaceId>mid1</MarketplaceId>
<BuyerEmail>email1#buyer.com</BuyerEmail>
<BuyerName>name1</BuyerName>
<ShipmentServiceLevelCategory>Standard</ShipmentServiceLevelCategory>
<ShippedByAmazonTFM>false</ShippedByAmazonTFM>
<TFMShipmentStatus>PendingPickUp</TFMShipmentStatus>
<OrderType>StandardOrder</OrderType>
<EarliestShipDate>2014-10-05T18:30:00Z</EarliestShipDate>
<LatestShipDate>2014-10-07T18:29:59Z</LatestShipDate>
<EarliestDeliveryDate>2014-10-09T18:30:00Z</EarliestDeliveryDate>
<LatestDeliveryDate>2014-10-15T18:29:59Z</LatestDeliveryDate>
</Order>
</Orders>
</ListOrdersByNextTokenResult>
</ListOrdersByNextTokenResponse>
I want to add the elements inside Orders elemnt in b.xml to that of a.xml
So, the expected output is:
<?xml version="1.0"?>
<ListOrdersResponse xmlns="https://mws.amazonservices.com/Orders/2013-09-01">
<ListOrdersResult>
<NextToken>token</NextToken>
<CreatedBefore>2014-10-07T08:13:11Z</CreatedBefore>
<Orders>
<Order>
<AmazonOrderId>12345</AmazonOrderId>
<SellerOrderId>R12345</SellerOrderId>
<PurchaseDate>2014-10-02T14:40:37Z</PurchaseDate>
<LastUpdateDate>2014-10-03T09:47:02Z</LastUpdateDate>
<OrderStatus>Shipped</OrderStatus>
<FulfillmentChannel>MFN</FulfillmentChannel>
<SalesChannel>Amazon.in</SalesChannel>
<ShipServiceLevel>IN Exp Dom 2</ShipServiceLevel>
<ShippingAddress>
<Name>name</Name>
<AddressLine1>line1</AddressLine1>
<AddressLine2>line2</AddressLine2>
<City>Pune</City>
<StateOrRegion>Maharashtra</StateOrRegion>
<PostalCode>411027</PostalCode>
<CountryCode>IN</CountryCode>
<Phone>123456789</Phone>
</ShippingAddress>
<OrderTotal>
<CurrencyCode>INR</CurrencyCode>
<Amount>520.00</Amount>
</OrderTotal>
<NumberOfItemsShipped>1</NumberOfItemsShipped>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<PaymentExecutionDetail/>
<PaymentMethod>Other</PaymentMethod>
<MarketplaceId>mid</MarketplaceId>
<BuyerEmail>email#buyer.com</BuyerEmail>
<BuyerName>name</BuyerName>
<ShipmentServiceLevelCategory>Expedited</ShipmentServiceLevelCategory>
<ShippedByAmazonTFM>false</ShippedByAmazonTFM>
<TFMShipmentStatus>Delivered</TFMShipmentStatus>
<OrderType>StandardOrder</OrderType>
<EarliestShipDate>2014-10-05T18:30:00Z</EarliestShipDate>
<LatestShipDate>2014-10-07T18:29:59Z</LatestShipDate>
<EarliestDeliveryDate>2014-10-07T18:30:00Z</EarliestDeliveryDate>
<LatestDeliveryDate>2014-10-11T18:29:59Z</LatestDeliveryDate>
</Order>
<Order>
<AmazonOrderId>oid1</AmazonOrderId>
<PurchaseDate>2014-10-04T13:37:41Z</PurchaseDate>
<LastUpdateDate>2014-10-06T09:52:21Z</LastUpdateDate>
<OrderStatus>Shipped</OrderStatus>
<FulfillmentChannel>MFN</FulfillmentChannel>
<SalesChannel>Amazon.in</SalesChannel>
<ShipServiceLevel>IN Std Dom 2_50k_cod</ShipServiceLevel>
<ShippingAddress>
<Name>name1</Name>
<AddressLine1>line1-1</AddressLine1>
<AddressLine2>line2-1</AddressLine2>
<City>WADHVANCITY,SURENDRANAGAR</City>
<StateOrRegion>Gujarat</StateOrRegion>
<PostalCode>363035</PostalCode>
<CountryCode>IN</CountryCode>
<Phone>987654321</Phone>
</ShippingAddress>
<OrderTotal>
<CurrencyCode>INR</CurrencyCode>
<Amount>242.00</Amount>
</OrderTotal>
<NumberOfItemsShipped>1</NumberOfItemsShipped>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<PaymentExecutionDetail/>
<PaymentMethod>Other</PaymentMethod>
<MarketplaceId>mid1</MarketplaceId>
<BuyerEmail>email1#buyer.com</BuyerEmail>
<BuyerName>name1</BuyerName>
<ShipmentServiceLevelCategory>Standard</ShipmentServiceLevelCategory>
<ShippedByAmazonTFM>false</ShippedByAmazonTFM>
<TFMShipmentStatus>PendingPickUp</TFMShipmentStatus>
<OrderType>StandardOrder</OrderType>
<EarliestShipDate>2014-10-05T18:30:00Z</EarliestShipDate>
<LatestShipDate>2014-10-07T18:29:59Z</LatestShipDate>
<EarliestDeliveryDate>2014-10-09T18:30:00Z</EarliestDeliveryDate>
<LatestDeliveryDate>2014-10-15T18:29:59Z</LatestDeliveryDate>
</Order>
</Orders>
</ListOrdersResult>
</ListOrdersResponse>
I tried:
import xml.etree.ElementTree as ET
import os
import shlex
import subprocess
tree = ET.parse("a.xml")
root = tree.getroot()
combined_xml = root
namespaces = {'resp': 'https://mws.amazonservices.com/Orders/2013-09-01'}
results = combined_xml.find("resp:ListOrdersResult", namespaces=namespaces)
insertion_point = results.find("resp:Orders", namespaces=namespaces)
tree1 = ET.parse("b.xml")
root1 = tree1.getroot()
results1 = root1.find("resp:ListOrdersByNextTokenResult", namespaces=namespaces)
order_array1 = results1.find("resp:Orders", namespaces=namespaces)
for order in order_array1:
insertion_point.extend(order)
print ET.tostring(combined_xml)
But I am getting the following output:
<ns0:ListOrdersResponse xmlns:ns0="https://mws.amazonservices.com/Orders/2013-09-01">
<ns0:ListOrdersResult>
<ns0:NextToken>token</ns0:NextToken>
<ns0:CreatedBefore>2014-10-07T08:13:11Z</ns0:CreatedBefore>
<ns0:Orders>
<ns0:Order>
<ns0:AmazonOrderId>12345</ns0:AmazonOrderId>
<ns0:SellerOrderId>R12345</ns0:SellerOrderId>
<ns0:PurchaseDate>2014-10-02T14:40:37Z</ns0:PurchaseDate>
<ns0:LastUpdateDate>2014-10-03T09:47:02Z</ns0:LastUpdateDate>
<ns0:OrderStatus>Shipped</ns0:OrderStatus>
<ns0:FulfillmentChannel>MFN</ns0:FulfillmentChannel>
<ns0:SalesChannel>Amazon.in</ns0:SalesChannel>
<ns0:ShipServiceLevel>IN Exp Dom 2</ns0:ShipServiceLevel>
<ns0:ShippingAddress>
<ns0:Name>name</ns0:Name>
<ns0:AddressLine1>line1</ns0:AddressLine1>
<ns0:AddressLine2>line2</ns0:AddressLine2>
<ns0:City>Pune</ns0:City>
<ns0:StateOrRegion>Maharashtra</ns0:StateOrRegion>
<ns0:PostalCode>411027</ns0:PostalCode>
<ns0:CountryCode>IN</ns0:CountryCode>
<ns0:Phone>123456789</ns0:Phone>
</ns0:ShippingAddress>
<ns0:OrderTotal>
<ns0:CurrencyCode>INR</ns0:CurrencyCode>
<ns0:Amount>520.00</ns0:Amount>
</ns0:OrderTotal>
<ns0:NumberOfItemsShipped>1</ns0:NumberOfItemsShipped>
<ns0:NumberOfItemsUnshipped>0</ns0:NumberOfItemsUnshipped>
<ns0:PaymentExecutionDetail />
<ns0:PaymentMethod>Other</ns0:PaymentMethod>
<ns0:MarketplaceId>mid</ns0:MarketplaceId>
<ns0:BuyerEmail>email#buyer.com</ns0:BuyerEmail>
<ns0:BuyerName>name</ns0:BuyerName>
<ns0:ShipmentServiceLevelCategory>Expedited</ns0:ShipmentServiceLevelCategory>
<ns0:ShippedByAmazonTFM>false</ns0:ShippedByAmazonTFM>
<ns0:TFMShipmentStatus>Delivered</ns0:TFMShipmentStatus>
<ns0:OrderType>StandardOrder</ns0:OrderType>
<ns0:EarliestShipDate>2014-10-05T18:30:00Z</ns0:EarliestShipDate>
<ns0:LatestShipDate>2014-10-07T18:29:59Z</ns0:LatestShipDate>
<ns0:EarliestDeliveryDate>2014-10-07T18:30:00Z</ns0:EarliestDeliveryDate>
<ns0:LatestDeliveryDate>2014-10-11T18:29:59Z</ns0:LatestDeliveryDate>
</ns0:Order>
<ns0:AmazonOrderId>oid1</ns0:AmazonOrderId>
<ns0:PurchaseDate>2014-10-04T13:37:41Z</ns0:PurchaseDate>
<ns0:LastUpdateDate>2014-10-06T09:52:21Z</ns0:LastUpdateDate>
<ns0:OrderStatus>Shipped</ns0:OrderStatus>
<ns0:FulfillmentChannel>MFN</ns0:FulfillmentChannel>
<ns0:SalesChannel>Amazon.in</ns0:SalesChannel>
<ns0:ShipServiceLevel>IN Std Dom 2_50k_cod</ns0:ShipServiceLevel>
<ns0:ShippingAddress>
<ns0:Name>name1</ns0:Name>
<ns0:AddressLine1>line1-1</ns0:AddressLine1>
<ns0:AddressLine2>line2-1</ns0:AddressLine2>
<ns0:City>WADHVANCITY,SURENDRANAGAR</ns0:City>
<ns0:StateOrRegion>Gujarat</ns0:StateOrRegion>
<ns0:PostalCode>363035</ns0:PostalCode>
<ns0:CountryCode>IN</ns0:CountryCode>
<ns0:Phone>987654321</ns0:Phone>
</ns0:ShippingAddress>
<ns0:OrderTotal>
<ns0:CurrencyCode>INR</ns0:CurrencyCode>
<ns0:Amount>242.00</ns0:Amount>
</ns0:OrderTotal>
<ns0:NumberOfItemsShipped>1</ns0:NumberOfItemsShipped>
<ns0:NumberOfItemsUnshipped>0</ns0:NumberOfItemsUnshipped>
<ns0:PaymentExecutionDetail />
<ns0:PaymentMethod>Other</ns0:PaymentMethod>
<ns0:MarketplaceId>mid1</ns0:MarketplaceId>
<ns0:BuyerEmail>email1#byer.com</ns0:BuyerEmail>
<ns0:BuyerName>name1</ns0:BuyerName>
<ns0:ShipmentServiceLevelCategory>Standard</ns0:ShipmentServiceLevelCategory>
<ns0:ShippedByAmazonTFM>false</ns0:ShippedByAmazonTFM>
<ns0:TFMShipmentStatus>PendingPickUp</ns0:TFMShipmentStatus>
<ns0:OrderType>StandardOrder</ns0:OrderType>
<ns0:EarliestShipDate>2014-10-05T18:30:00Z</ns0:EarliestShipDate>
<ns0:LatestShipDate>2014-10-07T18:29:59Z</ns0:LatestShipDate>
<ns0:EarliestDeliveryDate>2014-10-09T18:30:00Z</ns0:EarliestDeliveryDate>
<ns0:LatestDeliveryDate>2014-10-15T18:29:59Z</ns0:LatestDeliveryDate>
</ns0:Orders>
</ns0:ListOrdersResult>
</ns0:ListOrdersResponse>
Why am I getting ns0? Also, <Order> tag is missing for the second order. How can I get the desired output without ns0. I am ok with suggestions for using another module if it makes life easier.:)
Thanks

ns0 means 'namespace 0' - it's a result of your namespace dict and "resp:tagname" terms.
I'd really recommend using beautifulsoup4 for this, though - it's much nicer for working with xml:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('a.xml'))
insertion_point = soup.listordersresult.orders
orders_b = BeautifulSoup(open('b.xml')).listordersbynexttokenresult.orders
# could probably just be orders_b = BeautifulSoup(open('b.xml'))
orders_to_insert = orders_b.find_all('order')
for order in orders_to_insert:
insertion_point.append(order)
print(soup)

import xml.etree.ElementTree as ET
from StringIO import StringIO
namespaces = {'resp': 'https://mws.amazonservices.com/Orders/2013-09-01'}
tree = ET.parse("a.xml")
root = tree.getroot()
results = root.find("resp:ListOrdersResult", namespaces=namespaces)
order_array = results.find("resp:Orders", namespaces=namespaces).getchildren()
tree1 = ET.parse("b.xml")
root1 = tree1.getroot()
results1 = root1.find("resp:ListOrdersByNextTokenResult", namespaces=namespaces)
order_array1 = results1.find("resp:Orders", namespaces=namespaces).getchildren()
for order in order_array1:
order_array.append(order)
tree.write("temp.xml")
correct_data = open("temp.xml").read().replace('ns0:', '').replace(':ns0','')
filewrite = open("combined.xml", 'w')
filewrite.write(correct_data)
filewrite.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse xml string having deep structures using python - python

Related

How to extract data from GML file

Python XML Parser Issue

Python and products file XML

lxml: Get all leaf nodes?

How to merge 2 xml files with namespaces

Categories

Resources