Error in parsing xml using python due to namespace present

Error in parsing xml using python due to namespace present - python

Using the below script to remove the child node based on the image type from below XML but there is below error because of xmlns header so I removed that and tried still it is only removing 3 child nodes present out of 5.
Can you please check?
<?xml version="1.0" encoding="UTF-8"?>
<!-- Copyright (c) All rights reserved. -->
<dummy_list xmlns="https://dummy_list_file"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="template.xsd">
<dummy_capability>
<dummy_type>1</dummy_type>
<dummy_type_string>dummy_3700E</dummy_type_string>
<dummy_image>c3700</dummy_image>
<dummy_string>dummy3702E,dummy3701E</dummy_string>
<dummy_capabilities>
<CSTREAMS>True</CSTREAMS>
<ABC_SUPPORTED>True</ABC_SUPPORTED>
<THRESHOLD_SUPPORTED>True</THRESHOLD_SUPPORTED>
<FABRIC_CABLE>True</FABRIC_CABLE>
</dummy_capabilities>
</dummy_capability>
<dummy_capability>
<dummy_type>2</dummy_type>
<dummy_type_string>dummy_2700E</dummy_type_string>
<dummy_image>c2700</dummy_image>
<dummy_string>dummy2702E,dummy2701E</dummy_string>
<dummy_capabilities>
<CSTREAMS>True</CSTREAMS>
<ABC_SUPPORTED>True</ABC_SUPPORTED>
<THRESHOLD_SUPPORTED>True</THRESHOLD_SUPPORTED>
<FABRIC_CABLE>True</FABRIC_CABLE>
</dummy_capabilities>
</dummy_capability>
<dummy_capability>
<dummy_type>3</dummy_type>
<dummy_type_string>dummy_1700E</dummy_type_string>
<dummy_image>c1700</dummy_image>
<dummy_string>dummy1702E,dummy1701E</dummy_string>
<dummy_capabilities>
<CSTREAMS>True</CSTREAMS>
<ABC_SUPPORTED>True</ABC_SUPPORTED>
<THRESHOLD_SUPPORTED>True</THRESHOLD_SUPPORTED>
<FABRIC_CABLE>True</FABRIC_CABLE>
</dummy_capabilities>
</dummy_capability>
<dummy_capability>
<dummy_type>4</dummy_type>
<dummy_type_string>dummy_4700E</dummy_type_string>
<dummy_image>c4700</dummy_image>
<dummy_string>dummy4702E,dummy4701E</dummy_string>
<dummy_capabilities>
<CSTREAMS>True</CSTREAMS>
<ABC_SUPPORTED>True</ABC_SUPPORTED>
<THRESHOLD_SUPPORTED>True</THRESHOLD_SUPPORTED>
<FABRIC_CABLE>True</FABRIC_CABLE>
</dummy_capabilities>
</dummy_capability>
<dummy_capability>
<dummy_type>4</dummy_type>
<dummy_type_string>dummy_4700E</dummy_type_string>
<dummy_image>c4700</dummy_image>
<dummy_string>dummy4702E,dummy4701E</dummy_string>
<dummy_capabilities>
<CSTREAMS>True</CSTREAMS>
<ABC_SUPPORTED>True</ABC_SUPPORTED>
<THRESHOLD_SUPPORTED>True</THRESHOLD_SUPPORTED>
<FABRIC_CABLE>True</FABRIC_CABLE>
</dummy_capabilities>
</dummy_capability>
<dummy_capability>
<dummy_type>4</dummy_type>
<dummy_type_string>dummy_4700E</dummy_type_string>
<dummy_image>c4700</dummy_image>
<dummy_string>dummy4702E,dummy4701E</dummy_string>
<dummy_capabilities>
<CSTREAMS>True</CSTREAMS>
<ABC_SUPPORTED>True</ABC_SUPPORTED>
<THRESHOLD_SUPPORTED>True</THRESHOLD_SUPPORTED>
<FABRIC_CABLE>True</FABRIC_CABLE>
</dummy_capabilities>
</dummy_capability>
</dummy_list>
#!/router/bin/python3-3.6.3
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse('dummy.xml')
root = tree.getroot()
for child in root:
if (child.find('dummy_image').text == 'c3700'):
print("Removing child: " + child.find('dummy_image').text)
root.remove(child)
tree.write('out.xml')
How can I parse this with also present?
xmlns="https://dummy_list_file"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="template.xsd
Why it is not removing all the child nodes fro perticular image type?

Another method.
from simplified_scrapy import SimplifiedDoc,utils
import json
xml = utils.getFileContent('dummy.xml')
doc = SimplifiedDoc(xml)
dummy_capabilitys = doc.selects('dummy_image').contains('c3700').parent
for dummy_capability in dummy_capabilitys:
dummy_capability.repleaceSelf("")
utils.saveFile("out.xml",doc.html)
# Get attributes
root = doc.select('dummy_list')
print (root["xmlns"],root["xmlns:xsi"],root["xsi:schemaLocation"])
Result:
https://dummy_list_file http://www.w3.org/2001/XMLSchema-instance template.xsd
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Related

How to extract data from GML file

I have a text file and would like to extract the <gml:pos>73664.300 836542.700</gml:pos> from it. More precisely I would like to get the GPS coordinate system [73664.300 836542.700] from the pos tag. The file contains multiple <wfs:member> and each of them has a <gml:pos> (deepest layer).
<?xml version='1.0' encoding='UTF-8'?>
<wfs:FeatureCollection xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wfs/2.0 http://schemas.opengis.net/wfs/2.0/wfs.xsd http://www.opengis.net/gml/3.2 http://schemas.opengis.net/gml/3.2.1/gml.xsd http://www.deegree.org/app https://web.de/feature_descr?SERVICE=WFS&VERSION=2.0.0&REQUEST=DescribeFeatureType&OUTPUTFORMAT=application%2Fgml%2Bxml%3B+version%3D3.2&TYPENAME=app:lsa_data&NAMESPACES=xmlns(app,http%3A%2F%2Fwww.deegree.org%2Fapp)" xmlns:wfs="http://www.opengis.net/wfs/2.0" timeStamp="2020-11-18T15:01:17Z" xmlns:gml="http://www.opengis.net/gml/3.2" numberMatched="unknown" numberReturned="0">
<!--NOTE: numberReturned attribute should be 'unknown' as well, but this would not validate against the current version of the WFS 2.0 schema (change upcoming). See change request (CR 144): https://portal.opengeospatial.org/files?fact_id=6798.-->
<wfs:member>
<app:dat_set xmlns:app="http://www.deegree.org/app" gml:id="app:dat_set_1">
<app:point>2</app:point>
<app:art>K </app:art>
<app:L_Name>westt / woustest </app:L_Name>
<app:geom>
<!--Inlined geometry 'data_1_APP_GEOM'-->
<gml:MultiPoint gml:id="data_1_APP_GEOM" srsName="EPSG:25832">
<gml:pointMember>
<gml:Point gml:id="GEOMETRY_ad608059-f297-4554-8464-cdde248cb531" srsName="EPSG:25832">
<gml:pos>73664.300 836542.700</gml:pos>
</gml:Point>
</gml:pointMember>
</gml:MultiPoint>
</app:geom>
</app:lsa_pointdata>
</wfs:member>
<wfs:member>
<app:dat_set xmlns:app="http://www.deegree.org/app" gml:id="app:dat_set_2">
<app:point>3</app:point>
<app:art>K </app:art>
<app:L_Name>route / riztr </app:L_Name>
<app:geom>
<!--Inlined geometry 'data_2_APP_GEOM'-->
<gml:MultiPoint gml:id="data_2_APP_GEOM" srsName="EPSG:25832">
<gml:pointMember>
<gml:Point gml:id="GEOMETRY_440d8630-b674-4768-a5b7-3fab46d9ac8c" srsName="EPSG:25832">
<gml:pos>74354.900 837456.300</gml:pos>
</gml:Point>
</gml:pointMember>
</gml:MultiPoint>
</app:geom>
</app:lsa_pointdata>
</wfs:member>
<wfs:member>
...
...
How could I get those gps coordinates ?
Thank you in advance.

You can use lxml and XPATH.
data = b'''\
<?xml version='1.0' encoding='UTF-8'?>
<wfs:FeatureCollection xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/wfs/2.0 http://schemas.opengis.net/wfs/2.0/wfs.xsd http://www.opengis.net/gml/3.2 http://schemas.opengis.net/gml/3.2.1/gml.xsd http://www.deegree.org/app https://web.de/feature_descr?SERVICE=WFS&VERSION=2.0.0&REQUEST=DescribeFeatureType&OUTPUTFORMAT=application%2Fgml%2Bxml%3B+version%3D3.2&TYPENAME=app:lsa_data&NAMESPACES=xmlns(app,http%3A%2F%2Fwww.deegree.org%2Fapp)" xmlns:wfs="http://www.opengis.net/wfs/2.0" timeStamp="2020-11-18T15:01:17Z" xmlns:gml="http://www.opengis.net/gml/3.2" numberMatched="unknown" numberReturned="0">
<!--NOTE: numberReturned attribute should be 'unknown' as well, but this would not validate against the current version of the WFS 2.0 schema (change upcoming). See change request (CR 144): https://portal.opengeospatial.org/files?fact_id=6798.-->
<wfs:member>
<app:dat_set xmlns:app="http://www.deegree.org/app" gml:id="app:dat_set_1">
<app:point>2</app:point>
<app:art>K </app:art>
<app:L_Name>westt / woustest </app:L_Name>
<app:geom>
<!--Inlined geometry 'data_1_APP_GEOM'-->
<gml:MultiPoint gml:id="data_1_APP_GEOM" srsName="EPSG:25832">
<gml:pointMember>
<gml:Point gml:id="GEOMETRY_ad608059-f297-4554-8464-cdde248cb531" srsName="EPSG:25832">
<gml:pos>73664.300 836542.700</gml:pos>
</gml:Point>
</gml:pointMember>
</gml:MultiPoint>
</app:geom>
</app:dat_set>
</wfs:member>
<wfs:member>
<app:dat_set xmlns:app="http://www.deegree.org/app" gml:id="app:dat_set_2">
<app:point>3</app:point>
<app:art>K </app:art>
<app:L_Name>route / riztr </app:L_Name>
<app:geom>
<!--Inlined geometry 'data_2_APP_GEOM'-->
<gml:MultiPoint gml:id="data_2_APP_GEOM" srsName="EPSG:25832">
<gml:pointMember>
<gml:Point gml:id="GEOMETRY_440d8630-b674-4768-a5b7-3fab46d9ac8c" srsName="EPSG:25832">
<gml:pos>74354.900 837456.300</gml:pos>
</gml:Point>
</gml:pointMember>
</gml:MultiPoint>
</app:geom>
</app:dat_set>
</wfs:member>
</wfs:FeatureCollection>
'''
from lxml import etree
from io import BytesIO
f = BytesIO(data)
ns = {"gml": "http://www.opengis.net/gml/3.2"}
tree = etree.parse(f)
for e in tree.findall("//gml:pos", ns):
print(e.text)

Python XML Parser Issue

I am new to python. Sorry for asking this stupid question.
I am trying to read a XML file to python object (preferably to pandas)
For now I am just trying to print the variables, to see if I can read them properly in a tabular form.
I have used xml.etree.ElementTree for this, but I might not be using it as intended.
Code:
import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
ODM = tree.getroot()
ns = {'xmlns': 'http://www.cdisc.org/ns/odm/v1.3',
'mdsol': 'http://www.mdsol.com/ns/odm/metadata'}
for ClinicalData in ODM:
LocationOID=None
#print(ClinicalData.tag, ClinicalData.attrib)
for SubjectData in ClinicalData:
for SiteRef in SubjectData:
LocationOID=SiteRef.attrib.get('LocationOID')
for StudyEventData in SubjectData:
for AuditRecord in StudyEventData:
print(ClinicalData.attrib.get('MetaDataVersionOID'),
ClinicalData.attrib.get('AuditSubCategoryName'), #null ouptput due to namespace issue
SubjectData.attrib.get('SubjectKey'),
SubjectData.attrib.get('SubjectName'), #null ouptput due to namespace issue
LocationOID, #not sure what is the issue
StudyEventData.attrib.get('StudyEventRepeatKey'),
AuditRecord.find('DateTimeStamp') #not sure what is the issue
)
Input:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3"
xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata"
CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">
<ClinicalData MetaDataVersionOID="1772" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Activated">
<SubjectData SubjectKey="7735fd9c-1792-457c-aa58-0ca26ecdc810" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-3">
<SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960580">
<AuditRecord>
<UserRef UserOID="systemuser"/>
<LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<DateTimeStamp>2019-07-10T07:56:54</DateTimeStamp>
<ReasonForChange>Update</ReasonForChange>
<SourceID>394263772</SourceID>
</AuditRecord>
</StudyEventData>
</SubjectData>
</ClinicalData>
</ODM>
I am expecting all the print variables need to have the proper variable assigned values as in XML file. Please let me know is there any other proper way of doing it instead of inner looping multiple times.

Namespaces are a pain using ElementTree. See this discussion.
Short answer:
for ClinicalData in ODM:
#print(ClinicalData.tag, ClinicalData.attrib)
for SubjectData in ClinicalData:
SiteRef = SubjectData.find('{http://www.cdisc.org/ns/odm/v1.3}SiteRef')
LocationOID = SiteRef.attrib.get('LocationOID')
for StudyEventData in SubjectData:
for AuditRecord in StudyEventData:
print(
ClinicalData.attrib.get('MetaDataVersionOID'),
ClinicalData.attrib.
get('{http://www.mdsol.com/ns/odm/metadata}AuditSubCategoryName'
), #null ouptput due to namespace issue
SubjectData.attrib.get('SubjectKey'),
SubjectData.attrib.get(
'{http://www.mdsol.com/ns/odm/metadata}SubjectName'
), #null ouptput due to namespace issue
LocationOID, #not sure what is the issue
StudyEventData.attrib.get('StudyEventRepeatKey'),
AuditRecord.find(
'{http://www.cdisc.org/ns/odm/v1.3}DateTimeStamp').
text #not sure what is the issue
)

I think you can use BeautifulSoup for parsing XML:
from bs4 import BeautifulSoup
temp ="""<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3"
xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata"
CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">
<ClinicalData MetaDataVersionOID="1772" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Activated">
<SubjectData SubjectKey="7735fd9c-1792-457c-aa58-0ca26ecdc810" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-3">
<SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960580">
<AuditRecord>
<UserRef UserOID="systemuser"/>
<LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<DateTimeStamp>2019-07-10T07:56:54</DateTimeStamp>
<ReasonForChange>Update</ReasonForChange>
<SourceID>394263772</SourceID>
</AuditRecord>
</StudyEventData>
</SubjectData>
</ClinicalData>
</ODM>"""
temp=BeautifulSoup(temp,"lxml")
ClinicalData = temp.find('ClinicalData'.lower())
SubjectData = ClinicalData.find_all('SubjectData'.lower())
LocationOID=None
for i in SubjectData:
SiteRef = i.find('SiteRef'.lower())
LocationOID = SiteRef.attrs['locationoid']
print('LocationOID',LocationOID)
output:
LocationOID 0ACCSP3MAPPING1SITE1
[Finished in 1.2s]

#Justin
I have applied your suggestions, it worked, until I broke it.
Input:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">
<ClinicalData MetaDataVersionOID="2965" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Entered">
<SubjectData SubjectKey="481e4653-693c-4e15-8762-d8a66c0d2cf1" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-1">
<SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960564">
<FormData FormOID="VS" FormRepeatKey="1" mdsol:DataPageId="15331229">
<ItemGroupData ItemGroupOID="VS" mdsol:RecordId="17928808">
<ItemData ItemOID="VS.WT" TransactionType="Upsert" Value="45">
<AuditRecord>
<UserRef UserOID="alscrave2"/>
<LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<DateTimeStamp>2018-02-02T09:39:30</DateTimeStamp>
<ReasonForChange/>
<SourceID>122841525</SourceID>
</AuditRecord>
<MeasurementUnitRef MeasurementUnitOID="1761.Weight.1"/>
</ItemData>
</ItemGroupData>
</FormData>
</StudyEventData>
</SubjectData>
</ClinicalData>
<ClinicalData MetaDataVersionOID="2965" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Entered">
<SubjectData SubjectKey="481e4653-693c-4e15-8762-d8a66c0d2cf1" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-1">
<SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960564">
<FormData FormOID="VS" FormRepeatKey="1" mdsol:DataPageId="15331229">
<ItemGroupData ItemGroupOID="VS" mdsol:RecordId="17928809">
<ItemData ItemOID="VS.WT" TransactionType="Upsert" Value="46">
<AuditRecord>
<UserRef UserOID="alscrave2"/>
<LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<DateTimeStamp>2018-02-02T09:39:30</DateTimeStamp>
<ReasonForChange/>
<SourceID>122841525</SourceID>
</AuditRecord>
<MeasurementUnitRef MeasurementUnitOID="1761.Weight.1"/>
</ItemData>
</ItemGroupData>
</FormData>
</StudyEventData>
</SubjectData>
</ClinicalData>
</ODM>
Code:
import xml.etree.ElementTree as ET
import pandas as pd
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
tree = ET.parse("data.xml")
ODM = tree.getroot()
xmlns = "{http://www.cdisc.org/ns/odm/v1.3}"
mdsol = "{http://www.mdsol.com/ns/odm/metadata}"
def data_reader():
dfcols = ['CreationDateTime','StudyOID','MetaDataVersionOID','SubjectName','SUBJECTUUID','LocationOID','StudyEventOID',
'StudyEventRepeatKey','FormOID','FormRepeatKey','DataPageId','ItemgroupOID','RecordId','var_name','Value',
'DateTimeStamp','ASC_Name','Measurement_Unit','SourceID','UserOID','InstanceId']
df_xml = pd.DataFrame(columns=dfcols)
CreationDateTime = ODM.attrib.get('CreationDateTime')
for ClinicalData in ODM:
StudyOID = ClinicalData.attrib.get('StudyOID')
MetaDataVersionOID = ClinicalData.attrib.get('MetaDataVersionOID')
ASC_Name = ClinicalData.attrib.get('{0}AuditSubCategoryName'.format(mdsol))
for SubjectData in ClinicalData:
SubjectName = SubjectData.attrib.get('{0}SubjectName'.format(mdsol))
SUBJECTUUID = SubjectData.attrib.get('SubjectKey')
LocationOID = SubjectData.find('{0}SiteRef'.format(xmlns)).attrib.get('LocationOID')
for StudyEventData in SubjectData:
StudyEventOID = StudyEventData.attrib.get('StudyEventOID')
StudyEventRepeatKey = StudyEventData.attrib.get('StudyEventRepeatKey')
InstanceId = StudyEventData.attrib.get('{0}InstanceId'.format(mdsol))
for FormData in StudyEventData:
FormOID = FormData.attrib.get('FormOID')
FormRepeatKey = FormData.attrib.get('FormRepeatKey')
DataPageId = FormData.attrib.get('{0}DataPageId'.format(mdsol))
for ItemGroupData in FormData:
ItemgroupOID = ItemGroupData.attrib.get('ItemgroupOID')
RecordId = ItemGroupData.attrib.get('{0}RecordId'.format(mdsol))
for ItemData in ItemGroupData:
var_name = ItemData.attrib.get('ItemOID')
Value = ItemData.attrib.get('Value')
Measurement_Unit = ItemData.find('MeasurementUnitRef'.format(xmlns)).attrib.get('MeasurementUnitOID')
for AuditRecord in ItemData:
DateTimeStamp = AuditRecord.find('{0}DateTimeStamp'.format(xmlns)).text;
SourceID = AuditRecord.find('{0}SourceID'.format(xmlns)).text;
UserOID = ItemData.find('{0}UserRef'.format(xmlns)).attrib.get('UserOID')
df_xml = df_xml.append(
pd.Series([CreationDateTime,StudyOID,MetaDataVersionOID,SubjectName,
SUBJECTUUID,LocationOID,StudyEventOID,
StudyEventRepeatKey,FormOID,FormRepeatKey,DataPageId,ItemgroupOID,
RecordId,var_name,Value,DateTimeStamp,ASC_Name,Measurement_Unit,
SourceID,UserOID,InstanceId], index=dfcols),
ignore_index=True)
print(df_xml)
data_reader()
Issue: I am getting duplicate records. And variables DateTimeStamp, SourceID, UserOID and Measurement_Unit are throwing run time errors during assignment.

Create a dataframe from nested xml and generate a csv

I have an XML file like this:
<?xml version="1.0"?>
<PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Class Def" MessageType="Integration Object">
<ListOf_Class_Def>
<ImpExp Type="CLASS_DEF" Name="lp_pkg_cla" Object_Num="1001p">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
<Object_Arrt Orig_Id="6666p" Attr_Name="LP_Portable">
</Object_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Class_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Class Def" MessageType="Integration Object">
<ListOf_Class_Def>
<ImpExp Type="CLASS_DEF" Name="M_pkg_cla" Object_Num="1023i">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
<Object_Arrt Orig_Id="7010p" Attr_Name="O_Portable">
</Object_Arrt>
<Object_Arrt Orig_Id="7012j" Attr_Name="O_wireless">
</Object_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Class_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Laptop" Object_Num="2008a">
<ListOfObject_Def>
<Object_Def Ancestor_Num="1001p" Ancestor_Name="lp_pkg_cla">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Mouse" Object_Num="2987d">
<ListOfObject_Def>
<Object_Def Ancestor_Num="1023i" Ancestor_Name="M_pkg_cla">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Speaker" Object_Num="5463g">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
</PropertySet>
I am hoping to extract the Name, Object_Num, Orig_Id and Attr_Name tags from it using Python and convert them into a .csv format.
The .csv format I'd like to see it in is simply:
ProductId Product AttributeId Attribute
2008a Laptop 6666p LP_Portable
2987d Mouse 7010p O_Portable
2987d Mouse 7012p O_Wireless
5463g Speaker "" ""
Actually there is a relationship like this in xml tags:
All products are in the tags, "ImpExp Type="PROD_DEF".. "
All attributes are in the tags, "ImpExp Type="CLASS_DEF".. "
If a product has attributes, then there is a tag
<Object_Def Ancestor_Num="1023i".. >
The Ancestor_Num is equal to Object_Num in tags,
Type="CLASS_DEF"..
I have tried this:
from lxml import etree
import pandas
import HTMLParser
inFile = "./newm.xml"
outFile = "./new.csv"
ctx1 = etree.iterparse(inFile, tag=("ImpExp", "ListOfObject_Def", "ListOfObject_Arrt",))
hp = HTMLParser.HTMLParser()
csvData = []
csvData1 = []
csvData2 = []
csvData3 = []
csvData4 = []
csvData5 = []
for event, elem in ctx1:
value1 = elem.get("Type")
value2 = elem.get("Name")
value3 = elem.get("Object_Num")
value4 = elem.get("Ancestor_Num")
value5 = elem.get("Orig_Id")
value6 = elem.get("Attr_Name")
if value1 == "PROD_DEF":
csvData.append(value2)
csvData1.append(value3)
for event, elem in ctx1:
if value4 is not None:
csvData2.append(value4)
elem.clear()
df = pandas.DataFrame({'Product':csvData, 'ProductId':csvData1, 'AncestorId':csvData2})
for event, elem in ctx1:
if value1 == "Class Def":
csvData3.append(value3)
csvData4.append(value5)
csvData5.append(value6)
elem.clear()
df1 = pandas.DataFrame({'AncestorId':csvData3, 'AttribId':csvData4, 'AttribName':csvData5})
dff = pandas.merge(df, df1, on="AncestorId")
dff.to_csv(outFile, index = False)

Consider XSLT, the special purpose language designed to transform XML files and can directly convert XML to CSV (i.e., text file) without the pandas dataframe intermediary. Python's third-party module lxml (which you are already using) can run XSLT 1.0 scripts and do so without for loops or if logic. However, due to the complex alignment of product and attributes, some longer XPath searches are used with XSLT.
XSLT (save as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="no" method="text"/>
<xsl:strip-space elements="*"/>
<xsl:param name="delimiter">,</xsl:param>
<xsl:template match="/PropertySet">
<xsl:text>ProductId,Product,AttributeId,Attribute
</xsl:text>
<xsl:apply-templates select="*"/>
</xsl:template>
<xsl:template match="PropertySet|Message|ListOf_Class_Def|ListOf_Prod_Def|ImpExp">
<xsl:apply-templates select="*"/>
</xsl:template>
<xsl:template match="ListOfObject_Arrt">
<xsl:apply-templates select="Object_Arrt"/>
<xsl:if test="name(*) != 'Object_Arrt' and preceding-sibling::ListOfObject_Def/Object_Def/#Ancestor_Name = ''">
<xsl:value-of select="concat(ancestor::ImpExp/#Name, $delimiter,
ancestor::ImpExp/#Object_Num, $delimiter,
'', $delimiter,
'')"/><xsl:text>
</xsl:text>
</xsl:if>
</xsl:template>
<xsl:template match="Object_Arrt">
<xsl:variable name="attrName" select="ancestor::ImpExp/#Name"/>
<xsl:value-of select="concat(/PropertySet/PropertySet/Message[#IntObjectName='Prod Def']/ListOf_Prod_Def/
ImpExp[ListOfObject_Def/Object_Def/#Ancestor_Name = $attrName]/#Name, $delimiter,
/PropertySet/PropertySet/Message[#IntObjectName='Prod Def']/ListOf_Prod_Def/
ImpExp[ListOfObject_Def/Object_Def/#Ancestor_Name = $attrName]/#Object_Num, $delimiter,
#Orig_Id, $delimiter,
#Attr_Name)"/><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
Python
import lxml.etree as et
# LOAD XML AND XSL
xml = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')
# RUN TRANSFORMATION
transform = et.XSLT(xsl)
result = transform(xml)
# OUTPUT TO FILE
with open('Output.csv', 'wb') as f:
f.write(result)
Output
ProductId,Product,AttributeId,Attribute
Laptop,2008a,6666p,LP_Portable
Mouse,2987d,7010p,O_Portable
Mouse,2987d,7012j,O_wireless
Speaker,5463g,,

You would need to preparse all of the CLASS_DEF entries into a dictionary. These can then be looked up when processing the PROD_DEF entries:
import csv
from lxml import etree
inFile = "./newm.xml"
outFile = "./new.csv"
tree = etree.parse(inFile)
class_defs = {}
# First extract all the CLASS_DEF entries into a dictionary
for impexp in tree.iter("ImpExp"):
name = impexp.get('Name')
if impexp.get('Type') == "CLASS_DEF":
for list_of_object_arrt in impexp.findall('ListOfObject_Arrt'):
class_defs[name] = [(obj.get('Orig_Id'), obj.get('Attr_Name')) for obj in list_of_object_arrt]
with open(outFile, 'wb') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['ProductId', 'Product', 'AttributeId', 'Attribute'])
for impexp in tree.iter("ImpExp"):
object_num = impexp.get('Object_Num')
name = impexp.get('Name')
if impexp.get('Type') == "PROD_DEF":
for list_of_object_def in impexp.findall('ListOfObject_Def'):
for obj in list_of_object_def:
ancestor_num = obj.get('Ancestor_Num')
ancestor_name = obj.get('Ancestor_Name')
csv_output.writerow([object_num, name] + list(class_defs.get(ancestor_name, [['', '']])[0]))
This would produce new.csv containing:
ProductId,Product,AttributeId,Attribute
2008a,Laptop,6666p,LP_Portable
2987d,Mouse,7010p,O_Portable
5463g,Speaker,,
If you are using Python 3.x, use:
with open(outFile, 'w', newline='') as f_output:

How to merge 2 xml files with namespaces

I am trying to merge two XML files using ElementTree module. Following are the XMLs:
a.xml:
<?xml version="1.0"?>
<ListOrdersResponse xmlns="https://mws.amazonservices.com/Orders/2013-09-01">
<ListOrdersResult>
<NextToken>token</NextToken>
<CreatedBefore>2014-10-07T08:13:11Z</CreatedBefore>
<Orders>
<Order>
<AmazonOrderId>12345</AmazonOrderId>
<SellerOrderId>R12345</SellerOrderId>
<PurchaseDate>2014-10-02T14:40:37Z</PurchaseDate>
<LastUpdateDate>2014-10-03T09:47:02Z</LastUpdateDate>
<OrderStatus>Shipped</OrderStatus>
<FulfillmentChannel>MFN</FulfillmentChannel>
<SalesChannel>Amazon.in</SalesChannel>
<ShipServiceLevel>IN Exp Dom 2</ShipServiceLevel>
<ShippingAddress>
<Name>name</Name>
<AddressLine1>line1</AddressLine1>
<AddressLine2>line2</AddressLine2>
<City>Pune</City>
<StateOrRegion>Maharashtra</StateOrRegion>
<PostalCode>411027</PostalCode>
<CountryCode>IN</CountryCode>
<Phone>123456789</Phone>
</ShippingAddress>
<OrderTotal>
<CurrencyCode>INR</CurrencyCode>
<Amount>520.00</Amount>
</OrderTotal>
<NumberOfItemsShipped>1</NumberOfItemsShipped>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<PaymentExecutionDetail/>
<PaymentMethod>Other</PaymentMethod>
<MarketplaceId>mid</MarketplaceId>
<BuyerEmail>email#buyer.com</BuyerEmail>
<BuyerName>name</BuyerName>
<ShipmentServiceLevelCategory>Expedited</ShipmentServiceLevelCategory>
<ShippedByAmazonTFM>false</ShippedByAmazonTFM>
<TFMShipmentStatus>Delivered</TFMShipmentStatus>
<OrderType>StandardOrder</OrderType>
<EarliestShipDate>2014-10-05T18:30:00Z</EarliestShipDate>
<LatestShipDate>2014-10-07T18:29:59Z</LatestShipDate>
<EarliestDeliveryDate>2014-10-07T18:30:00Z</EarliestDeliveryDate>
<LatestDeliveryDate>2014-10-11T18:29:59Z</LatestDeliveryDate>
</Order>
</Orders>
</ListOrdersResult>
</ListOrdersResponse>
b.xml:
<?xml version="1.0"?>
<ListOrdersByNextTokenResponse xmlns="https://mws.amazonservices.com/Orders/2013-09-01">
<ListOrdersByNextTokenResult>
<NextToken>token1</NextToken>
<CreatedBefore>2014-10-07T08:13:11Z</CreatedBefore>
<Orders>
<Order>
<AmazonOrderId>oid1</AmazonOrderId>
<PurchaseDate>2014-10-04T13:37:41Z</PurchaseDate>
<LastUpdateDate>2014-10-06T09:52:21Z</LastUpdateDate>
<OrderStatus>Shipped</OrderStatus>
<FulfillmentChannel>MFN</FulfillmentChannel>
<SalesChannel>Amazon.in</SalesChannel>
<ShipServiceLevel>IN Std Dom 2_50k_cod</ShipServiceLevel>
<ShippingAddress>
<Name>name1</Name>
<AddressLine1>line1-1</AddressLine1>
<AddressLine2>line2-1</AddressLine2>
<City>WADHVANCITY,SURENDRANAGAR</City>
<StateOrRegion>Gujarat</StateOrRegion>
<PostalCode>363035</PostalCode>
<CountryCode>IN</CountryCode>
<Phone>987654321</Phone>
</ShippingAddress>
<OrderTotal>
<CurrencyCode>INR</CurrencyCode>
<Amount>242.00</Amount>
</OrderTotal>
<NumberOfItemsShipped>1</NumberOfItemsShipped>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<PaymentExecutionDetail/>
<PaymentMethod>Other</PaymentMethod>
<MarketplaceId>mid1</MarketplaceId>
<BuyerEmail>email1#buyer.com</BuyerEmail>
<BuyerName>name1</BuyerName>
<ShipmentServiceLevelCategory>Standard</ShipmentServiceLevelCategory>
<ShippedByAmazonTFM>false</ShippedByAmazonTFM>
<TFMShipmentStatus>PendingPickUp</TFMShipmentStatus>
<OrderType>StandardOrder</OrderType>
<EarliestShipDate>2014-10-05T18:30:00Z</EarliestShipDate>
<LatestShipDate>2014-10-07T18:29:59Z</LatestShipDate>
<EarliestDeliveryDate>2014-10-09T18:30:00Z</EarliestDeliveryDate>
<LatestDeliveryDate>2014-10-15T18:29:59Z</LatestDeliveryDate>
</Order>
</Orders>
</ListOrdersByNextTokenResult>
</ListOrdersByNextTokenResponse>
I want to add the elements inside Orders elemnt in b.xml to that of a.xml
So, the expected output is:
<?xml version="1.0"?>
<ListOrdersResponse xmlns="https://mws.amazonservices.com/Orders/2013-09-01">
<ListOrdersResult>
<NextToken>token</NextToken>
<CreatedBefore>2014-10-07T08:13:11Z</CreatedBefore>
<Orders>
<Order>
<AmazonOrderId>12345</AmazonOrderId>
<SellerOrderId>R12345</SellerOrderId>
<PurchaseDate>2014-10-02T14:40:37Z</PurchaseDate>
<LastUpdateDate>2014-10-03T09:47:02Z</LastUpdateDate>
<OrderStatus>Shipped</OrderStatus>
<FulfillmentChannel>MFN</FulfillmentChannel>
<SalesChannel>Amazon.in</SalesChannel>
<ShipServiceLevel>IN Exp Dom 2</ShipServiceLevel>
<ShippingAddress>
<Name>name</Name>
<AddressLine1>line1</AddressLine1>
<AddressLine2>line2</AddressLine2>
<City>Pune</City>
<StateOrRegion>Maharashtra</StateOrRegion>
<PostalCode>411027</PostalCode>
<CountryCode>IN</CountryCode>
<Phone>123456789</Phone>
</ShippingAddress>
<OrderTotal>
<CurrencyCode>INR</CurrencyCode>
<Amount>520.00</Amount>
</OrderTotal>
<NumberOfItemsShipped>1</NumberOfItemsShipped>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<PaymentExecutionDetail/>
<PaymentMethod>Other</PaymentMethod>
<MarketplaceId>mid</MarketplaceId>
<BuyerEmail>email#buyer.com</BuyerEmail>
<BuyerName>name</BuyerName>
<ShipmentServiceLevelCategory>Expedited</ShipmentServiceLevelCategory>
<ShippedByAmazonTFM>false</ShippedByAmazonTFM>
<TFMShipmentStatus>Delivered</TFMShipmentStatus>
<OrderType>StandardOrder</OrderType>
<EarliestShipDate>2014-10-05T18:30:00Z</EarliestShipDate>
<LatestShipDate>2014-10-07T18:29:59Z</LatestShipDate>
<EarliestDeliveryDate>2014-10-07T18:30:00Z</EarliestDeliveryDate>
<LatestDeliveryDate>2014-10-11T18:29:59Z</LatestDeliveryDate>
</Order>
<Order>
<AmazonOrderId>oid1</AmazonOrderId>
<PurchaseDate>2014-10-04T13:37:41Z</PurchaseDate>
<LastUpdateDate>2014-10-06T09:52:21Z</LastUpdateDate>
<OrderStatus>Shipped</OrderStatus>
<FulfillmentChannel>MFN</FulfillmentChannel>
<SalesChannel>Amazon.in</SalesChannel>
<ShipServiceLevel>IN Std Dom 2_50k_cod</ShipServiceLevel>
<ShippingAddress>
<Name>name1</Name>
<AddressLine1>line1-1</AddressLine1>
<AddressLine2>line2-1</AddressLine2>
<City>WADHVANCITY,SURENDRANAGAR</City>
<StateOrRegion>Gujarat</StateOrRegion>
<PostalCode>363035</PostalCode>
<CountryCode>IN</CountryCode>
<Phone>987654321</Phone>
</ShippingAddress>
<OrderTotal>
<CurrencyCode>INR</CurrencyCode>
<Amount>242.00</Amount>
</OrderTotal>
<NumberOfItemsShipped>1</NumberOfItemsShipped>
<NumberOfItemsUnshipped>0</NumberOfItemsUnshipped>
<PaymentExecutionDetail/>
<PaymentMethod>Other</PaymentMethod>
<MarketplaceId>mid1</MarketplaceId>
<BuyerEmail>email1#buyer.com</BuyerEmail>
<BuyerName>name1</BuyerName>
<ShipmentServiceLevelCategory>Standard</ShipmentServiceLevelCategory>
<ShippedByAmazonTFM>false</ShippedByAmazonTFM>
<TFMShipmentStatus>PendingPickUp</TFMShipmentStatus>
<OrderType>StandardOrder</OrderType>
<EarliestShipDate>2014-10-05T18:30:00Z</EarliestShipDate>
<LatestShipDate>2014-10-07T18:29:59Z</LatestShipDate>
<EarliestDeliveryDate>2014-10-09T18:30:00Z</EarliestDeliveryDate>
<LatestDeliveryDate>2014-10-15T18:29:59Z</LatestDeliveryDate>
</Order>
</Orders>
</ListOrdersResult>
</ListOrdersResponse>
I tried:
import xml.etree.ElementTree as ET
import os
import shlex
import subprocess
tree = ET.parse("a.xml")
root = tree.getroot()
combined_xml = root
namespaces = {'resp': 'https://mws.amazonservices.com/Orders/2013-09-01'}
results = combined_xml.find("resp:ListOrdersResult", namespaces=namespaces)
insertion_point = results.find("resp:Orders", namespaces=namespaces)
tree1 = ET.parse("b.xml")
root1 = tree1.getroot()
results1 = root1.find("resp:ListOrdersByNextTokenResult", namespaces=namespaces)
order_array1 = results1.find("resp:Orders", namespaces=namespaces)
for order in order_array1:
insertion_point.extend(order)
print ET.tostring(combined_xml)
But I am getting the following output:
<ns0:ListOrdersResponse xmlns:ns0="https://mws.amazonservices.com/Orders/2013-09-01">
<ns0:ListOrdersResult>
<ns0:NextToken>token</ns0:NextToken>
<ns0:CreatedBefore>2014-10-07T08:13:11Z</ns0:CreatedBefore>
<ns0:Orders>
<ns0:Order>
<ns0:AmazonOrderId>12345</ns0:AmazonOrderId>
<ns0:SellerOrderId>R12345</ns0:SellerOrderId>
<ns0:PurchaseDate>2014-10-02T14:40:37Z</ns0:PurchaseDate>
<ns0:LastUpdateDate>2014-10-03T09:47:02Z</ns0:LastUpdateDate>
<ns0:OrderStatus>Shipped</ns0:OrderStatus>
<ns0:FulfillmentChannel>MFN</ns0:FulfillmentChannel>
<ns0:SalesChannel>Amazon.in</ns0:SalesChannel>
<ns0:ShipServiceLevel>IN Exp Dom 2</ns0:ShipServiceLevel>
<ns0:ShippingAddress>
<ns0:Name>name</ns0:Name>
<ns0:AddressLine1>line1</ns0:AddressLine1>
<ns0:AddressLine2>line2</ns0:AddressLine2>
<ns0:City>Pune</ns0:City>
<ns0:StateOrRegion>Maharashtra</ns0:StateOrRegion>
<ns0:PostalCode>411027</ns0:PostalCode>
<ns0:CountryCode>IN</ns0:CountryCode>
<ns0:Phone>123456789</ns0:Phone>
</ns0:ShippingAddress>
<ns0:OrderTotal>
<ns0:CurrencyCode>INR</ns0:CurrencyCode>
<ns0:Amount>520.00</ns0:Amount>
</ns0:OrderTotal>
<ns0:NumberOfItemsShipped>1</ns0:NumberOfItemsShipped>
<ns0:NumberOfItemsUnshipped>0</ns0:NumberOfItemsUnshipped>
<ns0:PaymentExecutionDetail />
<ns0:PaymentMethod>Other</ns0:PaymentMethod>
<ns0:MarketplaceId>mid</ns0:MarketplaceId>
<ns0:BuyerEmail>email#buyer.com</ns0:BuyerEmail>
<ns0:BuyerName>name</ns0:BuyerName>
<ns0:ShipmentServiceLevelCategory>Expedited</ns0:ShipmentServiceLevelCategory>
<ns0:ShippedByAmazonTFM>false</ns0:ShippedByAmazonTFM>
<ns0:TFMShipmentStatus>Delivered</ns0:TFMShipmentStatus>
<ns0:OrderType>StandardOrder</ns0:OrderType>
<ns0:EarliestShipDate>2014-10-05T18:30:00Z</ns0:EarliestShipDate>
<ns0:LatestShipDate>2014-10-07T18:29:59Z</ns0:LatestShipDate>
<ns0:EarliestDeliveryDate>2014-10-07T18:30:00Z</ns0:EarliestDeliveryDate>
<ns0:LatestDeliveryDate>2014-10-11T18:29:59Z</ns0:LatestDeliveryDate>
</ns0:Order>
<ns0:AmazonOrderId>oid1</ns0:AmazonOrderId>
<ns0:PurchaseDate>2014-10-04T13:37:41Z</ns0:PurchaseDate>
<ns0:LastUpdateDate>2014-10-06T09:52:21Z</ns0:LastUpdateDate>
<ns0:OrderStatus>Shipped</ns0:OrderStatus>
<ns0:FulfillmentChannel>MFN</ns0:FulfillmentChannel>
<ns0:SalesChannel>Amazon.in</ns0:SalesChannel>
<ns0:ShipServiceLevel>IN Std Dom 2_50k_cod</ns0:ShipServiceLevel>
<ns0:ShippingAddress>
<ns0:Name>name1</ns0:Name>
<ns0:AddressLine1>line1-1</ns0:AddressLine1>
<ns0:AddressLine2>line2-1</ns0:AddressLine2>
<ns0:City>WADHVANCITY,SURENDRANAGAR</ns0:City>
<ns0:StateOrRegion>Gujarat</ns0:StateOrRegion>
<ns0:PostalCode>363035</ns0:PostalCode>
<ns0:CountryCode>IN</ns0:CountryCode>
<ns0:Phone>987654321</ns0:Phone>
</ns0:ShippingAddress>
<ns0:OrderTotal>
<ns0:CurrencyCode>INR</ns0:CurrencyCode>
<ns0:Amount>242.00</ns0:Amount>
</ns0:OrderTotal>
<ns0:NumberOfItemsShipped>1</ns0:NumberOfItemsShipped>
<ns0:NumberOfItemsUnshipped>0</ns0:NumberOfItemsUnshipped>
<ns0:PaymentExecutionDetail />
<ns0:PaymentMethod>Other</ns0:PaymentMethod>
<ns0:MarketplaceId>mid1</ns0:MarketplaceId>
<ns0:BuyerEmail>email1#byer.com</ns0:BuyerEmail>
<ns0:BuyerName>name1</ns0:BuyerName>
<ns0:ShipmentServiceLevelCategory>Standard</ns0:ShipmentServiceLevelCategory>
<ns0:ShippedByAmazonTFM>false</ns0:ShippedByAmazonTFM>
<ns0:TFMShipmentStatus>PendingPickUp</ns0:TFMShipmentStatus>
<ns0:OrderType>StandardOrder</ns0:OrderType>
<ns0:EarliestShipDate>2014-10-05T18:30:00Z</ns0:EarliestShipDate>
<ns0:LatestShipDate>2014-10-07T18:29:59Z</ns0:LatestShipDate>
<ns0:EarliestDeliveryDate>2014-10-09T18:30:00Z</ns0:EarliestDeliveryDate>
<ns0:LatestDeliveryDate>2014-10-15T18:29:59Z</ns0:LatestDeliveryDate>
</ns0:Orders>
</ns0:ListOrdersResult>
</ns0:ListOrdersResponse>
Why am I getting ns0? Also, <Order> tag is missing for the second order. How can I get the desired output without ns0. I am ok with suggestions for using another module if it makes life easier.:)
Thanks

ns0 means 'namespace 0' - it's a result of your namespace dict and "resp:tagname" terms.
I'd really recommend using beautifulsoup4 for this, though - it's much nicer for working with xml:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('a.xml'))
insertion_point = soup.listordersresult.orders
orders_b = BeautifulSoup(open('b.xml')).listordersbynexttokenresult.orders
# could probably just be orders_b = BeautifulSoup(open('b.xml'))
orders_to_insert = orders_b.find_all('order')
for order in orders_to_insert:
insertion_point.append(order)
print(soup)

import xml.etree.ElementTree as ET
from StringIO import StringIO
namespaces = {'resp': 'https://mws.amazonservices.com/Orders/2013-09-01'}
tree = ET.parse("a.xml")
root = tree.getroot()
results = root.find("resp:ListOrdersResult", namespaces=namespaces)
order_array = results.find("resp:Orders", namespaces=namespaces).getchildren()
tree1 = ET.parse("b.xml")
root1 = tree1.getroot()
results1 = root1.find("resp:ListOrdersByNextTokenResult", namespaces=namespaces)
order_array1 = results1.find("resp:Orders", namespaces=namespaces).getchildren()
for order in order_array1:
order_array.append(order)
tree.write("temp.xml")
correct_data = open("temp.xml").read().replace('ns0:', '').replace(':ns0','')
filewrite = open("combined.xml", 'w')
filewrite.write(correct_data)
filewrite.close()

How to parse xml string having deep structures using python

A similar question is asked here (Python XML Parsing) but I could not reach to the content I am interested in.
I need to extract all the information that is enclosed between the tag patent-classification if the classification-scheme tag value is CPC. There are multiple such element and are enclosed inside patent-classifications tag.
In the example given below, there are three such values: C 07 K 16 22 I , A 61 K 2039 505 A and C 07 K 2317 21 A
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
<ops:meta name="elapsed-time" value="21"/>
<exchange-documents>
<exchange-document system="ops.epo.org" family-id="39103486" country="US" doc-number="2009234106" kind="A1">
<bibliographic-data>
<publication-reference>
<document-id document-id-type="docdb">
<country>US</country>
<doc-number>2009234106</doc-number>
<kind>A1</kind>
<date>20090917</date>
</document-id>
<document-id document-id-type="epodoc">
<doc-number>US2009234106</doc-number>
<date>20090917</date>
</document-id>
</publication-reference>
<classifications-ipcr>
<classification-ipcr sequence="1">
<text>C07K 16/ 44 A I </text>
</classification-ipcr>
</classifications-ipcr>
<patent-classifications>
<patent-classification sequence="1">
<classification-scheme office="" scheme="CPC"/>
<section>C</section>
<class>07</class>
<subclass>K</subclass>
<main-group>16</main-group>
<subgroup>22</subgroup>
<classification-value>I</classification-value>
</patent-classification>
<patent-classification sequence="2">
<classification-scheme office="" scheme="CPC"/>
<section>A</section>
<class>61</class>
<subclass>K</subclass>
<main-group>2039</main-group>
<subgroup>505</subgroup>
<classification-value>A</classification-value>
</patent-classification>
<patent-classification sequence="7">
<classification-scheme office="" scheme="CPC"/>
<section>C</section>
<class>07</class>
<subclass>K</subclass>
<main-group>2317</main-group>
<subgroup>92</subgroup>
<classification-value>A</classification-value>
</patent-classification>
<patent-classification sequence="1">
<classification-scheme office="US" scheme="UC"/>
<classification-symbol>530/387.9</classification-symbol>
</patent-classification>
</patent-classifications>
</bibliographic-data>
</exchange-document>
</exchange-documents>
</ops:world-patent-data>

Install BeautifulSoup if you don't have it:
$ easy_install BeautifulSoup4
Try this:
from bs4 import BeautifulSoup
xml = open('example.xml', 'rb').read()
bs = BeautifulSoup(xml)
# find patent-classification
patents = bs.findAll('patent-classification')
# filter the ones with CPC
for pa in patents:
if pa.find('classification-scheme', {'scheme': 'CPC'} ):
print pa.getText()

You can use python xml standard module:
import xml.etree.ElementTree as ET
root = ET.parse('a.xml').getroot()
for node in root.iterfind(".//{http://www.epo.org/exchange}classification-scheme[#scheme='CPC']/.."):
data = []
for d in node.getchildren():
if d.text:
data.append(d.text)
print ' '.join(data)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error in parsing xml using python due to namespace present - python

Related

How to extract data from GML file

Python XML Parser Issue

Create a dataframe from nested xml and generate a csv

How to merge 2 xml files with namespaces

How to parse xml string having deep structures using python

Categories

Resources