extract xml to pandas dataframe with unknown number of nodes - python
The below code sample works if there is only one node.
However, our use case we dont know how many nodes we will receive
Convert a xml to pandas data frame python
Sample as below.
How we can parse this into dataframe
In particular, we dont know how manby
we will received in the feed file
<?xml version = '1.0' encoding = 'UTF-8'?>
<EVENT spec="IDL:com/RfcCallEvents:1.0#Z_BAPI_UPDT_SERV_NOTIFICATION">
<eventHeader>
<objectName/>
<objectKey/>
<eventName/>
<eventId/>
</eventHeader>
<TAB_DETAIL_DATA>
<ZNEWFLAG>X</ZNEWFLAG>
<FENUM>2</FENUM>
<BAUTL>661-01727</BAUTL>
<OTEIL/>
<FECOD>KBB</FECOD>
<URCOD>B08</URCOD>
<ZCOMPMDF>A</ZCOMPMDF>
<ZOPREPL/>
<ZWRNCOV>LP</ZWRNCOV>
<ZWRNREF/>
<ZNEWPS>C07XMAAEJCLD</ZNEWPS>
<ZOLDPN/>
<ZOLDPD/>
<ZOLDPS>C07XMAACJCLD</ZOLDPS>
<MAILINFECOD/>
<ZUNITPR/>
<ZNEWPD/>
<ZNEWPN/>
<ZABUSE/>
<ZRPS>S</ZRPS>
<ZEXKGB/>
<ZKGBMM/>
<ZINSTS>000</ZINSTS>
<ZACKBB/>
<ZCHKOVR/>
<ZSNDB/>
<ZNOTAFISCAL/>
<ZCONSGMT/>
<ZPRTCONS/>
<ZZRTNTRNO/>
<ZZRTNCAR/>
<ZZINSPECT/>
<ZZPR_OPT/>
</TAB_DETAIL_DATA>
<TAB_DETAIL_DATA>
<ZNEWFLAG>X</ZNEWFLAG>
<FENUM>1</FENUM>
<BAUTL>661-01727</BAUTL>
<OTEIL/>
<FECOD>KBB</FECOD>
<URCOD>B08</URCOD>
<ZCOMPMDF>A</ZCOMPMDF>
<ZOPREPL/>
<ZWRNCOV>LP</ZWRNCOV>
<ZWRNREF/>
<ZNEWPS>C07XMAAEJCLD</ZNEWPS>
<ZOLDPN/>
<ZOLDPD/>
<ZOLDPS>C07XMAACJCLD</ZOLDPS>
<MAILINFECOD/>
<ZUNITPR/>
<ZNEWPD/>
<ZNEWPN/>
<ZABUSE/>
<ZRPS>S</ZRPS>
<ZEXKGB/>
<ZKGBMM/>
<ZINSTS>000</ZINSTS>
<ZACKBB/>
<ZCHKOVR/>
<ZSNDB/>
<ZNOTAFISCAL/>
<ZCONSGMT/>
<ZPRTCONS/>
<ZZRTNTRNO/>
<ZZRTNCAR/>
<ZZINSPECT/>
<ZZPR_OPT/>
</TAB_DETAIL_DATA>
<TAB_HEADER_DATA>
<QMNUM>030334920069</QMNUM>
<ZGSXREF>CONSUMER</ZGSXREF>
<ZVANTREF>G338005317</ZVANTREF>
<ZSHIPER/>
<ZSHPRNO/>
<ZRVREF/>
<ZTECHID>4HQ2OD6C19</ZTECHID>
<ZADREPAIR/>
<ZZKATR7/>
</TAB_HEADER_DATA>
</EVENT>
I suspect you need to parse xml-data to several dataframes, e.g. as follows:
import xmltodict # install this module first
data = """<?xml version = '1.0' encoding = 'UTF-8'?>
<EVENT spec="IDL:com/RfcCallEvents:1.0#Z_BAPI_UPDT_SERV_NOTIFICATION">
<eventHeader>
<objectName/>
<objectKey/>
<eventName/>
<eventId/>
</eventHeader>
<TAB_DETAIL_DATA>
<ZNEWFLAG>X</ZNEWFLAG>
<FENUM>2</FENUM>
<BAUTL>661-01727</BAUTL>
<OTEIL/>
<FECOD>KBB</FECOD>
<URCOD>B08</URCOD>
<ZCOMPMDF>A</ZCOMPMDF>
<ZOPREPL/>
<ZWRNCOV>LP</ZWRNCOV>
<ZWRNREF/>
<ZNEWPS>C07XMAAEJCLD</ZNEWPS>
<ZOLDPN/>
<ZOLDPD/>
<ZOLDPS>C07XMAACJCLD</ZOLDPS>
<MAILINFECOD/>
<ZUNITPR/>
<ZNEWPD/>
<ZNEWPN/>
<ZABUSE/>
<ZRPS>S</ZRPS>
<ZEXKGB/>
<ZKGBMM/>
<ZINSTS>000</ZINSTS>
<ZACKBB/>
<ZCHKOVR/>
<ZSNDB/>
<ZNOTAFISCAL/>
<ZCONSGMT/>
<ZPRTCONS/>
<ZZRTNTRNO/>
<ZZRTNCAR/>
<ZZINSPECT/>
<ZZPR_OPT/>
</TAB_DETAIL_DATA>
<TAB_DETAIL_DATA>
<ZNEWFLAG>X</ZNEWFLAG>
<FENUM>1</FENUM>
<BAUTL>661-01727</BAUTL>
<OTEIL/>
<FECOD>KBB</FECOD>
<URCOD>B08</URCOD>
<ZCOMPMDF>A</ZCOMPMDF>
<ZOPREPL/>
<ZWRNCOV>LP</ZWRNCOV>
<ZWRNREF/>
<ZNEWPS>C07XMAAEJCLD</ZNEWPS>
<ZOLDPN/>
<ZOLDPD/>
<ZOLDPS>C07XMAACJCLD</ZOLDPS>
<MAILINFECOD/>
<ZUNITPR/>
<ZNEWPD/>
<ZNEWPN/>
<ZABUSE/>
<ZRPS>S</ZRPS>
<ZEXKGB/>
<ZKGBMM/>
<ZINSTS>000</ZINSTS>
<ZACKBB/>
<ZCHKOVR/>
<ZSNDB/>
<ZNOTAFISCAL/>
<ZCONSGMT/>
<ZPRTCONS/>
<ZZRTNTRNO/>
<ZZRTNCAR/>
<ZZINSPECT/>
<ZZPR_OPT/>
</TAB_DETAIL_DATA>
<TAB_HEADER_DATA>
<QMNUM>030334920069</QMNUM>
<ZGSXREF>CONSUMER</ZGSXREF>
<ZVANTREF>G338005317</ZVANTREF>
<ZSHIPER/>
<ZSHPRNO/>
<ZRVREF/>
<ZTECHID>4HQ2OD6C19</ZTECHID>
<ZADREPAIR/>
<ZZKATR7/>
</TAB_HEADER_DATA>
</EVENT>"""
dct = xmltodict.parse(data)
def make_df(name="TAB_DETAIL_DATA", dct=dct):
df = pd.DataFrame()
if isinstance(dct['EVENT'][name], list):
for j in dct['EVENT'][name]:
_ = pd.DataFrame({'value': [y for x, y in j.items()]}, index=j.keys())
df = pd.concat([df, _])
else:
df = pd.DataFrame({'value': [y for x, y in dct['EVENT'][name].items()]}, index=dct['EVENT'][name].keys())
return df
Now, you can experiment with the parser:
make_df(name="TAB_HEADER_DATA") # produces single df
make_df(name="TAB_DETAIL_DATA") # concatenates all content occurred in TAB_DETAIL_DATA sections, returns single df
Related
XML into Pandas dataframe
I have an XML file and I would like to parse it into a table. (Pandas dataframe) Below is just a sample of the XML file. Those are only two of the records. <?xml version="1.0" encoding="UTF-8"?> <file> <C13_335010X321A1_837Y6> <BHT_BeginningOfHierarchicalTransaction> <BHT01__HierarchicalStructureCode>0011</BHT01__HierarchicalStructureCode> <BHT02__TransactionSetPurposeCode>00</BHT02__TransactionSetPurposeCode> <BHT03__OriginatorApplicationTransactionIdentifier>513513TR</BHT03__OriginatorApplicationTransactionIdentifier> <BHT04__TransactionSetCreationDate>20200212</BHT04__TransactionSetCreationDate> <BHT05__TransactionSetCreationTime>1287</BHT05__TransactionSetCreationTime> <BHT06__ClaimOrEncounterIdentifier>DD</BHT06__ClaimOrEncounterIdentifier> </BHT_BeginningOfHierarchicalTransaction> <Loop_1000A> <NM1_SubmitterName_1000A> <NM101__EntityIdentifierCode>27</NM101__EntityIdentifierCode> <NM102__EntityTypeQualifier>9</NM102__EntityTypeQualifier> <NM103__SubmitterLastOrOrganizationName>AAA</NM103__SubmitterLastOrOrganizationName> <NM108__IdentificationCodeQualifier>22</NM108__IdentificationCodeQualifier> <NM109__SubmitterIdentifier>55555500</NM109__SubmitterIdentifier> </NM1_SubmitterName_1000A> <PER_SubmitterEDIContactInformation_1000A> <PER01__ContactFunctionCode>LK</PER01__ContactFunctionCode> <PER02__SubmitterContactName>John Smith</PER02__SubmitterContactName> <PER03__CommunicationNumberQualifier>WW</PER03__CommunicationNumberQualifier> <PER04__CommunicationNumber>2132220011</PER04__CommunicationNumber> <PER05__CommunicationNumberQualifier>DD</PER05__CommunicationNumberQualifier> <PER06__CommunicationNumber>DD_2#GMAIL.COM</PER06__CommunicationNumber> </PER_SubmitterEDIContactInformation_1000A> </Loop_1000A> <Loop_1000B> <NM1_ReceiverName_1000B> <NM101__EntityIdentifierCode>21</NM101__EntityIdentifierCode> <NM102__EntityTypeQualifier>0</NM102__EntityTypeQualifier> <NM103__ReceiverName>AAA</NM103__ReceiverName> <NM108__IdentificationCodeQualifier>32</NM108__IdentificationCodeQualifier> <NM109__ReceiverPrimaryIdentifier>2514521</NM109__ReceiverPrimaryIdentifier> </NM1_ReceiverName_1000B> </Loop_1000B> <Loop_2000A> <HL_BillingProviderHierarchicalLevel_2000A> <HL01__HierarchicalIDNumber>32</HL01__HierarchicalIDNumber> <HL03__HierarchicalLevelCode>54</HL03__HierarchicalLevelCode> <HL04__HierarchicalChildCode>32</HL04__HierarchicalChildCode> </HL_BillingProviderHierarchicalLevel_2000A> <Loop_2010AA> <NM1_BillingProviderName_2010AA> <NM101__EntityIdentifierCode>54</NM101__EntityIdentifierCode> <NM102__EntityTypeQualifier>21</NM102__EntityTypeQualifier> <NM103__BillingProviderLastOrOrganizationalName>AAA</NM103__BillingProviderLastOrOrganizationalName> <NM108__IdentificationCodeQualifier>XX</NM108__IdentificationCodeQualifier> <NM109__BillingProviderIdentifier>515151325</NM109__BillingProviderIdentifier> </NM1_BillingProviderName_2010AA> <N3_BillingProviderAddress_2010AA> <N301__BillingProviderAddressLine>214 SS STREET</N301__BillingProviderAddressLine> </N3_BillingProviderAddress_2010AA> <N4_BillingProviderCityStateZIPCode_2010AA> <N401__BillingProviderCityName>LA</N401__BillingProviderCityName> <N402__BillingProviderStateOrProvinceCode>CA</N402__BillingProviderStateOrProvinceCode> <N403__BillingProviderPostalZoneOrZIPCode>93500</N403__BillingProviderPostalZoneOrZIPCode> </N4_BillingProviderCityStateZIPCode_2010AA> <REF_BillingProviderTaxIdentification_2010AA> <REF01__ReferenceIdentificationQualifier>OI</REF01__ReferenceIdentificationQualifier> <REF02__BillingProviderTaxIdentificationNumber>5135151315</REF02__BillingProviderTaxIdentificationNumber> </REF_BillingProviderTaxIdentification_2010AA> </Loop_2010AA> <Loop_2000B> <HL_SubscriberHierarchicalLevel_2000B> <HL01__HierarchicalIDNumber>5</HL01__HierarchicalIDNumber> <HL02__HierarchicalParentIDNumber>5</HL02__HierarchicalParentIDNumber> <HL03__HierarchicalLevelCode>55</HL03__HierarchicalLevelCode> <HL04__HierarchicalChildCode>5</HL04__HierarchicalChildCode> </HL_SubscriberHierarchicalLevel_2000B> <SBR_SubscriberInformation_2000B> <SBR01__PayerResponsibilitySequenceNumberCode>L</SBR01__PayerResponsibilitySequenceNumberCode> <SBR02__IndividualRelationshipCode>32</SBR02__IndividualRelationshipCode> <SBR03__SubscriberGroupOrPolicyNumber>252525Z125</SBR03__SubscriberGroupOrPolicyNumber> <SBR09__ClaimFilingIndicatorCode>NM</SBR09__ClaimFilingIndicatorCode> </SBR_SubscriberInformation_2000B> <Loop_2010BA> <NM1_SubscriberName_2010BA> <NM101__EntityIdentifierCode>DCX</NM101__EntityIdentifierCode> <NM102__EntityTypeQualifier>5</NM102__EntityTypeQualifier> <NM103__SubscriberLastName>SMITH</NM103__SubscriberLastName> <NM104__SubscriberFirstName>JOHN</NM104__SubscriberFirstName> <NM108__IdentificationCodeQualifier>CA</NM108__IdentificationCodeQualifier> <NM109__SubscriberPrimaryIdentifier>3656361.</NM109__SubscriberPrimaryIdentifier> </NM1_SubscriberName_2010BA> <N3_SubscriberAddress_2010BA> <N301__SubscriberAddressLine>111 STREET</N301__SubscriberAddressLine> </N3_SubscriberAddress_2010BA> <N4_SubscriberCityStateZIPCode_2010BA> <N401__SubscriberCityName>LA</N401__SubscriberCityName> <N402__SubscriberStateCode>CA</N402__SubscriberStateCode> <N403__SubscriberPostalZoneOrZIPCode>93000</N403__SubscriberPostalZoneOrZIPCode> </N4_SubscriberCityStateZIPCode_2010BA> <DMG_SubscriberDemographicInformation_2010BA> <DMG01__DateTimePeriodFormatQualifier>K5</DMG01__DateTimePeriodFormatQualifier> <DMG02__SubscriberBirthDate>19851010</DMG02__SubscriberBirthDate> <DMG03__SubscriberGenderCode>U</DMG03__SubscriberGenderCode> </DMG_SubscriberDemographicInformation_2010BA> </Loop_2010BA> <Loop_2010BB> <NM1_PayerName_2010BB> <NM101__EntityIdentifierCode>FF</NM101__EntityIdentifierCode> <NM102__EntityTypeQualifier>3</NM102__EntityTypeQualifier> <NM103__PayerName>AAA</NM103__PayerName> <NM108__IdentificationCodeQualifier>GF</NM108__IdentificationCodeQualifier> <NM109__PayerIdentifier>32514</NM109__PayerIdentifier> </NM1_PayerName_2010BB> </Loop_2010BB> <Loop_2300> <CLM_ClaimInformation_2300> <CLM01__PatientControlNumber>5413</CLM01__PatientControlNumber> <CLM02__TotalClaimChargeAmount>651</CLM02__TotalClaimChargeAmount> <CLM05_HealthCareServiceLocationInformation_2300> <CLM05_01_PlaceOfServiceCode>13</CLM05_01_PlaceOfServiceCode> <CLM05_02_FacilityCodeQualifier>D</CLM05_02_FacilityCodeQualifier> <CLM05_03_ClaimFrequencyCode>3</CLM05_03_ClaimFrequencyCode> </CLM05_HealthCareServiceLocationInformation_2300> <CLM06__ProviderOrSupplierSignatureIndicator>N</CLM06__ProviderOrSupplierSignatureIndicator> <CLM07__AssignmentOrPlanParticipationCode>R</CLM07__AssignmentOrPlanParticipationCode> <CLM08__BenefitsAssignmentCertificationIndicator>N</CLM08__BenefitsAssignmentCertificationIndicator> <CLM09__ReleaseOfInformationCode>N</CLM09__ReleaseOfInformationCode> <CLM10__PatientSignatureSourceCode>X</CLM10__PatientSignatureSourceCode> </CLM_ClaimInformation_2300> <REF_ClaimIdentifierForTransmissionIntermediaries_2300> <REF01__ReferenceIdentificationQualifier>J1</REF01__ReferenceIdentificationQualifier> <REF02__ValueAddedNetworkTraceNumber>FVC2514543254</REF02__ValueAddedNetworkTraceNumber> </REF_ClaimIdentifierForTransmissionIntermediaries_2300> <HI_HealthCareDiagnosisCode_2300> <HI01_HealthCareCodeInformation_2300> <HI01_01_DiagnosisTypeCode>CCC</HI01_01_DiagnosisTypeCode> <HI01_02_DiagnosisCode>N111</HI01_02_DiagnosisCode> </HI01_HealthCareCodeInformation_2300> </HI_HealthCareDiagnosisCode_2300> <Loop_2310B> <NM1_RenderingProviderName_2310B> <NM101__EntityIdentifierCode>32</NM101__EntityIdentifierCode> <NM102__EntityTypeQualifier>2</NM102__EntityTypeQualifier> <NM103__RenderingProviderLastOrOrganizationName>JOHN</NM103__RenderingProviderLastOrOrganizationName> <NM104__RenderingProviderFirstName>SMITH</NM104__RenderingProviderFirstName> <NM108__IdentificationCodeQualifier>TT</NM108__IdentificationCodeQualifier> <NM109__RenderingProviderIdentifier>25431251</NM109__RenderingProviderIdentifier> </NM1_RenderingProviderName_2310B> <PRV_RenderingProviderSpecialtyInformation_2310B> <PRV01__ProviderCode>TR</PRV01__ProviderCode> <PRV02__ReferenceIdentificationQualifier>VFD</PRV02__ReferenceIdentificationQualifier> <PRV03__ProviderTaxonomyCode>135454353L</PRV03__ProviderTaxonomyCode> </PRV_RenderingProviderSpecialtyInformation_2310B> </Loop_2310B> <Loop_2400> <LX_ServiceLineNumber_2400> <LX01__AssignedNumber>2</LX01__AssignedNumber> </LX_ServiceLineNumber_2400> <SV1_ProfessionalService_2400> <SV101_CompositeMedicalProcedureIdentifier_2400> <SV101_01_ProductOrServiceIDQualifier>EE</SV101_01_ProductOrServiceIDQualifier> <SV101_02_ProcedureCode>99999</SV101_02_ProcedureCode> <SV101_07_Description>BLOOD</SV101_07_Description> </SV101_CompositeMedicalProcedureIdentifier_2400> <SV102__LineItemChargeAmount>200</SV102__LineItemChargeAmount> <SV103__UnitOrBasisForMeasurementCode>PP</SV103__UnitOrBasisForMeasurementCode> <SV104__ServiceUnitCount>3.5</SV104__ServiceUnitCount> <SV107_CompositeDiagnosisCodePointer_2400> <SV107_01_DiagnosisCodePointer>2</SV107_01_DiagnosisCodePointer> </SV107_CompositeDiagnosisCodePointer_2400> </SV1_ProfessionalService_2400> <DTP_DateServiceDate_2400> <DTP01__DateTimeQualifier>654</DTP01__DateTimeQualifier> <DTP02__DateTimePeriodFormatQualifier>U8</DTP02__DateTimePeriodFormatQualifier> <DTP03__ServiceDate>20191010</DTP03__ServiceDate> </DTP_DateServiceDate_2400> <REF_LineItemControlNumber_2400> <REF01__ReferenceIdentificationQualifier>5F</REF01__ReferenceIdentificationQualifier> <REF02__LineItemControlNumber>DDD.32.123</REF02__LineItemControlNumber> </REF_LineItemControlNumber_2400> </Loop_2400> </Loop_2300> </Loop_2000B> </Loop_2000A> </C13_335010X321A1_837Y6> <C13_335010X321A1_837Y6> <BHT_BeginningOfHierarchicalTransaction> <BHT01__HierarchicalStructureCode>0011</BHT01__HierarchicalStructureCode> <BHT02__TransactionSetPurposeCode>00</BHT02__TransactionSetPurposeCode> <BHT03__OriginatorApplicationTransactionIdentifier>513513TR</BHT03__OriginatorApplicationTransactionIdentifier> <BHT04__TransactionSetCreationDate>20200212</BHT04__TransactionSetCreationDate> <BHT05__TransactionSetCreationTime>1287</BHT05__TransactionSetCreationTime> <BHT06__ClaimOrEncounterIdentifier>DD</BHT06__ClaimOrEncounterIdentifier> </BHT_BeginningOfHierarchicalTransaction> <Loop_1000A> <NM1_SubmitterName_1000A> <NM101__EntityIdentifierCode>27</NM101__EntityIdentifierCode> <NM102__EntityTypeQualifier>9</NM102__EntityTypeQualifier> <NM103__SubmitterLastOrOrganizationName>AAA</NM103__SubmitterLastOrOrganizationName> <NM108__IdentificationCodeQualifier>22</NM108__IdentificationCodeQualifier> <NM109__SubmitterIdentifier>55555500</NM109__SubmitterIdentifier> </NM1_SubmitterName_1000A> <PER_SubmitterEDIContactInformation_1000A> <PER01__ContactFunctionCode>LK</PER01__ContactFunctionCode> <PER02__SubmitterContactName>John Smith</PER02__SubmitterContactName> <PER03__CommunicationNumberQualifier>WW</PER03__CommunicationNumberQualifier> <PER04__CommunicationNumber>2132220011</PER04__CommunicationNumber> <PER05__CommunicationNumberQualifier>DD</PER05__CommunicationNumberQualifier> <PER06__CommunicationNumber>DD_2#GMAIL.COM</PER06__CommunicationNumber> </PER_SubmitterEDIContactInformation_1000A> </Loop_1000A> <Loop_1000B> <NM1_ReceiverName_1000B> <NM101__EntityIdentifierCode>21</NM101__EntityIdentifierCode> <NM102__EntityTypeQualifier>0</NM102__EntityTypeQualifier> <NM103__ReceiverName>AAA</NM103__ReceiverName> <NM108__IdentificationCodeQualifier>32</NM108__IdentificationCodeQualifier> <NM109__ReceiverPrimaryIdentifier>2514521</NM109__ReceiverPrimaryIdentifier> </NM1_ReceiverName_1000B> </Loop_1000B> <Loop_2000A> <HL_BillingProviderHierarchicalLevel_2000A> <HL01__HierarchicalIDNumber>32</HL01__HierarchicalIDNumber> <HL03__HierarchicalLevelCode>54</HL03__HierarchicalLevelCode> <HL04__HierarchicalChildCode>32</HL04__HierarchicalChildCode> </HL_BillingProviderHierarchicalLevel_2000A> <Loop_2010AA> <NM1_BillingProviderName_2010AA> <NM101__EntityIdentifierCode>54</NM101__EntityIdentifierCode> <NM102__EntityTypeQualifier>21</NM102__EntityTypeQualifier> <NM103__BillingProviderLastOrOrganizationalName>AAA</NM103__BillingProviderLastOrOrganizationalName> <NM108__IdentificationCodeQualifier>XX</NM108__IdentificationCodeQualifier> <NM109__BillingProviderIdentifier>515151325</NM109__BillingProviderIdentifier> </NM1_BillingProviderName_2010AA> <N3_BillingProviderAddress_2010AA> <N301__BillingProviderAddressLine>214 SS STREET</N301__BillingProviderAddressLine> </N3_BillingProviderAddress_2010AA> <N4_BillingProviderCityStateZIPCode_2010AA> <N401__BillingProviderCityName>LA</N401__BillingProviderCityName> <N402__BillingProviderStateOrProvinceCode>CA</N402__BillingProviderStateOrProvinceCode> <N403__BillingProviderPostalZoneOrZIPCode>93500</N403__BillingProviderPostalZoneOrZIPCode> </N4_BillingProviderCityStateZIPCode_2010AA> <REF_BillingProviderTaxIdentification_2010AA> <REF01__ReferenceIdentificationQualifier>OI</REF01__ReferenceIdentificationQualifier> <REF02__BillingProviderTaxIdentificationNumber>5135151315</REF02__BillingProviderTaxIdentificationNumber> </REF_BillingProviderTaxIdentification_2010AA> </Loop_2010AA> <Loop_2000B> <HL_SubscriberHierarchicalLevel_2000B> <HL01__HierarchicalIDNumber>5</HL01__HierarchicalIDNumber> <HL02__HierarchicalParentIDNumber>5</HL02__HierarchicalParentIDNumber> <HL03__HierarchicalLevelCode>55</HL03__HierarchicalLevelCode> <HL04__HierarchicalChildCode>5</HL04__HierarchicalChildCode> </HL_SubscriberHierarchicalLevel_2000B> <SBR_SubscriberInformation_2000B> <SBR01__PayerResponsibilitySequenceNumberCode>L</SBR01__PayerResponsibilitySequenceNumberCode> <SBR02__IndividualRelationshipCode>32</SBR02__IndividualRelationshipCode> <SBR03__SubscriberGroupOrPolicyNumber>252525Z125</SBR03__SubscriberGroupOrPolicyNumber> <SBR09__ClaimFilingIndicatorCode>NM</SBR09__ClaimFilingIndicatorCode> </SBR_SubscriberInformation_2000B> <Loop_2010BA> <NM1_SubscriberName_2010BA> <NM101__EntityIdentifierCode>DCX</NM101__EntityIdentifierCode> <NM102__EntityTypeQualifier>5</NM102__EntityTypeQualifier> <NM103__SubscriberLastName>SMITH</NM103__SubscriberLastName> <NM104__SubscriberFirstName>JOHN</NM104__SubscriberFirstName> <NM108__IdentificationCodeQualifier>CA</NM108__IdentificationCodeQualifier> <NM109__SubscriberPrimaryIdentifier>3656361.</NM109__SubscriberPrimaryIdentifier> </NM1_SubscriberName_2010BA> <N3_SubscriberAddress_2010BA> <N301__SubscriberAddressLine>111 STREET</N301__SubscriberAddressLine> </N3_SubscriberAddress_2010BA> <N4_SubscriberCityStateZIPCode_2010BA> <N401__SubscriberCityName>LA</N401__SubscriberCityName> <N402__SubscriberStateCode>CA</N402__SubscriberStateCode> <N403__SubscriberPostalZoneOrZIPCode>93000</N403__SubscriberPostalZoneOrZIPCode> </N4_SubscriberCityStateZIPCode_2010BA> <DMG_SubscriberDemographicInformation_2010BA> <DMG01__DateTimePeriodFormatQualifier>K5</DMG01__DateTimePeriodFormatQualifier> <DMG02__SubscriberBirthDate>19851010</DMG02__SubscriberBirthDate> <DMG03__SubscriberGenderCode>U</DMG03__SubscriberGenderCode> </DMG_SubscriberDemographicInformation_2010BA> </Loop_2010BA> <Loop_2010BB> <NM1_PayerName_2010BB> <NM101__EntityIdentifierCode>FF</NM101__EntityIdentifierCode> <NM102__EntityTypeQualifier>3</NM102__EntityTypeQualifier> <NM103__PayerName>AAA</NM103__PayerName> <NM108__IdentificationCodeQualifier>GF</NM108__IdentificationCodeQualifier> <NM109__PayerIdentifier>32514</NM109__PayerIdentifier> </NM1_PayerName_2010BB> </Loop_2010BB> <Loop_2300> <CLM_ClaimInformation_2300> <CLM01__PatientControlNumber>5413</CLM01__PatientControlNumber> <CLM02__TotalClaimChargeAmount>651</CLM02__TotalClaimChargeAmount> <CLM05_HealthCareServiceLocationInformation_2300> <CLM05_01_PlaceOfServiceCode>13</CLM05_01_PlaceOfServiceCode> <CLM05_02_FacilityCodeQualifier>D</CLM05_02_FacilityCodeQualifier> <CLM05_03_ClaimFrequencyCode>3</CLM05_03_ClaimFrequencyCode> </CLM05_HealthCareServiceLocationInformation_2300> <CLM06__ProviderOrSupplierSignatureIndicator>N</CLM06__ProviderOrSupplierSignatureIndicator> <CLM07__AssignmentOrPlanParticipationCode>R</CLM07__AssignmentOrPlanParticipationCode> <CLM08__BenefitsAssignmentCertificationIndicator>N</CLM08__BenefitsAssignmentCertificationIndicator> <CLM09__ReleaseOfInformationCode>N</CLM09__ReleaseOfInformationCode> <CLM10__PatientSignatureSourceCode>X</CLM10__PatientSignatureSourceCode> </CLM_ClaimInformation_2300> <REF_ClaimIdentifierForTransmissionIntermediaries_2300> <REF01__ReferenceIdentificationQualifier>J1</REF01__ReferenceIdentificationQualifier> <REF02__ValueAddedNetworkTraceNumber>FVC2514543254</REF02__ValueAddedNetworkTraceNumber> </REF_ClaimIdentifierForTransmissionIntermediaries_2300> <HI_HealthCareDiagnosisCode_2300> <HI01_HealthCareCodeInformation_2300> <HI01_01_DiagnosisTypeCode>CCC</HI01_01_DiagnosisTypeCode> <HI01_02_DiagnosisCode>N111</HI01_02_DiagnosisCode> </HI01_HealthCareCodeInformation_2300> </HI_HealthCareDiagnosisCode_2300> <Loop_2310B> <NM1_RenderingProviderName_2310B> <NM101__EntityIdentifierCode>32</NM101__EntityIdentifierCode> <NM102__EntityTypeQualifier>2</NM102__EntityTypeQualifier> <NM103__RenderingProviderLastOrOrganizationName>JOHN</NM103__RenderingProviderLastOrOrganizationName> <NM104__RenderingProviderFirstName>SMITH</NM104__RenderingProviderFirstName> <NM108__IdentificationCodeQualifier>TT</NM108__IdentificationCodeQualifier> <NM109__RenderingProviderIdentifier>25431251</NM109__RenderingProviderIdentifier> </NM1_RenderingProviderName_2310B> <PRV_RenderingProviderSpecialtyInformation_2310B> <PRV01__ProviderCode>TR</PRV01__ProviderCode> <PRV02__ReferenceIdentificationQualifier>VFD</PRV02__ReferenceIdentificationQualifier> <PRV03__ProviderTaxonomyCode>135454353L</PRV03__ProviderTaxonomyCode> </PRV_RenderingProviderSpecialtyInformation_2310B> </Loop_2310B> <Loop_2400> <LX_ServiceLineNumber_2400> <LX01__AssignedNumber>2</LX01__AssignedNumber> </LX_ServiceLineNumber_2400> <SV1_ProfessionalService_2400> <SV101_CompositeMedicalProcedureIdentifier_2400> <SV101_01_ProductOrServiceIDQualifier>EE</SV101_01_ProductOrServiceIDQualifier> <SV101_02_ProcedureCode>99999</SV101_02_ProcedureCode> <SV101_07_Description>BLOOD</SV101_07_Description> </SV101_CompositeMedicalProcedureIdentifier_2400> <SV102__LineItemChargeAmount>200</SV102__LineItemChargeAmount> <SV103__UnitOrBasisForMeasurementCode>PP</SV103__UnitOrBasisForMeasurementCode> <SV104__ServiceUnitCount>3.5</SV104__ServiceUnitCount> <SV107_CompositeDiagnosisCodePointer_2400> <SV107_01_DiagnosisCodePointer>2</SV107_01_DiagnosisCodePointer> </SV107_CompositeDiagnosisCodePointer_2400> </SV1_ProfessionalService_2400> <DTP_DateServiceDate_2400> <DTP01__DateTimeQualifier>654</DTP01__DateTimeQualifier> <DTP02__DateTimePeriodFormatQualifier>U8</DTP02__DateTimePeriodFormatQualifier> <DTP03__ServiceDate>20191010</DTP03__ServiceDate> </DTP_DateServiceDate_2400> <REF_LineItemControlNumber_2400> <REF01__ReferenceIdentificationQualifier>5F</REF01__ReferenceIdentificationQualifier> <REF02__LineItemControlNumber>DDD.32.123</REF02__LineItemControlNumber> </REF_LineItemControlNumber_2400> </Loop_2400> </Loop_2300> </Loop_2000B> </Loop_2000A> </C13_335010X321A1_837Y6> </file> These have to be in two rows, I am using the following python code to convert it into panda data frame, but I am getting empty data frame. import pandas as pd import xml.etree.ElementTree as et def xml_file(file): columns = file.attrib for xml in file.iter('C13_335010X321A1_837Y6'): file_dict = columns.copy() file_dict.update(xml.attrib) yield file_dict tree = et.parse(r"C:\Users\Desktop\test1.xml") root = tree.getroot() df = pd.DataFrame(list(xml_file(root)))
How to convert independent output lists to a dataframe
Hope you are having a great weekend. My problem is as follows: For my designed model i am getting the following predictions: [0.3182012736797333, 0.6817986965179443, 0.5067878365516663, 0.49321213364601135, 0.4795221984386444, 0.520477831363678, 0.532780110836029, 0.46721988916397095, 0.3282901346683502, 0.6717098355293274] [0.362120658159256, 0.6378793120384216, 0.5134761929512024, 0.4865237772464752, 0.46048662066459656, 0.539513349533081, 0.5342788100242615, 0.4657211899757385, 0.34932515025138855, 0.6506748199462891] [0.3647380471229553, 0.6352618932723999, 0.5087167620658875, 0.49128326773643494, 0.4709164798259735, 0.5290834903717041, 0.5408024787902832, 0.4591975510120392, 0.37024226784706116, 0.6297577023506165] [0.43765324354171753, 0.5623468160629272, 0.505147397518158, 0.49485257267951965, 0.45281311869621277, 0.5471869111061096, 0.5416161417961121, 0.45838382840156555, 0.3789178133010864, 0.6210821866989136] [0.44772887229919434, 0.5522711277008057, 0.5119441151618958, 0.48805591464042664, 0.46322566270828247, 0.5367743372917175, 0.5402485132217407, 0.45975151658058167, 0.4145151972770691, 0.5854847431182861] [0.35674020648002625, 0.6432597637176514, 0.48104971647262573, 0.5189502835273743, 0.4554695188999176, 0.54453045129776, 0.5409557223320007, 0.45904430747032166, 0.3258989453315735, 0.6741010546684265] [0.3909384310245514, 0.609061598777771, 0.4915180504322052, 0.5084819793701172, 0.45033228397369385, 0.5496677160263062, 0.5267384052276611, 0.47326159477233887, 0.34493446350097656, 0.6550655364990234] [0.32971733808517456, 0.6702827215194702, 0.5224012732505798, 0.47759872674942017, 0.4692566692829132, 0.5307433605194092, 0.5360044836997986, 0.4639955163002014, 0.41811054944992065, 0.5818894505500793] [0.37096619606018066, 0.6290338039398193, 0.5165190100669861, 0.4834809899330139, 0.4739859998226166, 0.526013970375061, 0.5340168476104736, 0.46598318219184875, 0.3438771069049835, 0.6561229228973389] [0.4189890921115875, 0.5810109376907349, 0.52749103307724, 0.47250890731811523, 0.44485437870025635, 0.5551456212997437, 0.5398098230361938, 0.46019014716148376, 0.3739124536514282, 0.6260875463485718] [0.3979812562465668, 0.6020187139511108, 0.5050275325775146, 0.49497246742248535, 0.4653399884700775, 0.5346599817276001, 0.537341833114624, 0.4626581072807312, 0.33742010593414307, 0.6625799536705017] [0.368088960647583, 0.631911039352417, 0.49925288558006287, 0.5007471442222595, 0.4547160863876343, 0.545283854007721, 0.5408452749252319, 0.45915472507476807, 0.4053747355937958, 0.5946252346038818] As you can see they are independent lists. I want to convert these lists into a dataframe. Although they are independent, they are coming out of a for loop, so i cannot append them because they are not coming at once.
Use: data = [[0.3182012736797333, 0.6817986965179443, 0.5067878365516663, 0.49321213364601135, 0.4795221984386444, 0.520477831363678, 0.532780110836029, 0.46721988916397095, 0.3282901346683502, 0.6717098355293274], [0.362120658159256, 0.6378793120384216, 0.5134761929512024, 0.4865237772464752, 0.46048662066459656, 0.539513349533081, 0.5342788100242615, 0.4657211899757385, 0.34932515025138855, 0.6506748199462891], [0.3647380471229553, 0.6352618932723999, 0.5087167620658875, 0.49128326773643494, 0.4709164798259735, 0.5290834903717041, 0.5408024787902832, 0.4591975510120392, 0.37024226784706116, 0.6297577023506165], [0.43765324354171753, 0.5623468160629272, 0.505147397518158, 0.49485257267951965, 0.45281311869621277, 0.5471869111061096, 0.5416161417961121, 0.45838382840156555, 0.3789178133010864, 0.6210821866989136], [0.44772887229919434, 0.5522711277008057, 0.5119441151618958, 0.48805591464042664, 0.46322566270828247, 0.5367743372917175, 0.5402485132217407, 0.45975151658058167, 0.4145151972770691, 0.5854847431182861], [0.35674020648002625, 0.6432597637176514, 0.48104971647262573, 0.5189502835273743, 0.4554695188999176, 0.54453045129776, 0.5409557223320007, 0.45904430747032166, 0.3258989453315735, 0.6741010546684265], [0.3909384310245514, 0.609061598777771, 0.4915180504322052, 0.5084819793701172, 0.45033228397369385, 0.5496677160263062, 0.5267384052276611, 0.47326159477233887, 0.34493446350097656, 0.6550655364990234], [0.32971733808517456, 0.6702827215194702, 0.5224012732505798, 0.47759872674942017, 0.4692566692829132, 0.5307433605194092, 0.5360044836997986, 0.4639955163002014, 0.41811054944992065, 0.5818894505500793], [0.37096619606018066, 0.6290338039398193, 0.5165190100669861, 0.4834809899330139, 0.4739859998226166, 0.526013970375061, 0.5340168476104736, 0.46598318219184875, 0.3438771069049835, 0.6561229228973389], [0.4189890921115875, 0.5810109376907349, 0.52749103307724, 0.47250890731811523, 0.44485437870025635, 0.5551456212997437, 0.5398098230361938, 0.46019014716148376, 0.3739124536514282, 0.6260875463485718], [0.3979812562465668, 0.6020187139511108, 0.5050275325775146, 0.49497246742248535, 0.4653399884700775, 0.5346599817276001, 0.537341833114624, 0.4626581072807312, 0.33742010593414307, 0.6625799536705017], [0.368088960647583, 0.631911039352417, 0.49925288558006287, 0.5007471442222595, 0.4547160863876343, 0.545283854007721, 0.5408452749252319, 0.45915472507476807, 0.4053747355937958, 0.5946252346038818]] # Create this before your for loop df = pd.DataFrame(columns = range(10)) for pred_list in data: #Add this within your for loop df = df.append(pd.Series(pred_list), ignore_index=True) output:
different return types for getpath() in lxml
I have folders full of XML files which I want to parse to a dataframe. The following functions iterate through an XML tree recursively and return a dataframe with three columns: path, attributes and text. def XML2DF(filename,df1,MAX_DEPTH=20): with open(filename) as f: xml_str = f.read() tree = etree.fromstring(xml_str) df1 = recursive_parseXML2DF(tree, df1, MAX_DEPTH=MAX_DEPTH) return def recursive_parseXML2DF(element, df1, depth=0, MAX_DEPTH=20): if depth > MAX_DEPTH: return df1 df2 = pd.DataFrame([[element.getroottree().getpath(element), element.attrib, element.text]], columns=["path", "attrib", "text"]) #print(df2) df1 = pd.concat([df1, df2]) for child in element.getchildren(): df1 = recursive_parseXML2DF(child, df1, depth=depth + 1) return df1 The code for the function was adapted from this post. Most of the times the function works fine and returns the entire path but for some documents the returned path looks like this: /*/*[1]/*[3] /*/*[1]/*[3]/*[1] The text tag entry remains valid and correct. The only difference in the XML between working path and widlcard path documents I can make out is that the XML tags are written in all caps. Working example: <?xml version="1.0" encoding="utf-8"?> <root> <Header> <ReceivingApplication>ReceivingApplication</ReceivingApplication> <SendingApplication>SendingApplication</SendingApplication> <MessageControlID>12345</MessageControlID> <ReceivingApplication>ReceivingApplication</ReceivingApplication> <FileCreationDate>2000-01-01T00:00:00</FileCreationDate> </Header> <Einsendung> <Patient> <PatientName>Name</PatientName> <PatientVorname>FirstName</PatientVorname> <PatientGebDat>2000-01-01T00:00:00</PatientGebDat> <PatientSex>4</PatientSex> <PatientPWID>123456</PatientPWID> </Patient> <Visit> <VisitNumber>A2000.0001</VisitNumber> <PatientPLZ>1234</PatientPLZ> <PatientOrt>PatientOrt</PatientOrt> <PatientAdr2> </PatientAdr2> <PatientStrasse>PatientStrasse 01</PatientStrasse> <VisitEinsID>1234</VisitEinsID> <VisitBefund>VisitBefund</VisitBefund> <Befunddatum>2000-01-01T00:00:00</Befunddatum> </Visit> </Einsendung> </root> nonsensical Example: <?xml version="1.0"?> <KRSCHWEIZ xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="krSCHWEIZ"> <KEY_VS>abcdefg</KEY_VS> <KEY_KLR>abcdefg</KEY_KLR> <ABSENDER> <ABSENDER_MELDER_ID>123456</ABSENDER_MELDER_ID> <MELDER> <MELDER_ID>123456</MELDER_ID> <QUELLSYSTEM>ABCDEF</QUELLSYSTEM> <PATIENT> <REFERENZNR>987654</REFERENZNR> <NACHNAME>my name</NACHNAME> <VORNAMEN>my first name</VORNAMEN> <GEBURTSNAME /> <GEBURTSDATUM>my dob</GEBURTSDATUM> <GESCHLECHT>XX</GESCHLECHT> <PLZ>9999</PLZ> <WOHNORT>Mycity</WOHNORT> <STRASSE>mystreet</STRASSE> <HAUSNR>99</HAUSNR> <VERSICHERTENNR>999999999</VERSICHERTENNR> <DATEIEN> <DATEI> <DATEINAME>my_attached_document.html</DATEINAME> <DATEIBASE64>mybase_64_encoded_document</DATEIBASE64> </DATEI> </DATEIEN> </PATIENT> </MELDER> </ABSENDER> </KRSCHWEIZ> How do I get correct explicit path information also for this case?
The prescence of namespaces changes the output of .getpath() - you can use .getelementpath() instead which will include the namespace prefix instead of using wildcards. If the prefix should be discarded completely - you can strip them out before using .getpath() import lxml.etree import pandas as pd rows = [] tree = lxml.etree.parse("broken.xml") for node in tree.iter(): try: node.tag = lxml.etree.QName(node).localname except ValueError: # skip tags with no name continue rows.append([tree.getpath(node), node.attrib, node.text]) df = pd.DataFrame(rows, columns=["path", "attrib", "text"]) Resulting dataframe: >>> df path attrib text 0 /KRSCHWEIZ [] \n 1 /KRSCHWEIZ/KEY_VS [] abcdefg 2 /KRSCHWEIZ/KEY_KLR [] abcdefg 3 /KRSCHWEIZ/ABSENDER [] \n 4 /KRSCHWEIZ/ABSENDER/ABSENDER_MELDER_ID [] 123456 5 /KRSCHWEIZ/ABSENDER/MELDER [] \n 6 /KRSCHWEIZ/ABSENDER/MELDER/MELDER_ID [] 123456 7 /KRSCHWEIZ/ABSENDER/MELDER/QUELLSYSTEM [] ABCDEF 8 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT [] \n 9 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/REFERENZNR [] 987654 10 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/NACHNAME [] my name 11 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/VORNAMEN [] my first name 12 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/GEBURTSNAME [] None 13 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/GEBURTSDATUM [] my dob 14 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/GESCHLECHT [] XX 15 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/PLZ [] 9999 16 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/WOHNORT [] Mycity 17 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/STRASSE [] mystreet 18 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/HAUSNR [] 99 19 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/VERSICHERTENNR [] 999999999 20 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN [] \n 21 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN/DATEI [] \n 22 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN/DAT... [] my_attached_document.html 23 /KRSCHWEIZ/ABSENDER/MELDER/PATIENT/DATEIEN/DAT... [] mybase_64_encoded_document
How to convert XML data as a pandas data frame?
I'm trying to analysis XML file with python. I ned to get xml data as a pandas data frame. import pandas as pd import xml.etree.ElementTree as et def parse_XML(xml_file, df_cols): xtree = et.parse(xml_file) xroot = xtree.getroot() rows = [] for node in xroot: res = [] res.append(node.attrib.get(df_cols[0])) for el in df_cols[1:]: if node is not None and node.find(el) is not None: res.append(node.find(el).text) else: res.append(None) rows.append({df_cols[i]: res[i] for i, _ in enumerate(df_cols)}) out_df = pd.DataFrame(rows, columns=df_cols) return out_df parse_XML('/Users/newuser/Desktop/TESTRATP/arrets.xml', ["Name","gml"]) But I'm getting below data frame. Name gml 0 None None 1 None None 2 None None My XML file is : <?xml version="1.0" encoding="UTF-8"?> <PublicationDelivery version="1.09:FR-NETEX_ARRET-2.1-1.0" xmlns="http://www.netex.org.uk/netex" xmlns:core="http://www.govtalk.gov.uk/core" xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:ifopt="http://www.ifopt.org.uk/ifopt" xmlns:siri="http://www.siri.org.uk/siri" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.netex.org.uk/netex"> <PublicationTimestamp>2020-08-05T06:00:01+00:00</PublicationTimestamp> <ParticipantRef>transport.data.gouv.fr</ParticipantRef> <dataObjects> <GeneralFrame id="FR:GeneralFrame:NETEX_ARRET:" version="any"> <members> <Quay id="FR:Quay:zenbus_StopPoint_SP_351400003_LOC:" version="any"> <Name>ST FELICIEN - Centre</Name> <Centroid> <Location> <gml:pos srsName="EPSG:2154">828054.2068251468 6444393.512041969</gml:pos> </Location> </Centroid> <TransportMode>bus</TransportMode> </Quay> <Quay id="FR:Quay:zenbus_StopPoint_SP_361350004_LOC:" version="any"> <Name>ST FELICIEN - Chemin de Juny</Name> <Centroid> <Location> <gml:pos srsName="EPSG:2154">828747.3172982805 6445226.100290826</gml:pos> </Location> </Centroid> <TransportMode>bus</TransportMode> </Quay> <Quay id="FR:Quay:zenbus_StopPoint_SP_343500005_LOC:" version="any"> <Name>ST FELICIEN - Darone</Name> <Centroid> <Location> <gml:pos srsName="EPSG:2154">829036.2709757038 6444724.878001894</gml:pos> </Location> </Centroid> <TransportMode>bus</TransportMode> </Quay> <Quay id="FR:Quay:zenbus_StopPoint_SP_359440004_LOC:" version="any"> <Name>ST FELICIEN - Col de Fontayes</Name> <Centroid> <Location> <gml:pos srsName="EPSG:2154">829504.7993360173 6445490.57188837</gml:pos> </Location> </Centroid> <TransportMode>bus</TransportMode> </Quay> </members> </GeneralFrame> </dataObjects> </PublicationDelivery> I gave you here little part of my xml file. I can't give you full XML file as it exceeding the character limits in stackoverflow. I'm wondering why I got above output and I don't know where the my error is. As I'm new to this please some one can help me? Thank you
My approach is avoid xml parsing and switch straight into pandas by using xmlplain to generate JSON from xml. import xmlplain with open("so_sample.xml") as f: js = xmlplain.xml_to_obj(f, strip_space=True, fold_dict=True) df1 = pd.json_normalize(js).explode("PublicationDelivery.dataObjects.GeneralFrame.members") # cleanup column names... df1 = df1.rename(columns={c:c.replace("PublicationDelivery.", "").replace("dataObjects.GeneralFrame.","").strip() for c in df1.columns}) # drop spurious columns df1 = df1.drop(columns=[c for c in df1.columns if c[0]=="#"]) # expand second level of dictionaries df1 = pd.json_normalize(df1.to_dict(orient="records")) # cleanup columns from second set of dictionaries df1 = df1.rename(columns={c:c.replace("members.Quay.", "") for c in df1.columns}) # expand next list and dicts df1 = pd.json_normalize(df1.explode("Centroid.Location.gml:pos").to_dict(orient="records")) # there are some NaNs - dela with them df1["Centroid.Location.gml:pos.#srsName"].fillna(method="ffill", inplace=True) df1["Centroid.Location.gml:pos"].fillna(method="bfill", inplace=True) # de-dup df1 = df1.groupby("#id", as_index=False).first() # more columns than requested... for SO output print(df1.loc[:,["Name", "Centroid.Location.gml:pos.#srsName", "Centroid.Location.gml:pos"]].to_string(index=False)) output Name Centroid.Location.gml:pos.#srsName Centroid.Location.gml:pos ST FELICIEN - Darone EPSG:2154 829036.2709757038 6444724.878001894 ST FELICIEN - Centre EPSG:2154 828054.2068251468 6444393.512041969 ST FELICIEN - Col de Fontayes EPSG:2154 829504.7993360173 6445490.57188837 ST FELICIEN - Chemin de Juny EPSG:2154 828747.3172982805 6445226.100290826
Alternative solution using pandas-read-xml pip install pandas-read-xml import pandas_read_xml as pdx from pandas_read_xml import fully_flatten df = pdx.read_xml(xml, ['PublicationDelivery', 'dataObjects', 'GeneralFrame', 'members']).pipe(fully_flatten) The list is just the tags that you want to navigate to as the "root". You many need to clean the column names afterwards.
parse xml to pandas data frame in python
I am trying to read the XML file and convert it to pandas. However it returns empty data This is the sample of xml structure: <Instance ID="1"> <MetaInfo StudentID ="DTSU040" TaskID="LP03_PR09.bLK.sh" DataSource="DeepTutorSummer2014"/> <ProblemDescription>A car windshield collides with a mosquito, squashing it.</ProblemDescription> <Question>How does this work tion?</Question> <Answer>tthis is my best </Answer> <Annotation Label="correct(0)|correct_but_incomplete(1)|contradictory(0)|incorrect(0)"> <AdditionalAnnotation ContextRequired="0" ExtraInfoInAnswer="0"/> <Comments Watch="1"> The student forgot to tell the opposite force. Opposite means opposite direction, which is important here. However, one can argue that the opposite is implied. See the reference answers.</Comments> </Annotation> <ReferenceAnswers> 1: Since the windshield exerts a force on the mosquito, which we can call action, the mosquito exerts an equal and opposite force on the windshield, called the reaction. </ReferenceAnswers> </Instance> I have tried this code, however it's not working on my side. It returns empty dataframe. import pandas as pd import xml.etree.ElementTree as et xtree = et.parse("grade_data.xml") xroot = xtree.getroot() df_cols = ["ID", "TaskID", "DataSource", "ProblemDescription", 'Question', 'Answer', 'ContextRequired', 'ExtraInfoInAnswer', 'Comments', 'Watch', 'ReferenceAnswers'] rows = [] for node in xroot: s_name = node.attrib.get("ID") s_student = node.find("StudentID") s_task = node.find("TaskID") s_source = node.find("DataSource") s_desc = node.find("ProblemDescription") s_question = node.find("Question") s_ans = node.find("Answer") s_label = node.find("Label") s_contextrequired = node.find("ContextRequired") s_extraInfoinAnswer = node.find("ExtraInfoInAnswer") s_comments = node.find("Comments") s_watch = node.find("Watch") s_referenceAnswers = node.find("ReferenceAnswers") rows.append({"ID": s_name,"StudentID":s_student, "TaskID": s_task, "DataSource": s_source, "ProblemDescription": s_desc , "Question": s_question , "Answer": s_ans ,"Label": s_label, "s_contextrequired": s_contextrequired , "ExtraInfoInAnswer": s_extraInfoinAnswer , "Comments": s_comments , "Watch": s_watch, "ReferenceAnswers": s_referenceAnswers, }) out_df = pd.DataFrame(rows, columns = df_cols)
The problem in your solution was that the "element data extraction" was not done properly. The xml you mentioned in the question is nested in several layers. And that is why we need to recursively read and extract the data. The following solution should give you what you need in this case. Although I would encourage you to look at this article and the python documentation for more clarity. Method: 1 import numpy as np import pandas as pd #import os import xml.etree.ElementTree as ET def xml2df(xml_source, df_cols, source_is_file = False, show_progress=True): """Parse the input XML source and store the result in a pandas DataFrame with the given columns. For xml_source = xml_file, Set: source_is_file = True For xml_source = xml_string, Set: source_is_file = False <element attribute_key1=attribute_value1, attribute_key2=attribute_value2> <child1>Child 1 Text</child1> <child2>Child 2 Text</child2> <child3>Child 3 Text</child3> </element> Note that for an xml structure as shown above, the attribute information of element tag can be accessed by list(element). Any text associated with <element> tag can be accessed as element.text and the name of the tag itself can be accessed with element.tag. """ if source_is_file: xtree = ET.parse(xml_source) # xml_source = xml_file xroot = xtree.getroot() else: xroot = ET.fromstring(xml_source) # xml_source = xml_string consolidator_dict = dict() default_instance_dict = {label: None for label in df_cols} def get_children_info(children, instance_dict): # We avoid using element.getchildren() as it is deprecated. # Instead use list(element) to get a list of attributes. for child in children: #print(child) #print(child.tag) #print(child.items()) #print(child.getchildren()) # deprecated method #print(list(child)) if len(list(child))>0: instance_dict = get_children_info(list(child), instance_dict) if len(list(child.keys()))>0: items = child.items() instance_dict.update({key: value for (key, value) in items}) #print(child.keys()) instance_dict.update({child.tag: child.text}) return instance_dict # Loop over all instances for instance in list(xroot): instance_dict = default_instance_dict.copy() ikey, ivalue = instance.items()[0] # The first attribute is "ID" instance_dict.update({ikey: ivalue}) if show_progress: print('{}: {}={}'.format(instance.tag, ikey, ivalue)) # Loop inside every instance instance_dict = get_children_info(list(instance), instance_dict) #consolidator_dict.update({ivalue: instance_dict.copy()}) consolidator_dict[ivalue] = instance_dict.copy() df = pd.DataFrame(consolidator_dict).T df = df[df_cols] return df Run the following to generate the desired output. xml_source = r'grade_data.xml' df_cols = ["ID", "TaskID", "DataSource", "ProblemDescription", "Question", "Answer", "ContextRequired", "ExtraInfoInAnswer", "Comments", "Watch", 'ReferenceAnswers'] df = xml2df(xml_source, df_cols, source_is_file = True) df Method: 2 Given you have the xml_string, you could convert xml >> dict >> dataframe. run the following to get the desired output. Note: You will need to install xmltodict to use Method-2. This method is inspired by the solution suggested by #martin-blech at How to convert XML to JSON in Python? [duplicate] . Kudos to #martin-blech for making it. pip install -U xmltodict Solution def read_recursively(x, instance_dict): #print(x) txt = '' for key in x.keys(): k = key.replace("#","") if k in df_cols: if isinstance(x.get(key), dict): instance_dict, txt = read_recursively(x.get(key), instance_dict) #else: instance_dict.update({k: x.get(key)}) #print('{}: {}'.format(k, x.get(key))) else: #print('else: {}: {}'.format(k, x.get(key))) # dig deeper if value is another dict if isinstance(x.get(key), dict): instance_dict, txt = read_recursively(x.get(key), instance_dict) # add simple text associated with element if k=='#text': txt = x.get(key) # update text to corresponding parent element if (k!='#text') and (txt!=''): instance_dict.update({k: txt}) return (instance_dict, txt) You will need the function read_recursively() given above. Now run the following. import xmltodict, json o = xmltodict.parse(xml_string) # INPUT: XML_STRING #print(json.dumps(o)) # uncomment to see xml to json converted string consolidated_dict = dict() oi = o['Instances']['Instance'] for x in oi: instance_dict = dict() instance_dict, _ = read_recursively(x, instance_dict) consolidated_dict.update({x.get("#ID"): instance_dict.copy()}) df = pd.DataFrame(consolidated_dict).T df = df[df_cols] df
Several issues: Calling .find on the loop variable, node, expects a child node to exist: current_node.find('child_of_current_node'). However, since all the nodes are the children of root they do not maintain their own children, so no loop is required; Not checking NoneType that can result from missing nodes with find() and prevents retrieving .tag or .text or other attributes; Not retrieving node content with .text, otherwise the <Element... object is returned; Consider this adjustment using the ternary condition expression a if condition else b to ensure variable has a value regardless: rows = [] s_name = xroot.attrib.get("ID") s_student = xroot.find("StudentID").text if xroot.find("StudentID") is not None else None s_task = xroot.find("TaskID").text if xroot.find("TaskID") is not None else None s_source = xroot.find("DataSource").text if xroot.find("DataSource") is not None else None s_desc = xroot.find("ProblemDescription").text if xroot.find("ProblemDescription") is not None else None s_question = xroot.find("Question").text if xroot.find("Question") is not None else None s_ans = xroot.find("Answer").text if xroot.find("Answer") is not None else None s_label = xroot.find("Label").text if xroot.find("Label") is not None else None s_contextrequired = xroot.find("ContextRequired").text if xroot.find("ContextRequired") is not None else None s_extraInfoinAnswer = xroot.find("ExtraInfoInAnswer").text if xroot.find("ExtraInfoInAnswer") is not None else None s_comments = xroot.find("Comments").text if xroot.find("Comments") is not None else None s_watch = xroot.find("Watch").text if xroot.find("Watch") is not None else None s_referenceAnswers = xroot.find("ReferenceAnswers").text if xroot.find("ReferenceAnswers") is not None else None rows.append({"ID": s_name,"StudentID":s_student, "TaskID": s_task, "DataSource": s_source, "ProblemDescription": s_desc , "Question": s_question , "Answer": s_ans ,"Label": s_label, "s_contextrequired": s_contextrequired , "ExtraInfoInAnswer": s_extraInfoinAnswer , "Comments": s_comments , "Watch": s_watch, "ReferenceAnswers": s_referenceAnswers }) out_df = pd.DataFrame(rows, columns = df_cols) Alternatively, run a more dynamic version assigning to an inner dictionary using the iterator variable: rows = [] for node in xroot: inner = {} inner[node.tag] = node.text rows.append(inner) out_df = pd.DataFrame(rows, columns = df_cols) Or list/dict comprehension: rows = [{node.tag: node.text} for node in xroot] out_df = pd.DataFrame(rows, columns = df_cols)