Parse by Python

Parse by Python - python

I have a problem with my Python parsing. I have this kind of xml file:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans scribe="maria" audio_filename="agora_2007_11_05_a" version="11" version_date="080826" xml:lang="catalan">
<Topics>
<Topic id="to1" desc="music"/>
<Topic id="to2" desc="bgnoise"/>
<Topic id="to4" desc="silence"/>
<Topic id="to5" desc="speech"/>
<Topic id="to6" desc="speech+music"/>
</Topics>
<Speakers>
<Speaker id="spk1" name="Xavi Coral" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
<Speaker id="spk2" name="Ferran Martínez" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
<Speaker id="spk3" name="Jordi Barbeta" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
</Speakers>
<Section type="report" topic="to6" startTime="111.286" endTime="119.308">
<Turn speaker="spk1" startTime="111.286" endTime="119.308" mode="planned" channel="studio">
<Sync time="111.286"/>
ha estat director del diari La Vanguàrdia,
<Sync time="113.56"/>
ha estat director general de Barcelona Televisió i director del Centre Territorial de Televisió Espanyola a Catalunya,
<Sync time="119.308"/>
actualment col·labora en el diari
<Event desc="es" type="language" extent="begin"/>
El Periódico
<Event desc="es" type="language" extent="end"/>
de Catalunya.
</Turn>
</Section>
And this is my Python code:
import xml.etree.ElementTree as etree
import os
import sys
xmlD = etree.parse(sys.stdin)
root = xmlD.getroot()
sections = root.getchildren()[2].getchildren()
for section in sections:
turns = section.getchildren()
for turn in turns:
speaker = turn.get('speaker')
mode = turn.get('mode')
childs = turn.getchildren()
for child in childs:
time = child.get('time')
opt = child.get('desc')
extent = child.get('extent')
if opt == 'es' and extent == 'begin':
opt = "ESP:"
elif opt == "la" extent == 'begin':
opt = "LAT:"
elif opt == "en" extent == 'begin':
opt = "ENG:"
else:
opt = ""
if time:
time = time
else:
time = ""
print time, opt+child.tail.encode('latin-1')
I need to mark the words pronounced in other language with this tag LANG: For example:
spanish words ENG:hello, spanish words, but when I have 2 consecutive words pronounced in other language I don't know how to do this: spanish words ENG:hello ENG:man, spanish words . The change of language is in the Event xml tag.
Now, at the Output I have:
actualment col·labora en el diari ESP:El Periódico de Catalunya. and I want: actualment col·labora en el diari ESP:El ESP:Periódico de Catalunya.
Anyone could help me?
Thank you!

You can do something like -
print time, opt+(" " + opt).join([c.encode('latin-1').decode('latin-1') for c in child.tail.split(' ')])
instead of your print statement

Related

Parse XML to CSV when XML tag has child attributes

I've written a small python app to print some XML tags and select child attributes. The XML are for electronic invoicing here in Mexico, here is an example of the XML:
<?xml version="1.0" encoding="UTF-8"?><cfdi:Comprobante xmlns:cfdi="http://www.sat.gob.mx/cfd/4" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Version="4.0" xsi:schemaLocation="http://www.sat.gob.mx/cfd/4 http://www.sat.gob.mx/sitio_internet/cfd/4/cfdv40.xsd" Serie="V" Folio="10030062" Fecha="2022-11-09T18:55:51" Sello="kWjohv/nlmGBVrIUBxeULiiF2HiUGxAsDC4FTirGnF8GMD7tTVDwpzDOVcyJJupQYJKj/xRPIz46i1RjZYX2jIskXxJwb5QkWfSUC6rO3TdHr4nqJQnLCD2cdp66u2/v+8uYJv+as7uXvuGv1JwQ67Mg037b0IPTjHPKaZvRwIBQCrLukLB4bOX8yuBGWWqrAqJPR/eS/wRt3QedyBhUIbUsebRgtirOQ0ywarSPUJ9Dll0KmaWq3rrHN+jkoAUZSgy+mJoR2WldeIbuiHXml/QXezl4o34ICK32gYyzvzrpLTslxPYTKcoKzDvGo2jK5/T7NctbNrrH29i515lugg==" FormaPago="04" NoCertificado="00001000000503805521" Certificado="MIIGITCCBAmgAwIBAgIUMDAwMDEwMDAwMDA1MDM4MDU1MjEwDQYJKoZIhvcNAQELBQAwggGEMSAwHgYDVQQDDBdBVVRPUklEQUQgQ0VSVElGSUNBRE9SQTEuMCwGA1UECgwlU0VSVklDSU8gREUgQURNSU5JU1RSQUNJT04gVFJJQlVUQVJJQTEaMBgGA1UECwwRU0FULUlFUyBBdXRob3JpdHkxKjAoBgkqhkiG9w0BCQEWG2NvbnRhY3RvLnRlY25pY29Ac2F0LmdvYi5teDEmMCQGA1UECQwdQVYuIEhJREFMR08gNzcsIENPTC4gR1VFUlJFUk8xDjAMBgNVBBEMBTA2MzAwMQswCQYDVQQGEwJNWDEZMBcGA1UECAwQQ0lVREFEIERFIE1FWElDTzETMBEGA1UEBwwKQ1VBVUhURU1PQzEVMBMGA1UELRMMU0FUOTcwNzAxTk4zMVwwWgYJKoZIhvcNAQkCE01yZXNwb25zYWJsZTogQURNSU5JU1RSQUNJT04gQ0VOVFJBTCBERSBTRVJWSUNJT1MgVFJJQlVUQVJJT1MgQUwgQ09OVFJJQlVZRU5URTAeFw0yMDA0MTYyMDE1MTdaFw0yNDA0MTYyMDE1MTdaMIHvMTAwLgYDVQQDEydQUkVNSVVNIFJFU1RBVVJBTlQgQlJBTkRTIFMgREUgUkwgREUgQ1YxMDAuBgNVBCkTJ1BSRU1JVU0gUkVTVEFVUkFOVCBCUkFORFMgUyBERSBSTCBERSBDVjEwMC4GA1UEChMnUFJFTUlVTSBSRVNUQVVSQU5UIEJSQU5EUyBTIERFIFJMIERFIENWMSUwIwYDVQQtExxQUkIxMDA4MDJIMjAgLyBSQVpFNjUwNTAzVUY4MR4wHAYDVQQFExUgLyBSQVpFNjUwNTAzSE5FTUNMMDcxEDAOBgNVBAsUB1BSQl9GQUMwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQCacpMWcQqSuS0mc8CfDLBvhLqPL5LyxYcEi/TYqHpje3DeVkkB6uYB19+3MO3oTnGnZgt7Jhs6/eM1+3ch/4EnAxUvVbBAHaXUUmRHTXGwBgqRMHgYYQ/DwsKHjL2fQoCodxSsJCKSg93GO4JXXHIFITALb9aOmPLd/hRc4krOqZT2egVL/HrIY+4Y2L9y9HEH+B8HUC5tbmsal5V9XNQs86nSg8Zc8IPUNMhWRQtKwdIwDwCTccYTTiBK7O2ykiba6/Ef3ORb1bDHv8YSzfjnNpD/yhXn3PyCKR9KjXp1dxGyFsEbqZH5SwUp5/aDDXetI1dal7GYSxqYA54BRKQFAgMBAAGjHTAbMAwGA1UdEwEB/wQCMAAwCwYDVR0PBAQDAgbAMA0GCSqGSIb3DQEBCwUAA4ICAQAU8WJ25ANnPSd09lBj1XsKcDlREx1zr3Tlw9UrFIZZJdsd2f0BeJFtolsWO3afHiVcpk5IfUshjI9fe/uzm8AbbMPpaoBhywoHTBJiG4bGkwQpVEddjufDKKxkuao+NALpwhfFc8kNJTmG0FuOYEVU7pKh/gz2kZOhcKViXGt3OQYLlUZ6+PP99Z2AePkz2x6gtC+A20oxfDLkPXqtEez2mby//bUSgtGsFWTIkrtC7Zro47zNOCYDngKWoke4T91o8xTtcABoeRlZTDovLCFsVm0zg5Cd22PWFkfIvlVZyIRSlJcrq2P3fo0fzeQ+rG+CpntIfOrYZr5eQHOOLUMPavazsTvFDJQpnCbZnNIxnaMKAPmXbgJHyMx24tARd0rGEuM/KLn/ZW2TCUAD5mofsT6++Z/EMsAZ68Tv4ZbwcPlWDbuJEUTsK/z2angnO55xA+NdPz+MltizMUcKXjzzvUanOAXQNIHD2wEbyXHpD3Ytb6BU6OOAx7HNiBnokxkyr7riD/slEL/di09S3Po3Q5X0z4ygUh2lHyxJDDJtNYiYLsscbliVVk0BtPAuTidOlLutw9N19zSE4AZgzIhwIF7oiJlM4EytSIZsM6GUWniN4+tWRDoV+sEgpKnblH4ms3OHB3ZE5LsgHAjcfFyToVaA3GpzLJSkQawmhc+ylw==" SubTotal="145.69" Moneda="MXN" Total="169.00" TipoDeComprobante="I" Exportacion="01" MetodoPago="PUE" LugarExpedicion="11520"><cfdi:Emisor Rfc="PRB100802H20" Nombre="PREMIUM RESTAURANT BRANDS" RegimenFiscal="601"/><cfdi:Receptor Rfc="IST190806QJ7" Nombre="INDRA SISTEMAS TRANSPORTE Y DEFENSA" DomicilioFiscalReceptor="11520" RegimenFiscalReceptor="601" UsoCFDI="G03"/><cfdi:Conceptos><cfdi:Concepto ClaveProdServ="90101503" NoIdentificacion="0385101372231252" ObjetoImp="02" Cantidad="1" ClaveUnidad="XPK" Unidad="Paquete" Descripcion="PQT. DE ALIMENTOS (CONSUMO: 2022-11-08) FOLIO(0385101372231252)" ValorUnitario="145.69" Importe="145.69"><cfdi:Impuestos><cfdi:Traslados><cfdi:Traslado Base="145.69" Impuesto="002" TipoFactor="Tasa" TasaOCuota="0.160000" Importe="23.31"/></cfdi:Traslados></cfdi:Impuestos></cfdi:Concepto></cfdi:Conceptos><cfdi:Impuestos TotalImpuestosTrasladados="23.31"><cfdi:Traslados><cfdi:Traslado Base="145.69" Impuesto="002" TipoFactor="Tasa" TasaOCuota="0.160000" Importe="23.31"/></cfdi:Traslados></cfdi:Impuestos><cfdi:Complemento><tfd:TimbreFiscalDigital xmlns:tfd="http://www.sat.gob.mx/TimbreFiscalDigital" FechaTimbrado="2022-11-09T19:05:56" UUID="67B2DDD8-ABCF-4CD1-B435-C228742542B6" NoCertificadoSAT="00001000000503270882" SelloCFD="kWjohv/nlmGBVrIUBxeULiiF2HiUGxAsDC4FTirGnF8GMD7tTVDwpzDOVcyJJupQYJKj/xRPIz46i1RjZYX2jIskXxJwb5QkWfSUC6rO3TdHr4nqJQnLCD2cdp66u2/v+8uYJv+as7uXvuGv1JwQ67Mg037b0IPTjHPKaZvRwIBQCrLukLB4bOX8yuBGWWqrAqJPR/eS/wRt3QedyBhUIbUsebRgtirOQ0ywarSPUJ9Dll0KmaWq3rrHN+jkoAUZSgy+mJoR2WldeIbuiHXml/QXezl4o34ICK32gYyzvzrpLTslxPYTKcoKzDvGo2jK5/T7NctbNrrH29i515lugg==" SelloSAT="LBOVbhfMGU8T2Tsrz6fFTLkCz90Z0sZIkJqLquayWD5GIhdw6UDvp2Lo5r40jjGC1WwvHMsimi6Ho5xMH70nHH9gkeUIRK3BdsPcUjwFSnYzL1TwG70ZGf7hFBh8uflI1jKzLPRFvWhfHyw1Wznof9NtlXCvSRYhmlcxM6/kj/gOOG0hrq+DJaEsTNJgD7XQzUlMJ9/Casc2kgvOYAdwpXdmkNEtEe9oqQiti4VbPXxEUKpE66hik/Rg4txFMCTPlAMpiz3XfDig/gp6lrFnb/TYkSFr3E9/oPJxoig4xTwPuCZ9uOxfExpxtI3ASXpCoh4isWqqlgxc7abxIoA5Tw==" Version="1.1" RfcProvCertif="TLE011122SC2" xsi:schemaLocation="http://www.sat.gob.mx/TimbreFiscalDigital http://www.sat.gob.mx/sitio_internet/cfd/TimbreFiscalDigital/TimbreFiscalDigitalv11.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"/></cfdi:Complemento><cfdi:Addenda><Referencia xmlns="https://facturacion.prb.com.mx/XSD/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://facturacion.prb.com.mx/XSD/ https://facturacion.prb.com.mx/XSD/prb_addenda.xsd"
ticket="0385101372231252"/></cfdi:Addenda></cfdi:Comprobante>
Here is the code I've written:
while True:
import xml.dom.minidom
import csv
from tkinter import *
from tkinter import filedialog
root = Tk()
root.filename = filedialog.askopenfilename(title = "Select file", filetypes =[('XML Files', '*.xml')])
print (root.filename)
root.mainloop()
print("------------------------------------------------------------------------------")
def main():
# use the parse() function to load and parse an XML file
doc = xml.dom.minidom.parse(root.filename);
# print out the document node and the name of the first child tag
print (doc.nodeName)
print (doc.firstChild.tagName)
# get a list of XML tags from the document and print each one
cfd = doc.getElementsByTagName("cfdi:Comprobante")
print ("%d Monto:" % cfd.length)
for skill in cfd:
print (skill.getAttribute("Total"))
cfd = doc.getElementsByTagName("cfdi:Comprobante")
print ("%d Fecha:" % cfd.length)
for skill in cfd:
print (skill.getAttribute("Fecha"))
cfd = doc.getElementsByTagName("cfdi:Concepto")
print ("%d Descripción:" % cfd.length)
for skill in cfd:
print (skill.getAttribute("Descripcion"))
cfd = doc.getElementsByTagName("cfdi:Emisor")
print ("%d RFC_Emisor:" % cfd.length)
for skill in cfd:
print (skill.getAttribute("Rfc"))
cfd = doc.getElementsByTagName("cfdi:Emisor")
print ("%d Emisor:" % cfd.length)
for skill in cfd:
print (skill.getAttribute("Nombre"))
cfd = doc.getElementsByTagName("tfd:TimbreFiscalDigital")
print ("%d UUID:" % cfd.length)
for skill in cfd:
print (skill.getAttribute("UUID"))
cfd = doc.getElementsByTagName("cfdi:Receptor")
print ("%d RFC_Receptor:" % cfd.length)
for skill in cfd:
print (skill.getAttribute("Rfc"))
cfd = doc.getElementsByTagName("cfdi:Receptor")
print ("%d Receptor:" % cfd.length)
for skill in cfd:
print (skill.getAttribute("Nombre"))
print("------------------------------------------------------------------------------")
if __name__ == "__main__":
main();
try_again = int(input("Press 1 to try again, 0 to exit."))
if try_again == 0:
break # break out of the outer while loop
Now I'm trying to update it to write all to a CSV file, I've tried with the following code
import xml.etree.ElementTree as Xet
import pandas as pd
import xml.dom.minidom
import csv
from tkinter import *
from tkinter import filedialog
root1 = Tk()
root1.filename = filedialog.askopenfilename(title = "Select file", filetypes =[('XML Files', '*.xml')])
print (root1.filename)
cols = ["Monto","Fecha","Descripcion","RFC_Emisor","Emisor","UUID","RFC_Receptor","Receptor"]
rows = []
# use the parse() function to load and parse an XML file
doc = xml.dom.minidom.parse(root1.filename);
Total = doc.getElementsByTagName("cfdi:Comprobante")
for skill in Total:
print (skill.getAttribute("Total"))
Date = doc.getElementsByTagName("cfdi:Comprobante")
for skill in Date:
print (skill.getAttribute("Fecha"))
Desc = doc.getElementsByTagName("cfdi:Concepto")
for skill in Desc:
print (skill.getAttribute("Descripcion"))
RFC1 = doc.getElementsByTagName("cfdi:Emisor")
for skill in RFC1:
print (skill.getAttribute("Rfc"))
Name = doc.getElementsByTagName("cfdi:Emisor")
for skill in Name:
print (skill.getAttribute("Nombre"))
UUI = doc.getElementsByTagName("tfd:TimbreFiscalDigital")
for skill in UUI:
print (skill.getAttribute("UUID"))
RFC2 = doc.getElementsByTagName("cfdi:Receptor")
for skill in RFC2:
print (skill.getAttribute("Rfc"))
Name2 = doc.getElementsByTagName("cfdi:Receptor")
for skill in Name2:
print (skill.getAttribute("Nombre"))
#Parsing the XML file
xmlparse = Xet.parse(root1.filename)
root = xmlparse.getroot()
for i in root:
Monto = Total
Fecha = Date
Descripcion = Desc
RFC_Emisor = RFC1
Emisor = Name
UUID = UUI
RFC_Receptor = RFC2
Receptor = Name2
rows.append({"Monto": Monto,
"Fecha": Fecha,
"Descripcion": Descripcion,
"RFC_Emisor": RFC_Emisor,
"Emisor": Emisor,
"UUID": UUID,
"RFC_Receptor": RFC_Receptor,
"Receptor": Receptor})
df= pd.DataFrame(rows, columns=cols)
df.to_csv('output.csv')
But the CSV file only writes this:
,Monto,Fecha,Descripcion,RFC_Emisor,Emisor,UUID,RFC_Receptor,Receptor
0,[<DOM Element: cfdi:Comprobante at 0x2c76ad9a030>],[<DOM Element: cfdi:Comprobante at 0x2c76ad9a030>],[<DOM Element: cfdi:Concepto at 0x2c76adbdef0>],[<DOM Element: cfdi:Emisor at 0x2c76adbdbd0>],[<DOM Element: cfdi:Emisor at 0x2c76adbdbd0>],[<DOM Element: tfd:TimbreFiscalDigital at 0x2c76adbe5d0>],[<DOM Element: cfdi:Receptor at 0x2c76adbdd10>],[<DOM Element: cfdi:Receptor at 0x2c76adbdd10>]
I know I'm parsing the XML twice but cannot figure out how to do it
The expected CSV should look like this:
,Monto,Fecha,Descripcion,RFC_Emisor,Emisor,UUID,RFC_Receptor,Receptor
0,169.00,2022-11-09T18:55:51,PQT. DE ALIMENTOS (CONSUMO: 2022-11-08) FOLIO(0385101372231252),PRB100802H20,PREMIUM RESTAURANT BRANDS,67B2DDD8-ABCF-4CD1-B435-C228742542B6,IST190806QJ7,INDRA SISTEMAS TRANSPORTE Y DEFENSA
EDIT 2022-11-14
I've tried with this code but cannot get the Total or the Fecha value
import pandas as pd
import xml.dom.minidom
import xml.etree.ElementTree as Xet
from tkinter import *
from tkinter import filedialog
from lxml import etree
root1 = Tk()
root1.filename = filedialog.askopenfilename(title="Select file", filetypes=[('XML Files','*.xml')])
print (root1.filename)
cols = ["Monto","Fecha","Descripcion","RFC_Emisor","Emisor","UUID"]
rows = []
row =[]
doc = xml.dom.minidom.parse(root1.filename);
xmlparse = Xet.parse(root1.filename)
root = xmlparse.getroot()
for m in root.findall('.//*[#Total]'):
row.extend(m.attrib.get("Total")) #,m.attrib.get("Fecha")))
for d in root.findall('.//*[#Descripcion]'):
row.append(d.attrib.get('Descripcion'))
for rf in root.findall('.//*[#Rfc]'):
row.extend((rf.attrib.get("Rfc"),rf.attrib.get("Nombre")))
for u in root.findall('.//*[#UUID]'):
row.append(u.attrib.get("UUID"))
rows.append(row)
df= pd.DataFrame(rows,columns=cols)
df.to_csv('output.csv',mode='a', index=False, header=False)
EDIT 2022-11-16
I was able to run the code, but with this XML:
<?xml version="1.0" encoding="utf-8"?><cfdi:Comprobante xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:cfdi="http://www.sat.gob.mx/cfd/3" xsi:schemaLocation="http://www.sat.gob.mx/cfd/3 http://www.sat.gob.mx/sitio_internet/cfd/3/cfdv33.xsd" Version="3.3" Serie="EE" Folio="3963205" Fecha="2022-08-26T11:56:10" Sello="fg7F9EF8YXkWP4UAw96g97gat8D7nzJ10TjtxB/x4t1G10LS7RmRKa/jQ1dcBpJ96ck8FPBnOirF8Ya4IQ7hJgkoWRDkY4cpTI5UChiZsld7frl8x1yz21HIckqWqBtn/xQT4l0iAXda5xIRA6shOf0YErTU6NOkZNLNp4ToNg6hUbaoc4RXTNWcyc25lyXc9nMY6BkYiDNaCgLnIZ/d1jTIrwIPOyAlhAcdmVaPKxyfpMNrUhPBh4FKRy6MW8iNGXw+ZhPSYUncSiuUYA6O7B1qlHGHQSuN4q+dEv8T3C+4gM0PjInYhRt3XSOmwDfXAvjUFy4tyqXIHBHZCnpG+A==" FormaPago="99" NoCertificado="00001000000507261913" Certificado="MIIF6DCCA9CgAwIBAgIUMDAwMDEwMDAwMDA1MDcyNjE5MTMwDQYJKoZIhvcNAQELBQAwggGEMSAwHgYDVQQDDBdBVVRPUklEQUQgQ0VSVElGSUNBRE9SQTEuMCwGA1UECgwlU0VSVklDSU8gREUgQURNSU5JU1RSQUNJT04gVFJJQlVUQVJJQTEaMBgGA1UECwwRU0FULUlFUyBBdXRob3JpdHkxKjAoBgkqhkiG9w0BCQEWG2NvbnRhY3RvLnRlY25pY29Ac2F0LmdvYi5teDEmMCQGA1UECQwdQVYuIEhJREFMR08gNzcsIENPTC4gR1VFUlJFUk8xDjAMBgNVBBEMBTA2MzAwMQswCQYDVQQGEwJNWDEZMBcGA1UECAwQQ0lVREFEIERFIE1FWElDTzETMBEGA1UEBwwKQ1VBVUhURU1PQzEVMBMGA1UELRMMU0FUOTcwNzAxTk4zMVwwWgYJKoZIhvcNAQkCE01yZXNwb25zYWJsZTogQURNSU5JU1RSQUNJT04gQ0VOVFJBTCBERSBTRVJWSUNJT1MgVFJJQlVUQVJJT1MgQUwgQ09OVFJJQlVZRU5URTAeFw0yMTA0MzAxOTMwNTFaFw0yNTA0MzAxOTMwNTFaMIG2MRwwGgYDVQQDExNCQ0QgVFJBVkVMIFNBIERFIENWMRwwGgYDVQQpExNCQ0QgVFJBVkVMIFNBIERFIENWMRwwGgYDVQQKExNCQ0QgVFJBVkVMIFNBIERFIENWMSUwIwYDVQQtExxCVFI3ODAyMjM3VTIgLyBNRVNFNjkwNjEzNTQ3MR4wHAYDVQQFExUgLyBNRVNFNjkwNjEzSERGTk5OMDUxEzARBgNVBAsTCkJDRCBUUkFWRUwwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDcOxH+0iOfuj7fkxJhU3EoUo/aHFcofV2RkJ0XOOTP7H0oL99KE1AofsUbPZZvn4605puQWp2Fu/xg544Fd24fQ3WinhktnFu/bfP4X7O6hFyiL7//Kcme87/sNkFqaO6JUjkGMAASa3XROPUyYrnPszshF4sne+KZZzHg2347l2qPhN6LMMMyqIN0YdS9AsMTdUnSnZYgfgrxHN8RnWrwgmpELGJ6lZBf4mEpltzXYNYOWgm9t2xlnmMXss7MCsvQh6+ctA8iwEe64F2AesQzFNarer48RI8WHhBeUqO6APnom8tgA9K9SlBYxgR7FyRlrR8Q7NWRy122yTDiqUD3AgMBAAGjHTAbMAwGA1UdEwEB/wQCMAAwCwYDVR0PBAQDAgbAMA0GCSqGSIb3DQEBCwUAA4ICAQAJruwufHGLVpyZ6ReQ8AyrkMtxONRmLhv7C2nY2c8+O+k/emdZU0Zm2iTQViXTPo0K0i2o4scEVMZAKtTbqzJk4NDHTYHn6ESNsH1whLcBAtGn2b+GYt4TYMZVy7zP/ty5mL8rlMBmg89zi2NGQRqLl4l5DoI1KgcSk7wue2hCOj3kIi4noWZQdh8kgACthei7aPscfNQXivZ17tBDTzmdWIBcn4KACIwFvkTCeGl1gsV5i1CIdex5p011qXsmIcPMIF/gAsVZRbfEKpu2afxZcCC9ig/xmB2blpv2E8QSMa6S27w7dxH3i92OpFBRXONpFXNJRtx7r6UvUw6Shq2oqTfImGzvdfC8oOa3LIq2AdoD2XGRjuaOQLqRhWldxyiW0N7jTreMnZoeUi2TiN8yp8aJFMRf8pj5sxGsUVBUiktYp6FYk3H+hk3DPEngZje2pJog0suuUyHJJEWaSMQlyFe30lyEMvnXt/xSpMwjt0PU3I9VOhXHrLW0QeCV8AyQ/timemYdJlgke6ROygIU01xqG/SBfM8REYYgtoKFKrtBuqowSrbjlJOXZHFZzKapAipbQ4dfOWoiBIlxHwgOtmOqfBlGRSf/MfSDxIERRGybZzBTnrMBPbgXPNeqs2qFAyrIx/qiTqp85KMgwoANVHUy4rk6mrc3gxZtp3isxA==" SubTotal="1483.10" Descuento="0.00" Moneda="MXN" Total="1713.48" TipoDeComprobante="I" MetodoPago="PPD" LugarExpedicion="11560"><cfdi:Emisor Rfc="BTR7802237U2" Nombre="BCD Travel S.A. de C.V." RegimenFiscal="601"/><cfdi:Receptor Rfc="IST190806QJ7" Nombre="INDRA SISTEMAS TRANSPORTE Y DEFENSA" UsoCFDI="G03"/><cfdi:Conceptos><cfdi:Concepto ClaveProdServ="90121502" NoIdentificacion="AZOEIG-3969614-15936193-16737975" Cantidad="1" ClaveUnidad="E48" Unidad="Unidad de Servicio" Descripcion="Reservación Hotel ( Tasa 16% )" ValorUnitario="1439.90" Importe="1439.90" Descuento="0.00"><cfdi:Impuestos><cfdi:Traslados><cfdi:Traslado Base="1439.90" Impuesto="002" TipoFactor="Tasa" TasaOCuota="0.160000" Importe="230.38"/></cfdi:Traslados></cfdi:Impuestos></cfdi:Concepto><cfdi:Concepto ClaveProdServ="90121502" NoIdentificacion="AZOEIG-3969614-15936193-16737975" Cantidad="1" ClaveUnidad="E48" Unidad="Unidad de Servicio" Descripcion="Otros Impuestos" ValorUnitario="43.20" Importe="43.20" Descuento="0.00"><cfdi:Impuestos><cfdi:Traslados><cfdi:Traslado Base="43.20" Impuesto="002" TipoFactor="Tasa" TasaOCuota="0.000000" Importe="0.00"/></cfdi:Traslados></cfdi:Impuestos></cfdi:Concepto></cfdi:Conceptos><cfdi:Impuestos TotalImpuestosTrasladados="230.38"><cfdi:Traslados><cfdi:Traslado Impuesto="002" TipoFactor="Tasa" TasaOCuota="0.160000" Importe="230.38"/><cfdi:Traslado Impuesto="002" TipoFactor="Tasa" TasaOCuota="0.000000" Importe="0.00"/></cfdi:Traslados></cfdi:Impuestos><cfdi:Complemento><tfd:TimbreFiscalDigital xmlns:tfd="http://www.sat.gob.mx/TimbreFiscalDigital" xsi:schemaLocation="http://www.sat.gob.mx/TimbreFiscalDigital http://www.sat.gob.mx/sitio_internet/cfd/TimbreFiscalDigital/TimbreFiscalDigitalv11.xsd" Version="1.1" UUID="055DC12A-C9F7-4E70-B23B-EB6CA1ABDC4A" FechaTimbrado="2022-08-26T13:10:51" RfcProvCertif="DET080304395" SelloCFD="fg7F9EF8YXkWP4UAw96g97gat8D7nzJ10TjtxB/x4t1G10LS7RmRKa/jQ1dcBpJ96ck8FPBnOirF8Ya4IQ7hJgkoWRDkY4cpTI5UChiZsld7frl8x1yz21HIckqWqBtn/xQT4l0iAXda5xIRA6shOf0YErTU6NOkZNLNp4ToNg6hUbaoc4RXTNWcyc25lyXc9nMY6BkYiDNaCgLnIZ/d1jTIrwIPOyAlhAcdmVaPKxyfpMNrUhPBh4FKRy6MW8iNGXw+ZhPSYUncSiuUYA6O7B1qlHGHQSuN4q+dEv8T3C+4gM0PjInYhRt3XSOmwDfXAvjUFy4tyqXIHBHZCnpG+A==" NoCertificadoSAT="00001000000503726537" SelloSAT="LyWQC2ExMofC25dv/qhchiKH2yVf29BuRzA1WJaPFOGq5+JF+bJL7nPpV2jE6iP1aKbtD7lyPLHRW8/P9KTR47GtGf3iuPpUWddsUA70cVTk1ol6/FJfrfuE1G2CLlUdhhf8MholjYtJNgbZ7hlfdmv0Zrj5vv3waO9FIRr0J/P6fA0uBK0qX0CxGYxNTsxPrwJ3CNkWFa94rVdM4iCfCZeXNGoqTXF+EEe2yPFJUvMR/BcYoiG8w6mKrojzKetDgg3J6bSDhW8XNGvYNt300fwUlU7arvoCo7f36UbI1lh+xGWEIDPy/IO7bccpfOm3T7xf0bfWBT1kl3o8IqxqgQ=="/></cfdi:Complemento><cfdi:Addenda><BCDTravel:AdditionalInformation xmlns:BCDTravel="https://www.bcdtravelmexico.com.mx/Addenda" xsi:schemaLocation="https://www.bcdtravelmexico.com.mx/Addenda https://www.bcdtravelmexico.com.mx/Addenda/BCDTravel.xsd"><BCDTravel:RecordInformation><BCDTravel:Reservacion ClaveReservacion="AZOEIG" NumeroOS="3969614" Pasajero="TEJEDA/EDGAR LEONARDO"/></BCDTravel:RecordInformation><BCDTravel:PaymentInformation><BCDTravel:MetodoPago Metodo="AR" Monto="1713.48"/></BCDTravel:PaymentInformation></BCDTravel:AdditionalInformation></cfdi:Addenda></cfdi:Comprobante>
I'm getting this error:
8 columns passed, passed data had 9 columns
I assume I have to add an exception but I'm not sure how to do it.

It looks like your code is way more complicated than necessary.
Try it this way:
from lxml import etree
cols = ["Monto","Fecha","Descripcion","RFC_Emisor","Emisor","UUID","RFC_Receptor","Receptor"]
rows = []
row =[]
for t in root.xpath('//*[#Total]'):
row.extend((t.attrib.get("Total"),t.attrib.get("Fecha")))
for d in doc.xpath('//*[#Descripcion]'):
row.append(d.attrib.get('Descripcion'))
for rf in doc.xpath('//*[#Rfc]'):
row.extend((rf.attrib.get("Rfc"),rf.attrib.get("Nombre")))
for u in doc.xpath('//*[#UUID]'):
row.append(u.attrib.get("UUID"))
rows.append(row)
pd.DataFrame(rows,columns=cols)
Output (based on your sample xml):
Monto Fecha Descripcion RFC_Emisor Emisor UUID RFC_Receptor Receptor
0 169.00 2022-11-09T18:55:51 PQT. DE ALIMENTOS (CONSUMO: 2022-11-08) FOLIO(... PRB100802H20 PREMIUM RESTAURANT BRANDS IST190806QJ7 INDRA SISTEMAS TRANSPORTE Y DEFENSA 67B2DDD8-ABCF-4CD1-B435-C228742542B6
You'll likely have to modify this to fit your actual xml.

How determine if a token is part of an entity within Spacy?

I have
import spacy
nlp = spacy.load("en_core_web_lg")
line = "Rio de Janeiro is the capital of.."
doc = nlp(line)
for tok in doc:
printf(tok.lemma_)
for ent in doc.ents:
printf(e.lemma_)
I want obtain wikization: "[[Rio de Janeiro]] [[be|is]] [[the]] [[capital]] [[of]].."
how determine if token "Rio" is part of entity "Rio de Janeiro"?

Use the ent_type or ent_type_ attribute, if the value is not an empty string it is an entity.
Edit: for attribute ent_iob or ent_iob_ “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set.
import spacy
nlp = spacy.load("en_core_web_lg")
line = "Rio de Janeiro is the capital of.."
doc = nlp(line)
for tok in doc:
print(tok, tok.ent_type_, tok.ent_iob_)
Output:
Rio GPE B
de GPE I
Janeiro GPE I
is O
the O
capital O
of O
.. O

Entities have start and end property: indicies of token stream.
I can write:
import spacy
nlp = spacy.load("en_core_web_lg")
line = "Rio de Janeiro is the capital of.."
doc = nlp(line)
if len(doc.ents)>0:
firstEnt = doc.ents[0].start
else:
firstEnt = len(doc)
for j in range(firstEnt):
print(doc[j])
j = firstEnt
for i in range(len(doc.ents)):
ent = doc.ents[i]
while j<ent.start:
print(doc[j])
j+=1
print(ent)
if len(doc.ents) > 0:
j = ent.end
while j<len(doc):
print(doc[j])
j+=1

Python XML Parser Issue

I am new to python. Sorry for asking this stupid question.
I am trying to read a XML file to python object (preferably to pandas)
For now I am just trying to print the variables, to see if I can read them properly in a tabular form.
I have used xml.etree.ElementTree for this, but I might not be using it as intended.
Code:
import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
ODM = tree.getroot()
ns = {'xmlns': 'http://www.cdisc.org/ns/odm/v1.3',
'mdsol': 'http://www.mdsol.com/ns/odm/metadata'}
for ClinicalData in ODM:
LocationOID=None
#print(ClinicalData.tag, ClinicalData.attrib)
for SubjectData in ClinicalData:
for SiteRef in SubjectData:
LocationOID=SiteRef.attrib.get('LocationOID')
for StudyEventData in SubjectData:
for AuditRecord in StudyEventData:
print(ClinicalData.attrib.get('MetaDataVersionOID'),
ClinicalData.attrib.get('AuditSubCategoryName'), #null ouptput due to namespace issue
SubjectData.attrib.get('SubjectKey'),
SubjectData.attrib.get('SubjectName'), #null ouptput due to namespace issue
LocationOID, #not sure what is the issue
StudyEventData.attrib.get('StudyEventRepeatKey'),
AuditRecord.find('DateTimeStamp') #not sure what is the issue
)
Input:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3"
xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata"
CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">
<ClinicalData MetaDataVersionOID="1772" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Activated">
<SubjectData SubjectKey="7735fd9c-1792-457c-aa58-0ca26ecdc810" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-3">
<SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960580">
<AuditRecord>
<UserRef UserOID="systemuser"/>
<LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<DateTimeStamp>2019-07-10T07:56:54</DateTimeStamp>
<ReasonForChange>Update</ReasonForChange>
<SourceID>394263772</SourceID>
</AuditRecord>
</StudyEventData>
</SubjectData>
</ClinicalData>
</ODM>
I am expecting all the print variables need to have the proper variable assigned values as in XML file. Please let me know is there any other proper way of doing it instead of inner looping multiple times.

Namespaces are a pain using ElementTree. See this discussion.
Short answer:
for ClinicalData in ODM:
#print(ClinicalData.tag, ClinicalData.attrib)
for SubjectData in ClinicalData:
SiteRef = SubjectData.find('{http://www.cdisc.org/ns/odm/v1.3}SiteRef')
LocationOID = SiteRef.attrib.get('LocationOID')
for StudyEventData in SubjectData:
for AuditRecord in StudyEventData:
print(
ClinicalData.attrib.get('MetaDataVersionOID'),
ClinicalData.attrib.
get('{http://www.mdsol.com/ns/odm/metadata}AuditSubCategoryName'
), #null ouptput due to namespace issue
SubjectData.attrib.get('SubjectKey'),
SubjectData.attrib.get(
'{http://www.mdsol.com/ns/odm/metadata}SubjectName'
), #null ouptput due to namespace issue
LocationOID, #not sure what is the issue
StudyEventData.attrib.get('StudyEventRepeatKey'),
AuditRecord.find(
'{http://www.cdisc.org/ns/odm/v1.3}DateTimeStamp').
text #not sure what is the issue
)

I think you can use BeautifulSoup for parsing XML:
from bs4 import BeautifulSoup
temp ="""<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3"
xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata"
CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">
<ClinicalData MetaDataVersionOID="1772" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Activated">
<SubjectData SubjectKey="7735fd9c-1792-457c-aa58-0ca26ecdc810" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-3">
<SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960580">
<AuditRecord>
<UserRef UserOID="systemuser"/>
<LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<DateTimeStamp>2019-07-10T07:56:54</DateTimeStamp>
<ReasonForChange>Update</ReasonForChange>
<SourceID>394263772</SourceID>
</AuditRecord>
</StudyEventData>
</SubjectData>
</ClinicalData>
</ODM>"""
temp=BeautifulSoup(temp,"lxml")
ClinicalData = temp.find('ClinicalData'.lower())
SubjectData = ClinicalData.find_all('SubjectData'.lower())
LocationOID=None
for i in SubjectData:
SiteRef = i.find('SiteRef'.lower())
LocationOID = SiteRef.attrs['locationoid']
print('LocationOID',LocationOID)
output:
LocationOID 0ACCSP3MAPPING1SITE1
[Finished in 1.2s]

#Justin
I have applied your suggestions, it worked, until I broke it.
Input:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" xmlns:mdsol="http://www.mdsol.com/ns/odm/metadata" CreationDateTime="2019-08-23T12:59:09" FileOID="3b2b4161-fad8-4239-9c83-03d0e62624dd" FileType="Transactional" ODMVersion="1.3">
<ClinicalData MetaDataVersionOID="2965" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Entered">
<SubjectData SubjectKey="481e4653-693c-4e15-8762-d8a66c0d2cf1" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-1">
<SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960564">
<FormData FormOID="VS" FormRepeatKey="1" mdsol:DataPageId="15331229">
<ItemGroupData ItemGroupOID="VS" mdsol:RecordId="17928808">
<ItemData ItemOID="VS.WT" TransactionType="Upsert" Value="45">
<AuditRecord>
<UserRef UserOID="alscrave2"/>
<LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<DateTimeStamp>2018-02-02T09:39:30</DateTimeStamp>
<ReasonForChange/>
<SourceID>122841525</SourceID>
</AuditRecord>
<MeasurementUnitRef MeasurementUnitOID="1761.Weight.1"/>
</ItemData>
</ItemGroupData>
</FormData>
</StudyEventData>
</SubjectData>
</ClinicalData>
<ClinicalData MetaDataVersionOID="2965" StudyOID="0ACC SP3 MAPPING1(DEV)" mdsol:AuditSubCategoryName="Entered">
<SubjectData SubjectKey="481e4653-693c-4e15-8762-d8a66c0d2cf1" mdsol:SubjectKeyType="SubjectUUID" mdsol:SubjectName="ACC-SUBJ-1">
<SiteRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<StudyEventData StudyEventOID="FV" StudyEventRepeatKey="VIST[1]/FV[1]" mdsol:InstanceId="2960564">
<FormData FormOID="VS" FormRepeatKey="1" mdsol:DataPageId="15331229">
<ItemGroupData ItemGroupOID="VS" mdsol:RecordId="17928809">
<ItemData ItemOID="VS.WT" TransactionType="Upsert" Value="46">
<AuditRecord>
<UserRef UserOID="alscrave2"/>
<LocationRef LocationOID="0ACCSP3MAPPING1SITE1"/>
<DateTimeStamp>2018-02-02T09:39:30</DateTimeStamp>
<ReasonForChange/>
<SourceID>122841525</SourceID>
</AuditRecord>
<MeasurementUnitRef MeasurementUnitOID="1761.Weight.1"/>
</ItemData>
</ItemGroupData>
</FormData>
</StudyEventData>
</SubjectData>
</ClinicalData>
</ODM>
Code:
import xml.etree.ElementTree as ET
import pandas as pd
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
tree = ET.parse("data.xml")
ODM = tree.getroot()
xmlns = "{http://www.cdisc.org/ns/odm/v1.3}"
mdsol = "{http://www.mdsol.com/ns/odm/metadata}"
def data_reader():
dfcols = ['CreationDateTime','StudyOID','MetaDataVersionOID','SubjectName','SUBJECTUUID','LocationOID','StudyEventOID',
'StudyEventRepeatKey','FormOID','FormRepeatKey','DataPageId','ItemgroupOID','RecordId','var_name','Value',
'DateTimeStamp','ASC_Name','Measurement_Unit','SourceID','UserOID','InstanceId']
df_xml = pd.DataFrame(columns=dfcols)
CreationDateTime = ODM.attrib.get('CreationDateTime')
for ClinicalData in ODM:
StudyOID = ClinicalData.attrib.get('StudyOID')
MetaDataVersionOID = ClinicalData.attrib.get('MetaDataVersionOID')
ASC_Name = ClinicalData.attrib.get('{0}AuditSubCategoryName'.format(mdsol))
for SubjectData in ClinicalData:
SubjectName = SubjectData.attrib.get('{0}SubjectName'.format(mdsol))
SUBJECTUUID = SubjectData.attrib.get('SubjectKey')
LocationOID = SubjectData.find('{0}SiteRef'.format(xmlns)).attrib.get('LocationOID')
for StudyEventData in SubjectData:
StudyEventOID = StudyEventData.attrib.get('StudyEventOID')
StudyEventRepeatKey = StudyEventData.attrib.get('StudyEventRepeatKey')
InstanceId = StudyEventData.attrib.get('{0}InstanceId'.format(mdsol))
for FormData in StudyEventData:
FormOID = FormData.attrib.get('FormOID')
FormRepeatKey = FormData.attrib.get('FormRepeatKey')
DataPageId = FormData.attrib.get('{0}DataPageId'.format(mdsol))
for ItemGroupData in FormData:
ItemgroupOID = ItemGroupData.attrib.get('ItemgroupOID')
RecordId = ItemGroupData.attrib.get('{0}RecordId'.format(mdsol))
for ItemData in ItemGroupData:
var_name = ItemData.attrib.get('ItemOID')
Value = ItemData.attrib.get('Value')
Measurement_Unit = ItemData.find('MeasurementUnitRef'.format(xmlns)).attrib.get('MeasurementUnitOID')
for AuditRecord in ItemData:
DateTimeStamp = AuditRecord.find('{0}DateTimeStamp'.format(xmlns)).text;
SourceID = AuditRecord.find('{0}SourceID'.format(xmlns)).text;
UserOID = ItemData.find('{0}UserRef'.format(xmlns)).attrib.get('UserOID')
df_xml = df_xml.append(
pd.Series([CreationDateTime,StudyOID,MetaDataVersionOID,SubjectName,
SUBJECTUUID,LocationOID,StudyEventOID,
StudyEventRepeatKey,FormOID,FormRepeatKey,DataPageId,ItemgroupOID,
RecordId,var_name,Value,DateTimeStamp,ASC_Name,Measurement_Unit,
SourceID,UserOID,InstanceId], index=dfcols),
ignore_index=True)
print(df_xml)
data_reader()
Issue: I am getting duplicate records. And variables DateTimeStamp, SourceID, UserOID and Measurement_Unit are throwing run time errors during assignment.

Python ElementTree parsing order

I have a very novice problem, but i am just getting started and this one is bugging my brain.
This is my XML:
<SCHEDULE>
<PROGRAMMES>
<PROGRAMME START="00:30:00" END="00:31:00">
<TITLES>
<TITLE>Name 1</TITLE>
</TITLES>
</PROGRAMME>
<PROGRAMME START="00:31:00" END="00:32:00">
<BLOCK>
<BLOCK_NAME>Block1</BLOCK_NAME>
<BLOCK_START>00:31:00</BLOCK_START>
</BLOCK>
<TITLES>
<TITLE>Name 2</TITLE>
</TITLES>
</PROGRAMME>
<PROGRAMME START="00:40:00" BILLEDEND="00:45:00">
<BLOCK>
<BLOCK_NAME>Block 1</BLOCK_NAME>
<BLOCK_START>00:31:00</BLOCK_START>
</BLOCK>
<TITLES>
<TITLE>Name 3</TITLE>
</TITLES>
</PROGRAMME>
<PROGRAMME START="00:45:00" END="00:50:00">
<TITLES>
<TITLE>Name 4</TITLE>
</TITLES>
</PROGRAMME>
</PROGRAMMES>
</SCHEDULE>
This is my Python code:
import xml.etree.ElementTree as ET
try:
import textwrap
textwrap.indent
except AttributeError:
def indent(text, amount, ch=' '):
padding = amount * ch
return ''.join(padding+line for line in text.splitlines(True))
else:
def indent(text, amount, ch=' '):
return textwrap.indent(text, amount * ch)
tree= ET.parse("block.xml")
root=tree.getroot()
for sub in root.findall('.//PROGRAMME'):
time = sub.get("START")[:5] + " "
nas = sub.find(".//TITLE").text
if sub.find(".//BLOCK_NAME") == None:
block = time + nas
else:
block = sub.find(".//BLOCK_NAME").text
c= indent(time + nas,3)
block = time + block + "\n" + c
print(block)
Result i get:
>>>00:30 Name 1
>>>00:31 Block1
>>> 00:31 Name 2
>>>00:40 Block 1
>>> 00:40 Name 3
>>>00:45 Name 4
Now the question:
What am i missing so i would get this result
>>>00:30 Name 1
>>>00:31 Block1
>>> 00:31 Name 2
>>> 00:40 Name 3
>>>00:45 Name 4
I presume i need to incorporate BLOCK_START somehow but i have no idea how...
Tnx in advance for help

I figured it out on my own. Just compared START and STARTBLOCK.
if sub.find(".//BLOCK_NAME") is None:
block = time + nas
else:
if sub.find(".//BLOCK_START").text != sub.get("START"):
block = indent(time + nas, 4)
else:
block = sub.find(".//BLOCK_NAME").text
c = indent(time + nas, 4)
block = time + block + "\n" + c

Python and products file XML

I'm using python and I need to find sku, min-order-qty and step-quantity for each occurence of sku.
Input file is:
<product sku="1235997403">
<sku>1235997403</sku>
<name xml:lang="fr-FR">Huile pour entretien des destructeurs de documents HSM</name>
<short-description xml:lang="fr-FR">Flacon 250 ml. Colis de 1 flacon.</short-description>
<category-links>
<category-link name="20319647o.rjpf_20320074o.rjpf" domain="RAJA-FR-WEB-0092-21" default = "1" hotdeal = "0"/>
</category-links>
<online>1</online>
<quantity unit="pcs">
<min-order-quantity>1</min-order-quantity>
<step-quantity>1</step-quantity>
</quantity>
....
</product>
....
I try to use lxml but fail to get min-order-qty and step-quantity
from lxml import etree
tree = etree.parse('./ST2CleanCourt.xml')
elem = tree.getroot()
for child in elem:
print (child.attrib["sku"])
I tried to use the 2 solutions below. It works but I need to read the file so I write
from lxml import etree
import codecs
f=codecs.open('./ST2CleanCourt.xml','r','utf-8')
fichier = f.read()
tree = etree.fromstring(fichier)
for child in tree:
print ('sku:', child.attrib['sku'])
print ('min:', child.find('.//min-order-quantity').text)
and I always get this error
print ('min:', child.find('.//min-order-quantity').text)
AttributeError: 'NoneType' object has no attribute 'text'
what is wrong ?

You can use the xpath method to get the required values.
Example:
from lxml import etree
a = """<product sku="1235997403">
<sku>1235997403</sku>
<name xml:lang="fr-FR">Huile pour entretien des destructeurs de documents HSM</name>
<short-description xml:lang="fr-FR">Flacon 250 ml. Colis de 1 flacon.</short-description>
<category-links>
<category-link name="20319647o.rjpf_20320074o.rjpf" domain="RAJA-FR-WEB-0092-21" default = "1" hotdeal = "0"/>
</category-links>
<online>1</online>
<quantity unit="pcs">
<min-order-quantity>1</min-order-quantity>
<step-quantity>1</step-quantity>
</quantity>
</product>
"""
tree = etree.fromstring(a)
tags = tree.xpath('/product')
for b in tags:
print b.attrib["sku"]
min_order = b.xpath("//quantity/min-order-quantity")
print min_order[0].text
step_quality = b.xpath("//quantity/step-quantity")
print step_quality[0].text
Output:
1235997403
1
1

Using more then 1 product and root node of products you can find this:
x = """
<products>
<product sku="1235997403">
<sku>1235997403</sku>
<name xml:lang="fr-FR">Huile pour entretien des destructeurs de documents HSM</name>
<short-description xml:lang="fr-FR">Flacon 250 ml. Colis de 1 flacon.</short-description>
<category-links>
<category-link name="20319647o.rjpf_20320074o.rjpf" domain="RAJA-FR-WEB-0092-21" default = "1" hotdeal = "0"/>
</category-links>
<online>1</online>
<quantity unit="pcs">
<min-order-quantity>1</min-order-quantity>
<step-quantity>1</step-quantity>
</quantity>
</product>
<product sku="997403">
<sku>1235997403</sku>
<name xml:lang="fr-FR">Huile pour entretien des destructeurs de documents HSM</name>
<short-description xml:lang="fr-FR">Flacon 250 ml. Colis de 1 flacon.</short-description>
<category-links>
<category-link name="20319647o.rjpf_20320074o.rjpf" domain="RAJA-FR-WEB-0092-21" default = "1" hotdeal = "0"/>
</category-links>
<online>1</online>
<quantity unit="pcs">
<min-order-quantity>5</min-order-quantity>
<step-quantity>7</step-quantity>
</quantity>
</product>
</products>
"""
from lxml import etree
tree = etree.fromstring(x)
for child in tree:
print ("sku:", child.attrib["sku"])
print ("min:", child.find(".//min-order-quantity").text) # looks for node below
print ("step:" ,child.find(".//step-quantity").text) # child with the given name
Essentially you look for any node below child that has the correct name and print its text.
Output:
sku:1235997403
min:1
step:1
sku:997403
min:5
step:7
Doku: http://lxml.de/tutorial.html#elementpath

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse by Python - python

You can do something like - print time, opt+(" " + opt).join([c.encode('latin-1').decode('latin-1') for c in child.tail.split(' ')]) instead of your print statement

Related

Parse XML to CSV when XML tag has child attributes

How determine if a token is part of an entity within Spacy?

Python XML Parser Issue

Python ElementTree parsing order

Python and products file XML

Categories

Resources