The simple code down below prints certain elements and their attributes in a dataframe.
It iterates through an XML files, looks for these elements and just prints them out
Code
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('1last.xml')
root = tree.getroot()
for neighbor in root.iter('Description'):
print(neighbor.attrib, neighbor.text)
for neighbor in root.iter('SetData'):
print(neighbor.attrib)
for neighbor in root.iter('FileX'):
print(neighbor.attrib)
for neighbor in root.iter('FileY'):
print(neighbor.attrib)
Output
I want to export the output into a Excel table form but It doesn’t seem to work
I have tried this
export_excel = root.to_excel (r'C:\Users\fsdf.LAPTOP-E8A1PPIN\Desktop\test\export_dataframe.xlsx', index = None, header=True)
but I got the error saying “AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'to_excel'
This my xml file
<?xml version="1.0" encoding="utf-8"?>
<ProjectData>
<FINAL>
<START id="ID0001" service_code="0x5196">
<Docs Docs_type="START">
<Rational>225196</Rational>
<Qualify>6251960000A0DE</Qualify>
</Docs>
<Description num="1213f2312">The parameter</Description>
<SetFile dg="" dg_id="">
<SetData value="32" />
</SetFile>
</START>
<START id="DG0003" service_code="0x517B">
<Docs Docs_type="START">
<Rational>23423</Rational>
<Qualify>342342</Qualify>
</Docs>
<Description num="3423423f3423">The third</Description>
<SetFile dg="" dg_id="">
<FileX dg="" axis_pts="2" name="" num="" dg_id="" />
<FileY unit="" axis_pts="20" name="TOOLS" text_id="23423" unit_id="" />
<SetData x="E1" value="21259" />
<SetData x="E2" value="0" />
</SetFile>
</START>
<START id="ID0048" service_code="0x5198">
<RawData rawdata_type="OPDATA">
<Request>225198</Request>
<Response>343243324234234</Response>
</RawData>
<Meaning text_id="434234234">The forth</Meaning>
<ValueDataset unit="m" unit_id="FEDS">
<FileX dg="kg" discrete="false" axis_pts="19" name="weight" text_id="SDF3" unit_id="SDGFDS" />
<SetData xin="sdf" xax="233" value="323" />
<SetData xin="123" xax="213" value="232" />
<SetData xin="2321" xax="232" value="23" />
</ValueDataset>
</START>
</FINAL>
</ProjectData>
This is what I would want the table to look like.
One approach would be to use a library such as openpyxl to write the Excel file directly. The following shows how this could be done:
import openpyxl
from bs4 import BeautifulSoup
with open('1last.xml') as f_input:
soup = BeautifulSoup(f_input, 'lxml')
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Sheet1"
ws.append(["Description", "num", "text"])
for description in soup.find_all("description"):
ws.append(["", description['num'], description.text])
ws.append(["SetData", "x", "value", "xin", "xax"])
for setdata in soup.find_all("setdata"):
ws.append(["", setdata.get('x', ''), setdata.get('value', ''), setdata.get('xin', ''), setdata.get('xax', '')])
wb.save(filename="1last.xlsx")
This would create an Excel file looking like:
Related
I am trying to create an API connection and response is looking like below. I need to parse this data and turn it into a pd dataframe and/or create loop to find specific information belong to tags.
Below is the code i try to run but it returns with empty list, and it looks not iterable.
Also it is not convertible to a data frame for now. What steps should I take to handle this data?
import requests
import pandas as pd
import xml.etree.ElementTree as ET
response = """<?xml version = "1.0" encoding = "utf-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<SOAP-ENV:Body>
<Desperados_Clientes_V2.DESPERADOSResponse xmlns="TrainsWebb_V16">
<Sdtdesperadosclient xmlns="TrainsWebb_V16">
<SDTDesperadosClientItem xmlns="TrainsWebb_V16">
<AESA>10555555555 </AESA>
<DOCUMENTO>1666666666</DOCUMENTO>
<REMITENTE>888888888 </REMITENTE>
<NM_REMITENTE>ABDULREZZAK S.A.S. </NM_REMITENTE>
<FECHA_ELABORACION>14/8/2020</FECHA_ELABORACION>
<HORA_ELABORACION>11:27</HORA_ELABORACION>
<CODIGO_DEST>0000000000</CODIGO_DEST>
<NIT_DESTINATARIO>0000000000</NIT_DESTINATARIO>
<NOMBRE_DESTINATARIO>HOST ADMIRALE GORA</NOMBRE_DESTINATARIO>
<DIRECCION_DESTINATARIO>BBA 56 # 21 - 001</DIRECCION_DESTINATARIO>
<DANE_DESTINO>0200000</DANE_DESTINO>
<CIUDAD_DESTINO>GORA </CIUDAD_DESTINO>
<DEPARTAMENTO_DESTINO>ANTIOCHIA </DEPARTAMENTO_DESTINO>
<FECHA_ENTREGA>11/02/2020</FECHA_ENTREGA>
<HORA_ENTREGA>11:44</HORA_ENTREGA>
<FECHA_CITA />
<HORA_CITA />
<CODIGO_ESTADO>Z </CODIGO_ESTADO>
<NOMBRE_ESTADO>CUMPLEANNO </NOMBRE_ESTADO>
<FECHA_ESTADO>11/01/2020</FECHA_ESTADO>
<HORA_ESTADO>11:44</HORA_ESTADO>
<CODIGO_NOVEDAD />
<NOMBRE_NOVEDAD />
<FECHA_NOVEDAD />
<HORA_NOVEDAD />
<COMENTARIO_NOVEDAD />
<OBSERVACIONES />
<ENLACE_IMAGEN>https://ssssss.ssssssss.com/SSSSSS/xxxxxxxxxxxxxx.aspx?1111111,222222222,SIXA_XEOX,SIXAXEOX2016</ENLACE_IMAGEN>
<DOCUMENTO_2>169999999999</DOCUMENTO_2>
<DOCUMENTO_3 />
<DOCUMENTO_4 />
<FECHA_TRANSMISION>18/02/2020</FECHA_TRANSMISION>
<HORA_TRANSMISION>08:12:30</HORA_TRANSMISION>
<MENSAJE_TRANSMISION>KK</MENSAJE_TRANSMISION>
<PROMESA_SERVICIO>15/10/21</PROMESA_SERVICIO>
<CODIGO_DIVISION>011111</CODIGO_DIVISION>
<NOMBRE_DIVISION>ABDURREZZAK </NOMBRE_DIVISION>
</SDTDesperadosClientItem>
<SDTDesperadosClientItem xmlns="TrainsWebb_V16">
<AESA>10555555555 </AESA>
<DOCUMENTO>177777777</DOCUMENTO>
<REMITENTE>9999999999 </REMITENTE>
<NM_REMITENTE>ABDULREZZAK S.A.S. </NM_REMITENTE>
<FECHA_ELABORACION>12/8/2020</FECHA_ELABORACION>
<HORA_ELABORACION>16:27</HORA_ELABORACION>
<CODIGO_DEST>0000000000</CODIGO_DEST>
<NIT_DESTINATARIO>0000000000</NIT_DESTINATARIO>
<NOMBRE_DESTINATARIO>GORA FORA</NOMBRE_DESTINATARIO>
<DIRECCION_DESTINATARIO>BBG 16 # 91 - 021</DIRECCION_DESTINATARIO>
<DANE_DESTINO>0500000</DANE_DESTINO>
<CIUDAD_DESTINO>AROG </CIUDAD_DESTINO>
<DEPARTAMENTO_DESTINO>ANTIOCHIA </DEPARTAMENTO_DESTINO>
<FECHA_ENTREGA>10/02/2020</FECHA_ENTREGA>
<HORA_ENTREGA>10:44</HORA_ENTREGA>
<FECHA_CITA />
<HORA_CITA />
<CODIGO_ESTADO>D </CODIGO_ESTADO>
<NOMBRE_ESTADO>CUMPLEANNI </NOMBRE_ESTADO>
<FECHA_ESTADO>11/01/2020</FECHA_ESTADO>
<HORA_ESTADO>11:44</HORA_ESTADO>
<CODIGO_NOVEDAD />
<NOMBRE_NOVEDAD />
<FECHA_NOVEDAD />
<HORA_NOVEDAD />
<COMENTARIO_NOVEDAD />
<OBSERVACIONES />
<ENLACE_IMAGEN>https://ssssss.ssssssss.com/SSSSSS/xxxxxxxxxxxxxx.aspx?1111111,222222222,SIXA_XEOX,SIXAXEOX2016</ENLACE_IMAGEN>
<DOCUMENTO_2>1677777777</DOCUMENTO_2>
<DOCUMENTO_3 />
<DOCUMENTO_4 />
<FECHA_TRANSMISION>18/02/2020</FECHA_TRANSMISION>
<HORA_TRANSMISION>08:12:30</HORA_TRANSMISION>
<MENSAJE_TRANSMISION>HK</MENSAJE_TRANSMISION>
<PROMESA_SERVICIO>15/10/21</PROMESA_SERVICIO>
<CODIGO_DIVISION>011111</CODIGO_DIVISION>
<NOMBRE_DIVISION>ABDURREZZAK </NOMBRE_DIVISION>
</SDTDesperadosClientItem>
</Sdtdesperadosclient>
</Desperados_Clientes_V2.DESPERADOSResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>"""
myroot = ET.fromstring(response)
for child in myroot.iter('*'):
print(child.tag)
sid = myroot.findall(".//{'TrainsWebb_V16'}AESA")
print(sid)
For parsing into a pandas DataFrame, you can use the pandas.read_xml function:
data_frame = pd.read_xml(response, xpath="//*[name()='SDTDesperadosClientItem']")
https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html#pandas-read-xml
I have an xml file called persons.xml in the following format:
<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York"/>
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles"/>
</person>
</persons>
I want to export to a file the list of person names along with the city names
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('./persons.xml')
root = tree.getroot()
df_cols = ["person_name", "city_name"]
rows = []
for node in root:
person_name = node.attrib.get("name")
rows.append({"person_name": person_name})
out_df = pd.DataFrame(rows, columns = df_cols)
out_df
Obviously this part of the code will only work for obtaining the name as it’s part of the root, but I can’t figure out how to loop through the child nodes too and obtain this info. Do I need to append something to root to iterate over the child nodes?
I can obtain everything using root.getchildren but it doesn’t allow me to return only the child nodes:
children = root.getchildren()
for child in children:
ElementTree.dump(child)
Is there a good way to get this information?
See below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person id="1" name="John">
<city id="21" name="New York" />
</person>
<person id="2" name="Mary">
<city id="22" name="Los Angeles" />
</person>
</persons>'''
root = ET.fromstring(xml)
data = []
for p in root.findall('.//person'):
data.append({'parson': p.attrib['name'], 'city': p.find('city').attrib['name']})
df = pd.DataFrame(data)
print(df)
output
parson city
0 John New York
1 Mary Los Angeles
I have a basic xml file called meals.xml which looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<meals name="Sample Text">
<meal id="1" name="Poached Eggs" type="breakfast"/>
<meal id="2" name="Club Sandwich" type="lunch"/>
<meal id="3" name="Steak" type="dinner"/>
<meal id="4" name="Steak" type="dinner"/>
</meals>
I want to extract both 'id' and 'name' attributes in to a dataframe. I can extract one when specifying one column and one attribute (eg, name only), but can't seem to figure out the syntax for getting multiple attributes in the for loop. This what I've tried, adding id to the 'df_cols' and 'attrib.get' function:
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.parse('meals.xml').getroot()
df_cols = ["id", "name"]
rows = []
for node in root:
value = node.attrib.get('id', 'name')
rows.append(value)
df = pd.DataFrame(rows, columns = df_cols)
df
Can someone advise how to do this?
The below may work for you
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<meals name="Sample Text">
<meal id="1" name="Poached Eggs" type="breakfast"/>
<meal id="2" name="Club Sandwich" type="lunch"/>
<meal id="3" name="Steak" type="dinner"/>
<meal id="4" name="Steak" type="dinner"/>
</meals>'''
root = ET.fromstring(xml)
data = [{'id': m.attrib['id'], 'name': m.attrib['name']} for m in root.findall('.//meal')]
df = pd.DataFrame(data)
print(df)
output
id name
0 1 Poached Eggs
1 2 Club Sandwich
2 3 Steak
3 4 Steak
The code below goes through the xml files and parses them into a single csv file
from xml.etree import ElementTree as ET
from collections import defaultdict
import csv
from pathlib import Path
directory = 'path to a folder with xml files'
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
headers = ['id', 'service_code', 'rational', 'qualify', 'description_num', 'description_txt', 'set_data_xin', 'set_data_xax', 'set_data_value', 'set_data_x']
writer.writerow(headers)
xml_files_list = list(map(str, Path(directory).glob('**/*.xml')))
print(xml_files_list)
for xml_file in xml_files_list:
tree = ET.parse(xml_file)
root = tree.getroot()
start_nodes = root.findall('.//START')
for sn in start_nodes:
row = defaultdict(str)
repeated_values = dict()
for k,v in sn.attrib.items():
repeated_values[k] = v
for rn in sn.findall('.//Rational'):
repeated_values['rational'] = rn.text
for qu in sn.findall('.//Qualify'):
repeated_values['qualify'] = qu.text
for ds in sn.findall('.//Description'):
repeated_values['description_txt'] = ds.text
repeated_values['description_num'] = ds.attrib['num']
for st in sn.findall('.//SetData'):
for k,v in st.attrib.items():
row['set_data_'+ str(k)] = v
for key in repeated_values.keys():
row[key] = repeated_values[key]
row_data = [row[i] for i in headers]
writer.writerow(row_data)
row = defaultdict(str)
This is the xml file.
<?xml version="1.0" encoding="utf-8"?>
<ProjectData>
<Phones>
<Date />
<Prog />
<Box />
<Feature />
<IN>MAFWDS</IN>
<Set>234234</Set>
<Pr>23423</Pr>
<Number>afasfhrtv</Number>
<Simple>dfasd</Simple>
<Nr />
<Get>6070106091</Get>
<Reno>1233</Reno>
</Phones>
<FINAL>
<START id="B001" service_code="0x5196">
<Docs Docs_type="START">
<Rational>225196</Rational>
<Qualify>6251960000A0DE</Qualify>
</Docs>
<Description num="1213f2312">The parameter</Description>
<DataFile dg="12" dg_id="let">
<SetData value="32" />
</DataFile>
</START>
<START id="C003" service_code="0x517B">
<Docs Docs_type="START">
<Rational>23423</Rational>
<Qualify>342342</Qualify>
</Docs>
<Description num="3423423f3423">The third</Description>
<DataFile dg="55" dg_id="big">
<SetData x="E1" value="21259" />
<SetData x="E2" value="02" />
</DataFile>
</START>
<START id="Z048" service_code="0x5198">
<RawData rawdata_type="ASDS">
<Rational>225198</Rational>
<Qualify>343243324234234</Qualify>
</RawData>
<Description num="434234234">The forth</Description>
<DataFile unit="21" unit_id="FEDS">
<FileX unit="eg" discrete="false" axis_pts="19" name="Vsome" text_id="bx5" unit_id="GDFSD" />
<SetData xin="5" xax="233" value="323" />
<SetData xin="123" xax="77" value="555" />
<SetData xin="17" xax="65" value="23" />
</DataFile>
</START>
</FINAL>
</ProjectData>
This is how the output looks like
Currently struggling to modify the code , so it goes to Phones (which is another child of Projectdata) takes elements from Set and Get attaches them together with _ and parses them into the first column that has the header names ** Identify**
The picture bellow shows how It should look.
Modify your headers line to
headers = ['identify', 'id', 'service_code', 'rational', 'qualify', 'description_num', 'description_txt', 'set_data_xin', 'set_data_xax', 'set_data_value', 'set_data_x']
p_get = tree.find('.//Phones/Get').text
p_set = tree.find('.//Phones/Set').text
and add this info to the row_data just before the line writer.writerow(row_data)
like this:
row_data.insert(0, p_get + '_' + p_set)
Update
row_data[0] = p_get + '_' + p_set
Hi I have the following XML and I am using the python ElementTree library to parse it.
<?xml version="1.0" encoding="UTF-8"?>
<bpmn:definitions xmlns:bpmn="http://www.omg.org/spec/BPMN/20100524/MODEL" xmlns:bpmndi="http://www.omg.org/spec/BPMN/20100524/DI" xmlns:di="http://www.omg.org/spec/DD/20100524/DI" xmlns:dc="http://www.omg.org/spec/DD/20100524/DC" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:camunda="http://camunda.org/schema/1.0/bpmn" id="Definitions_0qjsjhs" targetNamespace="http://bpmn.io/schema/bpmn" exporter="Camunda Modeler" exporterVersion="2.2.4">
<bpmn:process id="Process_1" isExecutable="true">
<bpmn:startEvent id="StartEvent_1">
<bpmn:outgoing>SequenceFlow_0bwuv8k</bpmn:outgoing>
</bpmn:startEvent>
<bpmn:sequenceFlow id="SequenceFlow_0bwuv8k" sourceRef="StartEvent_1" targetRef="Task_07owcwp" />
<bpmn:endEvent id="EndEvent_13n8600">
<bpmn:incoming>SequenceFlow_111n6oc</bpmn:incoming>
</bpmn:endEvent>
<bpmn:sequenceFlow id="SequenceFlow_111n6oc" sourceRef="Task_07owcwp" targetRef="EndEvent_13n8600">
<bpmn:conditionExpression xsi:type="bpmn:tFormalExpression">myexpression.</bpmn:conditionExpression>
</bpmn:sequenceFlow>
<bpmn:scriptTask id="Task_07owcwp" scriptFormat="testformat" camunda:resultVariable="output">
<bpmn:incoming>SequenceFlow_0bwuv8k</bpmn:incoming>
<bpmn:outgoing>SequenceFlow_111n6oc</bpmn:outgoing>
<bpmn:script>myscript</bpmn:script>
</bpmn:scriptTask>
</bpmn:process>
<bpmndi:BPMNDiagram id="BPMNDiagram_1">
<bpmndi:BPMNPlane id="BPMNPlane_1" bpmnElement="Process_1">
<bpmndi:BPMNShape id="_BPMNShape_StartEvent_2" bpmnElement="StartEvent_1">
<dc:Bounds x="173" y="102" width="36" height="36" />
</bpmndi:BPMNShape>
<bpmndi:BPMNEdge id="SequenceFlow_0bwuv8k_di" bpmnElement="SequenceFlow_0bwuv8k">
<di:waypoint x="209" y="120" />
<di:waypoint x="266" y="120" />
</bpmndi:BPMNEdge>
<bpmndi:BPMNShape id="EndEvent_13n8600_di" bpmnElement="EndEvent_13n8600">
<dc:Bounds x="423" y="102" width="36" height="36" />
</bpmndi:BPMNShape>
<bpmndi:BPMNEdge id="SequenceFlow_111n6oc_di" bpmnElement="SequenceFlow_111n6oc">
<di:waypoint x="366" y="120" />
<di:waypoint x="423" y="120" />
</bpmndi:BPMNEdge>
<bpmndi:BPMNShape id="ScriptTask_1p2sdkp_di" bpmnElement="Task_07owcwp">
<dc:Bounds x="266" y="80" width="100" height="80" />
</bpmndi:BPMNShape>
</bpmndi:BPMNPlane>
</bpmndi:BPMNDiagram>
</bpmn:definitions>`
I am currently at the 'bpmn:scripttask' element and I am having issues working out how to extract the value of 'camunda:resultVariable'
<bpmn:scriptTask id="Task_07owcwp" scriptFormat="testformat" camunda:resultVariable="output">
<bpmn:incoming>SequenceFlow_0bwuv8k</bpmn:incoming>
<bpmn:outgoing>SequenceFlow_111n6oc</bpmn:outgoing>
<bpmn:script>myscript</bpmn:script>
</bpmn:scriptTask>
I have tried
node.find('camunda:resultVariable' {'camunda':'http://camunda.org/schema/1.0/bpmn'}
where node is a an ElementTree object scriptTask. Does find or findall all you to look at the current elements attributes as it doesn't seem to be finding it. How would I go about finding this value?
Thank you.
ok I worked it out. I can use .get but I need to expand out the namespace like so
node.get('{http://camunda.org/schema/1.0/bpmn}resultVariable')