Converting unusual XML file to CSV using Python - python

I'm having an issue with my XML file. I would like to achieve the same as in: https://www.delftstack.com/howto/python/xml-to-csv-python/
However, my XML file looks a bit different, for example:
<students>
<student name="Rick Grimes" rollnumber="1" age="15"/>
<student name="Lori Grimes" rollnumber="2" age="16"/>
<student name="Judith Grimes" rollnumber="4" age="13"/>
</students>
The code specified in the link does not work with this formatting.
from xml.etree import ElementTree
tree = ElementTree.parse("input.xml")
root = tree.getroot()
for student in root:
name = student.find("name").text
roll_number = student.find("rollnumber").text
age = student.find("age").text
print(f"{name},{roll_number},{age}")
I have very little coding experience, so hoping someone on here can help me out.
Expected result:
Rick Grimes,1,15
Lori Grimes,2,16
Carl Grimes,3,14
Judith Grimes,4,13
Actual result:
AttributeError: 'NoneType' object has no attribute 'text'

text refers to the actual text of the tag. To make it clear:
<student> text here </student>
You don't have any here since your tags are autoclosing. What you are looking for is the tag attribute attrib: doc here
Something like this should help you get what you're looking for:
for student in root:
print(student.attrib)

You cannot get the text if there aren't any text to get.
Instead you want to use .attrib[key] as you have the values as attributes.
I have modified your example so that it will work with your XML file.
from xml.etree import ElementTree
tree = ElementTree.parse("input.xml")
root = tree.getroot()
for student in root:
name = student.attrib["name"]
roll_number = student.attrib["rollnumber"]
age = student.attrib["age"]
print(f"{name},{roll_number},{age}")
I hope this will help you.

import io
from xml.etree import ElementTree
xml_string = """<students>
<student name="Rick Grimes" rollnumber="1" age="15"/>
<student name="Lori Grimes" rollnumber="2" age="16"/>
<student name="Judith Grimes" rollnumber="4" age="13"/>
</students>"""
file = io.StringIO(xml_string)
tree = ElementTree.parse(file)
root = tree.getroot()
result = ""
for student in root:
result += f"{student.attrib['name']},{student.attrib['rollnumber']},{student.attrib['age']} "
print(result)
result
Rick Grimes,1,15 Lori Grimes,2,16 Judith Grimes,4,13

For such easy structured XML you can use also the build in function from pandas in two lines of code:
import pandas as pd
df = pd.read_xml('caroline.xml', xpath='.//student')
csv = df.to_csv('caroline.csv', index=False)
# For visualization only
with open('caroline.csv', 'r') as f:
lines = f.readlines()
for line in lines:
print(line)
Output:
name,rollnumber,age
Rick Grimes,1,15
Lori Grimes,2,16
Judith Grimes,4,13
With the option header=False you can also switch off to write the header to the csv file.

Related

Python XML: Getting unnecessary new lines when adding new element

Basically I'm trying to add a new element and for it to be properly indented, but with this code I get unnecessary new lines between elements. What is causing it and how do I fix it? Thanks
Example:
from xml.dom import minidom
import xml.etree.ElementTree as ET
def example(name, category):
tree = ET.parse("example1.xml")
root = tree.getroot()
for i in root:
if i.tag == category:
ET.SubElement(i, name).text = name
xmlStr = minidom.parseString(ET.tostring(root)).toprettyxml(indent=" ")
with open("example1.xml", "w") as f:
f.write(xmlStr)
example("test", 'FRUITS')
XML File:
<?xml version="1.0" ?>
<root>
<FRUITS>
<APPLE>apple</APPLE>
<PEAR>pear</PEAR>
<PLUM>plum</PLUM>
</FRUITS>
<VEGETABLES>
<CARROT>carrot</CARROT>
<POTATO>potato</POTATO>
</VEGETABLES>

How to indent my xml data which in xml file python [duplicate]

After reading from an existing file with 'ugly' XML and doing some modifications, pretty printing doesn't work. I've tried etree.write(FILE_NAME, pretty_print=True).
I have the following XML:
<testsuites tests="14" failures="0" disabled="0" errors="0" time="0.306" name="AllTests">
<testsuite name="AIR" tests="14" failures="0" disabled="0" errors="0" time="0.306">
....
And I use it like this:
tree = etree.parse('original.xml')
root = tree.getroot()
...
# modifications
...
with open(FILE_NAME, "w") as f:
tree.write(f, pretty_print=True)
For me, this issue was not solved until I noticed this little tidbit here:
http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output
Short version:
Read in the file with this command:
>>> parser = etree.XMLParser(remove_blank_text=True)
>>> tree = etree.parse(filename, parser)
That will "reset" the already existing indentation, allowing the output to generate it's own indentation correctly. Then pretty_print as normal:
>>> tree.write(<output_file_name>, pretty_print=True)
Well, according to the API docs, there is no method "write" in the lxml etree module. You've got a couple of options in regards to getting a pretty printed xml string into a file. You can use the tostring method like so:
f = open('doc.xml', 'w')
f.write(etree.tostring(root, pretty_print=True))
f.close()
Or, if your input source is less than perfect and/or you want more knobs and buttons to configure your out put you could use one of the python wrappers for the tidy lib.
http://utidylib.berlios.de/
import tidy
f.write(tidy.parseString(your_xml_str, **{'output_xml':1, 'indent':1, 'input_xml':1}))
http://countergram.com/open-source/pytidylib
from tidylib import tidy_document
document, errors = tidy_document(your_xml_str, options={'output_xml':1, 'indent':1, 'input_xml':1})
f.write(document)
fp = file('out.txt', 'w')
print(e.tree.tostring(...), file=fp)
fp.close()
Here is an answer that is fixed to work with Python 3:
from lxml import etree
from sys import stdout
from io import BytesIO
parser = etree.XMLParser(remove_blank_text = True)
file_obj = BytesIO(text)
tree = etree.parse(file_obj, parser)
tree.write(stdout.buffer, pretty_print = True)
where text is the xml code as a sequence of bytes.
I am not sure why other answers did not mention this. If you want to obtain the root of the xml there is a method called getroot(). I hope I answered your question (though a little late).
tree = et.parse(xmlFile)
root = tree.getroot()
Of course - pretty print of lxml.etree is possible.
In my case, the old trick with remove_blank_text=True and pretty_print=True was not working as I expected (was too delicate), so I decided to write it by myself.
Here is it - a modern, forcible, native pythonic way to correct lxml.etee.Element tree indentation.
This gives a nicely prettified XML string:
from typing import Optional
import lxml.etree
def indent_lxml(element: lxml.etree.Element, level: int = 0, is_last_child: bool = True) -> None:
space = " "
indent_str = "\n" + level * space
element.text = strip_or_null(element.text)
if element.text:
element.text = f"{indent_str}{space}{element.text}"
num_children = len(element)
if num_children:
element.text = f"{element.text or ''}{indent_str}{space}"
for index, child in enumerate(element.iterchildren()):
is_last = index == num_children - 1
indent_lxml(child, level + 1, is_last)
elif element.text:
element.text += indent_str
tail_level = max(0, level - 1) if is_last_child else level
tail_indent = "\n" + tail_level * space
tail = strip_or_null(element.tail)
element.tail = f"{indent_str}{tail}{tail_indent}" if tail else tail_indent
def strip_or_null(text: Optional[str]) -> Optional[str]:
if text is not None:
return text.strip() or None
It's decent fast, because it doesn't allocate any additional structures in memory and also traversing the tree - it visits each node only once, giving the best possible - O x N computational complexity.
It rearranges all the existing indentation "in place" in the tree (the DOM) by correcting contents of Element.text and Element.tail attributes (affects white-spaces only).
Naturally, it also can be used with HTML parsed by lxml.
In order to use it, do something like that:
root = lxml.etree.parse("path/to/the_file.xml").getroot()
# or
root = lxml.etree.fromstring("<xml><body><leaf1/><leaf2/></body></xml>")
indent_lxml(root) # corrects indentation "in place"
result = lxml.etree.tostring(root, encoding="unicode")
print(result)
Which prints:
<xml>
<body>
<leaf1/>
<leaf2/>
</body>
</xml>

How to properly format xml file using lxml? [duplicate]

After reading from an existing file with 'ugly' XML and doing some modifications, pretty printing doesn't work. I've tried etree.write(FILE_NAME, pretty_print=True).
I have the following XML:
<testsuites tests="14" failures="0" disabled="0" errors="0" time="0.306" name="AllTests">
<testsuite name="AIR" tests="14" failures="0" disabled="0" errors="0" time="0.306">
....
And I use it like this:
tree = etree.parse('original.xml')
root = tree.getroot()
...
# modifications
...
with open(FILE_NAME, "w") as f:
tree.write(f, pretty_print=True)
For me, this issue was not solved until I noticed this little tidbit here:
http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output
Short version:
Read in the file with this command:
>>> parser = etree.XMLParser(remove_blank_text=True)
>>> tree = etree.parse(filename, parser)
That will "reset" the already existing indentation, allowing the output to generate it's own indentation correctly. Then pretty_print as normal:
>>> tree.write(<output_file_name>, pretty_print=True)
Well, according to the API docs, there is no method "write" in the lxml etree module. You've got a couple of options in regards to getting a pretty printed xml string into a file. You can use the tostring method like so:
f = open('doc.xml', 'w')
f.write(etree.tostring(root, pretty_print=True))
f.close()
Or, if your input source is less than perfect and/or you want more knobs and buttons to configure your out put you could use one of the python wrappers for the tidy lib.
http://utidylib.berlios.de/
import tidy
f.write(tidy.parseString(your_xml_str, **{'output_xml':1, 'indent':1, 'input_xml':1}))
http://countergram.com/open-source/pytidylib
from tidylib import tidy_document
document, errors = tidy_document(your_xml_str, options={'output_xml':1, 'indent':1, 'input_xml':1})
f.write(document)
fp = file('out.txt', 'w')
print(e.tree.tostring(...), file=fp)
fp.close()
Here is an answer that is fixed to work with Python 3:
from lxml import etree
from sys import stdout
from io import BytesIO
parser = etree.XMLParser(remove_blank_text = True)
file_obj = BytesIO(text)
tree = etree.parse(file_obj, parser)
tree.write(stdout.buffer, pretty_print = True)
where text is the xml code as a sequence of bytes.
I am not sure why other answers did not mention this. If you want to obtain the root of the xml there is a method called getroot(). I hope I answered your question (though a little late).
tree = et.parse(xmlFile)
root = tree.getroot()
Of course - pretty print of lxml.etree is possible.
In my case, the old trick with remove_blank_text=True and pretty_print=True was not working as I expected (was too delicate), so I decided to write it by myself.
Here is it - a modern, forcible, native pythonic way to correct lxml.etee.Element tree indentation.
This gives a nicely prettified XML string:
from typing import Optional
import lxml.etree
def indent_lxml(element: lxml.etree.Element, level: int = 0, is_last_child: bool = True) -> None:
space = " "
indent_str = "\n" + level * space
element.text = strip_or_null(element.text)
if element.text:
element.text = f"{indent_str}{space}{element.text}"
num_children = len(element)
if num_children:
element.text = f"{element.text or ''}{indent_str}{space}"
for index, child in enumerate(element.iterchildren()):
is_last = index == num_children - 1
indent_lxml(child, level + 1, is_last)
elif element.text:
element.text += indent_str
tail_level = max(0, level - 1) if is_last_child else level
tail_indent = "\n" + tail_level * space
tail = strip_or_null(element.tail)
element.tail = f"{indent_str}{tail}{tail_indent}" if tail else tail_indent
def strip_or_null(text: Optional[str]) -> Optional[str]:
if text is not None:
return text.strip() or None
It's decent fast, because it doesn't allocate any additional structures in memory and also traversing the tree - it visits each node only once, giving the best possible - O x N computational complexity.
It rearranges all the existing indentation "in place" in the tree (the DOM) by correcting contents of Element.text and Element.tail attributes (affects white-spaces only).
Naturally, it also can be used with HTML parsed by lxml.
In order to use it, do something like that:
root = lxml.etree.parse("path/to/the_file.xml").getroot()
# or
root = lxml.etree.fromstring("<xml><body><leaf1/><leaf2/></body></xml>")
indent_lxml(root) # corrects indentation "in place"
result = lxml.etree.tostring(root, encoding="unicode")
print(result)
Which prints:
<xml>
<body>
<leaf1/>
<leaf2/>
</body>
</xml>

Python: Writing list to xml file

I'm trying to write the list elements to an xml file. I have written the below code. The xml file is created, but the data is repeated. I'm unable to figure out why is the data written twice in the xml file.
users_list = ['Group1User1', 'Group1User2', 'Group2User1', 'Group2User2']
def create_xml(self):
usrconfig = Element("usrconfig")
usrconfig = ET.SubElement(usrconfig,"usrconfig")
for user in range(len( users_list)):
usr = ET.SubElement(usrconfig,"usr")
usr.text = str(users_list[user])
usrconfig.extend(usrconfig)
tree = ET.ElementTree(usrconfig)
tree.write("details.xml",encoding='utf-8', xml_declaration=True)
Output File: details.xml
-
<usr>Group1User1</usr>
<usr>Group1User2</usr>
<usr>Group2User1</usr>
<usr>Group2User2</usr>
<usr>Group1User1</usr>
<usr>Group1User2</usr>
<usr>Group2User1</usr>
<usr>Group2User2</usr>
enter image description here
usrconfig.extend(usrconfig)
This line looks suspicious to me. if userconfig was a list, this line would be equivalent to "duplicate every element in this list". I suspect that something similar happens for Elements, too. Try deleting that line.
import xml.etree.ElementTree as ET
users_list = ["Group1User1", "Group1User2", "Group2User1", "Group2User2"]
def create_xml():
usrconfig = ET.Element("usrconfig")
usrconfig = ET.SubElement(usrconfig,"usrconfig")
for user in range(len( users_list)):
usr = ET.SubElement(usrconfig,"usr")
usr.text = str(users_list[user])
tree = ET.ElementTree(usrconfig)
tree.write("details.xml",encoding='utf-8', xml_declaration=True)
create_xml()
Result:
<?xml version='1.0' encoding='utf-8'?>
<usrconfig>
<usr>Group1User1</usr>
<usr>Group1User2</usr>
<usr>Group2User1</usr>
<usr>Group2User2</usr>
</usrconfig>
For such a simple xml structure, we can directly write out the file. But this technique might also be useful if one is not up to speed with the python xml modules.
import os
users_list = ["Group1User1", "Group1User2", "Group2User1", "Group2User2"]
os.chdir("C:\\Users\\Mike\\Desktop")
xml_out_DD = open("test.xml", 'wb')
xml_out_DD.write(bytes('<usrconfig>', 'utf-8'))
for i in range(0, len(users_list)):
xml_out_DD.write(bytes('<usr>' + users_list[i] + '</usr>', 'utf-8'))
xml_out_DD.write(bytes('</usrconfig>', 'utf-8'))
xml_out_DD.close()

xml file to csv file python script

I need a python script for extract data from xml file
I have a xml file as shoen below:
<software>
<name>Update Image</name>
<Build>22.02</Build>
<description>Firmware for Delta-M Series </description>
<CommonImages> </CommonImages>
<ModelBasedImages>
<ULT>
<CNTRL_0>
<file type="UI_APP" ver="2.35" crc="1234"/>
<file type="MainFW" ver="5.01" crc="5678"/>
<SIZE300>
<file type="ParamTableDB" ver="1.1.4" crc="9101"/>
</SIZE300>
</CNTRL_0>
<CNTRL_2>
<file type="UI_APP" ver="2.35" crc="1234"/>
<file type="MainFW" ver="5.01" crc="9158"/>
</CNTRL_2>
</ULT>
</ModelBasedImages>
</software>
I want the data in table format like:
type ver crc
UI_APP 2.35 1234
MainFW 5.01 5678
ParamTableDB 1.1.4 9101
UI_APP 2.35 1234
MainFW 5.01 9158
Extract into any type of file csv/doc....
I tried this code:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("Build_40.01 (copy).xml")
root = tree.getroot()
# open a file for writing
Resident_data = open('ResidentData.csv', 'w')
# create the csv writer object
csvwriter = csv.writer(Resident_data)
resident_head = []
count = 0
for member in root.findall('file'):
resident = []
address_list = []
if count == 0:
name = member.find('type').tag
resident_head.append(name)
ver = member.find('ver').tag
resident_head.append(ver)
crc = member.find('crc').tag
resident_head.append(crc)
csvwriter.writerow(resident_head)
count = count + 1
name = member.find('type').text
resident.append(name)
ver = member.find('ver').text
resident.append(ver)
crc = member.find('crc').text
resident.append(crc)
csvwriter.writerow(resident)
Resident_data.close()
Thanks in advance
edited:xml code updated.
Use the xpath expression .//file to find all <file> elements in the XML document, and then use each element's attributes to populate the CSV file through a csv.DictWriter:
import csv
import xml.etree.ElementTree as ET
tree = ET.parse("Build_40.01 (copy).xml")
root = tree.getroot()
with open('ResidentData.csv', 'w') as f:
w = csv.DictWriter(f, fieldnames=('type', 'ver', 'crc'))
w.writerheader()
w.writerows(e.attrib for e in root.findall('.//file'))
For your sample input the output CSV file will look like this:
type,ver,crc
UI_APP,2.35,1234
MainFW,5.01,5678
ParamTableDB,1.1.4,9101
UI_APP,2.35,1234
MainFW,5.01,9158
which uses the default delimiter (comma) for a CSV file. You can change the delimiter using the delimiter=' ' option to DictWriter(), however, you will not be able to obtain the same formatting as your sample output, which appears to use fixed width fields (but you might get away with using tab as the delimiter).

Categories

Resources