adjust Python fuction to parse xml

adjust Python fuction to parse xml - python

I need to read an XML file in an external domain.
my code:
tree = ET.ElementTree(file=urllib2.urlopen('http://192.168.2.57:8010/data/camera_state.xml'))
root = tree.getroot()
root.tag, root.attrib
for elem in tree.iter():
print elem.tag, elem.att
I could not get into the structure I need, the result of my function is this below:
CameraState {}
Cameras {}
Camera {'Id': '1'}
State {}
Camera {'Id': '2'}
State {}
Camera {'Id': '3'}
State {}
Camera {'Id': '4'}
State {}
I need to adjust this Python function to get into a result as below:
<CameraState>
<Cameras>
<Camera Id="1">
<State>NO_SIGNAL</State>
</Camera>
<Camera Id="2">
<State>OK</State>
</Camera>
</Cameras>
</CameraState>

You do have the parsed structure. It's just about the way you are accessing it.
Use getchildren to access children nodes. An example of recursively printing the structure:
import xml.etree.ElementTree as ET
def print_tree(node, prefix=''):
print(prefix, node.tag, node.attrib, node.text.strip())
for child in node:
print_tree(child, prefix + ' ')
tree = ET.ElementTree(file=<your file>)
root = tree.getroot()
print_tree(root)
It gives:
CameraState {}
Cameras {}
Camera {'Id': '1'}
State {} NO_SIGNAL
Camera {'Id': '2'}
State {} OK
However, I recommend you take a look at xmltodict:
import xmltodict
with open(<your file>) as f:
tree = xmltodict.parse(f.read())
print(tree)
It gives you OrderedDicts:
OrderedDict([('CameraState', OrderedDict([('Cameras', OrderedDict([('Camera', [OrderedDict([('#Id', '1'), ('State', 'NO_SIGNAL')]), OrderedDict([('#Id', '2'), ('State', 'OK')])])]))]))])

Related

Parse xml file to a python list

I have a xml file:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<Document xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03">
<CstmrCdtTrfInitn>
<GrpHdr>
<MsgId>637987745078994894</MsgId>
<CreDtTm>2022-09-14T05:48:27</CreDtTm>
<NbOfTxs>205</NbOfTxs>
<CtrlSum>154761.02</CtrlSum>
<InitgPty>
<Nm> Company</Nm>
</InitgPty>
</GrpHdr>
<PmtInf>
<PmtInfId>20220914054827-154016</PmtInfId>
<PmtMtd>TRF</PmtMtd>
<BtchBookg>true</BtchBookg>
<NbOfTxs>205</NbOfTxs>
<CtrlSum>154761.02</CtrlSum>
<PmtTpInf>
<SvcLvl>
<Cd>SEPA</Cd>
</SvcLvl>
<CtgyPurp>
<Cd>SALA</Cd>
</CtgyPurp>
</PmtTpInf>
<CdtTrfTxInf> <----------------------------------
<Amt>
<InstdAmt Ccy="EUR">1536.96</InstdAmt>
</Amt>
<Cdtr>
<Nm>Achternaam, Voornaam </Nm>
</Cdtr>
<CdtrAcct>
<Id>
<IBAN>NL80RABO0134343443</IBAN>
</Id>
</CdtrAcct>
</CdtTrfTxInf> <------------------------------------
<CdtTrfTxInf> <----------------------------------
<Amt>
<InstdAmt Ccy="EUR">1676.96</InstdAmt>
</Amt>
<Cdtr>
<Nm>Achternaam, Voornaam </Nm>
</Cdtr>
<CdtrAcct>
<Id>
<IBAN>NL80RABO013433222243</IBAN>
</Id>
</CdtrAcct>
</CdtTrfTxInf> <------------------------------------
</CstmrCdtTrfInitn>
</Document>
I use ElementTree:
I want a python list of tuples with the info within the tag (everything between the arrows in the example xml file). So in this example i want al list with 2 tuples.
How can i do that.
I can iterate over the tree, but thats is.
my code:
import xml.etree.ElementTree as ET
tree = ET.parse(xml_file)
root = tree.getroot()
for elem in tree.iter():
print(elem.tag, elem.text) --> i get every tag in the whole file

I rather like to use xmltodict.
First of all, your input data as given is missing a closing </PmtInf> tag towards the end, just before your closing </CstmrCdtTrfInitn> tag. After fixing that, I saved your xml data into a file and did the following:
import xmltodict
with open("input_data.xml", "r") as f:
xml_data = f.read()
xml_dict = xmltodict.parse(xml_data)
You can then access the xml data using dictionary accessors, for example:
xml_dict
>>>{'Document': {'#xmlns:xsi': 'http://www.w3.org/20...a-instance', '#xmlns': 'urn:iso:std:iso:2002...001.001.03', 'CstmrCdtTrfInitn': {...}}}
xml_dict["Document"]
>>>{'#xmlns:xsi': 'http://www.w3.org/20...a-instance', '#xmlns': 'urn:iso:std:iso:2002...001.001.03', 'CstmrCdtTrfInitn': {'GrpHdr': {...}, 'PmtInf': {...}}}
xml_dict["Document"]["CstmrCdtTrfInitn"].keys()
>>>dict_keys(['GrpHdr', 'PmtInf'])
xml_dict["Document"]["CstmrCdtTrfInitn"]["PmtInf"]
{'PmtInfId': '20220914054827-154016', 'PmtMtd': 'TRF', 'BtchBookg': 'true', 'NbOfTxs': '205', 'CtrlSum': '154761.02', 'PmtTpInf': {'SvcLvl': {...}, 'CtgyPurp': {...}}, 'CdtTrfTxInf': [{...}, {...}]}
xml_dict["Document"]["CstmrCdtTrfInitn"]["PmtInf"].keys()
dict_keys(['PmtInfId', 'PmtMtd', 'BtchBookg', 'NbOfTxs', 'CtrlSum', 'PmtTpInf', 'CdtTrfTxInf'])
Then you can loop over your CdtTrfTxInf with:
for item in xml_dict["Document"]["CstmrCdtTrfInitn"]["PmtInf"]["CdtTrfTxInf"]:
print(item)
giving the output:
{'Amt': {'InstdAmt': {'#Ccy': 'EUR', '#text': '1536.96'}}, 'Cdtr': {'Nm': 'Achternaam, Voornaam'}, 'CdtrAcct': {'Id': {'IBAN': 'NL80RABO0134343443'}}}
{'Amt': {'InstdAmt': {'#Ccy': 'EUR', '#text': '1676.96'}}, 'Cdtr': {'Nm': 'Achternaam, Voornaam'}, 'CdtrAcct': {'Id': {'IBAN': 'NL80RABO013433222243'}}}
which you can process as you want.

this is just a speedcode try xd give it a chance and try it :
import xml.etree.ElementTree as ET
tree = ET.parse("fr.xml")
root = tree.getroot()
test = False
for elem in tree.iter():
if elem.tag == "CdtTrfTxInf":
test = True
continue
if test and elem.text.strip() :
print(elem.tag, elem.text)
with result as list of tuple :
import xml.etree.ElementTree as ET
tree = ET.parse("fr.xml")
root = tree.getroot()
test = False
tag = []
textval=[]
for elem in tree.iter():
if elem.tag == "CdtTrfTxInf":
test = True
continue
if test and elem.text.strip() :
tag.append(elem.tag)
textval.append(elem.text)
data = list(zip(tag, textval))
print (data)

Extraction of XML Data using Python

I am trying to create a JSON file containing extracted data of Goodreads XML file, but I am unable to do so. I have never worked with XML and I have tried to go through tutorials but to no avail, I am not able to extract any data.
My XML File looks like this:
<books_count type="integer">11</books_count>
<original_publication_year type="integer">1997</original_publication_year>
<original_publication_month type="integer" nil="true"/>
<original_publication_day type="integer" nil="true"/>
<original_title>There Was an Old Lady Who Swallowed a Fly</original_title>
<popular_shelves>
<shelf name="to-read" count="3504"/>
<shelf name="picture-books" count="397"/>
<shelf name="childrens" count="226"/>
<shelf name="children-s-books" count="213"/>
<shelf name="children" count="149"/>
<shelf name="children-s" count="139"/>
<shelf name="caldecott" count="110"/>
</pouplar_shelves>
How to extract the data and specifically from popular shelves as the data I require is in shelf name?
Edit1:
import os
from xml.etree import ElementTree as ET
path = "books_xml"
for filename in os.listdir(path):
if not filename.endswith('.xml'): continue
fullname = os.path.join(path, filename)
tree = ET.parse(fullname)
print(tree)
root = tree.getroot()
for child in root:
print(child.books_count, child.text)
This is what i was trying to do, I have to run by multiple xml files in a directory. It throws error:
AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'books_count'
Edit2:
import os
from xml.etree import ElementTree as ET
mytree = ET.parse('sample_book.xml')
myroot = mytree.getroot()
print(myroot)
name = myroot.find('original_title').text
print(name)
Giving the following error
AttributeError: 'NoneType' object has no attribute 'text'

First, here you find all examples for a basic extraction
https://docs.python.org/3/library/xml.etree.elementtree.html#tutorial
2nd, your xml needs be better structured: E.g "pouplar_shelves" is typo error
3nd, I'll give you a small example
import xml.etree.ElementTree as ET
xml = '''
<popular_shelves>
<shelf name="to-read" count="3504"/>
<shelf name="picture-books" count="397"/>
<shelf name="childrens" count="226"/>
<shelf name="children-s-books" count="213"/>
<shelf name="children" count="149"/>
<shelf name="children-s" count="139"/>
<shelf name="caldecott" count="110"/>
</popular_shelves>
'''
xml_root = ET.fromstring(xml)
for i in xml_root.iter():
print(i.tag, i.attrib)
As result:
popular_shelves {}
shelf {'name': 'to-read', 'count': '3504'}
shelf {'name': 'picture-books', 'count': '397'}
shelf {'name': 'childrens', 'count': '226'}
shelf {'name': 'children-s-books', 'count': '213'}
shelf {'name': 'children', 'count': '149'}
shelf {'name': 'children-s', 'count': '139'}
shelf {'name': 'caldecott', 'count': '110'}
Press any key to continue . . .
Edit:
You should understand better what I've sent in docs Python. Also, be familiar with the source code of ET
https://github.com/python/cpython/blob/3.8/Lib/xml/etree/ElementTree.py
This is the model
Example form:
<tag attrib>text<child/>...</tag>tail
By using shelf as example:
for i in xml_root.iter('shelf'):
print(i.attrib)
Will return this:
{'name': 'to-read', 'count': '3504'}
{'name': 'picture-books', 'count': '397'}
{'name': 'childrens', 'count': '226'}
{'name': 'children-s-books', 'count': '213'}
{'name': 'children', 'count': '149'}
{'name': 'children-s', 'count': '139'}
{'name': 'caldecott', 'count': '110'}
Bcz your XML tag is shelf and your atribs is name and count
is translated
text...tail

Not getting XML output as expected

I have Python3 and am following this XML tutorial, https://docs.python.org/3.7/library/xml.etree.elementtree.html
I wish to output a listing of all DailyIndexRatio
DailyIndexRatio {'CUSIP': '912810FD5','IssueDate': '1998-04-15',
'Date':'2019-03-01','RefCPI':'251.23300','IndexRatio':'1.55331' }
....
Instead my code outputs
DailyIndexRatio {}
....
How to fix?
Here is the code
import xml.etree.ElementTree as ET
tree = ET.parse('CPI_20190213.xml')
root = tree.getroot()
print(root.tag)
print(root.attrib)
for child in root:
print(child.tag,child.attrib)
And I downloaded the xml file from https://treasurydirect.gov/xml/CPI_20190213.xml

import xml.etree.ElementTree as ET
tree = ET.parse('CPI_20190213.xml') # Load the XML
root = tree.getroot() # Get XML root element
e = root.findall('.//DailyIndexRatio') # use xpath to find relevant elements
# for each element
for i in e:
# create a dictionary object.
d = {}
# for each child of element
for child in i:
# add the tag name and text value to the dictionary
d[child.tag] = child.text
# print the DailyIndexRatio tag name and dictionary
print (i.tag, d)
Outputs:
DailyIndexRatio {'CUSIP': '912810FD5', 'IssueDate': '1998-04-15', 'Date': '2019-03-01', 'RefCPI': '251.23300', 'IndexRatio': '1.55331'}
DailyIndexRatio {'CUSIP': '912810FD5', 'IssueDate': '1998-04-15', 'Date': '2019-03-02', 'RefCPI': '251.24845', 'IndexRatio': '1.55341'}
DailyIndexRatio {'CUSIP': '912810FD5', 'IssueDate': '1998-04-15', 'Date': '2019-03-03', 'RefCPI': '251.26390', 'IndexRatio': '1.55351'}
DailyIndexRatio {'CUSIP': '912810FD5', 'IssueDate': '1998-04-15', 'Date': '2019-03-04', 'RefCPI': '251.27935', 'IndexRatio': '1.55360'}
...

You're printing the attributes, but that element does not have any attributes.
This is an element with attributes:
<element name="Bob" age="40" sex="male" />
But the element you're trying to print doesn't have those. It has child elements:
<element>
<name>Bob</name>
<age>40</age>
<sex>male</sex>
</element>

XML to CSV using xml.etree.ElementTree.interparse functionality

Folks, I am new (brand new) to python, so after taking a course I decided to create a script to covert an XML file to CSV. The file in question is 2GB in size, so after searching here and on google I think I need to use the xml.etree.ElementTree.interparse functionality. For reference the XML file I am looking to covert looks like this:
<Document>
<type></type>
<internal_id></internal_id>
<name></name>
<number></number>
<cadname></cadname>
<version></version>
<iteration></iteration>
**<isLatest></isLatest>**
<modifiedBy>
<username></username>
<email/>
</modifiedBy>
<content>
**<name></name>**
<id></id>
<uploaded></uploaded>
<refSize></refSize>
<storage>
<vault></vault>
<folder></folder>
**<filename></filename>**
<location></location>
**<actualLocation></actualLocation>**
</storage>
<replicatedTo></replicatedTo>
<copies></copies>
<status></status>
</content>
I am using the value of isLatest to determine whether I need to add the items to the CSV file. If the value is "true" I want the data to move to the CSV file. Here is the code that works to a point:
import xml.etree.ElementTree as ET
import csv
parser = ET.iterparse("windchill.xml")
# open a file for writing
csvfile = open('windchill.txt', 'w', encoding="utf-8")
# create the csv writer object
csvwriter = csv.writer(csvfile)
count = 0
for event, document in parser:
if document.tag == 'Document':
if document.find('isLatest').text == 'true':
row = []
name = document.find('content').find('name').text
row.append(name)
filename = document.find('content').find('storage').find('filename').text
row.append(filename)
folder = document.find('content').find('storage').find('actualLocation').text
row.append(folder)
csvwriter.writerow(row)
document.clear()
csvfile.close()
If I run the code, I get this error:
Traceback (most recent call last):
File "C:/Users/mike/PycharmProjects/windchill/xml2csv-stream.py", line 17, in <module>
if document.find('isLatest').text == 'true':
AttributeError: 'NoneType' object has no attribute 'text'
A file is created that has 91,000 entries that look like this:
plate.prt,000000000518e8,/vault/Vlt7
adhesive.prt,0000000005024b,/vault/Vlt7
brd_pad.prt,00000000057862,/vault/Vlt7
support_pad.prt,0000000005024c,/vault/Vlt7
ground.prt,0000000005089b,/vault/Vlt7
There seem to be two issues with the output.
Some items seem to be duplicated, although the source file has no duplications. The name could be duplicated in the source file, but there can only be one name value that is .
I don't think the file completed. I looked at the last entry of my TXT (CSV) file and it does not match the last line of my source file. I was assuming the iterator was serial in nature.
So, any idea what the error is telling me, and any idea why I may be seeing duplicates? Originally I thought the error may have been related to me not ending gracefully. I am confident the XML is formed properly throughout, but maybe that is a bad assumption.
******UPDATES******
Here is a sample of the elements.
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>33709881</internal_id>
<name>bga_13x11p137_0_4_0_8.prt</name>
<number>BGA_13X11P137_0_4_0_8.PRT</number>
<cadname>bga_13x11p137_0_4_0_8.prt</cadname>
<version>A</version>
<iteration>1</iteration>
<isLatest>false</isLatest>
<modifiedBy>
<username>ets027 (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>bga_13x11p137_0_4_0_8.prt</name>
<id>5341368</id>
<uploaded>Jan 13, 2006 09:14:41</uploaded>
<refSize>287764</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>000000000505a6</filename>
<location>[wt.fv.FvItem:33709835]::master::master_vault::master_vault7::000000000505a6</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>34570129</internal_id>
<name>d61-2446-02_nest_plate.prt</name>
<number>D61-2446-02_NEST_PLATE.PRT</number>
<cadname>d61-2446-02_nest_plate.prt</cadname>
<version>-</version>
<iteration>1</iteration>
<isLatest>true</isLatest>
<modifiedBy>
<username>esb044c (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>d61-2446-02_nest_plate.prt</name>
<id>5344204</id>
<uploaded>Jan 30, 2006 09:09:24</uploaded>
<refSize>109278</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>000000000518e8</filename>
<location>[wt.fv.FvItem:34566594]::master::master_vault::master_vault7::000000000518e8</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>33512036</internal_id>
<name>d68-2568-07_press_head_adhesive.prt</name>
<number>D68-2568-07_PRESS_HEAD_ADHESIVE.PRT</number>
<cadname>d68-2568-07_press_head_adhesive.prt</cadname>
<version>-</version>
<iteration>2</iteration>
<isLatest>true</isLatest>
<modifiedBy>
<username>e3789c (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>d68-2568-07_press_head_adhesive.prt</name>
<id>5340927</id>
<uploaded>Jan 10, 2006 15:42:31</uploaded>
<refSize>76314</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>0000000005024b</filename>
<location>[wt.fv.FvItem:33512072]::master::master_vault::master_vault7::0000000005024b</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>34715717</internal_id>
<name>dbk_flip_sleeve.prt</name>
<number>DBK_FLIP_SLEEVE.PRT</number>
<cadname>dbk_flip_sleeve.prt</cadname>
<version>-</version>
<iteration>1</iteration>
<isLatest>false</isLatest>
<modifiedBy>
<username>EKA014 (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>dbk_flip_sleeve.prt</name>
<id>5344969</id>
<uploaded>Feb 01, 2006 12:54:43</uploaded>
<refSize>847210</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>00000000051b54</filename>
<location>[wt.fv.FvItem:34714395]::master::master_vault::master_vault7::00000000051b54</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
Here is my updated code:
import xml.etree.ElementTree as ET
import csv
parser = ET.iterparse("windchill.xml", events=('start', 'end'))
csvfile = open('windchill.txt', 'w', encoding="utf-8")
csvwriter = csv.writer(csvfile)
for event, document in parser:
if event=='end' and document.tag=='Document':
if document.find('type').text == 'wt.epm.EPMDocument' and document.find('isLatest').text == 'true':
row = []
version = document.find('version').text
row.append(version)
name = document.find('content').find('name').text
row.append(name)
filename = document.find('content').find('storage').find('filename').text
row.append(filename)
# folder = document.find('content').find('storage').find('actualLocation').text
folder = document.find('content').find('storage').find('folder').text
row.append(folder)
csvwriter.writerow(row)
csvfile.close()
I added in a check for type. Type wt.ep.EPMDocument will have the record. I then want to pull the data out of the storage element. Specifically name, folder, and filename. I originally was using actualLocation instead ov vault, but changed hoping the shorter name would help with my memory error.

Concerning your first issue: iterparse 'sees' each and every xml element in a document when that element starts and, again, when it closes. This probably explains the duplication that you find. Not only must you filter for the element(s) that you want, you must filter for the appropriate event. You might look at this answer, https://stackoverflow.com/a/46167799/131187, to see how to deal with this.
Concerning the second: When document.find('isLatest') fails to find what you've requested it returns None, rather than an object representing an xml element. None has no properties, including text, therefore, your program croaks at that point, and you receive an incomplete csv file.
Edit in answer to comment: This code parses the xml but does not write the csv. csv records would be written in the save_csv_record function, or its equivalent. It appears only once in the code so should be easy to replace.
Called in the way it is in this code iterparse returns only 'end' events and their corresponding xml elements. Therefore, the code watches for the 'end' of a 'Document'. When it sees one it asks whether the document contains an 'isLatest' of 'true'. If it does it writes it out; if not, it ignores it and empties document_content. If the code has not seen the 'end' of a document it simply saves the content of the tag and keeps reading under it does.
from xml.etree.ElementTree import iterparse
def save_csv_record(record):
print(record)
return
document_content = {}
for ev, el in iterparse('windchill.xml'):
if el.tag=='Document':
if document_content['isLatest'] == 'true':
save_csv_record(document_content)
document_content = {}
else:
document_content[el.tag] = el.text.strip() if el.text else None
Output:
{'folder': 'master_vault7', 'storage': '', 'refSize': '109278', 'cadname': 'd61-2446-02_nest_plate.prt', 'filename': '000000000518e8', 'replicatedTo': '', 'status': 'Content File Missing', 'number': 'D61-2446-02_NEST_PLATE.PRT', 'location': '[wt.fv.FvItem:34566594]::master::master_vault::master_vault7::000000000518e8', 'vault': 'master_vault', 'uploaded': 'Jan 30, 2006 09:09:24', 'id': '5344204', 'actualLocation': '/vault/Windchill_Vaults/WcVlt7', 'name': 'd61-2446-02_nest_plate.prt', 'modifiedBy': '', 'email': None, 'content': '', 'internal_id': '34570129', 'iteration': '1', 'username': 'esb044c (deleted)', 'type': 'wt.epm.EPMDocument', 'copies': '', 'isLatest': 'true', 'version': '-'}
{'folder': 'master_vault7', 'storage': '', 'refSize': '76314', 'cadname': 'd68-2568-07_press_head_adhesive.prt', 'filename': '0000000005024b', 'replicatedTo': '', 'status': 'Content File Missing', 'number': 'D68-2568-07_PRESS_HEAD_ADHESIVE.PRT', 'location': '[wt.fv.FvItem:33512072]::master::master_vault::master_vault7::0000000005024b', 'vault': 'master_vault', 'uploaded': 'Jan 10, 2006 15:42:31', 'id': '5340927', 'actualLocation': '/vault/Windchill_Vaults/WcVlt7', 'name': 'd68-2568-07_press_head_adhesive.prt', 'modifiedBy': '', 'email': None, 'content': '', 'internal_id': '33512036', 'iteration': '2', 'username': 'e3789c (deleted)', 'type': 'wt.epm.EPMDocument', 'copies': '', 'isLatest': 'true', 'version': '-'}
EDITED FOR LATEST CODE:
Here is the new code that I am using, that sill runs out of memory:
from xml.etree.ElementTree import iterparse
def save_csv_record(record):
print(record)
return
document_content = {}
for ev, el in iterparse('windchill.xml'):
if el.tag=='Document':
if document_content['type']=='wt.epm.EPMDocument' and
document_content['isLatest'] == 'true':
save_csv_record(document_content)
document_content = {}
else:
document_content[el.tag] = el.text.strip() if el.text else None

How to parse and display the content of an Ixml object using IXML

I am having difficult parsing the xml _file below using Ixml:
>>_file= "qv.xml"
file content:
<document reference="suspicious-document00500.txt">
<feature name="plagiarism" type="artificial" obfuscation="none" this_offset="128" this_length="2503" source_reference="source-document00500.txt" source_offset="138339" source_length="2503"/>
<feature name="plagiarism" type="artificial" obfuscation="none" this_offset="8593" this_length="1582" source_reference="source-document00500.txt" source_offset="49473" source_length="1582"/>
</document>
Here is my attempt:
>>from lxml.etree import XMLParser, parse
>>parsefile = parse(_file)
>>print parsefile
Output: <lxml.etree._ElementTree object at 0x000000000642E788>
The output is the location of the ixml object, while I am after the actual file content ie
Desired output={'document reference'="suspicious-document00500.txt", 'this_offset': '128', 'obfuscation': 'none', 'source_length': '2503', 'name': 'plagiarism', 'this_length': '2503', 'source_reference': 'source-document00500.txt', 'source_offset': '138339', 'type': 'artificial'}
Any ideas on how to get the desired output? thanks.

Here's one way of getting the desired outputs:
from lxml import etree
def main():
doc = etree.parse('qv.xml')
root = doc.getroot()
print root.attrib
for item in root:
print item.attrib
if __name__ == "__main__":
main()
Output:
{'reference': 'suspicious-document00500.txt'}
{'this_offset': '128', 'obfuscation': 'none', 'source_length': '2503', 'name': 'plagiarism', 'this_length': '2503', 'source_reference': 'source-document00500.txt', 'source_offset': '138339', 'type': 'artificial'}
{'this_offset': '8593', 'obfuscation': 'none', 'source_length': '1582', 'name': 'plagiarism', 'this_length': '1582', 'source_reference': 'source-document00500.txt', 'source_offset': '49473', 'type': 'artificial'}
It works fine with the contents you gave.
You might want to read thisto see how etree represents xml objects.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

adjust Python fuction to parse xml - python

Related

Parse xml file to a python list

Extraction of XML Data using Python

Not getting XML output as expected

XML to CSV using xml.etree.ElementTree.interparse functionality

How to parse and display the content of an Ixml object using IXML

Categories

Resources