Extraction of XML Data using Python

Extraction of XML Data using Python - python

I am trying to create a JSON file containing extracted data of Goodreads XML file, but I am unable to do so. I have never worked with XML and I have tried to go through tutorials but to no avail, I am not able to extract any data.
My XML File looks like this:
<books_count type="integer">11</books_count>
<original_publication_year type="integer">1997</original_publication_year>
<original_publication_month type="integer" nil="true"/>
<original_publication_day type="integer" nil="true"/>
<original_title>There Was an Old Lady Who Swallowed a Fly</original_title>
<popular_shelves>
<shelf name="to-read" count="3504"/>
<shelf name="picture-books" count="397"/>
<shelf name="childrens" count="226"/>
<shelf name="children-s-books" count="213"/>
<shelf name="children" count="149"/>
<shelf name="children-s" count="139"/>
<shelf name="caldecott" count="110"/>
</pouplar_shelves>
How to extract the data and specifically from popular shelves as the data I require is in shelf name?
Edit1:
import os
from xml.etree import ElementTree as ET
path = "books_xml"
for filename in os.listdir(path):
if not filename.endswith('.xml'): continue
fullname = os.path.join(path, filename)
tree = ET.parse(fullname)
print(tree)
root = tree.getroot()
for child in root:
print(child.books_count, child.text)
This is what i was trying to do, I have to run by multiple xml files in a directory. It throws error:
AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'books_count'
Edit2:
import os
from xml.etree import ElementTree as ET
mytree = ET.parse('sample_book.xml')
myroot = mytree.getroot()
print(myroot)
name = myroot.find('original_title').text
print(name)
Giving the following error
AttributeError: 'NoneType' object has no attribute 'text'

First, here you find all examples for a basic extraction
https://docs.python.org/3/library/xml.etree.elementtree.html#tutorial
2nd, your xml needs be better structured: E.g "pouplar_shelves" is typo error
3nd, I'll give you a small example
import xml.etree.ElementTree as ET
xml = '''
<popular_shelves>
<shelf name="to-read" count="3504"/>
<shelf name="picture-books" count="397"/>
<shelf name="childrens" count="226"/>
<shelf name="children-s-books" count="213"/>
<shelf name="children" count="149"/>
<shelf name="children-s" count="139"/>
<shelf name="caldecott" count="110"/>
</popular_shelves>
'''
xml_root = ET.fromstring(xml)
for i in xml_root.iter():
print(i.tag, i.attrib)
As result:
popular_shelves {}
shelf {'name': 'to-read', 'count': '3504'}
shelf {'name': 'picture-books', 'count': '397'}
shelf {'name': 'childrens', 'count': '226'}
shelf {'name': 'children-s-books', 'count': '213'}
shelf {'name': 'children', 'count': '149'}
shelf {'name': 'children-s', 'count': '139'}
shelf {'name': 'caldecott', 'count': '110'}
Press any key to continue . . .
Edit:
You should understand better what I've sent in docs Python. Also, be familiar with the source code of ET
https://github.com/python/cpython/blob/3.8/Lib/xml/etree/ElementTree.py
This is the model
Example form:
<tag attrib>text<child/>...</tag>tail
By using shelf as example:
for i in xml_root.iter('shelf'):
print(i.attrib)
Will return this:
{'name': 'to-read', 'count': '3504'}
{'name': 'picture-books', 'count': '397'}
{'name': 'childrens', 'count': '226'}
{'name': 'children-s-books', 'count': '213'}
{'name': 'children', 'count': '149'}
{'name': 'children-s', 'count': '139'}
{'name': 'caldecott', 'count': '110'}
Bcz your XML tag is shelf and your atribs is name and count
is translated
text...tail

Related

Parse xml file to a python list

I have a xml file:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<Document xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:iso:std:iso:20022:tech:xsd:pain.001.001.03">
<CstmrCdtTrfInitn>
<GrpHdr>
<MsgId>637987745078994894</MsgId>
<CreDtTm>2022-09-14T05:48:27</CreDtTm>
<NbOfTxs>205</NbOfTxs>
<CtrlSum>154761.02</CtrlSum>
<InitgPty>
<Nm> Company</Nm>
</InitgPty>
</GrpHdr>
<PmtInf>
<PmtInfId>20220914054827-154016</PmtInfId>
<PmtMtd>TRF</PmtMtd>
<BtchBookg>true</BtchBookg>
<NbOfTxs>205</NbOfTxs>
<CtrlSum>154761.02</CtrlSum>
<PmtTpInf>
<SvcLvl>
<Cd>SEPA</Cd>
</SvcLvl>
<CtgyPurp>
<Cd>SALA</Cd>
</CtgyPurp>
</PmtTpInf>
<CdtTrfTxInf> <----------------------------------
<Amt>
<InstdAmt Ccy="EUR">1536.96</InstdAmt>
</Amt>
<Cdtr>
<Nm>Achternaam, Voornaam </Nm>
</Cdtr>
<CdtrAcct>
<Id>
<IBAN>NL80RABO0134343443</IBAN>
</Id>
</CdtrAcct>
</CdtTrfTxInf> <------------------------------------
<CdtTrfTxInf> <----------------------------------
<Amt>
<InstdAmt Ccy="EUR">1676.96</InstdAmt>
</Amt>
<Cdtr>
<Nm>Achternaam, Voornaam </Nm>
</Cdtr>
<CdtrAcct>
<Id>
<IBAN>NL80RABO013433222243</IBAN>
</Id>
</CdtrAcct>
</CdtTrfTxInf> <------------------------------------
</CstmrCdtTrfInitn>
</Document>
I use ElementTree:
I want a python list of tuples with the info within the tag (everything between the arrows in the example xml file). So in this example i want al list with 2 tuples.
How can i do that.
I can iterate over the tree, but thats is.
my code:
import xml.etree.ElementTree as ET
tree = ET.parse(xml_file)
root = tree.getroot()
for elem in tree.iter():
print(elem.tag, elem.text) --> i get every tag in the whole file

I rather like to use xmltodict.
First of all, your input data as given is missing a closing </PmtInf> tag towards the end, just before your closing </CstmrCdtTrfInitn> tag. After fixing that, I saved your xml data into a file and did the following:
import xmltodict
with open("input_data.xml", "r") as f:
xml_data = f.read()
xml_dict = xmltodict.parse(xml_data)
You can then access the xml data using dictionary accessors, for example:
xml_dict
>>>{'Document': {'#xmlns:xsi': 'http://www.w3.org/20...a-instance', '#xmlns': 'urn:iso:std:iso:2002...001.001.03', 'CstmrCdtTrfInitn': {...}}}
xml_dict["Document"]
>>>{'#xmlns:xsi': 'http://www.w3.org/20...a-instance', '#xmlns': 'urn:iso:std:iso:2002...001.001.03', 'CstmrCdtTrfInitn': {'GrpHdr': {...}, 'PmtInf': {...}}}
xml_dict["Document"]["CstmrCdtTrfInitn"].keys()
>>>dict_keys(['GrpHdr', 'PmtInf'])
xml_dict["Document"]["CstmrCdtTrfInitn"]["PmtInf"]
{'PmtInfId': '20220914054827-154016', 'PmtMtd': 'TRF', 'BtchBookg': 'true', 'NbOfTxs': '205', 'CtrlSum': '154761.02', 'PmtTpInf': {'SvcLvl': {...}, 'CtgyPurp': {...}}, 'CdtTrfTxInf': [{...}, {...}]}
xml_dict["Document"]["CstmrCdtTrfInitn"]["PmtInf"].keys()
dict_keys(['PmtInfId', 'PmtMtd', 'BtchBookg', 'NbOfTxs', 'CtrlSum', 'PmtTpInf', 'CdtTrfTxInf'])
Then you can loop over your CdtTrfTxInf with:
for item in xml_dict["Document"]["CstmrCdtTrfInitn"]["PmtInf"]["CdtTrfTxInf"]:
print(item)
giving the output:
{'Amt': {'InstdAmt': {'#Ccy': 'EUR', '#text': '1536.96'}}, 'Cdtr': {'Nm': 'Achternaam, Voornaam'}, 'CdtrAcct': {'Id': {'IBAN': 'NL80RABO0134343443'}}}
{'Amt': {'InstdAmt': {'#Ccy': 'EUR', '#text': '1676.96'}}, 'Cdtr': {'Nm': 'Achternaam, Voornaam'}, 'CdtrAcct': {'Id': {'IBAN': 'NL80RABO013433222243'}}}
which you can process as you want.

this is just a speedcode try xd give it a chance and try it :
import xml.etree.ElementTree as ET
tree = ET.parse("fr.xml")
root = tree.getroot()
test = False
for elem in tree.iter():
if elem.tag == "CdtTrfTxInf":
test = True
continue
if test and elem.text.strip() :
print(elem.tag, elem.text)
with result as list of tuple :
import xml.etree.ElementTree as ET
tree = ET.parse("fr.xml")
root = tree.getroot()
test = False
tag = []
textval=[]
for elem in tree.iter():
if elem.tag == "CdtTrfTxInf":
test = True
continue
if test and elem.text.strip() :
tag.append(elem.tag)
textval.append(elem.text)
data = list(zip(tag, textval))
print (data)

Local variables in a dictionary function in Python

I am trying to handle the below requirement. As a beginner to Python programming, I couldn't get out of the issue which am facing in declaring the variables. I have a huge XML that I need to open and create three dictionaries out of it.
Here are my programming steps.
Open the file using the built-in open function
Read each line from the object created above
Between certain tags, I need to search for a pattern and fill the data into the dictionary.
The XML file looks like
<tag_1>
name=(pattern1)
age=(pattern1.1)
company=(pattern1.2)
<\tag_1>
<tag_2>
name=(pattern2)
age=(pattern2.1)
company=(pattern2.2)
<\tag_2>
<tag_3>
name=(pattern3)
age=(pattern3.1)
comapany=(pattern3.2)
<\tag_3>
and so on, with repeated above tags.
From each tag above, i need to create 3 dictionaries like:
dict1[pattern1]['age']=pattern1.1
dict1[pattern1]['company']=pattern1.2
Similarly for dict2, & dict3 as well.
Created a dictionary function, with passing arguments as line, dictionary.
for line in file.readlines():
dict_instance(line, dictionary_1 )
dict_instance(line, dictionary_2 )
dict_instance(line, dictionary_3 )
def dict_instance(line, object):
#ON TAG START (i have this condition set in my code)
if re.search(r'name=(.*)', line):
name=re.search(r'name=(.*)', line).group(1)
if re.search(r'age=(.*)', line):
age=re.search(r'age=(.*)', line).group(1)
if re.search(r'company=(.*)', line):
company=re.search(r'company=(.*)', line).group(1)
#ON TAG END (i have this condition set in my code)
object[name]={}
if not age:
object[name]['age']=age
if not company:
object[name]['company']=company
Each tag of data should go in each dictionary, like tag1 to dict1, tag2 to dict2 and tag3 to dict3.
Now my question is how do I can create the "name", "age" & "company" variables local to each dictionary, if I create global variables, these will mix up in all three dictionaries which creates incorrect data in it.
Please ignore if any indentation issues in the above.

I'm not sure I understand the requirements. But here are some methods which might be helpful:
xml_content = """<tags>
<tag_1>
name=(pattern1)
age=(pattern1.1)
company=(pattern1.2)
</tag_1>
<tag_2>
name=(pattern2)
age=(pattern2.1)
company=(pattern2.2)
</tag_2>
<tag_3>
name=(pattern3)
age=(pattern3.1)
company=(pattern3.2)
</tag_3>
</tags>
"""
from xml.etree import ElementTree
document = ElementTree.fromstring(xml_content)
You can iterate over the tags and get the desired information:
for tag in document:
print(tag.tag)
print(tag.text)
print(tag.text.split())
print(dict(line.split('=') for line in tag.text.split()))
print("---------------------")
It outputs:
tag_1
name=(pattern1)
age=(pattern1.1)
company=(pattern1.2)
['name=(pattern1)', 'age=(pattern1.1)', 'company=(pattern1.2)']
{'name': '(pattern1)', 'age': '(pattern1.1)', 'company': '(pattern1.2)'}
---------------------
tag_2
name=(pattern2)
age=(pattern2.1)
company=(pattern2.2)
['name=(pattern2)', 'age=(pattern2.1)', 'company=(pattern2.2)']
{'name': '(pattern2)', 'age': '(pattern2.1)', 'company': '(pattern2.2)'}
---------------------
tag_3
name=(pattern3)
age=(pattern3.1)
company=(pattern3.2)
['name=(pattern3)', 'age=(pattern3.1)', 'company=(pattern3.2)']
{'name': '(pattern3)', 'age': '(pattern3.1)', 'company': '(pattern3.2)'}
If you want one big list or one big dict:
def tag_to_dict(tag):
return dict(line.split('=') for line in tag.text.split())
[tag_to_dict(tag) for tag in document]
{tag.tag:tag_to_dict(tag) for tag in document}
Which return:
[{'name': '(pattern1)', 'age': '(pattern1.1)', 'company': '(pattern1.2)'},
{'name': '(pattern2)', 'age': '(pattern2.1)', 'company': '(pattern2.2)'},
{'name': '(pattern3)', 'age': '(pattern3.1)', 'company': '(pattern3.2)'}]
and
{'tag_1': {'name': '(pattern1)',
'age': '(pattern1.1)',
'company': '(pattern1.2)'},
'tag_2': {'name': '(pattern2)',
'age': '(pattern2.1)',
'company': '(pattern2.2)'},
'tag_3': {'name': '(pattern3)',
'age': '(pattern3.1)',
'company': '(pattern3.2)'}}

adjust Python fuction to parse xml

I need to read an XML file in an external domain.
my code:
tree = ET.ElementTree(file=urllib2.urlopen('http://192.168.2.57:8010/data/camera_state.xml'))
root = tree.getroot()
root.tag, root.attrib
for elem in tree.iter():
print elem.tag, elem.att
I could not get into the structure I need, the result of my function is this below:
CameraState {}
Cameras {}
Camera {'Id': '1'}
State {}
Camera {'Id': '2'}
State {}
Camera {'Id': '3'}
State {}
Camera {'Id': '4'}
State {}
I need to adjust this Python function to get into a result as below:
<CameraState>
<Cameras>
<Camera Id="1">
<State>NO_SIGNAL</State>
</Camera>
<Camera Id="2">
<State>OK</State>
</Camera>
</Cameras>
</CameraState>

You do have the parsed structure. It's just about the way you are accessing it.
Use getchildren to access children nodes. An example of recursively printing the structure:
import xml.etree.ElementTree as ET
def print_tree(node, prefix=''):
print(prefix, node.tag, node.attrib, node.text.strip())
for child in node:
print_tree(child, prefix + ' ')
tree = ET.ElementTree(file=<your file>)
root = tree.getroot()
print_tree(root)
It gives:
CameraState {}
Cameras {}
Camera {'Id': '1'}
State {} NO_SIGNAL
Camera {'Id': '2'}
State {} OK
However, I recommend you take a look at xmltodict:
import xmltodict
with open(<your file>) as f:
tree = xmltodict.parse(f.read())
print(tree)
It gives you OrderedDicts:
OrderedDict([('CameraState', OrderedDict([('Cameras', OrderedDict([('Camera', [OrderedDict([('#Id', '1'), ('State', 'NO_SIGNAL')]), OrderedDict([('#Id', '2'), ('State', 'OK')])])]))]))])

XML to CSV using xml.etree.ElementTree.interparse functionality

Folks, I am new (brand new) to python, so after taking a course I decided to create a script to covert an XML file to CSV. The file in question is 2GB in size, so after searching here and on google I think I need to use the xml.etree.ElementTree.interparse functionality. For reference the XML file I am looking to covert looks like this:
<Document>
<type></type>
<internal_id></internal_id>
<name></name>
<number></number>
<cadname></cadname>
<version></version>
<iteration></iteration>
**<isLatest></isLatest>**
<modifiedBy>
<username></username>
<email/>
</modifiedBy>
<content>
**<name></name>**
<id></id>
<uploaded></uploaded>
<refSize></refSize>
<storage>
<vault></vault>
<folder></folder>
**<filename></filename>**
<location></location>
**<actualLocation></actualLocation>**
</storage>
<replicatedTo></replicatedTo>
<copies></copies>
<status></status>
</content>
I am using the value of isLatest to determine whether I need to add the items to the CSV file. If the value is "true" I want the data to move to the CSV file. Here is the code that works to a point:
import xml.etree.ElementTree as ET
import csv
parser = ET.iterparse("windchill.xml")
# open a file for writing
csvfile = open('windchill.txt', 'w', encoding="utf-8")
# create the csv writer object
csvwriter = csv.writer(csvfile)
count = 0
for event, document in parser:
if document.tag == 'Document':
if document.find('isLatest').text == 'true':
row = []
name = document.find('content').find('name').text
row.append(name)
filename = document.find('content').find('storage').find('filename').text
row.append(filename)
folder = document.find('content').find('storage').find('actualLocation').text
row.append(folder)
csvwriter.writerow(row)
document.clear()
csvfile.close()
If I run the code, I get this error:
Traceback (most recent call last):
File "C:/Users/mike/PycharmProjects/windchill/xml2csv-stream.py", line 17, in <module>
if document.find('isLatest').text == 'true':
AttributeError: 'NoneType' object has no attribute 'text'
A file is created that has 91,000 entries that look like this:
plate.prt,000000000518e8,/vault/Vlt7
adhesive.prt,0000000005024b,/vault/Vlt7
brd_pad.prt,00000000057862,/vault/Vlt7
support_pad.prt,0000000005024c,/vault/Vlt7
ground.prt,0000000005089b,/vault/Vlt7
There seem to be two issues with the output.
Some items seem to be duplicated, although the source file has no duplications. The name could be duplicated in the source file, but there can only be one name value that is .
I don't think the file completed. I looked at the last entry of my TXT (CSV) file and it does not match the last line of my source file. I was assuming the iterator was serial in nature.
So, any idea what the error is telling me, and any idea why I may be seeing duplicates? Originally I thought the error may have been related to me not ending gracefully. I am confident the XML is formed properly throughout, but maybe that is a bad assumption.
******UPDATES******
Here is a sample of the elements.
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>33709881</internal_id>
<name>bga_13x11p137_0_4_0_8.prt</name>
<number>BGA_13X11P137_0_4_0_8.PRT</number>
<cadname>bga_13x11p137_0_4_0_8.prt</cadname>
<version>A</version>
<iteration>1</iteration>
<isLatest>false</isLatest>
<modifiedBy>
<username>ets027 (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>bga_13x11p137_0_4_0_8.prt</name>
<id>5341368</id>
<uploaded>Jan 13, 2006 09:14:41</uploaded>
<refSize>287764</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>000000000505a6</filename>
<location>[wt.fv.FvItem:33709835]::master::master_vault::master_vault7::000000000505a6</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>34570129</internal_id>
<name>d61-2446-02_nest_plate.prt</name>
<number>D61-2446-02_NEST_PLATE.PRT</number>
<cadname>d61-2446-02_nest_plate.prt</cadname>
<version>-</version>
<iteration>1</iteration>
<isLatest>true</isLatest>
<modifiedBy>
<username>esb044c (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>d61-2446-02_nest_plate.prt</name>
<id>5344204</id>
<uploaded>Jan 30, 2006 09:09:24</uploaded>
<refSize>109278</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>000000000518e8</filename>
<location>[wt.fv.FvItem:34566594]::master::master_vault::master_vault7::000000000518e8</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>33512036</internal_id>
<name>d68-2568-07_press_head_adhesive.prt</name>
<number>D68-2568-07_PRESS_HEAD_ADHESIVE.PRT</number>
<cadname>d68-2568-07_press_head_adhesive.prt</cadname>
<version>-</version>
<iteration>2</iteration>
<isLatest>true</isLatest>
<modifiedBy>
<username>e3789c (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>d68-2568-07_press_head_adhesive.prt</name>
<id>5340927</id>
<uploaded>Jan 10, 2006 15:42:31</uploaded>
<refSize>76314</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>0000000005024b</filename>
<location>[wt.fv.FvItem:33512072]::master::master_vault::master_vault7::0000000005024b</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
<Document>
<type>wt.epm.EPMDocument</type>
<internal_id>34715717</internal_id>
<name>dbk_flip_sleeve.prt</name>
<number>DBK_FLIP_SLEEVE.PRT</number>
<cadname>dbk_flip_sleeve.prt</cadname>
<version>-</version>
<iteration>1</iteration>
<isLatest>false</isLatest>
<modifiedBy>
<username>EKA014 (deleted)</username>
<email/>
</modifiedBy>
<content>
<name>dbk_flip_sleeve.prt</name>
<id>5344969</id>
<uploaded>Feb 01, 2006 12:54:43</uploaded>
<refSize>847210</refSize>
<storage>
<vault>master_vault</vault>
<folder>master_vault7</folder>
<filename>00000000051b54</filename>
<location>[wt.fv.FvItem:34714395]::master::master_vault::master_vault7::00000000051b54</location>
<actualLocation>/vault/Windchill_Vaults/WcVlt7</actualLocation>
</storage>
<replicatedTo>
</replicatedTo>
<copies>
</copies>
<status>Content File Missing</status>
</content>
</Document>
Here is my updated code:
import xml.etree.ElementTree as ET
import csv
parser = ET.iterparse("windchill.xml", events=('start', 'end'))
csvfile = open('windchill.txt', 'w', encoding="utf-8")
csvwriter = csv.writer(csvfile)
for event, document in parser:
if event=='end' and document.tag=='Document':
if document.find('type').text == 'wt.epm.EPMDocument' and document.find('isLatest').text == 'true':
row = []
version = document.find('version').text
row.append(version)
name = document.find('content').find('name').text
row.append(name)
filename = document.find('content').find('storage').find('filename').text
row.append(filename)
# folder = document.find('content').find('storage').find('actualLocation').text
folder = document.find('content').find('storage').find('folder').text
row.append(folder)
csvwriter.writerow(row)
csvfile.close()
I added in a check for type. Type wt.ep.EPMDocument will have the record. I then want to pull the data out of the storage element. Specifically name, folder, and filename. I originally was using actualLocation instead ov vault, but changed hoping the shorter name would help with my memory error.

Concerning your first issue: iterparse 'sees' each and every xml element in a document when that element starts and, again, when it closes. This probably explains the duplication that you find. Not only must you filter for the element(s) that you want, you must filter for the appropriate event. You might look at this answer, https://stackoverflow.com/a/46167799/131187, to see how to deal with this.
Concerning the second: When document.find('isLatest') fails to find what you've requested it returns None, rather than an object representing an xml element. None has no properties, including text, therefore, your program croaks at that point, and you receive an incomplete csv file.
Edit in answer to comment: This code parses the xml but does not write the csv. csv records would be written in the save_csv_record function, or its equivalent. It appears only once in the code so should be easy to replace.
Called in the way it is in this code iterparse returns only 'end' events and their corresponding xml elements. Therefore, the code watches for the 'end' of a 'Document'. When it sees one it asks whether the document contains an 'isLatest' of 'true'. If it does it writes it out; if not, it ignores it and empties document_content. If the code has not seen the 'end' of a document it simply saves the content of the tag and keeps reading under it does.
from xml.etree.ElementTree import iterparse
def save_csv_record(record):
print(record)
return
document_content = {}
for ev, el in iterparse('windchill.xml'):
if el.tag=='Document':
if document_content['isLatest'] == 'true':
save_csv_record(document_content)
document_content = {}
else:
document_content[el.tag] = el.text.strip() if el.text else None
Output:
{'folder': 'master_vault7', 'storage': '', 'refSize': '109278', 'cadname': 'd61-2446-02_nest_plate.prt', 'filename': '000000000518e8', 'replicatedTo': '', 'status': 'Content File Missing', 'number': 'D61-2446-02_NEST_PLATE.PRT', 'location': '[wt.fv.FvItem:34566594]::master::master_vault::master_vault7::000000000518e8', 'vault': 'master_vault', 'uploaded': 'Jan 30, 2006 09:09:24', 'id': '5344204', 'actualLocation': '/vault/Windchill_Vaults/WcVlt7', 'name': 'd61-2446-02_nest_plate.prt', 'modifiedBy': '', 'email': None, 'content': '', 'internal_id': '34570129', 'iteration': '1', 'username': 'esb044c (deleted)', 'type': 'wt.epm.EPMDocument', 'copies': '', 'isLatest': 'true', 'version': '-'}
{'folder': 'master_vault7', 'storage': '', 'refSize': '76314', 'cadname': 'd68-2568-07_press_head_adhesive.prt', 'filename': '0000000005024b', 'replicatedTo': '', 'status': 'Content File Missing', 'number': 'D68-2568-07_PRESS_HEAD_ADHESIVE.PRT', 'location': '[wt.fv.FvItem:33512072]::master::master_vault::master_vault7::0000000005024b', 'vault': 'master_vault', 'uploaded': 'Jan 10, 2006 15:42:31', 'id': '5340927', 'actualLocation': '/vault/Windchill_Vaults/WcVlt7', 'name': 'd68-2568-07_press_head_adhesive.prt', 'modifiedBy': '', 'email': None, 'content': '', 'internal_id': '33512036', 'iteration': '2', 'username': 'e3789c (deleted)', 'type': 'wt.epm.EPMDocument', 'copies': '', 'isLatest': 'true', 'version': '-'}
EDITED FOR LATEST CODE:
Here is the new code that I am using, that sill runs out of memory:
from xml.etree.ElementTree import iterparse
def save_csv_record(record):
print(record)
return
document_content = {}
for ev, el in iterparse('windchill.xml'):
if el.tag=='Document':
if document_content['type']=='wt.epm.EPMDocument' and
document_content['isLatest'] == 'true':
save_csv_record(document_content)
document_content = {}
else:
document_content[el.tag] = el.text.strip() if el.text else None

How to parse and display the content of an Ixml object using IXML

I am having difficult parsing the xml _file below using Ixml:
>>_file= "qv.xml"
file content:
<document reference="suspicious-document00500.txt">
<feature name="plagiarism" type="artificial" obfuscation="none" this_offset="128" this_length="2503" source_reference="source-document00500.txt" source_offset="138339" source_length="2503"/>
<feature name="plagiarism" type="artificial" obfuscation="none" this_offset="8593" this_length="1582" source_reference="source-document00500.txt" source_offset="49473" source_length="1582"/>
</document>
Here is my attempt:
>>from lxml.etree import XMLParser, parse
>>parsefile = parse(_file)
>>print parsefile
Output: <lxml.etree._ElementTree object at 0x000000000642E788>
The output is the location of the ixml object, while I am after the actual file content ie
Desired output={'document reference'="suspicious-document00500.txt", 'this_offset': '128', 'obfuscation': 'none', 'source_length': '2503', 'name': 'plagiarism', 'this_length': '2503', 'source_reference': 'source-document00500.txt', 'source_offset': '138339', 'type': 'artificial'}
Any ideas on how to get the desired output? thanks.

Here's one way of getting the desired outputs:
from lxml import etree
def main():
doc = etree.parse('qv.xml')
root = doc.getroot()
print root.attrib
for item in root:
print item.attrib
if __name__ == "__main__":
main()
Output:
{'reference': 'suspicious-document00500.txt'}
{'this_offset': '128', 'obfuscation': 'none', 'source_length': '2503', 'name': 'plagiarism', 'this_length': '2503', 'source_reference': 'source-document00500.txt', 'source_offset': '138339', 'type': 'artificial'}
{'this_offset': '8593', 'obfuscation': 'none', 'source_length': '1582', 'name': 'plagiarism', 'this_length': '1582', 'source_reference': 'source-document00500.txt', 'source_offset': '49473', 'type': 'artificial'}
It works fine with the contents you gave.
You might want to read thisto see how etree represents xml objects.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extraction of XML Data using Python - python

Related

Parse xml file to a python list

Local variables in a dictionary function in Python

adjust Python fuction to parse xml

XML to CSV using xml.etree.ElementTree.interparse functionality

How to parse and display the content of an Ixml object using IXML

Categories

Resources