My client wants me to parse over 100,00 xml files and converting them into a text file.
I have successfully parse a couple of files and converting them into a text file. However I managed to do that by editing the xml and adding <root></root> in the xml file.
This would seem inefficient since I would have to edit nearly 100,00 xml files to achieve my desired result.
Is there anyway for my python code to recognize the first node and read it as the root node?
I have tried using the method showed in Python XML Parsing without root
,however I do not fully understand it and I do not know where to implement this.
The XML format is as follows:
<Thread>
<ThreadID></ThreadID>
<Title></Title>
<InitPost>
<UserID></UserID>
<Date></Date>
<icontent></icontent>
</InitPost>
<Post>
<UserID></UserID>
<Date></Date>
<rcontent></rcontent>
</Post>
</Thread>
And this is my code on how to parse the XML files:
import os
from xml.etree import ElementTree
saveFile = open('test3.txt','w')
for path, dirs, files in os.walk("data/sample"):
for f in files:
fileName = os.path.join(path, f)
with open(fileName, "r", encoding="utf8") as myFile:
dom = ElementTree.parse(myFile)
thread = dom.findall('Thread')
for t in thread:
threadID = str(t.find('ThreadID').text)
threadID = threadID.strip()
title = str(t.find('Title').text)
title = title.strip()
userID = str(t.find('InitPost/UserID').text)
userID = userID.strip()
date = str(t.find('InitPost/Date').text)
date = date.strip()
initPost = str(t.find('InitPost/icontent').text)
initPost = initPost.strip()
post = dom.findall('Thread/Post')
The rest of the code is just writing to the output text file.
Load the xml as text and wrap it with root element.
'1.xml' is the xml you have posted
from xml.etree import ElementTree as ET
files = ['1.xml'] # your list of files goes here
for file in files:
with open(file) as f:
# wrap it with <r>
xml = '<r>' + f.read() + '</r>'
root = ET.fromstring(xml)
print('Now we are ready to work with the xml')
I don't know if the Python parser supports DTDs, but if it does, then one approach is to define a simple wrapper document like this
<!DOCTYPE root [
<!ENTITY e SYSTEM "realdata.xml">
]>
<root>&e;</root>
and point the parser at this wrapper document instead of at realdata.xml
Not sure about Python, but generally speaking you can use SGML to infer missing tags, whether at the document element (root) level or elsewhere. The basic technique is creating a DTD for declaring the document element like so
<!DOCTYPE root [
<!ELEMENT root O O ANY>
]>
<!-- your document character data goes here -->
where the important things are the O O (letter O) tag omission indicators telling SGML that both the start- and end-element tags for root can be omitted.
See also the following questions with more details:
Querying Non-XML compliant structured data
Adding missing XML closing tags in Javascript
Related
My xml is as below.
<?xml version="1.0" encoding="UTF-8"?>
<ServiceResponse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="https://xx.xx.xx/xx/xx/x.x/xx/xx.xsd">
<responseCode>SUCCESS</responseCode>
<count>100</count>
<hasMoreRecords>true</hasMoreRecords>
<lastId>12345</lastId>
<data>
<Main>
<sub1>1</id>
<sub2>a</name>
</Main>
<Main>
<sub1>2</id>
<sub2>b</name>
</Main>
</data>
</ServiceResponse>
My code is as below.
import csv
import xml.etree.ElementTree as etree
xml_file_name = 'blah.xml'
csv_file_name = 'blah.csv'
main_tag_name = 'Main'
fields = ['sub1', 'sub2']
tree = etree.parse(xml_file_name)
with open(csv_file_name, 'w', newline='', encoding="utf-8") as csv_file:
csvwriter = csv.writer(csv_file)
csvwriter.writerow(fields)
for host in tree.iter(tag=main_tag_name):
data = []
for field in fields:
if host.find(field) is not None:
data.append(host.find(field).text)
else:
data.append('')
csvwriter.writerow(data)
Somehow I think this is not the correct way to parse an xml, because it is searching 'Main' anywhere in the tree structure, and does not follow a specific path to search it.
Meaning - If it accidentally finds 'Main' anywhere else, the program will not work as desired.
Request you to suggest me the most optimized way you know for this use case, mostly a built-in approach rather than too much of customization.
Note:
I want to use this as a common script for multiple xml files which have various tags before reaching the main tag and then has various sub tags. This needs to be considered to make sure we don't hardcode the tree structure and is configurable.
You can try xpath based approach.
For example:
with open('some.csv', 'w', newline='') as f:
writer = csv.writer(f)
with open("test.xml") as f:
tree = ET.parse(f)
root = tree.getroot()
sub1_nodes = root.findall('.//data/Main/sub1')
sub2_nodes = root.findall('.//data/Main/sub2')
for a,b in zip(sub1_nodes, sub2_nodes):
writer.writerow([a.text, b.text])
I was wonder how I would go about determining what the root tag for an XML document is using xml.dom.minidom.
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child1></child1>
<child2></child2>
<child3></child3>
</root>
In the example XML above, my root tag could be 3 or 4 different things. All I want to do is pull the tag, and then use that value to get the elements by tag name.
def import_from_XML(self, file_name)
file = open(file_name)
document = file.read()
if re.compile('^<\?xml').match(document):
xml = parseString(document)
root = '' # <-- THIS IS WHERE IM STUCK
elements = xml.getElementsByTagName(root)
I tried searching through the documentation for xml.dom.minidom, but it is a little hard for me to wrap my head around, and I couldn't find anything that answered this question outright.
I'm using Python 3.6.x, and I would prefer to keep with the standard library if possible.
For the line you commented as Where I am stuck, the following should assign the value of the root tag of the XML document to the variable theNameOfTheRootElement:
theNameOfTheRootElement = xml.documentElement.tagName
this is what I did when I last processed xml. I didn't use the approach you used but I hope it will help you.
import urllib2
from xml.etree import ElementTree as ET
req = urllib2.Request(site)
file=None
try:
file = urllib2.urlopen(req)
except urllib2.URLError as e:
print e.reason
data = file.read()
file.close()
root = ET.fromstring(data)
print("root", root)
for child in root.findall('parent element'):
print(child.text, child.attrib)
I have a script that goes through all the XML files in directory and then parses those XML files to get the data in element IS tag ICP. However, there are several thousands of those XML files and some of them may not have tag ICP in IS. Is there a way to do it via minidom?
Example of XML I am parsing that has element IS and tag ICP:
<is ico="0000000000" pcz="1" icp="12345678" icz="12345678" oddel="99">
Example of XML I am parsing that has element IS but no tag ICP:
<is ico="000000000">
Here my script obviously fails as there is no ICP. How to check presence of the ICP tag?
My script:
import os
from xml.dom import minidom
#for testing purposes
directory = os.getcwd()
print("Zdrojový adresář je: " + directory)
print("Procházím aktuální adresář, hledám XML soubory...")
print("Procházím XML soubory, hledám IČP provádějícího...")
with open ('ICP_all.txt', 'w') as SeznamICP_all:
for root, dirs, files in os.walk(directory):
for file in files:
if (file.endswith('.xml')):
xmldoc = minidom.parse(os.path.join(root, file))
itemlist = xmldoc.getElementsByTagName('is')
SeznamICP_all.write(itemlist[0].attributes['icp'].value + '\n')
print("Vytvářím list unikátních IČP...")
with open ('ICP_distinct.txt','w') as distinct:
UnikatniICP = []
with open ('ICP_all.txt','r') as SeznamICP_all:
distinct.writelines(set(SeznamICP_all))
input('Pro ukončení stiskni libovolnou klávesu...')
I googled a lot, yet I cannot get a simple answer on how to check if a tag is present in XML using minidom.
Could you please give me some advise?
You can use hasAttribute(attributeName) method :
....
itemlist = xmldoc.getElementsByTagName('is')
if itemlist[0].hasAttribute("icp"):
SeznamICP_all.write(itemlist[0].attributes['icp'].value + '\n')
You can check presence of icp by useing in operator
for item in itemlist:
if( 'icp' in item.attributes ):
SeznamICP_all.write(item.attributes['icp'].value + '\n')
break;
I need to parse a very large (~40GB) XML file, remove certain elements from it, and write the result to a new xml file. I've been trying to use iterparse from python's ElementTree, but I'm confused about how to modify the tree and then write the resulting tree into a new XML file. I've read the documentation on itertree but it hasn't cleared things up. Are there any simple ways to do this?
Thank you!
EDIT: Here's what I have so far.
import xml.etree.ElementTree as ET
import re
date_pages = []
f=open('dates_texts.xml', 'w+')
tree = ET.iterparse("sample.xml")
for i, element in tree:
if element.tag == 'page':
for page_element in element:
if page_element.tag == 'revision':
for revision_element in page_element:
if revision_element.tag == '{text':
if len(re.findall('20\d\d', revision_element.text.encode('utf8'))) == 0:
element.clear()
If you have a large xml that doesn't fit in memory then you could try to serialize it one element at a time. For example, assuming <root><page/><page/><page/>...</root> document structure and ignoring possible namespace issues:
import xml.etree.cElementTree as etree
def getelements(filename_or_file, tag):
context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
if event == 'end' and elem.tag == tag:
yield elem
root.clear() # free memory
with open('output.xml', 'wb') as file:
# start root
file.write(b'<root>')
for page in getelements('sample.xml', 'page'):
if keep(page):
file.write(etree.tostring(page, encoding='utf-8'))
# close root
file.write(b'</root>')
where keep(page) returns True if page should be kept e.g.:
import re
def keep(page):
# all <revision> elements must have 20xx in them
return all(re.search(r'20\d\d', rev.text)
for rev in page.iterfind('revision'))
For comparison, to modify a small xml file, you could:
# parse small xml
tree = etree.parse('sample.xml')
# remove some root/page elements from xml
root = tree.getroot()
for page in root.findall('page'):
if not keep(page):
root.remove(page) # modify inplace
# write to a file modified xml tree
tree.write('output.xml', encoding='utf-8')
Perhaps the answer to my similar question can help you out.
As for how to write this back to an .xml file, I ended up doing this at the bottom of my script:
with open('File.xml', 'w') as t: # I'd suggest using a different file name here than your original
for line in ET.tostring(doc):
t.write(line)
t.close
print('File.xml Complete') # Console message that file wrote successfully, can be omitted
The variable doc is from earlier on in my script, comparable to where you have tree = ET.iterparse("sample.xml") I have this:
doc = ET.parse(filename)
I've been using lxml instead of ElementTree but I think the write out part should still work (I think it's mainly just xpath stuff that ElementTree can't handle.) I'm using lxml imported with this line:
from lxml import etree as ET
Hopefully this (along with my linked question for some additional code context if you need it) can help you out!
I am trying to extract the name elements under the sequence in xml files. I have pasted in the top of a sample xml to illustrate. With this I want to get the text from 01 Interview_been successful through mentorship and write it to a file. There are multiple sequence tags in the xml and I am trying to figure out how to go through it and extract it. I have tried to figure out how to use xml.etree and xml.dom.minidom but I can't seem to wrap my brain around it. I was able to get all of the id values from the sequence tags but not the name elements. I'm pasting in my code before the xml.
from xml.etree import ElementTree
file = open("xmldump.txt", "r")
filedata = file.read()
file.close()
with open('test.xml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.iter('name'):
sequenceid = node.attrib.get('name')
print ' %s' % (sequenceid)
newLine = sequenceid + "\n"
file = open("xmldump.txt", "w")
file.write(newLine)
file.close()
Here is the XML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xmeml>
<xmeml version="5">
<bin>
<uuid>0F5D72FA-54E4-4DE8-81D7-CC33F5C43836</uuid>
<updatebehavior>add</updatebehavior>
<name>Logged</name>
<children>
<sequence id="01 Interview_been successful through mentorship">
<uuid>12FB944D-83EA-4527-9A54-2130A42E3A06</uuid>
<updatebehavior>add</updatebehavior>
<name>01 Interview_been successful through mentorship</name>
<duration>1195</duration>
<rate>
<ntsc>TRUE</ntsc>
<timebase>24</timebase>
</rate>
<timecode>
Well, I'm not sure if you want the "id" attribute or the name tag(your code is confusing, it tries to extract a "name" attribute out of the "sequence" tag, but that tag only has an "id" attribute). Below is code that extract both, should help you get started on figuring out how ElementTree works
from xml.etree import ElementTree
with open('test.xml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.iter('sequence'):
sequenceid = node.attrib.get('id')
name = node.findtext('name')