I have a very large feed file that is sent as an XML document (5GB). What would be the fastest way to parse the structure of the main item node without previously knowing its structure? Is there a means in Python to do so 'on-the-fly' without having the complete xml loaded in memory? For example, what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?
Update: I've included an example XML fragment here: https://hastebin.com/uyalicihow.xml. I'm looking to extract something like a dataframe (or list or whatever other data structure you want to use) similar to the following:
Items/Item/Main/Platform Items/Item/Info/Name
iTunes Chuck Versus First Class
iTunes Chuck Versus Bo
How could this be done? I've added a bounty to encourage answers here.
Several people have misinterpreted this question, and re-reading it, it's really not at all clear. In fact there are several questions.
How to detect an XML schema
Some people have interpreted this as saying you think there might be a schema within the file, or referenced from the file. I interpreted it as meaning that you wanted to infer a schema from the content of the instance.
What would be the fastest way to parse the structure of the main item node without previously knowing its structure?
Just put it through a parser, e.g. a SAX parser. A parser doesn't need to know the structure of an XML file in order to split it up into elements and attributes. But I don't think you actually want the fastest parse possible (in fact, I don't think performance is that high on your requirements list at all). I think you want to do something useful with the information (you haven't told us what): that is, you want to process the information, rather than just parsing the XML.
Is there a python utility that can do so 'on-the-fly' without having
the complete xml loaded in memory?
Yes, according to this page which mentions 3 event-based XML parsers in the Python world: https://wiki.python.org/moin/PythonXml (I can't vouch for any of them)
what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?
I'm not sure you know what the verb "to parse" actually means. Your phrase certainly suggests that you expect the file to contain a schema, which you want to extract. But I'm not at all sure you really mean that. And in any case, if it did contain a schema in the first 5Mb, you could find it just be reading the file sequentially, there would be no need to "save" the first part of the file first.
Question: way to parse the structure of the main item node without previously knowing its structure
This class TopSequenceElement parse a XML File to find all Sequence Elements.
The default is, to break at the FIRST closing </...> of the topmost Element.
Therefore, it is independend of the file size or even by truncated files.
from lxml import etree
from collections import OrderedDict
class TopSequenceElement(etree.iterparse):
"""
Read XML File
results: .seq == OrderedDict of Sequence Element
.element == topmost closed </..> Element
.xpath == XPath to top_element
"""
class Element:
"""
Classify a Element
"""
SEQUENCE = (1, 'SEQUENCE')
VALUE = (2, 'VALUE')
def __init__(self, elem, event):
if len(elem):
self._type = self.SEQUENCE
else:
self._type = self.VALUE
self._state = [event]
self.count = 0
self.parent = None
self.element = None
#property
def state(self):
return self._state
#state.setter
def state(self, event):
self._state.append(event)
#property
def is_seq(self):
return self._type == self.SEQUENCE
def __str__(self):
return "Type:{}, Count:{}, Parent:{:10} Events:{}"\
.format(self._type[1], self.count, str(self.parent), self.state)
def __init__(self, fh, break_early=True):
"""
Initialize 'iterparse' only to callback at 'start'|'end' Events
:param fh: File Handle of the XML File
:param break_early: If True, break at FIRST closing </..> of the topmost Element
If False, run until EOF
"""
super().__init__(fh, events=('start', 'end'))
self.seq = OrderedDict()
self.xpath = []
self.element = None
self.parse(break_early)
def parse(self, break_early):
"""
Parse the XML Tree, doing
classify the Element, process only SEQUENCE Elements
record, count of end </...> Events,
parent from this Element
element Tree of this Element
:param break_early: If True, break at FIRST closing </..> of the topmost Element
:return: None
"""
parent = []
try:
for event, elem in self:
tag = elem.tag
_elem = self.Element(elem, event)
if _elem.is_seq:
if event == 'start':
parent.append(tag)
if tag in self.seq:
self.seq[tag].state = event
else:
self.seq[tag] = _elem
elif event == 'end':
parent.pop()
if parent:
self.seq[tag].parent = parent[-1]
self.seq[tag].count += 1
self.seq[tag].state = event
if self.seq[tag].count == 1:
self.seq[tag].element = elem
if break_early and len(parent) == 1:
break
except etree.XMLSyntaxError:
pass
finally:
"""
Find the topmost completed '<tag>...</tag>' Element
Build .seq.xpath
"""
for key in list(self.seq):
self.xpath.append(key)
if self.seq[key].count > 0:
self.element = self.seq[key].element
break
self.xpath = '/'.join(self.xpath)
def __str__(self):
"""
String Representation of the Result
:return: .xpath and list of .seq
"""
return "Top Sequence Element:{}\n{}"\
.format( self.xpath,
'\n'.join(["{:10}:{}"
.format(key, elem) for key, elem in self.seq.items()
])
)
if __name__ == "__main__":
with open('../test/uyalicihow.xml', 'rb') as xml_file:
tse = TopSequenceElement(xml_file)
print(tse)
Output:
Top Sequence Element:Items/Item
Items :Type:SEQUENCE, Count:0, Parent:None Events:['start']
Item :Type:SEQUENCE, Count:1, Parent:Items Events:['start', 'end', 'start']
Main :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Info :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Genres :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Products :Type:SEQUENCE, Count:1, Parent:Item Events:['start', 'end']
... (omitted for brevity)
Step 2: Now, you know there is a <Main> Tag, you can do:
print(etree.tostring(tse.element.find('Main'), pretty_print=True).decode())
<Main>
<Platform>iTunes</Platform>
<PlatformID>353736518</PlatformID>
<Type>TVEpisode</Type>
<TVSeriesID>262603760</TVSeriesID>
</Main>
Step 3: Now, you know there is a <Platform> Tag, you can do:
print(etree.tostring(tse.element.find('Main/Platform'), pretty_print=True).decode())
<Platform>iTunes</Platform>
Tested with Python:3.5.3 - lxml.etree:3.7.1
For very big files, reading is always a problem. I would suggest a simple algorithmic behavior for the reading of the file itself. The key point is always the xml tags inside the files. I would suggest you read the xml tags and sort them inside a heap and then validate the content of the heap accordingly.
Reading the file should also happen in chunks:
import xml.etree.ElementTree as etree
for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
store_in_heap(event, element)
This will parse the XML file in chunks at a time and give it to you at every step of the way. start will trigger when a tag is first encountered. At this point elem will be empty except for elem.attrib that contains the properties of the tag. end will trigger when the closing tag is encountered, and everything in-between has been read.
you can also benefit from the namespaces that are in start-ns and end-ns. ElementTree has provided this call to gather all the namespaces in the file.
Refer to this link for more information about namespaces
My interpretation of your needs is that you want to be able to parse the partial file and build up the structure of the document as you go. I've taken some assumptions from the file you uploaded:
Fundamentally you want to be parsing collections of things which have similar properties - I'm inferring this from the way you presented your desired output as a table with rows containing the values.
You expect these collections of things to have the same number of values.
You need to be able to parse partial files.
You don't worry about the properties of elements, just their contents.
I'm using xml.sax as this deals with arbitrarily large files and doesn't need to read the whole file into memory. Note that the strategy I'm following now doesn't actually scale that well as I'm storing all the elements in memory to build the dataframe, but you could just as well output the paths and contents.
In the sample file there is a problem with having one row per Item since there are multiples of the Genre tag and there are also multiple Product tags. I've handled the repeated Genre tags by appending them. This relies on the Genre tags appearing consecutively. It is not at all clear how the Product relationships can be handled in a single table.
import xml.sax
from collections import defaultdict
class StructureParser(xml.sax.handler.ContentHandler):
def __init__(self):
self.text = ''
self.path = []
self.datalist = defaultdict(list)
self.previouspath = ''
def startElement(self, name, attrs):
self.path.append(name)
def endElement(self, name):
strippedtext = self.text.strip()
path = '/'.join(self.path)
if strippedtext != '':
if path == self.previouspath:
# This handles the "Genre" tags in the sample file
self.datalist[path][-1] += f',{strippedtext}'
else:
self.datalist[path].append(strippedtext)
self.path.pop()
self.text = ''
self.previouspath = path
def characters(self, content):
self.text += content
You'd use this like this:
parser = StructureParser()
try:
xml.sax.parse('uyalicihow.xml', parser)
except xml.sax.SAXParseException:
print('File probably ended too soon')
This will read the example file just fine.
Once this has read and probably printed "File probably ended to soon", you have the parsed contents in parser.datalist.
You obviously want to have just the parts which read successfully, so you can figure out the shortest list and build a DataFrame with just those paths:
import pandas as pd
smallest_items = min(len(e) for e in parser.datalist.values())
df = pd.DataFrame({key: value for key, value in parser.datalist.items() if len(value) == smallest_items})
This gives something similar to your desired output:
Items/Item/Main/Platform Items/Item/Main/PlatformID Items/Item/Main/Type
0 iTunes 353736518 TVEpisode
1 iTunes 495275084 TVEpisode
The columns for the test file which are matched here are
>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
'Items/Item/Info/HighestResolution',
'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
'Items/Item/Products/Product/URL'],
dtype='object')
Based on your comments, it appears as though it is more important to you to have all the elements represented, but perhaps just showing a preview, in which case you can perhaps use only the first elements from the data. Note that in this case the Products entries won't match the Item entries.
df = pd.DataFrame({key: value[:smallest_items] for key, value in parser.datalist.items()})
Now we get all the paths:
>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
'Items/Item/Info/HighestResolution',
'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
'Items/Item/Products/Product/URL',
'Items/Item/Products/Product/Offers/Offer/Price',
'Items/Item/Products/Product/Offers/Offer/Currency'],
dtype='object')
There are a number of tools around that will generate a schema from a supplied instance document. I don't know how many of them will work on a 5Gb input file, and I don't know how many of them can be invoked from Python.
Many years ago I wrote a Java-based, fully streamable tool to generate a DTD from an instance document. It hasn't been touched in years but it should still run: https://sourceforge.net/projects/saxon/files/DTDGenerator/7.0/dtdgen7-0.zip/download?use_mirror=vorboss
There are other tools listed here: Any tools to generate an XSD schema from an XML instance document?
As I see it your question is very clear. I give it a plus one up vote for clearness. You are wanting to parse text.
Write a little text parser, we can call that EditorB, that reads in chunks of the file or at least line by line. Then edit or change it as you like and re-save that chunk or line.
It can be easy in Windows from 98SE on. It should be easy in other operating systems.
The process is (1) Adjust (manually or via program), as you currently do, we can call this EditorA, that is editing your XML document, and save it; (2) stop EditorA; (3) Run your parser or editor, EditorB, on the saved XML document either manually or automatically (started via detecting that the XML document has changed via date or time or size, etc.); (4) Using EditorB, save manually or automatically the edits from step 3; (5) Have your EditorA reload the XML document and go on from there; (6) do this as often as is necessary, making edits with EditorA and automatically adjusting them outside of EditorA by using EditorB.
Edit this way before you send the file.
It is a lot of typing to explain, but XML is just a glorified text document. It can be easily parsed and edited and saved, either character by character or by larger amounts line by line or in chunks.
As a further note, this can be applied via entire directory contained documents or system wide documents as I have done in the past.
Make certain that EditorA is stopped before EditorB is allowed to start it's changing. Then stop EditorB before restarting EditorA. If you set this up as I described, then EditorB can be run continually in the background, but put in it an automatic notifier (maybe a message box with options, or a little button that is set formost on the screen when activated) that allows you to turn off (on continue with) EditorA before using EditorB. Or, as I would do it, put in a detector to keep EditorB from executing its own edits as long as EditorA is running.
B Lean
I'm attempting to save data from several lists in XML format, but I cannot understand how to make the XML display properly. An example of my code right now is as follows:
from lxml import etree
#Create XML Root
articles = etree.Element('root')
#Create Lists & Data
t_list = ['title1', 'title2', 'title3', 'title4', 'title5']
c_list = ['content1', 'content2', 'content3', 'content4', 'content5']
sum_list = ['summary1', 'summary2', 'summary3', 'summary4', 'summary5']
s_list = ['source1', 'source2', 'source3', 'source4', 'source5']
i = 0
for t in t_list:
for i in range(len(t_list)):
#Create SubElements of XML Root
article = etree.SubElement(articles, 'Article')
titles = etree.SubElement(article, 'Title')
summary = etree.SubElement(article, 'Summary')
source = etree.SubElement(article, 'Source')
content = etree.SubElement(article, 'Content')
#Add List Data to SubElements
titles.text = t_list[i]
summary.text = sum_list[i]
source.text = s_list[i]
content.text = c_list[i]
print(etree.tostring(articles, pretty_print=True))
My Current Output is written in one very jumbled fashion, all on a single line as follows:
b'<root>\n <Article>\n <Title>title1</Title>\n <Summary>summary1</Summary>\n <Source>source1</Source>\n <Content>content1</Content>\n </Article>\n
It looks like the pretty_print function within lxml is adding proper indentation, as well as \n breaks as I would want, but it doesn't seem to be getting interpreted correctly during output; it write on a single line.
The output I'm trying to get is as follows:
<root>
<Article>
<Title>title1</Title>
<Summary>summary1</Summary>
<Source>source1</Source>
<Content>content1</Content>
</Article>
Ideally, I'd like for my output to be viewed as a valid XML document, and display in proper nested format.
Your "Current Output" is the representation (internal python representation) of the bytestring generated by etree.tostring(), and seems that in Python3 print(somebytestring) prints the representation instead of the actual string.
Hopefully the solution is quite simple: just pass the desired encoding to etree.tostring(), ie:
xml = etree.tostring(articles, encoding="unicode", pretty_print=True)
print(xml)
I've only used the base ET module in Python and can't find an lxml download for python 3.5 (which I'm on) in order to test it, but the b before the line indicates bytes and a quick glance at the documentation indicates that tostring() has an encoding keyword, so you should just need to set that to unicode or utf-8.
I'll also mention that you don't need to set "i" before your for-loop (python will create the "i" it needs for the for-loop), though I- personally- would zip the lists and iterate the items in the lists themselves (though that's not going to have any real impact on the code in this situation).
I've got a large XML file that I need to parse and look for a specific node. Once it has been found, I need to make a copy, edit a couple of values and write the file again.
So far I've managed to get the DOM element that I want. There is actually two of these elements already in the XML so after I'm finished, there will be three. Once I've made a copy of the DOM and edited the value, how do I then write this into the DOM (and thus the file)?
I'm using Python's from xml.dom import minidom at the moment.
In minidom you start with creating Document:
Document doc = Document("your_root")
then if it is a text node you want to add, you append it with:
text_node = doc.createTextNode(str(some content))
doc.appendChild(text_node)
if you had for example <some_elem key="my value">some my text</some_elem>:
do it like this:
text_node = doc.createTextNode('some my text')
elem.appendChild(text_node)
elem.setAttribute('key', 'my value')
if it is complex element create it with:
elem = doc.createElement('your_elem')
if you need to set attributes do:
elem.setAttribute("some-attribute",your_attr)
if you need to append something to it:
elem.appendChild( some_other_elem )
then append the element:
doc.appendChild( elem )
if you need a string representation do:
doc.toxml()
of
doc.toprettyxml()
From the minidom documentation:
from xml.dom.minidom import getDOMImplementation
impl = getDOMImplementation()
newdoc = impl.createDocument(None, "some_tag", None)
top_element = newdoc.documentElement
text = newdoc.createTextNode('Some textual content.')
top_element.appendChild(text)
So I guess appendChild is what you ask for?