I have a very large feed file that is sent as an XML document (5GB). What would be the fastest way to parse the structure of the main item node without previously knowing its structure? Is there a means in Python to do so 'on-the-fly' without having the complete xml loaded in memory? For example, what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?
Update: I've included an example XML fragment here: https://hastebin.com/uyalicihow.xml. I'm looking to extract something like a dataframe (or list or whatever other data structure you want to use) similar to the following:
Items/Item/Main/Platform Items/Item/Info/Name
iTunes Chuck Versus First Class
iTunes Chuck Versus Bo
How could this be done? I've added a bounty to encourage answers here.
Several people have misinterpreted this question, and re-reading it, it's really not at all clear. In fact there are several questions.
How to detect an XML schema
Some people have interpreted this as saying you think there might be a schema within the file, or referenced from the file. I interpreted it as meaning that you wanted to infer a schema from the content of the instance.
What would be the fastest way to parse the structure of the main item node without previously knowing its structure?
Just put it through a parser, e.g. a SAX parser. A parser doesn't need to know the structure of an XML file in order to split it up into elements and attributes. But I don't think you actually want the fastest parse possible (in fact, I don't think performance is that high on your requirements list at all). I think you want to do something useful with the information (you haven't told us what): that is, you want to process the information, rather than just parsing the XML.
Is there a python utility that can do so 'on-the-fly' without having
the complete xml loaded in memory?
Yes, according to this page which mentions 3 event-based XML parsers in the Python world: https://wiki.python.org/moin/PythonXml (I can't vouch for any of them)
what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?
I'm not sure you know what the verb "to parse" actually means. Your phrase certainly suggests that you expect the file to contain a schema, which you want to extract. But I'm not at all sure you really mean that. And in any case, if it did contain a schema in the first 5Mb, you could find it just be reading the file sequentially, there would be no need to "save" the first part of the file first.
Question: way to parse the structure of the main item node without previously knowing its structure
This class TopSequenceElement parse a XML File to find all Sequence Elements.
The default is, to break at the FIRST closing </...> of the topmost Element.
Therefore, it is independend of the file size or even by truncated files.
from lxml import etree
from collections import OrderedDict
class TopSequenceElement(etree.iterparse):
"""
Read XML File
results: .seq == OrderedDict of Sequence Element
.element == topmost closed </..> Element
.xpath == XPath to top_element
"""
class Element:
"""
Classify a Element
"""
SEQUENCE = (1, 'SEQUENCE')
VALUE = (2, 'VALUE')
def __init__(self, elem, event):
if len(elem):
self._type = self.SEQUENCE
else:
self._type = self.VALUE
self._state = [event]
self.count = 0
self.parent = None
self.element = None
#property
def state(self):
return self._state
#state.setter
def state(self, event):
self._state.append(event)
#property
def is_seq(self):
return self._type == self.SEQUENCE
def __str__(self):
return "Type:{}, Count:{}, Parent:{:10} Events:{}"\
.format(self._type[1], self.count, str(self.parent), self.state)
def __init__(self, fh, break_early=True):
"""
Initialize 'iterparse' only to callback at 'start'|'end' Events
:param fh: File Handle of the XML File
:param break_early: If True, break at FIRST closing </..> of the topmost Element
If False, run until EOF
"""
super().__init__(fh, events=('start', 'end'))
self.seq = OrderedDict()
self.xpath = []
self.element = None
self.parse(break_early)
def parse(self, break_early):
"""
Parse the XML Tree, doing
classify the Element, process only SEQUENCE Elements
record, count of end </...> Events,
parent from this Element
element Tree of this Element
:param break_early: If True, break at FIRST closing </..> of the topmost Element
:return: None
"""
parent = []
try:
for event, elem in self:
tag = elem.tag
_elem = self.Element(elem, event)
if _elem.is_seq:
if event == 'start':
parent.append(tag)
if tag in self.seq:
self.seq[tag].state = event
else:
self.seq[tag] = _elem
elif event == 'end':
parent.pop()
if parent:
self.seq[tag].parent = parent[-1]
self.seq[tag].count += 1
self.seq[tag].state = event
if self.seq[tag].count == 1:
self.seq[tag].element = elem
if break_early and len(parent) == 1:
break
except etree.XMLSyntaxError:
pass
finally:
"""
Find the topmost completed '<tag>...</tag>' Element
Build .seq.xpath
"""
for key in list(self.seq):
self.xpath.append(key)
if self.seq[key].count > 0:
self.element = self.seq[key].element
break
self.xpath = '/'.join(self.xpath)
def __str__(self):
"""
String Representation of the Result
:return: .xpath and list of .seq
"""
return "Top Sequence Element:{}\n{}"\
.format( self.xpath,
'\n'.join(["{:10}:{}"
.format(key, elem) for key, elem in self.seq.items()
])
)
if __name__ == "__main__":
with open('../test/uyalicihow.xml', 'rb') as xml_file:
tse = TopSequenceElement(xml_file)
print(tse)
Output:
Top Sequence Element:Items/Item
Items :Type:SEQUENCE, Count:0, Parent:None Events:['start']
Item :Type:SEQUENCE, Count:1, Parent:Items Events:['start', 'end', 'start']
Main :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Info :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Genres :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Products :Type:SEQUENCE, Count:1, Parent:Item Events:['start', 'end']
... (omitted for brevity)
Step 2: Now, you know there is a <Main> Tag, you can do:
print(etree.tostring(tse.element.find('Main'), pretty_print=True).decode())
<Main>
<Platform>iTunes</Platform>
<PlatformID>353736518</PlatformID>
<Type>TVEpisode</Type>
<TVSeriesID>262603760</TVSeriesID>
</Main>
Step 3: Now, you know there is a <Platform> Tag, you can do:
print(etree.tostring(tse.element.find('Main/Platform'), pretty_print=True).decode())
<Platform>iTunes</Platform>
Tested with Python:3.5.3 - lxml.etree:3.7.1
For very big files, reading is always a problem. I would suggest a simple algorithmic behavior for the reading of the file itself. The key point is always the xml tags inside the files. I would suggest you read the xml tags and sort them inside a heap and then validate the content of the heap accordingly.
Reading the file should also happen in chunks:
import xml.etree.ElementTree as etree
for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
store_in_heap(event, element)
This will parse the XML file in chunks at a time and give it to you at every step of the way. start will trigger when a tag is first encountered. At this point elem will be empty except for elem.attrib that contains the properties of the tag. end will trigger when the closing tag is encountered, and everything in-between has been read.
you can also benefit from the namespaces that are in start-ns and end-ns. ElementTree has provided this call to gather all the namespaces in the file.
Refer to this link for more information about namespaces
My interpretation of your needs is that you want to be able to parse the partial file and build up the structure of the document as you go. I've taken some assumptions from the file you uploaded:
Fundamentally you want to be parsing collections of things which have similar properties - I'm inferring this from the way you presented your desired output as a table with rows containing the values.
You expect these collections of things to have the same number of values.
You need to be able to parse partial files.
You don't worry about the properties of elements, just their contents.
I'm using xml.sax as this deals with arbitrarily large files and doesn't need to read the whole file into memory. Note that the strategy I'm following now doesn't actually scale that well as I'm storing all the elements in memory to build the dataframe, but you could just as well output the paths and contents.
In the sample file there is a problem with having one row per Item since there are multiples of the Genre tag and there are also multiple Product tags. I've handled the repeated Genre tags by appending them. This relies on the Genre tags appearing consecutively. It is not at all clear how the Product relationships can be handled in a single table.
import xml.sax
from collections import defaultdict
class StructureParser(xml.sax.handler.ContentHandler):
def __init__(self):
self.text = ''
self.path = []
self.datalist = defaultdict(list)
self.previouspath = ''
def startElement(self, name, attrs):
self.path.append(name)
def endElement(self, name):
strippedtext = self.text.strip()
path = '/'.join(self.path)
if strippedtext != '':
if path == self.previouspath:
# This handles the "Genre" tags in the sample file
self.datalist[path][-1] += f',{strippedtext}'
else:
self.datalist[path].append(strippedtext)
self.path.pop()
self.text = ''
self.previouspath = path
def characters(self, content):
self.text += content
You'd use this like this:
parser = StructureParser()
try:
xml.sax.parse('uyalicihow.xml', parser)
except xml.sax.SAXParseException:
print('File probably ended too soon')
This will read the example file just fine.
Once this has read and probably printed "File probably ended to soon", you have the parsed contents in parser.datalist.
You obviously want to have just the parts which read successfully, so you can figure out the shortest list and build a DataFrame with just those paths:
import pandas as pd
smallest_items = min(len(e) for e in parser.datalist.values())
df = pd.DataFrame({key: value for key, value in parser.datalist.items() if len(value) == smallest_items})
This gives something similar to your desired output:
Items/Item/Main/Platform Items/Item/Main/PlatformID Items/Item/Main/Type
0 iTunes 353736518 TVEpisode
1 iTunes 495275084 TVEpisode
The columns for the test file which are matched here are
>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
'Items/Item/Info/HighestResolution',
'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
'Items/Item/Products/Product/URL'],
dtype='object')
Based on your comments, it appears as though it is more important to you to have all the elements represented, but perhaps just showing a preview, in which case you can perhaps use only the first elements from the data. Note that in this case the Products entries won't match the Item entries.
df = pd.DataFrame({key: value[:smallest_items] for key, value in parser.datalist.items()})
Now we get all the paths:
>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
'Items/Item/Info/HighestResolution',
'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
'Items/Item/Products/Product/URL',
'Items/Item/Products/Product/Offers/Offer/Price',
'Items/Item/Products/Product/Offers/Offer/Currency'],
dtype='object')
There are a number of tools around that will generate a schema from a supplied instance document. I don't know how many of them will work on a 5Gb input file, and I don't know how many of them can be invoked from Python.
Many years ago I wrote a Java-based, fully streamable tool to generate a DTD from an instance document. It hasn't been touched in years but it should still run: https://sourceforge.net/projects/saxon/files/DTDGenerator/7.0/dtdgen7-0.zip/download?use_mirror=vorboss
There are other tools listed here: Any tools to generate an XSD schema from an XML instance document?
As I see it your question is very clear. I give it a plus one up vote for clearness. You are wanting to parse text.
Write a little text parser, we can call that EditorB, that reads in chunks of the file or at least line by line. Then edit or change it as you like and re-save that chunk or line.
It can be easy in Windows from 98SE on. It should be easy in other operating systems.
The process is (1) Adjust (manually or via program), as you currently do, we can call this EditorA, that is editing your XML document, and save it; (2) stop EditorA; (3) Run your parser or editor, EditorB, on the saved XML document either manually or automatically (started via detecting that the XML document has changed via date or time or size, etc.); (4) Using EditorB, save manually or automatically the edits from step 3; (5) Have your EditorA reload the XML document and go on from there; (6) do this as often as is necessary, making edits with EditorA and automatically adjusting them outside of EditorA by using EditorB.
Edit this way before you send the file.
It is a lot of typing to explain, but XML is just a glorified text document. It can be easily parsed and edited and saved, either character by character or by larger amounts line by line or in chunks.
As a further note, this can be applied via entire directory contained documents or system wide documents as I have done in the past.
Make certain that EditorA is stopped before EditorB is allowed to start it's changing. Then stop EditorB before restarting EditorA. If you set this up as I described, then EditorB can be run continually in the background, but put in it an automatic notifier (maybe a message box with options, or a little button that is set formost on the screen when activated) that allows you to turn off (on continue with) EditorA before using EditorB. Or, as I would do it, put in a detector to keep EditorB from executing its own edits as long as EditorA is running.
B Lean
I currently am trying to build an XML file from a CSV file. Currently my code reads the CSV file to data and begins creating the XML from the data that is stored within the CSV.
CSV Example:
Element,XMLFile
SubElement,XMLName,XMLFile
SubElement,XMLDate,XMLName
SubElement,XMLInformation,XMLDate
SubElement,XMLTime,XMLName
Expected Output:
<XMLFile>
<XMLName>
<XMLDate>
<XMLInformation />
</XMLDate>
<XMLTime />
</XMLName>
</XMLFile>
Currently my code attempts to look at the CSV to see what the parent is for the new subelement:
# Defines main element
# xmlElement = xml.Element(XMLFile)
xmlElement = xml.Element(csvData[rowNumber][columnNumber])
# Should Define desired parent (FAIL) and SubElement name (PASS)
# xmlSubElement = xml.SubElement(XMLFile, XMLName)
xmlSubElement = xml.SubElement(csvData[rowNumber][columnNumber + 2], csvData[rowNumber][columnNumber + 1])
When the code attempts to use the CSV source string as the parent parameter, Python 3.5 generates the following error:
TypeError: must be xml.etree.ElementTree.Element, not str
Known cause of the error is that the parent paramenter is being returned as a string, when it is expected to be an Element or SubElement.
Is it possible to recall the stored value from the CSV and have it reference the Element or SubElement, instead of a string? The goal is to allow the code to read the CSV file and assign any SubElement to the parent listed in the CSV.
I cannot tell for sure, but it looks like you are doing:
ElementTree.SubElement(str, str)
when you should be doing:
ElementTree.SubElement(Element, str)
It also seems like you already know this. The real question, then, is how are you going to reference the parent object when you only know its tag string? You could search for Elements in the ElementTree with that particular tag string, but this is generally not a good idea as XML allows multiple instances of similar elements.
I would suggest you either:
Find a strategy to store references to parent elements
See if there is a way to uniquely identify the parent element using XPath
This is one of my first forays into Python. I'd normally stick with bash, however Minidom seems to perfectly suite my needs for XML parsing, so I'm giving it a shot.
First question which I can't seem to figure out is, what's the equivalent for 'grep -v' when parsing a file?
Each object I'm pulling begins with a specific tag. If, within said tag, I want to exclude a row of data based off of a certain string embedded within the tag, how do I accomplish this?
Pseudo code that I've got now (no exclusion):
mainTag = xml.getElementsByTagName("network_object")
name = network_object.getElementsByTagName("Name")[0].firstChild.data
I'd like to see the data output all "name" fields, with the exception of strings that contain "cluster". Since I'll be doing multiple searches on network_objects, I believe I need to do it at that level, but don't know how.
Etree is giving me a ton of problems, can you give me some logic to do this with minidom?
This obviously doesn't work:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
if name is not 'cluster' in name
continue
First of all, step away from the minidom module. Minidom is great if you already know the DOM from other languages and really do not want to learn any other API. There are easier alternatives available, right there in the standard library. I'd use the ElementTree API instead.
You generally just loop over matches, and skip over the ones that you want to exclude as you do so:
from xml.etree import ElementTree
tree = ElementTree.parse(somefile)
for name in tree.findall('.//network_object//Name'):
if name.text is not None and 'cluster' in name.text:
continue # skip this one
I have to do an assignment where i have a .txt file that contains something like this
p
There is no one who loves pain itself, who seeks after it and wants to
have it, simply because it is pain...
h1
this is another example of what this text file looks like
i am suppose to write a python code that parses this text file and creates and xhtml file
I need to find a starting point for this project because i am very new to python and not familiar with alot of this stuff.
This python code is suppose to take each of these "tags" from this text file and put them into an xhtml file I hope that what i ask makes sense to you.
Any help is greatly appreciated,
Thanks in advance!
-bojan
You say you're very new to Python, so I'll start at the very low-level. You can iterate over the lines in a file very simply in Python
fyle = open("contents.txt")
for lyne in fyle :
# Do string processing here
fyle.close()
Now how to parse it. If each formatting directive (e.g. p, h1), is on a separate line, you can check that easily. I'd build up a dictionary of handlers and get the handler like so:
handlers= {"p": # p tag handler
"h1": # h1 tag handler
}
# ... in the loop
if lyne.rstrip() in handlers : # strip to remove trailing whitespace
# close current handler?
# start new handler?
else :
# pass string to current handler
You could do what Daniel Pryden suggested and create an in-memory data structure first, and then serialize that the XHTML. In that case, the handlers would know how to build the objects corresponding to each tag. But I think the simpler solution, especially if you don't have lots of time, you have is just to go straight to XHTML, keeping a stack of the current enclosed tags. In that case your "handler" may just be some simple logic to write the tags to the output file/string.
I can't say more without knowing the specifics of your problem. And besides, I don't want to do all your homework for you. This should give you a good start.
Rather than going directly from the text file you describe to an XHTML file, I would transform it into an intermediate in-memory representation first.
So I would build classes to represent the p and h1 tags, and then go through the text file and build those objects and put them into a list (or even a more complex object, but from the looks of your file a list should be sufficient). Then I would pass the list to another function that would loop through the p and h1 objects and output them as XHTML.
As an added bonus, I would make each tag object (say, Paragraph and Heading1 classes) implement an as_xhtml() method, and delegate the actual formatting to that method. Then the XHTML output loop could be something like:
for tag in input_tags:
xhtml_file.write(tag.as_xhtml())