How to iteratively parse a large XML file in Python?

How to iteratively parse a large XML file in Python? - python

I need to process an approximately 8Gb large .XML file.
The file structure is (simplified) similar to the below:
<TopLevelElement>
<SomeElementList>
<Element>zzz</Element>
....and so on for thousands of rows
</SomeElementList>
<Records>
<RecordType1>
<RecordItem id="aaaa">
<SomeData>
<SomeMoreData NameType="xxx">
<NameComponent1>zzz</NameComponent1>
....
<AnotherNameComponent>zzzz</AnotherNameComponent>
</SomeMoreData>
</SomeData>
</RecordItem>
..... hundreds of thousands of items, some are quite large.
</RecordType1>
<RecordType2>
<RecordItem id="cccc">
...hundreds of thousands of RecordType2 elements, slightly different from RecordItems in RecordType1
</RecordItem>
</RecordType2>
</Records>
</TopLevelElement>
I need to extract some of the sub-elements in RecordType1 and RecordType2 elements. There are conditions to determine which record items need to be processed and which fields
need to be extracted. The individual RecordItems do not exceed 120k (some have extensive text data, which I do not need).
Here is the code. Function get_all_records receives following inputs: a) path to the XML file; b) record category ('RecordType1' or 'RecordType2'); c) what name components to pick
from xml.etree import cElementTree as ET
def get_all_records(xml_file_path, record_category, name_types, name_components):
context = ET.iterparse(xml_file_path, events=("start", "end"))
context = iter(context)
event, root = next(context)
all_records = []
for event, elem in context:
if event == 'end' and elem.tag == record_category and elem.attrib['action'] != 'del':
record_contents = get_record(elem, name_types=name_types, name_components=name_components, record_id=elem.attrib['id'])
if record_contents:
all_records += record_contents
root.clear()
return all_records
I have experimented with the number of records, the code nicely processes 100k RecordItems (only Type1, it just takes too long to get to Type2) in approximately one minute.
Attempting to process a larger number of records (I took one million), eventually leads to MemoryError in ElementTree.py.
So I am guessing no memory is released despite of root.clear() statement.
An ideal solution would be one where the RecordItems would be read one at the time, processed, and then discarded from the memory, but I have no clue how to do that.
From XML point of view the two extra layers of elements (TopLevelElement and Records) seem to complicate the task.
I am new to XML and to respective Python libraries so an explanation with detail would be much appreciated!

Iterating over a huge XML file is always painful.
I'll go over all the process from start to finish, suggesting the best practices for keeping low memory yet maximizing parsing speed.
First no need to store ET.iterparse as a variable. Just iterate over it like
for event, elem in ET.iterparse(xml_file, events=("start", "end")):
This iterator created for, well..., iteration without storing anything else in memory except the current tag. Also you don't need root.clear() with this new approach and you can go as long as your hard disk space allows it for huge XML files.
Your code should look like:
from xml.etree import cElementTree as ET
def get_all_records(xml_file_path, record_category, name_types, name_components):
all_records = []
for event, elem in ET.iterparse(xml_file_path, events=("start", "end")):
if event == 'end' and elem.tag == record_category and elem.attrib['action'] != 'del':
record_contents = get_record(elem, name_types=name_types, name_components=name_components, record_id=elem.attrib['id'])
if record_contents:
all_records += record_contents
return all_records
Also, please think carefully about the reason you need to store the whole list of all_records. If it's only for writing CSV file at the end of the process - this reason isn't good enough and can cause memory issues when scaling to even bigger XML files.
Make sure you write each new row to CSV as this row happens, turning memory issues into none-issue.
P.S.
If you need to store several tags before you find your main tag in order to parse this historic information as you go down the XML file - just store it locally in some new variables. This comes handy whenever future data in XML file makes you go backwards to a specific tag you know already occured.

Related

Elasticsearch-Py bulk not indexing all documents

I am using the elasticsearch-py Python package to interact with Elasticsearch through code. I have a script that is meant to take each document from one index, generate a field + value, then re-index it into a new index.
The issue is that there is 1216 documents in the first index, but only 1000 documents make it to the second one. Typically, it is exactly 1000 documents, occasionally making it higher around 1100, but never making it to the full 1216.
I usually keep the batch_size at 200, but changing it around seems to have some effect on the amount of documents that make it to the second index. Changing it to 400 will typically get a result of 800 documents being transferred. Using parallel_bulk seems to have the same results as using bulk.
I believe the issue is with the generating process I am performing. For each document I am generating its ancestry (they are organized in a tree structure) by recursively getting its parent from the first index. This involves rapid document GET requests interwoven with Bulk API calls to index the documents and Scroll API calls to get the documents from the index in the first place.
Would activity like this cause the documents to not go through? If I remove (comment out) the recursive GET requests, all documents seem to go through every time. I have tried creating multiple Elasticsearch clients, but that wouldn't even help if ES itself is the bottleneck.
Here is the code if you're curious:
def complete_resources():
for result in helpers.scan(client=es, query=query, index=TEMP_INDEX_NAME):
resource = result["_source"]
ancestors = []
parent = resource.get("parent")
while parent is not None:
ancestors.append(parent)
parent = es.get(
index=TEMP_INDEX_NAME,
doc_type=TEMPORARY_DOCUMENT_TYPE,
id=parent["uid"]
).get("_source").get("parent")
resource["ancestors"] = ancestors
resource["_id"] = resource["uid"]
yield resource
This generator is consumed by helpers.parallel_bulk()
for success, info in helpers.parallel_bulk(
client=es,
actions=complete_resources(),
thread_count=10,
queue_size=12,
raise_on_error=False,
chunk_size=INDEX_BATCH_SIZE,
index=new_primary_index_name,
doc_type=PRIMARY_DOCUMENT_TYPE,
):
if success:
successful += 1
else:
failed += 1
print('A document failed:', info)
This gives me the following result:
Time: 7 seconds
Successful: 1000
Failed: 0

How to efficiently detect an XML schema without having the entire file in python

I have a very large feed file that is sent as an XML document (5GB). What would be the fastest way to parse the structure of the main item node without previously knowing its structure? Is there a means in Python to do so 'on-the-fly' without having the complete xml loaded in memory? For example, what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?
Update: I've included an example XML fragment here: https://hastebin.com/uyalicihow.xml. I'm looking to extract something like a dataframe (or list or whatever other data structure you want to use) similar to the following:
Items/Item/Main/Platform Items/Item/Info/Name
iTunes Chuck Versus First Class
iTunes Chuck Versus Bo
How could this be done? I've added a bounty to encourage answers here.

Several people have misinterpreted this question, and re-reading it, it's really not at all clear. In fact there are several questions.
How to detect an XML schema
Some people have interpreted this as saying you think there might be a schema within the file, or referenced from the file. I interpreted it as meaning that you wanted to infer a schema from the content of the instance.
What would be the fastest way to parse the structure of the main item node without previously knowing its structure?
Just put it through a parser, e.g. a SAX parser. A parser doesn't need to know the structure of an XML file in order to split it up into elements and attributes. But I don't think you actually want the fastest parse possible (in fact, I don't think performance is that high on your requirements list at all). I think you want to do something useful with the information (you haven't told us what): that is, you want to process the information, rather than just parsing the XML.
Is there a python utility that can do so 'on-the-fly' without having
the complete xml loaded in memory?
Yes, according to this page which mentions 3 event-based XML parsers in the Python world: https://wiki.python.org/moin/PythonXml (I can't vouch for any of them)
what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?
I'm not sure you know what the verb "to parse" actually means. Your phrase certainly suggests that you expect the file to contain a schema, which you want to extract. But I'm not at all sure you really mean that. And in any case, if it did contain a schema in the first 5Mb, you could find it just be reading the file sequentially, there would be no need to "save" the first part of the file first.

Question: way to parse the structure of the main item node without previously knowing its structure
This class TopSequenceElement parse a XML File to find all Sequence Elements.
The default is, to break at the FIRST closing </...> of the topmost Element.
Therefore, it is independend of the file size or even by truncated files.
from lxml import etree
from collections import OrderedDict
class TopSequenceElement(etree.iterparse):
"""
Read XML File
results: .seq == OrderedDict of Sequence Element
.element == topmost closed </..> Element
.xpath == XPath to top_element
"""
class Element:
"""
Classify a Element
"""
SEQUENCE = (1, 'SEQUENCE')
VALUE = (2, 'VALUE')
def __init__(self, elem, event):
if len(elem):
self._type = self.SEQUENCE
else:
self._type = self.VALUE
self._state = [event]
self.count = 0
self.parent = None
self.element = None
#property
def state(self):
return self._state
#state.setter
def state(self, event):
self._state.append(event)
#property
def is_seq(self):
return self._type == self.SEQUENCE
def __str__(self):
return "Type:{}, Count:{}, Parent:{:10} Events:{}"\
.format(self._type[1], self.count, str(self.parent), self.state)
def __init__(self, fh, break_early=True):
"""
Initialize 'iterparse' only to callback at 'start'|'end' Events
:param fh: File Handle of the XML File
:param break_early: If True, break at FIRST closing </..> of the topmost Element
If False, run until EOF
"""
super().__init__(fh, events=('start', 'end'))
self.seq = OrderedDict()
self.xpath = []
self.element = None
self.parse(break_early)
def parse(self, break_early):
"""
Parse the XML Tree, doing
classify the Element, process only SEQUENCE Elements
record, count of end </...> Events,
parent from this Element
element Tree of this Element
:param break_early: If True, break at FIRST closing </..> of the topmost Element
:return: None
"""
parent = []
try:
for event, elem in self:
tag = elem.tag
_elem = self.Element(elem, event)
if _elem.is_seq:
if event == 'start':
parent.append(tag)
if tag in self.seq:
self.seq[tag].state = event
else:
self.seq[tag] = _elem
elif event == 'end':
parent.pop()
if parent:
self.seq[tag].parent = parent[-1]
self.seq[tag].count += 1
self.seq[tag].state = event
if self.seq[tag].count == 1:
self.seq[tag].element = elem
if break_early and len(parent) == 1:
break
except etree.XMLSyntaxError:
pass
finally:
"""
Find the topmost completed '<tag>...</tag>' Element
Build .seq.xpath
"""
for key in list(self.seq):
self.xpath.append(key)
if self.seq[key].count > 0:
self.element = self.seq[key].element
break
self.xpath = '/'.join(self.xpath)
def __str__(self):
"""
String Representation of the Result
:return: .xpath and list of .seq
"""
return "Top Sequence Element:{}\n{}"\
.format( self.xpath,
'\n'.join(["{:10}:{}"
.format(key, elem) for key, elem in self.seq.items()
])
)
if __name__ == "__main__":
with open('../test/uyalicihow.xml', 'rb') as xml_file:
tse = TopSequenceElement(xml_file)
print(tse)
Output:
Top Sequence Element:Items/Item
Items :Type:SEQUENCE, Count:0, Parent:None Events:['start']
Item :Type:SEQUENCE, Count:1, Parent:Items Events:['start', 'end', 'start']
Main :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Info :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Genres :Type:SEQUENCE, Count:2, Parent:Item Events:['start', 'end', 'start', 'end']
Products :Type:SEQUENCE, Count:1, Parent:Item Events:['start', 'end']
... (omitted for brevity)
Step 2: Now, you know there is a <Main> Tag, you can do:
print(etree.tostring(tse.element.find('Main'), pretty_print=True).decode())
<Main>
<Platform>iTunes</Platform>
<PlatformID>353736518</PlatformID>
<Type>TVEpisode</Type>
<TVSeriesID>262603760</TVSeriesID>
</Main>
Step 3: Now, you know there is a <Platform> Tag, you can do:
print(etree.tostring(tse.element.find('Main/Platform'), pretty_print=True).decode())
<Platform>iTunes</Platform>
Tested with Python:3.5.3 - lxml.etree:3.7.1

For very big files, reading is always a problem. I would suggest a simple algorithmic behavior for the reading of the file itself. The key point is always the xml tags inside the files. I would suggest you read the xml tags and sort them inside a heap and then validate the content of the heap accordingly.
Reading the file should also happen in chunks:
import xml.etree.ElementTree as etree
for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
store_in_heap(event, element)
This will parse the XML file in chunks at a time and give it to you at every step of the way. start will trigger when a tag is first encountered. At this point elem will be empty except for elem.attrib that contains the properties of the tag. end will trigger when the closing tag is encountered, and everything in-between has been read.
you can also benefit from the namespaces that are in start-ns and end-ns. ElementTree has provided this call to gather all the namespaces in the file.
Refer to this link for more information about namespaces

My interpretation of your needs is that you want to be able to parse the partial file and build up the structure of the document as you go. I've taken some assumptions from the file you uploaded:
Fundamentally you want to be parsing collections of things which have similar properties - I'm inferring this from the way you presented your desired output as a table with rows containing the values.
You expect these collections of things to have the same number of values.
You need to be able to parse partial files.
You don't worry about the properties of elements, just their contents.
I'm using xml.sax as this deals with arbitrarily large files and doesn't need to read the whole file into memory. Note that the strategy I'm following now doesn't actually scale that well as I'm storing all the elements in memory to build the dataframe, but you could just as well output the paths and contents.
In the sample file there is a problem with having one row per Item since there are multiples of the Genre tag and there are also multiple Product tags. I've handled the repeated Genre tags by appending them. This relies on the Genre tags appearing consecutively. It is not at all clear how the Product relationships can be handled in a single table.
import xml.sax
from collections import defaultdict
class StructureParser(xml.sax.handler.ContentHandler):
def __init__(self):
self.text = ''
self.path = []
self.datalist = defaultdict(list)
self.previouspath = ''
def startElement(self, name, attrs):
self.path.append(name)
def endElement(self, name):
strippedtext = self.text.strip()
path = '/'.join(self.path)
if strippedtext != '':
if path == self.previouspath:
# This handles the "Genre" tags in the sample file
self.datalist[path][-1] += f',{strippedtext}'
else:
self.datalist[path].append(strippedtext)
self.path.pop()
self.text = ''
self.previouspath = path
def characters(self, content):
self.text += content
You'd use this like this:
parser = StructureParser()
try:
xml.sax.parse('uyalicihow.xml', parser)
except xml.sax.SAXParseException:
print('File probably ended too soon')
This will read the example file just fine.
Once this has read and probably printed "File probably ended to soon", you have the parsed contents in parser.datalist.
You obviously want to have just the parts which read successfully, so you can figure out the shortest list and build a DataFrame with just those paths:
import pandas as pd
smallest_items = min(len(e) for e in parser.datalist.values())
df = pd.DataFrame({key: value for key, value in parser.datalist.items() if len(value) == smallest_items})
This gives something similar to your desired output:
Items/Item/Main/Platform Items/Item/Main/PlatformID Items/Item/Main/Type
0 iTunes 353736518 TVEpisode
1 iTunes 495275084 TVEpisode
The columns for the test file which are matched here are
>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
'Items/Item/Info/HighestResolution',
'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
'Items/Item/Products/Product/URL'],
dtype='object')
Based on your comments, it appears as though it is more important to you to have all the elements represented, but perhaps just showing a preview, in which case you can perhaps use only the first elements from the data. Note that in this case the Products entries won't match the Item entries.
df = pd.DataFrame({key: value[:smallest_items] for key, value in parser.datalist.items()})
Now we get all the paths:
>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
'Items/Item/Info/HighestResolution',
'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
'Items/Item/Products/Product/URL',
'Items/Item/Products/Product/Offers/Offer/Price',
'Items/Item/Products/Product/Offers/Offer/Currency'],
dtype='object')

There are a number of tools around that will generate a schema from a supplied instance document. I don't know how many of them will work on a 5Gb input file, and I don't know how many of them can be invoked from Python.
Many years ago I wrote a Java-based, fully streamable tool to generate a DTD from an instance document. It hasn't been touched in years but it should still run: https://sourceforge.net/projects/saxon/files/DTDGenerator/7.0/dtdgen7-0.zip/download?use_mirror=vorboss
There are other tools listed here: Any tools to generate an XSD schema from an XML instance document?

As I see it your question is very clear. I give it a plus one up vote for clearness. You are wanting to parse text.
Write a little text parser, we can call that EditorB, that reads in chunks of the file or at least line by line. Then edit or change it as you like and re-save that chunk or line.
It can be easy in Windows from 98SE on. It should be easy in other operating systems.
The process is (1) Adjust (manually or via program), as you currently do, we can call this EditorA, that is editing your XML document, and save it; (2) stop EditorA; (3) Run your parser or editor, EditorB, on the saved XML document either manually or automatically (started via detecting that the XML document has changed via date or time or size, etc.); (4) Using EditorB, save manually or automatically the edits from step 3; (5) Have your EditorA reload the XML document and go on from there; (6) do this as often as is necessary, making edits with EditorA and automatically adjusting them outside of EditorA by using EditorB.
Edit this way before you send the file.
It is a lot of typing to explain, but XML is just a glorified text document. It can be easily parsed and edited and saved, either character by character or by larger amounts line by line or in chunks.
As a further note, this can be applied via entire directory contained documents or system wide documents as I have done in the past.
Make certain that EditorA is stopped before EditorB is allowed to start it's changing. Then stop EditorB before restarting EditorA. If you set this up as I described, then EditorB can be run continually in the background, but put in it an automatic notifier (maybe a message box with options, or a little button that is set formost on the screen when activated) that allows you to turn off (on continue with) EditorA before using EditorB. Or, as I would do it, put in a detector to keep EditorB from executing its own edits as long as EditorA is running.
B Lean

change attrib value while ET.iterpase() through XML.osm in python

I am trying to write a new attribute value of an XML element while it is being ET.interparse() in a for loop. Suggestions on how to do this?
I want to avoid opening the whole XML file because it is quite large, which is why I am only opening a single element at the start event at one time.
here is the code that I have:
import xml.etree.cElementTree as ET
def main_function:
osmfile = 'sample.osm'
osm_file = open(osmfile, 'r+')
for event, elem in ET.interparse(osm_file, events=('start',)):
if elem.tag == 'node':
for tag in elem.iter('tag'):
if is_addr_street_tag(tag): # Function returns boolean
cleaned_street_name = cleaning_street(tag.attrib['v']) # Function returns cleaned street name
##===================================================##
## Write cleaned_street_name to XML tag attrib value ##
##===================================================##
osm_file.close()

BLUF: Apparently it is not possible to do that without opening the whole XML file and then later rewriting the whole XML file.
1) You can not write the attribute back to the element (although you actually can but it would be difficult, time consuming, and inelegant)
2) "It is physically impossible to replace a text in a file with a shorter or longer text without rewriting the entire file. (The very only exceptions being "exactly the same length text" and "the data is at the very end".)"
Here is the comment from usr2564301 on a question related to yours about changing an attribute value of an element without opening the whole XML document.
That cannot possibly work. The XML handling is unaware that the data came from a file and so it cannot "write back" the changed value at the exact same position in the file. Even if it could: it is physically impossible to replace a text in a file with a shorter or longer text without rewriting the entire file. (The very only exceptions being "exactly the same length text" and "the data is at the very end".) – usr2564301

How to get all text inside XML tags

xml file snapshot
From above .xml file I am extracting article-id, article-title, abstract and keywords. For normal text inside single tag getting correct results. But text with multiple tags such as:
<title-group>
<article-title>
Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium,
<italic>Rapidithrix thailandica</italic>
</article-title>
</title-group>
.
.
same is for abstract...
I got output as:
OrderedDict([(u'italic**', u'Rapidithrix thailandica'), ('#text', u'Acetylcholines terase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Ba cterium,')])
code has considered tag as a text and the o/p generated is also not in the sequence.
How to simply extract text from such input document as "Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium, Rapidithrix thailandica".
I am using below python code to perform above task..
import xmltodict
import os
from os.path import basename
import re
with open('2630847.nxml') as fd:
doc = xmltodict.parse(fd.read())
pmc_id = doc['article']['front']['article-meta']['article-id'][1]['#text']
article_title = doc['article']['front']['article-meta']['title-group']['article-title']
y = doc['article']['front']['article-meta']['abstract']
y = y.items()[0]
article_abstract = [g.encode('ascii','ignore') for g in y][1]
z = doc['article']['front']['article-meta']['kwd-group']['kwd']
zz = [g.encode('ascii','ignore') for g in z]
article_keywords = ",".join(zz).replace(","," ")
fout = open(str(pmc_id)+".txt","w")
fout.write(str(pmc_id)+"\n"+str(article_title)+". "+str(article_abstract)+". "+str(article_keywords))
Can somebody please suggest corrections..

xmltodict will likely be hard to use for your data. PMC journal articles are definitely not what the authors could have had in mind. Putting any but the most trivial XML into xmltodict is pounding a round peg into a square hole -- you might succeed, but it won't be pretty. I explain this further below under "tldr"....
Instead, I suggest you use a library whose data model fits your data better, such as xml.dom, minidom, or recent versions of BeautifulSoup. In many such libraries you just load the document with one call and then call some function like innerText() to get all the text content of it. You could even just load the document into a browser and call the Javascript innerText() function to get what you want. If the tool you choose doesn't provide innertext() already, it is:
def innertext(node):
t = ""
for curNode in node.childNodes:
if (isinstance(curNode, Text)):
t += curNode.nodeValue
elif (isinstance(curNode, Element)):
t += curNode.innerText
return(t)
You could tweak that to put spaces between the text nodes, depending on your data.
Hope that helps.
==tldr==
xmltodict makes an admirable attempt at making XML "as simple as possible"; but IMHO it errs in making it simpler than possible.
xmltodict basically works by turning every element into a dict, with its children as the dict items, keyed by their element names. But in many cases (such as yours), XML data isn't very much like that at all. For example, an element can have many children with the same name, but a dict can't.
So xmltodict has to do something special. It turns adjacent instances of the same element type into an array (without the element type). Here's an example excerpted from https://github.com/martinblech/xmltodict):
<and>
<many>elements</many>
<many>more elements</many>
</and>
becomes:
"and": {
"many": [
"elements",
"more elements"
]
},
First off, this means that xmltodict always loses the ordering information about child elements unless they are of the same type. So a section that contains a mix of paragraphs, lists, blockquotes, and so on, will either fail to load in xmltodict, or have all the scattered instances of each kind of child gathered together, completely losing their order.
The xmltodict approach also introduces frequent special-cases -- for example, you can't just get a list of all the children, or use len() to find out how many there are, etc. etc., because at every step you have to check whether you're really at a child element, or at a list of them.
Looking at xmltodict's own examples, you'll see that they mostly consist of walking down the tree by element names, but every now and then there's an integer subscript -- that's for the cases where these arrays are needed. But unless the data is unusually simple (which yours isn't), you won't know where that is. For example, if one DIV in an HTML document happens to contain only one P, the code to access the P needs one fewer subscript than with another DIV that happens to have more than one P.
It seems to me undesirable that the number of subscripts to get to something depends on how many siblings it has, and their types.
Alas, the structure still isn't good enough. Since child elements may have their own child elements, just making them strings in that extra array won't be enough. Sometimes they'll have to be dicts again, with some of their items in turn perhaps being arrays, some of whose items may be dicts, and so on. Writing the correct traversal algorithm to gather up the text is significantly harder than the DOM one shown above.
To be completely fair, there is some XML in which the order doesn't matter logically -- for example, you could export a SQL table into an XML file, using a container element for each record with a child element for each field. The order of fields is not information, so if you load such XML into xmltodict, losing the order doesn't matter. Likewise if you serialized Python data that was already just a dict. But those are very specialized edge cases. xmltodict might be an excellent choice for a case like that -- but the articles you're looking at are very far from that.

rss is not updated immediately

I have a little script that monitors the RSS for 'new questions tagged with python' specifically on SO. It stores the feed in a variable on the first iteration of the loop, and then constantly checks the feed against the one stored in the variable. If the feed changes, it updates the variable and outputs the newest entry to the console, and plays a soundfile to alert me that there are new questions. All in all, it's quite handy as I don't have to keep an eye on anything. However, there are time discrepancies between new questions actually being posted, and my script detecting feed updates. These discrepancies seem to vary in the length of time, but generally, it's isn't instant and tends to not alert me before there has been enough action on a question so that it's been pretty much dealt with. Not always the case, but generally. Is there a way for me to ensure much faster or quicker updates/alerts? Or is this as good as it gets? (It's crossed my mind that this particular feed is only updated when there is actually action on a question.. anyone know if that's the case?)
Have I misunderstood the way that rss actually works?
import urllib2
import mp3play
import time
from xml.dom import minidom
def SO_notify():
""" play alarm when rss is updated """
rss = ''
filename = "path_to_soundfile"
mp3 = mp3play.load(filename)
mp3.volume(25)
while True:
html = urllib2.urlopen("http://stackoverflow.com/feeds/tag?tagnames=python&sort=newest")
new_rss = html.read()
if new_rss == rss:
continue
rss = new_rss
feed = minidom.parseString(rss)
new_entry = feed.getElementsByTagName('entry')[0]
title = new_entry.getElementsByTagName('title')[0].childNodes[0].nodeValue
print title
mp3.play()
time.sleep(30) #Edit - thanks to all who suggested this
SO_notify()

Something like:
import requests
import mp3play
import time
curr_ids = []
filename = "path_to_soundfile"
mp3 = mp3play.load(filename)
mp3.volume(25)
while True:
api_json = requests.get("http://api.stackoverflow.com/1.1/questions/unanswered?order=desc&tagged=python").json()
new_questions = []
all_questions = []
for q in api_json["questions"]:
all_questions.append(q["question_id"])
if q["question_id"] not in curr_ids:
new_questions.append(q["question_id"])
if new_questions:
print(new_questions)
mp3.play()
curr_ids = all_questions
time.sleep(30)
Used the requests package here because urllib gives me some encoding troubles.

IMHO, you could have 2 solutions to this, depending on which approach you want:
Use JSON - this will give you a nice dict with all entries.
Use RSS (XML). In this case you'd need something like feedparser to process your XML.
Either way, the code should be something like:
# make curr_ids a dictionary for easier lookup
curr_ids = []
filename = "path_to_soundfile"
mp3 = mp3play.load(filename)
mp3.volume(25)
# Loop
while True:
# Get the list of entries in objects
entries = get_list_of_entries()
new_ids = []
for entry in entries:
# Check if we reached the most recent entry
if entry.id in curr_ids:
# Force loop end if we did
break
new_ids.append(entry.id)
# Do whatever operations
print entry.title
if len(new_ids) > 0:
mp3.play()
curr_ids = new_ids
else:
# No updates in the meantime
pass
sleep(30)
Several notes:
I'd order the entries by "oldest" instead so the printed entries look like a stream, with the most recent one being the last printed out.
the new_ids thing is to keep the list of ids to a minimum. Otherwise lookup will become slower with time
get_list_of_entries() is a container to get the entries from the source (objects from XML or a dict from JSON). Depending on which approach you want, referring them is different (but the principle is the same)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.