lxml findall() problem

lxml findall() problem - python

Just trying to make a simple program to get wikipedia's recentchanges and parse that XML file.
I stuck at the point where findall() not working. What I'm doing wrong?
import urllib2
from lxml import etree as ET
result = urllib2.urlopen('http://en.wikipedia.org/w/api.php?action=query&format=xml&list=recentchanges&rcprop=title|ids|sizes|flags|user|timestamp').read()
xml=ET.fromstring (result)
print xml[0][0][0].attrib # that works!
print xml.findall ('api/query/recentchanges/rc') # that don't!

I suspect the root node is the topic node, so it's looking for a node named "api" inside of the root node. If so, both of the following will work:
query/recentchanges/rc
/api/query/recentchanges/rc

Related

How to make nested xml structure flat with python

I have XML with huge nested structure.
Like this one
<root>
<node1>
<subnode1>
<name1>text1</name1>
</subnode1>
</node1>
<node2>
<subnode2>
<name2>text2</name2>
</subnode2>
</node2>
</root>
I want convert it to
<root>
<node1>
<name1>text1</name1>
</node1>
<node2>
<name2>text2</name2>
</node2>
</root>
I was tried with following steps
from xml.etree import ElementTree as et
tr = etree.parse(path)
root = tr.getroot()
for node in root.getchildren():
for element in node.iter():
if (element.text is not None):
node.extend(element)
I also tried with node.append(element) but it also does not work it adds element in end and i got infinity loop.
Any helps be appreciated.

A few points to mention here:
Firstly, your test element.text is not None always returns True if you parse your XML file as given above using xml.etree.Elementree since at the end of each node, there is a new line character, hence, the text in each supposedly not-having-text node always have \n character. An alternative is to use lxml.etree.parse with a lxml.etree.XMLParser that ignore the blank text as below.
Secondly, it's not good to append to a tree while reading through it. The same reason for why this code will give infinite loop:
>>> a = [1,2,3,4]
>>> for k in a:
a.append(5)
You could see #Alex Martelli answer for this question here: Modifying list while iterating regarding the issue.
Hence, you should make a buffer XML tree and build it accordingly rather than modifying your tree while traversing it.
from xml.etree import ElementTree as et
import pdb;
from lxml import etree
p = etree.XMLParser(remove_blank_text=True)
path = 'test.xml'
tr = et.parse(path, parser = p)
root = tr.getroot()
buffer = et.Element(root.tag);
for node in root.getchildren():
bnode = et.Element(node.tag)
for element in node.iter():
#pdb.set_trace()
if (element.text is not None):
bnode.append(element)
#node.extend(element)
buffer.append(bnode)
et.dump(buffer)
Sample run and results:
Chip chip# 01:01:53# ~: python stackoverflow.py
<root><node1><name1>text1</name1></node1><node2><name2>text2</name2></node2></root>
NOTE: you can always try to print a pretty XML tree using lxml package in python following tutorials here: Pretty printing XML in Python since the tree I printed out is rather horrible to read by naked eyes.

Python getparent() not working

I'd like to use getparent() in some code I'm working on to read XML files. When I try what's below I get this error: AttributeError: getparent
I assume I'm making a basic mistake but after an hour of searching and trial and error, I can't figure out what it is. (Using python 2.7 if that matters)
import xml.etree.cElementTree as ET
import lxml.etree
url = [file.xml]
tree = ET.ElementTree(file=url)
txt = 'texthere'
for elem in tree.iter(tag='text'):
print elem.text
print elem.getparent()

Element objects created with the standard library module ElementTree do not have a getparent() method. Element objects created with lxml do have this method. You import lxml (import lxml.etree) in your code but you don't use it.
Here is a small working demonstration:
from lxml import etree
XML = """
<root>
<a>
<b>foo</b>
</a>
</root>"""
tree = etree.fromstring(XML)
for elem in tree.iter(tag="b"):
print "text:", elem.text
print "parent:", elem.getparent()
Output:
text: foo
parent: <Element a at 0x27a6f08>

I think better to try this. There is some problem with your import libraries. same thing can done usng DOM. great example in here. http://www.mkyong.com/python/python-read-xml-file-dom-example/

In Python - Parsing a response xml and finding a specific text vaule

I'm new to python and I'm having a particularly difficult time working with xml and python. The situation I have is this, I'm trying to count the number of times a word appears in an xml document. Simple enough, but the xml document is a response from a server. Is it possible to do this without writing to a file? It would be great trying to do it from memory.
Here is a sample xml code:
<xml>
<title>Info</title>
<foo>aldfj</foo>
<data>Text I want to count</data>
</xml>
Here is what I have in python
import urllib2
import StringIO
import xml.dom.minidom
from xml.etree.ElementTree import parse
usock = urllib.urlopen('http://www.example.com/file.xml')
xmldoc = minidom.parse(usock)
print xmldoc.toxml()
Past This point I have tried using StringIO, ElementTree, and minidom to no success and I have gotten to a point where I'm not sure what else to do.
Any help would be greatly appreciated

It's quite simple, as far as I can tell:
import urllib2
from xml.dom import minidom
usock = urllib2.urlopen('http://www.example.com/file.xml')
xmldoc = minidom.parse(usock)
for element in xmldoc.getElementsByTagName('data'):
print element.firstChild.nodeValue
So to count the occurrences of a string, try this (a bit condensed, but I like one-liners):
count = sum(element.firstChild.nodeValue.find('substring') for element in xmldoc.getElementsByTagName('data'))

If you are just trying to count the number of times a word appears in an XML document, just read the document as a string and do a count:
import urllib2
data = urllib2.urlopen('http://www.example.com/file.xml').read()
print data.count('foobar')
Otherwise, you can just iterate through the tags you are looking for:
from xml.etree import cElementTree as ET
xml = ET.fromstring(urllib2.urlopen('http://www.example.com/file.xml').read())
for data in xml.getiterator('data'):
# do something with
data.text

Does this help ...
from xml.etree.ElementTree import XML
txt = """<xml>
<title>Info</title>
<foo>aldfj</foo>
<data>Text I want to count</data>
</xml>"""
# this will give us the contents of the data tag.
data = XML(txt).find("data").text
# ... so here we could do whatever we want
print data

Just replace the string 'count' with whatever word you want to count. If you want to count phrases, then you'll have to adapt this code as this is for word counting. But anyway, the answer to how to get at all the embedded text is XML('<your xml string here>').itertext()
from xml.etree.ElementTree import XML
from re import findall
txt = """<xml>
<title>Info</title>
<foo>aldfj</foo>
<data>Text I want to count</data>
</xml>"""
sum([len(filter(lambda w: w == 'count', findall('\w+', t))) for t in XML(txt).itertext()])

Elementtree displaying elements out of order

I'm using Python's ElementTree to parse xml files. I have a "findall" to find all "revision" subelements, but when I iterate through the result, they are not in document order. What can I be doing wrong?
Here's my code:
allrevisions = page.findall('{http://www.mediawiki.org/xml/export-0.5/}revision')
for rev in allrevisions:
print rev
print rev.find('{http://www.mediawiki.org/xml/export-0.5/}timestamp').text
Here's a link to the document I'm parsing: http://pastie.org/2780983
Thanks,
bsg
-Oops. By going through my code and running it piece by piece, I worked out the problem - I had stuck in a reverse() on the elements list in the wrong place, which was causing all the trouble. Thank you so much for your help - I'm sorry it was such a silly issue.

The documentation for ElementTree says that findall returns the elements in document order.
A quick test shows the correct behaviour:
import xml.etree.ElementTree as et
xmltext = """
<root>
<number>1</number>
<number>2</number>
<number>3</number>
<number>4</number>
</root>
"""
tree = et.fromstring(xmltext)
for number in tree.findall('number'):
print number.text
Result:
1
2
3
4
It would be helpful to see the document you are parsing.
Update:
Using the source data you provided:
from __future__ import with_statement
import xml.etree.ElementTree as et
with open('xmldata.xml', 'r') as f:
xmldata = f.read()
tree = et.fromstring(xmldata)
for revision in tree.findall('.//{http://www.mediawiki.org/xml/export-0.5/}revision'):
print revision.find('{http://www.mediawiki.org/xml/export-0.5/}text').text[0:10].encode('utf8')
Result:
‘The Mind
{{db-spam}
‘The Mind
'''The Min
<!-- Pleas
The same order as they appear in the document.

Basic Python file searching and I/O

I'm trying to complete a simple task in Python and I'm new to the language (I'm C++). I hope someone might be able to point me in the right direction.
Problem:
I have an XML file (12mb) full of data and within the file there are start tags 'xmltag' and end tags '/xmltag' that represent the start and end of the data sections I would like to pull out.
I would like to navigate through this open file with a loop and for each instance locate a start tag and copy the data within the section to a new file until the end tag. I would then like to repeat this to the end of the file.
I'm happy with the file I/O but not the most efficient looping, searching and extracting of the data.
I really like the look of the language and hopefully I'm going to get more involved so I can give back to the community.
Big thanks!

Check BeautifulSoup
from BeautifulSoup import BeautifulSoup
with open('bigfile.xml', 'r') as xml:
soup = BeautifulSoup(xml):
for xmltag in soup('xmltag'):
print xmltag.contents

Dive Into Python 3 have a great chapter about this:
http://diveintopython3.org/xml.html#xml-parse
It'a great free book about python, worth reading !

The BeautifulSoup answer is good but this executes faster and doesn't require an external library:
import xml.etree.cElementTree as ET
tree = ET.parse('xmlfile.xml')
results = (elem for elem in tree.getiterator('xmltag'))
# in Python 2.7+, getiterator() is deprecated; use tree.iter('xmltag')

No need to install BeautifulSoup, Python includes the ElementTree parser in its standard library.
from xml.etree import cElementTree as ET
tree = ET.parse('myfilename')
new_tree = ET('new_root_element')
for element in tree.findall('.//xmltag'):
new_tree.append(tree.element)
print ET.tostring(new_tree)

xml=open("xmlfile").read()
x=xml.split("</xmltag>")
for block in x:
if "<xmltag>" in block:
print block.split("<xmltag>")[-1]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

lxml findall() problem - python

I suspect the root node is the topic node, so it's looking for a node named "api" inside of the root node. If so, both of the following will work: query/recentchanges/rc /api/query/recentchanges/rc

Related

How to make nested xml structure flat with python

Python getparent() not working

In Python - Parsing a response xml and finding a specific text vaule

Elementtree displaying elements out of order

Basic Python file searching and I/O

Categories

Resources