In Python - Parsing a response xml and finding a specific text vaule - python

I'm new to python and I'm having a particularly difficult time working with xml and python. The situation I have is this, I'm trying to count the number of times a word appears in an xml document. Simple enough, but the xml document is a response from a server. Is it possible to do this without writing to a file? It would be great trying to do it from memory.
Here is a sample xml code:
<xml>
<title>Info</title>
<foo>aldfj</foo>
<data>Text I want to count</data>
</xml>
Here is what I have in python
import urllib2
import StringIO
import xml.dom.minidom
from xml.etree.ElementTree import parse
usock = urllib.urlopen('http://www.example.com/file.xml')
xmldoc = minidom.parse(usock)
print xmldoc.toxml()
Past This point I have tried using StringIO, ElementTree, and minidom to no success and I have gotten to a point where I'm not sure what else to do.
Any help would be greatly appreciated

It's quite simple, as far as I can tell:
import urllib2
from xml.dom import minidom
usock = urllib2.urlopen('http://www.example.com/file.xml')
xmldoc = minidom.parse(usock)
for element in xmldoc.getElementsByTagName('data'):
print element.firstChild.nodeValue
So to count the occurrences of a string, try this (a bit condensed, but I like one-liners):
count = sum(element.firstChild.nodeValue.find('substring') for element in xmldoc.getElementsByTagName('data'))

If you are just trying to count the number of times a word appears in an XML document, just read the document as a string and do a count:
import urllib2
data = urllib2.urlopen('http://www.example.com/file.xml').read()
print data.count('foobar')
Otherwise, you can just iterate through the tags you are looking for:
from xml.etree import cElementTree as ET
xml = ET.fromstring(urllib2.urlopen('http://www.example.com/file.xml').read())
for data in xml.getiterator('data'):
# do something with
data.text

Does this help ...
from xml.etree.ElementTree import XML
txt = """<xml>
<title>Info</title>
<foo>aldfj</foo>
<data>Text I want to count</data>
</xml>"""
# this will give us the contents of the data tag.
data = XML(txt).find("data").text
# ... so here we could do whatever we want
print data

Just replace the string 'count' with whatever word you want to count. If you want to count phrases, then you'll have to adapt this code as this is for word counting. But anyway, the answer to how to get at all the embedded text is XML('<your xml string here>').itertext()
from xml.etree.ElementTree import XML
from re import findall
txt = """<xml>
<title>Info</title>
<foo>aldfj</foo>
<data>Text I want to count</data>
</xml>"""
sum([len(filter(lambda w: w == 'count', findall('\w+', t))) for t in XML(txt).itertext()])

Related

Parsing an xml file with an emphasis tag in it in python

I am currently writing a python script that can extract all of the text in an xml file. I am using the Element Tree library to interpret the data but I am running into this problem however when the data is structured like this...
<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>
When I attempt to read out the text, I get the first half of the Segment ("Alright. So what we had") before the pause tag.
What I am trying to figure out is if there is a way to ignore the tags in the data segments and print out all of the text.
Another solution.
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>'''
doc = SimplifiedDoc(html)
print(doc.Segment)
print(doc.Segment.text)
Result:
{'StartTime': '639.752', 'EndTime': '642.270', 'Participant': 'fe016', 'tag': 'Segment', 'html': "\n But I bet it's a good <Pause /> superset of it.\n"}
But I bet it's a good superset of it.
Here are more examples. https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples
xml = '''<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>'''
# solution using ETree
from xml.etree import ElementTree as ET
root = ET.fromstring(xml)
pause = root.find('./Pause')
print(root.text + pause.tail)

How to save webpages text content as a text file using python

I did python script:
from string import punctuation
from collections import Counter
import urllib
from stripogram import html2text
myurl = urllib.urlopen("https://www.google.co.in/?gfe_rd=cr&ei=v-PPV5aYHs6L8Qfwwrlg#q=samsung%20j7")
html_string = myurl.read()
text = html2text( html_string )
file = open("/home/nextremer/Final_CF/contentBased/contentCount/hi.txt", "w")
file.write(text)
file.close()
Using this script I didn't get perfect output only some HTML code.
I want save all webpage text content in a text file.
I used urllib2 or bs4 but I didn't get results.
I don't want output as a html structure.
I want all text data from webpage
What do you mean with "webpage text"?
It seems you don't want the full HTML-File. If you just want the text you see in your browser, that is not so easily solvable, as the parsing of a HTML-document can be very complex, especially with JavaScript-rich pages.
That starts with assessing if a String between "<" and ">" is a regular tag and includes analyzing the CSS-Properties changed by JavaScript-behavior.
That is why people write very big and complex rendering-Engines for Webpage-Browsers.
You dont need to write any hard algorithms to extract data from search result. Google has a API to do this.
Here is an example:https://github.com/google/google-api-python-client/blob/master/samples/customsearch/main.py
But to use it, first you must to register in google for API Key.
All information you can find here: https://developers.google.com/api-client-library/python/start/get_started
import urllib
urllib.urlretrieve("http://www.example.com/test.html", "test.txt")

Python lxml.etree - Is it more effective to parse XML from string or directly from link?

With the lxml.etree python framework, is it more efficient to parse xml directly from a link to an online xml file or is it better to say, use a different framework (such as urllib2), to return a string and then parse from that? Or does it make no difference at all?
Method 1 - Parse directly from link
from lxml import etree as ET
parsed = ET.parse(url_link)
Method 2 - Parse from string
from lxml import etree as ET
import urllib2
xml_string = urllib2.urlopen(url_link).read()
parsed = ET.parse.fromstring(xml_string)
# note: I do not have access to python
# at the moment, so not sure whether
# the .fromstring() function is correct
Or is there a more efficient method than either of these, e.g. save the xml to a .xml file on desktop then parse from those?
I ran the two methods with a simple timing rapper.
Method 1 - Parse XML Directly From Link
from lxml import etree as ET
#timing
def parseXMLFromLink():
parsed = ET.parse(url_link)
print parsed.getroot()
for n in range(0,100):
parseXMLFromLink()
Average of 100 = 98.4035 ms
Method 2 - Parse XML From String Returned By Urllib2
from lxml import etree as ET
import urllib2
#timing
def parseXMLFromString():
xml_string = urllib2.urlopen(url_link).read()
parsed = ET.fromstring(xml_string)
print parsed
for n in range(0,100):
parseXMLFromString()
Average of 100 = 286.9630 ms
So anecdotally it seems that using lxml to parse directly from the link is the more immediately quick method. It's not clear whether it would be faster to download then parse large xml documents from the hard drive, but presumably unless the document is huge and the parsing task more intensive, the parseXMLFromLink() function would still remain quicker as it is urllib2 that seems to slow the second function down.
I ran this a few times and the results stayed the same.
If by 'effective' you mean 'efficient', I'm relatively certain you will see no difference between the two at all (unless ET.parse(link) is horribly implemented).
The reason is that the network time is going to be the most significant part of parsing an online XML file, a lot longer than storing the file to disk or keeping it in memory, and a lot longer than actually parsing it.

Reading an xml response and printing a required data in python

I have got an xml data as a output for my code. And Now I wanted to get an element value from the obtained xml data.
I have used following commands
data1 = r1.read()
dom = xml.dom.minidom.parseString(data1)
conference=dom.getElementsByTagName('totalResults')
print conference.node value
But I was unable get the value.
My xml code will be
<first:totalresults>100</first:totalresults>
and so on
So now I want the value 100 to be printed
So can any one help me in solving this. I have been trying for this since last night please any one kindly help me.
I'd recommend you'd use etree for an easier XML parsing :
from lxml import etree
myFile = open("file.xml", 'r')
tree = etree.parse(myFile)
data = tree.xpath('//ns:totalresults', namespaces={'ns': 'http://api.com'})
print data

Elementtree displaying elements out of order

I'm using Python's ElementTree to parse xml files. I have a "findall" to find all "revision" subelements, but when I iterate through the result, they are not in document order. What can I be doing wrong?
Here's my code:
allrevisions = page.findall('{http://www.mediawiki.org/xml/export-0.5/}revision')
for rev in allrevisions:
print rev
print rev.find('{http://www.mediawiki.org/xml/export-0.5/}timestamp').text
Here's a link to the document I'm parsing: http://pastie.org/2780983
Thanks,
bsg
-Oops. By going through my code and running it piece by piece, I worked out the problem - I had stuck in a reverse() on the elements list in the wrong place, which was causing all the trouble. Thank you so much for your help - I'm sorry it was such a silly issue.
The documentation for ElementTree says that findall returns the elements in document order.
A quick test shows the correct behaviour:
import xml.etree.ElementTree as et
xmltext = """
<root>
<number>1</number>
<number>2</number>
<number>3</number>
<number>4</number>
</root>
"""
tree = et.fromstring(xmltext)
for number in tree.findall('number'):
print number.text
Result:
1
2
3
4
It would be helpful to see the document you are parsing.
Update:
Using the source data you provided:
from __future__ import with_statement
import xml.etree.ElementTree as et
with open('xmldata.xml', 'r') as f:
xmldata = f.read()
tree = et.fromstring(xmldata)
for revision in tree.findall('.//{http://www.mediawiki.org/xml/export-0.5/}revision'):
print revision.find('{http://www.mediawiki.org/xml/export-0.5/}text').text[0:10].encode('utf8')
Result:
‘The Mind
{{db-spam}
‘The Mind
'''The Min
<!-- Pleas
The same order as they appear in the document.

Categories

Resources