Parsing strange xml feeds

Parsing strange xml feeds - python

May not be strange, but I have never used xml, or PHP, which is two of the things I am using for an upcoming project.
Anyway, I am parsing this XML feed. Each <item> contains an <enclosure url=...>
Where ... = URLs & image types etc
In Python 3 using feedparser I can use
feed = feedparser.parse("http://www.huffingtonpost.com/feeds/verticals/good-news/index.xml")
l = feed.entries[12]['title']`
just fine, but when I try to get the URL of an image using
p = feed.entries[12]['enclosure']
I get an error
Traceback (most recent call last):
File "<pyshell#28>", line 1, in <module>
p = feed.entries[12]['enclosure']
File "C:\Python34\lib\site-packages\feedparser-5.1.3-py3.4.egg\feedparser.py", line 375, in __getitem__
return dict.__getitem__(self, key)
KeyError: 'enclosure'
So obviously enclosure isn't coming back with anything, I suspect this is because in the XML it does not use
<name of object>Text</name of object>
Instead it uses
<enclosure url=... blah blah blah />
How do I get the value of URL? It is equal to a string (url="url is here")

Looking at the feedparse docs try using the entries[i].enclosures[j].href reference which returns the URL of the linked file:
feed = feedparser.parse("http://www.huffingtonpost.com/feeds/verticals/good-news/index.xml")
l = feed.entries[12].enclosures[1].href

Related

Python, ElementTree: Find specific content in XML tag?

I'm trying to do something I thought should be very simple in ElementTree: find elements with specific tag content. The docs give the example:
*[tag='text']* Selects all elements that have a child named *tag* whose complete text content, including descendants, equals the given *text*.
Which seems straightforward enough. However, it does not work as I expect. Suppose I want to find all examples of <note>NEW</note>. The following complete example:
#!/usr/bin/env python
import xml.etree.ElementTree as ET
xml = """<?xml version="1.0"?>
<entry>
<foo>blah</foo>
<foo>bblic</foo>
<foo>fjdks<note>NEW</note></foo>
<foo>fdfsd</foo>
<foo>ljklj<note>NEW</note></foo>
</entry>
"""
root = ET.fromstring(xml)
print("Number of 'foo' elements: %d" % len(root.findall('.//foo')))
print("Number of new 'foo' elements: %d" % len(root.findall('.//[note="NEW"]')))
Yields:
$ python foo.py
Number of 'foo' elements: 5
Traceback (most recent call last):
File "/usr/lib/python3.10/xml/etree/ElementPath.py", line 370, in iterfind
selector = _cache[cache_key]
KeyError: ('.//[note="NEW"]',)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/foo.py", line 17, in <module>
print("Number of new 'foo' elements: %d" % len(root.findall('.//[note="NEW"]')))
File "/usr/lib/python3.10/xml/etree/ElementPath.py", line 411, in findall
return list(iterfind(elem, path, namespaces))
File "/usr/lib/python3.10/xml/etree/ElementPath.py", line 384, in iterfind
selector.append(ops[token[0]](next, token))
File "/usr/lib/python3.10/xml/etree/ElementPath.py", line 193, in prepare_descendant
raise SyntaxError("invalid descendant")
SyntaxError: invalid descendant
How am I meant to do this simple task?

docs says also that
Predicates (expressions within square brackets) must be preceded by a
tag name, an asterisk, or another predicate.
taking this is account
root.findall('.//[note="NEW"]')
is illegal, you should add * before [ to denote any tag i.e.
root.findall('.//*[note="NEW"]')
xor use tag name before [ to denote certain tag i.e.
root.findall('.//foo[note="NEW"]')

The main problem seems an expected dependency from first to second search, which does not exist.
This works (but used syntax requires Python >=3.10):
for foo in root.findall('.//foo[note="NEW"]'):
print(foo.text)

How to use get_attachment call upon QueryResult (python cloudant) ?

I've been trying to get attachment image data from documents in Cloudant.
I can successfully do it once a document is selected (direct extract with _id, etc).
Now trying to do it in combination with "query" operation using selector, I run into trouble.
Here is my code.
targetName="chibika33"
targetfile="chibitest.png"
#--------------------------------------------------
# get all the documents with the specific nameField
#--------------------------------------------------
myDatabase.create_query_index(fields = ['nameField'])
selector = {'nameField': {'$eq': targetName}}
docs = myDatabase.get_query_result(selector)
#--------------------------------------------------
# get the attachment files to them, save it locally
#--------------------------------------------------
count = 0
for doc in docs:
count=count+1
result_filename="result%03d.png"%(count)
dataContent = doc.get_attachment(targetfile, attachment_type='binary')
dataContentb =base64.b64decode(dataContent)
with open(result_filename,'wb') as output:
output.write(dataContentb)
Causes error as;
Traceback (most recent call last):
File "view8.py", line 44, in <module>
dataContent = doc.get_attachment(targetfile, attachment_type='binary')
AttributeError: 'dict' object has no attribute 'get_attachment'
So far, I've been unable to find any API for converting dict to document object in the python-cloudant-document...[python-cloudant document]: http://python-cloudant.readthedocs.io/en/latest/index.html
Any advise would be highly appreciated.

The returned structure from get_query_result(...) isn't an array of documents.
Try:
resp = myDatabase.get_query_result(selector)
for doc in resp['docs']:
# your code here
See the docs at:
http://python-cloudant.readthedocs.io/en/latest/database.html#cloudant.database.CloudantDatabase.get_query_result

Python - Generating RSS with PyRSS2Gen

I'm attempting to write a program that utilizes urllib2 to parse HTML, and then utilizes PyRSS2Gen to create the RSS feed, in XML.
I keep getting the error
Traceback (most recent call last):
File "pythonproject.py", line 46, in <module>
get_rss()
File "pythonproject.py", line 43, in get_rss
rss.write_xml(open("cssnews.rss.xml", "w"))
File "build/lib/PyRSS2Gen.py", line 34, in write_xml
self.publish(handler)
File "build/lib/PyRSS2Gen.py", line 380, in publish
item.publish(handler)
File "build/lib/PyRSS2Gen.py", line 427, in publish
_opt_element(handler, "title", self.title)
File "build/lib/PyRSS2Gen.py", line 58, in _opt_element
_element(handler, name, obj)
File "build/lib/PyRSS2Gen.py", line 53, in _element
obj.publish(handler)
AttributeError: 'builtin_function_or_method' object has no attribute 'publish'
upon trying to run it.
From what I could find, other users came across this issue when trying to create a new tag for the XML, but I am trying to use the default tags given with PyRSS2Gen. Inspecting the PyRSS2Gen.py file shows the write_xml() command I am using, so is the error with how I am assigning values to the rss items by popping them from a list?
def get_rss():
sys.path.append('build/lib')
from PyRSS2Gen import RSS2, RSSItem
rss = RSS2(
title = 'Python RSS Creator',
link = 'technews.acm.org',
description = 'Creates RSS out of HTML',
items = [],
)
for x in range(0, len(rssTitles)):
rss.items.append(RSSItem(
title = rssTitles.pop,
link = rssLinks.pop,
description = rssDesc.pop,
))
rss.write_xml(open("cssnews.rss.xml", "w"))
# 5 - Call function
get_rss()

I ended up just writing out to a file, like so;
news = open("news.rss.xml", "w")
news.write("<?xml version=\"1.0\" ?>")
news.write("\n")
news.write("<rss xmlns:atom=\"http://www.w3.org/2005/Atom\" version=\"2.0\">")
news.write("\n")
news.write("<channel>")
news.write("\n")
etc.

PyRSS2Gen relies on its inputs to either be strings or to have a publish method that does all the necessary conversion.
In this case, you missed to call the pop method on the rssTitles, giving you a function rather than a string. Adding () after all the pop mentions should give you a usable program.
Note that similar errors can also crop up when there's other non-sting items around (eg. byte strings); the AttributeError line gives you a hint as to the object that went into the RSS item, and the backtrace indicates where in the RSS item that is (the title, in this case).

Strange failure to make a HIT for Amazon Mechanical Turk with some URLs?

I was trying to include a link in a HIT request in Amazon Mechanical Turk, using boto, and kept getting an error that my XML was invalid. I gradually pared my html down to the bare minimum, and isolated that it seems to be that some valid links fail for seemingly no reason. Can anyone with expertise in boto or aws help me parse why?
I followed these two guides:
http://www.toforge.com/2011/04/boto-mturk-tutorial-create-hits/
https://gist.github.com/j2labs/740267
Here is my example:
from boto.mturk.connection import MTurkConnection
from boto.mturk.question import QuestionContent,Question,QuestionForm,Overview,AnswerSpecification,SelectionAnswer,FormattedContent,FreeTextAnswer
from config import *
HOST = 'mechanicalturk.sandbox.amazonaws.com'
mtc = MTurkConnection(aws_access_key_id=ACCESS_ID,
aws_secret_access_key=SECRET_KEY,
host=HOST)
title = 'HIT title'
description = ("HIT description.")
keywords = 'keywords'
s1 = """<![CDATA[<p>Here comes a link <a href='%s'>LINK</a></p>]]>""" % "http://www.example.com"
s2 = """<![CDATA[<p>Here comes a link <a href='%s'>LINK</a></p>]]>""" % "https://www.google.com/search?q=example&site=imghp&tbm=isch"
def makeahit(s):
overview = Overview()
overview.append_field('Title', 'HIT title itself')
overview.append_field('FormattedContent',s)
qc = QuestionContent()
qc.append_field('Title','The title')
fta = FreeTextAnswer()
q = Question(identifier="URL",
content=qc,
answer_spec=AnswerSpecification(fta))
question_form = QuestionForm()
question_form.append(overview)
question_form.append(q)
mtc.create_hit(questions=question_form,
max_assignments=1,
title=title,
description=description,
keywords=keywords,
duration = 30,
reward=0.05)
makeahit(s1) # SUCCESS!
makeahit(s2) # FAIL?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 25, in makeahit
File "/usr/local/lib/python2.7/dist-packages/boto/mturk/connection.py", line 263, in create_hit
return self._process_request('CreateHIT', params, [('HIT', HIT)])
File "/usr/local/lib/python2.7/dist-packages/boto/mturk/connection.py", line 821, in _process_request
return self._process_response(response, marker_elems)
File "/usr/local/lib/python2.7/dist-packages/boto/mturk/connection.py", line 836, in _process_response
raise MTurkRequestError(response.status, response.reason, body)
boto.mturk.connection.MTurkRequestError: MTurkRequestError: 200 OK
<?xml version="1.0"?>
<CreateHITResponse><OperationRequest><RequestId>19548ab5-034b-49ec-86b2-9e499a3c9a79</RequestId></OperationRequest><HIT><Request><IsValid>False</IsValid><Errors><Error><Code>AWS.MechanicalTurk.XHTMLParseError</Code><Message>There was an error parsing the XHTML data in your request. Please make sure the data is well-formed and validates against the appropriate schema. Details: The reference to entity "site" must end with the ';' delimiter. Invalid content: <FormattedContent><![CDATA[<p>Here comes a link <a href='https://www.google.com/search?q=example&site=imghp&tbm=isch'>LINK</a></p>]]></FormattedContent> (1369323038698 s)</Message></Error></Errors></Request></HIT></CreateHITResponse>
Any idea why s2 fails, but s1 succeeds when both are valid links? Both link contents work:
http://www.example.com
https://www.google.com/search?q=example&site=imghp&tbm=isch
Things with query strings? Https?
UPDATE
I'm going to do some tests, but right now my candidate hypotheses are:
HTTPS doesn't work (so, I'll see if I can get another https link to work)
URLs with params don't work (so, I'll see if I can get another url with params to work)
Google doesn't allow its searches to get posted this way? (if 1 and 2 fail!)

You need to escape ampersands in urls, i.e. & => &.
At the end of s2, use
q=example&site=imghp&tbm=isch
instead of
q=example&site=imghp&tbm=isch

Adding a XML root attribute with hyphens in python XMLBuilder

I'm integrating with the google checkout api and all of their attributes include hyphens in their attribute values. So to create a request to charge an order I need to send an xml post that looks like:
<?xml version="1.0" encoding="UTF-8"?>
<charge-and-ship-order xmlns="http://checkout.google.com/schema/2" google-order-number="6014423719">
<amount currency="USD">335.55</amount>
</charge-and-ship-order>
I'm having trouble building that xml with the attribute "google-order-number". The following code works if I want to create an empty node:
>>> xml=XMLBuilder()
>>> xml << ('charge-and-ship-order, {'xmlns':'xxx','google-order-number':'3433'})
>>> str(xml)
>>> <charge-and-ship-order google-order-number="3433" xmlns="xxx" />
But If I try to child node for the amount using the documented way:
>>> xml=XMLBuilder()
>>> with xml('charge-and-ship-order', xmlns='xxx', google-order-number='3433'}):
>>> with xml('amount', currency="USD"):
>>> xml << '4.54'
I get an error saying:
SyntaxError: keyword can't be an expression
I've also tried:
>>> xml=XMLBuilder()
>>> with xml('charge-and-ship-order', {'xmlns':'xxx', 'google-order-number':'3433'}):
>>> with xml << 'test'
and I get a traceback in the XMLBuilder library saying
File "/xmlbuilder/xmlbuilder/__init__.py", line 102, in __call__
x(*dt,**mp)
File "/xmlbuilder/xmlbuilder/__init__.py", line 36, in __call__
text = "".join(dt)
TypeError: sequence item 0: expected string, dict found
Any Ideas how to use an attribute like that? I'm using the XMLBuilder library located at
http://pypi.python.org/pypi/xmlbuilder

You can pass the attributes in a dictionary like this:
function_call(**{'weird-named-key': 'value'})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing strange xml feeds - python

Looking at the feedparse docs try using the entries[i].enclosures[j].href reference which returns the URL of the linked file: feed = feedparser.parse("http://www.huffingtonpost.com/feeds/verticals/good-news/index.xml") l = feed.entries[12].enclosures[1].href

Related

Python, ElementTree: Find specific content in XML tag?

How to use get_attachment call upon QueryResult (python cloudant) ?

Python - Generating RSS with PyRSS2Gen

Strange failure to make a HIT for Amazon Mechanical Turk with some URLs?

Adding a XML root attribute with hyphens in python XMLBuilder

Categories

Resources