Python+Flask dynamic generated RSS feed is invalid

Python+Flask dynamic generated RSS feed is invalid - python

I'm trying to create an RSS feed for my Blog app at patife.com/rss/. The app is built on python with Flask. I tried creating a template that would dynamically generate the RSS with all entries.. but its not valid
i can't seem to convert date format to RFC-822 using JINJA functions. I was trying the function strfdate.
the entry actual content which gets inside the description tag isn't taking HTML very nicely.
This is the current code (i removed the link generator bc its working and i can't make posts with too many links)
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Patife.com</title>
<link>http://www.patife.com/</link>
<description>Startups. I can't help myself.</description>
{% for entry in entries %}
<item>
<title>{{ entry.title_en }}</title>
<link>http://www.patife.com/entries/{{ entry.id }}</link>
<guid>http://www.patife.com/entries/{{ entry.id }}</guid>
<pubDate>{{ entry.date_created.strftime('') }}</pubDate>
<description>{{ entry.text_en|safe }}</description>
</item>
{% endfor %}
</channel>
</rss>

Your dates are invalid because you are using naive datetimes. They have no timezone information associated with the them. Most databases don't support timezone-aware values, so you'll either need to convert all of your naive datetimes to aware datetimes or just include the timezone in your template.
<pubDate>{{ entry.date_created.strftime('%a, %d %b %y %T') }} UTC</pubDate>
The reason the HTML isn't validating is that when you embed HTML in XML it gets treated as XML. RSS doesn't support arbitrary tags, so validation fails. XML allows to you embed unescaped values in a node by wrapping it in CDATA delimiters.
<description><![CDATA[{{ entry.text_en|safe }}]]></description>

Related

orderedContent PyXB return empty lists

I am trying to parse and XML file with simple and complex types using pyxb library. The following is the XML structure :
</Requirement>
<Section name="Electrical Characteristics">
</Section>
</Section>
</Source>
</Requirements>
After reading writing it through pyxb I am getting the XML in following order where the order of Requirement and section is changed.
<Section name="Electrical Characteristics">
</Section>
</Requirement>
</Section>
</Source>
</Requirements>
I have written a script which dumps XML to pyxb object and I used orderedContent() to preserve order in the following way.
ordered_instance = instance.orderedContent()
I am getting empty list in ordered_instance variable.

How to obtain field names of RSS feed(xml file) in python dynamically using feedparser?

I have used feedparser library in python to read rss feeds from particlar URL.
the feeds are received in 'fee' variable by using following line of code:
fee = feedparser.parse('http://www.indiatimes.com/r/python/.rss')
fee contains feed in list of list format. The format and the data we get in this is complex and not fixed.
I want to obtain names of fields(keys) of this RSS feed dynamically. How to do that?
some field names are fixed such as link, date etc. But I need names of all fields in my code.

First of all, the link you're going through has a 404 error.
So you're not going to get any rss from that link.
Secondly, an RSS link ends with a .rss file most of the times.
ex: http://timesofindia.feedsportal.com/c/33039/f/533916/index.rss
Once you get an actual working RSS link, all you have to do is this:
fee = feedparser.parse('http://timesofindia.feedsportal.com/c/33039/f/533916/index.rss')
for feed in fee.entries:
print feed.title
print feed.link
What I wrote above was for the getting the item elements.
Let me provide you with a better example.
import feedparser
rss_document = """
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Sample Feed</title>
<description>For documentation <em>only</em></description>
<link>http://example.org/</link>
<pubDate>Sat, 07 Sep 2002 00:00:01 GMT</pubDate>
<!-- other elements omitted from this example -->
<item>
<title>First entry title</title>
<link>http://example.org/entry/3</link>
<description>Watch out for <span style="background-image:
url(javascript:window.location='http://example.org/')">nasty
tricks</span></description>
<pubDate>Thu, 05 Sep 2002 00:00:01 GMT</pubDate>
<guid>http://example.org/entry/3</guid>
<!-- other elements omitted from this example -->
</item>
</channel>
</rss>
"""
rss = feedparser.parse(rss_document)
# Channel Details
print "-----Channel Details-----"
print rss.feed.title
print rss.feed.description
print rss.feed.link
# Item Details
print "-----Item Details-----"
for fee in rss.entries:
print fee.title
print fee.summary,
print fee.link

feeds_all = feedparser.parse('http://www.indiatimes.com/r/python/.rss')
I am not sure what kind of json it is, but the functions .keys() and .values() work fine on it. What I did is, for dynamically getting names of keys that are previously unknown (above answer gives static keys and it's values, you need to know the key names in advance), fee.keys() and it worked!
So, the answer is in the following lines: channel_keys = feeds_all.keys() and feed_keys = feeds_all.feed.keys(), for getting value of those keys, feed_values = feeds_all.feed.values()....

Use below code it will give you all keys name,
import feedparser
feeds_all = feedparser.parse(URL)
feed_all_keys = feeds_all.keys()
feed_keys = feeds_all.feed.keys()
entries_keys = feeds_all.entries.keys()
feed_all_keys holds all keys
feed_keys holds keys related to feed
entries_keys holds keys related to entries(items)

Generating XML/Feed for your Python Blog

I've been trying to add RSS feeds in my blog(webapp2 application - Jinja2 templates), this is what I have:
class XMLHandler(Handler):
def get(self):
posts = db.GqlQuery("SELECT * FROM Post WHERE is_draft = FALSE ORDER BY created DESC")
self.render("xmltemplate.xml", posts=posts, mimetype='application/xml')
xmltemplate.xml looks like this:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<channel>
<title>Blag</title>
<link>http://www.blagonudacity.appspot.com/</link>
<description>Yet Another Blag</description>
{%for post in posts %}
<entry>
<title>{{post.subject}}></title>
<link href="http://www.blagonudacity.appspot.com/post/{{ post.key().id()}}" rel="alternate" />
<updated>{{post.created}}</updated>
<author><name>Prakhar Srivastav</name></author>
<summary type="html"> {{ post.content }} </summary>
</entry>
{%endfor%}
</channel>
</feed>
What i'm getting in my browser when I migrate to the relevant page /feeds/all.atom.xml
is just a html page with the markup. It doesn't look like how XML pages look in browser. What am I doing wrong here? Here is the demo

I saw that the page is delivered with content type text/html, this could be one problem, i suggest you should set this to text/xml (more details can be found here.
Also it highly depends on the browser on how this is displayed, i guess you are using chrome (like me) where the link provided by you looks like a webpage, if you open it in firefox you will see the "live bookmark" styled page, however the entries don't show. I'm not sure if this is because of some problem with your markup or some problem with firefox and atom feeds.
The xml file itself seems to be ok (checked with w3 validator).
UPDATE: Ok, there seems to be something wrong with your atom XML (it is valid xml, as mentioned above) however it does not seem to be valid Atom data (according to the feed validator).
I tried to bookmark it in firefox and it does not show any entries (just like the preview mentioned above).
So i think you should take a look at the atom feed e.g. this and this could help.
I'm not really sure but when looking at your XML i think that you may have mixed up Atom and Rss a little.

Generating RSS feed under Google App Engine

I want to provide rss feed under google app engine/python.
I've tried to use usual request handler and generate xml response. When I access the feed url directly, I can see the feed correctly, however, when I'm trying to subscribe to the feed in google reader, it says that
'The feed being requested cannot be found.'
I wonder whether this approach is right. I was considering using a static xml file and updating it by cron jobs. But while GAE doesn't support file i/o, this approach seems not going to work.
How to solve this? Thanks!

There're 2 solutions I suggest:
GAE-REST you can just add to your project and configure and it will make RSS for you but the project is old and no longer maintained.
Do like I do, use a template to write a list to and like this I could succeed generating RSS (GeoRSS) that can be read via google reader where template is:
<title>{{host}}</title>
<link href="http://{{host}}" rel="self"/>
<id>http://{{host}}/</id>
<updated>2011-09-17T08:14:49.875423Z</updated>
<generator uri="http://{{host}}/">{{host}}</generator>
{% for entity in entities %}
<entry>
<title><![CDATA[{{entity.title}}]]></title>
<link href="http://{{host}}/vi/{{entity.key.id}}"/>
<id>http://{{host}}/vi/{{entity.key.id}}</id>
<updated>{{entity.modified.isoformat}}Z</updated>
<author><name>{{entity.title|escape}}</name></author>
<georss:point>{{entity.geopt.lon|floatformat:2}},{{entity.geopt.lat|floatformat:2}}</georss:point>
<published>{{entity.added}}</published>
<summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">{{entity.text|escape}}</div>
</summary>
</entry>
{% endfor %}
</feed>
My handler is (you can also do this with python 2.7 as just a function outside a handler for a more minimal solution):
class GeoRSS(webapp2.RequestHandler):
def get(self):
start = datetime.datetime.now() - timedelta(days=60)
count = (int(self.request.get('count'
)) if not self.request.get('count') == '' else 1000)
try:
entities = memcache.get('entities')
except KeyError:
entity = Entity.all().filter('modified >',
start).filter('published =',
True).order('-modified').fetch(count)
memcache.set('entities', entities)
template_values = {'entities': entities, 'request': self.request,
'host': os.environ.get('HTTP_HOST',
os.environ['SERVER_NAME'])}
dispatch = 'templates/georss.html'
path = os.path.join(os.path.dirname(__file__), dispatch)
output = template.render(path, template_values)
self.response.headers['Cache-Control'] = 'public,max-age=%s' \
% 86400
self.response.headers['Content-Type'] = 'application/rss+xml'
self.response.out.write(output)
I hope some of this works for you, both ways worked for me.

I have an Atom feed generator for my blog, which runs on AppEngine/Python. I use the Django 1.2 template engine to construct the feed. My template looks like this:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
xml:lang="en"
xml:base="http://www.example.org">
<id>urn:uuid:4FC292A4-C69C-4126-A9E5-4C65B6566E05</id>
<title>Adam Crossland's Blog</title>
<subtitle>opinions and rants on software and...things</subtitle>
<updated>{{ updated }}</updated>
<author>
<name>Adam Crossland</name>
<email>adam#adamcrossland.net</email>
</author>
<link href="http://blog.adamcrossland.net/" />
<link rel="self" href="http://blog.adamcrossland.net/home/feed" />
{% for each_post in posts %}{{ each_post.to_atom|safe }}
{% endfor %}
</feed>
Note: if you use any of this, you'll need to create your own uuid to go into the id node.
The updated node should contain the time and date on which contents of the feed were last updated in rfc 3339 format. Fortunately, Python has a library to take care of this for you. An excerpt from the controller that generates the feed:
from rfc3339 import rfc3339
posts = Post.get_all_posts()
self.context['posts'] = posts
# Initially, we'll assume that there are no posts in the blog and provide
# an empty date.
self.context['updated'] = ""
if posts is not None and len(posts) > 0:
# But there are posts, so we will pick the most recent one to get a good
# value for updated.
self.context['updated'] = rfc3339(posts[0].updated(), utc=True)
response.content_type = "application/atom+xml"
Don't worry about the self.context['updated'] stuff. That just how my framework provides a shortcut for setting template variables. The import part is that I encode the date that I want to use with the rfc3339 function. Also, I set the content_type property of the Response object to be application/atom+xml.
The only other missing piece is that the template uses a method called to_atom to turn the Post object into Atom-formatted data:
def to_atom(self):
"Create an ATOM entry block to represent this Post."
from rfc3339 import rfc3339
url_for = self.url_for()
atom_out = "<entry>\n\t<title>%s</title>\n\t<link href=\"http://blog.adamcrossland.net/%s\" />\n\t<id>%s</id>\n\t<summary>%s</summary>\n\t<updated>%s</updated>\n </entry>" % (self.title, url_for, self.slug_text, self.summary_for(), rfc3339(self.updated(), utc=True))
return atom_out
That's all that is required as far as I know, and this code does generate a perfectly-nice and working feed for my blog. Now, if you really want to do RSS instead of Atom, you'll need to change the format of the feed template, the Post template and the content_type, but I think that is the essence of what you need to do to get a feed generated from an AppEngine/Python application.

There's nothing special about generating XML as opposed to HTML - provided you set the content type correctly. Pass your feed to the validator at http://validator.w3.org/feed/ and it will tell you what's wrong with it.
If that doesn't help, you'll need to show us your source - we can't debug your code for you if you won't show it to us.

Exporting data as an XML file in google appengine

I'm trying to export data to an XML file in the Google appengine, I'm using Python/Django. The file is expected to contain upto 100K records converted to XML. Is there an equivalent in App Engine of:
f = file('blah', 'w+')
f.write('whatever')
f.close()
?
Thanks
Edit
What I'm trying to achieve is exporting some information to an XML document so it can be exported to google places (don't know exactly how this will work, but I've been told that google will fecth this xml file from time to time).

You could also generate XML with Django templates. There's no special reason that a template has to contain HMTL. I use this approach for generating the Atom feed for my blog. The template looks like this. I pass it the collection of posts that go into the feed, and each Post entity has a to_atom method that generate its Atom representation.
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
xml:lang="en"
xml:base="http://www.example.org">
<id>urn:uuid:4FC292A4-C69C-4126-A9E5-4C65B6566E05</id>
<title>Adam Crossland's Blog</title>
<subtitle>opinions and rants on software and...things</subtitle>
<updated>{{ updated }}</updated>
<author>
<name>Adam Crossland</name>
<email>adam#adamcrossland.net</email>
</author>
<link href="http://blog.adamcrossland.net/" />
<link rel="self" href="http://blog.adamcrossland.net/home/feed" />
{% for each_post in posts %}{{ each_post.to_atom|safe }}
{% endfor %}
</feed>

Every datastore model class has an instance method to_xml() that will generate an XML representation of that datastore type.
Run your query to get the records you want
Set the content type of the response as appropriate - if you want to prompt the user to save the file locally, add a content-disposition header as well
generate whatever XML preamble you need to come before your record data
iterate through the query results, calling to_xml() on each and adding that output to your reponse
do whatever closing of the XML preamble you need to do.

What the author is talking about is probably Sitemaps.
Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.
And about what I think you need is to write the XML to request object like so:
doc.writexml(self.response.out)
In my case I do this based on mime types sent from the client:
_MIME_TYPES = {
# xml mime type needs lower priority, that's needed for WebKit based browsers,
# which add application/xml equally to text/html in accept header
'xml': ('application/xml;q=0.9', 'text/xml;q=0.9', 'application/x-xml;q=0.9',),
'html': ('text/html',),
'json': ('application/json',),
}
mime = self.request.accept.best_match(reduce(lambda x, y: x + y, _MIME_TYPES.values()))
if mime:
for shortmime, mimes in _MIME_TYPES.items():
if mime in mimes:
renderer = shortmime
break
# call specific render function
renderer = 'render' + renderer
logging.info('Using %s for serving response' % renderer)
try:
getattr(self.__class__, renderer)(self)
except AttributeError, e:
logging.error("Missing renderer %s" % renderer)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python+Flask dynamic generated RSS feed is invalid - python

Related

orderedContent PyXB return empty lists

How to obtain field names of RSS feed(xml file) in python dynamically using feedparser?

Generating XML/Feed for your Python Blog

Generating RSS feed under Google App Engine

Exporting data as an XML file in google appengine

Categories

Resources