Automatically generated sitemap on Google App Engine - python

Okay so I know there are already some questions about this topic but I don't find any to be specific enough. I want to have a script on my site that will automatically generate my sitemap.xml (or save it somewhere). I know how to upload files and have my site set up on http://sean-behan.appspot.com with Python 2.7. How do I setup the script that will generate the sitemap and if possible please reference the code. Just ask if you need more info. :) Thanks.

You can have outside services automatically generate them for you by traversing your site.
One such service is at http://www.xml-sitemaps.com/details-sean-behan.appspot.com.html
Alternatively, you can serve your own xml file based on the URL's you want to appear in your site. In which case, see Tim Hoffman's answer.

I can't point you to code, as I don't know how your site is structured or what templating env you use, does your site structure include static pages etc...
The basics are if you have code that can pull together a list of dictionaries that contain the metadata about each page you want in your sitemap then you are half way there.
The use a templating language (or straight python ) that generates an xml file as per sitemap.org spec.
Now you have two choices, dynamically serve this output as requested, or store it in the datastore if when compressed it is less than 1MB, or write it to google cloud storage, then server it's contents when /sitemap.xml is requested. You will then set up a cron task to regenerate your cached sitemap once a day (or whatever frequency is appropriate).

Related

Search index for flat HTML pages

I'm looking to add search capability into an existing entirely static website. Likely, the new search functionality itself would need to be dynamic, as the search index would need to be updated periodically (as people make changes to the static content), and the search results will need to be dynamically produced when a user interacts with it. I'd hope to add this functionality using Python, as that's my preferred language, though am open to ideas.
The Google Web Search API won't work in this case because the content being indexed is on a private network. Django haystack won't work for this case, as that requires that the content be stored in Django models. A tool called mnoGoSearch might be an option, as I think it can spider a website like Google does, but I'm not sure how active that project is anymore; the project site seems a bit dated.
I'm curious about using tools like Solr, ElasticSearch, or Whoosh, though I believe that those tools are only the indexing engine and don't handle the parsing of search content. Does anyone have any recommendations as to how one may index static html content for retrieving as a set of search results? Thanks for reading and for any feedback you have.
With Solr, you would write code that retrieves content to be indexed, parses out the target portions from the each item then sends it to Solr for indexing.
You would then interact with Solr for search, and have it return either the entire indexed document an ID or some other identifying information about the original indexed content, using that to display results to the user.

Libraries for generating URL of uploading files in Django

Need to develop a project for file uploading and generating its(files's) URL, which could be shared. Are there any particular libraries or simple means in Python,(Django) that would be handy and efficient.?
~ Newbie trying a Herculean Task~ Thanks in Advance :)
Much of this is done for you in the Django-backed version of this excellent JQuery file upload app.
Try out a demo of the original here.
No. Since the Storage used may not even save the actual files on the filesystem, there is no universal way to generate a URL to them. You will need to create a view that gets passed a key that it can use to identify the record that has the file field, then you will need to respond with the file contents yourself.

Filling out paragraph text via urllib?

Say I have a paragraph in a site I am managing and I want a python program to change the contents of that paragraph. Is this plausible with urllib?
Quite possibly; it depends on how the site is designed.
If the site is just a collection of static pages (ie .html files) then you would have to get a copy of the page, modify it, and upload the new version - most likely using sftp or WebDAV.
If the site is running a content management system (like Wordpress, Drupal, Joomla, etc) then it gets quite a bit simpler - your script can simply post new page content.
If it is static content maintained through a template system (ie Dreamweaver) then life gets quite a bit nastier again - because any changes you make will not be reflected in the template files, and will likely get overwritten and disappear the next time you update the site.
If you have access to any server-side scripting language, its easy.

ensuring dynamic image urls in a web-app: use a blob store?

I want to serve images in a web-app using sessions such that the links to the images expire once the session has expired.
If I show the actual links to the filesystem store of the images, say http://www.mywebapp.com/images/foo1.jpg this clearly makes stopping future requests for the image (one the user has signed out of the session) difficult to stop. Which is why I was considering placing the images in a sqlite db, and serving them from there.
It seems that using the db for image storage is considered bad practice (though apparently the GAE blob store seems to provide this functionality), so i was trying to figure out what the alternatives would be.
1)
Perhaps I do somesort of url-re-writing like so:
http://www.mywebapp.com/images/[session_id]/foo1.jpg
Thinking of using nginx, but it seems (on a first look) that this will require some hackin to accomplish?
2)
Copy the files to a physical directory on the filesystem and delete when the session expires. this seems quite messy though?
Are there any standard methods of accomplishing this dynamic image url thing?
I'm using web.py - if that helps.
Many thanks!
lighty's mod_secdownload has worked well for me to solve this issue. You can read more about it at http://redmine.lighttpd.net/wiki/1/Docs:ModSecDownload
The lighttpd wiki also has a generic article about your problem: http://redmine.lighttpd.net/wiki/1/HowToFightDeepLinking
Why so complicated?
Serve the image under the name which the user supplied (i.e. http://www.mywebapp.com/images/foo1.jpg)
Save the images in a directory using a UUID as name.
Create a map of file names to UUIDs in the session.
In the handler for /images/ look up the real file name in the map. Return 404 if no such entry exists. Otherwise serve the image.
When the session is closed, delete all files from the map.
In a cron job, delete all images that are older than one day.
This way, several users can upload the same image (same name), images get deleted as soon as possible or by the cron job (if the server crashes or something like that).
A combination of your two ideas (copy to a dir, expire when session expires) could be generalized to creating a new dir (could be as simple as a symlink) every 15 minutes. When generating the new symlink, also remove the one that's an hour old by now. Always link to the newest name in your code.

How do i output a dynamically generated web page to a .html page instead of .py cgi page?

So ive just started learning python on WAMP, ive got the results of a html form using cgi, and successfully performed a database search with mysqldb. I can return the results to a page that ends with .py by using print statements in the python cgi code, but i want to create a webpage that's .html and have that returned to the user, and/or keep them on the same webaddress when the database search results return.
thanks
paul
edit: to clarify on my local machine, i see /localhost/search.html in the address bar i submit the html form, and receive a results page at /localhost/cgi-bin/searchresults.py. i want to see the results on /localhost/results.html or /localhost/search.html. if this was on a public server im ASSUMING it would return .../cgi-bin/searchresults.py, the last time i saw /cgi-bin/ directories was in the 90s in a url. ive glanced at addhandler, as david suggested, im not sure if thats what i want.
edit: thanks all of you for your input, yep without using frameworks, mod_rewrite seems the way to go, but having looked at that, I decided to save myself the trouble and go with django with mod_wsgi, mainly because of the size of its userbase and amount of docs. i might switch to a lighter/more customisable framework, once ive got the basics
First, I'd suggest that you remember that URLs are URLs and that file extensions don't matter, and that you should just leave it.
If that isn't enough, then remember that URLs are URLs and that file extensions don't matter — and configure Apache to use a different rule to determine that is a CGI program rather than a static file to be served up as is. You can use AddHandler to add a handler for files on the hard disk with a .html extension.
Alternatively, you could use mod_rewrite to tell Apache that …/foo.html means …/foo.py
Finally, I'd suggest that if you do muck around with what URLs look like, that you remove any sign of something that looks like a file extension (so that …/foo is requested rather then …/foo.anything).
As for keeping the user on the same address for results as for the request … that is just a matter of having the program output the basic page without results if it doesn't get the query string parameters that indicate a search term had been passed.

Categories

Resources