Search index for flat HTML pages - python

I'm looking to add search capability into an existing entirely static website. Likely, the new search functionality itself would need to be dynamic, as the search index would need to be updated periodically (as people make changes to the static content), and the search results will need to be dynamically produced when a user interacts with it. I'd hope to add this functionality using Python, as that's my preferred language, though am open to ideas.
The Google Web Search API won't work in this case because the content being indexed is on a private network. Django haystack won't work for this case, as that requires that the content be stored in Django models. A tool called mnoGoSearch might be an option, as I think it can spider a website like Google does, but I'm not sure how active that project is anymore; the project site seems a bit dated.
I'm curious about using tools like Solr, ElasticSearch, or Whoosh, though I believe that those tools are only the indexing engine and don't handle the parsing of search content. Does anyone have any recommendations as to how one may index static html content for retrieving as a set of search results? Thanks for reading and for any feedback you have.

With Solr, you would write code that retrieves content to be indexed, parses out the target portions from the each item then sends it to Solr for indexing.
You would then interact with Solr for search, and have it return either the entire indexed document an ID or some other identifying information about the original indexed content, using that to display results to the user.

Related

Best way to store web page content in database using Django and a single template for web pages

I'm building a web site and the bulk of the content will be the same general type and layout on the page. I'm going to use a single template to handle each post and the actual content will be stored in a database.
The content will just be html paragraphs, headers, sub headers, different lists, quotes, code blocks, etc.
Web pages will typically be the same or at least similar. All html components should follow the same guidelines to make sure everything looks and feels the same. Currently I'll be the only author, but in the future I plan to incorporate other authors as well.
At first I thought, just copy and paste this html content into a textfield in the database and I can add new posts/articles on the admin site.
Then I thought, maybe use a textfield and copy and paste json of a list of ['type': , 'content': ]. and then I can have the single template page iterate over this list and display the content based on the 'type'. My idea here is that it would shorten the data I have to add to the database by stripping the html tags out of the equation.
Considering I hope to have future authors as well, just curious of some ideas on how I can accomplish this to make it easy for myself to post new content.
That sounds pretty much exactly like the example of this fantastic tutorial by Miguel Grinberg. He sets up a flask environment to be used as his personal blog. With user log in and everything you would need.

Search/Filter/Select/Manipulate data from a website using Python

I'm working on a project that basically requires me to go to a website, pick a search mode (name, year, number, etc), search a name, select amongst the results those with a specific type (filtering in other words), pick the option to save those results as opposed to emailing them, pick a format to save them then download them by clicking the save button.
My question is, is there a way to do those steps using a Python program? I am only aware of extracting data and downloading pages/images, but I was wondering if there was a way to write a script that would manipulate the data, and do what a person would manually do, only for a large number of iterations.
I've thought of looking into the URL structures, and finding a way to generate for each iteration the accurate URL, but even if that works, I'm still stuck because of the "Save" button, as I can't find a link that would automatically download the data that I want, and using a function of the urllib2 library would download the page but not the actual file that I want.
Any idea on how to approach this? Any reference/tutorial would be extremely helpful, thanks!
EDIT: When I inspect the save button here is what I get:
Search Button
This would depend a lot on the website your targeting and how their search is implemented.
For some websites, like Reddit, they have an open API where you can add a .json extension to a URL and get a JSON string response as opposed to pure HTML.
For using a REST API or any JSON response, you can load it as a Python dictionary using the json module like this
import json
json_response = '{"customers":[{"name":"carlos", "age":4}, {"name":"jim", "age":5}]}'
rdict = json.loads(json_response)
def print_names(data):
for entry in data["customers"]:
print(entry["name"])
print_names(rdict)
You should take a look at the Library of Congress docs for developers. If they have an API, you'll be able to learn about how you can do search and filter through their API. This will make everything much easier than manipulating a browser through something like Selenium. If there's an API, then you could easily scale your solution up or down.
If there's no API, then you have
Use Selenium with a browser(I prefer Firefox)
Try to get as much info generated, filtered, etc. without actually having to push any buttons on that page by learning how their search engine works with GET and POST requests. For example, if you're looking for books within a range, then manually conduct this search and look at how the URL changes. If you're lucky, you'll see that your search criteria is in the URL. Using this info you can actually conduct a search by visiting that URL which means your program won't have to fill out a form and push buttons, drop-downs, etc.
If you have to use the browser through Selenium(for example, if you want to save the whole page with html, css, js files then you have to press ctrl+s then click "save" button), then you need to find libraries that allow you to manipulate the keyboard within Python. There are such libraries for Ubuntu. These libraries will allow you to press any keys on the keyboard and even do key combinations.
An example of what's possible:
I wrote a script that logs me in to a website, then navigates me to some page, downloads specific links on that page, visits every link, saves every page, avoids saving duplicate pages, and avoids getting caught(i.e. it doesn't behave like a bot by for example visiting 100 pages per minute).
The whole thing took 3-4 hours to code and it actually worked in a virtual Ubuntu machine I had running on my Mac which means while it was doing all that work I could do use my machine. If you don't use a virtual machine, then you'll either have to leave the script running and not interfere with it or make a much more robust program that IMO is not worth coding since you can just use a virtual machine.

Web scraping: finding element after a DOM Tree change

I am relatively new to web scraping/crawlers and was wondering about 2 issues in the event where a parsed DOM element is not found in the fetched webpage anymore:
1- Is there a clever way to detect if the page has changed? I have read that it's possible to store and compare hashes but I am not sure how effective it is.
2- In case a parsed element is not found in the fetched webpage anymore, if we assume that we know that the same DOM element still exists somewhere in the DOM Tree in a different location, is there a way to somehow traverse the DOM Tree efficiently without having to go over all of its nodes?
I am trying to find out how experienced developers deal with those two issues and would appreciate insights/hints/strategies on how to manage them.
Thank you in advance.
I didn't see this in your tag list so I thought I'd mention this before anything else: a tool called BeautifulSoup, designed specifically for web-scraping.
Web scraping is a messy process. Unless there's some long-standing regularity or direct relationship with the web site, you can't really rely on anything remaining static in the web page - certainly not when you scale to millions of web pages.
With that in mind:
There's no one-fit-all solution. Some ideas:
Use RSS, if available.
Split your scraping into crude categories where some categories have either implied or explicit timestamps (eg: news sites) you can use to trigger an update on your end.
You already mentioned this but hashing works quite well and is relatively cheap in terms of storage. Another idea here is to not hash the entire page but rather only dynamic or elements of interest.
Fetch HEAD, if available.
Download and store previous and current version of the files, then use a utility like diff.
Use a 3rd party service to detect a change and trigger a "refresh" on your end.
Obviously each of the above has its pros and cons in terms of processing, storage, and memory requirements.
As of version 4.x of BeautifulSoup you can use different HTML parsers, namely, lxml, which should allow you to use XPath. This will definitely be more efficient than traversing the entire tree manually in a loop.
Alternatively (and likely even more efficient) is using CSS selectors. The latter is more flexible because it doesn't depend on the content being in the same place; of course this assumes the content you're interested in retains the CSS attributes.
Hope this helps!

Automatically generated sitemap on Google App Engine

Okay so I know there are already some questions about this topic but I don't find any to be specific enough. I want to have a script on my site that will automatically generate my sitemap.xml (or save it somewhere). I know how to upload files and have my site set up on http://sean-behan.appspot.com with Python 2.7. How do I setup the script that will generate the sitemap and if possible please reference the code. Just ask if you need more info. :) Thanks.
You can have outside services automatically generate them for you by traversing your site.
One such service is at http://www.xml-sitemaps.com/details-sean-behan.appspot.com.html
Alternatively, you can serve your own xml file based on the URL's you want to appear in your site. In which case, see Tim Hoffman's answer.
I can't point you to code, as I don't know how your site is structured or what templating env you use, does your site structure include static pages etc...
The basics are if you have code that can pull together a list of dictionaries that contain the metadata about each page you want in your sitemap then you are half way there.
The use a templating language (or straight python ) that generates an xml file as per sitemap.org spec.
Now you have two choices, dynamically serve this output as requested, or store it in the datastore if when compressed it is less than 1MB, or write it to google cloud storage, then server it's contents when /sitemap.xml is requested. You will then set up a cron task to regenerate your cached sitemap once a day (or whatever frequency is appropriate).

Best way to programmatically save a webpage to a Static HTML File

The more research I do, the more grim the outlook becomes.
I am trying to Flat Save, or Static Save a webpage with Python. This means merging all the styles to inline properties, and changing all links to absolute URLs.
I've tried nearly every free conversion website, api, and even libraries on github. None are that impressive. The best python implementation I could find for flattening styles is https://github.com/davecranwell/inline-styler. I adapted that slightly for Flask, but the generated file isn't that great. Here's how it looks:
Obviously, it should look better. Here's what it should look like:
https://dzwonsemrish7.cloudfront.net/items/3U302I3Y1H0J1h1Z0t1V/Screen%20Shot%202012-12-19%20at%205.51.44%20PM.png?v=2d0e3d26
It seems like a neverending struggle dealing with Malformed html, unrecognized CSS properties, Unicode errors, etc. So does anyone have a suggestion on a better way to do this? I understand I can go to file -> save in my local browser, but when I am trying to do this en mass, and extract a particular xpath that's not really viable.
It looks like Evernote's web clipper uses iFrames, but that seems more complicated than I think it should be. But at least the clippings look decent on Evernote.
After walking away for a while, I managed to install a ruby library that flattens the CSS much much better than anything else I've used. It's the library behind the very slow web interface here http://premailer.dialect.ca/
Thank goodness they released the source on Github, it's the best hands down.
https://github.com/alexdunae/premailer
It flattens styles, creates absolute urls, works with a URL or string, and can even create plain text email templates. Very impressed with this library.
Update Nov 2013
I ended up writing my own bookmarklet that works purely client side. It is compatible with Webkit and FireFox only. It recurses through each node and adds inline styles then sends the flattened HTML to the clippy.in API to save to the user's dashboard.
Client Side Bookmarklet
It sounds like inline styles might be a deal-breaker for you, but if not, I suggest taking another look at Evernote Web Clipper. The desktop app has an Export HTML feature for web clips. The output is a bit messy as you'd expect with inline styles, but I've found the markup to be a reliable representation of the saved page.
Regarding inline vs. external styles, for something like this I don't see any way around inline if you're doing a lot of pages from different sites where class names would have conflicting style rules.
You mentioned that Web Clipper uses iFrames, but I haven't found this to be the case for the HTML output. You'd likely have to embed the static page as an iFrame if you're re-publishing on another site (legally I assume), but otherwise that shouldn't be an issue.
Some automation would certainly help so you could go straight from the browser to the HTML output, and perhaps for relocating the saved images to a single repo with updated src links in the HTML. If you end up working on something like this, I'd be grateful to try it out myself.

Categories

Resources