I'm trying to do three things.
One: crawl and archive, at least daily, a predefined set of sites.
Two: run overnight batch python scripts on this data (text classification).
Three: expose a Django based front end to users to let them search the crawled data.
I've been playing with Apache Nutch/Lucene but getting it to play nice with Django just seems too difficult when I could just use another crawler engine.
Question 950790 suggests I could just write the crawler in Django itself, but I'm not sure how to go about this.
Basically - any pointers to writing a crawler in Django or an existing python crawler that I could adapt? Or should I incorporate 'turning into Django-friendly stuff' in step two and write some glue code? Or, finally, should I abandon Django altogether? I really need something that can search quickly from the front end, though.
If you insert your django project's app directories into sys.path, you can write standard Python scripts that utilize the Django ORM functionality. We have an /admin/ directory that contains scripts to perform various tasks-- at the top of each script is a block that looks like:
sys.path.insert(0,os.path.abspath('../my_django_project'))
sys.path.insert(0,os.path.abspath('../'))
sys.path.insert(0,os.path.abspath('../../'))
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'
Then it's just a matter of using your tool of choice to crawl the web and using the Django database API to store the data.
You write your own crawler using urllib2 to get the pages and Beautiful Soup to parse the HTML looking for the content.
Here's an example of reading a page:
http://docs.python.org/library/urllib2.html#examples
Here's an example of parsing the page:
http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing HTML
If you don't want to write crawler using Django ORM (or already have working crawler) you could share database between crawler and Django-powred front-end.
To be able to search (and edit) existing database using Django admin you should create Django models.
The easy way for that is described here:
http://docs.djangoproject.com/en/dev/howto/legacy-databases/
Related
I am working on a web scraping project. In this project, I have written the necessary code to scrape the required information from a website using python and selenium. All of this code resides in multiple methods in a class. This code is saved as scraper.py.
When I execute this code, the program takes sometime between 6 and 10 seconds to extract all the necessary information from the website.
I wanted to create a UI for this project. I used django to create the UI. In the webapp, there is a form which when submitted opens a new browser window and starts the scraping process.
I access the scraper.py file in django views, where depending on the form inputs, the scraping occurs. While this works fine, the execution is very slow and takes almost 2 minutes to finish running.
How do I make the execution of the code faster using django faster? can you point me some tutorial on how to convert the scraper.py code into an api that django can access? will this help in making the code faster?
Thanks in advance
Few tiny tips,
How is your scraper.py working in the first place? Does it simply print the site links/details, or store it in a text file, or return them? What exactly happens in it?
If you wish to use your scraper.py as an "API" write your scraper.py code within a function that returns the details of your scraped site as a dictionary. Django's views.py can easily handle such dictionaries and send it over to your frontend HTML to replace the parts written in Jinja2.
Further speed can be achieved (in case your scraper does larger jobs) by using multi-threading and/or multi-processing. Do explore both :)
I have a flask application using the jinja2 template engine. The content is dynamic, pulling particular bits from a database.
I'd like to turn a particular page static with all the dynamic data etc. intact. However, I'd like this to run every hour or so as the database will continue to have new data, hench why I'm not just using an existing static generator or building the static page manually - I will cron the job to run automatically.
Any thoughts on how this might be done? I can't provide code examples as I simply don't have a clue on how I might tackle this.
Any help to get me started would be much appreciated.
You can use Frozen-Flask to convert a dynamic Flask app to a static site. It can discover most pages on it's own, assuming each page is linked from another page, such as a list of blog posts linking to individual posts. There are also ways to tell it about other pages if they are not discovered automatically. You could run this periodically with a cron job to update the static site regularly.
freeze_app.py:
from flask_frozen import Freezer
from myapp import app
freezer = Freezer(app)
freezer.freeze()
I want to create or find an open source web crawler (spider/bot) written in Python. It must find and follow links, collect meta tags and meta descriptions, title's of web pages and the url of a webpage and put all of the data into a MySQL database.
Does anyone know of any open source scripts that could help me? Also, if anyone can give me some pointers as to what I should do then they are more than welcome to.
yes i know,
libraries
https://github.com/djay/transmogrify.webcrawler
http://code.google.com/p/harvestman-crawler/
http://code.activestate.com/pypm/orchid/
open source web crawler
http://scrapy.org/
tutorials
http://www.example-code.com/python/pythonspider.asp
PS I don't know if they use mysql because normally python either uses sqlit or postgre sql so if you want you could use the libraries i gave you and import the python-mysql module and do it :D
http://sourceforge.net/projects/mysql-python/
I would suggest you to use Scrapy, which is a powerful scraping framework based on Twisted and lxml. It is particularly well suited for the kind of tasks you want to perform, it features regex based rules to follow links and lets you use either regular expressions or XPath expressions to extract data from the html. It also provides what they call "pipelines" to dump data to whatever you want.
Scrapy doesn't provide a built-in MySQL pipeline, but someone has written one here, from which you could base your own.
Scrappy is a web crawling and scraping framework you can extend to insert the selected data to a database.
It's like an inverse of the Django framework.
I am learning python. I have created some scripts that I use to parse various websites that I run daily (as their stats are updated), and look at the output in the Python interpreter. I would like to create a website to display the results. What I want to do is run my script when I go to the site, and display a sortable table of the results.
I have looked at Django and am part way through the tutorial, but it seems like an awful lot of overhead for what should be a simple problem. I know that I could just write a Python script to output simple HTML, but is that really the best way? I would like to be able to sort the table by various columns.
I have years of programming experience (C, Java, etc.), but have very little web development experience.
Thanks in advance.
Have you considered Flask? Like Tornado, it is both a "micro-framework" and a simple web server, so it has everything you need right out of the box. http://flask.pocoo.org/
This example (right off the homepage) pretty much sums up how simple the code can be:
from flask import Flask
app = Flask(__name__)
#app.route("/")
def hello():
return "Hello World!"
if __name__ == "__main__":
app.run()
If you are creating non-interactive pages, you can easily setup any modern web server to execute your python script as a CGI. Instead of loading a static file, your web server will return the output of your python script.
This isn't very sophisticated, but if you are simply returning the output without needing browser submitted date, this is the easiest way (scaling under load is a different story).
You don't even need the "cgi" module from python, if you aren't receiving any data from the browser. Anything more complicated than this and you should use a web framework.
Examples and other methods
Simple Example: hardest part is webserver configuration
mod_python: Cut down on CGI overhead (otherwise, apache execs the python interpreter for each hit)
python module cgi: sending data to your python script from the browser.
Sorting
Javascript side sorting: I've used this javascript library to add sortable tables. This is the easiest way to add sorting without requiring additional work or another HTTP GET.
Instructions:
Download this file
Add to your HTML
Add class="sortable" to any table you'd like to make sortable
Click on the headers to sort
You might consider Tornado if Django is too much overhead. I've used both and agree that, if you have something simple/small to do and don't already know Django, it's going to exponentially increase your time to production. On the other hand, you can 'get' Tornado in a couple of hours and get something relatively simple done in a day or two with no prior experience with it. At least, that's been my experience with it.
Note that Tornado is still a tradeoff: you get a lot of simplicity in exchange for the huge cornucopia of features and shortcuts you get w/ Django.
PS - in addition to being a 'micro-framework', Tornado is also its own web server, so there's no mucking with wsgi/mod-cgi/fcgi.... just write your request handlers and run it. Be sure to see the demos included in the distribution.
Have you seen bottle framework? It is a micro framework and very simple.
If I correctly understood your requirements you might find Wooey very interesting.
Wooey is a A Django app that creates automatic web UIs for Python scripts:
http://wooey.readthedocs.org
Here you can check a demo:
https://wooey.herokuapp.com/
Django is a big webframework, meant to include loads of things becaus eyou often needs them, even though sometimes you don't.
Look at Pyramid, earlier known as BFG. It's much smaller.
http://pypi.python.org/pypi/pyramid/1.0a1
Other microframeworks to check out are here: http://wiki.python.org/moin/WebFrameworks
On the other hand, in this case it's probably also overkill. sounds like you can run the script once every ten minites, and write a static HTML file, and just use Apache.
If you are not willing to write your own tool, there is a pretty advanced tool for executing your scripts: http://rundeck.org/
It's pretty simple to start and can be configured for complex scenarios as well.
For the requirement of custom view (with sortable results), I believe you can implement a simple plugin for translating script output into html elements.
Also, for simple setups I could recommend my own tool: https://github.com/bugy/script-server. It doesn't have tons of features, but very easy for end-users and supports interactive execution.
If you don't need any input from the browser, this sounds like an almost-static webpage that just happens to change once a day. You'll only need some way to get html out of your script, in a place where your webserver can access it.)
So you'd use some form of templating; if you'll need some structure above the single page, there's static site / blog generators that you can feed your output in, say, Markdown format, and call their make html or the like.
You can use DicksonUI https://dicksonui.gitbook.io
DicksonUI is better
Or Remi gui(search in google)
DicksonUI is better.
I am the author of DicksonUI
So ive just started learning python on WAMP, ive got the results of a html form using cgi, and successfully performed a database search with mysqldb. I can return the results to a page that ends with .py by using print statements in the python cgi code, but i want to create a webpage that's .html and have that returned to the user, and/or keep them on the same webaddress when the database search results return.
thanks
paul
edit: to clarify on my local machine, i see /localhost/search.html in the address bar i submit the html form, and receive a results page at /localhost/cgi-bin/searchresults.py. i want to see the results on /localhost/results.html or /localhost/search.html. if this was on a public server im ASSUMING it would return .../cgi-bin/searchresults.py, the last time i saw /cgi-bin/ directories was in the 90s in a url. ive glanced at addhandler, as david suggested, im not sure if thats what i want.
edit: thanks all of you for your input, yep without using frameworks, mod_rewrite seems the way to go, but having looked at that, I decided to save myself the trouble and go with django with mod_wsgi, mainly because of the size of its userbase and amount of docs. i might switch to a lighter/more customisable framework, once ive got the basics
First, I'd suggest that you remember that URLs are URLs and that file extensions don't matter, and that you should just leave it.
If that isn't enough, then remember that URLs are URLs and that file extensions don't matter — and configure Apache to use a different rule to determine that is a CGI program rather than a static file to be served up as is. You can use AddHandler to add a handler for files on the hard disk with a .html extension.
Alternatively, you could use mod_rewrite to tell Apache that …/foo.html means …/foo.py
Finally, I'd suggest that if you do muck around with what URLs look like, that you remove any sign of something that looks like a file extension (so that …/foo is requested rather then …/foo.anything).
As for keeping the user on the same address for results as for the request … that is just a matter of having the program output the basic page without results if it doesn't get the query string parameters that indicate a search term had been passed.