How to scrape data faster with selenium and django

How to scrape data faster with selenium and django - python

I am working on a web scraping project. In this project, I have written the necessary code to scrape the required information from a website using python and selenium. All of this code resides in multiple methods in a class. This code is saved as scraper.py.
When I execute this code, the program takes sometime between 6 and 10 seconds to extract all the necessary information from the website.
I wanted to create a UI for this project. I used django to create the UI. In the webapp, there is a form which when submitted opens a new browser window and starts the scraping process.
I access the scraper.py file in django views, where depending on the form inputs, the scraping occurs. While this works fine, the execution is very slow and takes almost 2 minutes to finish running.
How do I make the execution of the code faster using django faster? can you point me some tutorial on how to convert the scraper.py code into an api that django can access? will this help in making the code faster?
Thanks in advance

Few tiny tips,
How is your scraper.py working in the first place? Does it simply print the site links/details, or store it in a text file, or return them? What exactly happens in it?
If you wish to use your scraper.py as an "API" write your scraper.py code within a function that returns the details of your scraped site as a dictionary. Django's views.py can easily handle such dictionaries and send it over to your frontend HTML to replace the parts written in Jinja2.
Further speed can be achieved (in case your scraper does larger jobs) by using multi-threading and/or multi-processing. Do explore both :)

Related

Python multiprocess/multithreading in database driven scraping

I have around 7 lacs website url in my database. I just scraping some short info form that website.
But when I run the scripting, it is normal to take huge to time to check such a huge URLs.
Currently, is am doing with for loop:
def scrape_short_webinfo():
# a function scraping some minor data
for instance in Link.objects.all():
scrape_short_webinfo(instance.url)
I want to put these thing in multiprocess/multipthread, so that it the script should finish more faster.
Can anyone help me in this case?

You can use ExecutorService to parallelize your code

How to call scrapy in a python utillity

I want to get data from the web in real-time and have used scrapy to extract the information to build a python utility. The problem is that the data is static while the information will change in time.
I wanted to know if it is viable to call my scrapy spider when the utility is invoked so that when the utility is called for the first time, the data at that time is stored as JSON with the user which will change when the user calls it the next time.
Please let me know if there is an alternative to it.
Thanks in advance.
Edit-1: To make it clear, the data that I have extracted will change over time. Here is a link to my previous question about building the spider: How to scrape contents from multiple tables in a webpage. The problem is that as the league progresses, the fixtures' status will change (completed or not yet completed). I want the users to get real-time scraped data.
Edit-2: What I previously did was calling my spider separately and using the JSON generated for the purpose of my utillity. For the users to have real-time data, when they use it on the terminal, should I push the scrapy code into the main repository that will be uploaded to PyPI and call the spider in the main function of the .py file? Is this possible? What are its alternatives, if any?

your could start your scrapy from code when you (or your user) need:
from scrapy import cmdline
SCRAPY_SPIDER_NAME = 'spyder_name' # spyder name to start scraping
cmdline.execute("scrapy crawl {}".format(SCRAPY_SPIDER_NAME))

Python search script in an HTML webpage

I have an HTML webpage. It has a search textbox. I want to allow the user to search within a dataset. The dataset is represented by a bunch of files on my server. I wrote a python script which can make that search.
Unfortunately, I'm not familiar with how can I unite the HTML page and a Python script.
The task is to put a python script into the html file so, that:
Python code will be run on the server side
Python code can somehow take the values from the HTML page as input
Python code can somehow put the search results to the HTML webpage as output
Question 1 : How can I do this?
Question 2 : How the python code should be stored on the website?
Question 3 : How it should take HTML values as input?
Question 4 : How can it output the results to the webpage? Do I need to install/use any additional frameworks?
Thanks!

There are too many things to get wrong if you try to implement that by yourself with only what the standard library provides.
I would recommend using a web framework, like flask or django. I linked to the quickstart sections of the comprehensive documentation of both. Basically, you write code and URL specifications that are mapped to the code, e.g. an HTTP GET on /search is mapped to a method returning the HTML page.
You can then use a form submit button to GET /search?query=<param> with the being the user's input. Based on that input you search the dataset and return a new HTML page with results.
Both frameworks have template languages that help you put the search results into HTML.
For testing purposes, web frameworks usually come with a simple webserver you can use. For production purposes, there are better solutions like uwsgi and gunicorn
Also, you should consider putting the data into a database, parsing files for each query can be quite inefficient.
I'm sure you will have more questions on the way, but that's what stackoverflow is for, and if you can ask more specific questions, it is easier to provide more focused answers.

I would look at the cgi library in python.

You should check out Django, its a very flexible and easy Python web-framework.

How do I create a web interface to a simple python script?

I am learning python. I have created some scripts that I use to parse various websites that I run daily (as their stats are updated), and look at the output in the Python interpreter. I would like to create a website to display the results. What I want to do is run my script when I go to the site, and display a sortable table of the results.
I have looked at Django and am part way through the tutorial, but it seems like an awful lot of overhead for what should be a simple problem. I know that I could just write a Python script to output simple HTML, but is that really the best way? I would like to be able to sort the table by various columns.
I have years of programming experience (C, Java, etc.), but have very little web development experience.
Thanks in advance.

Have you considered Flask? Like Tornado, it is both a "micro-framework" and a simple web server, so it has everything you need right out of the box. http://flask.pocoo.org/
This example (right off the homepage) pretty much sums up how simple the code can be:
from flask import Flask
app = Flask(__name__)
#app.route("/")
def hello():
return "Hello World!"
if __name__ == "__main__":
app.run()

If you are creating non-interactive pages, you can easily setup any modern web server to execute your python script as a CGI. Instead of loading a static file, your web server will return the output of your python script.
This isn't very sophisticated, but if you are simply returning the output without needing browser submitted date, this is the easiest way (scaling under load is a different story).
You don't even need the "cgi" module from python, if you aren't receiving any data from the browser. Anything more complicated than this and you should use a web framework.
Examples and other methods
Simple Example: hardest part is webserver configuration
mod_python: Cut down on CGI overhead (otherwise, apache execs the python interpreter for each hit)
python module cgi: sending data to your python script from the browser.
Sorting
Javascript side sorting: I've used this javascript library to add sortable tables. This is the easiest way to add sorting without requiring additional work or another HTTP GET.
Instructions:
Download this file
Add to your HTML
Add class="sortable" to any table you'd like to make sortable
Click on the headers to sort

You might consider Tornado if Django is too much overhead. I've used both and agree that, if you have something simple/small to do and don't already know Django, it's going to exponentially increase your time to production. On the other hand, you can 'get' Tornado in a couple of hours and get something relatively simple done in a day or two with no prior experience with it. At least, that's been my experience with it.
Note that Tornado is still a tradeoff: you get a lot of simplicity in exchange for the huge cornucopia of features and shortcuts you get w/ Django.
PS - in addition to being a 'micro-framework', Tornado is also its own web server, so there's no mucking with wsgi/mod-cgi/fcgi.... just write your request handlers and run it. Be sure to see the demos included in the distribution.

Have you seen bottle framework? It is a micro framework and very simple.

If I correctly understood your requirements you might find Wooey very interesting.
Wooey is a A Django app that creates automatic web UIs for Python scripts:
http://wooey.readthedocs.org
Here you can check a demo:
https://wooey.herokuapp.com/

Django is a big webframework, meant to include loads of things becaus eyou often needs them, even though sometimes you don't.
Look at Pyramid, earlier known as BFG. It's much smaller.
http://pypi.python.org/pypi/pyramid/1.0a1
Other microframeworks to check out are here: http://wiki.python.org/moin/WebFrameworks
On the other hand, in this case it's probably also overkill. sounds like you can run the script once every ten minites, and write a static HTML file, and just use Apache.

If you are not willing to write your own tool, there is a pretty advanced tool for executing your scripts: http://rundeck.org/
It's pretty simple to start and can be configured for complex scenarios as well.
For the requirement of custom view (with sortable results), I believe you can implement a simple plugin for translating script output into html elements.
Also, for simple setups I could recommend my own tool: https://github.com/bugy/script-server. It doesn't have tons of features, but very easy for end-users and supports interactive execution.

If you don't need any input from the browser, this sounds like an almost-static webpage that just happens to change once a day. You'll only need some way to get html out of your script, in a place where your webserver can access it.)
So you'd use some form of templating; if you'll need some structure above the single page, there's static site / blog generators that you can feed your output in, say, Markdown format, and call their make html or the like.

You can use DicksonUI https://dicksonui.gitbook.io
DicksonUI is better
Or Remi gui(search in google)
DicksonUI is better.
I am the author of DicksonUI

Interfacing web crawler with Django front end

I'm trying to do three things.
One: crawl and archive, at least daily, a predefined set of sites.
Two: run overnight batch python scripts on this data (text classification).
Three: expose a Django based front end to users to let them search the crawled data.
I've been playing with Apache Nutch/Lucene but getting it to play nice with Django just seems too difficult when I could just use another crawler engine.
Question 950790 suggests I could just write the crawler in Django itself, but I'm not sure how to go about this.
Basically - any pointers to writing a crawler in Django or an existing python crawler that I could adapt? Or should I incorporate 'turning into Django-friendly stuff' in step two and write some glue code? Or, finally, should I abandon Django altogether? I really need something that can search quickly from the front end, though.

If you insert your django project's app directories into sys.path, you can write standard Python scripts that utilize the Django ORM functionality. We have an /admin/ directory that contains scripts to perform various tasks-- at the top of each script is a block that looks like:
sys.path.insert(0,os.path.abspath('../my_django_project'))
sys.path.insert(0,os.path.abspath('../'))
sys.path.insert(0,os.path.abspath('../../'))
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'
Then it's just a matter of using your tool of choice to crawl the web and using the Django database API to store the data.

You write your own crawler using urllib2 to get the pages and Beautiful Soup to parse the HTML looking for the content.
Here's an example of reading a page:
http://docs.python.org/library/urllib2.html#examples
Here's an example of parsing the page:
http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing HTML

If you don't want to write crawler using Django ORM (or already have working crawler) you could share database between crawler and Django-powred front-end.
To be able to search (and edit) existing database using Django admin you should create Django models.
The easy way for that is described here:
http://docs.djangoproject.com/en/dev/howto/legacy-databases/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape data faster with selenium and django - python

Related

Python multiprocess/multithreading in database driven scraping

How to call scrapy in a python utillity

Python search script in an HTML webpage

How do I create a web interface to a simple python script?

Interfacing web crawler with Django front end

Categories

Resources