Python multiprocess/multithreading in database driven scraping - python

I have around 7 lacs website url in my database. I just scraping some short info form that website.
But when I run the scripting, it is normal to take huge to time to check such a huge URLs.
Currently, is am doing with for loop:
def scrape_short_webinfo():
# a function scraping some minor data
for instance in Link.objects.all():
scrape_short_webinfo(instance.url)
I want to put these thing in multiprocess/multipthread, so that it the script should finish more faster.
Can anyone help me in this case?

You can use ExecutorService to parallelize your code

Related

Best approach to save data from webscraper

I have recently hit some obstacles while trying to scrape a website. I'm using python requests to scrape a website and then bs4 to get the elements and the information I want. I'm using ThreadPoolExecutor to make this work faster. I'm using sqlite3 to save the information and check if some of the information already is in the database. The problem, however, is that I've until now used a multiprocessing Lock to only allow one thread to use the database at the same time since sqlite doesn't allow threading. This, however, is that this is really slow and really brings the speed of the scraper down. I've read about MongoDB but I would prefer a database that can be saved as a local file, i.e. mydatabase.db, like sqlite. I want the database to be able to support threading so that the threads don't have to wait for each other.
What's the best approach to solve this problem?
Thank you

How to scrape data faster with selenium and django

I am working on a web scraping project. In this project, I have written the necessary code to scrape the required information from a website using python and selenium. All of this code resides in multiple methods in a class. This code is saved as scraper.py.
When I execute this code, the program takes sometime between 6 and 10 seconds to extract all the necessary information from the website.
I wanted to create a UI for this project. I used django to create the UI. In the webapp, there is a form which when submitted opens a new browser window and starts the scraping process.
I access the scraper.py file in django views, where depending on the form inputs, the scraping occurs. While this works fine, the execution is very slow and takes almost 2 minutes to finish running.
How do I make the execution of the code faster using django faster? can you point me some tutorial on how to convert the scraper.py code into an api that django can access? will this help in making the code faster?
Thanks in advance
Few tiny tips,
How is your scraper.py working in the first place? Does it simply print the site links/details, or store it in a text file, or return them? What exactly happens in it?
If you wish to use your scraper.py as an "API" write your scraper.py code within a function that returns the details of your scraped site as a dictionary. Django's views.py can easily handle such dictionaries and send it over to your frontend HTML to replace the parts written in Jinja2.
Further speed can be achieved (in case your scraper does larger jobs) by using multi-threading and/or multi-processing. Do explore both :)

Doesn't get results after a while scraping (python)

I'm trying to scrap a large database for a project of mine, however I find that after I scrap a relatively big amount of data, I stop receiving some of the xml information I'm interested in. I'm not sure if it's because the server is limiting my access or because it starts scraping too fast.
I put a "sleep" line between the scraping loops to overcome this, but as I try to reach more data it doesn't work anymore.
I guess this is a known problem in web scraping but I'm very new to this field so any suggestion will be very helpful.
Note: I tried 'request' with some free proxies but that didn't work either (still some data missing). I also checked the original website and it does have the data I seek.
Edit: It looks like most of that data I'm missing comes from specific attributes that don't load as fast as all other data. So I think I'm looking for a way to tell if this xml I'm looking for has loaded already.
I'm using lxml and requests.
Thanks.

Code for web crawling with Python 2.7.3 in mac terminal?

I am a social scientist and a complete newbie/noob when it comes to coding. I have searched through the other questions/tutorials but am unable to get the gist of how to crawl a news website targeting the comments section specifically. Ideally, I'd like to tell python to crawl a number of pages and return all the comments as a .txt file. I've tried
from bs4 import BeautifulSoup
import urllib2
url="http://www.xxxxxx.com"
and that's as far as I can go before I get an error message saying bs4 is not a module. I'd appreciate any kind of help on this, and please, if you decide to respond, DUMB IT DOWN for me!
I can run wget on terminal and get all kinds of text from websites which is awesome IF I could actually figure out how to save the individual output html files into one big .txt file. I will take a response to either question.
Try Scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
You will most likely encounter this as you go, but in some cases, if the site is employing 3rd party services for comments, like Disqus, you will find that you will not be able to pull the comments down in this manner. Just a heads up.
I've gone down this route before and have had to tailor the script to a particular site's layout/design/etc.
I've found libcurl to be extremely handy, if you don't mind doing the post-processing using Python's string handler functions.
If you don't need to implement it purely in Python, you can make use of wget's recursive mirroring option to handle the content pull, then write your python code to parse the downloaded files.
I'll add my two cents here as well.
The first things to check are that you installed beautiful soup, and that it lives somewhere that it can be found. There's all kinds of things that can go wrong here.
My experience is similar to yours: I work at a web startup, and we have a bunch of users who register, but give us no information about their job (which is actually important for us). So my idea was to scrape the homepage and the "About us" page from the domain in their email address, and try to put a learning algorithm around the data that I captured to predict their job. The results for each domain are stored as a text file.
Unfortunately (for you...sorry), the code I ended up with was a bit complicated. The problem is that you'll end up getting a lot of garbage when you do the scraping, and you'll have to filter it out. You'll also end up with encoding issues, and (assuming you want to do some learning here) you'll have to get rid of low-value words. The total code is about 1000 lines, and I'll post some important pieces that may help you out here, if you're interested.

How to display database query results of 100,000 rows or more with HTML?

We're rewriting a website used by one of our clients. The user traffic on it is very low, less than 100 unique visitors a week. It's basically just a nice interface to their data in our databases. It allows them to query and filter on different sets of data of theirs.
We're rewriting the site in Python, re-using the same Oracle database that the data is currently on. The current version is written in an old, old version of Coldfusion. One of the things that Coldfusion does well though is displays tons of database records on a single page. It's capable of displaying hundreds of thousands of rows at once without crashing the browser. It uses a Java applet, and it looks like the contents of the rows are perhaps compressed and passed in through the HTML or something. There is a large block of data in the HTML but it's not displayed - it's just rendered by the Java applet.
I've tried several JavaScript solutions but they all hinge on the fact that the data will be present in an HTML table or something along those lines. This causes browsers to freeze and run out of memory.
Does anyone know of any solutions to this situation? Our client loves the ability to scroll through all of this data without clicking a "next page" link.
I have done just what you are describing using the following (which works very well):
jQuery Datatables
It enables you to do 'fetch as you scroll' pagination, so you can disable the pagination arrows in favor of a 'forever' scroll.
Give a try with Jquery scroll.
Instead of image scroll , you need to have data scroll.
You should poulate data in the divs , instead of images.
http://www.smoothdivscroll.com/#quickdemo
It should work. I wish.
You gotta great client anyway :-)
Something related to your Q
http://www.9lessons.info/2009/07/load-data-while-scroll-with-jquery-php.html
http://api.jquery.com/scroll/
I'm using Open Rico's LiveGrid in a project to display a table with thousands of rows in a page as an endless scrolling table. It has been working really fine so far. The table requests data on demand when you scroll through the rows. The parameters are send as simple GET parameters and the response you have to create on the serverside is simple XML. It should be possible to implement a data backend for a Rico LiveGrid in Python.
Most people, in this case, would use a framework. The best documented and most popular framework in Python is Django. It has good database support (including Oracle), and you'll have the easiest time getting help using it since there's such an active Django community.
You can try some other frameworks, but if you're tied to Python I'd recommend Django.
Of course, Jython (if it's an option), would make your job very easy. You could take the existing Java framework you have and just use Jython to build a frontend (and continue to use your Java applet and Java classes and Java server).
The memory problem is an interesting one; I'd be curious to see what you come up with.
Have you tried jqGrid? It can be buggy at times, but overall it's one of the better JavaScript grids. It's fairly efficient in dealing with large datasets. It also has a feature whereby the grid retrieves data asynchronously in chunks, but still allows continuous scrolling. It just asks for more data as the user scrolls down to it.
I did something like this a while ago and successfully implemented YUI's data table combined with Django
http://developer.yahoo.com/yui/datatable/
This gives you column sorting, pagination, scrolling and so on. It also allows you to use a variety of data sources such as JSON or XML.

Categories

Resources