Speed up the number of page I can scrape via threading - python

I'm currently using beautifulsoup to scrape sourceforge.net for various project information. I'm using the solution in this thread. It works well, but I wish to do it yet faster. Right now I'm creating a list of 15 URLs, and feed them into the
run_parallel_in_threads. All the URLs are sourceforge.net links. I'm currently getting about 2.5 pages per second. And it seems that increasing or decreasing the number of URLs in my list doesn't have much effect on the speed. Are there any strategy to increase the number of page I can scrape? Any other solutions that are more suitable for this kind of project?

You could have your threads which run in parallel simply retrieve the web content. Once the html page is retrieved, pass the page into a queue which have multiple workers each parsing a single html page. Now you've essentially pipelined your workflow. Instead of having each thread do multiple steps (retrieve page, scrape, store). Each of your threads in parallel simple retrieve the page and then have it pass the task into a queue which processes these tasks in a round robin approach.
Please let me know if you have any questions!

Related

Scrape many pages with different "?id=" parameters without excessive requests?

I'm trying to scrape all pages with different ids from a site that is formatted url.com/page?id=1, but there are millions of ids so even at 1 request per second it will take weeks to get them all.
I am a total noob at this so I was wondering if there was a better way than going one by one such as a bulk request or something or should I just increase the requests per second to whatever I can get away with.
I am using requests and beautifulsoup in python to scrape the pages currently.
The grequests library is one possible approach you could take. The results are returned in the order they are obtained (which is not the same order as event_list).
import grequests
event_list = [grequests.get(f"http://url.com/page?id={req_id}") for req_id in range(1, 100)]
for r in grequests.imap(event_list, size=5):
print(r.request.url)
print(r.text[:100])
print()
Note: You are likely to be blocked if you attempt this on most sites. A better approach would be to see if the website has a suitable API you could use to obtain the same information which could be found by using your browser's network tools whilst using the site.

How to optimize web site / make it load faster?

I have a webpage which does web scraping and displays news in a slideshow. It also extracts tweets from Twitter using tweepy.
The code sequence is below:
class extract_news:
def bcnews(self):
//code to extract news
def func2(self):
//code to extract news
...
...
def extractfromtwitter(self):
//code to extract using tweepy
I have multiple such functions to extract from different websites using BS4 and to display the news and tweets. I am using Flask to run this code.
But the page takes about 20seconds or so to load. And if someone tries to access it remotely, it takes too long and the browser gives the error "Connection Timed Out" or just doesn't load.
How can I make this page load faster? Say in like >5 seconds.
Thanks!
You need to identify the bottlenecks in your code and then figure out how to reduce them. It's difficult to help you with the minimal amount of code that you have provided, but the most likely cause is that each HTTP request takes most of the time, and the parsing is probably negligible in comparison.
See if you can figure out a way to paralleise the HTTP requests, e.g. using the multiprocessing or threading modules.
I agree with the others. To give a concrete answer/solution we will NEED to see the code.
However in a nutshell what you will need to do is profile the application with your DevTools. This will result in you pushing the sync javascript code below the CSS, markup, and ASCII loading.
Also create a routine to load an initial chunk of content (approximately one page or slide worth) so that the user will have something to look at. The rest can load in the background and they will never know the difference. It will almost certainly be available before they are able to click to scroll to the next slide. Even if it does take 10 seconds or so.
Perceived performance is what I am describing here. Yes I agree , you will and should find ways to improve the overall loading. However arguably more important is improving the "perceived performance". This is done (as I said), by loading some initial content. Then streaming in the rest immediately afterwards.

Scraping a large number of urls in Python

I have 630,220 urls that I need to open and scrape. These urls themselves have been scraped, and scraping them was much easier because every single scraped page would return around 3,500 urls.
To scrape these 630,220 urls, I'm currently doing parallel scraping in Python using threads. Using 16 threads, it takes 51 seconds to scrape 200 urls. Thus, it would take me 44 hours to scrape all 630,220 urls which seems to be an unnecessarily time consuming and grossly inefficient way to handle this problem.
Assuming that the server will not be overloaded, is there a way to asynchronously send something like 1000 requests per second? That would bring down the total scraping time to about 10 minutes, which is pretty reasonable.
Use gevent. Enable monkey-patching of Python standard library, and just use your favourite scrapping library - replace the threads by 1000 greenlets doing the same thing. And you're done.

Distributing a python workload across multiple processes

Let us suppose that I want to do a google search for the word 'hello'. I then want to go to every single link on the first 100 pages of Google and download the HTML of that linked page. Since there are 10 results per page, this would mean I'd have to click about 1,000 links.
This is how I would do it with a single process:
from selenium import webdriver
driver=webdriver.Firefox()
driver.get('http://google.com')
# do the search
search = driver.find_element_by_name('q')
search.send_keys('hello')
search.submit()
# click all the items
links_on_page = driver.find_elements_by_xpath('//li/div/h3/a')
for item in links_on_page:
item.click()
# do something on the page
driver.back()
# go to the next page
driver.find_element_by_xpath('//*[#id="pnnext"]')
This would obviously take a very long time to do this on 100 pages. How would I distribute the load, such that I could have (for example) three drivers open, and each would 'check out' a page. For example:
Driver #1 checks out page 1. Starts page 1.
Driver #2 sees that page 1 is checked out and goes to page #2. Starts page 2.
Driver #3 sees that page 1 is checked out and goes to page #2. Same with page 2. Starts page 3.
Driver #1 finishes work on page 1...starts page 4.
I understand the principle of how this would work, but what would be the actual code to get a basic implementation of this working?
You probably want to use a multiprocessing Pool. To do so, write a method that is parameterised by page number:
def get_page_data(page_number):
# Fetch page data
...
# Parse page data
...
for linked_page in parsed_links:
# Fetch page source and save to file
...
Then just use a Pool of however many processes you think is appropriate (determining this number will probably require some experimentation):
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool(processes=4)
pool.map(get_page_data, range(1,101))
This will now set 4 processes going, each fetching a page from Google and then fetching each of the pages it links to.
Not answering your question directly, but proposing an avenue that might make your code usable in a single process, thereby avoiding you synchronisation issues between different threads/processes...
You would probably better be with a framework such as Twisted which enables asynchronous network operations, in order to keep all operations within the same process. In your code, the parsing of the HTML code is likely going to take far less time than the complete network operations required to fetch the pages. Therefore, using asynchronous IO, you can launch a couple of requests at the same time and parse the result only as responses arrive. In effect, each time a page is returned, your process is likely to be "idling" in the runloop.

Being a good citizen and web-scraping

I have a two part question.
First, I'm writing a web-scraper based on the CrawlSpider spider in Scrapy. I'm aiming to scrape a website that has many thousands (possible into the hundreds of thousands) of records. These records are buried 2-3 layers down from the start page. So basically I have the spider start on a certain page, crawl until it finds a specific type of record, and then parse the html. What I'm wondering is what methods exist to prevent my spider from overloading the site? Is there possibly a way to do thing's incrementally or put a pause in between different requests?
Second, and related, is there a method with Scrapy to test a crawler without placing undue stress on a site? I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?
Any advice or resources would be greatly appreciated.
Is there possibly a way to do thing's incrementally
I'm using Scrapy caching ability to scrape site incrementaly
HTTPCACHE_ENABLED = True
Or you can use new 0.14 feature Jobs: pausing and resuming crawls
or put a pause in between different requests?
check this settings:
DOWNLOAD_DELAY
RANDOMIZE_DOWNLOAD_DELAY
is there a method with Scrapy to test a crawler without placing undue stress on a site?
You can try and debug your code in Scrapy shell
I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?
Also, you can call scrapy.shell.inspect_response at any time in your spider.
Any advice or resources would be greatly appreciated.
Scrapy documentation is the best resource.
You have to start crawling and log everything. In case you get banned, you can add sleep() before pages request.
Changing User-Agent is a good practise, too (http://www.user-agents.org/ http://www.useragentstring.com/ )
If you get banned by ip, use proxy to bypass it. Cheers.

Categories

Resources