I have 630,220 urls that I need to open and scrape. These urls themselves have been scraped, and scraping them was much easier because every single scraped page would return around 3,500 urls.
To scrape these 630,220 urls, I'm currently doing parallel scraping in Python using threads. Using 16 threads, it takes 51 seconds to scrape 200 urls. Thus, it would take me 44 hours to scrape all 630,220 urls which seems to be an unnecessarily time consuming and grossly inefficient way to handle this problem.
Assuming that the server will not be overloaded, is there a way to asynchronously send something like 1000 requests per second? That would bring down the total scraping time to about 10 minutes, which is pretty reasonable.
Use gevent. Enable monkey-patching of Python standard library, and just use your favourite scrapping library - replace the threads by 1000 greenlets doing the same thing. And you're done.
Related
I'm trying to scrape all pages with different ids from a site that is formatted url.com/page?id=1, but there are millions of ids so even at 1 request per second it will take weeks to get them all.
I am a total noob at this so I was wondering if there was a better way than going one by one such as a bulk request or something or should I just increase the requests per second to whatever I can get away with.
I am using requests and beautifulsoup in python to scrape the pages currently.
The grequests library is one possible approach you could take. The results are returned in the order they are obtained (which is not the same order as event_list).
import grequests
event_list = [grequests.get(f"http://url.com/page?id={req_id}") for req_id in range(1, 100)]
for r in grequests.imap(event_list, size=5):
print(r.request.url)
print(r.text[:100])
print()
Note: You are likely to be blocked if you attempt this on most sites. A better approach would be to see if the website has a suitable API you could use to obtain the same information which could be found by using your browser's network tools whilst using the site.
I am a bit new to web scraping and my question might be a bit silly. I want to get information from a rental website. I want to scrape almost 2000 pages per day to obtain the information. But I do not want to hammer their website. I just need information inside a specific tag which is a table. Is there any ways to only request that part of the page rather than getting the whole page?
I will surely add delay and sleep to the script, but reducing file size would also help.
Implementing that will reduce the requested file size from around 300kB to 11kB.
Website URL: https://asunnot.oikotie.fi/vuokrattavat-asunnot
example of webpage: https://asunnot.oikotie.fi/vuokrattavat-asunnot/imatra/15733776
required tag: <div class="listing-details-container">...</div>
Thank you for your response in advance :)
I think 2000 a day is not high - depends when you do it. If you put a 10 second wait between each request that should not overload it - but would take 6 hours.
It may be better to do it overnight when the site should be quieter.
If you do 2000 with no wait the site owner may be unhappy.
I have a list of approx. 52 websites which lead to about approx. 150 webpages that i require scraping on. Based on my ignorance and lack of research i started building crawlers per webpage which is starting to become to difficult to complete and maintain.
Based on my analysis thus far I already know what information i want to scrape per webpage and it is clear that these websites have their own structure. On the plus side i noticed that each website has some commonalities in their web structure among their webpages.
My million dollar question, is there a single technique or single web crawler that i can use to scrape these sites? I already know the information that I want, these sites are rarely updated in terms of their web structure and most of these sites have documents that need to be downloaded.
Alternatively, is there a better solution to use that will reduce the amount of web crawlers that I need to build? additionally, these web crawlers will only be used to download the new information of the websites that i am aiming them at.
[…] i started building crawlers per webpage which is starting to become to difficult to complete and maintain […] it is clear that these websites have their own structure. […] these sites are rarely updated in terms of their web structure […]
If websites have different structures, having separate spiders makes sense, and should make maintenance easier in the long term.
You say completing new spiders (I assume you mean developing them, not crawling or something else) is becoming difficult, however if they are similar to an existing spider, you can simply copy-and-paste the most similar existing spider, and make only the necessary changes.
Maintenance should be easiest with separate spiders for different websites. If a single website changes, you can fix the spider for that website. If you have a spider for multiple websites, and only one of them changes, you need to make sure that your changes for the modified website do not break the rest of the websites, which can be a nightmare.
Also, since you say website structures do not change often, maintenance should not be that hard in general.
If you notice you are repeating a lot of code, you might be able to extract some shared code into a spider middleware, a downloader middleware, an extension, an item loader, or even a base spider class shared by two or more spiders. But I would not try to use a single Spider subclass to scrape multiple different websites that are likely to evolve separately.
I suggest you crawl specific tags such as body, h1,h2,h3,h4,h5, h6,p and... for each links. You can gather all p tags and append them into a specific link. It can be used for each tags you want to crawl them. Also, you can append related links of tags to your database.
I'm currently using beautifulsoup to scrape sourceforge.net for various project information. I'm using the solution in this thread. It works well, but I wish to do it yet faster. Right now I'm creating a list of 15 URLs, and feed them into the
run_parallel_in_threads. All the URLs are sourceforge.net links. I'm currently getting about 2.5 pages per second. And it seems that increasing or decreasing the number of URLs in my list doesn't have much effect on the speed. Are there any strategy to increase the number of page I can scrape? Any other solutions that are more suitable for this kind of project?
You could have your threads which run in parallel simply retrieve the web content. Once the html page is retrieved, pass the page into a queue which have multiple workers each parsing a single html page. Now you've essentially pipelined your workflow. Instead of having each thread do multiple steps (retrieve page, scrape, store). Each of your threads in parallel simple retrieve the page and then have it pass the task into a queue which processes these tasks in a round robin approach.
Please let me know if you have any questions!
I have a two part question.
First, I'm writing a web-scraper based on the CrawlSpider spider in Scrapy. I'm aiming to scrape a website that has many thousands (possible into the hundreds of thousands) of records. These records are buried 2-3 layers down from the start page. So basically I have the spider start on a certain page, crawl until it finds a specific type of record, and then parse the html. What I'm wondering is what methods exist to prevent my spider from overloading the site? Is there possibly a way to do thing's incrementally or put a pause in between different requests?
Second, and related, is there a method with Scrapy to test a crawler without placing undue stress on a site? I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?
Any advice or resources would be greatly appreciated.
Is there possibly a way to do thing's incrementally
I'm using Scrapy caching ability to scrape site incrementaly
HTTPCACHE_ENABLED = True
Or you can use new 0.14 feature Jobs: pausing and resuming crawls
or put a pause in between different requests?
check this settings:
DOWNLOAD_DELAY
RANDOMIZE_DOWNLOAD_DELAY
is there a method with Scrapy to test a crawler without placing undue stress on a site?
You can try and debug your code in Scrapy shell
I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?
Also, you can call scrapy.shell.inspect_response at any time in your spider.
Any advice or resources would be greatly appreciated.
Scrapy documentation is the best resource.
You have to start crawling and log everything. In case you get banned, you can add sleep() before pages request.
Changing User-Agent is a good practise, too (http://www.user-agents.org/ http://www.useragentstring.com/ )
If you get banned by ip, use proxy to bypass it. Cheers.