I am using scrapy to scrape some big brands to import the sale data for my site. Currently I am using
DOWNLOAD_DELAY = 1.5
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16
I am using Item loader to specify css/xpath rules and Pipeline to write the data into csv. The data that I collect is original price, sale price, colour, sizes, name, image url and brand.
I have written the spider for only one merchant who has around 10k urls and it takes me about 4 hours.
My question is, does 4 hours sounds alright for 10k urls or it should be faster than that. If so, what else do I need to do to speed it up.
I am using only one SPLASH instance locally to test. But in production I am planning to use 3 SPLASH instances.
Now the main issue, I have about 125 merchants and on avg 10k product for each. couple of them has more than 150k urls to scrape.
I need to scrape all their data every night to update my site.
Since my single spider takes 4 hours to scrape 10k urls, I am wondering if it is even valid dream to achieve 125 x 10k urls every night
I will really appreciate your experienced input to my problem.
Your DOWNLOAD_DELAY is enforced per IP, so if there is only 1 IP, then 10,000 requests will take 15000 seconds (10,000 * 1.5). That is just over 4 hours. So yeah that's about right.
If you are scraping more than one site, then they will be different IP addresses, and so they should run more or less in parallel, so should still take 4 hours or so.
If you are scraping 125 sites, then you will probably hit a different bottleneck at some point.
Related
I'm using Scrapy to collect about 150 attributes for Cloud Services listed on the UK Gov Digital Marketplace (G-Cloud). The catalogue holds details of about 40,000 services accessed through summary pages with 30 products to a page (1,355 pages). Things have been going fine for 5 or 6 years. Now the catalogue has been moved to a new URL. I have repaired all the Xpaths and tested and I am getting all the data I need when a product is hit.
But comparing my downloaded product count to the total on the catalogue, I am getting only 30,000 products - missing 25%.
"custom_settings" includes 'AUTOTHROTTLE_START_DELAY': .5,
'AUTOTHROTTLE_MAX_DELAY': 60,
Would changing these settings improve my harvest?
What else can you suggest may be impacting my low yield?
For completeness, here is the pagination function, which seems to work effectively, so I don't think the fault lies here:
def parse(self, response):
for link in response.xpath('//h2/a/#href').getall():
yield scrapy.Request(response.urljoin(link), callback=self.parse_item)
next_page = response.xpath('//li[#class="dm-pagination__item dm-pagination__item--next"]/a/#href').get()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
Historically, I never get the full product count as a small number are suspended for editing or deletion. But this would be less then 1%. The script takes roughly 15 hours to run from my desktop. I am running the script repeatedly to see if the outcome is consistent and if the missing products are always the same ones. Due to the long run-time it will be a week or so before I have sufficient data to draw a conclusion.
I'm trying to scrape all pages with different ids from a site that is formatted url.com/page?id=1, but there are millions of ids so even at 1 request per second it will take weeks to get them all.
I am a total noob at this so I was wondering if there was a better way than going one by one such as a bulk request or something or should I just increase the requests per second to whatever I can get away with.
I am using requests and beautifulsoup in python to scrape the pages currently.
The grequests library is one possible approach you could take. The results are returned in the order they are obtained (which is not the same order as event_list).
import grequests
event_list = [grequests.get(f"http://url.com/page?id={req_id}") for req_id in range(1, 100)]
for r in grequests.imap(event_list, size=5):
print(r.request.url)
print(r.text[:100])
print()
Note: You are likely to be blocked if you attempt this on most sites. A better approach would be to see if the website has a suitable API you could use to obtain the same information which could be found by using your browser's network tools whilst using the site.
I am trying to scrape all sites from THIS website.
I will use www.site.com instead of real domain just to simpify my problem.
Basically, there is a list of around 300 000 sites, each page has 30 results, so there should be around 10000 pages.
This is an example:
www.site.com/1 -> sites from 1-30
www.site.com/2 -> sites from 30-60
www.site.com/3 -> sites from 60-90
www.site.com/4 -> sites from 90-120
The problem is, when I reach page 167, there are no more results after that shown. That way, I can see only list of the first 5000 sites.
When I write this:
www.site.com/168
I get this error: PHP Warning – yii\base\ErrorException
Click HERE to see full error.
I was able to create a script in python that will scrape first 5000 sites, but I don't have any idea on how to access full list.
For example, there is a possibility to search for certain keywords on that page, but again, if there are more than 5000 results, only first 5000 sites will be shown.
Any ideas on how to solve this problem?
I am a bit new to web scraping and my question might be a bit silly. I want to get information from a rental website. I want to scrape almost 2000 pages per day to obtain the information. But I do not want to hammer their website. I just need information inside a specific tag which is a table. Is there any ways to only request that part of the page rather than getting the whole page?
I will surely add delay and sleep to the script, but reducing file size would also help.
Implementing that will reduce the requested file size from around 300kB to 11kB.
Website URL: https://asunnot.oikotie.fi/vuokrattavat-asunnot
example of webpage: https://asunnot.oikotie.fi/vuokrattavat-asunnot/imatra/15733776
required tag: <div class="listing-details-container">...</div>
Thank you for your response in advance :)
I think 2000 a day is not high - depends when you do it. If you put a 10 second wait between each request that should not overload it - but would take 6 hours.
It may be better to do it overnight when the site should be quieter.
If you do 2000 with no wait the site owner may be unhappy.
I have 630,220 urls that I need to open and scrape. These urls themselves have been scraped, and scraping them was much easier because every single scraped page would return around 3,500 urls.
To scrape these 630,220 urls, I'm currently doing parallel scraping in Python using threads. Using 16 threads, it takes 51 seconds to scrape 200 urls. Thus, it would take me 44 hours to scrape all 630,220 urls which seems to be an unnecessarily time consuming and grossly inefficient way to handle this problem.
Assuming that the server will not be overloaded, is there a way to asynchronously send something like 1000 requests per second? That would bring down the total scraping time to about 10 minutes, which is pretty reasonable.
Use gevent. Enable monkey-patching of Python standard library, and just use your favourite scrapping library - replace the threads by 1000 greenlets doing the same thing. And you're done.