Scrapy (python) only collecting 75% of target web pages - how to improve - python

I'm using Scrapy to collect about 150 attributes for Cloud Services listed on the UK Gov Digital Marketplace (G-Cloud). The catalogue holds details of about 40,000 services accessed through summary pages with 30 products to a page (1,355 pages). Things have been going fine for 5 or 6 years. Now the catalogue has been moved to a new URL. I have repaired all the Xpaths and tested and I am getting all the data I need when a product is hit.
But comparing my downloaded product count to the total on the catalogue, I am getting only 30,000 products - missing 25%.
"custom_settings" includes 'AUTOTHROTTLE_START_DELAY': .5,
'AUTOTHROTTLE_MAX_DELAY': 60,
Would changing these settings improve my harvest?
What else can you suggest may be impacting my low yield?
For completeness, here is the pagination function, which seems to work effectively, so I don't think the fault lies here:
def parse(self, response):
for link in response.xpath('//h2/a/#href').getall():
yield scrapy.Request(response.urljoin(link), callback=self.parse_item)
next_page = response.xpath('//li[#class="dm-pagination__item dm-pagination__item--next"]/a/#href').get()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
Historically, I never get the full product count as a small number are suspended for editing or deletion. But this would be less then 1%. The script takes roughly 15 hours to run from my desktop. I am running the script repeatedly to see if the outcome is consistent and if the missing products are always the same ones. Due to the long run-time it will be a week or so before I have sufficient data to draw a conclusion.

Related

How to scrape data from multiple unrelated sections of a website (using Scrapy)

I have made a Scrapy web crawler which can scrape Amazon. It can scrape by searching for items using a list of keywords and scrape the data from the resulting pages.
However, I would like to scrape Amazon for large portion of its product data. I don't have a preferred list of keywords with which to query for items. Rather, I'd like to scrape the website evenly and collect X number of items which is representative of all products listed on Amazon.
Does anyone know how scrape a website in this fashion? Thanks.
I'm putting my comment as an answer so that others looking for a similar solution can find it easier.
One way to achieve this is to going through each category (furniture, clothes, technology, automotive, etc.) and collecting a set number of items there. Amazon has side/top bars with navigation links to different categories, so you can let it run through there.
The process would be as follows:
Follow category urls from initial Amazon.com parse
Use a different parse function for the callback, one that will scrape however many items from that category
Ensure that data is writing to a file (it will probably be a lot of data)
However, such an approach would not be representative in the proportions of each category in the total Amazon products. Try looking for a "X number of results" label for each category to compensate for that. Good luck with your project!

Paginating and getting prices from a site using Scrapy

I started to look at Scrapy and want to have one spider to get some prices of MTG Cards.
First I don't know if I'm 100% correct to use the link that select all the cards available in the beginning of the function:
name = 'bazarmtgbot'
allowed_domains = ['www.bazardebagda.com.br']
start_urls = ['https://bazardebagda.com.br/?view=ecom/itens&tcg=1&txt_estoque=1&txt_limit=160&txt_order=1&txt_extras=all&page=1']
1 - Should I use this kind of start_urls?
2 - Then, if you access the site, I could not find how to get the unit and price of the card, they are blank DIV's...
I got the name using:
titles = response.css(".itemNameP.ellipsis::text").extract()
3 - I couldn't find how can I do the pagination of this site to get the next set of items unit/prices. Do I need to copy the start_urls N times?
(and 3) That would be fine to start on a given page. When scraping you can queue additional URLs to scrape by looking for something like the "next page" button, scraping that link, and yield'ing a scrapy.Request that you want to follow-up on. See this part of the Scrapy tutorial
That site may be using a bunch of techniques to thwart price scraping: the blank price divs are loading an image like the below and chopping parts of it up with gibberish CSS class names to form the number. You may need to do some OCR or find an alternative method. Bear in mind that because they're going to that degree, there might be other anti-scraping countermeasures.

How to scrape tens of thousands urls every night using scrapy

I am using scrapy to scrape some big brands to import the sale data for my site. Currently I am using
DOWNLOAD_DELAY = 1.5
CONCURRENT_REQUESTS_PER_DOMAIN = 16
CONCURRENT_REQUESTS_PER_IP = 16
I am using Item loader to specify css/xpath rules and Pipeline to write the data into csv. The data that I collect is original price, sale price, colour, sizes, name, image url and brand.
I have written the spider for only one merchant who has around 10k urls and it takes me about 4 hours.
My question is, does 4 hours sounds alright for 10k urls or it should be faster than that. If so, what else do I need to do to speed it up.
I am using only one SPLASH instance locally to test. But in production I am planning to use 3 SPLASH instances.
Now the main issue, I have about 125 merchants and on avg 10k product for each. couple of them has more than 150k urls to scrape.
I need to scrape all their data every night to update my site.
Since my single spider takes 4 hours to scrape 10k urls, I am wondering if it is even valid dream to achieve 125 x 10k urls every night
I will really appreciate your experienced input to my problem.
Your DOWNLOAD_DELAY is enforced per IP, so if there is only 1 IP, then 10,000 requests will take 15000 seconds (10,000 * 1.5). That is just over 4 hours. So yeah that's about right.
If you are scraping more than one site, then they will be different IP addresses, and so they should run more or less in parallel, so should still take 4 hours or so.
If you are scraping 125 sites, then you will probably hit a different bottleneck at some point.

Scrapy - Scraping links by date

Is it possible to scrape links by the date associated with them? I'm trying to implement a daily run spider that saves article information to a database, but I don't want to re-scrape articles that I have already scraped before-- i.e yesterday's articles. I ran across this SO post asking the same thing and the scrapy-deltafetch plugin was suggested.
However, this relies on checking new requests against previously saved request fingerprints stored in a database. I'm assuming that if the daily scraping went on for a while, there would be a need for significant memory overhead on the database to store request fingerprints that have already been scraped.
So given a list of articles on a site like cnn.com, I want to scrape all the articles that have been published today 6/14/17, but once the scraper hits later articles with a date listed as 6/13/17, I want to close the spider and stop scraping. Is this kind of approach possible with scrapy? Given a page of articles, will a CrawlSpider start at the top of the page and scrape articles in order?
Just new to Scrapy, so not sure what to try. Any help would be greatly appreciated, thank you!
You can use a custom delta-fetch_key which checks the date and the title as the fingerprint.
from w3lib.url import url_query_parameter
...
def parse(self, response):
...
for product_url in response.css('a.product_listing'):
yield Request(
product_url,
meta={'deltafetch_key': url_query_parameter(product_url, 'id')},
callback=self.parse_product_page
)
...
I compose a date using datetime.strptime(Item['dateinfo'], "%b-%d-%Y") from cobbled together information on the item of interest.
After that I just check it against a configured age in my settings, which can be overridden per invocation. You can issue a closespider exception when you find an age that is too old or you can set a finished flag and act on that in any of your other code.
No need for remembering stuff. I use this on a spider that I run daily and I simply set a 24hr age limit.

Recursively call different URL and wait for first site to be finished scraping using scrapy for Python

I am wondering if there is a way to call more then one site recursively, to make it more dynamic. My instructor has asked to have scrapy crawl more than one website. This is what I have.
def start_requests(self):
yield scrapy.Request("http://www.tripadvisor.in/Hotel_Review-g1009352-d1173080-Reviews-Yercaud_Rock_Perch_A_Sterling_Holidays_Resort-Yercaud_Tamil_Nadu.html", self.parse)
yield scrapy.Request("http://www.tripadvisor.in/Hotel_Review-g297600-d8029162-Reviews-Daman_Casa_Tesoro-Daman_Daman_and_Diu.html", self.parse)
yield scrapy.Request("http://www.tripadvisor.in/Hotel_Review-g304557-d2519662-Reviews-Darjeeling_Khushalaya_Sterling_Holidays_Resort-Darjeeling_West_Bengal.html", self.parse)
It works for the most part but it does it one by one, Is there a way to have it go through one site at a time not all at once. To sum it up, I need to to go through one yield at a time and then once that is done it goes to the next site and so on
No, there is no way to do this with the default configuration of Scrapy.
The main idea behind the tool is to make gathering information fast. And to achieve this it scrapes sites parallel. If you yield a Request it gets in the queue and as soon as a process or thread is ready it gets scraped. Then the result gets into another queue and then it will be scraped (the parse method or the defined callback is executed with the response).
To have one site scraped after another try to reduce the concurrent requests in the settings.py file to 1 with the CONCURRENT_REQUESTS = 1 setting.
You can read more about the settings in the docs.

Categories

Resources