I'm trying to write a crawler which does the following job:
Start from a specific URL, then crawl the pages up to a depth limit (simple Scrapy task). During this time, check the first 1 kB of the header field, and extract the file information with Magic. If it is a file, I want to send it to a pipeline for further processing.
What I did so far is:
def process_links(self, links):
for link in links:
try:
req = urlopen(link.url)
except Exception as e:
self.logger.error("--> Exception URL[%s]: %s", link, e)
req = None
if req is not None:
ext_type = magic.from_buffer(req.read(1024))
if 'HTML' in ext_type:
yield link
else:
item = FileItem()
item["type"] = "FILE"
item["url"] = link.url
item["extension"] = ext_type
itemproc = self.crawler.engine.scraper.itemproc
itemproc.process_item(item, self)
So I basically get the URLs from the scraped page, then depending on their type, either I yield a new Request or I send a new FileItem item to the pipeline manually (with the last two lines)
I also have parse_page function for parsing HTML pages:
def parse_page(self, response):
item = PageItem()
item["type"] = "PAGE"
item["url"] = response.request.url
item["depth"] = response.meta['depth']
item["response_url"] = response.url
item["status_code"] = response.status
item["status_msg"] = "OK"
item["request_headers"] = response.request.headers
item["body"] = response.body
item["links"] = []
yield item
Which yields the PageItem object.
Although the yielded PageItem fires the item_passed signal in Scrapy, manually passed FileItems do not fire the signal. Am I missing something while manually passing the FileItems to the pipeline?
Thanks in advance!
Related
I'm creating a spider using scrapy to scrape Details from rottentomatoes.com. As the search page is rendered dynamically, I used the rottentomatoes API for eg:https://www.rottentomatoes.com/api/private/v2.0/search?q=inception to get the search results and URL. Following the URL via scrapy, I was able to extract the tomatometer score, audience score, director, cast etc. However, I want to extract all the audience reviews too. The issue is that, audience reviews page (https://www.rottentomatoes.com/m/inception/reviews?type=user) works using pagination and I'm not able to extract data from next page, moreover I couldn't find a way to use the API to extract the details either. Could anyone help me on this.
def parseRottenDetail(self, response):
print("Reached Tomato Parser")
try:
if MoviecrawlSpider.current_parse <= MoviecrawlSpider.total_results:
items = TomatoCrawlerItem()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['tomatometerScore'] = response.css(
'.mop-ratings-wrap__row .mop-ratings-wrap__half .mop-ratings-wrap__percentage::text').get().strip()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][
'tomatoAudienceScore'] = response.css(
'.mop-ratings-wrap__row .mop-ratings-wrap__half.audience-score .mop-ratings-wrap__percentage::text').get().strip()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][
'tomatoCriticConsensus'] = response.css('p.mop-ratings-wrap__text--concensus::text').get()
if MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]["type"] == "Movie":
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['Director'] = response.xpath(
"//ul[#class='content-meta info']/li[#class='meta-row clearfix']/div[contains(text(),'Directed By')]/../div[#class='meta-value']/a/text()").get()
else:
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['Director'] = response.xpath(
"//div[#class='tv-series__series-info-castCrew']/div/span[contains(text(),'Creator')]/../a/text()").get()
reviews_page = response.css('div.mop-audience-reviews__view-all a[href*="reviews"]::attr(href)').get()
if len(reviews_page) != 0:
yield response.follow(reviews_page, callback=self.parseRottenReviews)
else:
for key in MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse].keys():
if "pageURL" not in key and "type" not in key:
items[key] = MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][key]
yield items
if MoviecrawlSpider.current_parse <= MoviecrawlSpider.total_results:
MoviecrawlSpider.current_parse += 1
print("Parse Values are Current Parse " + str(
MoviecrawlSpider.current_parse) + "and Total Results " + str(MoviecrawlSpider.total_results))
yield response.follow(MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]["pageURL"],
callback=self.parseRottenDetail)
except Exception as e:
exc_type, exc_obj, exc_tb = sys.exc_info()
print(e)
print(exc_tb.tb_lineno)
After this piece of code is executed I reach the page of reviews ie for eg: https://www.rottentomatoes.com/m/inception/reviews?type=user, Hereafter there is a next button and next page is loaded using pagination. So What should be my approach to extract all the reviews?
def parseRottenReviews(self, response):
print("Reached Rotten Review Parser")
items = TomatoCrawlerItem()
When you go to the next page, you can notice that it uses the previous end cursor value of the page. You can set the endCursor with empty string for the first iteration. Also note that you need the movieId for requuesting reviews, this id can be extracted from the json embedded from JS :
import requests
import re
import json
r = requests.get("https://www.rottentomatoes.com/m/inception/reviews?type=user")
data = json.loads(re.search('movieReview\s=\s(.*);', r.text).group(1))
movieId = data["movieId"]
def getReviews(endCursor):
r = requests.get(f"https://www.rottentomatoes.com/napi/movie/{movieId}/reviews/user",
params = {
"direction": "next",
"endCursor": endCursor,
"startCursor": ""
})
return r.json()
reviews = []
result = {}
for i in range(0, 5):
print(f"[{i}] request review")
result = getReviews(result["pageInfo"]["endCursor"] if i != 0 else "")
reviews.extend([t for t in result["reviews"]])
print(reviews)
print(f"got {len(reviews)} reviews")
Note that you can also scrape the html for the first iteration
As I'm using Scrapy, I was looking for a way to perform this without using the requests module. The approach is the same, but I found that the page https://www.rottentomatoes.com/m/inception had an object root.RottenTomatoes.context.fandangoData in the <script> tag, which had a key "emsId" which had the Id of the movie which is passed to the api to get details. Going through the network tab on each pagination event, I realised that they used startCursor and endCursor to filter results for each page.
pattern = r'\broot.RottenTomatoes.context.fandangoData\s*=\s*(\{.*?\})\s*;\s*\n'
json_data = response.css('script::text').re_first(pattern)
movie_id = json.loads(json_data)["emsId"]
{SpiderClass}.movieId = movie_id
next_page = "https://www.rottentomatoes.com/napi/movie/" + movie_id + "/reviews/user?direction=next&endCursor=&startCursor="
yield response.follow(next_page, callback=self.parseRottenReviews)
For the first iteration, you could leave the startCursor and endCursor parameter blank. Now you enter the parse function. You could get the startCursor and endCursor parameters of the next page from the current response. Repeat this until the hasNextPage attribute is false.
def parseRottenReviews(self, response):
print("Reached Rotten Review Parser")
current_result = json.loads(response.text)
for review in current_result["reviews"]:
{SpiderClass}.reviews.append(review) #Spider class memeber So that it could be shared among iterations
if current_result["pageInfo"]["hasNextPage"] is True:
next_page = "https://www.rottentomatoes.com/napi/movie/" + \
str({SpiderClass}.movieId) + "/reviews/user?direction=next&endCursor=" + str(
current_result["pageInfo"][
"endCursor"]) + "&startCursor=" + str(current_result["pageInfo"]["startCursor"])
yield response.follow(next_page, callback=self.parseRottenReviews)
Now the {SpiderClass}.reviews array will have the reviews
So I'm relatively new to scrapy and am trying to get a crawler that pulls hyper links for businesses on a listing page. Here is the code:
class EmailSpider(CrawlSpider):
name = "emailcrawler"
start_urls = [
'https://www.yellowpages.com/search?search_terms=Computer+Software+%26+Services&geo_location_terms=Florence%2C+KY'
# 'https://www.yellowpages.com/search?search_terms=Computers+%26+Computer+Equipment-Service+%26+fix&geo_location_terms=FL'
]
def parse(self, response):
information = response.xpath('//*[#class="info"]')
for info in information:
website = info.xpath('.//*[#class="links"]/a/#href').extract_first()
if website != "None":
request = Request(url = website, callback=self.parse_email, errback = self.handle_error,
meta={'dont_retry': True, 'dont_redirect':True, 'handle_httpstatus_list': [302]})
request.meta['data'] = {
'Website': website
}
# yield response.follow(url = website, callback = self.parse_email)
yield request
next_page_url = response.xpath('//*[#class="next ajax-page"]/#href').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield Request(absolute_next_page_url, errback = self.handle_error, meta={'dont_retry': True, 'dont_redirect':True})
def parse_email(self, response):
data = response.meta.get('data')
# try:
# emails = set(re.findall(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.com", response.text, re.I))
# except AttributeError:
# return
# data['email'] = emails
selector = Selector(response)
for found_address in selector.re('[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.com'):
# item = EmailAddressItem()
data['email_address'] = found_address
# item['url'] = response.url
yield data
def handle_error(self, failure):
self.log("Request failed: %s" % failure.request)
Before I attempted to get scrapy to follow each link, I had it just return the list of websites that it pulled which worked perfectly. It was able to request the next page after iterating through the urls on the page and then yield the results. What I am trying to do now is to get it to go to each website that it pulls, extract an email element on that website if it is found and then return back to the loop and then try another website. The problem is that when the crawler gets a response error the crawl just stops. It also seems like even if the Request was successful, that the crawler is not going to be able to return to the original iteration through the yellowpages url. It gets stuck in one of the websites that it follows and then the for loop dies. How can I get the crawler to stay its course and keep attempting to pull from the websites it scrapes while also staying within the process of iterating through each page of the listing website. To put it simply, I need to be able to go through every single page on the initial listing page no matter what request error comes about, but have the crawler pop in and out of the websites it finds and attempt to scrape data on those sites.
class EmailSpider(CrawlSpider):
name = "followwebsite"
start_urls = [
# 'https://www.manta.com/mb_35_D000B000_000/offices_and_clinics_of_medical_doctors',
# 'https://www.chess.com/home'
# 'https://webscraper.io/test-sites/e-commerce/static'
'https://www.yellowpages.com/search?search_terms=Computer+Software+%26+Services&geo_location_terms=Florence%2C+KY'
'https://www.yellowpages.com/search?search_terms=Computers+%26+Computer+Equipment-Service+%26+fix&geo_location_terms=FL'
]
def parse(self, response):
website = response.xpath('//*[#class="links"]/a/#href')
yield from response.follow_all(website, self.parse_email)
next_page_url = response.xpath('//*[#class="next ajax-page"]/#href').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield Request(absolute_next_page_url, errback = self.handle_error)
def parse_email(self, response):
selector = Selector(response)
for found_address in selector.re('[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.com'):
item = EmailAddressItem()
item['email_address'] = found_address
# item['url'] = response.url
yield item
def handle_error(self, failure):
self.log("Request failed: %s" % failure.request)
Figured it out no thanks to you bums
I am trying to make a scrapy bot that utilizes pagination but having no success...
The bot crawls through all of the links on the first page one but never goes on to the next page. I have read a ton of different threads and I cant figure this out at all. I am very new to web scraping to please feel free to hammer the crap out of my code.
import time
from scrapy.spiders import CrawlSpider, Rule
#from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.http.request import Request
from tutorial.items import TutorialItem
#from scrapy_tutorial.items import ScrapyTutorialItem
class raytheonJobsPageSpider(CrawlSpider):
name = "raytheonJobsStart"
allowed_domains = ["jobs.raytheon.com"]
start_urls = [
"https://jobs.raytheon.com/search-jobs"
]
rules = ( Rule(LinkExtractor(restrict_xpaths=('//div[#class="next"]',)), callback='parse_listings',follow=True), )
def parse_start_url(self, response):
'''
Crawl start URLs
'''
return self.parse_listings(response)
def parse_listings(self, response):
'''
Extract data from listing pages
'''
sel = Selector(response)
jobs = response.xpath(
'//*[#id="search-results-list"]/ul/*/a/#href'
).extract()
nextLink = response.xpath('//a[#class="next"]').extract()
print "This is just the next page link - ",nextLink
for job_url in jobs:
job_url = self.__normalise(job_url)
job_url = self.__to_absolute_url(response.url, job_url)
yield Request(job_url, callback=self.parse_details)
def parse_details(self, response):
'''
Extract data from details pages
'''
sel = Selector(response)
job = sel.xpath('//*[#id="content"]')
item = TutorialItem()
# Populate job fields
item['title'] = job.xpath('//*[#id="content"]/section[1]/div/h1/text()').extract()
jobTitle=job.xpath('//*[#id="content"]/section[1]/div/h1/text()').extract()
item['reqid'] = job.xpath('//*[#id="content"]/section[1]/div/span[1]/text()').extract()
item['location'] = job.xpath('//*[#id="content"]/section[1]/div/span[last()]/text()').extract()
item['applink'] = job.xpath('//*[#id="content"]/section[1]/div/a[2]/#href').extract()
item['description'] = job.xpath('//*[#id="content"]/section[1]/div/div').extract()
item['clearance'] = job.xpath('//*[#id="content"]/section[1]/div/*/text()').extract()
#item['page_url'] = response.url
item = self.__normalise_item(item, response.url)
time.sleep(1)
return item
def __normalise_item(self, item, base_url):
'''
Standardise and format item fields
'''
# Loop item fields to sanitise data and standardise data types
for key, value in vars(item).values()[0].iteritems():
item[key] = self.__normalise(item[key])
# Convert job URL from relative to absolute URL
#item['job_url'] = self.__to_absolute_url(base_url, item['job_url'])
return item
def __normalise(self, value):
print self,value
# Convert list to string
value = value if type(value) is not list else ' '.join(value)
# Trim leading and trailing special characters (Whitespaces, newlines, spaces, tabs, carriage returns)
value = value.strip()
return value
def __to_absolute_url(self, base_url, link):
'''
Convert relative URL to absolute URL
'''
import urlparse
link = urlparse.urljoin(base_url, link)
return link
def __to_int(self, value):
'''
Convert value to integer type
'''
try:
value = int(value)
except ValueError:
value = 0
return value
def __to_float(self, value):
'''
Convert value to float type
'''
try:
value = float(value)
except ValueError:
value = 0.0
return value
You dont need PhantomJS or Splash.
By inspecting the AJAX calls I found that they are loading jobs via AJAX calls to this URL
You can see CurrentPage parameter at the end of URL.
And the result is returned in JSON format, and all jobs are on the key named results
I created a project on my side and I created fully 100% working code for you. Here is link to that in github, just download and run it ... you dont have to do anything at all :P
Download whole working project fomr here https://github.com/mani619cash/raytheon_pagination
Basic logic is here
class RaytheonspiderSpider(CrawlSpider):
name = "raytheonJobsStart"
page = 180
ajaxURL = "https://jobs.raytheon.com/search-jobs/results?ActiveFacetID=0&RecordsPerPage=15&Distance=50&RadiusUnitType=0&Keywords=&Location=&Latitude=&Longitude=&ShowRadius=False&CustomFacetName=&FacetTerm=&FacetType=0&SearchResultsModuleName=Search+Results&SearchFiltersModuleName=Search+Filters&SortCriteria=5&SortDirection=1&SearchType=5&CategoryFacetTerm=&CategoryFacetType=&LocationFacetTerm=&LocationFacetType=&KeywordType=&LocationType=&LocationPath=&OrganizationIds=&CurrentPage="
def start_requests(self):
yield Request(self.ajaxURL + str(self.page), callback=self.parse_listings)
def parse_listings(self, response):
resp = json.loads(response.body)
response = Selector(text = resp['results'])
jobs = response.xpath('//*[#id="search-results-list"]/ul/*/a/#href').extract()
if jobs:
for job_url in jobs:
job_url = "https://jobs.raytheon.com" + self.__normalise(job_url)
#job_url = self.__to_absolute_url(response.url, job_url)
yield Request(url=job_url, callback=self.parse_details)
else:
raise CloseSpider("No more pages... exiting...")
# go to next page...
self.page = self.page + 1
yield Request(self.ajaxURL + str(self.page), callback=self.parse_listings)
Change
restrict_xpaths=('//div[#class="next"]',)) to
restrict_xpaths=('//a[#class="next"]',))
If this not working then do a recursive call to parse_listings function
def parse_listings(self, response):
'''
Extract data from listing pages
'''
sel = Selector(response)
jobs = response.xpath(
'//*[#id="search-results-list"]/ul/*/a/#href'
).extract()
nextLink = response.xpath('//a[#class="next"]').extract()
print "This is just the next page link - ",nextLink
for job_url in jobs:
job_url = self.__normalise(job_url)
job_url = self.__to_absolute_url(response.url, job_url)
yield Request(job_url, callback=self.parse_details)
yield Request(pagination link here, callback=self.parse_listings)
I am on mobile so cant type code. I hope the logic i told you makes sense
My spider function is on a page and I need to go to a link and get some data from that page to add to my item but I need to go to various pages from the parent page without creating more items. How would I go about doing that because from what I can read in the documentation I can only go in a linear fashion:
parent page > next page > next page
But I need to:
parent page > next page
> next page
> next page
You should return Request instances and pass item around in meta. And you would have to make it in a linear fashion and build a chain of requests and callbacks. In order to achieve it, you can pass around a list of requests for completing an item and return an item from the last callback:
def parse_main_page(self, response):
item = MyItem()
item['main_url'] = response.url
url1 = response.xpath('//a[#class="link1"]/#href').extract()[0]
request1 = scrapy.Request(url1, callback=self.parse_page1)
url2 = response.xpath('//a[#class="link2"]/#href').extract()[0]
request2 = scrapy.Request(url2, callback=self.parse_page2)
url3 = response.xpath('//a[#class="link3"]/#href').extract()[0]
request3 = scrapy.Request(url3, callback=self.parse_page3)
request.meta['item'] = item
request.meta['requests'] = [request2, request3]
return request1
def parse_page1(self, response):
item = response.meta['item']
item['data1'] = response.xpath('//div[#class="data1"]/text()').extract()[0]
return request.meta['requests'].pop(0)
def parse_page2(self, response):
item = response.meta['item']
item['data2'] = response.xpath('//div[#class="data2"]/text()').extract()[0]
return request.meta['requests'].pop(0)
def parse_page3(self, response):
item = response.meta['item']
item['data3'] = response.xpath('//div[#class="data3"]/text()').extract()[0]
return item
Also see:
How can i use multiple requests and pass items in between them in scrapy python
Almost Asynchronous Requests for Single Item Processing in Scrapy
Using the Scrapy Requests you can perform extra operations on the next URL in the scrapy.Request's callback .
Need example in scrapy on how to get a link from one page, then follow this link, get more info from the linked page, and merge back with some data from first page.
Partially fill your item on the first page, and the put it in your request's meta. When the callback for the next page is called, it can take the partially filled request, put more data into it, and then return it.
An example from scrapy documntation
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
More information on passing the meta data and request objects is specifically described in this part of the documentation:
http://readthedocs.org/docs/scrapy/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
This question is also related to: Scrapy: Follow link to get additional Item data?
A bit illustration of Scrapy documentation code
def start_requests(self):
yield scrapy.Request("http://www.example.com/main_page.html",callback=parse_page1)
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url ##extracts http://www.example.com/main_page.html
request = scrapy.Request("http://www.example.com/some_page.html",callback=self.parse_page2)
request.meta['my_meta_item'] = item ## passing item in the meta dictionary
##alternatively you can follow as below
##request = scrapy.Request("http://www.example.com/some_page.html",meta={'my_meta_item':item},callback=self.parse_page2)
return request
def parse_page2(self, response):
item = response.meta['my_meta_item']
item['other_url'] = response.url ##extracts http://www.example.com/some_page.html
return item