I am trying to store the trail of URLs my Spider visits every time it visits the target page. I am having trouble with reading the starting URL and ending URL for each request. I have gone through the documentation and this is as far as I can go using examples from the documentation.
Here is my Spider class
class MinistryProductsSpider(CrawlSpider):
name = "ministryproducts"
allowed_domains = ["www.ministryofsupply.com"]
start_urls = ["https://www.ministryofsupply.com/"]
base_url = "https://www.ministryofsupply.com/"
rules = [
Rule(
LinkExtractor(allow="products/"),
callback="parse_products",
follow=True,
process_request="main",
)
]
I have a separate function for callback which parses data on every product page. The documentation doesn't specify if I can use callback and process_request at in the same Rule.
def main(self, request, response):
trail = [link for link in response.url]
return Request(response.url, callback=self.parse_products, meta=dict(trail))
def parse_products(self, response, trail):
self.logger.info("Hi this is a product page %s", response.url)
parser = Parser()
item = parser.parse_product(response, trail)
yield item
I have been stuck at this point for the past 4 hours. My Parser class is running absolutely fine. I am also looking for an explanation of best practices in this case.
I solved the problem by creating a new scrapy.request object by iterating over href values on a tags on the catalogue page.
parser = Parser()
def main(self, response):
href_list = response.css("a.CardProduct__link::attr(href)").getall()
for link in href_list:
product_url = self.base_url + link
request = Request(product_url, callback=self.parse_products)
visited_urls = [request.meta.get("link_text", "").strip(), request.url]
trail = copy.deepcopy(response.meta.get("visited_urls", [])) + visited_urls
request.meta["trail"] = trail
yield request
def parse_products(self, response):
self.logger.info("Hi this is a product page %s", response.url)
item = self.parser.parse_product(response)
yield item
Related
So I'm relatively new to scrapy and am trying to get a crawler that pulls hyper links for businesses on a listing page. Here is the code:
class EmailSpider(CrawlSpider):
name = "emailcrawler"
start_urls = [
'https://www.yellowpages.com/search?search_terms=Computer+Software+%26+Services&geo_location_terms=Florence%2C+KY'
# 'https://www.yellowpages.com/search?search_terms=Computers+%26+Computer+Equipment-Service+%26+fix&geo_location_terms=FL'
]
def parse(self, response):
information = response.xpath('//*[#class="info"]')
for info in information:
website = info.xpath('.//*[#class="links"]/a/#href').extract_first()
if website != "None":
request = Request(url = website, callback=self.parse_email, errback = self.handle_error,
meta={'dont_retry': True, 'dont_redirect':True, 'handle_httpstatus_list': [302]})
request.meta['data'] = {
'Website': website
}
# yield response.follow(url = website, callback = self.parse_email)
yield request
next_page_url = response.xpath('//*[#class="next ajax-page"]/#href').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield Request(absolute_next_page_url, errback = self.handle_error, meta={'dont_retry': True, 'dont_redirect':True})
def parse_email(self, response):
data = response.meta.get('data')
# try:
# emails = set(re.findall(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.com", response.text, re.I))
# except AttributeError:
# return
# data['email'] = emails
selector = Selector(response)
for found_address in selector.re('[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.com'):
# item = EmailAddressItem()
data['email_address'] = found_address
# item['url'] = response.url
yield data
def handle_error(self, failure):
self.log("Request failed: %s" % failure.request)
Before I attempted to get scrapy to follow each link, I had it just return the list of websites that it pulled which worked perfectly. It was able to request the next page after iterating through the urls on the page and then yield the results. What I am trying to do now is to get it to go to each website that it pulls, extract an email element on that website if it is found and then return back to the loop and then try another website. The problem is that when the crawler gets a response error the crawl just stops. It also seems like even if the Request was successful, that the crawler is not going to be able to return to the original iteration through the yellowpages url. It gets stuck in one of the websites that it follows and then the for loop dies. How can I get the crawler to stay its course and keep attempting to pull from the websites it scrapes while also staying within the process of iterating through each page of the listing website. To put it simply, I need to be able to go through every single page on the initial listing page no matter what request error comes about, but have the crawler pop in and out of the websites it finds and attempt to scrape data on those sites.
class EmailSpider(CrawlSpider):
name = "followwebsite"
start_urls = [
# 'https://www.manta.com/mb_35_D000B000_000/offices_and_clinics_of_medical_doctors',
# 'https://www.chess.com/home'
# 'https://webscraper.io/test-sites/e-commerce/static'
'https://www.yellowpages.com/search?search_terms=Computer+Software+%26+Services&geo_location_terms=Florence%2C+KY'
'https://www.yellowpages.com/search?search_terms=Computers+%26+Computer+Equipment-Service+%26+fix&geo_location_terms=FL'
]
def parse(self, response):
website = response.xpath('//*[#class="links"]/a/#href')
yield from response.follow_all(website, self.parse_email)
next_page_url = response.xpath('//*[#class="next ajax-page"]/#href').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield Request(absolute_next_page_url, errback = self.handle_error)
def parse_email(self, response):
selector = Selector(response)
for found_address in selector.re('[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.com'):
item = EmailAddressItem()
item['email_address'] = found_address
# item['url'] = response.url
yield item
def handle_error(self, failure):
self.log("Request failed: %s" % failure.request)
Figured it out no thanks to you bums
I'm creating a web scraper with scrapy and python. The page I'm scraping has each item structured as a card, I'm able to scrape some info from these cards (name, location), but I also want to get info that is reached by clicking on card > new page > click button on new page that opens form > scrape value from the form. How should I structure the parse function, do I need nested loops or separate functions ..?
class StackSpider(Spider):
name = "stack"
allowed_domains = ["example.com"]
start_urls = ["example.com/page"]
def parse(self, response):
for page_url in response.css('a[class ~= search- card]::attr(href)').extract():
page_url = response.urljoin(page_url)
yield scrapy.Request(url=page_url, callback=self.parse)
for vc in response.css('div#vc-profile.container').extract():
item = StackItem()
item['name'] = vc.xpath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[1]/h1/text()').extract()
item['firm'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[1]').extract()
item['pos'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[2]').extract()
em = vc.xpath('/*[#id="vc-profile"]/div/div[1]/div[2]/div[2]/div/div[1]/button').extract()
item['email'] = em.xpath('//*[#id="email"]/value').extract()
yield item
the scraper is crawling, but outputting nothing
The best approach is creating an item object on the first page, scrape the needed data and save to the item. Again make a request to the new URL (card > new page > click the button to form) and pass the same item in there. Yielding the output from here will fix the issue.
You should probably split the scraper into 1 'parse' method and 1 'parse_item' method.
Your parse method goes through the page and yields the urls of the items for which you want to get the details. The parse_item method will get back the response from the parse function, and get the details for the specific item.
Difficult to say what it will look like without knowing the website, but it'll probably look more or less like this:
class StackSpider(Spider):
name = "stack"
allowed_domains = ["example.com"]
start_urls = ["example.com/page"]
def parse(self, response):
for page_url in response.css('a[class ~= search- card]::attr(href)').extract():
page_url = response.urljoin(page_url)
yield scrapy.Request(url=page_url, callback=self.parse_item)
def parse_item(self, response)
item = StackItem()
item['name'] = vc.xpath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[1]/h1/text()').extract()
item['firm'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[1]').extract()
item['pos'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[2]').extract()
em = vc.xpath('/*[#id="vc-profile"]/div/div[1]/div[2]/div[2]/div/div[1]/button').extract()
item['email'] = em.xpath('//*[#id="email"]/value').extract()
yield item
I am writing a web scraper to fetch a group of links
(located at tree.xpath('//div[#class="work_area_content"]/a/#href')
from a website and return the Title and Url of all the leafs sectioned by the leafs parent. I have two scrapers: one in python and one in Scrapy for Python. What is the purpose of callbacks in the Scrapy Request method? Should the information be in a multidimensional or single dimension list ( I believe multi-dimensional but it enhances complication)? Which of the below code is better? If the scraper code is better, how do I migrate the python code to the Scrapy code?
From what I understand from callbacks is that it passes a function's arguments to another function; however, if the callback refers to itself, the data gets overwritten and therefore lost, and you're unable to go back to the root data. Is this correct?
python:
url_storage = [ [ [ [] ] ] ]
page = requests.get('http://1.1.1.1:1234/TestSuites')
tree = html.fromstring(page.content)
urls = tree.xpath('//div[#class="work_area_content"]/a/#href').extract()
i = 0
j = 0
k = 0
for i, url in enumerate(urls):
absolute_url = "".join(['http://1.1.1.1:1234/', url])
url_storage[i][j][k].append(absolute_url)
print(url_storage)
#url_storage.insert(i, absolute_url)
page = requests.get(url_storage[i][j][k])
tree2 = html.fromstring(page.content)
urls2 = tree2.xpath('//div[#class="work_area_content"]/a/#href').extract()
for j, url2 in enumerate(urls2):
absolute_url = "".join(['http://1.1.1.1:1234/', url2])
url_storage[i][j][k].append(absolute_url)
page = requests.get(url_storage[i][j][k])
tree3 = html.fromstring(page.content)
urls3 = tree3.xpath('//div[#class="work_area_content"]/a/#href').extract()
for k, url3 in enumerate(urls3):
absolute_url = "".join(['http://1.1.1.1:1234/', url3])
url_storage[i][j][k].append(absolute_url)
page = requests.get(url_storage[i][j][k])
tree4 = html.fromstring(page.content)
urls3 = tree4.xpath('//div[#class="work_area_content"]/a/#href').extract()
title = tree4.xpath('//span[#class="page_title"]/text()').extract()
yield Request(url_storage[i][j][k], callback=self.end_page_parse_TS, meta={"Title": title, "URL": urls3 })
#yield Request(absolute_url, callback=self.end_page_parse_TC, meta={"Title": title, "URL": urls3 })
def end_page_parse_TS(self, response):
print(response.body)
url = response.meta.get('URL')
title = response.meta.get('Title')
yield{'URL': url, 'Title': title}
def end_page_parse_TC(self, response):
url = response.meta.get('URL')
title = response.meta.get('Title')
description = response.meta.get('Description')
description = response.xpath('//table[#class="wiki_table]/tbody[contains(/td/text(), "description")/parent').extract()
yield{'URL': url, 'Title': title, 'Description':description}
Scrapy:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractor import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from datablogger_scraper.items import DatabloggerScraperItem
class DatabloggerSpider(CrawlSpider):
# The name of the spider
name = "datablogger"
# The domains that are allowed (links to other domains are skipped)
allowed_domains = ['http://1.1.1.1:1234/']
# The URLs to start with
start_urls = ['http://1.1.1.1:1234/TestSuites']
# This spider has one rule: extract all (unique and canonicalized) links, follow them and parse them using the parse_items method
rules = [
Rule(
LinkExtractor(
canonicalize=True,
unique=True
),
follow=True,
callback="parse_items"
)
]
# Method which starts the requests by visiting all URLs specified in start_urls
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, dont_filter=True)
# Method for parsing items
def parse_items(self, response):
# The list of items that are found on the particular page
items = []
# Only extract canonicalized and unique links (with respect to the current page)
links = LinkExtractor(canonicalize=True, unique=True).extract_links(response)
# Now go through all the found links
item = DatabloggerScraperItem()
item['url_from'] = response.url
for link in links:
item['url_to'] = link.url
items.append(item)
# Return all the found items
return items
I Have an issue with my Spider. I tried to follow some tutorial to understand the scrapy a little bit better and extended the tutorial to crawl also subpages. The issue of my spider is that it only crawls one element of the entry page and not 25 as it should be on the page.
I have no clue where the failure is. Perhaps somebody of you can help me here:
from datetime import datetime as dt
import scrapy
from reddit.items import RedditItem
class PostSpider(scrapy.Spider):
name = 'post'
allowed_domains = ['reddit.com']
def start_requests(self):
reddit_urls = [
('datascience', 'week')
]
for sub, period in reddit_urls:
url = 'https://www.reddit.com/r/' + sub + '/top/?sort=top&t=' + period
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# get the subreddit from the URL
sub = response.url.split('/')[4]
# parse thru each of the posts
for post in response.css('div.thing'):
item = RedditItem()
item['title'] = post.css('a.title::text').extract_first()
item['commentsUrl'] = post.css('a.comments::attr(href)').extract_first()
### scrap comments page.
request = scrapy.Request(url=item['commentsUrl'], callback=self.parse_comments)
request.meta['item'] = item
return request
def parse_comments(self, response):
item = response.meta['item']
item['commentsText'] = response.css('div.comment div.md p::text').extract()
self.logger.info('Got successful response from {}'.format(response.url))
yield item
Thanks for your help.
BR
Thanks for your comments:
Indeed I have to yield request it, rather return request.
Now it is working.
I am trying to scrap search results of http://www.ncbi.nlm.nih.gov/pubmed. I gathered all useful information on first page but i am having problem in navigation of second page (second page do not have any result, some parameters in request are missing or wrong).
My code is:
class PubmedSpider(Spider):
name = "pubmed"
cur_page = 1
max_page = 3
start_urls = [
"http://www.ncbi.nlm.nih.gov/pubmed/?term=cancer+toxic+drug"
]
def parse(self, response):
sel = Selector(response)
pubmed_results = sel.xpath('//div[#class="rslt"]')
#next_page_url = sel.xpath('//div[#id="gs_n"]//td[#align="left"]/a/# href').extract()[0]
self.cur_page = self.cur_page + 1
print 'cur_page ','*' * 30, self.cur_page
form_data = {'term':'cancer+drug+toxic+',
'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.Page':'results',
'email_subj':'cancer+drug+toxic+-+PubMed',
'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.CurrPage':str(self.cur_page),
'email_subj2':'cancer+drug+toxic+-+PubMed',
'EntrezSystem2.PEntrez.DbConnector.LastQueryKey':'2',
'EntrezSystem2.PEntrez.DbConnector.Cmd':'PageChanged',
'p%24a':'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.Page',
'p%24l':'EntrezSystem2',
'p%24':'pubmed',
}
for pubmed_result in pubmed_results:
item = PubmedItem()
item['title'] = lxml.html.fromstring(pubmed_result.xpath('.//a')[0].extract()).text_content()
item['link'] = pubmed_result.xpath('.//p[#class="title"]/a/#href').extract()[0]
#modify following lines
if self.cur_page < self.max_page:
yield FormRequest("http://www.ncbi.nlm.nih.gov/pubmed/?term=cancer+toxic+drug",formdata = form_data,
callback = self.parse2, method="POST")
yield item
def parse2(self, response):
with open('response_html', 'w')as f:
f.write(response.body)
cookies are enabled in settings.py
If you search the NCBI for information why don't you use the E-Utilities designed for this type of research? This would avoid abuse notifications returned from the site (perhaps this happened with your scraper too).
I know the question is quite old, however it can happen that somebody stumbles upon the same question...
Your base URL would be: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer+toxic+drug
You can find a description of the query parameters here: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch (for more results per query and how you can advance)
And using this API would you enable you to use some other tools and a newer Python 3 too.