I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain different from the original domain).
I managed to do that with 2 rules but they are based on the domain of the site being crawled. If I want to run this on multiple websites I run into a problem because I don't know which "start_url" I'm currently on so I can't change the rule appropriately.
Here's what I came up with so far, it works for one website and I'm not sure how to apply it to a list of websites:
class HomepagesSpider(CrawlSpider):
name = 'homepages'
homepage = 'http://www.somesite.com'
start_urls = [homepage]
# strip http and www
domain = homepage.replace('http://', '').replace('https://', '').replace('www.', '')
domain = domain[:-1] if domain[-1] == '/' else domain
rules = (
Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True),
Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False),
)
def parse_internal(self, response):
# log internal page...
def parse_external(self, response):
# parse external page...
This can probably be done by just passing the start_url as an argument when calling the scraper, but I'm looking for a way to do that programmatically within the scraper itself.
Any ideas?
Thanks!
Simon.
I've found a very similar question and used the second option presented in the accepted answer to develop a workaround for this problem, since it's not supported out-of-the-box in scrapy.
I've created a function that gets a url as an input and creates rules for it:
def rules_for_url(self, url):
domain = Tools.get_domain(url)
rules = (
Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True),
Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False),
)
return rules
I then override some of CrawlSpider's functions.
I changed _rules into a dictionary where the keys are the different website domains and the values are the rules for that domain (using rules_for_url). The population of _rules is done in _compile_rules
I then make the appropriate changes in _requests_to_follow and _response_downloaded to support the new way of using _rules.
_rules = {}
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
domain = Tools.get_domain(response.url)
for n, rule in enumerate(self._rules[domain]):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(domain + ';' + str(n), link)
yield rule.process_request(r)
def _response_downloaded(self, response):
meta_rule = response.meta['rule'].split(';')
domain = meta_rule[0]
rule_n = int(meta_rule[1])
rule = self._rules[domain][rule_n]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
def _compile_rules(self):
def get_method(method):
if callable(method):
return method
elif isinstance(method, six.string_types):
return getattr(self, method, None)
for url in self.start_urls:
url_rules = self.rules_for_url(url)
domain = Tools.get_domain(url)
self._rules[domain] = [copy.copy(r) for r in url_rules]
for rule in self._rules[domain]:
rule.callback = get_method(rule.callback)
rule.process_links = get_method(rule.process_links)
rule.process_request = get_method(rule.process_request)
See the original functions here.
Now the spider will simply go over each url in start_urls and create a set of rules specific for that url. Then use the appropriate rules for each website being crawled.
Hope this helps anyone who stumbles upon this problem in the future.
Simon.
Iterate over all website links in start_urls and populate allowed_domains and deny_domains arrays. And then define Rules.
start_urls = ["www.website1.com", "www.website2.com", "www.website3.com", "www.website4.com"]
allow_domains = []
deny_domains = []
for link in start_urls
# strip http and www
domain = link.replace('http://', '').replace('https://', '').replace('www.', '')
domain = domain[:-1] if domain[-1] == '/' else domain
allow_domains.extend([domain])
deny_domains.extend([domain])
rules = (
Rule(LinkExtractor(allow_domains=allow_domains, deny_domains=()), callback='parse_internal', follow=True),
Rule(LinkExtractor(allow_domains=(), deny_domains=deny_domains), callback='parse_external', follow=False),
)
Related
I am trying to store the trail of URLs my Spider visits every time it visits the target page. I am having trouble with reading the starting URL and ending URL for each request. I have gone through the documentation and this is as far as I can go using examples from the documentation.
Here is my Spider class
class MinistryProductsSpider(CrawlSpider):
name = "ministryproducts"
allowed_domains = ["www.ministryofsupply.com"]
start_urls = ["https://www.ministryofsupply.com/"]
base_url = "https://www.ministryofsupply.com/"
rules = [
Rule(
LinkExtractor(allow="products/"),
callback="parse_products",
follow=True,
process_request="main",
)
]
I have a separate function for callback which parses data on every product page. The documentation doesn't specify if I can use callback and process_request at in the same Rule.
def main(self, request, response):
trail = [link for link in response.url]
return Request(response.url, callback=self.parse_products, meta=dict(trail))
def parse_products(self, response, trail):
self.logger.info("Hi this is a product page %s", response.url)
parser = Parser()
item = parser.parse_product(response, trail)
yield item
I have been stuck at this point for the past 4 hours. My Parser class is running absolutely fine. I am also looking for an explanation of best practices in this case.
I solved the problem by creating a new scrapy.request object by iterating over href values on a tags on the catalogue page.
parser = Parser()
def main(self, response):
href_list = response.css("a.CardProduct__link::attr(href)").getall()
for link in href_list:
product_url = self.base_url + link
request = Request(product_url, callback=self.parse_products)
visited_urls = [request.meta.get("link_text", "").strip(), request.url]
trail = copy.deepcopy(response.meta.get("visited_urls", [])) + visited_urls
request.meta["trail"] = trail
yield request
def parse_products(self, response):
self.logger.info("Hi this is a product page %s", response.url)
item = self.parser.parse_product(response)
yield item
I am interested in all of the visible text of a website.
The only thing is: I would like to exclude hyperlink text. Thereby I am able to exlude text in menu bars because they often contain links. In the image you can see that everything from a menu bar could be excluded (e.g. "Wohnen & Bauen").
https://www.gross-gerau.de/B%C3%BCrger-Service/Ver-und-Entsorgung/Abfallinformationen/index.php?object=tx,2289.12976.1&NavID=3411.60&La=1
All in all my spider looks like this:
class MySpider(CrawlSpider):
name = 'my_spider'
start_urls = ['https://www.gross-gerau.de/B%C3%BCrger-Service/Wohnen-Bauen/']
rules = (
Rule(LinkExtractor(allow="B%C3%BCrger-Service", deny=deny_list_sm),
callback='parse', follow=True),
)
def parse(self, response):
item = {}
item['scrape_date'] = int(time.time())
item['response_url'] = response.url
# old approach
# item["text"] = " ".join([x.strip() for x in response.xpath("//text()").getall()]).strip()
# exclude at least javascript code snippets and stuff
item["text"] = " ".join([x.strip() for x in response.xpath("//*[name(.)!='head' and name(.)!='script']/text()").getall()]).strip()
yield item
The solution should work for other websites, too.Does anyone have an idea how to solve this challenge? Any ideas are welcome!
You can extend your predicate as
[name()!='head' and name()!='script' and name()!='a']
I am trying to scrape multiple urls and getting the following error on passing the start urls as calculated url( see the code). Error-
Crawling could not start: 'start_urls' not found or empty (but found
'start_url' attribute instead, did you miss an 's'?) But everything is working fine on passing the string in list for start_ulr(see the commented part of code). Not able to figure out what wrong i am doing.Any help would be appreciated. Code sample-
class CompaniesSSpider(scrapy.Spider):
name = 'companies_s'
allowed_domains = ['www.screener.in']
df = pd.read_csv('Equity.csv')
comp = df['url'].tolist() #list of companies
comp1 = [str(x) for x in comp]
start_url = ["https://www.screener.in/company/" + e + "/consolidated/" for e in comp1]
#start_urls = ['https://www.screener.in/company/AVANTIFEED/consolidated/','https://www.screener.in/company/GPIL/consolidated/','https://www.screener.in/company/ACRYSIL/consolidated/']
def parse(self, response):
title = response.xpath("//h1[#class='margin-0']/text()").get()
marketcap = response.xpath("//span/child::span/text()").get()
yield{
'name':title,
'marketcap':marketcap}
Crawling could not start: 'start_urls' not found or empty (but found 'start_url' attribute instead, did you miss an 's'?)
You wrote start_url instead of start_urls.
I have a ScraPy Code that is running in shell, but when I try to export it to csv, it returns an empty file. It exports data when I do not go into a link and try to parse the description, but once I add the extra method of parsing the contents, it fails to work. Here is the code:
class MonsterSpider(CrawlSpider):
name = "monster"
allowed_domains = ["jobs.monster.com"]
base_url = "http://jobs.monster.com/v-technology.aspx?"
start_urls = [
"http://jobs.monster.com/v-technology.aspx"
]
for i in range(1,5):
start_urls.append(base_url + "page=" + str(i))
rules = (Rule(SgmlLinkExtractor(allow=("jobs.monster.com",))
, callback = 'parse_items'),)
def parse_items(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="col-xs-12"]')
#items = []
for site in sites.xpath('.//article[#class="js_result_row"]'):
item = MonsterItem()
item['title'] = site.xpath('.//span[#itemprop = "title"]/text()').extract()
item['company'] = site.xpath('.//span[#itemprop = "name"]/text()').extract()
item['city'] = site.xpath('.//span[#itemprop = "addressLocality"]/text()').extract()
item['state'] = site.xpath('.//span[#itemprop = "addressRegion"]/text()').extract()
item['link'] = site.xpath('.//a[#data-m_impr_a_placement_id= "jsr"]/#href').extract()
follow = ''.join(item["link"])
request = Request(follow, callback = self.parse_dir_contents)
request.meta["item"] = item
yield request
#items.append(item)
#return items
def parse_dir_contents(self, response):
item = response.meta["item"]
item['desc'] = site.xpath('.//div[#itemprop = "description"]/text()').extract()
return item
Taking out the parse_dir_contents and uncommenting the empty "lists" list and "append" code was the original code.
Well, as #tayfun suggests you should use response.xpath or define the site variable.
By the way, you do not need to use sel = Selector(response). Responses come with the xpath function, there is no need to cover it into another selector.
However the main problem is that you restrict the domain of the spider. You define allowed_domains = ["jobs.monster.com"] however if you look at the URL to follow of your custom Request you can see that they are something like http://jobview.monster.com/ or http://job-openings.monster.com. In this case your parse_dir_contents is not executed (the domain is not allowed) and your item does not get returned so you won't get any results.
Change allowed_domains = ["jobs.monster.com"] to
allowed_domains = ["monster.com"]
and you will be fine and your app will work and return items.
You have an error in your parse_dir_contents method:
def parse_dir_contents(self, response):
item = response.meta["item"]
item['desc'] = response.xpath('.//div[#itemprop=description"]/text()').extract()
return item
Note the use of response. I don't know where you got site that you are currently using from.
Also, try to provide the error details when you post a question. Writing "it fails to work" doesn't say much.
I am trying to scrap search results of http://www.ncbi.nlm.nih.gov/pubmed. I gathered all useful information on first page but i am having problem in navigation of second page (second page do not have any result, some parameters in request are missing or wrong).
My code is:
class PubmedSpider(Spider):
name = "pubmed"
cur_page = 1
max_page = 3
start_urls = [
"http://www.ncbi.nlm.nih.gov/pubmed/?term=cancer+toxic+drug"
]
def parse(self, response):
sel = Selector(response)
pubmed_results = sel.xpath('//div[#class="rslt"]')
#next_page_url = sel.xpath('//div[#id="gs_n"]//td[#align="left"]/a/# href').extract()[0]
self.cur_page = self.cur_page + 1
print 'cur_page ','*' * 30, self.cur_page
form_data = {'term':'cancer+drug+toxic+',
'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.Page':'results',
'email_subj':'cancer+drug+toxic+-+PubMed',
'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.CurrPage':str(self.cur_page),
'email_subj2':'cancer+drug+toxic+-+PubMed',
'EntrezSystem2.PEntrez.DbConnector.LastQueryKey':'2',
'EntrezSystem2.PEntrez.DbConnector.Cmd':'PageChanged',
'p%24a':'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.Page',
'p%24l':'EntrezSystem2',
'p%24':'pubmed',
}
for pubmed_result in pubmed_results:
item = PubmedItem()
item['title'] = lxml.html.fromstring(pubmed_result.xpath('.//a')[0].extract()).text_content()
item['link'] = pubmed_result.xpath('.//p[#class="title"]/a/#href').extract()[0]
#modify following lines
if self.cur_page < self.max_page:
yield FormRequest("http://www.ncbi.nlm.nih.gov/pubmed/?term=cancer+toxic+drug",formdata = form_data,
callback = self.parse2, method="POST")
yield item
def parse2(self, response):
with open('response_html', 'w')as f:
f.write(response.body)
cookies are enabled in settings.py
If you search the NCBI for information why don't you use the E-Utilities designed for this type of research? This would avoid abuse notifications returned from the site (perhaps this happened with your scraper too).
I know the question is quite old, however it can happen that somebody stumbles upon the same question...
Your base URL would be: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer+toxic+drug
You can find a description of the query parameters here: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch (for more results per query and how you can advance)
And using this API would you enable you to use some other tools and a newer Python 3 too.