I am trying to scrap search results of http://www.ncbi.nlm.nih.gov/pubmed. I gathered all useful information on first page but i am having problem in navigation of second page (second page do not have any result, some parameters in request are missing or wrong).
My code is:
class PubmedSpider(Spider):
name = "pubmed"
cur_page = 1
max_page = 3
start_urls = [
"http://www.ncbi.nlm.nih.gov/pubmed/?term=cancer+toxic+drug"
]
def parse(self, response):
sel = Selector(response)
pubmed_results = sel.xpath('//div[#class="rslt"]')
#next_page_url = sel.xpath('//div[#id="gs_n"]//td[#align="left"]/a/# href').extract()[0]
self.cur_page = self.cur_page + 1
print 'cur_page ','*' * 30, self.cur_page
form_data = {'term':'cancer+drug+toxic+',
'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.Page':'results',
'email_subj':'cancer+drug+toxic+-+PubMed',
'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.CurrPage':str(self.cur_page),
'email_subj2':'cancer+drug+toxic+-+PubMed',
'EntrezSystem2.PEntrez.DbConnector.LastQueryKey':'2',
'EntrezSystem2.PEntrez.DbConnector.Cmd':'PageChanged',
'p%24a':'EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Entrez_Pager.Page',
'p%24l':'EntrezSystem2',
'p%24':'pubmed',
}
for pubmed_result in pubmed_results:
item = PubmedItem()
item['title'] = lxml.html.fromstring(pubmed_result.xpath('.//a')[0].extract()).text_content()
item['link'] = pubmed_result.xpath('.//p[#class="title"]/a/#href').extract()[0]
#modify following lines
if self.cur_page < self.max_page:
yield FormRequest("http://www.ncbi.nlm.nih.gov/pubmed/?term=cancer+toxic+drug",formdata = form_data,
callback = self.parse2, method="POST")
yield item
def parse2(self, response):
with open('response_html', 'w')as f:
f.write(response.body)
cookies are enabled in settings.py
If you search the NCBI for information why don't you use the E-Utilities designed for this type of research? This would avoid abuse notifications returned from the site (perhaps this happened with your scraper too).
I know the question is quite old, however it can happen that somebody stumbles upon the same question...
Your base URL would be: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer+toxic+drug
You can find a description of the query parameters here: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch (for more results per query and how you can advance)
And using this API would you enable you to use some other tools and a newer Python 3 too.
Related
I am trying to store the trail of URLs my Spider visits every time it visits the target page. I am having trouble with reading the starting URL and ending URL for each request. I have gone through the documentation and this is as far as I can go using examples from the documentation.
Here is my Spider class
class MinistryProductsSpider(CrawlSpider):
name = "ministryproducts"
allowed_domains = ["www.ministryofsupply.com"]
start_urls = ["https://www.ministryofsupply.com/"]
base_url = "https://www.ministryofsupply.com/"
rules = [
Rule(
LinkExtractor(allow="products/"),
callback="parse_products",
follow=True,
process_request="main",
)
]
I have a separate function for callback which parses data on every product page. The documentation doesn't specify if I can use callback and process_request at in the same Rule.
def main(self, request, response):
trail = [link for link in response.url]
return Request(response.url, callback=self.parse_products, meta=dict(trail))
def parse_products(self, response, trail):
self.logger.info("Hi this is a product page %s", response.url)
parser = Parser()
item = parser.parse_product(response, trail)
yield item
I have been stuck at this point for the past 4 hours. My Parser class is running absolutely fine. I am also looking for an explanation of best practices in this case.
I solved the problem by creating a new scrapy.request object by iterating over href values on a tags on the catalogue page.
parser = Parser()
def main(self, response):
href_list = response.css("a.CardProduct__link::attr(href)").getall()
for link in href_list:
product_url = self.base_url + link
request = Request(product_url, callback=self.parse_products)
visited_urls = [request.meta.get("link_text", "").strip(), request.url]
trail = copy.deepcopy(response.meta.get("visited_urls", [])) + visited_urls
request.meta["trail"] = trail
yield request
def parse_products(self, response):
self.logger.info("Hi this is a product page %s", response.url)
item = self.parser.parse_product(response)
yield item
I'm creating a web scraper with scrapy and python. The page I'm scraping has each item structured as a card, I'm able to scrape some info from these cards (name, location), but I also want to get info that is reached by clicking on card > new page > click button on new page that opens form > scrape value from the form. How should I structure the parse function, do I need nested loops or separate functions ..?
class StackSpider(Spider):
name = "stack"
allowed_domains = ["example.com"]
start_urls = ["example.com/page"]
def parse(self, response):
for page_url in response.css('a[class ~= search- card]::attr(href)').extract():
page_url = response.urljoin(page_url)
yield scrapy.Request(url=page_url, callback=self.parse)
for vc in response.css('div#vc-profile.container').extract():
item = StackItem()
item['name'] = vc.xpath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[1]/h1/text()').extract()
item['firm'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[1]').extract()
item['pos'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[2]').extract()
em = vc.xpath('/*[#id="vc-profile"]/div/div[1]/div[2]/div[2]/div/div[1]/button').extract()
item['email'] = em.xpath('//*[#id="email"]/value').extract()
yield item
the scraper is crawling, but outputting nothing
The best approach is creating an item object on the first page, scrape the needed data and save to the item. Again make a request to the new URL (card > new page > click the button to form) and pass the same item in there. Yielding the output from here will fix the issue.
You should probably split the scraper into 1 'parse' method and 1 'parse_item' method.
Your parse method goes through the page and yields the urls of the items for which you want to get the details. The parse_item method will get back the response from the parse function, and get the details for the specific item.
Difficult to say what it will look like without knowing the website, but it'll probably look more or less like this:
class StackSpider(Spider):
name = "stack"
allowed_domains = ["example.com"]
start_urls = ["example.com/page"]
def parse(self, response):
for page_url in response.css('a[class ~= search- card]::attr(href)').extract():
page_url = response.urljoin(page_url)
yield scrapy.Request(url=page_url, callback=self.parse_item)
def parse_item(self, response)
item = StackItem()
item['name'] = vc.xpath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[1]/h1/text()').extract()
item['firm'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[1]').extract()
item['pos'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[2]').extract()
em = vc.xpath('/*[#id="vc-profile"]/div/div[1]/div[2]/div[2]/div/div[1]/button').extract()
item['email'] = em.xpath('//*[#id="email"]/value').extract()
yield item
I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain different from the original domain).
I managed to do that with 2 rules but they are based on the domain of the site being crawled. If I want to run this on multiple websites I run into a problem because I don't know which "start_url" I'm currently on so I can't change the rule appropriately.
Here's what I came up with so far, it works for one website and I'm not sure how to apply it to a list of websites:
class HomepagesSpider(CrawlSpider):
name = 'homepages'
homepage = 'http://www.somesite.com'
start_urls = [homepage]
# strip http and www
domain = homepage.replace('http://', '').replace('https://', '').replace('www.', '')
domain = domain[:-1] if domain[-1] == '/' else domain
rules = (
Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True),
Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False),
)
def parse_internal(self, response):
# log internal page...
def parse_external(self, response):
# parse external page...
This can probably be done by just passing the start_url as an argument when calling the scraper, but I'm looking for a way to do that programmatically within the scraper itself.
Any ideas?
Thanks!
Simon.
I've found a very similar question and used the second option presented in the accepted answer to develop a workaround for this problem, since it's not supported out-of-the-box in scrapy.
I've created a function that gets a url as an input and creates rules for it:
def rules_for_url(self, url):
domain = Tools.get_domain(url)
rules = (
Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True),
Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False),
)
return rules
I then override some of CrawlSpider's functions.
I changed _rules into a dictionary where the keys are the different website domains and the values are the rules for that domain (using rules_for_url). The population of _rules is done in _compile_rules
I then make the appropriate changes in _requests_to_follow and _response_downloaded to support the new way of using _rules.
_rules = {}
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
domain = Tools.get_domain(response.url)
for n, rule in enumerate(self._rules[domain]):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(domain + ';' + str(n), link)
yield rule.process_request(r)
def _response_downloaded(self, response):
meta_rule = response.meta['rule'].split(';')
domain = meta_rule[0]
rule_n = int(meta_rule[1])
rule = self._rules[domain][rule_n]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
def _compile_rules(self):
def get_method(method):
if callable(method):
return method
elif isinstance(method, six.string_types):
return getattr(self, method, None)
for url in self.start_urls:
url_rules = self.rules_for_url(url)
domain = Tools.get_domain(url)
self._rules[domain] = [copy.copy(r) for r in url_rules]
for rule in self._rules[domain]:
rule.callback = get_method(rule.callback)
rule.process_links = get_method(rule.process_links)
rule.process_request = get_method(rule.process_request)
See the original functions here.
Now the spider will simply go over each url in start_urls and create a set of rules specific for that url. Then use the appropriate rules for each website being crawled.
Hope this helps anyone who stumbles upon this problem in the future.
Simon.
Iterate over all website links in start_urls and populate allowed_domains and deny_domains arrays. And then define Rules.
start_urls = ["www.website1.com", "www.website2.com", "www.website3.com", "www.website4.com"]
allow_domains = []
deny_domains = []
for link in start_urls
# strip http and www
domain = link.replace('http://', '').replace('https://', '').replace('www.', '')
domain = domain[:-1] if domain[-1] == '/' else domain
allow_domains.extend([domain])
deny_domains.extend([domain])
rules = (
Rule(LinkExtractor(allow_domains=allow_domains, deny_domains=()), callback='parse_internal', follow=True),
Rule(LinkExtractor(allow_domains=(), deny_domains=deny_domains), callback='parse_external', follow=False),
)
I have a ScraPy Code that is running in shell, but when I try to export it to csv, it returns an empty file. It exports data when I do not go into a link and try to parse the description, but once I add the extra method of parsing the contents, it fails to work. Here is the code:
class MonsterSpider(CrawlSpider):
name = "monster"
allowed_domains = ["jobs.monster.com"]
base_url = "http://jobs.monster.com/v-technology.aspx?"
start_urls = [
"http://jobs.monster.com/v-technology.aspx"
]
for i in range(1,5):
start_urls.append(base_url + "page=" + str(i))
rules = (Rule(SgmlLinkExtractor(allow=("jobs.monster.com",))
, callback = 'parse_items'),)
def parse_items(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="col-xs-12"]')
#items = []
for site in sites.xpath('.//article[#class="js_result_row"]'):
item = MonsterItem()
item['title'] = site.xpath('.//span[#itemprop = "title"]/text()').extract()
item['company'] = site.xpath('.//span[#itemprop = "name"]/text()').extract()
item['city'] = site.xpath('.//span[#itemprop = "addressLocality"]/text()').extract()
item['state'] = site.xpath('.//span[#itemprop = "addressRegion"]/text()').extract()
item['link'] = site.xpath('.//a[#data-m_impr_a_placement_id= "jsr"]/#href').extract()
follow = ''.join(item["link"])
request = Request(follow, callback = self.parse_dir_contents)
request.meta["item"] = item
yield request
#items.append(item)
#return items
def parse_dir_contents(self, response):
item = response.meta["item"]
item['desc'] = site.xpath('.//div[#itemprop = "description"]/text()').extract()
return item
Taking out the parse_dir_contents and uncommenting the empty "lists" list and "append" code was the original code.
Well, as #tayfun suggests you should use response.xpath or define the site variable.
By the way, you do not need to use sel = Selector(response). Responses come with the xpath function, there is no need to cover it into another selector.
However the main problem is that you restrict the domain of the spider. You define allowed_domains = ["jobs.monster.com"] however if you look at the URL to follow of your custom Request you can see that they are something like http://jobview.monster.com/ or http://job-openings.monster.com. In this case your parse_dir_contents is not executed (the domain is not allowed) and your item does not get returned so you won't get any results.
Change allowed_domains = ["jobs.monster.com"] to
allowed_domains = ["monster.com"]
and you will be fine and your app will work and return items.
You have an error in your parse_dir_contents method:
def parse_dir_contents(self, response):
item = response.meta["item"]
item['desc'] = response.xpath('.//div[#itemprop=description"]/text()').extract()
return item
Note the use of response. I don't know where you got site that you are currently using from.
Also, try to provide the error details when you post a question. Writing "it fails to work" doesn't say much.
I am trying to scrape TripAdvisor's reviews, but I cannot find the Xpath to have it dynamically go through all the pages. I tried yield and callback but the thing is I cannot find the xpath for the line that goes to the next page. I am talking about This site
Here Is my code(UPDATED):
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapingtest.items import ScrapingTestingItem
class scrapingtestspider(Spider):
name = "scrapytesting"
allowed_domains = ["tripadvisor.in"]
base_uri = "tripadvisor.in"
start_urls = [
"http://www.tripadvisor.in/Hotel_Review-g297679-d300955-Reviews-Ooty_Fern_Hill_A_Sterling_Holidays_Resort-Ooty_Tamil_Nadu.html"]
output_json_dict = {}
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//a[contains(text(), "Next")]/#href').extract()
items = []
i=0
for sites in sites:
item = ScrapingTestingItem()
#item['reviews'] = sel.xpath('//p[#class="partial_entry"]/text()').extract()
item['subjects'] = sel.xpath('//span[#class="noQuotes"]/text()').extract()
item['stars'] = sel.xpath('//*[#class="rate sprite-rating_s rating_s"]/img/#alt').extract()
item['names'] = sel.xpath('//*[#class="username mo"]/span/text()').extract()
items.append(item)
i+=1
sites = sel.xpath('//a[contains(text(), "Next")]/#href').extract()
if(sites and len(sites) > 0):
yield Request(url="tripadvisor.in" + sites[i], callback=self.parse)
else:
yield items
If you want to select the URL behind Next why don't you try something like this:
next_url = response.xpath('//a[contains(text(), "Next")]/#href).extract()
And then yield a Request with this URL? With this you get always the next site to scrape and do not need the line containing the numbers.
Recently I did something similar on tripadvisor and this approach worked for me. If this won't work for you update your code with the approach you are trying to see where it can be approved.
Update
And change your Request creation block to the following:
if(sites and len(sites) > 0):
for site in sites:
yield Request(url="http://tripadvisor.in" + site, callback=self.parse)
Remove the else part and yield items at the end of the loop when the method finished with every parsing.
I think it can only work if you make a list of urls you want to scrap in a .txt file.
class scrapingtestspider(Spider):
name = "scrapytesting"
allowed_domains = ["tripadvisor.in"]
base_uri = "tripadvisor.in"
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()