I am interested in all of the visible text of a website.
The only thing is: I would like to exclude hyperlink text. Thereby I am able to exlude text in menu bars because they often contain links. In the image you can see that everything from a menu bar could be excluded (e.g. "Wohnen & Bauen").
https://www.gross-gerau.de/B%C3%BCrger-Service/Ver-und-Entsorgung/Abfallinformationen/index.php?object=tx,2289.12976.1&NavID=3411.60&La=1
All in all my spider looks like this:
class MySpider(CrawlSpider):
name = 'my_spider'
start_urls = ['https://www.gross-gerau.de/B%C3%BCrger-Service/Wohnen-Bauen/']
rules = (
Rule(LinkExtractor(allow="B%C3%BCrger-Service", deny=deny_list_sm),
callback='parse', follow=True),
)
def parse(self, response):
item = {}
item['scrape_date'] = int(time.time())
item['response_url'] = response.url
# old approach
# item["text"] = " ".join([x.strip() for x in response.xpath("//text()").getall()]).strip()
# exclude at least javascript code snippets and stuff
item["text"] = " ".join([x.strip() for x in response.xpath("//*[name(.)!='head' and name(.)!='script']/text()").getall()]).strip()
yield item
The solution should work for other websites, too.Does anyone have an idea how to solve this challenge? Any ideas are welcome!
You can extend your predicate as
[name()!='head' and name()!='script' and name()!='a']
Related
I want to use scrapy to take the strings from a predefined list bacteria_species and match them string by string with the elements from an HTML document from the website http://www.microbiologyresearch.org/content/journal/ijsem and if this string occurs in a tag-element of the HTML, the text of the whole element should be returned.
Here is my code:
import scrapy
class BacteriaSpider(scrapy.Spider):
name = 'bacteria'
allowed_domains = ['https://www.microbiologyresearch.org/content/journal/ijsem']
start_urls = ['http://www.microbiologyresearch.org/content/journal/ijsem/']
def parse(self, response):
bacteria_species = ['Abditibacterium utsteinense',
'Abiotrophia defectiva',
'Abyssibacter profundi',
'Abyssicoccus albus',
'Abyssivirga alkaniphila',
'Acanthopleuribacter pedis',
'Acaricomes phytoseiuli',
'Acetanaerobacterium elongatum',
'Acetanaerobacterium sp.',
'Acetatifactor muris']
for bacteria in bacteria_species:
response.xpath("//*/text()[contains(., bacteria)]").getall() # select the text of all nodes
pass
Unfortunately it doeset work
Does anyone have a better idea?
I think the problem is your xpath does not contain the proper format and you need to build it up dynamically like this:
def parse(self, response):
bacteria_species = ['prokaryotes',
'Malaciobacter']
search_xpath = "//*/text()[contains(., {0})]"
for bact in bacteria_species:
searchfor = search_xpath.format('"' + bact + '"')
print(searchfor)
results = response.xpath(searchfor).getall()
for item in results:
yield{
"bacteria" : bact,
"results" : item
}
I changed the bacteria_species to something that I saw multiples of in the web page and they were found. One thing to note is that xpath is case sensitive which could be a problem depending on the data
Is what you posted is your exact code?
You have
start_urls = ['http://https://www.{...}']
which is invalid as it contains both http:// and https://.
It should probably just be https://www.{...}.
I am getting Blank csv, though its not showing any error in code.
It is unable to crawl through web page.
This is the code which I have written referring youtube:-
import scrapy
from Example.items import MovieItem
class ThirdSpider(scrapy.Spider):
name = "imdbtestspider"
allowed_domains = ["imdb.com"]
start_url = ('http://www.imdb.com/chart/top',)
def parse(self,response):
links = response.xpath('//tbody[#class="lister-list"]/tr/td[#class="titleColumn"]/a/#href').extract()
i = 1
for link in links:
abs_url = response.urljoin(link)
#
url_next = '//*[#id="main"]/div/span/div/div/div[2]/table/tbody/tr['+str(i)+']/td[3]/strong/text()'
rating = response.xpath(url_next).extact()
if (i <= len(link)):
i=i+1
yield scrapy.Request(abs_url, callback = self.parse_indetail, meta = {'rating': rating})
def parse_indetail(self,response):
item = MovieItem()
#
item['title'] = response.xpath('//div[#class="title_wrapper"])/h1/text()').extract[0][:-1]
item['directors'] = response.xpath('//div[#class="credit_summary_items"]/span[#itemprop="director"]/a/span/text()').extract()[0]
item['writers'] = response.xpath('//div[#class="credit_summary_items"]/span[#itemprop="creator"]/a/span/text()').extract()
item['stars'] = response.xpath('//div[#class="credit_summary_items"]/span[#itemprop="actors"]/a/span/text()').extract()
item['popularity'] = response.xpath('//div[#class="titleReviewBarSubItem"]/div/span/text()').extract()[2][21:-8]
return item
This is output I am getting while running executing code with
scrapy crawl imdbtestspider -o example.csv -t csv
2019-01-17 18:44:34 [scrapy.core.engine] INFO: Spider opened
2019-01-17 18:44:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages
(at 0 pag es/min), scraped 0 items (at 0 items/min)
This is another way you might give a try with. I used css selector instead of xpath to make the script less verbose.
import scrapy
class ImbdsdpyderSpider(scrapy.Spider):
name = 'imbdspider'
start_urls = ['http://www.imdb.com/chart/top']
def parse(self, response):
for link in response.css(".titleColumn a[href^='/title/']::attr(href)").extract():
yield scrapy.Request(response.urljoin(link),callback=self.get_info)
def get_info(self, response):
item = {}
title = response.css(".title_wrapper h1::text").extract_first()
item['title'] = ' '.join(title.split()) if title else None
item['directors'] = response.css(".credit_summary_item h4:contains('Director') ~ a::text").extract()
item['writers'] = response.css(".credit_summary_item h4:contains('Writer') ~ a::text").extract()
item['stars'] = response.css(".credit_summary_item h4:contains('Stars') ~ a::text").extract()
popularity = response.css(".titleReviewBarSubItem:contains('Popularity') .subText::text").extract_first()
item['popularity'] = ' '.join(popularity.split()).strip("(") if popularity else None
item['rating'] = response.css(".ratingValue span::text").extract_first()
yield item
I have tested you given xpaths i don't know they are mistakenly wrong or are actually wrong.
e.g;
xpath = //*="main"]/div/span/div/div/div[2]/table/tbody/tr['+str(i)+']/td[3]/strong/text()
#There is not table when you reach at div[2]
//div[#class="title_wrapper"])/h1/text() #here there is and error after `]` ) is bad syntax
Plus your xpaths are not yielding any results.
As to why you are getting the error that says 0/pages crawled, despite not recreating your case, I have to assume that your method of page iteration is not building the page URLs correctly.
I'm having trouble understanding the use for creating the variable array of all the "follow links" and then using len to send them to the parse_indetail() but a couple things to note.
When your using "meta" to pass items from one function to the next, though you have the right idea, you are missing some the instantiation to the function your passing it to (you should also be using a standard naming convention for simplicity)
Should be something like this...
def parse(self,response):
# If you are going to capture an item at the first request, you must instantiate
# your items class
item = MovieItem()
....
# You seem to want to pass ratings to the next function for itimization, so
# you make sure that you have it listed in your items.py file and you set it
item[rating] = response.xpath(PATH).extact() # Why did you ad the url_next? huh?
....
# Standard convention for passing meta using call back is like this, this way
# allows you to pass multiple itemized item gets passed
yield scrapy.Request(abs_url, callback = self.parse_indetail, meta = {'item': item})
def parse_indetail(self,response):
# Then you must initialize the meta again in the function your passing it to
item = response.meta['item']
# Then you can continue your scraping
You should not complicate the page iteration logic. You seem to get how it works but need help fine tuning this aspect. I have recreated you use case and optimized it.
#items.py file
import scrapy
class TestimbdItem(scrapy.Item):
title = scrapy.Field()
directors = scrapy.Field()
writers = scrapy.Field()
stars = scrapy.Field()
popularity = scrapy.Field()
rating = scrapy.Field()
# The spider file
import scrapy
from testimbd.items import TestimbdItem
class ImbdsdpyderSpider(scrapy.Spider):
name = 'imbdsdpyder'
allowed_domains = ['imdb.com']
start_urls = ['http://www.imdb.com/chart/top']
def parse(self, response):
for href in response.css("td.titleColumn a::attr(href)").extract():
yield scrapy.Request(response.urljoin(href),
callback=self.parse_movie)
def parse_movie(self, response):
item = TestimbdItem()
item['title'] = [ x.replace('\xa0', '') for x in response.css(".title_wrapper h1::text").extract()][0]
item['directors'] = response.xpath('//div[#class="credit_summary_item"]/h4[contains(., "Director")]/following-sibling::a/text()').extract()
item['writers'] = response.xpath('//div[#class="credit_summary_item"]/h4[contains(., "Writers")]/following-sibling::a/text()').extract()
item['stars'] = response.xpath('//div[#class="credit_summary_item"]/h4[contains(., "Stars")]/following-sibling::a/text()').extract()
item['popularity'] = response.css(".titleReviewBarSubItem span.subText::text")[2].re('([0-9]+)')
item['rating'] = response.css(".ratingValue span::text").extract_first()
yield item
Notice two things:
Id the parse() function. All I'm doing here is using a for loop through the links, each instance in loop referred to href, and pass the urljoined href to the parser function. Give your use case, this is more than enough. In a situation where you have the next page, it's just creating a variable for the "next" page somehow and callback to parse, it will keep doing that till it cant fint a "next" page.
Secondly, Use xpath only when in the HTML items have the same tagwith different content. This is more of a personal opinion but I tell people that xpath selectors is like scalpel and css selectors is like a butcher knife. You can get damn accurate with scalpel but it takes more time and in many cases may be just easier to go with CSS selector to get the same result.
I'm writing a Scrapy scraper that uses CrawlSpider to crawl sites, go over their internal links, and scrape the contents of any external links (links with a domain different from the original domain).
I managed to do that with 2 rules but they are based on the domain of the site being crawled. If I want to run this on multiple websites I run into a problem because I don't know which "start_url" I'm currently on so I can't change the rule appropriately.
Here's what I came up with so far, it works for one website and I'm not sure how to apply it to a list of websites:
class HomepagesSpider(CrawlSpider):
name = 'homepages'
homepage = 'http://www.somesite.com'
start_urls = [homepage]
# strip http and www
domain = homepage.replace('http://', '').replace('https://', '').replace('www.', '')
domain = domain[:-1] if domain[-1] == '/' else domain
rules = (
Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True),
Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False),
)
def parse_internal(self, response):
# log internal page...
def parse_external(self, response):
# parse external page...
This can probably be done by just passing the start_url as an argument when calling the scraper, but I'm looking for a way to do that programmatically within the scraper itself.
Any ideas?
Thanks!
Simon.
I've found a very similar question and used the second option presented in the accepted answer to develop a workaround for this problem, since it's not supported out-of-the-box in scrapy.
I've created a function that gets a url as an input and creates rules for it:
def rules_for_url(self, url):
domain = Tools.get_domain(url)
rules = (
Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True),
Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False),
)
return rules
I then override some of CrawlSpider's functions.
I changed _rules into a dictionary where the keys are the different website domains and the values are the rules for that domain (using rules_for_url). The population of _rules is done in _compile_rules
I then make the appropriate changes in _requests_to_follow and _response_downloaded to support the new way of using _rules.
_rules = {}
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
domain = Tools.get_domain(response.url)
for n, rule in enumerate(self._rules[domain]):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(domain + ';' + str(n), link)
yield rule.process_request(r)
def _response_downloaded(self, response):
meta_rule = response.meta['rule'].split(';')
domain = meta_rule[0]
rule_n = int(meta_rule[1])
rule = self._rules[domain][rule_n]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
def _compile_rules(self):
def get_method(method):
if callable(method):
return method
elif isinstance(method, six.string_types):
return getattr(self, method, None)
for url in self.start_urls:
url_rules = self.rules_for_url(url)
domain = Tools.get_domain(url)
self._rules[domain] = [copy.copy(r) for r in url_rules]
for rule in self._rules[domain]:
rule.callback = get_method(rule.callback)
rule.process_links = get_method(rule.process_links)
rule.process_request = get_method(rule.process_request)
See the original functions here.
Now the spider will simply go over each url in start_urls and create a set of rules specific for that url. Then use the appropriate rules for each website being crawled.
Hope this helps anyone who stumbles upon this problem in the future.
Simon.
Iterate over all website links in start_urls and populate allowed_domains and deny_domains arrays. And then define Rules.
start_urls = ["www.website1.com", "www.website2.com", "www.website3.com", "www.website4.com"]
allow_domains = []
deny_domains = []
for link in start_urls
# strip http and www
domain = link.replace('http://', '').replace('https://', '').replace('www.', '')
domain = domain[:-1] if domain[-1] == '/' else domain
allow_domains.extend([domain])
deny_domains.extend([domain])
rules = (
Rule(LinkExtractor(allow_domains=allow_domains, deny_domains=()), callback='parse_internal', follow=True),
Rule(LinkExtractor(allow_domains=(), deny_domains=deny_domains), callback='parse_external', follow=False),
)
import scrapy
from multiple_pages.items import YieldItem
class YelpSpider(scrapy.Spider):
name = "yelp"
allowed_domains = ["yelp.com"]
start_urls = ('http://www.yelp.com/'
List item
)
def parse(self, response):
item =YieldItem()
item['restaurents'] = response.xpath('//span[#class="indexed-biz-name"]//text()').extract()
item['rating'] = response.xpath('//div[#class="rating-large"]').extract()
item['phonenumber'] = response.xpath('//span[#class="biz-phone"]//a//text()').extract()
print item
When you use // in your XPath it selects all nodes in the document from the current node that match the selection, no matter where they are. So I guess your selecting several text fields.
Try with something more specific like:
item['phonenumber'] = response.xpath('//span[#class="biz-phone"]/text()').extract()
I have a ScraPy Code that is running in shell, but when I try to export it to csv, it returns an empty file. It exports data when I do not go into a link and try to parse the description, but once I add the extra method of parsing the contents, it fails to work. Here is the code:
class MonsterSpider(CrawlSpider):
name = "monster"
allowed_domains = ["jobs.monster.com"]
base_url = "http://jobs.monster.com/v-technology.aspx?"
start_urls = [
"http://jobs.monster.com/v-technology.aspx"
]
for i in range(1,5):
start_urls.append(base_url + "page=" + str(i))
rules = (Rule(SgmlLinkExtractor(allow=("jobs.monster.com",))
, callback = 'parse_items'),)
def parse_items(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="col-xs-12"]')
#items = []
for site in sites.xpath('.//article[#class="js_result_row"]'):
item = MonsterItem()
item['title'] = site.xpath('.//span[#itemprop = "title"]/text()').extract()
item['company'] = site.xpath('.//span[#itemprop = "name"]/text()').extract()
item['city'] = site.xpath('.//span[#itemprop = "addressLocality"]/text()').extract()
item['state'] = site.xpath('.//span[#itemprop = "addressRegion"]/text()').extract()
item['link'] = site.xpath('.//a[#data-m_impr_a_placement_id= "jsr"]/#href').extract()
follow = ''.join(item["link"])
request = Request(follow, callback = self.parse_dir_contents)
request.meta["item"] = item
yield request
#items.append(item)
#return items
def parse_dir_contents(self, response):
item = response.meta["item"]
item['desc'] = site.xpath('.//div[#itemprop = "description"]/text()').extract()
return item
Taking out the parse_dir_contents and uncommenting the empty "lists" list and "append" code was the original code.
Well, as #tayfun suggests you should use response.xpath or define the site variable.
By the way, you do not need to use sel = Selector(response). Responses come with the xpath function, there is no need to cover it into another selector.
However the main problem is that you restrict the domain of the spider. You define allowed_domains = ["jobs.monster.com"] however if you look at the URL to follow of your custom Request you can see that they are something like http://jobview.monster.com/ or http://job-openings.monster.com. In this case your parse_dir_contents is not executed (the domain is not allowed) and your item does not get returned so you won't get any results.
Change allowed_domains = ["jobs.monster.com"] to
allowed_domains = ["monster.com"]
and you will be fine and your app will work and return items.
You have an error in your parse_dir_contents method:
def parse_dir_contents(self, response):
item = response.meta["item"]
item['desc'] = response.xpath('.//div[#itemprop=description"]/text()').extract()
return item
Note the use of response. I don't know where you got site that you are currently using from.
Also, try to provide the error details when you post a question. Writing "it fails to work" doesn't say much.