I want to extract all the product url from the link "http://www.shopclues.com/diwali-mega-mall/hot-electronics-sale-fs/audio-systems-fs.html" using scrapy in python. Below is the function I'm using to do this:
def parse(self, response):
print("hello");
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[#id="pagination_contents"]')
items = []
i=3
for site in sites:
item = DmozItem()
item['link'] = site.select('div[2]/div['+str(i)+']/a/#href').extract()
i=int(i)+1;
print i
items.append(item)
return items
The x-path of each product div is: //div[#id="pagination_contents"]/div[2]/div['+str(i)+']/a/#href
But I'm getting only one link and not all the products' url.
I think your problem is that hxs.select('//div[#id="pagination_contents"]') only returns one result and then you only do one iteration in the loop.
You can select all following <div> elements that contain an <a>, and loop over those:
sites = hxs.select('//div[#id="pagination_contents"]/div[2]/div[a]')
for site in sites:
## This loop will run 33 times in my test.
## Access to each link:
item['link'] = site.select('./a[2]/#href').extract()
Related
I'm creating a web scraper with scrapy and python. The page I'm scraping has each item structured as a card, I'm able to scrape some info from these cards (name, location), but I also want to get info that is reached by clicking on card > new page > click button on new page that opens form > scrape value from the form. How should I structure the parse function, do I need nested loops or separate functions ..?
class StackSpider(Spider):
name = "stack"
allowed_domains = ["example.com"]
start_urls = ["example.com/page"]
def parse(self, response):
for page_url in response.css('a[class ~= search- card]::attr(href)').extract():
page_url = response.urljoin(page_url)
yield scrapy.Request(url=page_url, callback=self.parse)
for vc in response.css('div#vc-profile.container').extract():
item = StackItem()
item['name'] = vc.xpath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[1]/h1/text()').extract()
item['firm'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[1]').extract()
item['pos'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[2]').extract()
em = vc.xpath('/*[#id="vc-profile"]/div/div[1]/div[2]/div[2]/div/div[1]/button').extract()
item['email'] = em.xpath('//*[#id="email"]/value').extract()
yield item
the scraper is crawling, but outputting nothing
The best approach is creating an item object on the first page, scrape the needed data and save to the item. Again make a request to the new URL (card > new page > click the button to form) and pass the same item in there. Yielding the output from here will fix the issue.
You should probably split the scraper into 1 'parse' method and 1 'parse_item' method.
Your parse method goes through the page and yields the urls of the items for which you want to get the details. The parse_item method will get back the response from the parse function, and get the details for the specific item.
Difficult to say what it will look like without knowing the website, but it'll probably look more or less like this:
class StackSpider(Spider):
name = "stack"
allowed_domains = ["example.com"]
start_urls = ["example.com/page"]
def parse(self, response):
for page_url in response.css('a[class ~= search- card]::attr(href)').extract():
page_url = response.urljoin(page_url)
yield scrapy.Request(url=page_url, callback=self.parse_item)
def parse_item(self, response)
item = StackItem()
item['name'] = vc.xpath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[1]/h1/text()').extract()
item['firm'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[1]').extract()
item['pos'] = vc.expath('//*[#id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[2]').extract()
em = vc.xpath('/*[#id="vc-profile"]/div/div[1]/div[2]/div[2]/div/div[1]/button').extract()
item['email'] = em.xpath('//*[#id="email"]/value').extract()
yield item
I am trying to scrape TripAdvisor's reviews, but I cannot find the Xpath to have it dynamically go through all the pages. I tried yield and callback but the thing is I cannot find the xpath for the line that goes to the next page. I am talking about This site
Here Is my code(UPDATED):
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapingtest.items import ScrapingTestingItem
class scrapingtestspider(Spider):
name = "scrapytesting"
allowed_domains = ["tripadvisor.in"]
base_uri = "tripadvisor.in"
start_urls = [
"http://www.tripadvisor.in/Hotel_Review-g297679-d300955-Reviews-Ooty_Fern_Hill_A_Sterling_Holidays_Resort-Ooty_Tamil_Nadu.html"]
output_json_dict = {}
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//a[contains(text(), "Next")]/#href').extract()
items = []
i=0
for sites in sites:
item = ScrapingTestingItem()
#item['reviews'] = sel.xpath('//p[#class="partial_entry"]/text()').extract()
item['subjects'] = sel.xpath('//span[#class="noQuotes"]/text()').extract()
item['stars'] = sel.xpath('//*[#class="rate sprite-rating_s rating_s"]/img/#alt').extract()
item['names'] = sel.xpath('//*[#class="username mo"]/span/text()').extract()
items.append(item)
i+=1
sites = sel.xpath('//a[contains(text(), "Next")]/#href').extract()
if(sites and len(sites) > 0):
yield Request(url="tripadvisor.in" + sites[i], callback=self.parse)
else:
yield items
If you want to select the URL behind Next why don't you try something like this:
next_url = response.xpath('//a[contains(text(), "Next")]/#href).extract()
And then yield a Request with this URL? With this you get always the next site to scrape and do not need the line containing the numbers.
Recently I did something similar on tripadvisor and this approach worked for me. If this won't work for you update your code with the approach you are trying to see where it can be approved.
Update
And change your Request creation block to the following:
if(sites and len(sites) > 0):
for site in sites:
yield Request(url="http://tripadvisor.in" + site, callback=self.parse)
Remove the else part and yield items at the end of the loop when the method finished with every parsing.
I think it can only work if you make a list of urls you want to scrap in a .txt file.
class scrapingtestspider(Spider):
name = "scrapytesting"
allowed_domains = ["tripadvisor.in"]
base_uri = "tripadvisor.in"
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
New to scrapy and python and running into an issue here.
I'm trying to get the entire list of PS3 games from Metacritic. Here is my code:
class MetacriticSpider(BaseSpider):
name = "metacritic"
allowed_domains = ["metacritic.com"]
max_id = 10
start_urls = [
"http://www.metacritic.com/browse/games/title/ps3?page="
#"http://www.metacritic.com/browse/games/title/xbox360?page=0"
]
def start_requests(self):
for c in lowercase:
for i in range(self.max_id):
yield Request('http://www.metacritic.com/browse/games/title/ps3/{0}?page={1}'.format(c, i), callback = self.parse)
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="product_wrap"]/div')
items = []
for site in sites:
#item = MetacriticItem()
#titles = site.xpath('a/text()').extract()
titles = site.xpath('//div[contains(#class, "basic_stat product_title")]/a/text()').extract()
#cscore = site.xpath('//div[contains(#class, "basic_stat product_score brief_metascore")]/div[1]/text()').extract()
if titles:
item = MetacriticItem()
item['title'] = titles[0].strip()
items.append(item)
return items
For some reason when I check the JSON file, I have 81 instances of each title, and it is starting on
Assassin's Creed: Revelations - Ancestors Character Pack
It should be starting on the first page which is numbered titles, then progressing to the A list, and checking each page in that etc.
Any ideas on why it is doing it this way, I can't see what my problem is
Your xpath should be relative (.//) to the each site:
titles = site.xpath('.//div[contains(#class, "basic_stat product_title")]/a/text()').extract()
Also, change sites selection xpath to (note, no div at the end):
//div[#class="product_wrap"]
please look at this image from firebug
i want to get the test inside the <a> tag. i used this:
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="item paid-featured-item"]/div[#class="listing-item"]')
cars = []
for site in sites:
car = CarItem()
car['ATitle']=xpath('.//div[#class="block item-title"]/h3/span[#class="title"]/a/text()').extract()
cars.append(car)
return cars
I think i have used the correct xpath. but it seems no because i got empty result.
any help?
Following OP's comment:
this is probably what you aimed for:
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="item paid-featured-item"]/div[#class="listing-item"]')
cars = []
for site in sites:
car = CarItem()
car['ATitle']=site.xpath('.//div[#class="block item-title"]/h3/span[#class="title"]/a/text()').extract()
cars.append(car)
return cars
Alternatively, I see you're using a recent Scrapy version, so you may want to try CSS selectors that usually make selector expression easier to read and maintain.
In your case, you could use something like
def parse(self, response):
sel = Selector(response)
sites = sel.css('div.paid-featured-item div.listing-item')
cars = []
for site in sites:
car = CarItem()
car['ATitle'] = site.css('div.item-title h3 span.title a::text').extract()
cars.append(car)
return cars
Note that the a::text syntax is a Scrapy extension to CSS selectors
When I write parse() function, can I yield both a request and items for one single page?
I want to extract some data in page A and then store the data in database, and extract links to be followed (this can be done by rule in CrawlSpider).
I call the links pages of A pages is B pages, so I can write another parse_item() to extract data from B pages, but I want to extract some links in B pages, so I can only use rule to extract links? how to tackle with the duplicate URLs in Scrapy?
Yes, you can yield both requests and items. From what I've seen:
def parse(self, response):
hxs = HtmlXPathSelector(response)
base_url = response.url
links = hxs.select(self.toc_xpath)
for index, link in enumerate(links):
href, text = link.select('#href').extract(), link.select('text()').extract()
yield Request(urljoin(base_url, href[0]), callback=self.parse2)
for item in self.parse2(response):
yield item
I'm not 100% I understand your question but the code below request sites from a starting url using the basespider, then scans the starting url for href's then loops each link calling parse_url. everything matched in parse_url is sent to your item pipeline.
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//a[contains(#href, "content")]/#href').extract() ## only grab url with content in url name
for i in urls:
yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
item = ZipgrabberItem()
item['zip'] = hxs.select("//div[contains(#class,'odd')]/text()").extract() ## this bitch grabs it
return item
from Steven Almeroth in google groups:
You are right, you can yield Requests and return a list of Items, but that is not what you are attempting. You are attempting to yield a list of Items instead of return'ing them. And since you already are using parse() as a generator function you cannot have both yield and return together. But you can have many yields.
Try this:
def parse(self, response):
hxs = HtmlXPathSelector(response)
base_url = response.url
links = hxs.select(self.toc_xpath)
for index, link in enumerate(links):
href, text = link.select('#href').extract(), link.select('text()').extract()
yield Request(urljoin(base_url, href[0]), callback=self.parse2)
for item in self.parse2(response):
yield item