I'm using this code. The last two Values are there like that because I was testing to see if either one of them will work- they don't, though.
def parse_again(self, response):
sel = Selector(response)
meta = sel.xpath('//div[#class="LWimg"]')
items = []
for m in meta:
item = PageItem()
item['link'] = response.url
item['Stake'] = m.select('//div[#class="stakedLW"]/h1/text()').extract()
item['Value'] = m.select('//p[#class="value"]/text()').extract()
item['Value'] = m.select('//div[#class="value"]/span/span/text()').extract()
items.append(item)
return items
to retrieve data from this html source code
<div class="LWimg">
<div class="stakedLW">
<span class="title">Stake</span>
<span class="value">5.00</span>
<span class="currency"></span>
My items.py looks like this
from scrapy.item import Item, Field
class Page(Item):
Stake = Field()
Value = Field()
The problem is that data is not retrieved, i.e. nothing is saved into a .csv in the end.
Any input is welcome.
You are populating the Value field twice, so just the last one will work, and I think the correct way should be:
item['Value'] = response.xpath('//div[#class="stakedLW"]//span[#class="value"]/text()').extract_first()
The other fields are not necessary, just the link one.
Related
I'm using scrapy to build a spider to monitor prices on a website. The website isn't consistent in how it displays it's prices. For it's standard price, it always uses the same CSS class, however when a product goes on promotion, it uses one of two CSS classes. The CSS selectors for both are below:
response.css('span.price-num:last-child::text').extract_first()
response.css('.product-highlight-label')
Below is how my items currently look within my spider:
item = ScraperItem()
item['model'] = extract_with_css('.product-id::text')
item['link'] = extract_with_css('head meta[property="og:url"]::attr(content)')
item['price'] = extract_with_css('span.list-price:last-child::text')
item['promo_price'] = extract_with_css('span.price-num:last-child::text')
yield item`
I would like to have something like:
IF response.css('span.price-num:last-child::text') is true
item['promo_price'] = extract_with_css('span.price-num:last-child::text')
ELSE item['promo_price'] = extract_with_css('.product-highlight-label')
Each way I've tried this I have failed.
I got it to work. Here's my code:
item = ScraperItem()
item['model'] = extract_with_css('.product-id::text')
item['link'] = extract_with_css('head meta[property="og:url"]::attr(content)')
item['price'] = extract_with_css('span.list-price:last-child::text')
if response.css('span.price-num:last-child::text'):
item['promo_price'] = extract_with_css('span.price-num:last-child::text')
else:
item['promo_price'] = extract_with_css('.product-highlight-label::text')
yield item
Very new to scrapy, so bear with me.
First, here is my code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from usdirectory.items import UsdirectoryItem
from scrapy.http import Request
class MySpider(BaseSpider):
name = "usdirectory"
allowed_domains = ["domain.com"]
start_urls = ["url_removed_sorry"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//*[#id="holder_result2"]/a[1]/span/span[1]/text()').extract()
for title in titles:
item = UsdirectoryItem()
item["title"] = title
item
yield item
That works...but it only grabs the first item.
I noticed in the items I am trying to scrape, the Xpath changes for each row. For example, the first row is the xpath you see above:
//*[#id="holder_result2"]/a[1]/span/span[1]/text()
then it increments by 2, all the way to 29. So the second result:
//*[#id="holder_result2"]/a[3]/span/span[1]/text()
Last result:
//*[#id="holder_result2"]/a[29]/span/span[1]/text()
So my question is how do I get the script to grab all those, and I don't care if i have to copy and paste code for every item. All the other pages are exactly the same. I'm just not sure how to go about it.
Thank you very much.
Edit:
import scrapy
from scrapy.item import Item, Field
class UsdirectoryItem(scrapy.Item):
title = scrapy.Field()
Given the pattern is exactly as you described, you can use XPath modulo operator mod on position index of a to get all the target a elements :
//*[#id="holder_result2"]/a[position() mod 2 = 1]/span/span[1]/text()
For a quick demo, consider the following input XML :
<div>
<a>1</a>
<a>2</a>
<a>3</a>
<a>4</a>
<a>5</a>
</div>
Given this XPath /div/a[position() mod 2 = 1], the following elements will be returned :
<a>1</a>
<a>3</a>
<a>5</a>
See live demo in xpathtester.com here
Let me know if this works for you. Notice we are iterating over a[i] instead of a[1]. The results are stored in a list (hopefully).
def parse(self, response):
hxs = HtmlXPathSelector(response)
for i in xrange(15):
titles = hxs.select('//*[#id="holder_result2"]/a[' + str(1+i*2) + ']/span/span[1]/text()').extract()
for title in titles:
item = UsdirectoryItem()
item["title"] = title
item #erroneous line?
items.append(item)
yield item
I've built a crawler using scrapy to crawl into a sitemap and scrape required components from all the links in the sitemap.
class MySpider(SitemapSpider):
name = "functie"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]
def parse(self, response):
item = MyItem()
sel = Selector(response)
item['url'] = response.url
item['h1'] = sel.xpath("//h1[#class='no-bd']/text()").extract()
item['jobtype'] = sel.xpath('//input[#name=".Keyword"]/#value').extract()
item['count'] = sel.xpath('//input[#name="Count"]/#value').extract()
item['location'] = sel.xpath('//input[#name="Location"]/#value').extract()
yield item
The item['location'] can have null values at some cases. In that particular case i want to scrape other component and store it in item['location'].
The code i've tried is:
item['location'] = sel.xpath('//input[#name="Location"]/#value').extract()
if not item['location']:
item['location'] = sel.xpath('//a[#class="location"]/text()').extract()
But it doesn't checks the if-condition and returns empty if value is empty in the input field for location. Any help would be highly useful.
You may wish to check the length of item['location'] instead.
item['location'] = sel.xpath('//input[#name="Location"]/#value').extract()
if len(item['location']) < 1:
item['location'] = sel.xpath(//a[#class="location"]/text()').extract()')
Regardless, have you considered combining the two xpaths with a |?
item['location'] = sel.xpath('//input[#name="Location"]/#value | //a[#class="location"]/text()').extract()'
Try this approach:
if(item[location]==""):
item['location'] = sel.xpath('//a[#class="location"]/text()').extract()
I think what you are trying to achieve is best solved with a custom item pipeline.
1) Open pipelines.py and check your desired if condition within a Pipeline class:
class LocPipeline(object):
def process_item(self, item, spider):
# check if key "location" is in item dict
if not item.get("location"):
# if not, try specific xpath
item['location'] = sel.xpath('//a[#class="location"]/text()').extract()
else:
# if location was already found, do nothing
pass
return item
2) The next step is to add the custom LocPipeline() to your settings.py file:
ITEM_PIPELINES = {'myproject.pipelines.LocPipeline': 300}
Adding the custom pipeline to your settings, scrapy will automatically call the LocPipeline().process_item() after MySpider().parse() and search for the alternative XPath if no location is found yet.
I have an item which will be filled in each of the parse function. I want to return updated item after completion of parsing. Here is my scenario:
My Item class:
class MyItem(Item):
name = Field()
links1 = Field()
links2 = Field()
I have multiple urls to crawl after login:
in parse function, I'm doing:
for url in urls:
yield Request(url=url, callback=self.get_info)
In get_info, I will be extracting 'name' and 'links' in each response:
item = MyItem()
item['name'] = hxs.select("//title/text()").extract()
links = []
link = {}
for data in json_parsed_from_response:
link['name'] = data.get('name')
link['url'] = data.get('url')
links.append(link)
item['links1] = links
#similarly, item['links2'] is created.
Now, I want to go through each of the url in each of the item['links1] and item['links2'] as(these loops are inside get_info):
for link in item['links1']:
request = Request(url= link['url'], callback=self.get_status)
request.meta['link'] = link
yield request
for link in item['links2']:
request = Request(url= link['url'], callback=self.get_status)
request.meta['link'] = link
yield request
# Where do I return item, can't return item inside generator
def get_status(self, response):
link = response.meta['link']
if "good" in response.body:
link['status'] = 'good'
else:
link['status'] = 'bad'
# Changes made here, will be reflected in item?
# Also, I can't return item from here. Multiple items will be returned.
I can't figure out from where item has to be returned and it should have all the updated data.
Sorry, but unless you give out some more details I can't understand the design of your code and therefore I can't help... The best suggestion I have is to create a list of *MyItem*s and append each item you create to that list. The values should change as you change them. So you should be able to iterate over the list and see the updated items.
I'm having a problem iterating a crawl using scrapy. I am extracting a title field and a content field. The problem is that I get a JSON file with all of the titles listed and then all of the content. I'd like to get {title}, {content}, {title}, {content}, meaning I probably have to iterate through the parse function. The problem is that I cannot figure out what element I am looping through (i.e., for x in [???]) Here is the code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import SitemapSpider
from Foo.items import FooItem
class FooSpider(SitemapSpider):
name = "foo"
sitemap_urls = ['http://www.foo.com/sitemap.xml']
#sitemap_rules = [
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = [
item = FooItem()
item['title'] = hxs.select('//span[#class="headline"]/text()').extract()
item['content'] = hxs.select('//div[#class="articletext"]/text()').extract()
items.append(item)
return items
Your xpath queries returns all titles and all contents on the page. I suppose you can do:
titles = hxs.select('//span[#class="headline"]/text()').extract()
contents = hxs.select('//div[#class="articletext"]/text()').extract()
for title, context in zip(titles, contents):
item = FooItem()
item['title'] = title
item['content'] = context
yield item
But it is not reliable. Try to perform xpath query that return block with title and content inside. If you showed me xml source I'd help you.
blocks = hxs.select('//div[#class="some_filter"]')
for block in blocks:
item = FooItem()
item['title'] = block.select('span[#class="headline"]/text()').extract()
item['content'] = block.select('div[#class="articletext"]/text()').extract()
yield item
I'm not sure about xpath queries but I think idea is clear.
You don't need HtmlXPathSelector. Scrapy already has built-in XPATH selector. Try this:
blocks = response.xpath('//div[#class="some_filter"]')
for block in blocks:
item = FooItem()
item['title'] = block.xpath('span[#class="headline"]/text()').extract()[0]
item['content'] = block.xpath('div[#class="articletext"]/text()').extract()[0]
yield item