I am newbie using xpath,
I wanna extract every single title, body, link , release date from this link
everthing seems okay, but no on body, how to extract every single body on nested xPath, thanks before :)
here my source
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from thehack.items import ThehackItem
class MySpider(BaseSpider):
name = "thehack"
allowed_domains = ["thehackernews.com"]
start_urls = ["http://thehackernews.com/search/label/mobile%20hacking"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.xpath('//article[#class="post item module"]')
items = []
for titles in titles:
item = ThehackItem()
item['title'] = titles.select('span/h2/a/text()').extract()
item['link'] = titles.select('span/h2/a/#href').extract()
item['body'] = titles.select('span/div/div/div/div/a/div/text()').extract()
item['date'] = titles.select('span/div/span/text()').extract()
items.append(item)
return items
anybody can fix about body blok? only on body...
thanks before mate
here the picture of inspection elements from the website
I think you where struggling with the selectors, right? I think you should check the documentation for selectors, there's a lot of good information there. In this particular example, using the css selectors, I think it would be something like:
class MySpider(scrapy.Spider):
name = "thehack"
allowed_domains = ["thehackernews.com"]
start_urls = ["http://thehackernews.com/search/label/mobile%20hacking"]
def parse(self, response):
for article in response.css('article.post'):
item = ThehackItem()
item['title'] = article.css('.post-title>a::text').extract_first()
item['link'] = article.css('.post-title>a::attr(href)').extract_first()
item['body'] = ''. join(article.css('[id^=summary] *::text').extract()).strip()
item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
yield item
It would be a good exercise for you to change them to xpath selectors and maybe also check about ItemLoaders, together are very useful.
Related
i'm trying to scrape this site using scrapy but returns all the value in a
single cell, i except each value in a different row.
example:
milage: 25
milage: 377
milage: 247433
milage: 464130
but i'm getting the data like this
example:
milage:[u'25',
u'377',
u'247433',
u'399109',
u'464130',
u'399631',
u'435238',
u'285000',
u'287470',
u'280000']
here is my code
import scrapy
from ..items import ExampleItem
from scrapy.selector import HtmlXPathSelector
url = 'https://example.com'
class Example(scrapy.Spider):
name = 'example'
allowed_domains = ['www.example.com']
start_urls = [url]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item_selector = hxs.select('//div[#class="listing_format card5 relative"]')
for fields in item_selector:
item = ExampleItem()
item ['Mileage'] = fields.select('//li[strong="Mileage"]/span/text()').extract()
yield item
You didn't show your site but may be you need relative XPath:
item ['Mileage'] = fields.select('.//li[strong="Mileage"]/span/text()').extract_first()
It sounds like you need to iterate over your milages.
for fields in item_selector:
milages = fields.select('//li[strong="Mileage"]/span/text()').extract()
for milage in milages:
item = CommercialtrucktraderItem()
item ['Mileage'] = milage
yield item
Also consider making your fields.select('//li[strong="Mileage"]/span/text()').extract() more specific?
Very new to scrapy, so bear with me.
First, here is my code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from usdirectory.items import UsdirectoryItem
from scrapy.http import Request
class MySpider(BaseSpider):
name = "usdirectory"
allowed_domains = ["domain.com"]
start_urls = ["url_removed_sorry"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//*[#id="holder_result2"]/a[1]/span/span[1]/text()').extract()
for title in titles:
item = UsdirectoryItem()
item["title"] = title
item
yield item
That works...but it only grabs the first item.
I noticed in the items I am trying to scrape, the Xpath changes for each row. For example, the first row is the xpath you see above:
//*[#id="holder_result2"]/a[1]/span/span[1]/text()
then it increments by 2, all the way to 29. So the second result:
//*[#id="holder_result2"]/a[3]/span/span[1]/text()
Last result:
//*[#id="holder_result2"]/a[29]/span/span[1]/text()
So my question is how do I get the script to grab all those, and I don't care if i have to copy and paste code for every item. All the other pages are exactly the same. I'm just not sure how to go about it.
Thank you very much.
Edit:
import scrapy
from scrapy.item import Item, Field
class UsdirectoryItem(scrapy.Item):
title = scrapy.Field()
Given the pattern is exactly as you described, you can use XPath modulo operator mod on position index of a to get all the target a elements :
//*[#id="holder_result2"]/a[position() mod 2 = 1]/span/span[1]/text()
For a quick demo, consider the following input XML :
<div>
<a>1</a>
<a>2</a>
<a>3</a>
<a>4</a>
<a>5</a>
</div>
Given this XPath /div/a[position() mod 2 = 1], the following elements will be returned :
<a>1</a>
<a>3</a>
<a>5</a>
See live demo in xpathtester.com here
Let me know if this works for you. Notice we are iterating over a[i] instead of a[1]. The results are stored in a list (hopefully).
def parse(self, response):
hxs = HtmlXPathSelector(response)
for i in xrange(15):
titles = hxs.select('//*[#id="holder_result2"]/a[' + str(1+i*2) + ']/span/span[1]/text()').extract()
for title in titles:
item = UsdirectoryItem()
item["title"] = title
item #erroneous line?
items.append(item)
yield item
I am using scrapy 1.0.3. Here is my code of spider file,
from scrapy import Spider
from scrapy.selector import Selector
from parser_xxx.items import XxxItem
class XxxSpider(Spider):
name = "xxx"
allowed_domains = ["xxx.xxx.com"]
start_urls = ["http://xxx.xxx.com/jobs/"]
def parse(self, response):
quelist = Selector(response).xpath('//div[#id="job_listings"]')
for que in quelist:
item = XxxItem()
item['title'] = que.xpath('//a//h4/text()').extract()
item['link'] = que.xpath('//a/#href').extract()
yield item
But, I am getting all anchor links and all titles. Where am I wrong?
Thanks in advance!
You have to make your XPath expressions context-specific by prepending a dot. Plus, I think you should iterate over the links inside the div with id="job_listings":
quelist = response.xpath('//div[#id="job_listings"]//a')
for que in quelist:
item = XxxItem()
item['title'] = que.xpath('.//h4/text()').extract()
item['link'] = que.xpath('#href').extract()
yield item
I'm beginning now with Scrapy, and I got how to take the content I wanted from a sport page (name and team of a soccer player), but I need to follow the links searching for more teams, every team page have a link to players page, the structure of website link is :
team page: http://esporte.uol.com.br/futebol/clubes/vitoria/
players page: http://esporte.uol.com.br/futebol/clubes/vitoria/jogadores/
I've read some Scrapy tutorials and I'm thinking the team pages I have to follow links and don't parse nothing, and the players page I have to no-follow and parse the players, I don't know if I'm right with this idea and wrong with the syntax, of if my idea of follow is wrong, any help is welcome.
here is my code:
class MoneyballSpider(BaseSpider):
name = "moneyball"
allowed_domains = ["esporte.uol.com.br", "click.uol.com.br", "uol.com.br"]
start_urls = ["http://esporte.uol.com.br/futebol/clubes/vitoria/jogadores/"]
rules = (
Rule(SgmlLinkExtractor(allow=(r'.*futebol/clubes/.*/', ), deny=(r'.*futebol/clubes/.*/jogadores/', )), follow = True),
Rule(SgmlLinkExtractor(allow=(r'.*futebol/clubes/.*/jogadores/', )), callback='parse', follow = True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
jogadores = hxs.select('//div[#id="jogadores"]/div/ul/li')
items = []
for jogador in jogadores:
item = JogadorItem()
item['nome'] = jogador.select('h5/a/text()').extract()
item['time'] = hxs.select('//div[#class="header clube"]/h1/a/text()').extract()
items.append(item)
print item['nome'], item['time']
return items
First, since you need to follow an extract links, you need a CrawlSpider instead of a BaseSpider. Then, you need to define two rules: one for players with a callback, and one for teams without, to follow. Also, you should start with a URL with list of teams, like http://esporte.uol.com.br/futebol. Here's a complete spider, that returns players from different teams:
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
class JogadorItem(Item):
nome = Field()
time = Field()
class MoneyballSpider(CrawlSpider):
name = "moneyball"
allowed_domains = ["esporte.uol.com.br", "click.uol.com.br", "uol.com.br"]
start_urls = ["http://esporte.uol.com.br/futebol"]
rules = (Rule(SgmlLinkExtractor(allow=(r'.*futebol/clubes/.*?/jogadores/', )), callback='parse_players', follow=True),
Rule(SgmlLinkExtractor(allow=(r'.*futebol/clubes/.*', )), follow=True),)
def parse_players(self, response):
hxs = HtmlXPathSelector(response)
jogadores = hxs.select('//div[#id="jogadores"]/div/ul/li')
items = []
for jogador in jogadores:
item = JogadorItem()
item['nome'] = jogador.select('h5/a/text()').extract()
item['time'] = hxs.select('//div[#class="header clube"]/h1/a/text()').extract()
items.append(item)
print item['nome'], item['time']
return items
Quote from the output:
...
[u'Silva'] [u'Vila Nova-GO']
[u'Luizinho'] [u'Vila Nova-GO']
...
[u'Michel'] [u'Guarani']
[u'Wellyson'] [u'Guarani']
...
This is just hint for you to continue working on the spider, you'll need to tweak the spider further: choose an appropriate start URL depending on your needs etc.
Hope that helps.
I'm having a problem iterating a crawl using scrapy. I am extracting a title field and a content field. The problem is that I get a JSON file with all of the titles listed and then all of the content. I'd like to get {title}, {content}, {title}, {content}, meaning I probably have to iterate through the parse function. The problem is that I cannot figure out what element I am looping through (i.e., for x in [???]) Here is the code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import SitemapSpider
from Foo.items import FooItem
class FooSpider(SitemapSpider):
name = "foo"
sitemap_urls = ['http://www.foo.com/sitemap.xml']
#sitemap_rules = [
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = [
item = FooItem()
item['title'] = hxs.select('//span[#class="headline"]/text()').extract()
item['content'] = hxs.select('//div[#class="articletext"]/text()').extract()
items.append(item)
return items
Your xpath queries returns all titles and all contents on the page. I suppose you can do:
titles = hxs.select('//span[#class="headline"]/text()').extract()
contents = hxs.select('//div[#class="articletext"]/text()').extract()
for title, context in zip(titles, contents):
item = FooItem()
item['title'] = title
item['content'] = context
yield item
But it is not reliable. Try to perform xpath query that return block with title and content inside. If you showed me xml source I'd help you.
blocks = hxs.select('//div[#class="some_filter"]')
for block in blocks:
item = FooItem()
item['title'] = block.select('span[#class="headline"]/text()').extract()
item['content'] = block.select('div[#class="articletext"]/text()').extract()
yield item
I'm not sure about xpath queries but I think idea is clear.
You don't need HtmlXPathSelector. Scrapy already has built-in XPATH selector. Try this:
blocks = response.xpath('//div[#class="some_filter"]')
for block in blocks:
item = FooItem()
item['title'] = block.xpath('span[#class="headline"]/text()').extract()[0]
item['content'] = block.xpath('div[#class="articletext"]/text()').extract()[0]
yield item