I'm trying to scrape pricing info for comic books. What I'm ending up with is a Spider that scrapes through all instances of the top css selector, and then returns the desired value from only the first instance of the selector that contains the pricing info I'm after.
My end goal is to be able to create a pipeline to feed an SQLite db with title, sku, price, and url for the actual listing. Here is my code:
class XmenscrapeSpider(scrapy.Spider):
name = 'finalscrape'
allowed_domains = ['mycomicshop.com']
start_urls = ['https://www.mycomicshop.com/search?TID=222421']
def parse(self, response):
for item in response.css('td.highlighted'):
yield {
'title' : response.xpath('.//meta[#itemprop="sku"]/#content').get()
}
next_page = response.css('li.next a::attr(href)').extract()[1]
if next_page is not None:
yield resonse.follow(next_page, callback- self.parse)
My output looks like this:
{'title': '100 Bullets (1999 DC Vertigo) 1 CGC 9.8'}
2022-01-24 13:53:04 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.mycomicshop.com/search?TID=222421>
{'title': '100 Bullets (1999 DC Vertigo) 1 CGC 9.8'}
2022-01-24 13:53:04 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.mycomicshop.com/search?TID=222421>
{'title': '100 Bullets (1999 DC Vertigo) 1 CGC 9.8'}
2022-01-24 13:53:04 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.mycomicshop.com/search?TID=222421>
{'title': '100 Bullets (1999 DC Vertigo) 1 CGC 9.8'}
2022-01-24 13:53:04 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.mycomicshop.com/search?TID=222421>
{'title': '100 Bullets (1999 DC Vertigo) 1 CGC 9.8'}
If you look at the URL I'm trying to scrape, you can see that I'm only getting the desired value from the first tag, despite the spider iterating through the five instances of it on the page. I have a feeling that this is a simple solution, but I'm at whit's end here. Any ideas on what would probably be a simple fix?
You need to use relative xpath to item.
import scrapy
class XmenscrapeSpider(scrapy.Spider):
name = 'finalscrape'
allowed_domains = ['mycomicshop.com']
start_urls = ['https://www.mycomicshop.com/search?TID=222421']
def parse(self, response):
for item in response.css('td.highlighted'):
yield {
# 'title': response.xpath('.//meta[#itemprop="sku"]/#content').get()
'title': item.xpath('.//meta[#itemprop="name"]/#content').get()
}
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Note: You only loop over the highlighted items, and since the next page doesn't have any you won't get anything from it.
I'm fairly new to scrapy but have made a few simple scrapers work for me.
I'm trying to go to the next level by getting all the links from one page and scraping the content of the subpages. I've read up a few different examples and Q&As but can't seem to get this code to work for me.
import scrapy
from ..items import remoteworkhub_jobs
class remoteworkhub(scrapy.Spider):
name = 'remoteworkhub'
allowed_domains = ['www.remoteworkhub.com']
#start_urls = ['https://jobs.remoteworkhub.com/']
start_urls = ['https://jobs.remoteworkhub.com']
# Scrape the individual job urls and pass them to the spider
def parse(self, response):
links = response.xpath('//a[#class="jobList-title"]/#href').extract()
for jobs in links:
base_url = 'https://jobs.remoteworkhub.com'
Url = base_url + jobs
yield scrapy.Request(Url, callback=self.parsejobpage)
def parsejobpage(self, response):
#Extracting the content using css selectors
titles = response.xpath('//h1[#class="u-mv--remove u-textH2"]/text()').extract()
companys = response.xpath('/html/body/div[4]/div/div/div[1]/div[1]/div[1]/div[2]/div[2]/div/div[1]/strong/a/text()').extract()
categories = response.xpath('/html/body/div[4]/div/div/div[1]/div[1]/div[1]/div[3]/ul/li/a/text()').extract()
worktype = response.xpath('/html/body/div[4]/div/div/div[1]/div[1]/div[1]/div[5]/div[2]/span/text()').extract()
job_decription = response.xpath('//div[#class="job-body"]//text()').extract()
#titles = response.css('.jobDetail-headerIntro::text').extract()
#titles = response.xpath('//title').get()
#votes = response.css('.score.unvoted::text').extract()
#times = response.css('time::attr(title)').extract()
#comments = response.css('.comments::text').extract()
item = remoteworkhub_jobs()
#item['jobUrl'] = jobUrl
item['title'] = titles
#item['company'] = companys
#item['category'] = categories
#item['worktype'] = worktype
#item['job_description'] = job_decription
#yield or give the scraped info to scrapy
yield item
Check out the following implementation which should let you parse job title and their concerning company names from that site. The way you have defined xpaths are error prone. However, I've modified them so that they can work in the right way. Give it a shot:
import scrapy
class remoteworkhub(scrapy.Spider):
name = 'remoteworkhub'
start_urls = ['https://jobs.remoteworkhub.com']
def parse(self, response):
for job_link in response.xpath("//*[contains(#class,'job-listing')]//*[#class='jobList-title']/#href").extract():
Url = response.urljoin(job_link)
yield scrapy.Request(Url, callback=self.parsejobpage)
def parsejobpage(self, response):
d = {}
d['title'] = response.xpath("//*[#class='jobDetail-headerIntro']/h1/text()").get()
d['company'] = response.xpath("//*[#class='jobDetail-headerIntro']//strong//text()").get()
yield d
This is the kind of output I can see in the console if I use print instead of yield:
{'title': 'Sr Full Stack Developer, Node/React - Remote', 'company': 'Clevertech'}
{'title': 'Subject Matter Expert, Customer Experience - Remote', 'company': 'Qualtrics'}
{'title': 'Employee Experience Enterprise Account Executive - Academic and Government - Remote', 'company': 'Qualtrics'}
{'title': 'Senior Solutions Consultant, Brand Experience - Remote', 'company': 'Qualtrics'}
{'title': 'Data Analyst - Remote', 'company': 'Railsware'}
{'title': 'Recruitment Manager - Remote', 'company': 'Railsware'}
I am trying to scrape the href for each business in yellowpages. I am very new to using scrapy and on my second day. I am using requests to get the actual url to search with the spider. What am I doing wrong with my code? I want to eventually have scrapy go to each business and scrape its address and other information.
# -*- coding: utf-8 -*-
import scrapy
import requests
search = "Plumbers"
location = "Hammond, LA"
url = "https://www.yellowpages.com/search"
q = {'search_terms': search, 'geo_location_terms': location}
page = requests.get(url, params=q)
page = page.url
class YellowpagesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['yellowpages.com']
start_urls = [page]
def parse(self, response):
self.log("I just visited: " + response.url)
items = response.css('span.text::text')
for items in items:
print(items)
To get the name use:
response.css('a[class=business-name]::text')
To get the href use:
response.css('a[class=business-name]::attr(href)')
In the final call this looks like:
for bas in response.css('a[class=business-name]'):
item = { 'name' : bas.css('a[class=business-name]::text').extract_first(),
'url' : bas.css('a[class=business-name]::attr(href)').extract_first() }
yield item
Result:
2018-09-13 04:12:49 [quotes] DEBUG: I just visited: https://www.yellowpages.com/search?search_terms=Plumbers&geo_location_terms=Hammond%2C+LA
2018-09-13 04:12:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yellowpages.com/search?search_terms=Plumbers&geo_location_terms=Hammond%2C+LA>
{'name': 'Roto-Rooter Plumbing & Water Cleanup', 'url': '/new-orleans-la/mip/roto-rooter-plumbing-water-cleanup-21804163?lid=149760174'}
2018-09-13 04:12:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yellowpages.com/search?search_terms=Plumbers&geo_location_terms=Hammond%2C+LA>
{'name': "AJ's Plumbing And Heating Inc", 'url': '/new-orleans-la/mip/ajs-plumbing-and-heating-inc-16078566?lid=1001789407686'}
...
I am trying to scrape football fixtures from a website and my spider is not quite right as I either get the same fixture repeated for all selectors or homeTeam and awayTeamvariables are huge arrays that contain all home sides or away sides respectively. Either way it should reflect the Home vs Away format.
This is my current attempt:
class FixtureSpider(CrawlSpider):
name = "fixturesSpider"
allowed_domains = ["www.bbc.co.uk"]
start_urls = [
"http://www.bbc.co.uk/sport/football/premier-league/fixtures"
]
def parse(self, response):
for sel in response.xpath('//table[#class="table-stats"]/tbody/tr[#class="preview"]'):
item = Fixture()
item['kickoff'] = str(sel.xpath("//table[#class='table-stats']/tbody/tr[#class='preview']/td[3]/text()").extract()[0].strip())
item['homeTeam'] = str(sel.xpath("//table[#class='table-stats']/tbody/tr/td[2]/p/span/a/text()").extract()[0].strip())
item['awayTeam'] = str(sel.xpath("//table[#class='table-stats']/tbody/tr/td[2]/p/span/a/text()").extract()[1].strip())
yield item
This returns the below information repeatedly which is incorrect:
2015-03-20 21:41:40+0000 [fixturesSpider] DEBUG: Scraped from <200 http://www.bbc.co.uk/sport/football/premier-league/fixtures>
{'awayTeam': 'West Brom', 'homeTeam': 'Man City', 'kickoff': '12:45'}
2015-03-20 21:41:40+0000 [fixturesSpider] DEBUG: Scraped from <200 http://www.bbc.co.uk/sport/football/premier-league/fixtures>
{'awayTeam': 'West Brom', 'homeTeam': 'Man City', 'kickoff': '12:45'}
Could someone let me know where i'm going wrong?
The problem is the XPath expressions you are using in the loop are absolute - they start from the root element, but should be relative to a current row which sel is pointing to. In other words, you need to search in the current row context.
Fixed version:
for sel in response.xpath('//table[#class="table-stats"]/tbody/tr[#class="preview"]'):
item = Fixture()
item['kickoff'] = str(sel.xpath("td[3]/text()").extract()[0].strip())
item['homeTeam'] = str(sel.xpath("td[2]/p/span/a/text()").extract()[0].strip())
item['awayTeam'] = str(sel.xpath("td[2]/p/span/a/text()").extract()[1].strip())
yield item
This is the output I'm getting:
{'awayTeam': 'West Brom', 'homeTeam': 'Man City', 'kickoff': '12:45'}
{'awayTeam': 'Swansea', 'homeTeam': 'Aston Villa', 'kickoff': '15:00'}
{'awayTeam': 'Arsenal', 'homeTeam': 'Newcastle', 'kickoff': '15:00'}
...
If you want to grab the match dates, you need to change the strategy - iterate over dates (h2 elements with table-header class) and get the first following sibling table element:
for date in response.xpath('//h2[#class="table-header"]'):
matches = date.xpath('.//following-sibling::table[#class="table-stats"][1]/tbody/tr[#class="preview"]')
date = date.xpath('text()').extract()[0].strip()
for match in matches:
item = Fixture()
item['date'] = date
item['kickoff'] = match.xpath("td[3]/text()").extract()[0].strip()
item['homeTeam'] = match.xpath("td[2]/p/span/a/text()").extract()[0].strip()
item['awayTeam'] = match.xpath("td[2]/p/span/a/text()").extract()[1].strip()
yield item
Try the selectors below. I believe you need ...tbody//tr/... instead of ...tbody/tr/... to get all table rows instead of just the first one.
item['kickoff'] = str(sel.xpath("//table[#class='table-stats']/tbody//tr[#class='preview']/td[3]/text()").extract()[0].strip())
item['homeTeam'] = str(sel.xpath("//table[#class='table-stats']/tbody//tr/td[2]/p/span/a/text()").extract()[0].strip())
item['awayTeam'] = str(sel.xpath("//table[#class='table-stats']/tbody//tr/td[2]/p/span/a/text()").extract()[1].strip())
I am having following structure (sample). i am using scrapy to extract the details. I need to extract the fields of 'href' and text like 'Accounting'. I am using the following code. I am new to Xpath. any help to extarct the specific fields .
<div class = 'something'>
<ul>
<li>Accounting</li>
<li>Administrative</li>
<li>Advertising</li>
<li>Airline</li>
</ul>
</div>
My code is:
from scrapy.spider import BaseSpider
from jobfetch.items import JobfetchItem
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
class JobFetchSpider(BaseSpider):
"""Spider for regularly updated livingsocial.com site, San Francisco Page"""
name = "Jobsearch"
allowed_domains = ["jobsearch.about.com/"]
start_urls = ['http://jobsearch.about.com/od/job-titles/fl/job-titles-a-z.htm']
def parse(self, response):
count = 0
for sel in response.xpath('//*[#id="main"]/div/div[2]/div[1]/div/div[2]/article/div[2]/ul[1]'):
item = JobfetchItem()
item['title'] = sel.extract()
item['link'] = sel.extract()
count = count+1
print item
yield item
The problems you have in the code:
yield item should be inside the loop since you are instantiating items there
the xpath you have is pretty messy and not quite reliable since it heavily relies on the elements location inside parent tags and starts from almost the top parent of the document
your xpath is incorrect - it should go down to the a elements inside li inside ul
sel.extract() would only give you that ul element extracted
For the sake of an example, use a CSS selector here to get to the li tags:
import scrapy
from jobfetch.items import JobfetchItem
class JobFetchSpider(scrapy.Spider):
name = "Jobsearch"
allowed_domains = ["jobsearch.about.com/"]
start_urls = ['http://jobsearch.about.com/od/job-titles/fl/job-titles-a-z.htm']
def parse(self, response):
for sel in response.css('article[itemprop="articleBody"] div.expert-content-text > ul > li > a'):
item = JobfetchItem()
item['title'] = sel.xpath('text()').extract()[0]
item['link'] = sel.xpath('#href').extract()[0]
yield item
Running the spider produces:
{'link': u'http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm', 'title': u'Accounting'}
{'link': u'http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm', 'title': u'Administrative'}
...
{'link': u'http://jobsearch.about.com/od/job-title-samples/fl/yacht-job-titles.htm', 'title': u'Yacht Jobs'}
FYI, we could have used xpath() also:
//article[#itemprop="articleBody"]//div[#class="expert-content-text"]/ul/li/a
Use the below script to extract the data as you want to scrape.
In [1]: response.xpath('//div[#class="expert-content-text"]/ul/li/a/text()').extract()
Out[1]:
[u'Accounting',
u'Administrative',
u'Advertising',
u'Airline',
u'Animal',
u'Alternative Energy',
u'Auction House',
u'Banking',
u'Biotechnology',
u'Business',
u'Business Intelligence',
u'Chef',
u'College Admissions',
u'College Alumni Relations and Development ',
u'College Student Services',
u'Construction',
u'Consulting',
u'Corporate',
u'Cruise Ship',
u'Customer Service',
u'Data Science',
u'Engineering',
u'Entry Level Jobs',
u'Environmental',
u'Event Planning',
u'Fashion',
u'Film',
u'First Job',
u'Fundraiser',
u'Healthcare/Medical',
u'Health/Safety',
u'Hospitality',
u'Human Resources',
u'Human Services / Social Work',
u'Information Technology (IT)',
u'Insurance',
u'International Affairs / Development',
u'International Business',
u'Investment Banking',
u'Law Enforcement',
u'Legal',
u'Maintenance',
u'Management',
u'Manufacturing',
u'Marketing',
u'Media',
u'Museum',
u'Music',
u'Non Profit',
u'Nursing',
u'Outdoor ',
u'Public Administration',
u'Public Relations',
u'Purchasing',
u'Radio',
u'Real Estate ',
u'Restaurant',
u'Retail',
u'Sales',
u'School',
u'Science',
u'Ski and Snow Jobs',
u'Social Media',
u'Social Work',
u'Sports',
u'Television',
u'Trades',
u'Transportation',
u'Travel',
u'Yacht Jobs']
In [1]: response.xpath('//div[#class="expert-content-text"]/ul/li/a/#href').extract()
Out[2]:
[u'http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm',
u'http://jobsearch.about.com/od/job-titles/a/advertising-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/airline-industry-jobs.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/animal-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/alternative-energy-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/auction-house-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/banking-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/biotechnology-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/business-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/business-intelligence-job-titles.htm',
u'http://culinaryarts.about.com/od/culinaryfundamentals/a/whatisachef.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/college-admissions-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/college-alumni-relations-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/college-student-service-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/construction-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/consulting-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/c-level-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/cruise-ship-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/customer-service-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/data-science-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/engineering-job-titles.htm',
u'http://jobsearch.about.com/od/best-jobs/a/best-entry-level-jobs.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/environmental-job-titles.htm',
u'http://eventplanning.about.com/od/eventcareers/tp/corporateevents.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/fashion-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/film-job-titles.htm',
u'http://jobsearch.about.com/od/justforstudents/a/first-job-list.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/fundraiser-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/health-care-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/health-safety-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/hospitality-job-titles.htm',
u'http://humanresources.about.com/od/HR-Roles-And-Responsibilities/fl/human-resources-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/human-services-social-work-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/it-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/insurance-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/international-affairs-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/international-business-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/investment-banking-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/law-enforcement-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/legal-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/maintenance-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/management-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/manufacturing-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/marketing-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/media-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/museum-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/music-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/nonprofit-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/nursing-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/outdoor-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/public-administration-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/public-relations-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/purchasing-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/radio-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/real-estate-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/restaurant-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/retail-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/sales-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/high-school-middle-school-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/science-job-titles.htm',
u'http://jobsearch.about.com/od/skiandsnowjobs/a/skijob2_2.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/social-media-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/social-work-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/sports-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/television-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/trades-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/transportation-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/travel-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/yacht-job-titles.htm']