Python Scrapy for grabbing table columns and rows

Python Scrapy for grabbing table columns and rows - python

I'm relatively a noob at python and it's my first time learning scrapy. I've done data mining with perl quite successfully before, but this is a whole different ballgame!
I'm trying to scrape a table, grab the columns of each row. My code is below.
items.py
from scrapy.item import Item, Field
class Cio100Item(Item):
company = Field()
person = Field()
industry = Field()
url = Field()
scrape.py (the spider)
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from cio100.items import Cio100Item
items = []
class MySpider(BaseSpider):
name = "scrape"
allowed_domains = ["cio.co.uk"]
start_urls = ["http://www.cio.co.uk/cio100/2013/cio/"]
def parse(self, response):
sel = Selector(response)
tables = sel.xpath('//table[#class="bgWhite listTable"]//h2')
for table in tables:
# print table
item = Cio100Item()
item['company'] = table.xpath('a/text()').extract()
item['person'] = table.xpath('a/text()').extract()
item['industry'] = table.xpath('a/text()').extract()
item['url'] = table.xpath('a/#href').extract()
items.append(item)
return items
I'm have some trouble understanding how to articulate the xpath selection correctly.
I think this line is the problem:
tables = sel.xpath('//table[#class="bgWhite listTable"]//h2')
When I run the scraper as is above the result is I get things like this in terminal:
2014-01-13 22:13:29-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/>
{'company': [u"\nDomino's Pizza\n"],
'industry': [u"\nDomino's Pizza\n"],
'person': [u"\nDomino's Pizza\n"],
'url': [u'/cio100/2013/dominos-pizza/']}
2014-01-13 22:13:29-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/>
{'company': [u'\nColin Rees\n'],
'industry': [u'\nColin Rees\n'],
'person': [u'\nColin Rees\n'],
'url': [u'/cio100/2013/dominos-pizza/']}
Ideally I want only one block, not two, with Domino's in the company slot, Colin in the person slot, and the industry grabbed, which it's not doing.
When I use firebug to inspect the table, I see h2 for columns 1 and 2 (company and person) but column 3 is h3?
When I modify the tables line to h3 at the end, as follows
tables = sel.xpath('//table[#class="bgWhite listTable"]//h3')
I get this
2014-01-13 22:16:46-0500 [scrape] DEBUG: Scraped from <200 http://www.cio.co.uk/cio100/2013/cio/>
{'company': [u'\nRetail\n'],
'industry': [u'\nRetail\n'],
'person': [u'\nRetail\n'],
'url': [u'/cio100/2013/dominos-pizza/']}
Here it only produces 1 block, and it's capturing Industry and the URL correctly. But it's not getting the company name or person.
Any help will be greatly appreciated!
Thanks!

as far as the xpath goes, consider doing something like:
$ scrapy shell http://www.cio.co.uk/cio100/2013/cio/
...
>>> for tr in sel.xpath('//table[#class="bgWhite listTable"]/tr'):
... item = Cio100Item()
... item['company'] = tr.xpath('td[2]//a/text()').extract()[0].strip()
... item['person'] = tr.xpath('td[3]//a/text()').extract()[0].strip()
... item['industry'] = tr.xpath('td[4]//a/text()').extract()[0].strip()
... item['url'] = tr.xpath('td[4]//a/#href').extract()[0].strip()
... print item
...
{'company': u'LOCOG',
'industry': u'Leisure and entertainment',
'person': u'Gerry Pennell',
'url': u'/cio100/2013/locog/'}
{'company': u'Laterooms.com',
'industry': u'Leisure and entertainment',
'person': u'Adam Gerrard',
'url': u'/cio100/2013/lateroomscom/'}
{'company': u'Vodafone',
'industry': u'Communications and IT services',
'person': u'Albert Hitchcock',
'url': u'/cio100/2013/vodafone/'}
...
other than that you better yield items one by one rather than accumulating them in a list

Related

Scrapy Spider only pulling the first value from item container

I'm trying to scrape pricing info for comic books. What I'm ending up with is a Spider that scrapes through all instances of the top css selector, and then returns the desired value from only the first instance of the selector that contains the pricing info I'm after.
My end goal is to be able to create a pipeline to feed an SQLite db with title, sku, price, and url for the actual listing. Here is my code:
class XmenscrapeSpider(scrapy.Spider):
name = 'finalscrape'
allowed_domains = ['mycomicshop.com']
start_urls = ['https://www.mycomicshop.com/search?TID=222421']
def parse(self, response):
for item in response.css('td.highlighted'):
yield {
'title' : response.xpath('.//meta[#itemprop="sku"]/#content').get()
}
next_page = response.css('li.next a::attr(href)').extract()[1]
if next_page is not None:
yield resonse.follow(next_page, callback- self.parse)
My output looks like this:
{'title': '100 Bullets (1999 DC Vertigo) 1 CGC 9.8'}
2022-01-24 13:53:04 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.mycomicshop.com/search?TID=222421>
{'title': '100 Bullets (1999 DC Vertigo) 1 CGC 9.8'}
2022-01-24 13:53:04 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.mycomicshop.com/search?TID=222421>
{'title': '100 Bullets (1999 DC Vertigo) 1 CGC 9.8'}
2022-01-24 13:53:04 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.mycomicshop.com/search?TID=222421>
{'title': '100 Bullets (1999 DC Vertigo) 1 CGC 9.8'}
2022-01-24 13:53:04 [scrapy.core.scraper] DEBUG: Scraped from <200
https://www.mycomicshop.com/search?TID=222421>
{'title': '100 Bullets (1999 DC Vertigo) 1 CGC 9.8'}
If you look at the URL I'm trying to scrape, you can see that I'm only getting the desired value from the first tag, despite the spider iterating through the five instances of it on the page. I have a feeling that this is a simple solution, but I'm at whit's end here. Any ideas on what would probably be a simple fix?

You need to use relative xpath to item.
import scrapy
class XmenscrapeSpider(scrapy.Spider):
name = 'finalscrape'
allowed_domains = ['mycomicshop.com']
start_urls = ['https://www.mycomicshop.com/search?TID=222421']
def parse(self, response):
for item in response.css('td.highlighted'):
yield {
# 'title': response.xpath('.//meta[#itemprop="sku"]/#content').get()
'title': item.xpath('.//meta[#itemprop="name"]/#content').get()
}
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Note: You only loop over the highlighted items, and since the next page doesn't have any you won't get anything from it.

Need to Extract contents of subpages using scrapy

I'm fairly new to scrapy but have made a few simple scrapers work for me.
I'm trying to go to the next level by getting all the links from one page and scraping the content of the subpages. I've read up a few different examples and Q&As but can't seem to get this code to work for me.
import scrapy
from ..items import remoteworkhub_jobs
class remoteworkhub(scrapy.Spider):
name = 'remoteworkhub'
allowed_domains = ['www.remoteworkhub.com']
#start_urls = ['https://jobs.remoteworkhub.com/']
start_urls = ['https://jobs.remoteworkhub.com']
# Scrape the individual job urls and pass them to the spider
def parse(self, response):
links = response.xpath('//a[#class="jobList-title"]/#href').extract()
for jobs in links:
base_url = 'https://jobs.remoteworkhub.com'
Url = base_url + jobs
yield scrapy.Request(Url, callback=self.parsejobpage)
def parsejobpage(self, response):
#Extracting the content using css selectors
titles = response.xpath('//h1[#class="u-mv--remove u-textH2"]/text()').extract()
companys = response.xpath('/html/body/div[4]/div/div/div[1]/div[1]/div[1]/div[2]/div[2]/div/div[1]/strong/a/text()').extract()
categories = response.xpath('/html/body/div[4]/div/div/div[1]/div[1]/div[1]/div[3]/ul/li/a/text()').extract()
worktype = response.xpath('/html/body/div[4]/div/div/div[1]/div[1]/div[1]/div[5]/div[2]/span/text()').extract()
job_decription = response.xpath('//div[#class="job-body"]//text()').extract()
#titles = response.css('.jobDetail-headerIntro::text').extract()
#titles = response.xpath('//title').get()
#votes = response.css('.score.unvoted::text').extract()
#times = response.css('time::attr(title)').extract()
#comments = response.css('.comments::text').extract()
item = remoteworkhub_jobs()
#item['jobUrl'] = jobUrl
item['title'] = titles
#item['company'] = companys
#item['category'] = categories
#item['worktype'] = worktype
#item['job_description'] = job_decription
#yield or give the scraped info to scrapy
yield item

Check out the following implementation which should let you parse job title and their concerning company names from that site. The way you have defined xpaths are error prone. However, I've modified them so that they can work in the right way. Give it a shot:
import scrapy
class remoteworkhub(scrapy.Spider):
name = 'remoteworkhub'
start_urls = ['https://jobs.remoteworkhub.com']
def parse(self, response):
for job_link in response.xpath("//*[contains(#class,'job-listing')]//*[#class='jobList-title']/#href").extract():
Url = response.urljoin(job_link)
yield scrapy.Request(Url, callback=self.parsejobpage)
def parsejobpage(self, response):
d = {}
d['title'] = response.xpath("//*[#class='jobDetail-headerIntro']/h1/text()").get()
d['company'] = response.xpath("//*[#class='jobDetail-headerIntro']//strong//text()").get()
yield d
This is the kind of output I can see in the console if I use print instead of yield:
{'title': 'Sr Full Stack Developer, Node/React - Remote', 'company': 'Clevertech'}
{'title': 'Subject Matter Expert, Customer Experience - Remote', 'company': 'Qualtrics'}
{'title': 'Employee Experience Enterprise Account Executive - Academic and Government - Remote', 'company': 'Qualtrics'}
{'title': 'Senior Solutions Consultant, Brand Experience - Remote', 'company': 'Qualtrics'}
{'title': 'Data Analyst - Remote', 'company': 'Railsware'}
{'title': 'Recruitment Manager - Remote', 'company': 'Railsware'}

Scrapy trying to scrape business names href in python

I am trying to scrape the href for each business in yellowpages. I am very new to using scrapy and on my second day. I am using requests to get the actual url to search with the spider. What am I doing wrong with my code? I want to eventually have scrapy go to each business and scrape its address and other information.
# -*- coding: utf-8 -*-
import scrapy
import requests
search = "Plumbers"
location = "Hammond, LA"
url = "https://www.yellowpages.com/search"
q = {'search_terms': search, 'geo_location_terms': location}
page = requests.get(url, params=q)
page = page.url
class YellowpagesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['yellowpages.com']
start_urls = [page]
def parse(self, response):
self.log("I just visited: " + response.url)
items = response.css('span.text::text')
for items in items:
print(items)

To get the name use:
response.css('a[class=business-name]::text')
To get the href use:
response.css('a[class=business-name]::attr(href)')
In the final call this looks like:
for bas in response.css('a[class=business-name]'):
item = { 'name' : bas.css('a[class=business-name]::text').extract_first(),
'url' : bas.css('a[class=business-name]::attr(href)').extract_first() }
yield item
Result:
2018-09-13 04:12:49 [quotes] DEBUG: I just visited: https://www.yellowpages.com/search?search_terms=Plumbers&geo_location_terms=Hammond%2C+LA
2018-09-13 04:12:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yellowpages.com/search?search_terms=Plumbers&geo_location_terms=Hammond%2C+LA>
{'name': 'Roto-Rooter Plumbing & Water Cleanup', 'url': '/new-orleans-la/mip/roto-rooter-plumbing-water-cleanup-21804163?lid=149760174'}
2018-09-13 04:12:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.yellowpages.com/search?search_terms=Plumbers&geo_location_terms=Hammond%2C+LA>
{'name': "AJ's Plumbing And Heating Inc", 'url': '/new-orleans-la/mip/ajs-plumbing-and-heating-inc-16078566?lid=1001789407686'}
...

Scrapy: Attempts to extract data from selector list not right

I am trying to scrape football fixtures from a website and my spider is not quite right as I either get the same fixture repeated for all selectors or homeTeam and awayTeamvariables are huge arrays that contain all home sides or away sides respectively. Either way it should reflect the Home vs Away format.
This is my current attempt:
class FixtureSpider(CrawlSpider):
name = "fixturesSpider"
allowed_domains = ["www.bbc.co.uk"]
start_urls = [
"http://www.bbc.co.uk/sport/football/premier-league/fixtures"
]
def parse(self, response):
for sel in response.xpath('//table[#class="table-stats"]/tbody/tr[#class="preview"]'):
item = Fixture()
item['kickoff'] = str(sel.xpath("//table[#class='table-stats']/tbody/tr[#class='preview']/td[3]/text()").extract()[0].strip())
item['homeTeam'] = str(sel.xpath("//table[#class='table-stats']/tbody/tr/td[2]/p/span/a/text()").extract()[0].strip())
item['awayTeam'] = str(sel.xpath("//table[#class='table-stats']/tbody/tr/td[2]/p/span/a/text()").extract()[1].strip())
yield item
This returns the below information repeatedly which is incorrect:
2015-03-20 21:41:40+0000 [fixturesSpider] DEBUG: Scraped from <200 http://www.bbc.co.uk/sport/football/premier-league/fixtures>
{'awayTeam': 'West Brom', 'homeTeam': 'Man City', 'kickoff': '12:45'}
2015-03-20 21:41:40+0000 [fixturesSpider] DEBUG: Scraped from <200 http://www.bbc.co.uk/sport/football/premier-league/fixtures>
{'awayTeam': 'West Brom', 'homeTeam': 'Man City', 'kickoff': '12:45'}
Could someone let me know where i'm going wrong?

The problem is the XPath expressions you are using in the loop are absolute - they start from the root element, but should be relative to a current row which sel is pointing to. In other words, you need to search in the current row context.
Fixed version:
for sel in response.xpath('//table[#class="table-stats"]/tbody/tr[#class="preview"]'):
item = Fixture()
item['kickoff'] = str(sel.xpath("td[3]/text()").extract()[0].strip())
item['homeTeam'] = str(sel.xpath("td[2]/p/span/a/text()").extract()[0].strip())
item['awayTeam'] = str(sel.xpath("td[2]/p/span/a/text()").extract()[1].strip())
yield item
This is the output I'm getting:
{'awayTeam': 'West Brom', 'homeTeam': 'Man City', 'kickoff': '12:45'}
{'awayTeam': 'Swansea', 'homeTeam': 'Aston Villa', 'kickoff': '15:00'}
{'awayTeam': 'Arsenal', 'homeTeam': 'Newcastle', 'kickoff': '15:00'}
...
If you want to grab the match dates, you need to change the strategy - iterate over dates (h2 elements with table-header class) and get the first following sibling table element:
for date in response.xpath('//h2[#class="table-header"]'):
matches = date.xpath('.//following-sibling::table[#class="table-stats"][1]/tbody/tr[#class="preview"]')
date = date.xpath('text()').extract()[0].strip()
for match in matches:
item = Fixture()
item['date'] = date
item['kickoff'] = match.xpath("td[3]/text()").extract()[0].strip()
item['homeTeam'] = match.xpath("td[2]/p/span/a/text()").extract()[0].strip()
item['awayTeam'] = match.xpath("td[2]/p/span/a/text()").extract()[1].strip()
yield item

Try the selectors below. I believe you need ...tbody//tr/... instead of ...tbody/tr/... to get all table rows instead of just the first one.
item['kickoff'] = str(sel.xpath("//table[#class='table-stats']/tbody//tr[#class='preview']/td[3]/text()").extract()[0].strip())
item['homeTeam'] = str(sel.xpath("//table[#class='table-stats']/tbody//tr/td[2]/p/span/a/text()").extract()[0].strip())
item['awayTeam'] = str(sel.xpath("//table[#class='table-stats']/tbody//tr/td[2]/p/span/a/text()").extract()[1].strip())

Python scrapy to extract specific Xpath fields

I am having following structure (sample). i am using scrapy to extract the details. I need to extract the fields of 'href' and text like 'Accounting'. I am using the following code. I am new to Xpath. any help to extarct the specific fields .
<div class = 'something'>
<ul>
<li>Accounting</li>
<li>Administrative</li>
<li>Advertising</li>
<li>Airline</li>
</ul>
</div>
My code is:
from scrapy.spider import BaseSpider
from jobfetch.items import JobfetchItem
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
class JobFetchSpider(BaseSpider):
"""Spider for regularly updated livingsocial.com site, San Francisco Page"""
name = "Jobsearch"
allowed_domains = ["jobsearch.about.com/"]
start_urls = ['http://jobsearch.about.com/od/job-titles/fl/job-titles-a-z.htm']
def parse(self, response):
count = 0
for sel in response.xpath('//*[#id="main"]/div/div[2]/div[1]/div/div[2]/article/div[2]/ul[1]'):
item = JobfetchItem()
item['title'] = sel.extract()
item['link'] = sel.extract()
count = count+1
print item
yield item

The problems you have in the code:
yield item should be inside the loop since you are instantiating items there
the xpath you have is pretty messy and not quite reliable since it heavily relies on the elements location inside parent tags and starts from almost the top parent of the document
your xpath is incorrect - it should go down to the a elements inside li inside ul
sel.extract() would only give you that ul element extracted
For the sake of an example, use a CSS selector here to get to the li tags:
import scrapy
from jobfetch.items import JobfetchItem
class JobFetchSpider(scrapy.Spider):
name = "Jobsearch"
allowed_domains = ["jobsearch.about.com/"]
start_urls = ['http://jobsearch.about.com/od/job-titles/fl/job-titles-a-z.htm']
def parse(self, response):
for sel in response.css('article[itemprop="articleBody"] div.expert-content-text > ul > li > a'):
item = JobfetchItem()
item['title'] = sel.xpath('text()').extract()[0]
item['link'] = sel.xpath('#href').extract()[0]
yield item
Running the spider produces:
{'link': u'http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm', 'title': u'Accounting'}
{'link': u'http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm', 'title': u'Administrative'}
...
{'link': u'http://jobsearch.about.com/od/job-title-samples/fl/yacht-job-titles.htm', 'title': u'Yacht Jobs'}
FYI, we could have used xpath() also:
//article[#itemprop="articleBody"]//div[#class="expert-content-text"]/ul/li/a

Use the below script to extract the data as you want to scrape.
In [1]: response.xpath('//div[#class="expert-content-text"]/ul/li/a/text()').extract()
Out[1]:
[u'Accounting',
u'Administrative',
u'Advertising',
u'Airline',
u'Animal',
u'Alternative Energy',
u'Auction House',
u'Banking',
u'Biotechnology',
u'Business',
u'Business Intelligence',
u'Chef',
u'College Admissions',
u'College Alumni Relations and Development ',
u'College Student Services',
u'Construction',
u'Consulting',
u'Corporate',
u'Cruise Ship',
u'Customer Service',
u'Data Science',
u'Engineering',
u'Entry Level Jobs',
u'Environmental',
u'Event Planning',
u'Fashion',
u'Film',
u'First Job',
u'Fundraiser',
u'Healthcare/Medical',
u'Health/Safety',
u'Hospitality',
u'Human Resources',
u'Human Services / Social Work',
u'Information Technology (IT)',
u'Insurance',
u'International Affairs / Development',
u'International Business',
u'Investment Banking',
u'Law Enforcement',
u'Legal',
u'Maintenance',
u'Management',
u'Manufacturing',
u'Marketing',
u'Media',
u'Museum',
u'Music',
u'Non Profit',
u'Nursing',
u'Outdoor ',
u'Public Administration',
u'Public Relations',
u'Purchasing',
u'Radio',
u'Real Estate ',
u'Restaurant',
u'Retail',
u'Sales',
u'School',
u'Science',
u'Ski and Snow Jobs',
u'Social Media',
u'Social Work',
u'Sports',
u'Television',
u'Trades',
u'Transportation',
u'Travel',
u'Yacht Jobs']
In [1]: response.xpath('//div[#class="expert-content-text"]/ul/li/a/#href').extract()
Out[2]:
[u'http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm',
u'http://jobsearch.about.com/od/job-titles/a/advertising-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/airline-industry-jobs.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/animal-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/alternative-energy-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/auction-house-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/banking-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/biotechnology-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/business-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/business-intelligence-job-titles.htm',
u'http://culinaryarts.about.com/od/culinaryfundamentals/a/whatisachef.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/college-admissions-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/college-alumni-relations-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/college-student-service-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/construction-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/consulting-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/c-level-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/cruise-ship-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/customer-service-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/data-science-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/engineering-job-titles.htm',
u'http://jobsearch.about.com/od/best-jobs/a/best-entry-level-jobs.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/environmental-job-titles.htm',
u'http://eventplanning.about.com/od/eventcareers/tp/corporateevents.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/fashion-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/film-job-titles.htm',
u'http://jobsearch.about.com/od/justforstudents/a/first-job-list.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/fundraiser-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/health-care-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/health-safety-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/hospitality-job-titles.htm',
u'http://humanresources.about.com/od/HR-Roles-And-Responsibilities/fl/human-resources-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/human-services-social-work-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/it-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/insurance-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/international-affairs-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/international-business-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/investment-banking-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/law-enforcement-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/legal-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/maintenance-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/management-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/manufacturing-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/marketing-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/media-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/museum-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/music-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/nonprofit-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/nursing-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/outdoor-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/public-administration-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/public-relations-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/purchasing-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/radio-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/real-estate-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/restaurant-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/retail-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/sales-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/high-school-middle-school-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/science-job-titles.htm',
u'http://jobsearch.about.com/od/skiandsnowjobs/a/skijob2_2.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/social-media-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/social-work-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/sports-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/television-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/trades-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/transportation-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/travel-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/yacht-job-titles.htm']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Scrapy for grabbing table columns and rows - python

Related

Scrapy Spider only pulling the first value from item container

Need to Extract contents of subpages using scrapy

Scrapy trying to scrape business names href in python

Scrapy: Attempts to extract data from selector list not right

Python scrapy to extract specific Xpath fields

Categories

Resources