I am trying to perform a horizontal crawling -- starting from the first page and move until reaching the last page. The code is as follows:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from scrapy_project.items import metacriticItem
import datetime
# xpaths
main_xpath = '//div[#class = "title_bump"]//td[#class = "clamp-summary-wrap"]'
mt_url_xpath = './a/#href'
class MovieUrlSpider(CrawlSpider):
name = 'movie_url'
allowed_domains = ['web']
start_urls = (
'https://www.metacritic.com/browse/movies/score/metascore/all',
)
# rules for horizontal crawling
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[#rel="next"]'),
callback='parse_item', follow=True),
)
def parse_item(self, response):
# list of items we want
main = response.xpath(main_xpath)
for i in main:
# create the loader using the response
l = ItemLoader(item = metacriticItem(), selector = i)
# key
l.add_xpath('mt_url', mt_url_xpath)
# housekeeping fields
l.add_value('url', response.url)
l.add_value('spider', self.name)
l.add_value('date', datetime.datetime.now().strftime('%d/%m/%Y'))
yield l.load_item()
This script did not parse any item and it return the following message:
2021-05-12 02:22:24 [scrapy.core.engine] INFO: Spider opened
2021-05-12 02:22:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-05-12 02:22:24 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-05-12 02:22:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/robots.txt> (referer: None)
2021-05-12 02:22:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/movies/score/metascore/all> (referer: None)
2021-05-12 02:22:24 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.metacritic.com': <GET https://www.metacritic.com/browse/movies/score/metascore/all?page=1>
2021-05-12 02:22:24 [scrapy.core.engine] INFO: Closing spider (finished)
Related
I am trying to scrape a site with div elements and iteratively, for each div element I want to scrape some data from it and follow the child links it has and scrape more data from them.
Here is the code of quote.py
import scrapy
from ..items import QuotesItem
class QuoteSpider(scrapy.Spider):
name = 'quote'
baseurl='http://quotes.toscrape.com'
start_urls = [baseurl]
def parse(self, response):
all_div_quotes=response.css('.quote')
for quote in all_div_quotes:
item=QuotesItem()
title = quote.css('.text::text').extract()
author = quote.css('.author::text').extract()
tags = quote.css('.tag::text').extract()
author_details_url=self.baseurl+quote.css('.author+ a::attr(href)').extract_first()
item['title']=title
item['author']=author
item['tags']=tags
request = scrapy.Request(author_details_url,
callback=self.author_born,
meta={'item':item,'next_url':author_details_url})
yield request
def author_born(self, response):
item=response.meta['item']
next_url = response.meta['next_url']
author_born = response.css('.author-born-date::text').extract()
item['author_born']=author_born
yield scrapy.Request(next_url, callback=self.author_birthplace,
meta={'item':item})
def author_birthplace(self,response):
item=response.meta['item']
author_birthplace= response.css('.author-born-location::text').extract()
item['author_birthplace']=author_birthplace
yield item
Here is the code of items.py
import scrapy
class QuotesItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
author_born = scrapy.Field()
author_birthplace = scrapy.Field()
I ran the command scrapy crawl quote -o data.json, but there was no error message and data.json was empty. I was expecting to get all the data in its corresponding field.
Can you please help me?
Take a closer look at your logs, you'll be able to find messages like this:
DEBUG: Filtered duplicate request: <GET http://quotes.toscrape.com/author/Albert-Einstein>
Scrapy is automatically managing duplicates and trying not to visit one URL twice(for obvious reasons).
In you case you can add dont_filter = True to your requests and will see something like this:
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Steve-Martin/> (referer: http://quotes.toscrape.com/author/Steve-Martin/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Albert-Einstein/> (referer: http://quotes.toscrape.com/author/Albert-Einstein/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Marilyn-Monroe/> (referer: http://quotes.toscrape.com/author/Marilyn-Monroe/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/J-K-Rowling/> (referer: http://quotes.toscrape.com/author/J-K-Rowling/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Eleanor-Roosevelt/> (referer: http://quotes.toscrape.com/author/Eleanor-Roosevelt/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Andre-Gide/> (referer: http://quotes.toscrape.com/author/Andre-Gide/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Thomas-A-Edison/> (referer: http://quotes.toscrape.com/author/Thomas-A-Edison/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Jane-Austen/> (referer: http://quotes.toscrape.com/author/Jane-Austen/)
Which is kinda strange indeed, because of page yields request to itself.
Overall you could end up with something like this:
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quote'
baseurl = 'http://quotes.toscrape.com'
start_urls = [baseurl]
def parse(self, response):
all_div_quotes = response.css('.quote')
for quote in all_div_quotes:
item = dict()
title = quote.css('.text::text').extract()
author = quote.css('.author::text').extract()
tags = quote.css('.tag::text').extract()
author_details_url = self.baseurl + quote.css('.author+ a::attr(href)').extract_first()
item['title'] = title
item['author'] = author
item['tags'] = tags
print(item)
# Don't filter = True in case of we get two quotes of a single author.
# This is not optimal though. Better decision will be to save author data to self.storage
# And only visit new author info pages if needed, else take info from saved dict.
request = scrapy.Request(author_details_url,
callback=self.author_info,
meta={'item': item},
dont_filter=True)
yield request
def author_info(self, response):
item = response.meta['item']
author_born = response.css('.author-born-date::text').extract()
author_birthplace = response.css('.author-born-location::text').extract()
item['author_born'] = author_born
item['author_birthplace'] = author_birthplace
yield item
import scrapy
class Pttscrapper2Spider(scrapy.Spider):
name = 'PTTscrapper2'
allowed_domains = ['https://www.ptt.cc']
start_urls = ['https://www.ptt.cc/bbs/HatePolitics/index.html/']
handle_httpstatus_list = [400, 302]
def parse(self, response):
urls = response.css('div.r-ent > div.title > a::attr(href)').extract()
for thread_url in urls:
url = response.urljoin(thread_url)
yield scrapy.Request(url=url, callback=self.parse_details)
next_page_url = response.css('a.wide:nth-child(2)::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
def parse_details(self, response):
yield {
'title' : response.xpath('//head/title/text()').extract(),
'stance' : response.xpath('//*[#id="main-content"]/div[#class="push"]/span[1]/text()').extract(),
'userid' : response.xpath('//*[#id="main-content"]/div[#class="push"]/span[2]/text()').extract(),
'comment' : response.xpath('//*[#id="main-content"]/div[#class="push"]/span[3]/text()').extract(),
'time_of_post' : response.xpath('//*[#id="main-content"]/div[#class="push"]/span[4]/text()').extract(),
}
I've been using the above spider to try and crawl a website, but I when I run the spider, I get these messages:
> 2017-10-05 23:14:27 [scrapy.core.engine] INFO: Spider opened
> 2017-10-05 23:14:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages
> (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-10-05 23:14:27
> [scrapy.extensions.telnet] DEBUG: Telnet console listening on
> 127.0.0.1:6023 2017-10-05 23:14:28 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from:
> <302 https://www.ptt.cc/bbs/HatePolitics/index.html/> Set-Cookie:
> __cfduid=d3ca57dcab04acfaf256438a57c547e4a1507216462; expires=Fri, 05-Oct-18 15:14:22 GMT; path=/; domain=.ptt.cc; HttpOnly
>
> 2017-10-05 23:14:28 [scrapy.core.engine] DEBUG: Crawled (302) <GET
> https://www.ptt.cc/bbs/HatePolitics/index.html/> (referer: None)
> 2017-10-05 23:14:28 [scrapy.core.engine] INFO: Closing spider
> (finished)
What I've been thinking is that my spider can't seem to access the sub forums in the index. I've tested that the selectors point to the correct locations and request.urljoin creates the correct absolute url but can't seem to access the sub forums in a page. It would be great if someone can tell me why the spider is unable to access the links!
Two issues with your scraper. In the start_urls you added a trailing slash to the index.html/, which is wrong. Next allowed_domains will take domain names and not urls.
Change starting code to below and it would work
class Pttscrapper2Spider(scrapy.Spider):
name = 'PTTscrapper2'
allowed_domains = ['www.ptt.cc']
start_urls = ['https://www.ptt.cc/bbs/HatePolitics/index.html']
Logs from the run
2017-10-06 13:16:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ptt.cc/bbs/HatePolitics/M.1507268600.A.57C.html> (referer: https://www.ptt.cc/bbs/HatePolitics/index.html)
2017-10-06 13:16:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ptt.cc/bbs/HatePolitics/M.1507268600.A.57C.html>
{'title': ['[黑特] 先刪文,洪慈庸和高潞那個到底撤案了沒? - 看板 HatePolitics - 批踢踢實業坊'], 'stance': ['推 ', '→ ', '噓 ', '→ ', '→ '], 'userid': ['ABA0525', 'gerund', 'AGODFATHER', 'laman45', 'victoryman'], 'comment': [': 垃圾不分藍綠黃', ': 垃圾靠弟傭 中華民國內最沒資格當立委的爛貨', ': 說什麼東西你個板啊', ': 有確定再說', ': 看起來應該是撤了'], 'time_of_post': ['10/06 13:43\n', '10/06 13:50\n', '10/06 13:57\n', '10/06 13:59\n', ' 10/06 15:27\n']}
2017-10-06 13:16:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ptt.cc/bbs/HatePolitics/M.1507275599.A.657.html> (referer: https://www.ptt.cc/bbs/HatePolitics/index.html)
I searched for any similar issues on stackowerflow and the other q&a sites but I could not find any proper answer for my problem.
I have written the following spider to crawl nautilusconcept.com . The category structure of site is so bad. Because of it I had to apply rules as it parse all link with callback. I determine which url should be parse with if statement inside parse_item method. Anyway spider doesn't listen my deny rules and still trying to crawl contains (?brw....) links.
Here is my spider;
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from vitrinbot.items import ProductItem
from vitrinbot.base import utils
import hashlib
removeCurrency = utils.removeCurrency
getCurrency = utils.getCurrency
class NautilusSpider(CrawlSpider):
name = 'nautilus'
allowed_domains = ['nautilusconcept.com']
start_urls = ['http://www.nautilusconcept.com/']
xml_filename = 'nautilus-%d.xml'
xpaths = {
'category' :'//tr[#class="KategoriYazdirTabloTr"]//a/text()',
'title':'//h1[#class="UrunBilgisiUrunAdi"]/text()',
'price':'//hemenalfiyat/text()',
'images':'//td[#class="UrunBilgisiUrunResimSlaytTd"]//div/a/#href',
'description':'//td[#class="UrunBilgisiUrunBilgiIcerikTd"]//*/text()',
'currency':'//*[#id="UrunBilgisiUrunFiyatiDiv"]/text()',
'check_page':'//div[#class="ayrinti"]'
}
rules = (
Rule(
LinkExtractor(allow=('com/[\w_]+',),
deny=('asp$',
'login\.asp'
'hakkimizda\.asp',
'musteri_hizmetleri\.asp',
'iletisim_formu\.asp',
'yardim\.asp',
'sepet\.asp',
'catinfo\.asp\?brw',
),
),
callback='parse_item',
follow=True
),
)
def parse_item(self, response):
i = ProductItem()
sl = Selector(response=response)
if not sl.xpath(self.xpaths['check_page']):
return i
i['id'] = hashlib.md5(response.url.encode('utf-8')).hexdigest()
i['url'] = response.url
i['category'] = " > ".join(sl.xpath(self.xpaths['category']).extract()[1:-1])
i['title'] = sl.xpath(self.xpaths['title']).extract()[0].strip()
i['special_price'] = i['price'] = sl.xpath(self.xpaths['price']).extract()[0].strip().replace(',','.')
images = []
for img in sl.xpath(self.xpaths['images']).extract():
images.append("http://www.nautilusconcept.com/"+img)
i['images'] = images
i['description'] = (" ".join(sl.xpath(self.xpaths['description']).extract())).strip()
i['brand'] = "Nautilus"
i['expire_timestamp']=i['sizes']=i['colors'] = ''
i['currency'] = sl.xpath(self.xpaths['currency']).extract()[0].strip()
return i
Here is the piece of scrapy log
2014-07-22 17:39:31+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=-1&order=&src=&typ=> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=&offset=&order=&src=&stock=1)
2014-07-22 17:39:31+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=&offset=&order=&src=&stock=1)
2014-07-22 17:39:32+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=name&src=&stock=1&typ=> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=&offset=&order=&src=&stock=1)
2014-07-22 17:39:32+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=2&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:32+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=2&kactane=100&mrk=1&offset=-1&order=name&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:33+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&chkBeden=&chkMarka=&chkRenk=&cid=64&cmp=&direction=1&grp=&kactane=100&model=&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:33+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=1&chkBeden=&chkMarka=&chkRenk=&cid=64&cmp=&direction=1&grp=&kactane=100&model=&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:33+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=1&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=name&src=&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=1&chkBeden=&chkMarka=&chkRenk=&cid=64&cmp=&direction=1&grp=&kactane=100&model=&mrk=1&offset=-1&order=name&src=&stock=1&typ=7)
Spider also crawls proper page but it must not try to crawl links that contains (catinfo.asp?brw...)
I'm using Scrapy==0.24.2 and python 2.7.6
It's a canonicalizing "issue". By default, LinkExtractor returns canonicalized URLs, but regexes from deny and allow are applied before canonicalization.
I suggest you use these rules:
rules = (
Rule(
LinkExtractor(allow=('com/[\w_]+',),
deny=('asp$',
'login\.asp',
'hakkimizda\.asp',
'musteri_hizmetleri\.asp',
'iletisim_formu\.asp',
'yardim\.asp',
'sepet\.asp',
'catinfo\.asp\?.*brw',
),
),
callback='parse_item',
follow=True
),
)
My goal is to scrape a list of URLs and headlines from a site, as part of a larger project-which is what drove me to learn scrapy. Now, as it stands, using basespider to scrape the first page of a given date (format is /archive/date/) works fine. However, trying to use crawlspider (working off some tutorials) to scrape each sequential page of a given date isn't working, and I'm not sure why. I've tried a number of solution.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from physurlfetch.items import PhysurlfetchItem
from scrapy.http import Request
class PhysURLSpider(CrawlSpider):
date = raw_input("Please iput a date in the format M-DD-YYYY: ")
name = "PhysURLCrawlSpider"
allowed_domains = "phys.org"
start_url_str = ("http://phys.org/archive/%s/") % (date)
start_urls = [start_url_str]
rules = (
Rule (SgmlLinkExtractor(allow=("\d\.html",)),
callback="parse_items", follow = True),
)
#def parse_start_url(self, response):
#request = Request(start_urls, callback = self.parse_items)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//article[#class='news-box news-detail-box clearfix']/h4")
items = []
for titles in titles:
item = PhysurlfetchItem()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/#href").extract()
items.append(item)
return items
Currently I have parse_start_url commented out because that was failing with the method I was trying to snag start_urls (with the varying string). Running this currently jumps straight to page 2 of a given day without grabbing any data from page 1, and then stops (no page 2 data, no page 3).
When I ran your spider locally (using scrapy runspider yourspider.py) I got this console output:
2014-01-10 13:30:19+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/> (referer: None)
2014-01-10 13:30:19+0100 [PhysURLCrawlSpider] DEBUG: Filtered offsite request to 'phys.org': <GET http://phys.org/archive/5-12-2013/page2.html>
2014-01-10 13:30:19+0100 [PhysURLCrawlSpider] INFO: Closing spider (finished)
You can see Scrapy is filetring an offsite query. In fact allowed_domains should be a list of domains, so if you change to allowed_domains = ["phys.org"] you get further:
2014-01-10 13:32:00+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/> (referer: None)
2014-01-10 13:32:00+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page2.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:00+0100 [PhysURLCrawlSpider] DEBUG: Filtered duplicate request: <GET http://phys.org/archive/5-12-2013/page3.html> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page8.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page6.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Redirecting (301) to <GET http://phys.org/archive/5-12-2013/> from <GET http://phys.org/archive/5-12-2013/page1.html>
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page4.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page7.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page5.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page3.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/> (referer: http://phys.org/archive/5-12-2013/page2.html)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] INFO: Closing spider (finished)
But the spider is not picking up any item. It may or may not be a typo but your XPath expression for titles should probably be //article[#class='news-box news-detail-box clearfix']/h4, i.e. without the extra whitespace before clearfix.
As a final note, if you use the latest Scrapy version (from version 0.20.0 onwards), you'll be able to use CSS selectors, which can be more readable than XPath when selecting elements with multiple classes:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from physurlfetch.items import PhysurlfetchItem
from scrapy.http import Request
class PhysURLSpider(CrawlSpider):
date = raw_input("Please iput a date in the format M-DD-YYYY: ")
name = "PhysURLCrawlSpider"
allowed_domains = ["phys.org"]
start_url_str = ("http://phys.org/archive/%s/") % (date)
start_urls = [start_url_str]
rules = (
Rule (SgmlLinkExtractor(allow=("\d\.html",)),
callback="parse_items", follow = True),
)
#def parse_start_url(self, response):
#request = Request(start_urls, callback = self.parse_items)
def parse_items(self, response):
selector = Selector(response)
# selecting only using "news-detail-box" class
# you could use "article.news-box.news-detail-box.clearfix > h4"
titles = selector.css("article.news-detail-box > h4")
items = []
for titles in titles:
item = PhysurlfetchItem()
item ["title"] = titles.xpath("a/text()").extract()
item ["link"] = titles.xpath("a/#href").extract()
items.append(item)
self.log("%d items in %s" % (len(items), response.url))
return items
I am trying to parse all urls containing "133199" within my site.
Unfortunately, my code only parses 1 url within the whole site. There should be well over 20k urls.
The following code correctly crawls the whole website and is somehow parsing the first URL containing 133199, but not the rests.
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from wallspider.items import Website
class mydomainSpider(CrawlSpider):
name = "activewear"
allowed_domains = ["www.mydomain.com"]
start_urls = ["http://www.mydomain.com/",]
rules = (
Rule(SgmlLinkExtractor(allow=(),deny=('/[1-9]$', '(bti=)[1-9]+(?:\.[1-9]*)?', '(sort_by=)[a-zA-Z]', '(sort_by=)[1-9]+(?:\.[1-9]*)?', '(ic=32_)[1-9]+(?:\.[1-9]*)?', '(ic=60_)[0-9]+(?:\.[0-9]*)?', '(search_sort=)[1-9]+(?:\.[1-9]*)?', 'browse-ng.do\?', '/page/', '/ip/', 'out\+value', 'fn=', 'customer_rating', 'special_offers', 'search_sort=&', ))),
Rule (SgmlLinkExtractor(allow=('133199', ),)
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html')
items = []
for site in sites:
item = Website()
item['referer'] = response.request.headers.get('Referer')
item['url'] = response.url
item['title'] = site.xpath('/html/head/title/text()').extract()
item['description'] = site.select('//meta[#name="Description"]/#content').extract()
item['canonical'] = site.xpath('//head/link[#rel="canonical"]/#href').extract()
item['response'] = response.status
items.append(item)
return items
This is my console log of the only URL that gets parsed. The website is a couple million pages so I cannot display the whole log.
Scraped from <200 http://www.mydomain.com/browse/apparel/5438?_refineresult=true&facet=special_offers%3AClearance&ic=32_0&path=0%3A5438&povid=cat133199-env200983-moduleC052312-lLinkSubnav1Clearance>
{'canonical': [u'http://www.mydomain.com/browse/apparel/5438/'],
'description': [u"Shop for Apparel - mydomain.com. Buy products such as Disney Girls' Minnie Mouse 2 Piece Pajama Coat Set at mydomain and save."],
'referer': 'http://www.mydomain.com/cp/133199',
'response': 200,
'title': [u'\nApparel - mydomain.com\n'],
'url': 'http://www.mydomain.com/browse/apparel/5438?_refineresult=true&facet=special_offers%3AClearance&ic=32_0&path=0%3A5438&povid=cat133199-env200983-moduleC052312-lLinkSubnav1Clearance'}
2013-12-20 09:45:54-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/cp/Cats/202073?povid=P1171-C1110.2784+1455.2776+1115.2956-L440> (referer: http://www.mydomain.com/)
2013-12-20 09:45:54-0800 [activewear] DEBUG: Redirecting (301) to <GET http://www.mydomain.com/browse/pets/birds/5440_228734/?amp;ic=48_0&ref=243033.244996&catNavId=5440&povid=P1171-C1110.2784+1455.2776+1115.2956-L439> from <GET http://www.mydomain.com/browse/Birds/_/N-591gZaq90Zaqce/Ne-57ix?amp%3Bic=48_0&%3Bref=243033.244996&%3Btab_All=&catNavId=5440&povid=P1171-C1110.2784+1455.2776+1115.2956-L439>
2013-12-20 09:45:54-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/team-sports/soccer/4125_4161_432196?povid=P1171-C1110.2784+1455.2776+1115.2956-L277> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/sports-outdoors/golf/4125_4152?povid=P1171-C1110.2784+1455.2776+1115.2956-L276> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/team-sports/football/4125_4161_434036?povid=P1171-C1110.2784+1455.2776+1115.2956-L275> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/cp/1164750?povid=P1171-C1110.2784+1455.2776+1115.2956-L362> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/gifts-registry/specialty-gift-cards/1094765_96894_972339?povid=P1171-C1110.2784+1455.2776+1115.2956-L361> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/cp/pet-supplies/5440?povid=P1171-C1110.2784+1455.2776+1115.2956-L438> (referer: http://www.mydomain.com/)