import scrapy
class Pttscrapper2Spider(scrapy.Spider):
name = 'PTTscrapper2'
allowed_domains = ['https://www.ptt.cc']
start_urls = ['https://www.ptt.cc/bbs/HatePolitics/index.html/']
handle_httpstatus_list = [400, 302]
def parse(self, response):
urls = response.css('div.r-ent > div.title > a::attr(href)').extract()
for thread_url in urls:
url = response.urljoin(thread_url)
yield scrapy.Request(url=url, callback=self.parse_details)
next_page_url = response.css('a.wide:nth-child(2)::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
def parse_details(self, response):
yield {
'title' : response.xpath('//head/title/text()').extract(),
'stance' : response.xpath('//*[#id="main-content"]/div[#class="push"]/span[1]/text()').extract(),
'userid' : response.xpath('//*[#id="main-content"]/div[#class="push"]/span[2]/text()').extract(),
'comment' : response.xpath('//*[#id="main-content"]/div[#class="push"]/span[3]/text()').extract(),
'time_of_post' : response.xpath('//*[#id="main-content"]/div[#class="push"]/span[4]/text()').extract(),
}
I've been using the above spider to try and crawl a website, but I when I run the spider, I get these messages:
> 2017-10-05 23:14:27 [scrapy.core.engine] INFO: Spider opened
> 2017-10-05 23:14:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages
> (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-10-05 23:14:27
> [scrapy.extensions.telnet] DEBUG: Telnet console listening on
> 127.0.0.1:6023 2017-10-05 23:14:28 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from:
> <302 https://www.ptt.cc/bbs/HatePolitics/index.html/> Set-Cookie:
> __cfduid=d3ca57dcab04acfaf256438a57c547e4a1507216462; expires=Fri, 05-Oct-18 15:14:22 GMT; path=/; domain=.ptt.cc; HttpOnly
>
> 2017-10-05 23:14:28 [scrapy.core.engine] DEBUG: Crawled (302) <GET
> https://www.ptt.cc/bbs/HatePolitics/index.html/> (referer: None)
> 2017-10-05 23:14:28 [scrapy.core.engine] INFO: Closing spider
> (finished)
What I've been thinking is that my spider can't seem to access the sub forums in the index. I've tested that the selectors point to the correct locations and request.urljoin creates the correct absolute url but can't seem to access the sub forums in a page. It would be great if someone can tell me why the spider is unable to access the links!
Two issues with your scraper. In the start_urls you added a trailing slash to the index.html/, which is wrong. Next allowed_domains will take domain names and not urls.
Change starting code to below and it would work
class Pttscrapper2Spider(scrapy.Spider):
name = 'PTTscrapper2'
allowed_domains = ['www.ptt.cc']
start_urls = ['https://www.ptt.cc/bbs/HatePolitics/index.html']
Logs from the run
2017-10-06 13:16:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ptt.cc/bbs/HatePolitics/M.1507268600.A.57C.html> (referer: https://www.ptt.cc/bbs/HatePolitics/index.html)
2017-10-06 13:16:15 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.ptt.cc/bbs/HatePolitics/M.1507268600.A.57C.html>
{'title': ['[黑特] 先刪文,洪慈庸和高潞那個到底撤案了沒? - 看板 HatePolitics - 批踢踢實業坊'], 'stance': ['推 ', '→ ', '噓 ', '→ ', '→ '], 'userid': ['ABA0525', 'gerund', 'AGODFATHER', 'laman45', 'victoryman'], 'comment': [': 垃圾不分藍綠黃', ': 垃圾靠弟傭 中華民國內最沒資格當立委的爛貨', ': 說什麼東西你個板啊', ': 有確定再說', ': 看起來應該是撤了'], 'time_of_post': ['10/06 13:43\n', '10/06 13:50\n', '10/06 13:57\n', '10/06 13:59\n', ' 10/06 15:27\n']}
2017-10-06 13:16:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.ptt.cc/bbs/HatePolitics/M.1507275599.A.657.html> (referer: https://www.ptt.cc/bbs/HatePolitics/index.html)
Related
I am trying to perform a horizontal crawling -- starting from the first page and move until reaching the last page. The code is as follows:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from scrapy_project.items import metacriticItem
import datetime
# xpaths
main_xpath = '//div[#class = "title_bump"]//td[#class = "clamp-summary-wrap"]'
mt_url_xpath = './a/#href'
class MovieUrlSpider(CrawlSpider):
name = 'movie_url'
allowed_domains = ['web']
start_urls = (
'https://www.metacritic.com/browse/movies/score/metascore/all',
)
# rules for horizontal crawling
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[#rel="next"]'),
callback='parse_item', follow=True),
)
def parse_item(self, response):
# list of items we want
main = response.xpath(main_xpath)
for i in main:
# create the loader using the response
l = ItemLoader(item = metacriticItem(), selector = i)
# key
l.add_xpath('mt_url', mt_url_xpath)
# housekeeping fields
l.add_value('url', response.url)
l.add_value('spider', self.name)
l.add_value('date', datetime.datetime.now().strftime('%d/%m/%Y'))
yield l.load_item()
This script did not parse any item and it return the following message:
2021-05-12 02:22:24 [scrapy.core.engine] INFO: Spider opened
2021-05-12 02:22:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-05-12 02:22:24 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-05-12 02:22:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/robots.txt> (referer: None)
2021-05-12 02:22:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.metacritic.com/browse/movies/score/metascore/all> (referer: None)
2021-05-12 02:22:24 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.metacritic.com': <GET https://www.metacritic.com/browse/movies/score/metascore/all?page=1>
2021-05-12 02:22:24 [scrapy.core.engine] INFO: Closing spider (finished)
I am trying to scrape a site with div elements and iteratively, for each div element I want to scrape some data from it and follow the child links it has and scrape more data from them.
Here is the code of quote.py
import scrapy
from ..items import QuotesItem
class QuoteSpider(scrapy.Spider):
name = 'quote'
baseurl='http://quotes.toscrape.com'
start_urls = [baseurl]
def parse(self, response):
all_div_quotes=response.css('.quote')
for quote in all_div_quotes:
item=QuotesItem()
title = quote.css('.text::text').extract()
author = quote.css('.author::text').extract()
tags = quote.css('.tag::text').extract()
author_details_url=self.baseurl+quote.css('.author+ a::attr(href)').extract_first()
item['title']=title
item['author']=author
item['tags']=tags
request = scrapy.Request(author_details_url,
callback=self.author_born,
meta={'item':item,'next_url':author_details_url})
yield request
def author_born(self, response):
item=response.meta['item']
next_url = response.meta['next_url']
author_born = response.css('.author-born-date::text').extract()
item['author_born']=author_born
yield scrapy.Request(next_url, callback=self.author_birthplace,
meta={'item':item})
def author_birthplace(self,response):
item=response.meta['item']
author_birthplace= response.css('.author-born-location::text').extract()
item['author_birthplace']=author_birthplace
yield item
Here is the code of items.py
import scrapy
class QuotesItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
author_born = scrapy.Field()
author_birthplace = scrapy.Field()
I ran the command scrapy crawl quote -o data.json, but there was no error message and data.json was empty. I was expecting to get all the data in its corresponding field.
Can you please help me?
Take a closer look at your logs, you'll be able to find messages like this:
DEBUG: Filtered duplicate request: <GET http://quotes.toscrape.com/author/Albert-Einstein>
Scrapy is automatically managing duplicates and trying not to visit one URL twice(for obvious reasons).
In you case you can add dont_filter = True to your requests and will see something like this:
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Steve-Martin/> (referer: http://quotes.toscrape.com/author/Steve-Martin/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Albert-Einstein/> (referer: http://quotes.toscrape.com/author/Albert-Einstein/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Marilyn-Monroe/> (referer: http://quotes.toscrape.com/author/Marilyn-Monroe/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/J-K-Rowling/> (referer: http://quotes.toscrape.com/author/J-K-Rowling/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Eleanor-Roosevelt/> (referer: http://quotes.toscrape.com/author/Eleanor-Roosevelt/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Andre-Gide/> (referer: http://quotes.toscrape.com/author/Andre-Gide/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Thomas-A-Edison/> (referer: http://quotes.toscrape.com/author/Thomas-A-Edison/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Jane-Austen/> (referer: http://quotes.toscrape.com/author/Jane-Austen/)
Which is kinda strange indeed, because of page yields request to itself.
Overall you could end up with something like this:
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quote'
baseurl = 'http://quotes.toscrape.com'
start_urls = [baseurl]
def parse(self, response):
all_div_quotes = response.css('.quote')
for quote in all_div_quotes:
item = dict()
title = quote.css('.text::text').extract()
author = quote.css('.author::text').extract()
tags = quote.css('.tag::text').extract()
author_details_url = self.baseurl + quote.css('.author+ a::attr(href)').extract_first()
item['title'] = title
item['author'] = author
item['tags'] = tags
print(item)
# Don't filter = True in case of we get two quotes of a single author.
# This is not optimal though. Better decision will be to save author data to self.storage
# And only visit new author info pages if needed, else take info from saved dict.
request = scrapy.Request(author_details_url,
callback=self.author_info,
meta={'item': item},
dont_filter=True)
yield request
def author_info(self, response):
item = response.meta['item']
author_born = response.css('.author-born-date::text').extract()
author_birthplace = response.css('.author-born-location::text').extract()
item['author_born'] = author_born
item['author_birthplace'] = author_birthplace
yield item
I’m trying to scrape different information from several pages of a website.
Until the sixteenth page, everything works: the pages are crawled, scraped and the information stock in my database, however after the sixteenth page, it stops scraping but continues to crawl.
I checked the website and there are more of 470 pages with information. The HTML tags are the same, so I don't understand why it stopped scraping.
Python:
def url_lister():
url_list = []
page_count = 1
while page_count < 480:
url = 'https://www.active.com/running?page=%s' %page_count
url_list.append(url)
page_count += 1
return url_list
class ListeCourse_level1(scrapy.Spider):
name = 'ListeCAP_ACTIVE'
allowed_domains = ['www.active.com']
start_urls = url_lister()
def parse(self, response):
selector = Selector(response)
for uneCourse in response.xpath('//*[#id="lpf-tabs2-a"]/article/div/div/div/a[#itemprop="url"]'):
loader = ItemLoader(ActiveItem(), selector=uneCourse)
loader.add_xpath('nom_evenement', './/div[2]/div/h5[#itemprop="name"]/text()')
loader.default_input_processor = MapCompose(string)
loader.default_output_processor = Join()
yield loader.load_item()
pass
The shell:
> 2018-01-23 17:22:29 [scrapy.core.scraper] DEBUG: Scraped from <200
> https://www.active.com/running?page=15>
> {
> 'nom_evenement': 'Enniscrone 10k run & 5k run/walk',
> }
> 2018-01-23 17:22:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.active.com/running?page=16> (referer: None)
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> --------------------------------------------------
> 2018-01-23 17:22:34 [scrapy.extensions.logstats] INFO: Crawled 17 pages (at 17 pages/min), scraped 155 items (at 155 items/min)
> 2018-01-23 17:22:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.active.com/running?page=17> (referer: None)
>
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> -------------------------------------------------- 2018-01-23 17:22:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.active.com/running?page=18> (referer: None)
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> -------------------------------------------------- 2018-01-23 17:22:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.active.com/running?page=19> (referer: None)
This is probably caused by the fact that there are only 17 pages with the content you are looking for, while you instruct Scrapy to visit all 480 pages of form https://www.active.com/running?page=NNN. A better approach is to check on each page you visit that there is a next page and only in that case yield Request to the next page.
So, I would refactor your code to something like (not tested):
class ListeCourse_level1(scrapy.Spider):
name = 'ListeCAP_ACTIVE'
allowed_domains = ['www.active.com']
base_url = 'https://www.active.com/running'
start_urls = [base_url]
def parse(self, response):
selector = Selector(response)
for uneCourse in response.xpath('//*[#id="lpf-tabs2-a"]/article/div/div/div/a[#itemprop="url"]'):
loader = ItemLoader(ActiveItem(), selector=uneCourse)
loader.add_xpath('nom_evenement', './/div[2]/div/h5[#itemprop="name"]/text()')
loader.default_input_processor = MapCompose(string)
loader.default_output_processor = Join()
yield loader.load_item()
# check for next page link
if response.xpath('//a[contains(#class, "next-page")]'):
next_page = response.meta.get('page_number', 1) + 1
next_page_url = '{}?page={}'.format(base_url, next_page)
yield scrapy.Request(next_page_url, callback=self.parse, meta={'page_number': next_page})
I searched for any similar issues on stackowerflow and the other q&a sites but I could not find any proper answer for my problem.
I have written the following spider to crawl nautilusconcept.com . The category structure of site is so bad. Because of it I had to apply rules as it parse all link with callback. I determine which url should be parse with if statement inside parse_item method. Anyway spider doesn't listen my deny rules and still trying to crawl contains (?brw....) links.
Here is my spider;
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from vitrinbot.items import ProductItem
from vitrinbot.base import utils
import hashlib
removeCurrency = utils.removeCurrency
getCurrency = utils.getCurrency
class NautilusSpider(CrawlSpider):
name = 'nautilus'
allowed_domains = ['nautilusconcept.com']
start_urls = ['http://www.nautilusconcept.com/']
xml_filename = 'nautilus-%d.xml'
xpaths = {
'category' :'//tr[#class="KategoriYazdirTabloTr"]//a/text()',
'title':'//h1[#class="UrunBilgisiUrunAdi"]/text()',
'price':'//hemenalfiyat/text()',
'images':'//td[#class="UrunBilgisiUrunResimSlaytTd"]//div/a/#href',
'description':'//td[#class="UrunBilgisiUrunBilgiIcerikTd"]//*/text()',
'currency':'//*[#id="UrunBilgisiUrunFiyatiDiv"]/text()',
'check_page':'//div[#class="ayrinti"]'
}
rules = (
Rule(
LinkExtractor(allow=('com/[\w_]+',),
deny=('asp$',
'login\.asp'
'hakkimizda\.asp',
'musteri_hizmetleri\.asp',
'iletisim_formu\.asp',
'yardim\.asp',
'sepet\.asp',
'catinfo\.asp\?brw',
),
),
callback='parse_item',
follow=True
),
)
def parse_item(self, response):
i = ProductItem()
sl = Selector(response=response)
if not sl.xpath(self.xpaths['check_page']):
return i
i['id'] = hashlib.md5(response.url.encode('utf-8')).hexdigest()
i['url'] = response.url
i['category'] = " > ".join(sl.xpath(self.xpaths['category']).extract()[1:-1])
i['title'] = sl.xpath(self.xpaths['title']).extract()[0].strip()
i['special_price'] = i['price'] = sl.xpath(self.xpaths['price']).extract()[0].strip().replace(',','.')
images = []
for img in sl.xpath(self.xpaths['images']).extract():
images.append("http://www.nautilusconcept.com/"+img)
i['images'] = images
i['description'] = (" ".join(sl.xpath(self.xpaths['description']).extract())).strip()
i['brand'] = "Nautilus"
i['expire_timestamp']=i['sizes']=i['colors'] = ''
i['currency'] = sl.xpath(self.xpaths['currency']).extract()[0].strip()
return i
Here is the piece of scrapy log
2014-07-22 17:39:31+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=-1&order=&src=&typ=> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=&offset=&order=&src=&stock=1)
2014-07-22 17:39:31+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=&offset=&order=&src=&stock=1)
2014-07-22 17:39:32+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=name&src=&stock=1&typ=> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=0&cid=64&direction=&kactane=100&mrk=1&offset=&offset=&order=&src=&stock=1)
2014-07-22 17:39:32+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=2&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:32+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=&chkBeden=&chkMarka=&chkRenk=&cid=64&direction=2&kactane=100&mrk=1&offset=-1&order=name&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:33+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=0&chkBeden=&chkMarka=&chkRenk=&cid=64&cmp=&direction=1&grp=&kactane=100&model=&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:33+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=1&chkBeden=&chkMarka=&chkRenk=&cid=64&cmp=&direction=1&grp=&kactane=100&model=&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=prc&src=&stock=1&typ=7)
2014-07-22 17:39:33+0300 [nautilus] DEBUG: Crawled (200) <GET http://www.nautilusconcept.com/catinfo.asp?brw=1&cid=64&direction=1&kactane=100&mrk=1&offset=-1&order=name&src=&typ=7> (referer: http://www.nautilusconcept.com/catinfo.asp?brw=1&chkBeden=&chkMarka=&chkRenk=&cid=64&cmp=&direction=1&grp=&kactane=100&model=&mrk=1&offset=-1&order=name&src=&stock=1&typ=7)
Spider also crawls proper page but it must not try to crawl links that contains (catinfo.asp?brw...)
I'm using Scrapy==0.24.2 and python 2.7.6
It's a canonicalizing "issue". By default, LinkExtractor returns canonicalized URLs, but regexes from deny and allow are applied before canonicalization.
I suggest you use these rules:
rules = (
Rule(
LinkExtractor(allow=('com/[\w_]+',),
deny=('asp$',
'login\.asp',
'hakkimizda\.asp',
'musteri_hizmetleri\.asp',
'iletisim_formu\.asp',
'yardim\.asp',
'sepet\.asp',
'catinfo\.asp\?.*brw',
),
),
callback='parse_item',
follow=True
),
)
I am trying to parse all urls containing "133199" within my site.
Unfortunately, my code only parses 1 url within the whole site. There should be well over 20k urls.
The following code correctly crawls the whole website and is somehow parsing the first URL containing 133199, but not the rests.
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from wallspider.items import Website
class mydomainSpider(CrawlSpider):
name = "activewear"
allowed_domains = ["www.mydomain.com"]
start_urls = ["http://www.mydomain.com/",]
rules = (
Rule(SgmlLinkExtractor(allow=(),deny=('/[1-9]$', '(bti=)[1-9]+(?:\.[1-9]*)?', '(sort_by=)[a-zA-Z]', '(sort_by=)[1-9]+(?:\.[1-9]*)?', '(ic=32_)[1-9]+(?:\.[1-9]*)?', '(ic=60_)[0-9]+(?:\.[0-9]*)?', '(search_sort=)[1-9]+(?:\.[1-9]*)?', 'browse-ng.do\?', '/page/', '/ip/', 'out\+value', 'fn=', 'customer_rating', 'special_offers', 'search_sort=&', ))),
Rule (SgmlLinkExtractor(allow=('133199', ),)
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html')
items = []
for site in sites:
item = Website()
item['referer'] = response.request.headers.get('Referer')
item['url'] = response.url
item['title'] = site.xpath('/html/head/title/text()').extract()
item['description'] = site.select('//meta[#name="Description"]/#content').extract()
item['canonical'] = site.xpath('//head/link[#rel="canonical"]/#href').extract()
item['response'] = response.status
items.append(item)
return items
This is my console log of the only URL that gets parsed. The website is a couple million pages so I cannot display the whole log.
Scraped from <200 http://www.mydomain.com/browse/apparel/5438?_refineresult=true&facet=special_offers%3AClearance&ic=32_0&path=0%3A5438&povid=cat133199-env200983-moduleC052312-lLinkSubnav1Clearance>
{'canonical': [u'http://www.mydomain.com/browse/apparel/5438/'],
'description': [u"Shop for Apparel - mydomain.com. Buy products such as Disney Girls' Minnie Mouse 2 Piece Pajama Coat Set at mydomain and save."],
'referer': 'http://www.mydomain.com/cp/133199',
'response': 200,
'title': [u'\nApparel - mydomain.com\n'],
'url': 'http://www.mydomain.com/browse/apparel/5438?_refineresult=true&facet=special_offers%3AClearance&ic=32_0&path=0%3A5438&povid=cat133199-env200983-moduleC052312-lLinkSubnav1Clearance'}
2013-12-20 09:45:54-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/cp/Cats/202073?povid=P1171-C1110.2784+1455.2776+1115.2956-L440> (referer: http://www.mydomain.com/)
2013-12-20 09:45:54-0800 [activewear] DEBUG: Redirecting (301) to <GET http://www.mydomain.com/browse/pets/birds/5440_228734/?amp;ic=48_0&ref=243033.244996&catNavId=5440&povid=P1171-C1110.2784+1455.2776+1115.2956-L439> from <GET http://www.mydomain.com/browse/Birds/_/N-591gZaq90Zaqce/Ne-57ix?amp%3Bic=48_0&%3Bref=243033.244996&%3Btab_All=&catNavId=5440&povid=P1171-C1110.2784+1455.2776+1115.2956-L439>
2013-12-20 09:45:54-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/team-sports/soccer/4125_4161_432196?povid=P1171-C1110.2784+1455.2776+1115.2956-L277> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/sports-outdoors/golf/4125_4152?povid=P1171-C1110.2784+1455.2776+1115.2956-L276> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/team-sports/football/4125_4161_434036?povid=P1171-C1110.2784+1455.2776+1115.2956-L275> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/cp/1164750?povid=P1171-C1110.2784+1455.2776+1115.2956-L362> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/browse/gifts-registry/specialty-gift-cards/1094765_96894_972339?povid=P1171-C1110.2784+1455.2776+1115.2956-L361> (referer: http://www.mydomain.com/)
2013-12-20 09:45:55-0800 [activewear] DEBUG: Crawled (200) <GET http://www.mydomain.com/cp/pet-supplies/5440?povid=P1171-C1110.2784+1455.2776+1115.2956-L438> (referer: http://www.mydomain.com/)