python method is not called - python

I have following class method in a scrapy spider. parse_category yields a Request object that has callback to parse_product. Sometimes a category page redirects to a product page. So here I detect if a category page is a product page. If it is, I just call the parse_product method. But for some reason it does not call the method.
def parse(self, response):
hxs = HtmlXPathSelector(response)
anchors = hxs.select('//div[#id="panelMfr"]/div/ul/li[position() != last()]/a')
for anchor in anchors[2:3]:
url = anchor.select('#href').extract().pop()
cat = anchor.select('text()').extract().pop().strip()
yield Request(urljoin(get_base_url(response), url), callback=self.parse_category, meta={"category": cat})
def parse_category(self, response):
hxs = HtmlXPathSelector(response)
base_url = get_base_url(response)
# check if its a redirected product page
if (hxs.select(self.product_name_xpath)):
self.log("Category-To-Product Redirection")
self.parse_product(response) # <<---- This line is not called.
self.log("Product Parsed")
return
products_xpath = '//div[#class="productName"]/a/#href'
products = hxs.select(products_xpath).extract()
for url in products:
yield Request(urljoin(base_url, url), callback=self.parse_product, meta={"category": response.meta['category']})
next_page = hxs.select('//table[#class="nav-back"]/tr/td/span/a[contains(text(), "Next")]/text()').extract()
if next_page:
url = next_page[0]
yield Request(urljoin(base_url, url), callback=self.parse_category, meta={"category": response.meta['category']})
def parse_product(self, response):
hxs = HtmlXPathSelector(response)
base_url = get_base_url(response)
self.log("Inside parse_product")
In the log I see Category-To-Product Redirection and Product Parsed is printed but Inside parse_product is missing. Whats did I do wrong here?
2013-07-12 21:31:34+0100 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/category.aspx> (referer: None)
2013-07-12 21:31:34+0100 [example.com] DEBUG: Redirecting (302) to <GET http://www.example.com/productinfo.aspx?catref=AM6901> from <GET http://www.example.com/products/Inks-Toners/Apple>
2013-07-12 21:31:35+0100 [example.com] DEBUG: Crawled (200) <GET http://www.example.com/productinfo.aspx?catref=AM6901> (referer: http://www.example.com/category.aspx)
2013-07-12 21:31:35+0100 [example.com] DEBUG: Category-To-Product Redirection
2013-07-12 21:31:35+0100 [example.com] DEBUG: Product Parsed
2013-07-12 21:31:35+0100 [example.com] INFO: Closing spider (finished)
2013-07-12 21:31:35+0100 [-] ERROR: ERROR:root:SPIDER CLOSED: No. of products: 0

Related

How can I implement custom proxy on Scrapy?

I'm trying to implement custom scraper API but as a begging I think I'm doing wrong. But I follow their documentation to setup everything. Here is a documentation link
from scrapy import Spider
from scrapy.http import Request
from .config import API
from scraper_api import ScraperAPIClient
client = ScraperAPIClient(API)
class GlassSpider(Spider):
name = 'glass'
allowed_domains = ['glassdoor.co.uk']
start_urls = [client.scrapyGet(url='https://www.glassdoor.co.uk/Job/russian-jobs-SRCH_KE0,7.htm?fromAge=1')]
def parse(self, response):
jobs = response.xpath('//*[contains(#class, "react-job-listing")]')
for job in jobs:
job_url = job.xpath('.//*[contains(#class, "jobInfoItem jobTitle")]/#href').extract_first()
absulate_job_url = response.urljoin(job_url)
yield Request(client.scrapyGet(url=absulate_job_url),
callback=self.parse_jobpage,
meta={
"Job URL": absulate_job_url
})
def parse_jobpage(self, response):
absulate_job_url = response.meta.get('Job URL')
job_description = "".join(line for line in response.xpath('//*[contains(#class, "desc")]//text()').extract())
yield {
"Job URL": absulate_job_url,
"Job Description": job_description
}
That's the output I'm receiving.... Please what's wrong with my code. Please fix it for me. So I can follow and get the point. Thank you.
2020-10-01 23:01:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://api.scraperapi.com/?url=https%3A%2F%2Fwww.glassdoor.co.uk%2FJob%2F
russian-jobs-SRCH_KE0%2C7.htm%3FfromAge%3D1&api_key=bec9dd9f2be095dfc6158a7e609&scraper_sdk=python>
(referer: None)
2020-10-01 23:01:45 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'api.scraperapi.com': <GET
https://api.scraperapi.c
om/?url=https%3A%2F%2Fapi.scraperapi.com%2Fpartner%2FjobListing.htm%3Fpos%3D101%26ao%3D1044074%26s%3D149%26guid%3D00000174e51ccd8988e2e5420e6
7cf0d%26src%3DGD_JOB_AD%26t%3DSRFJ%26vt%3Dw%26cs%3D1_94f59ee8%26cb%3D1601571704401%26jobListingId%3D3696480795&api_key=bec9d9f82b0955c61
5c8a7e639scraper_sdk=python>
I'm not familiar with this particular lib, but from your execution logs the issue is that your request is beign filtered, since it's consider offsite.
[scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'api.scraperapi.com': <GET https://api.scraperapi.c om/?url=https%3A%2F%2Fapi.scraperapi.com%2Fpartner%2FjobListing.htm%3Fpos%3D101%26ao%3D1044074%26s%3D149%26guid%3D00000174e51ccd8988e2e5420e6 7cf0d%26src%3DGD_JOB_AD%26t%3DSRFJ%26vt%3Dw%26cs%3D1_94f59ee8%26cb%3D1601571704401%26jobListingId%3D3696480795&api_key=bec9d9f82b0955c61 5c8a7e639scraper_sdk=python>
Since scraperapi will make your request go through their domain and that's outside of what you defined in your allowed_domains it's filtered as an offsite request. To avoid this issue you can remove this line entirely:
allowed_domains = ['glassdoor.co.uk']
or try include 'api.scraperapi.com' in it.

scrapy 503 Service Unavailable on starturl

I modifed this spider but it gives this errors
Gave up retrying <GET https://lib.maplelegends.com/robots.txt> (failed 3 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://lib.maplelegends.com/robots.txt> (referer: None)
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 1 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 2 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 3 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://lib.maplelegends.com/?p=etc&id=4004003> (referer: None)
2019-01-06 23:43:56 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://lib.maplelegends.com/?p=etc&id=4004003>: HTTP status code is not handled or not allowed
Crawler code:
#!/usr/bin/env python3
import scrapy
import time
start_url = 'https://lib.maplelegends.com/?p=etc&id=4004003'
class MySpider(scrapy.Spider):
name = 'MySpider'
start_urls = [start_url]
def parse(self, response):
# print('url:', response.url)
products = response.xpath('.//div[#class="table-responsive"]/table/tbody')
for product in products:
item = {
#'name': product.xpath('./tr/td/b[1]/a/text()').extract(),
'link': product.xpath('./tr/td/b[1]/a/#href').extract(),
}
# url = response.urljoin(item['link'])
# yield scrapy.Request(url=url, callback=self.parse_product, meta={'item': item})
yield response.follow(item['link'], callback=self.parse_product, meta={'item': item})
time.sleep(5)
# execute with low
yield scrapy.Request(start_url, dont_filter=True, priority=-1)
def parse_product(self, response):
# print('url:', response.url)
# name = response.xpath('(//strong)[1]/text()').re(r'(\w+)')
hp = response.xpath('//*[contains(concat( " ", #class, " " ), concat( " ", "image", " " ))] | //img').re(r':(\d+)')
scrolls = response.xpath('//*[contains(concat( " ", #class, " " ), concat( " ", "image", " " ))] | //strong+//a//img/#title').re(r'\bScroll\b')
for price, hp, scrolls in zip(name, hp, scrolls):
yield {'name': name.strip(), 'hp': hp.strip(), 'scroll':scrolls.strip()}
--- it runs without project and saves in output.csv ---
from scrapy.crawler import CrawlerRunner
def _run_crawler(spider_cls, settings):
"""
spider_cls: Scrapy Spider class
returns: Twisted Deferred
"""
runner = CrawlerRunner(settings)
return runner.crawl(spider_cls) # return Deferred
def test_scrapy_crawler():
deferred = _run_crawler(MySpider, settings)
#deferred.addCallback
def _success(results):
"""
After crawler completes, this function will execute.
Do your assertions in this function.
"""
#deferred.addErrback
def _error(failure):
raise failure.value
return deferred
Robots.txt
Your crawler is trying to check robots.txt file but the website doesn't have one present.
To avoid this you can set ROBOTSTXT_OBEY setting to false in your settings.py file.
By default it's False but new scrapy projects generated with scrapy startproject command has ROBOTSTXT_OBEY = True generated from the template.
503 responses
Further the website seems to respond as 503 on every first request. The website is using some sort of bot protection:
First request is 503 then some javascript is being executed to make an AJAX request for generating __shovlshield cookie:
Seems like https://shovl.io/ ddos protection is being used.
To solve this you need to reverse engineer how javascript generates the cookie or employ javascript rendering techniques/services such as selenium or splash

Issue with scrapy spider

I am trying to get volume-weighted average prices for stocks from the moneycontrol.com website. The parse function is running without any issues but the parse_links function is not getting called. Am i missing something here?
# -*- coding: utf-8 -*-
import scrapy
class MoneycontrolSpider(scrapy.Spider):
name = "moneycontrol"
allowed_domains = ["https://www.moneycontrol.com"]
start_urls = ["https://www.moneycontrol.com/india/stockpricequote"]
def parse(self,response):
for link in response.css('td.last > a::attr(href)').extract():
if(link):
yield scrapy.Request(link, callback=self.parse_links,method='GET')
def parse_links(self, response):
VWAP= response.xpath('//*[#id="n_vwap_val"]/text()').extract_first()
print(VWAP)
with open('quotes.txt','a+') as f:
f.write('VWAP: {}'.format(VWAP) + '\n')
If you read the log output, the error becomes obvious.
2018-09-08 19:52:38 [py.warnings] WARNING: c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\offsite.py:59: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://www.moneycontrol.com in allowed_domains.
warnings.warn("allowed_domains accepts only domains, not URLs. Ignoring URL entry %s in allowed_domains." % domain, URLWarning)
2018-09-08 19:52:38 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-08 19:52:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.moneycontrol.com/india/stockpricequote> (referer: None)
2018-09-08 19:52:40 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.moneycontrol.com': <GET http://www.moneycontrol.com/india/stockpricequote/chemicals/aartiindustries/AI45>
So just fix your allowed_domains, and you should be fine:
allowed_domains = ["moneycontrol.com"]

Scrapy cannot download picture to local

I am using Scrapy (0.22) to crawl one site. I need to do three things:
I need the category and subcategory of the images
I need download the images and store them at local
I need store the categroy,subcategory,image url in Mongo
But now I am blocked,I use 'pipelines' to download the image, but my code can not work, it cannot download the picture to local.
Also, since I want to store the information in Mongo, anyone can give me some suggest on the "Mongo table structure"?
My code is as following:
settings.py
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 1}
IMAGES_STORE = '/ttt'
items.py
from scrapy.item import Item, Field
class TutorialItem(Item):
# define the fields for your item here like:
# name = Field()
catname=Field()
caturl=Field()
image_urls = Field()
images = Field()
pass
pipelines.py
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
from pprint import pprint as pp
class TutorialPipeline(object):
# def get_media_requests(self, item, info):
# for image_url in item['image_urls']:
# yield Request(image_url)
# def process_item(self, item, spider):
# print '**********************===================*******************'
# return item
# pp(item)
# pass
def get_media_requests(self,item,info):
# pass
pp('**********************===================*******************')
# yield Request(item['image_urls'])
for image_url in item['image_urls']:
# pass
# print image_url
yield Request(image_url)
spider.py
import scrapy
import os
from pprint import pprint as pp
from scrapy import log
from scrapy.http import Request
from scrapy.selector import Selector
from scrapy.spider import Spider
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import TutorialItem
from pprint import pprint as pp
class BaiduSpider(scrapy.spider.Spider):
name='baidu'
start_urls=[
# 'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/'
'http://giphy.com/categories'
]
domain='http://giphy.com'
def parse(self,response):
selector=Selector(response)
topCategorys=selector.xpath('//div[#id="None-list"]/a')
# pp(topCategorys)
items=[]
for tc in topCategorys:
item=TutorialItem()
item['catname']=tc.xpath('./text()').extract()[0]
item['caturl']=tc.xpath('./#href').extract()[0]
if item['catname']==u'ALL':
continue
reqUrl=self.domain+'/'+item['caturl']
# pp(reqUrl)
yield Request(url=reqUrl,meta={'caturl':reqUrl},callback=self.getSecondCategory)
def getSecondCategory(self,response):
selector=Selector(response)
# pp(response.meta['caturl'])
# pp('*****************=================**************')
secondCategorys=selector.xpath('//div[#class="grid_9 omega featured-category-tags"]/div/a')
# pp(secondCategorys)
items=[]
for sc in secondCategorys:
item=TutorialItem()
item['catname']=sc.xpath('./div/h4/text()').extract()[0]
item['caturl']=sc.xpath('./#href').extract()[0]
items.append(item)
reqUrl=self.domain+item['caturl']
# pp(items)
# pp(item)
# pp(reqUrl)
yield Request(url=reqUrl,meta={'caturl':reqUrl},callback=self.getImages)
def getImages(self,response):
selector=Selector(response)
# pp(response.meta['caturl'])
# pp('*****************=================**************')
# images=selector.xpath('//ul[#class="gifs freeform grid_12"]/div[position()=3]')
images=selector.xpath('//*[contains (#class,"hoverable-gif")]')
# images=selector.xpath('//ul[#class="gifs freeform grid_12"]//div[#class="hoverable-gif"]')
# pp(len(images))
items=[]
for image in images:
item=TutorialItem()
item['image_urls']=image.xpath('./a/figure/img/#src').extract()[0]
# item['imgName']=image.xpath('./a/figure/img/#alt').extract()[0]
items.append(item)
# pp(item)
# pp(items)
# pp('==============************==============')
# pp(items)
# items=[{'images':"hello world"}]
return items
Addition,there are not errors in the output,just is as following:
2014-12-21 13:49:56+0800 [scrapy] INFO: Enabled item pipelines: TutorialPipeline
2014-12-21 13:49:56+0800 [baidu] INFO: Spider opened
2014-12-21 13:49:56+0800 [baidu] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-21 13:49:56+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-12-21 13:49:56+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-12-21 13:50:07+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com/categories> (referer: None)
2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/science/> (referer: http://giphy.com/categories)
2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/sports/> (referer: http://giphy.com/categories)
2014-12-21 13:50:08+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/news-politics/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/transportation/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/interests/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/memes/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/tv/> (referer: http://giphy.com/categories)
2014-12-21 13:50:09+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/gaming/> (referer: http://giphy.com/categories)
2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/nature/> (referer: http://giphy.com/categories)
2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/emotions/> (referer: http://giphy.com/categories)
2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/movies/> (referer: http://giphy.com/categories)
2014-12-21 13:50:10+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/holiday/> (referer: http://giphy.com/categories)
2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/reactions/> (referer: http://giphy.com/categories)
2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/music/> (referer: http://giphy.com/categories)
2014-12-21 13:50:11+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com//categories/decades/> (referer: http://giphy.com/categories)
2014-12-21 13:50:12+0800 [baidu] DEBUG: Crawled (200) <GET http://giphy.com/search/the-colbert-report/> (referer: http://giphy.com//categories/news-politics/)
2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
{'image_urls': u'http://media1.giphy.com/media/2BDLDXFaEiuBy/200_s.gif'}
2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
{'image_urls': u'http://media2.giphy.com/media/WisjAI5QGgsrC/200_s.gif'}
2014-12-21 13:50:12+0800 [baidu] DEBUG: Scraped from <200 http://giphy.com/search/the-colbert-report/>
{'image_urls': u'http://media3.giphy.com/media/ZgDGEMihlZXCo/200_s.gif'}
.............
As far as I see it, there is no need for you to override the ImagesPipeline, because you are not modifying its behavior. But, since you are doing it, you should do it properly.
When overriding ImagesPipeline, two methods should be overriden:
get_media_requests(item, info) should return a Request for every URL in image_urls. This part you have done correctly.
item_completed(results, items, info) is called when all image requests for a single item have completed (either finished downloading, or failed for some reason). From the official documentation:
The item_completed() method must return the output that will be sent
to subsequent item pipeline stages, so you must return (or drop) the
item, as you would in any pipeline.
So, to make your custom images pipeline work, you need to override the item_completed() method, like this:
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
Further on, about other issues in your code that make it not work as expected:
You are not actually creating any useful items.
If you take a look at your parse() and getSecondCategory() functions, you will notice that you are not returning nor yielding any items. Although you seem to have prepared the items list, which you obviously wanted to use to store your items, it is never used to actually pass the items further down the processing path. At one point, you just yield a Request for the next page, and when the function finishes, your items are deleted.
You are not using your caturl info that you are passing via the meta dictionary. You are passing this info both in parse()Λ™and getSecondCategory(), but you never collect it in the callback function. Thus it is also being ignored.
So, the only thing that is basically going to work is the images pipeline, if you fix it as I already suggested. In order to fix these issue in your code, follow the guidelines below (please keep in mind that this is not tested, it is just a guideline for your consideration):
def parse(self,response):
selector=Selector(response)
topCategorys=selector.xpath('//div[#id="None-list"]/a')
for tc in topCategorys:
# no need to create the item just yet,
# only get the category and the url so we can
# continue the work in our callback
catname = tc.xpath('./text()').extract()[0]
caturl = tc.xpath('./#href').extract()[0]
if catname == u'ALL':
continue
reqUrl=self.domain + '/' + caturl
# pass the category name in the meta so we can retreive it
# from the response in the callback function
yield Request(url=reqUrl,meta={'catname': catname},
callback=self.getSecondCategory)
def getSecondCategory(self,response):
selector=Selector(response)
secondCategorys=selector.xpath('//div[#class="grid_9 omega featured-category-tags"]/div/a')
# retreive the category name from the response
# meta dictionary, which was copied from our request
catname = response.meta['catname']
for sc in secondCategorys:
# still no need to create the item,
# since we are just trying to get to
# the subcategory
subcatname = sc.xpath('./div/h4/text()').extract()[0]
subcaturl = sc.xpath('./#href').extract()[0]
reqUrl=self.domain + '/' + subcaturl
# this time pass both the category and the subcategory
# so we can read them both in the callback function
yield Request(url=reqUrl,meta={'catname':catname, 'subcatname':subcatname},
callback=self.getImages)
def getImages(self,response):
selector=Selector(response)
# retreive the category and subcategory name
catname = response.meta['catname']
subcatname = response.meta['subcatname']
images = selector.xpath('//*[contains (#class,"hoverable-gif")]')
for image in images:
# now could be a good time to create the items
item=TutorialItem()
# fill the items category information. You can concatenate
# the category and subcategory if you like, or you can
# add another field in your TutorialItem called subcatname
item['catname'] = catname + ":" + subcatname
# or alternatively:
# item['catname'] = catname
# item['subcatname'] = subcatname
item['image_urls']=image.xpath('./a/figure/img/#src').extract()[0]
# no need to store the items in the list to return
# it later, we can just yield the items as they are created
yield item

Scrapy is not listening to deny rules [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
For some reason scrapy is parsing data from URLs in my denied rules:
I'm getting parsed data from urls containing /browse/, /search/, /ip/.
I'm not sure where this is going wrong.
Please advise, thanks! Please find my code below:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from wallspider.items import Website
class mydomainSpider(CrawlSpider):
name = "tp"
allowed_domains = ["www.mydomain.com"]
start_urls = ["http://www.mydomain.com",]
"""/tp/ page type to crawl"""
rules = (Rule (SgmlLinkExtractor(allow=('/tp/', ),
deny=(
'browse/',
'browse-ng.do?',
'search-ng.do?',
'facet=',
'ip/',
'page/'
'search/',
'/[1-9]$',
'(bti=)[1-9]+(?:\.[1-9]*)?',
'(sort_by=)[a-zA-Z]',
'(sort_by=)[1-9]+(?:\.[1-9]*)?',
'(ic=32_)[1-9]+(?:\.[1-9]*)?',
'(ic=60_)[0-9]+(?:\.[0-9]*)?',
'(search_sort=)[1-9]+(?:\.[1-9]*)?', )
,)
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html')
items = []
for site in sites:
item = Website()
item['referer'] = response.request.headers.get('Referer')
item['url'] = response.url
item['title'] = site.xpath('/html/head/title/text()').extract()
item['description'] = site.select('//meta[#name="Description"]/#content').extract()
items.append(item)
return items
a part of my console log, its grabing /ip/ pages?:
2013-12-11 11:21:43-0800 [tp] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/1104329> (referer: http://www.mydomain.com/tp/john-duigan)
2013-12-11 11:21:43-0800 [tp] DEBUG: Scraped from <200 http://www.mydomain.com/ip/1104329>
{'description': [u'Shop Low Prices on: Molly (Widescreen) : Movies'],
'referer': 'http://www.mydomain.com/tp/john-duigan',
'title': [u'Molly (Widescreen): Movies : mydomain.com '],
'url': 'http://www.mydomain.com/ip/1104329'}
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/jon-furmanski>
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/taylor-byrd>
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/greg-byers>
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/tom-bowker>
2013-12-11 11:21:43-0800 [tp] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/21152221> (referer: http://www.mydomain.com/tp/peter-levin)
2013-12-11 11:21:43-0800 [tp] DEBUG: Scraped from <200 http://www.mydomain.com/ip/21152221>
{'description': [u'Shop Low Prices on: Marva Collins Story (1981) : Video on Demand by VUDU'],
'referer': 'http://www.mydomain.com/tp/peter-levin',
'title': [u'Marva Collins Story (1981): Video on Demand by VUDU : mydomain.com '],
'url': 'http://www.mydomain.com/ip/21152221'}
The rules of your SgmlLinkExtractor apply when extracting links from pages. And in your case, some of your .../tp/... requests are being redirected to .../ip/... pages.
Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/tom-bowker>
allow and deny patterns do no apply to URLs after redirections.
You could disable following redirections altogether by setting REDIRECT_ENABLED to False (see RedirectMiddleware)
I found out what was wrong, the pages were redirecting to a page type that was in my deny rule. Thank you for all your help! I appreciate it!

Categories

Resources