Problems getting next page when scraping with scrapy - python

I have a scrapy code which doesn't crawl pagination links and i'm stuck.
The source of the page is:
https://www.levenhuk.bg/katalog/teleskopi/?page=1
My code is:
import scrapy
class TelescopesSpider(scrapy.Spider):
name = 'telescopes'
allowed_domains = ['https://www.levenhuk.bg/']
start_urls = ['https://www.levenhuk.bg/katalog/teleskopi/?page=1']
download_delay = 3
def parse(self, response):
for product in response.xpath('//div[#class="catalog-item"]'):
yield {
# 'name': product.xpath('.//span[#itemprop="name" and contains(text(), "Levenhuk")]/text()').get(),
'name': product.xpath('.//span[#itemprop="name"]/text()').get(),
# 'price': product.xpath('.//div[#class="price"]/span/text()').get(),
'price': product.xpath('.//span[#itemprop="price"]/text()').re_first(r'[0-9]+,[0-9]+'),
'short_discr': product.xpath('.//div[#class="opis-item"]/p/strong/text()').get()
}
next_page_url = response.xpath('//*[#class="pagesCount"][1]//#href').get()
if next_page_url is not None:
yield scrapy.Request(response.urljoin(next_page_url))

I feel like the problem is simply that you are not specifying a callback in your pagination request. Specify your parse function as callback and that should work. please comment if it still doesn't work.
Edit:
In this case I feel like your logic needs an overhaul. I suggest separating the pagination and item extraction login. Try the following:
def parse(self, response):
self.extract_item(response)
next_page_urls = response.xpath('//*[#class="pagesCount"] [1]//#href').getall()
if next_page_urls is not None:
for url in next_page_urls:
yield scrapy.Request(response.urljoin(url), callback=self.extract_item)
def extract_item(self, response):
for product in response.xpath('//div[#class="catalog-item"]'):
yield {
# 'name': product.xpath('.//span[#itemprop="name" and contains(text(), "Levenhuk")]/text()').get(),
'name': product.xpath('.//span[#itemprop="name"]/text()').get(),
# 'price': product.xpath('.//div[#class="price"]/span/text()').get(),
'price': product.xpath('.//span[#itemprop="price"]/text()').re_first(r'[0-9]+,[0-9]+'),
'short_discr': product.xpath('.//div[#class="opis-item"]/p/strong/text()').get()
}
so now the parse function handles pagination and the extract_item function extracts items for every page.
Modify allowed_domains as well as specified by Pasindu.

Change this to :
allowed_domains = ['https://www.levenhuk.bg/']
allowed_domains = ['levenhuk.bg']
You also need to change:
next_page_url = response.xpath('//*[#class="pagesCount"][1]//#href').get()
This will only work for the first page, for page 2,3,4.., this will extract a link to the first page.
And also add a callback as mentioned by UzairAhmed.

This is a little tricky since usually standard practice is to just check if there is a next page button on a loop until there isn't.
Here's an example since there is no next page button we can figure out the total page count. There will be a duplicate request to page1 though with this method its not the most ideal situation.
import scrapy
class TelescopesSpider(scrapy.Spider):
name = 'telescopes'
allowed_domains = ['https://www.levenhuk.bg/']
start_urls = ['https://www.levenhuk.bg/katalog/teleskopi/?page=1']
download_delay = 3
def parse(self, response):
total_pages = response.css('.pagesCount a::text')[-1].get()
total_pages = int(total_pages)
pages_str = str(total_pages)
for i in range(1, total_pages):
url = 'https://www.levenhuk.bg/katalog/teleskopi/?page={}'.format(pages_str)
yield scrapy.Request(url, callback=self.parse_item, dont_filter=True)
def parse_item(self, response):
for product in response.xpath('//div[#class="catalog-item"]'):
yield {
'name': product.xpath('.//span[#itemprop="name"]/text()').get(),
'price': product.xpath('.//span[#itemprop="price"]/text()').re_first(r'[0-9]+,[0-9]+'),
'short_discr': product.xpath('.//div[#class="opis-item"]/p/strong/text()').get()
}
Another method of doing this would be to just look at how many pages there are and over ride your start_requests method as follows:
class TelescopesSpider(scrapy.Spider):
name = 'telescopes'
allowed_domains = ['https://www.levenhuk.bg/']
start_urls = ['https://www.levenhuk.bg/katalog/teleskopi/?page={}']
download_delay = 3
def start_requests(self):
for i in range(1, 14):
yield scrapy.Request(self.start_urls[0].format(str(i)), callback=self.parse)

Related

Python Scrapy, How to get second image on the page with scrapy?

I only want to extract exact one image on every page that scrapy looking for. For example I want to extract http://eshop.erhanteknik.com.tr/photo/foto_w720_604e44853371a920a52b0a31a3548b8b.jpg from http://eshop.erhanteknik.com.tr/tos_svitavy/tos_svitavy/uc_ayakli_aynalar_t0803?DS7641935 page which scrapy looks first. With this code I am currently get whole images with .getall command but I cannot figure how can get specific image.
from scrapy.http import Request
class BooksSpider(Spider):
name = 'books'
allowed_domains = ['eshop.erhanteknik.com.tr']
start_urls = ['http://eshop.erhanteknik.com.tr/urunlerimiz?categoryId=1']
def parse(self, response):
books = response.xpath('//h3/a/#href').extract()
for book in books:
absolute_url = response.urljoin(book)
yield Request(absolute_url, callback=self.parse_book)
# process next page
next_page_url = response.xpath('//a[#rel="next"]/#href').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield Request(absolute_next_page_url)
def parse_book(self, response):
title = response.css('h1::text').extract_first()
image_url = response.xpath('//img/#src').getall()
yield {
'title': title,
'image_url': image_url,
}
pass
You need to target the src of the images under the slide class.
image_url = response.css('.slide img::attr(src)').extract_first()
extract_first() will grab the first item of the list.
If you use extract(), you will get a list.

TypeError('Request url must be str or unicode, got %s:' % type

I am trying to login to imdb and scrape some data.
Here is my code
import scrapy
from scrapy.http import FormRequest
class lisTopSpider(scrapy.Spider):
name= 'imdbLog'
allowed_domains = ['imdb.com']
start_urls = [
'https://www.imdb.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https://www.imdb.com/registration/ap-signin-handler/imdb_us&openid.identity=http://specs.openid.net/auth/2.0/identifier_select&openid.assoc_handle=imdb_us&openid.mode=checkid_setup&siteState=eyJvcGVuaWQuYXNzb2NfaGFuZGxlIjoiaW1kYl91cyIsInJlZGlyZWN0VG8iOiJodHRwczovL3d3dy5pbWRiLmNvbS8_cmVmXz1sb2dpbiJ9&openid.claimed_id=http://specs.openid.net/auth/2.0/identifier_select&openid.ns=http://specs.openid.net/auth/2.0&tag=imdbtag_reg-20'
]
def parse(self, response):
token = response.xpath('//form/input[#name="appActionToken"]/#value').get()
appAction = response.xpath('//form/input[#name="appAction"]/#value').get()
siteState = response.xpath('//form/input[#name="siteState"]/#value').get()
openid = response.xpath('//form/input[#name="openid.return_to"]/#value').get()
prevRID = response.xpath('//form/input[#name="prevRID"]/#value').get()
workflowState = response.xpath('//form/input[#name="workflowState"]/#value').get()
create = response.xpath('//input[#name="create"]/#value').get()
metadata1 = response.xpath('//input[#name="metadata1"]/#value').get()
base_url = 'https://www.imdb.com/lists/tt0120852'
if 'login' in response.url:
return scrapy.Request(base_url, callback = self.listParse)
else:
return scrapy.Request(response,cookies=[{
'appActionToken':token,
'appAction':appAction,
'siteState':siteState,
'openid.return_to':openid,
'prevRID':prevRID,
'workflowState':workflowState,
'email':'....#gmail.com',
'create':create,
'passwrod':'....',
'metadata1':metadata1,
}], callback=self.parse)
def listParse(self, response):
listsLinks = response.xpath('//div[2]/strong')
for link in listsLinks:
list_url = response.urljoin(link.xpath('.//a/#href').get())
yield scrapy.Request(list_url, callback=self.parse_list, meta={'list_url': list_url})
next_page_url = response.xpath('//a[#class="flat-button next-page "]/#href').get()
if next_page_url is not None:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(next_page_url, callback=self.listParse)
#Link of each list
def parse_list(self, response):
list_url = response.meta['list_url']
myRatings = response.xpath('//div[#class="ipl-rating-star small"]/span[2]/text()').getall()
yield{
'list': list_url,
'ratings': myRatings,
}
First I was getting no Form object found something like this so I removed FormRequest and instead used Request.
Now I am getting error "TypeError('Request url must be str or unicode, got %s:' % type(url).name"
I am sure this code is far from working yet but I need to fix this error that I don't understand why it is happening.
Power shell shows this line reference number.
}], callback=self.parse)
The problem is this part:
return scrapy.Request(response,cookies=[{
'appActionToken':token,
'appAction':appAction,
'siteState':siteState,
'openid.return_to':openid,
'prevRID':prevRID,
'workflowState':workflowState,
'email':'....#gmail.com',
'create':create,
'passwrod':'....',
'metadata1':metadata1,
}], callback=self.parse)
Your first parameter is a response object, whereas Scrapy expects a url here. If you want to make another request to the same url, you can just put return scrapy.Request(response.url,cookies=[{...}], dont_filter=True).
I highly doubt this will work though.. A FormRequest is usually the way to go when you want to login.

Can't scrape next page contents using Scrapy

I want to scrape the contents from the next pages too but it didn't go to the next page. My code is:
import scrapy
class AggregatorSpider(scrapy.Spider):
name = 'aggregator'
allowed_domains = ['startech.com.bd/component/processor']
start_urls = ['https://startech.com.bd/component/processor']
def parse(self, response):
processor_details = response.xpath('//*[#class="col-xs-12 col-md-4 product-layout grid"]')
for processor in processor_details:
name = processor.xpath('.//h4/a/text()').extract_first()
price = processor.xpath('.//*[#class="price space-between"]/span/text()').extract_first()
print ('\n')
print (name)
print (price)
print ('\n')
next_page_url = response.xpath('//*[#class="pagination"]/li/a/#href').extract_first()
# absolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(next_page_url)
I didn't use the urljoin because the next_page_url is giving me the whole url. I also tried the dont_filter=true argument in the yield function which gives me an infinite loop through the 1st page. The message I'm getting from the terminal is [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.startech.com.bd': https://www.startech.com.bd/component/processor?page=2>
This is because your allowed_domains variable is wrong, use allowed_domains = ['www.startech.com.bd'] instead (see the doc).
You can also modify your next page selector in order to avoid going to page one again:
import scrapy
class AggregatorSpider(scrapy.Spider):
name = 'aggregator'
allowed_domains = ['www.startech.com.bd']
start_urls = ['https://startech.com.bd/component/processor']
def parse(self, response):
processor_details = response.xpath('//*[#class="col-xs-12 col-md-4 product-layout grid"]')
for processor in processor_details:
name = processor.xpath('.//h4/a/text()').extract_first()
price = processor.xpath('.//*[#class="price space-between"]/span/text()').extract_first()
yield({'name': name, 'price': price})
next_page_url = response.css('.pagination li:last-child a::attr(href)').extract_first()
if next_page_url:
yield scrapy.Request(next_page_url)

Scrapy returns repeated out of order results when using a for loop, but not when going link by link

I am attempting to use Scrapy to crawl a site. Here is my code:
import scrapy
class ArticleSpider(scrapy.Spider):
name = "article"
start_urls = [
'http://www.irna.ir/en/services/161',
]
def parse(self, response):
for linknum in range(1, 15):
next_article = response.xpath('//*[#id="NewsImageVerticalItems"]/div[%d]/div[2]/h3/a/#href' % linknum).extract_first()
next_article = response.urljoin(next_article)
yield scrapy.Request(next_article)
for text in response.xpath('//*[#id="ctl00_ctl00_ContentPlaceHolder_ContentPlaceHolder_NewsContent4_BodyLabel"]'):
yield {
'article': text.xpath('./text()').extract()
}
for tag in response.xpath('//*[#id="ctl00_ctl00_ContentPlaceHolder_ContentPlaceHolder_NewsContent4_bodytext"]'):
yield {
'tag1': tag.xpath('./div[3]/p[1]/a/text()').extract(),
'tag2': tag.xpath('./div[3]/p[2]/a/text()').extract(),
'tag3': tag.xpath('./div[3]/p[3]/a/text()').extract(),
'tag4': tag.xpath('./div[3]/p[4]/a/text()').extract()
}
yield response.follow('http://www.irna.ir/en/services/161', callback=self.parse)
But this returns in the JSON a weird mixture of repeated items, out of order and often skipping links: https://pastebin.com/LVkjHrRt
However, when I set linknum to a single number, the code works fine.
Why is iterating changing my results?
As #TarunLalwani already stated, your current approach is not right. Basically you should:
In parse method, extract links to all articles on a page and yield requests for scraping them with a callback named e.g. parse_article.
Still in parse method, check that button for loading more articles is present and if so, yield a request for URL of a pattern http://www.irna.ir/en/services/161/pageN. (This can be found in browser's developer tools under XHR requests on network tab.)
Define parse_article method where you extract the article text and tags from details page and finally yield it as item.
Below is the final spider:
import scrapy
class IrnaSpider(scrapy.Spider):
name = 'irna'
base_url = 'http://www.irna.ir/en/services/161'
def start_requests(self):
yield scrapy.Request(self.base_url, meta={'page_number': 1})
def parse(self, response):
for article_url in response.css('.DataListContainer h3 a::attr(href)').extract():
yield scrapy.Request(response.urljoin(article_url), callback=self.parse_article)
page_number = response.meta['page_number'] + 1
if response.css('#MoreButton'):
yield scrapy.Request('{}/page{}'.format(self.base_url, page_number),
callback=self.parse, meta={'page_number': page_number})
def parse_article(self, response):
yield {
'text': ' '.join(response.xpath('//p[#id="ctl00_ctl00_ContentPlaceHolder_ContentPlaceHolder_NewsContent4_BodyLabel"]/text()').extract()),
'tags': [tag.strip() for tag in response.xpath('//div[#class="Tags"]/p/a/text()').extract() if tag.strip()]
}

In Scrapy, how do I pass the urls generated in one class to the next class in the script?

The following is my spider's code:
import scrapy
class ProductMainPageSpider(scrapy.Spider):
name = 'ProductMainPageSpider'
start_urls = ['http://domain.com/main-product-page']
def parse(self, response):
for product in response.css('article.isotopeItem'):
yield {
'title': product.css('h3 a::text').extract_first().encode("utf-8"),
'category': product.css('h6 a::text').extract_first(),
'img': product.css('figure a img::attr("src")').extract_first(),
'url': product.css('h3 a::attr("href")').extract_first()
}
class ProductSecondaryPageSpider(scrapy.Spider):
name = 'ProductSecondaryPageSpider'
start_urls = """ URLS IN product['url'] FROM PREVIOUS CLASS """
def parse(self, response):
for product in response.css('article.isotopeItem'):
yield {
'title': product.css('h3 a::text').extract_first().encode("utf-8"),
'thumbnail': product.css('figure a img::attr("src")').extract_first(),
'short_description': product.css('div.summary').extract_first(),
'description': product.css('div.description').extract_first(),
'gallery_images': product.css('figure a img.gallery-item ::attr("src")').extract_first()
}
The first class/part works correctly if I remove the second class/part. It generates my json file correctly with the items requested in it. However, the website I need to crawl is a two-parter. It has a product archive page that shows a products as a thumbnail, title, and category (and this info is not in the next page). Then if you click on one of the thumbnails or titles you get sent to a single product page where there is specific info on the product.
There are a lot of products so I would like to pipe (yield?) the urls in product['url'] to the second class as the "start_urls" list. But I simply don't know how to do that. My knowledge doesn't go far enough to even know what I'm missing or what is going wrong so that I can find a solution.
Check out on line 20 what I want to do.
You don't have to create two spiders for this - you can simply go to the next url and carry over your item i.e.:
def parse(self, response):
item = MyItem()
item['name'] = response.xpath("//name/text()").extract()
next_page_url = response.xpath("//a[#class='next']/#href").extract_first()
yield Request(next_page_url,
self.parse_next,
meta={'item': item} # carry over our item
)
def parse_next(self, response):
# get our carried item from response meta
item = response.meta['item']
item['description'] = response.xpath("//description/text()").extract()
yield item
However if for some reason you realy want to split logic of these two steps you can simply save the results in a file (a json for example: scrapy crawl first_spider -o results.json) and open/iterate through it in your second spider in start_requests() class method which would yield urls, i.e.:
import json
from scrapy import spider
class MySecondSpider(spider):
def start_requests(self):
# this overrides `start_urls` logic
with open('results.json', 'r') as f:
data = json.loads(f.read())
for item in data:
yield Request(item['url'])

Categories

Resources