crawl spider doesn't proceed to next page - python

I'm scraping all products details on http://www.ulta.com/makeup-eyes-eyebrows?N=26yi. My rules are copied below. I only got data from the first page and it doesn't proceed to next page.
rules = (Rule(LinkExtractor(
restrict_xpaths='//*[#id="canada"]/div[4]/div[2]/div[3]/div[3]/div[2]/ul/li[3]/a',),
callback = 'parse',
follow =True),)
Can anyone help me with this?

Use the CrawlSpider, it will automatically crawl to other pages, otherwise with,
Spider , you need to manually pass the other links
class Scrapy1Spider(CrawlSpider):
instead of
class Scrapy1Spider(scrapy.Spider):
See: Scrapy crawl with next page

Related

I want to select div tag with specific class. But my spider returns nothing when I run it?

I want to scrape speakers' name from this link:
https://websummit.com/speakers
Name is basically in div tag with class="speaker__content__inner"
I made a spider in scrapy whos code is below
import scrapy
class Id01Spider(scrapy.Spider):
name = 'ID01'
allowed_domains = ['websummit.com']
start_urls = ['https://websummit.com/speakers']
def parse(self, response):
name=response.xpath('//div[#class = "speaker__content__inner"]/text()').extract()
for Speaker_Details in zip(name):
yield {'Speaker_Details': Speaker_Details.strip()}
pass
When I run this spider it runs and returns nothing.
Log file:
https://pastebin.com/JEfL2GBu
P.S: This is my first question on stackoverflow, so please correct my mistakes if I made any while asking.
If you check source HTML (using Ctrl+U) you'll find that there is no speakers info inside HTML. This content is loaded dynamically using Javascript.
You need to call https://api.cilabs.com/conferences/ws19/lists/speakers?per_page=25 and parse JSON.

Crawl data from next page doesn't change URL

i'm trying to change pages in this site to crawl information. But it doesn't change URL when I click next page:
My code until now:
[...]
paging = response.css('span id.next::attr(href)').extract()
if paging:
yield scrapy.Request(paging, callback=self.parse_links)
I don't know how to crawl from site like this. Please help me, thank you
Network next page request
You can try this request for the next page http://vsd.vn/ModuleArticles/ArticlesList/NextPageHDNVTCPH?pCurrentPage=2
This is return next page data

Scrapy - Xpath works in shell but not in code

I'm trying to crawl a website (I got their authorization), and my code returns what I want in scrapy shell, but I get nothing in my spider.
I also checked all the previous question similar to this one without any success, e.g., the website doesn't use javascript in the home page to load the elements I need.
import scrapy
class MySpider(scrapy.Spider):
name = 'MySpider'
start_urls = [ #WRONG URL, SHOULD BE https://shop.app4health.it/ PROBLEM SOLVED!
'https://www.app4health.it/',
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
print ('PRE RISULTATI')
results = response.selector.xpath('//*[#id="nav"]/ol/li[*]/a/#href').extract()
# results = response.css('li a>href').extract()
# This works on scrapy shell, not in code
#risultati = response.xpath('//*[#id="nav"]/ol/li[1]/a').extract()
print (risultati)
#for pagineitems in risultati:
# next_page = pagineitems
print ('NEXT PAGE')
#Ignores the request cause already done. Insert dont filter
yield scrapy.Request(url=risultati, callback=self.prodotti,dont_filter = True)
def prodotti(self, response):
self.logger.info('A REEEESPONSEEEEEE from %s just arrived!', response.url)
return 1
The website i'm trying to crawl is https://shop.app4health.it/
The xpath command that i use is this one :
response.selector.xpath('//*[#id="nav"]/ol/li[*]/a/#href').extract()
I know there are some problems with the prodotti function ecc..., but that's not the point. I would like to understand why the xpath selector works with scrapy shell ( i get exactly the links that i need ), but when i run it in my spider, i always get a null list.
If it can help, when i use CSS selectors in my spider, it works fine and it finds the elements, but i would like to use xpath ( i need it in the future development of my application ).
Thanks for the help :)
EDIT:
I tried to print the body of the first response ( from start_urls ) and it's correct, i get the page i want. When i use selectors in my code ( even the one that have been suggested ) they all work fine in shell, but i get nothing in my code!
EDIT 2
I have become more experienced with Scrapy and web crawling, and i realised that sometimes, the HTML page that you get in your browser might be different from the one you get with the Scrapy request! In my experience some website would respond with a different HTML compared to the one you see in your browser! That's why sometimes if you use the "correct" xpath/css query taken from the browser, it might return nothing if used in your Scrapy code.
Always check if the body of your response is what you were expecting!
SOLVED:
Path is correct. I wrote the wrong start_urls!
Alternatively to Desperado's answer you can use css selectors which are much simpler but more than enough for your use case:
$ scrapy shell "https://shop.app4health.it/"
In [1]: response.css('.level0 .level-top::attr(href)').extract()
Out[1]:
['https://shop.app4health.it/sonno',
'https://shop.app4health.it/monitoraggio-e-diagnostica',
'https://shop.app4health.it/terapia',
'https://shop.app4health.it/integratori-alimentari',
'https://shop.app4health.it/fitness',
'https://shop.app4health.it/benessere',
'https://shop.app4health.it/ausili',
'https://shop.app4health.it/prodotti-in-offerta',
'https://shop.app4health.it/kit-regalo']
scrapy shell command is perfect for debugging issues like this.
//nav[#id="mmenu"]//ul/li[contains(#class,"level0")]/a[contains(#class,"level-top")]/#href
use this xpath, also consider 'view-source' of page before creating xpath

Python/Scrapy scraping from Techcrunch

I am trying to build a spider to scrape some Data from the website Techcrunch - Heartbleed search
my tought was to give a tag when executing the spider from the command line (example: Heartbleed). The spider should then search trough all the associated search results, open each link and get the data contained within.
import scrapy
class TechcrunchSpider(scrapy.Spider):
name = "tech_search"
def start_requests(self):
url = 'https://techcrunch.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + '?s=' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
pass
this code can be executed with : scrapy crawl tech_search -s DOWNLOAD_DELAY=1.5 -o tech_search.jl -a tag=EXAMPLEINPUT"
Getting the data from the individual pages is not the problem, but actually getting the url to them is(from the search page linked above):
the thing is , when looking at the source Html file (Ctrl + u) of the Search site(link above), then i cant find anything about the searched elements(example : "What Is Heartbleed? The Video"). Any suggestions how to obtain these elements?
I suggest that you define your scrapy class along the lines shown in this answer but using the PhantomJS selenium headless browser. The essential problem is that when scrapy downloads those pages it uses javascript code to build the HTML (DOM) that you see but cannot access via the route you have chosen.

scrapy avoid crawler logging out

I am using the scrapy library to facilitate crawling a website.
The website uses authentication and I can successfully login to the page using scrapy.
The page has a URL which will log out the user and destroy the session.
How do I ensure scrapy avoids the logout page when crawling?
If you are using Link Extractors and simply don't want to follow this particular "logout" link, you can set deny property:
rules = [Rule(SgmlLinkExtractor(deny=[r'logout/']), follow=True),]
Another option is to check for response.url inside your spider's parse method:
def parse(self, response):
if 'logout' in response.url:
return
# extract items
Hope that helps.

Categories

Resources