I'm trying to crawl a website (I got their authorization), and my code returns what I want in scrapy shell, but I get nothing in my spider.
I also checked all the previous question similar to this one without any success, e.g., the website doesn't use javascript in the home page to load the elements I need.
import scrapy
class MySpider(scrapy.Spider):
name = 'MySpider'
start_urls = [ #WRONG URL, SHOULD BE https://shop.app4health.it/ PROBLEM SOLVED!
'https://www.app4health.it/',
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
print ('PRE RISULTATI')
results = response.selector.xpath('//*[#id="nav"]/ol/li[*]/a/#href').extract()
# results = response.css('li a>href').extract()
# This works on scrapy shell, not in code
#risultati = response.xpath('//*[#id="nav"]/ol/li[1]/a').extract()
print (risultati)
#for pagineitems in risultati:
# next_page = pagineitems
print ('NEXT PAGE')
#Ignores the request cause already done. Insert dont filter
yield scrapy.Request(url=risultati, callback=self.prodotti,dont_filter = True)
def prodotti(self, response):
self.logger.info('A REEEESPONSEEEEEE from %s just arrived!', response.url)
return 1
The website i'm trying to crawl is https://shop.app4health.it/
The xpath command that i use is this one :
response.selector.xpath('//*[#id="nav"]/ol/li[*]/a/#href').extract()
I know there are some problems with the prodotti function ecc..., but that's not the point. I would like to understand why the xpath selector works with scrapy shell ( i get exactly the links that i need ), but when i run it in my spider, i always get a null list.
If it can help, when i use CSS selectors in my spider, it works fine and it finds the elements, but i would like to use xpath ( i need it in the future development of my application ).
Thanks for the help :)
EDIT:
I tried to print the body of the first response ( from start_urls ) and it's correct, i get the page i want. When i use selectors in my code ( even the one that have been suggested ) they all work fine in shell, but i get nothing in my code!
EDIT 2
I have become more experienced with Scrapy and web crawling, and i realised that sometimes, the HTML page that you get in your browser might be different from the one you get with the Scrapy request! In my experience some website would respond with a different HTML compared to the one you see in your browser! That's why sometimes if you use the "correct" xpath/css query taken from the browser, it might return nothing if used in your Scrapy code.
Always check if the body of your response is what you were expecting!
SOLVED:
Path is correct. I wrote the wrong start_urls!
Alternatively to Desperado's answer you can use css selectors which are much simpler but more than enough for your use case:
$ scrapy shell "https://shop.app4health.it/"
In [1]: response.css('.level0 .level-top::attr(href)').extract()
Out[1]:
['https://shop.app4health.it/sonno',
'https://shop.app4health.it/monitoraggio-e-diagnostica',
'https://shop.app4health.it/terapia',
'https://shop.app4health.it/integratori-alimentari',
'https://shop.app4health.it/fitness',
'https://shop.app4health.it/benessere',
'https://shop.app4health.it/ausili',
'https://shop.app4health.it/prodotti-in-offerta',
'https://shop.app4health.it/kit-regalo']
scrapy shell command is perfect for debugging issues like this.
//nav[#id="mmenu"]//ul/li[contains(#class,"level0")]/a[contains(#class,"level-top")]/#href
use this xpath, also consider 'view-source' of page before creating xpath
Related
I am scraping blog posts and encountered a weird issue. When extracting an entire element instead of only it's text, scrapy is returning the selected element + every element/closing tag that comes after it in the webpage. For example, I have this code:
import scrapy
class postscraperSpider(scrapy.Spider):
name = 'postscraper'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com/blog-post/']
def parse(self, response):
yield{
'title': response.css('.title_container > h1.entry-title::text').get(),
'content': response.css('div.text_1 .text_inner h2').get()
}
When ran, title is populated with the proper text. However, content is populated by the correct response, and then every element and closing tag that comes after it.
If I attempt to extract the text, it populates fine. Like so:
def parse(self, response):
yield{
'title': response.css('.title_container > h1.entry-title::text').get(),
'content': response.css('div.text_1 .text_inner h2::text').get()
}
The reason I cannot just extract the text, is because it won't be only h2s that I'm extracting from text_inner. I will need to extract all children, including their tags. What I really need is code that looks like this, but I felt the above better illustrated my issue:
def parse(self, response):
yield{
'title': response.css('.title_container > h1.entry-title::text').get(),
'content': response.css('div.text_1 .text_inner > *').get()
}
Thank you for any help that you can offer.
Related: No text printed when using response.xpath() or response.css in scrapy
Also related:
Python: Scrapy returning all html following element instead of just html of element
It looks like it's an environment bug. I'm going to try reinstalling Anaconda.
Maybe you can try to use the .extract_first() instance instead of .get(). It is hard to tell if your CSS selector is correct because of the example website in the array. Try going to chrome and search the CSS selector you used and see if it returns all the closing tags and elements.
Reinstalling python + anaconda fixed this issue for me. I'm not sure what happened. I did have both python 3.8 and 3.9 installed, so it may have been a conflict between those.
In the picture you can see operator has some bad characters in the name. These fix themselves and show in chrome but on scrapy when I run even response.text in the shell I get
scrapy.exceptions.NotSupported: Response content isn't text
When I check other jobs where the operator doesnt have this text I can run the script fine and grab all the data.
I am sure its due to unicodes. I and not sure how to tell scrapy to ignore them and run the rest as text so I can scrape anything.
below is just a skeleton of my code
class PrintSpider(scrapy.Spider):
name = "printer_data"
start_urls = [
'http://192.168.4.107/jobid-15547'
]
def parse(self, response):
job_dict = {}
url_split = response.request.url.split('/')
job_dict['job_id'] = url_split[len(url_split)-1].split('-',1)[1]
job_dict['job_name'] = response.xpath("/html/body/fieldset/big[1]/text()").extract_first().split(': ', 1)[1] # this breaks here.
Update with other things I have tried already
I have worked with this for a while in the scrapy shell. response.text gives the exception I put earlier. this check is also in the response.xpath.
I have looked at the code a little bit but cannot find how response.text is working. I feel like I need to fix these characters in the response somehow so that scrapy will see it as text and can process the html instead of ignoring the entire page so I cannot access anything.
I also would love a way to save the response to a file without opening in chrome and saving so that I can work with the original document for testing.
It could be however not necessary. Try following approach to see what your crawler see:
from scrapy.utils.response import open_in_browser
def parse(self, response):
open_in_browser (response)
This will open the page in browser - make sure you are not doing this in a loop otherwise your browser will get stuck.
Secondly, try to fetch HTML first and see if this works fine.
response.xpath("/html/body/fieldset/big[1]/text()").extract_first()
modify to:
response.xpath("/html/body/fieldset/big[1]")[0].extract()
If second approach fixes the issue, then go with bs4 or lxml to convert html to text.
Furthermore, if this is a public link, let us know the link along with complete log for further understanding of the issue.
I want to scrape speakers' name from this link:
https://websummit.com/speakers
Name is basically in div tag with class="speaker__content__inner"
I made a spider in scrapy whos code is below
import scrapy
class Id01Spider(scrapy.Spider):
name = 'ID01'
allowed_domains = ['websummit.com']
start_urls = ['https://websummit.com/speakers']
def parse(self, response):
name=response.xpath('//div[#class = "speaker__content__inner"]/text()').extract()
for Speaker_Details in zip(name):
yield {'Speaker_Details': Speaker_Details.strip()}
pass
When I run this spider it runs and returns nothing.
Log file:
https://pastebin.com/JEfL2GBu
P.S: This is my first question on stackoverflow, so please correct my mistakes if I made any while asking.
If you check source HTML (using Ctrl+U) you'll find that there is no speakers info inside HTML. This content is loaded dynamically using Javascript.
You need to call https://api.cilabs.com/conferences/ws19/lists/speakers?per_page=25 and parse JSON.
I am trying to build a spider to scrape some Data from the website Techcrunch - Heartbleed search
my tought was to give a tag when executing the spider from the command line (example: Heartbleed). The spider should then search trough all the associated search results, open each link and get the data contained within.
import scrapy
class TechcrunchSpider(scrapy.Spider):
name = "tech_search"
def start_requests(self):
url = 'https://techcrunch.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + '?s=' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
pass
this code can be executed with : scrapy crawl tech_search -s DOWNLOAD_DELAY=1.5 -o tech_search.jl -a tag=EXAMPLEINPUT"
Getting the data from the individual pages is not the problem, but actually getting the url to them is(from the search page linked above):
the thing is , when looking at the source Html file (Ctrl + u) of the Search site(link above), then i cant find anything about the searched elements(example : "What Is Heartbleed? The Video"). Any suggestions how to obtain these elements?
I suggest that you define your scrapy class along the lines shown in this answer but using the PhantomJS selenium headless browser. The essential problem is that when scrapy downloads those pages it uses javascript code to build the HTML (DOM) that you see but cannot access via the route you have chosen.
I have a crawler that takes in urls and then follows the nextpage links for each url in the start urls and its working
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[#class="pagnNext"]',)), callback="parse_start_url", follow= True),)
However as you can imagine I start getting captchas at some point for some urls. I've heard that there might be honeypots that are not visible for human but in the html code designed to make you click to identfy that you are a bot.
I wanna make the extractor extracts the link conditionally for example dont extract and click if CSS style display:none exists or something like that
is this doable
I would do something like this:
def parse_page1(self, response):
if (response.css("thing i want to check exists"))
return scrapy.Request(response.xpath('//a[#class="pagnNext"]'),
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
official docs:
https://doc.scrapy.org/en/latest/topics/request-response.html
note: as for your captcha issue try messing with your settings. at least make sure your DOWNLOAD_DELAY is set to something other then 0. check out the other options https://doc.scrapy.org/en/latest/topics/settings.html