I am trying to build a spider to scrape some Data from the website Techcrunch - Heartbleed search
my tought was to give a tag when executing the spider from the command line (example: Heartbleed). The spider should then search trough all the associated search results, open each link and get the data contained within.
import scrapy
class TechcrunchSpider(scrapy.Spider):
name = "tech_search"
def start_requests(self):
url = 'https://techcrunch.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + '?s=' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
pass
this code can be executed with : scrapy crawl tech_search -s DOWNLOAD_DELAY=1.5 -o tech_search.jl -a tag=EXAMPLEINPUT"
Getting the data from the individual pages is not the problem, but actually getting the url to them is(from the search page linked above):
the thing is , when looking at the source Html file (Ctrl + u) of the Search site(link above), then i cant find anything about the searched elements(example : "What Is Heartbleed? The Video"). Any suggestions how to obtain these elements?
I suggest that you define your scrapy class along the lines shown in this answer but using the PhantomJS selenium headless browser. The essential problem is that when scrapy downloads those pages it uses javascript code to build the HTML (DOM) that you see but cannot access via the route you have chosen.
Related
Perhaps yield in Python is remedial for some, but not for me... at least not yet.
I understand yield creates a 'generator'.
I stumbled upon yield when I decided to learn scrapy.
I wrote some code for a Spider which works as follows:
Go to start hyperlink and extract all hyperlinks - which are not full hyperlinks, just sub-directories concatenated onto the starting hyperlink
Examines hyperlinks appends those meeting specific criteria to base hyperlink
Uses Request to navigate to new hyperlink and parses to find unique id in element with 'onclick'
import scrapy
class newSpider(scrapy.Spider)
name = 'new'
allowed_domains = ['www.alloweddomain.com']
start_urls = ['https://www.alloweddomain.com']
def parse(self, response)
links = response.xpath('//a/#href').extract()
for link in links:
if link == 'SpecificCriteria':
next_link = response.urljoin(link)
yield Request(next_link, callback=self.parse_new)
EDIT 1:
for uid_dict in self.parse_new(response):
print(uid_dict['uid'])
break
End EDIT 1
Running the code here evaluates response as the HTTP response to start_urls and not to next_link.
def parse_new(self, response)
trs = response.xpath("//*[#class='unit-directory-row']").getall()
for tr in trs:
if 'SpecificText' in tr:
elements = tr.split()
for element in elements:
if 'onclick' in element:
subelement = element.split('(')[1]
uid = subelement.split(')')[0]
print(uid)
yield {
'uid': uid
}
break
It works, scrapy crawls the first page, creates the new hyperlink and navigates to the next page. new_parser parses the HTML for the uid and 'yields' it. scrapy's engine shows that the correct uid is 'yielded'.
What I don't understand is how I can 'use' that uid obtained by parse_new to create and navigate to a new hyperlink like I would a variable and I cannot seem to be able to return a variable with Request.
I'd check out What does the "yield" keyword do? for a good explanation of how exactly yield works.
In the meantime, spider.parse_new(response) is an iterable object. That is, you can acquire its yielded results via a for loop. E.g.,
for uid_dict in spider.parse_new(response):
print(uid_dict['uid'])
After much reading and learning I discovered the reason scrapy does not perform the callback in the first parse and it has nothing to do with yield! It has a lot to do with two issues:
1) robots.txt. Link Can be 'resolved' with ROBOTSTXT_OBEY = False in settings.py
2) The logger has Filtered offsite request to. Link dont_filter=True may resolve this.
I want to scrape speakers' name from this link:
https://websummit.com/speakers
Name is basically in div tag with class="speaker__content__inner"
I made a spider in scrapy whos code is below
import scrapy
class Id01Spider(scrapy.Spider):
name = 'ID01'
allowed_domains = ['websummit.com']
start_urls = ['https://websummit.com/speakers']
def parse(self, response):
name=response.xpath('//div[#class = "speaker__content__inner"]/text()').extract()
for Speaker_Details in zip(name):
yield {'Speaker_Details': Speaker_Details.strip()}
pass
When I run this spider it runs and returns nothing.
Log file:
https://pastebin.com/JEfL2GBu
P.S: This is my first question on stackoverflow, so please correct my mistakes if I made any while asking.
If you check source HTML (using Ctrl+U) you'll find that there is no speakers info inside HTML. This content is loaded dynamically using Javascript.
You need to call https://api.cilabs.com/conferences/ws19/lists/speakers?per_page=25 and parse JSON.
I'm trying to crawl a website (I got their authorization), and my code returns what I want in scrapy shell, but I get nothing in my spider.
I also checked all the previous question similar to this one without any success, e.g., the website doesn't use javascript in the home page to load the elements I need.
import scrapy
class MySpider(scrapy.Spider):
name = 'MySpider'
start_urls = [ #WRONG URL, SHOULD BE https://shop.app4health.it/ PROBLEM SOLVED!
'https://www.app4health.it/',
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
print ('PRE RISULTATI')
results = response.selector.xpath('//*[#id="nav"]/ol/li[*]/a/#href').extract()
# results = response.css('li a>href').extract()
# This works on scrapy shell, not in code
#risultati = response.xpath('//*[#id="nav"]/ol/li[1]/a').extract()
print (risultati)
#for pagineitems in risultati:
# next_page = pagineitems
print ('NEXT PAGE')
#Ignores the request cause already done. Insert dont filter
yield scrapy.Request(url=risultati, callback=self.prodotti,dont_filter = True)
def prodotti(self, response):
self.logger.info('A REEEESPONSEEEEEE from %s just arrived!', response.url)
return 1
The website i'm trying to crawl is https://shop.app4health.it/
The xpath command that i use is this one :
response.selector.xpath('//*[#id="nav"]/ol/li[*]/a/#href').extract()
I know there are some problems with the prodotti function ecc..., but that's not the point. I would like to understand why the xpath selector works with scrapy shell ( i get exactly the links that i need ), but when i run it in my spider, i always get a null list.
If it can help, when i use CSS selectors in my spider, it works fine and it finds the elements, but i would like to use xpath ( i need it in the future development of my application ).
Thanks for the help :)
EDIT:
I tried to print the body of the first response ( from start_urls ) and it's correct, i get the page i want. When i use selectors in my code ( even the one that have been suggested ) they all work fine in shell, but i get nothing in my code!
EDIT 2
I have become more experienced with Scrapy and web crawling, and i realised that sometimes, the HTML page that you get in your browser might be different from the one you get with the Scrapy request! In my experience some website would respond with a different HTML compared to the one you see in your browser! That's why sometimes if you use the "correct" xpath/css query taken from the browser, it might return nothing if used in your Scrapy code.
Always check if the body of your response is what you were expecting!
SOLVED:
Path is correct. I wrote the wrong start_urls!
Alternatively to Desperado's answer you can use css selectors which are much simpler but more than enough for your use case:
$ scrapy shell "https://shop.app4health.it/"
In [1]: response.css('.level0 .level-top::attr(href)').extract()
Out[1]:
['https://shop.app4health.it/sonno',
'https://shop.app4health.it/monitoraggio-e-diagnostica',
'https://shop.app4health.it/terapia',
'https://shop.app4health.it/integratori-alimentari',
'https://shop.app4health.it/fitness',
'https://shop.app4health.it/benessere',
'https://shop.app4health.it/ausili',
'https://shop.app4health.it/prodotti-in-offerta',
'https://shop.app4health.it/kit-regalo']
scrapy shell command is perfect for debugging issues like this.
//nav[#id="mmenu"]//ul/li[contains(#class,"level0")]/a[contains(#class,"level-top")]/#href
use this xpath, also consider 'view-source' of page before creating xpath
I'm new to scrapy and tried to crawl from a couple of sites, but wasn't able to get more than a few images from there.
For example, for http://shop.nordstrom.com/c/womens-dresses-new with the following code -
def parse(self, response):
for dress in response.css('article.npr-product-module'):
yield {
'src': dress.css('img.product-photo').xpath('#src').extract_first(),
'url': dress.css('a.product-photo-href').xpath('#href').extract_first()
}
I got 6 products. I expect 66.
For URL https://www.renttherunway.com/products/dress with the following code -
def parse(self, response):
for dress in response.css('div.cycle-image-0'):
yield {
'image-url': dress.xpath('.//img/#src').extract_first(),
}
I got 12. I expect roughly 100.
Even when I changed it to crawl every 'next' page, I got the same number per page but it went through all pages successfully.
I have tried a different USER_AGENT, disabled COOKIES, and DOWNLOAD_DELAY of 5.
I imagine I will run into the same problem on any site so folks should have seen this before but can't find a reference to it.
What am I missing?
It's one of those weird websites where they store product data as json in html source and unpack it with javascript on page load later.
To figure this out usually what you want to do is
disable javascript and do scrapy view <url>
investigate the results
find the id in the product url and search that id in page source to check whether it exists and if so where it is hidden. If it doesn't exist that means it's being populated by some AJAX request -> reenable javascript, go to the page and dig through browser inspector's network tab to find it.
if you do regex based search:
re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
You'll get a huge json that contains all products and their information.
import json
import re
data = re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
data = json.loads(data[0])['data']
print(len(data['ProductResult']['Products']))
>> 66
That gets a correct amount of products!
So in your parse you can do this:
def parse(self, response):
for product in data['ProductResult']['Products']:
# find main image
image_url = [m['Url'] for m in product['Media'] if m['Type'] == 'MainImage']
yield {'image_url': image_url}
I'm trying to use scrapy to crawl some advertise information from this web sites.
That website has some div tag with class="product-card new_ outofstock installments_ ".
When I use:
items = response.xpath("//div[contains(#class, 'product-')]")
I get some node with class attribute = "product-description" but not "product-card".
When I use:
items = response.xpath("//div[contains(#class, 'product-card')]")
I still get nothing in result.
Why is that ?
As pointed in the previous answer, the content you are trying to scrape is generated dynamically using javascript. If performance is not a big deal for you, then you can use Selenium to emulate a real user and interact with the site. At the same time you can let Scrapy get the data for you.
If you want a similar example of how to do this, consider this tutorial: http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python/
The data you want is being populated by javascripts.
You would have to use a selenium webdriver to extract the data.
If you want to check before hand if data is being populated using javascript, open a scrapy shell and try extracting the data as below.
scrapy shell 'http://www.lazada.vn/dien-thoai-may-tinh-bang/?ref=MT'
>>>response.xpath('//div[contains(#class,"product-card")]')
Output:
[]
Now, if you use the same Xpath in the browser and get a result as below:
Then the data is populated using scripts and selenium would have to be used to get data.
Here is an example to extract data using selenium:
import scrapy
from selenium import webdriver
from scrapy.http import TextResponse
class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['lazada.vn']
start_urls = ['http://www.lazada.vn/dien-thoai-may-tinh-bang/?ref=MT']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
page = TextResponse(response.url, body=self.driver.page_source, encoding='utf-8')
required_data = page.xpath('//div[contains(#class,"product-card")]').extract()
self.driver.close()
Here are some examples of "selenium spiders":
Executing Javascript Submit form functions using scrapy in python
Snipplr
Scrapy with selenium
Extract data from dynamic webpages