Perhaps yield in Python is remedial for some, but not for me... at least not yet.
I understand yield creates a 'generator'.
I stumbled upon yield when I decided to learn scrapy.
I wrote some code for a Spider which works as follows:
Go to start hyperlink and extract all hyperlinks - which are not full hyperlinks, just sub-directories concatenated onto the starting hyperlink
Examines hyperlinks appends those meeting specific criteria to base hyperlink
Uses Request to navigate to new hyperlink and parses to find unique id in element with 'onclick'
import scrapy
class newSpider(scrapy.Spider)
name = 'new'
allowed_domains = ['www.alloweddomain.com']
start_urls = ['https://www.alloweddomain.com']
def parse(self, response)
links = response.xpath('//a/#href').extract()
for link in links:
if link == 'SpecificCriteria':
next_link = response.urljoin(link)
yield Request(next_link, callback=self.parse_new)
EDIT 1:
for uid_dict in self.parse_new(response):
print(uid_dict['uid'])
break
End EDIT 1
Running the code here evaluates response as the HTTP response to start_urls and not to next_link.
def parse_new(self, response)
trs = response.xpath("//*[#class='unit-directory-row']").getall()
for tr in trs:
if 'SpecificText' in tr:
elements = tr.split()
for element in elements:
if 'onclick' in element:
subelement = element.split('(')[1]
uid = subelement.split(')')[0]
print(uid)
yield {
'uid': uid
}
break
It works, scrapy crawls the first page, creates the new hyperlink and navigates to the next page. new_parser parses the HTML for the uid and 'yields' it. scrapy's engine shows that the correct uid is 'yielded'.
What I don't understand is how I can 'use' that uid obtained by parse_new to create and navigate to a new hyperlink like I would a variable and I cannot seem to be able to return a variable with Request.
I'd check out What does the "yield" keyword do? for a good explanation of how exactly yield works.
In the meantime, spider.parse_new(response) is an iterable object. That is, you can acquire its yielded results via a for loop. E.g.,
for uid_dict in spider.parse_new(response):
print(uid_dict['uid'])
After much reading and learning I discovered the reason scrapy does not perform the callback in the first parse and it has nothing to do with yield! It has a lot to do with two issues:
1) robots.txt. Link Can be 'resolved' with ROBOTSTXT_OBEY = False in settings.py
2) The logger has Filtered offsite request to. Link dont_filter=True may resolve this.
Related
Been wrestling with trying to get around this 302 redirection. First of all, the point of this particular part of my scraper is to get the next page index so I can flip through pages. The direct URLS aren't available for this site, so I cant just move on to the next or anything; in order to continue scraping the actual data using a parse_details function, I have to go through each page and simulate requests.
This is all pretty new to me, so I made sure to try anything I could find first. I have tried various settings ("REDIRECT_ENABLED":False, altering handle_httpstatus_list, etc.) but none are getting me through this. Currently I'm trying to follow the location of the redirection, but this isn't working either.
Here is an example of one of the potential solutions I've tried following.
try:
print('Current page index: ', page_index)
except: # Will be thrown if page_index wasnt found due to redirection.
if response.status in (302,) and 'Location' in response.headers:
location = to_native_str(response.headers['location'].decode('latin1'))
yield scrapy.Request(response.urljoin(location), method='POST', callback=self.parse)
The code, without the details parsing and such, is as follows:
def parse(self, response):
table = response.css('td> a::attr(href)').extract()
additional_page = response.css('span.page_list::text').extract()
for string_item in additional_page: # The text has some non-breaking
# spaces ( ) to ignore. We want the text representing the
# current page index only.
char_list = list(string_item)
for char in char_list:
if char.isdigit():
page_index = char
break # Now that we have the current page index, we
# can back out of this loop.
# Below is where the code breaks; it cannot find page_index since it is
# not getting to the site for scraping after redirection.
try:
print('Current page index: ', page_index)
# To get to the next page, we submit a form request since it is all
# setup with javascript instead of simlpy giving a URL to follow.
# The event target has 'dgTournament' information where the first
# piece is always '_ctl1' and the second is '_ctl' followed by
# the page index number we want to go to minus one (so if we want
# to go to the 8th page, its '_ctl7').
# Thus we can just plug in the current page index which is equal to
# the next we want to hit minus one.
# Here is how I am making the requests; they work until the (302)
# redirection...
form_data = {"__EVENTTARGET": "dgTournaments:_ctl1:_ctl" + page_index,
"__EVENTARGUMENT": {";;AjaxControlToolkit, Version=3.5.50731.0, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-US:ec0bb675-3ec6-4135-8b02-a5c5783f45f5:de1feab2:f9cec9bc:35576c48"}}
yield FormRequest(current_LEVEL, formdata=form_data, method="POST", callback=self.parse, priority=2)
Alternatively, a solution may be to follow pagination in a different way, instead of making all of these requests?
The original link is
https://m.tennislink.usta.com/TournamentSearch/searchresults.aspx?typeofsubmit=&action=2&keywords=&tournamentid=§iondistrict=&city=&state=&zip=&month=0&startdate=&enddate=&day=&year=2019&division=G16&category=28&surface=&onlineentry=&drawssheets=&usertime=&sanctioned=-1&agegroup=Y&searchradius=-1
if anyone is able to help.
You don't have to follow 302 requests instead you can do a POST request and receive the details of the page. The following code prints the data in the first 5 pages:
import requests
from bs4 import BeautifulSoup
url = 'https://m.tennislink.usta.com/TournamentSearch/searchresults.aspx'
pages=5
for i in range(pages):
params={'year':'2019','division':'G16','month':'0','searchradius':'-1'}
payload={'__EVENTTARGET': 'dgTournaments:_ctl1:_ctl'+str(i)}
res= requests.post(url,params=params,data=payload)
soup = BeautifulSoup(res.content,'lxml')
table=soup.find('table',id='ctl00_mainContent_dgTournaments')
#pretty print the table contents
for row in table.find_all('tr'):
for column in row.find_all('td'):
text = ', '.join(x.strip() for x in column.text.split('\n') if x.strip()).strip()
print(text)
print('-'*10)
I want to scrape speakers' name from this link:
https://websummit.com/speakers
Name is basically in div tag with class="speaker__content__inner"
I made a spider in scrapy whos code is below
import scrapy
class Id01Spider(scrapy.Spider):
name = 'ID01'
allowed_domains = ['websummit.com']
start_urls = ['https://websummit.com/speakers']
def parse(self, response):
name=response.xpath('//div[#class = "speaker__content__inner"]/text()').extract()
for Speaker_Details in zip(name):
yield {'Speaker_Details': Speaker_Details.strip()}
pass
When I run this spider it runs and returns nothing.
Log file:
https://pastebin.com/JEfL2GBu
P.S: This is my first question on stackoverflow, so please correct my mistakes if I made any while asking.
If you check source HTML (using Ctrl+U) you'll find that there is no speakers info inside HTML. This content is loaded dynamically using Javascript.
You need to call https://api.cilabs.com/conferences/ws19/lists/speakers?per_page=25 and parse JSON.
I am now scraping this website on a daily basis, and am using DeltaFetch to ignore pages which have already been visited (a lot of them).
The issue I am facing is that for this website, I need to first scrape page A, and then scrape page B to retrieve additional information about the item. DeltaFetch works well in ignoring requests to page B, but that also means that every time the scraping runs, it runs requests to page A regardless of whether it has visited it or not.
This is how my code is structured right now:
# Gathering links from a page, creating an item, and passing it to parse_A
def parse(self, response):
for href in response.xpath(u'//a[text()="詳細を見る"]/#href').extract():
item = ItemLoader(item=ItemClass(), response=response)
yield scrapy.Request(response.urljoin(href),
callback=self.parse_A,
meta={'item':item.load_item()})
# Parsing elements in page A, and passing the item to parse_B
def parse_A(self, response):
item = ItemLoader(item=response.meta['item'], response=response)
item.replace_xpath('age',u"//td[contains(#class,\"age\")]/text()")
page_B = response.xpath(u'//a/img[#alt="周辺環境"]/../#href').extract_first()
yield scrapy.Request(response.urljoin(page_B),
callback=self.parse_B,
meta={'item':item.load_item()})
# Parsing elements in page B, and yielding the item
def parse_B(self, response):
item = ItemLoader(item=response.meta['item'])
item.add_value('url_B',response.url)
yield item.load_item()
Any help would be appreciated to ignore the first request to page A when this page has already been visited, using DeltaFetch.
DeltaFetch only keeps record of the requests that yield items in its database, which means only those will be skipped by default.
However, you are able to customize the key used to store a record by using the deltafetch_key meta key. If you make this key the same for the requests that call parse_A() as for those created inside parse_A(), you should be able to achieve the effect you want.
Something like this should work (untested):
from scrapy.utils.request import request_fingerprint
# (...)
def parse_A(self, response):
# (...)
yield scrapy.Request(
response.urljoin(page_B),
callback=self.parse_B,
meta={
'item': item.load_item(),
'deltafetch_key': request_fingerprint(response.request)
}
)
Note: the example above effectively replaces the filtering of requests to parse_B() urls with the filtering of requests to parse_A() urls. You might need to use a different key depending on your needs.
I am trying to build a spider to scrape some Data from the website Techcrunch - Heartbleed search
my tought was to give a tag when executing the spider from the command line (example: Heartbleed). The spider should then search trough all the associated search results, open each link and get the data contained within.
import scrapy
class TechcrunchSpider(scrapy.Spider):
name = "tech_search"
def start_requests(self):
url = 'https://techcrunch.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + '?s=' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
pass
this code can be executed with : scrapy crawl tech_search -s DOWNLOAD_DELAY=1.5 -o tech_search.jl -a tag=EXAMPLEINPUT"
Getting the data from the individual pages is not the problem, but actually getting the url to them is(from the search page linked above):
the thing is , when looking at the source Html file (Ctrl + u) of the Search site(link above), then i cant find anything about the searched elements(example : "What Is Heartbleed? The Video"). Any suggestions how to obtain these elements?
I suggest that you define your scrapy class along the lines shown in this answer but using the PhantomJS selenium headless browser. The essential problem is that when scrapy downloads those pages it uses javascript code to build the HTML (DOM) that you see but cannot access via the route you have chosen.
I'm new to scrapy and tried to crawl from a couple of sites, but wasn't able to get more than a few images from there.
For example, for http://shop.nordstrom.com/c/womens-dresses-new with the following code -
def parse(self, response):
for dress in response.css('article.npr-product-module'):
yield {
'src': dress.css('img.product-photo').xpath('#src').extract_first(),
'url': dress.css('a.product-photo-href').xpath('#href').extract_first()
}
I got 6 products. I expect 66.
For URL https://www.renttherunway.com/products/dress with the following code -
def parse(self, response):
for dress in response.css('div.cycle-image-0'):
yield {
'image-url': dress.xpath('.//img/#src').extract_first(),
}
I got 12. I expect roughly 100.
Even when I changed it to crawl every 'next' page, I got the same number per page but it went through all pages successfully.
I have tried a different USER_AGENT, disabled COOKIES, and DOWNLOAD_DELAY of 5.
I imagine I will run into the same problem on any site so folks should have seen this before but can't find a reference to it.
What am I missing?
It's one of those weird websites where they store product data as json in html source and unpack it with javascript on page load later.
To figure this out usually what you want to do is
disable javascript and do scrapy view <url>
investigate the results
find the id in the product url and search that id in page source to check whether it exists and if so where it is hidden. If it doesn't exist that means it's being populated by some AJAX request -> reenable javascript, go to the page and dig through browser inspector's network tab to find it.
if you do regex based search:
re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
You'll get a huge json that contains all products and their information.
import json
import re
data = re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
data = json.loads(data[0])['data']
print(len(data['ProductResult']['Products']))
>> 66
That gets a correct amount of products!
So in your parse you can do this:
def parse(self, response):
for product in data['ProductResult']['Products']:
# find main image
image_url = [m['Url'] for m in product['Media'] if m['Type'] == 'MainImage']
yield {'image_url': image_url}