Scrapy (Python): Iterating over 'next' page without multiple functions

Scrapy (Python): Iterating over 'next' page without multiple functions - python

I am using Scrapy to grab stock data from Yahoo! Finance.
Sometimes, I need to loop over several pages, 19 in this example , in order to get all of the stock data.
Previously (when I knew there would only be two pages), I would use one function for each page, like so:
def stocks_page_1(self, response):
returns_page1 = []
#Grabs data here...
current_page = response.url
next_page = current_page + "&z=66&y=66"
yield Request(next_page, self.stocks_page_2, meta={'returns_page1': returns_page1})
def stocks_page_2(self, response):
# Grab data again...
Now, instead of writing 19 or more functions, I was wondering if there was a way I could loop through an iteration using one function to grab all data from all pages available for a given stock.
Something like this:
for x in range(30): # 30 was randomly selected
current_page = response.url
# Grabs Data
# Check if there is a 'next' page:
if response.xpath('//td[#align="right"]/a[#rel="next"]').extract() != ' ':
u = x * 66
next_page = current_page + "&z=66&y={0}".format(u)
# Go to the next page somehow within the function???
Updated Code:
Works, but only returns one page of data.
class DmozSpider(CrawlSpider):
name = "dnot"
allowed_domains = ["finance.yahoo.com", "http://eoddata.com/"]
start_urls = ['http://finance.yahoo.com/q?s=CAT']
rules = [
Rule(LinkExtractor(restrict_xpaths='//td[#align="right"]/a[#rel="next"]'),
callback='stocks1',
follow=True),
]
def stocks1(self, response):
returns = []
rows = response.xpath('//table[#class="yfnc_datamodoutline1"]//table/tr')[1:]
for row in rows:
cells = row.xpath('.//td/text()').extract()
try:
values = cells[-1]
try:
float(values)
returns.append(values)
except ValueError:
continue
except ValueError:
continue
unformatted_returns = response.meta.get('returns_pages')
returns = [float(i) for i in returns]
global required_amount_of_returns, counter
if counter == 1 and "CAT" in response.url:
required_amount_of_returns = len(returns)
elif required_amount_of_returns == 0:
raise CloseSpider("'Error with initiating required amount of returns'")
counter += 1
print counter
# Iterator to calculate Rate of return
# ====================================
if data_intervals == "m":
k = 12
elif data_intervals == "w":
k = 4
else:
k = 30
sub_returns_amount = required_amount_of_returns - k
sub_returns = returns[:sub_returns_amount]
rate_of_return = []
if len(returns) == required_amount_of_returns or "CAT" in response.url:
for number in sub_returns:
numerator = number - returns[k]
rate = numerator/returns[k]
if rate == '':
rate = 0
rate_of_return.append(rate)
k += 1
item = Website()
items = []
item['url'] = response.url
item['name'] = response.xpath('//div[#class="title"]/h2/text()').extract()
item['avg_returns'] = numpy.average(rate_of_return)
item['var_returns'] = numpy.cov(rate_of_return)
item['sd_returns'] = numpy.std(rate_of_return)
item['returns'] = returns
item['rate_of_returns'] = rate_of_return
item['exchange'] = response.xpath('//span[#class="rtq_exch"]/text()').extract()
item['ind_sharpe'] = ((numpy.average(rate_of_return) - RFR) / numpy.std(rate_of_return))
items.append(item)
yield item

You see, a parse callback is just a function that takes the response and returns or yields either Items or Requests or both. There is no issue at all with reusing these callbacks, so you can just pass the same callback for every request.
Now, you could pass the current page info using the Request meta but instead, I'd leverage the CrawlSpider to crawl across every page. It's really easy, start generating the Spider with the command line:
scrapy genspider --template crawl finance finance.yahoo.com
Then write it like this:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
Scrapy 1.0 has deprecated the scrapy.contrib namespace for the modules above, but if you're stuck with 0.24, use scrapy.contrib.linkextractors and scrapy.contrib.spiders.
from yfinance.items import YfinanceItem
class FinanceSpider(CrawlSpider):
name = 'finance'
allowed_domains = ['finance.yahoo.com']
start_urls = ['http://finance.yahoo.com/q/hp?s=PWF.TO&a=04&b=19&c=2005&d=04&e=19&f=2010&g=d&z=66&y=132']
rules = (
Rule(LinkExtractor(restrict_css='[rel="next"]'),
callback='parse_items',
follow=True),
)
LinkExtractor will pick up the links in the response to follow, but it can be limited with XPath (or CSS) and regular expressions. See documentation for more.
Rules will follow the links and call the callback on every response. follow=True will keep extracting links on every new response, but it can be limited by depth. See documentation again.
def parse_items(self, response):
for line in response.css('.yfnc_datamodoutline1 table tr')[1:-1]:
yield YfinanceItem(date=line.css('td:first-child::text').extract()[0])
Just yield the Items, since Requests for the next pages will be handled by the CrawlSpider Rules.

Related

collecting multiple data from multiple requests into one item in scrapy

basically I have a website that contains clothing items. I am starting my spider where I have all the items and I am looping over them one by one, and entering the item by taking the url and accessing the page of it. then, I am trying to get the values of the images (.jpeg url files) and returning them ( each item has multiple colors so I am trying to take all the images of all the colors of this specific item ). the problem is my code right now returns the url of the colors on each line. what I want to do is return all the colors urls of the specific item inside 1 line of the json file and then loop to the next item.
my current code:
import scrapy
class USSpider(scrapy.Spider):
name = 'US'
start_urls = ['https://tr.uspoloassn.com/sadece-online-erkek/?attributes_filterable_product_base_type=T-Shirt']
def parse(self, response):
for j in range(int(response.css('.js-product-list-load').xpath("#page").extract_first()) , int(response.css('.js-product-list-load').xpath("#numpages").extract_first())):
l = 'https://tr.uspoloassn.com/sadece-online-erkek/?attributes_filterable_product_base_type=T-Shirt' + '&page=' + str(j)
yield scrapy.Request(url=l, callback=self.parse2)
def parse2(self,response):
for i in range(len(response.css('a.js-product-images-wrapper'))):
link = 'https://tr.uspoloassn.com' + response.css('a.js-product-images-wrapper')[i].attrib['href']
Url = response.urljoin(link)
yield scrapy.Request(Url, callback=self.parse3)
def parse3(self,response):
colors = list(set(response.xpath('//*[#class="js-variant-area "]').css('ul li').xpath("//a[#class='js-variant ']").xpath(
"#data-value").extract()))
link = response.url
arnold = []
for i in colors:
if i[0].lower() == 'v':
url = link + '?integration_color=' + i
yield scrapy.Request(url,callback=self.parseImage)
def parseImage(self, response):
yield{
'image links': response.css("a.js-product-thumbnail").xpath("#data-image").extract()
}

Can Response not return an integer value in Scrapy?

In order to find out how many bills has each member of parliament has their signature, I'm trying to write a scraper on the members of parliament which works with 3 layers:
Accessing the link for each MP from the list
From (1) accessing the page with information including the bills the MP has a signature on
From (3) accessing the page where the bill proposals with MP's signature is shown, count them, assign their number to ktsayisi variable (problem occurs here)
At the last layer, I'm trying to return the number of bills by counting by the relevant xss selector by means of len() function. But apparently I can't assign the returned number from (3) to a value to be eventually yielded.
Scrapy returns just the link accessed rather than the number that I want the function to return. Why is it so? Can't I write a statement like X = Request(url,callback = function) where the defined function used in Response can iterate an integer? How can I fix it?
I want a number to be in the place of these statements yielded : <GET https://www.tbmm.gov.tr/Milletvekilleri/KanunTeklifiUyeninImzasiBulunanTeklifler?donemKod=27&sicil=UqVZp9Fvweo=>
Thanks in advance.
'''
from scrapy import Spider
from scrapy.http import Request
class MvSpider(Spider):
name = 'mv'
allowed_domains = ['tbmm.gov.tr'] #website of the parliament
start_urls = ['https://www.tbmm.gov.tr/Milletvekilleri/liste'] #the link which has the list of MPs
def parse(self, response):
mv_linkler = response.xpath('//div[#class="col-md-8"]/a/#href').getall()
for link in mv_linkler:
mutlak_link = response.urljoin(link) #absolute url
yield Request(mutlak_link, callback = self.mv_analiz)
def mv_analiz(self, response): #function to analyze the MP
kteklif_link_path = response.xpath("//a[contains(text(),'İmzası Bulunan Kanun Teklifleri')]/#href").get()
kteklif_link = response.urljoin(kteklif_link_path)
ktsayisi = int(Request(kteklif_link, callback = self.kt_say)) #the value of the number of bill proposals to be requested
def kt_say(self,response):
kteklifler = response.xpath("//tr[#valign='TOP']")
return len(kteklifler)
'''

You can't, furas's explanation pretty much covers why and I don't have anything to add, you need to do something like this:
from scrapy import Spider
from scrapy.http import Request
class MvSpider(Spider):
name = 'mv'
allowed_domains = ['tbmm.gov.tr'] #website of the parliament
start_urls = ['https://www.tbmm.gov.tr/Milletvekilleri/liste'] #the link which has the list of MPs
def parse(self, response):
mv_linkler = response.xpath('//div[#class="col-md-8"]/a/#href').getall()
for link in mv_linkler:
mutlak_link = response.urljoin(link) #absolute url
yield Request(mutlak_link, callback=self.mv_analiz)
def mv_analiz(self, response): #function to analyze the MP
kteklif_link_path = response.xpath("//a[contains(text(),'İmzası Bulunan Kanun Teklifleri')]/#href").get()
kteklif_link = response.urljoin(kteklif_link_path)
item = {}
req = Request(kteklif_link, callback=self.kt_say) #the value of the number of bill proposals to be requested
req.meta['item'] = item
yield req
def kt_say(self, response):
kteklifler = response.xpath("//tr[#valign='TOP']")
item = response.meta['item']
item['ktsayisi'] = len(kteklifler)
yield item

How to scrape website with multiple pages using scrapy

I am trying to scrape this website (that has multiple pages), using scrapy. the problem is that I can't find the next page URL.
Do you have an idea on how to scrape a website with multiple pages (with scrapy) or how to solve the error I'm getting with my code?
I tried the code below but it's not working:
class AbcdspiderSpider(scrapy.Spider):
"""
Class docstring
"""
name = 'abcdspider'
allowed_domains = ['abcd-terroir.smartrezo.com']
alphabet = list(string.ascii_lowercase)
url = "https://abcd-terroir.smartrezo.com/n31-france/annuaireABCD.html?page=1&spe=1&anIDS=31&search="
start_urls = [url + letter for letter in alphabet]
main_url = "https://abcd-terroir.smartrezo.com/n31-france/"
crawl_datetime = str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
start_time = datetime.datetime.now()
def parse(self, response):
self.crawler.stats.set_value("start_time", self.start_time)
try:
page = response.xpath('//div[#class="pageStuff"]/span/text()').get()
page_max = get_num_page(page)
for index in range(page_max):
producer_list = response.xpath('//div[#class="clearfix encart_ann"]/#onclick').getall()
for producer in producer_list:
link_producer = self.main_url + producer
yield scrapy.Request(url=link_producer, callback=self.parse_details)
next_page_url = "/annuaireABCD.html?page={}&spe=1&anIDS=31&search=".format(index)
if next_page_url is not None:
yield scrapy.Request(response.urljoin(self.main_url + next_page_url))
except Exception as e:
self.crawler.stats.set_value("error", e.args)
I am getting this error:
'error': ('range() integer end argument expected, got unicode.',)

The error is here:
page = response.xpath('//div[#class="pageStuff"]/span/text()').get()
page_max = get_num_page(page)
The range function expected an integer value (1,2,3,4, etc) not an unicode string ('Page 1 / 403'
)
My proposal for the range error is
page = response.xpath('//div[#class="pageStuff"]/span/text()').get().split('/ ')[1]
for index in range(int(page)):
#your actions

Scrapy - unable to make additional request in XMLFeedSpider

I have a scrapy spider that uses XMLFeedSpider. As well as the data returned for each node in parse_node(), I also need to make an additional request to get more data. The only issue, is if I yield an additional request from parse_node() nothing gets returned at all:
class MySpidersSpider(XMLFeedSpider):
name = "myspiders"
namespaces = [('g', 'http://base.google.com/ns/1.0')]
allowed_domains = {"www.myspiders.com"}
start_urls = [
"https://www.myspiders.com/productMap.xml"
]
iterator = 'iternodes'
itertag = 'item'
def parse_node(self, response, node):
if(self.settings['CLOSESPIDER_ITEMCOUNT'] and int(self.settings['CLOSESPIDER_ITEMCOUNT']) == self.item_count):
raise CloseSpider('CLOSESPIDER_ITEMCOUNT limit reached - ' + str(self.settings['CLOSESPIDER_ITEMCOUNT']))
else:
self.item_count += 1
id = node.xpath('id/text()').extract()
title = node.xpath('title/text()').extract()
link = node.xpath('link/text()').extract()
image_link = node.xpath('g:image_link/text()').extract()
gtin = node.xpath('g:gtin/text()').extract()
product_type = node.xpath('g:product_type/text()').extract()
price = node.xpath('g:price/text()').extract()
sale_price = node.xpath('g:sale_price/text()').extract()
availability = node.xpath('g:availability/text()').extract()
item = MySpidersItem()
item['id'] = id[0]
item['title'] = title[0]
item['link'] = link[0]
item['image_link'] = image_link[0]
item['gtin'] = gtin[0]
item['product_type'] = product_type[0]
item['price'] = price[0]
item['sale_price'] = '' if len(sale_price) == 0 else sale_price[0]
item['availability'] = availability[0]
yield Request(item['link'], callback=self.parse_details, meta={'item': item})
def parse_details(self, response):
item = response.meta['item']
item['price_per'] = 'test'
return item
If I change the last line of parse_node() to return item it works fine (without setting price_per in the item, naturally).
Any idea what I'm doing wrong?

Have you tried checking the contents of item['link']? If it is a relative link (example: /products?id=5), the URL won't return anything and the request will fail. You need to make sure it's a resolvable link (example: https://www.myspiders.com/products?id=5).

I discovered the issue - I was limiting the number of items processed in my parse_node() function. However, because of the limit, my spider was terminating prior to the request being made. Moving the code to limit the item processed to my parse_details() function resolves the issue:
def parse_details(self, response):
if(self.settings['CLOSESPIDER_ITEMCOUNT'] and int(self.settings['CLOSESPIDER_ITEMCOUNT']) == self.item_count):
raise CloseSpider('CLOSESPIDER_ITEMCOUNT limit reached - ' + str(self.settings['CLOSESPIDER_ITEMCOUNT']))
else:
self.item_count += 1
item = response.meta['item']
item['price_per'] = 'test'
return item

Scrapy: Wait for a specific url to be parsed before parsing others

Brief Explanation:
I have a Scrapy project that takes stock data from Yahoo! Finance. In order for my project to work, I need to ensure that a stock has been around for a desired amount of time. I do this by scraping CAT (Caterpillar Inc. (CAT) -NYSE) first, get the amount of closing prices that there is for that time period, and then ensure that all stocks scraped after that have the same amount of closing prices as CAT, thus ensuring that a stock has been publicly traded for the desired time length.
The Problem:
This all works fine and dandy, however my problem is that before scrapy has finished parsing CAT, it begins scraping other stocks and parsing them. This results in an error, as before I can get the desired amount of closing prices from CAT, scrapy is trying to decide if any other stock has the same amount of closing prices as CAT, which does not exist yet.
The actual question
How can I force scrapy to finish parsing one url before beginning others
I have also tried:
def start_requests(self):
global start_time
yield Request('http://finance.yahoo.com/q?s=CAT', self.parse)
# Waits 4 seconds to allow CAT to finish crawling
if time.time() - start_time > 0.2:
for i in self.other_urls:
yield Request(i, self.parse)
but the stocks in other_urls never commence, because scrapy never goes back to def start_requests to check if the time is above 0.2
The Entire Code:
from scrapy.selector import Selector
from scrapy import Request
from scrapy.exceptions import CloseSpider
from sharpeparser.gen_settings import *
from decimal import Decimal
from scrapy.spider import Spider
from sharpeparser.items import SharpeparserItem
import numpy
import time
if data_intervals == "m":
required_amount_of_returns = 24
elif data_intervals == "w":
required_amount_of_returns = 100
else:
required_amount_of_returns =
counter = 1
start_time = time.time()
class DnotSpider(Spider):
# ---- >>> ENSURE YOU INDENT 1 ---- >>>
# =======================================
name = "dnot"
allowed_domains = ["finance.yahoo.com", "http://eoddata.com/", "ca.finance.yahoo.com"]
start_urls = ['http://finance.yahoo.com/q?s=CAT']
other_urls = ['http://eoddata.com/stocklist/TSX.htm', 'http://eoddata.com/stocklist/TSX/B.htm', 'http://eoddata.com/stocklist/TSX/C.htm', 'http://eoddata.com/stocklist/TSX/D.htm', 'http://eoddata.com/stocklist/TSX/E.htm', 'http://eoddata.com/stocklist/TSX/F.htm', 'http://eoddata.com/stocklist/TSX/G.htm', 'http://eoddata.com/stocklist/TSX/H.htm', 'http://eoddata.com/stocklist/TSX/I.htm', 'http://eoddata.com/stocklist/TSX/J.htm', 'http://eoddata.com/stocklist/TSX/K.htm', 'http://eoddata.com/stocklist/TSX/L.htm', 'http://eoddata.com/stocklist/TSX/M.htm', 'http://eoddata.com/stocklist/TSX/N.htm', 'http://eoddata.com/stocklist/TSX/O.htm', 'http://eoddata.com/stocklist/TSX/P.htm', 'http://eoddata.com/stocklist/TSX/Q.htm', 'http://eoddata.com/stocklist/TSX/R.htm', 'http://eoddata.com/stocklist/TSX/S.htm', 'http://eoddata.com/stocklist/TSX/T.htm', 'http://eoddata.com/stocklist/TSX/U.htm', 'http://eoddata.com/stocklist/TSX/V.htm', 'http://eoddata.com/stocklist/TSX/W.htm', 'http://eoddata.com/stocklist/TSX/X.htm', 'http://eoddata.com/stocklist/TSX/Y.htm', 'http://eoddata.com/stocklist/TSX/Z.htm'
'http://eoddata.com/stocklist/NASDAQ/B.htm', 'http://eoddata.com/stocklist/NASDAQ/C.htm', 'http://eoddata.com/stocklist/NASDAQ/D.htm', 'http://eoddata.com/stocklist/NASDAQ/E.htm', 'http://eoddata.com/stocklist/NASDAQ/F.htm', 'http://eoddata.com/stocklist/NASDAQ/G.htm', 'http://eoddata.com/stocklist/NASDAQ/H.htm', 'http://eoddata.com/stocklist/NASDAQ/I.htm', 'http://eoddata.com/stocklist/NASDAQ/J.htm', 'http://eoddata.com/stocklist/NASDAQ/K.htm', 'http://eoddata.com/stocklist/NASDAQ/L.htm', 'http://eoddata.com/stocklist/NASDAQ/M.htm', 'http://eoddata.com/stocklist/NASDAQ/N.htm', 'http://eoddata.com/stocklist/NASDAQ/O.htm', 'http://eoddata.com/stocklist/NASDAQ/P.htm', 'http://eoddata.com/stocklist/NASDAQ/Q.htm', 'http://eoddata.com/stocklist/NASDAQ/R.htm', 'http://eoddata.com/stocklist/NASDAQ/S.htm', 'http://eoddata.com/stocklist/NASDAQ/T.htm', 'http://eoddata.com/stocklist/NASDAQ/U.htm', 'http://eoddata.com/stocklist/NASDAQ/V.htm', 'http://eoddata.com/stocklist/NASDAQ/W.htm', 'http://eoddata.com/stocklist/NASDAQ/X.htm', 'http://eoddata.com/stocklist/NASDAQ/Y.htm', 'http://eoddata.com/stocklist/NASDAQ/Z.htm',
'http://eoddata.com/stocklist/NYSE/B.htm', 'http://eoddata.com/stocklist/NYSE/C.htm', 'http://eoddata.com/stocklist/NYSE/D.htm', 'http://eoddata.com/stocklist/NYSE/E.htm', 'http://eoddata.com/stocklist/NYSE/F.htm', 'http://eoddata.com/stocklist/NYSE/G.htm', 'http://eoddata.com/stocklist/NYSE/H.htm', 'http://eoddata.com/stocklist/NYSE/I.htm', 'http://eoddata.com/stocklist/NYSE/J.htm', 'http://eoddata.com/stocklist/NYSE/K.htm', 'http://eoddata.com/stocklist/NYSE/L.htm', 'http://eoddata.com/stocklist/NYSE/M.htm', 'http://eoddata.com/stocklist/NYSE/N.htm', 'http://eoddata.com/stocklist/NYSE/O.htm', 'http://eoddata.com/stocklist/NYSE/P.htm', 'http://eoddata.com/stocklist/NYSE/Q.htm', 'http://eoddata.com/stocklist/NYSE/R.htm', 'http://eoddata.com/stocklist/NYSE/S.htm', 'http://eoddata.com/stocklist/NYSE/T.htm', 'http://eoddata.com/stocklist/NYSE/U.htm', 'http://eoddata.com/stocklist/NYSE/V.htm', 'http://eoddata.com/stocklist/NYSE/W.htm', 'http://eoddata.com/stocklist/NYSE/X.htm', 'http://eoddata.com/stocklist/NYSE/Y.htm', 'http://eoddata.com/stocklist/NYSE/Z.htm',
'http://eoddata.com/stocklist/HKEX/0.htm', 'http://eoddata.com/stocklist/HKEX/1.htm', 'http://eoddata.com/stocklist/HKEX/2.htm', 'http://eoddata.com/stocklist/HKEX/3.htm', 'http://eoddata.com/stocklist/HKEX/6.htm', 'http://eoddata.com/stocklist/HKEX/8.htm',
'http://eoddata.com/stocklist/LSE/0.htm', 'http://eoddata.com/stocklist/LSE/1.htm', 'http://eoddata.com/stocklist/LSE/2.htm', 'http://eoddata.com/stocklist/LSE/3.htm', 'http://eoddata.com/stocklist/LSE/4.htm', 'http://eoddata.com/stocklist/LSE/5.htm', 'http://eoddata.com/stocklist/LSE/6.htm', 'http://eoddata.com/stocklist/LSE/7.htm', 'http://eoddata.com/stocklist/LSE/8.htm', 'http://eoddata.com/stocklist/LSE/9.htm', 'http://eoddata.com/stocklist/LSE/A.htm', 'http://eoddata.com/stocklist/LSE/B.htm', 'http://eoddata.com/stocklist/LSE/C.htm', 'http://eoddata.com/stocklist/LSE/D.htm', 'http://eoddata.com/stocklist/LSE/E.htm', 'http://eoddata.com/stocklist/LSE/F.htm', 'http://eoddata.com/stocklist/LSE/G.htm', 'http://eoddata.com/stocklist/LSE/H.htm', 'http://eoddata.com/stocklist/LSE/I.htm', 'http://eoddata.com/stocklist/LSE/G.htm', 'http://eoddata.com/stocklist/LSE/K.htm', 'http://eoddata.com/stocklist/LSE/L.htm', 'http://eoddata.com/stocklist/LSE/M.htm', 'http://eoddata.com/stocklist/LSE/N.htm', 'http://eoddata.com/stocklist/LSE/O.htm', 'http://eoddata.com/stocklist/LSE/P.htm', 'http://eoddata.com/stocklist/LSE/Q.htm', 'http://eoddata.com/stocklist/LSE/R.htm', 'http://eoddata.com/stocklist/LSE/S.htm', 'http://eoddata.com/stocklist/LSE/T.htm', 'http://eoddata.com/stocklist/LSE/U.htm', 'http://eoddata.com/stocklist/LSE/V.htm', 'http://eoddata.com/stocklist/LSE/W.htm', 'http://eoddata.com/stocklist/LSE/X.htm', 'http://eoddata.com/stocklist/LSE/Y.htm', 'http://eoddata.com/stocklist/LSE/Z.htm',
'http://eoddata.com/stocklist/AMS/A.htm', 'http://eoddata.com/stocklist/AMS/B.htm', 'http://eoddata.com/stocklist/AMS/C.htm', 'http://eoddata.com/stocklist/AMS/D.htm', 'http://eoddata.com/stocklist/AMS/E.htm', 'http://eoddata.com/stocklist/AMS/F.htm', 'http://eoddata.com/stocklist/AMS/G.htm', 'http://eoddata.com/stocklist/AMS/H.htm', 'http://eoddata.com/stocklist/AMS/I.htm', 'http://eoddata.com/stocklist/AMS/J.htm', 'http://eoddata.com/stocklist/AMS/K.htm', 'http://eoddata.com/stocklist/AMS/L.htm', 'http://eoddata.com/stocklist/AMS/M.htm', 'http://eoddata.com/stocklist/AMS/N.htm', 'http://eoddata.com/stocklist/AMS/O.htm', 'http://eoddata.com/stocklist/AMS/P.htm', 'http://eoddata.com/stocklist/AMS/Q.htm', 'http://eoddata.com/stocklist/AMS/R.htm', 'http://eoddata.com/stocklist/AMS/S.htm', 'http://eoddata.com/stocklist/AMS/T.htm', 'http://eoddata.com/stocklist/AMS/U.htm', 'http://eoddata.com/stocklist/AMS/V.htm', 'http://eoddata.com/stocklist/AMS/W.htm', 'http://eoddata.com/stocklist/AMS/X.htm', 'http://eoddata.com/stocklist/AMS/Y.htm', 'http://eoddata.com/stocklist/AMS/Z.htm',
'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=A', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=B', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=C', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=D', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=E', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=F', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=G', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=H', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=I', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=J', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=K', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=L', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=M', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=N', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=O', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=P', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=Q', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=R', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=S', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=T', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=U', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=V', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=W', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=X', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=Y', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=Z',
'https://ca.finance.yahoo.com/q/cp?s=%5EHSI&alpha=0', 'https://ca.finance.yahoo.com/q/cp?s=%5EHSI&alpha=1', 'https://ca.finance.yahoo.com/q/cp?s=%5EHSI&alpha=2', 'https://ca.finance.yahoo.com/q/cp?s=%5EHSI&alpha=3',
'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=A', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=B', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=C', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=D', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=E', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=F', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=G', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=H', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=I', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=J', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=K', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=L', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=M', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=N', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=O', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=P', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=Q', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=R', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=S', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=T', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=U', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=V', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=W', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=X', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=Y', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=Z',
'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=A', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=B', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=C', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=D', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=E', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=F', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=G', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=H', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=I', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=J', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=K', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=L', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=M', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=N', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=O', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=P', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=Q', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=R', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=S', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=T', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=U', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=V', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=W', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=X', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=Y', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=Z',
'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=A', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=B', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=C', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=D', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=E', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=F', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=G', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=H', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=I', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=J', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=K', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=L', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=M', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=N', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=O', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=P', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=Q', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=R', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=S', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=T', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=U', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=V', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=W', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=X', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=Y', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=Z']
def start_requests(self):
global start_time
yield Request('http://finance.yahoo.com/q?s=CAT', self.parse)
# Waits 4 seconds to allow CAT to finish crawling
if time.time() - start_time > 0.2:
for i in self.other_urls:
yield Request(i, self.parse)
def parse(self, response):
if "eoddata" in response.url:
companyList = response.xpath('//tr[#class="ro"]/td/a/text()').extract()
for company in companyList:
if "TSX" in response.url:
go = 'http://finance.yahoo.com/q/hp?s={0}.TO&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, beginning_month, beginning_day, beginning_year, ending_month, ending_day, ending_year, data_intervals)
yield Request(go, self.stocks1)
elif "LSE" in response.url:
go = 'http://finance.yahoo.com/q/hp?s={0}.L&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, beginning_month, beginning_day, beginning_year, ending_month, ending_day, ending_year, data_intervals)
yield Request(go, self.stocks1)
elif "HKEX" in response.url:
go = 'http://finance.yahoo.com/q/hp?s={0}.HK&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, beginning_month, beginning_day, beginning_year, ending_month, ending_day, ending_year, data_intervals)
yield Request(go, self.stocks1)
elif "AMS" in response.url:
go = 'https://ca.finance.yahoo.com/q/hp?s={0}.AS&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, beginning_month, beginning_day, beginning_year, ending_month, ending_day, ending_year, data_intervals)
yield Request(go, self.stocks1)
else:
go = 'https://ca.finance.yahoo.com/q/hp?s={0}&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, beginning_month, beginning_day, beginning_year, ending_month, ending_day, ending_year, data_intervals)
yield Request(go, self.stocks1)
elif "http://finance.yahoo.com/q?s=CAT" in response.url:
go = 'http://finance.yahoo.com/q/hp?s=CAT&a={0}&b={1}&c={2}&d={3}&e={4}&f={5}&g={6}'.format(beginning_month, beginning_day, beginning_year, ending_month, ending_day, ending_year, data_intervals)
yield Request(go, self.stocks1)
else:
rows = response.xpath('//table[#class="yfnc_tableout1"]//table/tr')[1:]
for row in rows:
company = row.xpath('.//td[1]/b/a/text()').extract()
go = 'http://finance.yahoo.com/q/hp?s={0}&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, beginning_day, beginning_month, beginning_year, ending_day, ending_month, ending_year, data_intervals)
yield Request(go, self.stocks1)
def stocks1(self, response):
current_page = response.url
print current_page
# If the link is not the same as the first page, ie. stocks1 is requested through stocks2, get the stock data from stocks2
if initial_ending not in current_page[-iel:]:
returns_pages = response.meta.get('returns_pages')
# Remove the last stock price from the stock list, because it is the same as the first on the new list
if not not returns_pages:
if len(returns_pages) > 2:
returns_pages = returns_pages[:-1]
else:
# Else, if the link does match that of the first page, create a new list becuase one does not exist yet
returns_pages = []
# This grabs the stock data from the page
rows = response.xpath('//table[#class="yfnc_datamodoutline1"]//table/tr')[1:]
print "stocks1"
print returns_pages
for row in rows:
cells = row.xpath('.//td/text()').extract()
try:
values = cells[-1]
try:
float(values)
# And adds it to returns_pages
returns_pages.append(values)
except ValueError:
continue
except ValueError:
continue
print "after"
print returns_pages
# exp determines if there is a 'Next page' or not
exp = response.xpath('//td[#align="right"]/a[#rel="next"]').extract()
# If there is a 'Next Page':
if not not exp:
# And this is the first page:
if initial_ending in current_page[-iel:]:
#create necessary url for the 2nd page
next_page = current_page + "&z=66&y=66"
# If this is not the first page
else:
# This increases the end of the link by 66, thereby getting the next 66 results on for pages 2 and after
u = int(current_page[-6:].split("=",1)[1])
o = len(str(u))
u += 66
next_page = current_page[:-o] + str(u)
print next_page, "66&y in curr_page"
# Then go back to self.stocks1 to get more data on the next page
yield Request(next_page, self.stocks2, meta={'returns_pages': returns_pages}, dont_filter=True)
# Else, if there is no 'Next Link'
else:
# Send the retuns to finalize.stock to be saved in the item
yield Request(current_page, callback=self.finalize_stock, meta={'returns_pages': returns_pages}, dont_filter=True)
def stocks2(self, response):
# Prints the link of the current url
current_page = response.url
print current_page
# Gets the returns from the previous page
returns_pages = response.meta.get('returns_pages')
# Removes the last return from the previous page because it will be a duplicate
returns_pages = returns_pages[:-1]
print "stocks2"
print returns_pages
# Gets all of the returns on the page
rows = response.xpath('//table[#class="yfnc_datamodoutline1"]//table/tr')[1:]
for row in rows:
cells = row.xpath('.//td/text()').extract()
try:
values = cells[-1]
try:
float(values)
# And adds it to the previous returns
returns_pages.append(values)
except ValueError:
continue
except ValueError:
continue
print "after 2"
print returns_pages
# exp determines if there is a 'Next page' or not
exp = response.xpath('//td[#align="right"]/a[#rel="next"]').extract()
# If there is a 'Next Page':
if not not exp:
# And somehow, this is the first page (should never be true)
if initial_ending in current_page[-iel:]:
# Add necessary link to go to the second page
next_page = current_page + "&z=66&y=66"
print next_page, "66&y not in curr_page"
# Else, this is not the first page (should always be true)
else:
# add 66 to the last number on the preceeding link in order to access the second or later pages
u = int(current_page[-6:].split("=",1)[1])
o = len(str(u))
u += 66
next_page = current_page[:-o] + str(u)
print next_page, "66&y in curr_page"
# go back to self.stocks1 to get more data on the next page
yield Request(next_page, self.stocks1, meta={'returns_pages': returns_pages}, dont_filter=True)
else:
# If there is no "Next" link, send the retuns to finalize.stock to be saved in the item
yield Request(current_page, callback=self.finalize_stock, meta={'returns_pages': returns_pages}, dont_filter=True)
print "sending to finalize stock"
def finalize_stock(self,response):
current_page = response.url
print "====================="
print "finalize_stock called"
print current_page
print "====================="
unformatted_returns = response.meta.get('returns_pages')
returns = [float(i) for i in unformatted_returns]
global required_amount_of_returns, counter
if counter == 1 and "CAT" in response.url:
required_amount_of_returns = len(returns)
elif required_amount_of_returns == 0:
raise CloseSpider("'Error with initiating required amount of returns'")
counter += 1
print counter
# Iterator to calculate Rate of return
# ====================================
if data_intervals == "m":
k = 12
elif data_intervals == "w":
k = 4
else:
k = 30
sub_returns_amount = required_amount_of_returns - k
sub_returns = returns[:sub_returns_amount]
rate_of_return = []
RFR = 0.03
# Make sure list is exact length, otherwise rate_of_return will be inaccurate
# Returns has not been checked by pipeline yet, so small lists will be in the variable
if len(returns) > required_amount_of_returns:
for number in sub_returns:
numerator = number - returns[k]
rate = numerator/returns[k]
if rate == '':
rate = 0
rate_of_return.append(rate)
k += 1
item = SharpeparserItem()
items = []
item['url'] = response.url
item['name'] = response.xpath('//div[#class="title"]/h2/text()').extract()
item['avg_returns'] = numpy.average(rate_of_return)
item['var_returns'] = numpy.cov(rate_of_return)
item['sd_returns'] = numpy.std(rate_of_return)
item['returns'] = unformatted_returns
item['rate_of_returns'] = rate_of_return
item['exchange'] = response.xpath('//span[#class="rtq_exch"]/text()').extract()
item['ind_sharpe'] = ((numpy.average(rate_of_return) - RFR) / numpy.std(rate_of_return))
items.append(item)
yield item

Actual Question
As for the actual problem of doing each request in sequence... There are a few questions similar to yours:
crawling sites one-by-one
crawling urls in order
processing urls sequentially
As a general summary there seem to be a couple of options:
Utilise the priority flag in a start_requests() function to iterate through websites in a particular order
Set CONCURRENT_REQUESTS=1 to ensure that only one request is carried out at a time
If you want to parse all sites all at once after the first CAT ticker has been done. It might be possible to specify an if function to flick the above setting to a higher value if you have already parsed the first site using the settings API
General Coding
I can't run your exact code because you are missing the class structure but I can already see a few things that might be tripping you up:
This SO post describes yield. to better understand how your yield function is working run the following:
def it():
yield range(2)
yield range(10)
g = it()
for i in g:
print i
# now the generator has been consumed.
for i in g:
print i
This SO post also demonstrates that the start_requests() function overrides the list specified by start_urls. It appears that for this reason your urls in start_urls are overridden by this function which only ever yields a generator expression of Request('http://finance.yahoo.com/q?s=CAT', self.parse)
Is there any particular reason that you are not listing all the urls in the list start_urls in the order you want them parsed and delete the function start_requests()? The docs on start_urls state:
subsequent URLs will be generated successively from data contained in the start URLs
Sticking things in globals will tend to cause you problems in projects like this, it's usually better to initiate them as attributes of self in a def __init__(self): function which will be called when the class is called.
This might be petty but you could save yourself a lot of scrolling / effort by listing all the symbols in a separate file and then just load them up in your code. As it stands you have a lot of repetition in that list that you could cut out and make it far easier to read.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy (Python): Iterating over 'next' page without multiple functions - python

Related

collecting multiple data from multiple requests into one item in scrapy

Can Response not return an integer value in Scrapy?

How to scrape website with multiple pages using scrapy

Scrapy - unable to make additional request in XMLFeedSpider

Scrapy: Wait for a specific url to be parsed before parsing others

Categories

Resources