I am trying to scrape this website (that has multiple pages), using scrapy. the problem is that I can't find the next page URL.
Do you have an idea on how to scrape a website with multiple pages (with scrapy) or how to solve the error I'm getting with my code?
I tried the code below but it's not working:
class AbcdspiderSpider(scrapy.Spider):
"""
Class docstring
"""
name = 'abcdspider'
allowed_domains = ['abcd-terroir.smartrezo.com']
alphabet = list(string.ascii_lowercase)
url = "https://abcd-terroir.smartrezo.com/n31-france/annuaireABCD.html?page=1&spe=1&anIDS=31&search="
start_urls = [url + letter for letter in alphabet]
main_url = "https://abcd-terroir.smartrezo.com/n31-france/"
crawl_datetime = str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
start_time = datetime.datetime.now()
def parse(self, response):
self.crawler.stats.set_value("start_time", self.start_time)
try:
page = response.xpath('//div[#class="pageStuff"]/span/text()').get()
page_max = get_num_page(page)
for index in range(page_max):
producer_list = response.xpath('//div[#class="clearfix encart_ann"]/#onclick').getall()
for producer in producer_list:
link_producer = self.main_url + producer
yield scrapy.Request(url=link_producer, callback=self.parse_details)
next_page_url = "/annuaireABCD.html?page={}&spe=1&anIDS=31&search=".format(index)
if next_page_url is not None:
yield scrapy.Request(response.urljoin(self.main_url + next_page_url))
except Exception as e:
self.crawler.stats.set_value("error", e.args)
I am getting this error:
'error': ('range() integer end argument expected, got unicode.',)
The error is here:
page = response.xpath('//div[#class="pageStuff"]/span/text()').get()
page_max = get_num_page(page)
The range function expected an integer value (1,2,3,4, etc) not an unicode string ('Page 1 / 403'
)
My proposal for the range error is
page = response.xpath('//div[#class="pageStuff"]/span/text()').get().split('/ ')[1]
for index in range(int(page)):
#your actions
Related
I'm trying to scrape a website using Scrapy and Selenium, and everything works just fine except the "yield item" part of the code.
In the "def parse_product(self, response)" part, I'm using Selenium find_element_by_css_selector to fill a list and then use the "for element in zip(list1, list2, etc)" approach to generate my items. I have also set up a Pipeline to export the result into a csv.
The problem is that although my spider is scraping the objects correctly (I have tested it with some prints along the way), the item creation part is not working and I'm getting an empty csv.
I have tried another approach that works, but is too slow. It consists in defining a Middleware to pass the request through Selenium, load the page source code and return a HtmlResponse. Then, I just simply use the response.css() method to fill the lists, the same approach to generate the items and the same Pipeline to export it as csv.
spider.py
def __init__(self):
#Headless Option
opt = Options()
opt.headless = True
## get the Firefox profile object
prf = FirefoxProfile()
## Disable CSS
prf.set_preference('permissions.default.stylesheet', 2)
## Disable images
prf.set_preference('permissions.default.image', 2)
self.browser = webdriver.Firefox(firefox_options = opt, firefox_profile = prf)
def parse(self, response):
self.browser.get(response.url)
print('Current URL: ' + response.request.url)
# Find the total number of pages
# Go to last page (click on '>>')
self.browser.find_element_by_css_selector('li.ais-pagination__item:nth-child(9) a:nth-child(1)').click()
n = self.browser.find_element_by_css_selector('li.ais-pagination__item:nth-child(7) a:nth-child(1)').get_attribute('text')
pages = int(n.strip())
# Go back to first page (click on '<<')
self.browser.find_element_by_css_selector('li.ais-pagination__item:nth-child(1) a:nth-child(1)').click()
# Scrape product links
pdcts = []
href = []
i=1
while i<=n:
# Append Product Links
atag = self.browser.find_elements_by_css_selector('div[class*="ph-proucts"] a')
for a in atag:
href.append(a.get_attribute('href'))
prd = list(set(href))
for p in prd:
pdcts.append(p)
# Load new page of products (Click '>')
self.browser.find_element_by_css_selector('li.ais-pagination__item:nth-child(8) a:nth-child(1)').click()
i+=1
for link in pdcts:
yield scrapy.Request(url = link, callback = self.parse_product)
def parse_product(self, response):
self.browser.get(response.url)
print("Current URL: " + response.request.url)
#Item Initializator
name = []
#Item Filling
try:
n = self.browser.find_element_by_css_selector('div[class="flex xs12 sm12 md12"] h1').text
name.append(string.capwords(n))
except:
n1 = response.request.url
n = n1.split('products/')[1].replace('-',' ')
name.append(string.capwords(n))
#Item Creation
for i in zip(name):
self.browser.quit()
item = ProjectItem()
item['name'] = i[0]
yield item
The expected result is a csv with the scraped information but I'm getting an empty one instead.
Could anyone help me with this please? I would really appreciate it.
This is another newbie scrapy question:
When I first started with the scrapy tutorial linked here:
https://docs.scrapy.org/en/latest/intro/tutorial.html
I can crawl a webpage and then output the scraped content to a json file. But when I modify the tutorial to add a few rules like:
traversal depth
and memory so it doesn't traverse already visited pages again.
The output to the json stops although I can still see the output on the console. Can someone give me pointers on what I am doing wrong? The modifications can be seen below:
class QuotesSpider(scrapy.Spider):
name = "quotes"
#allowed_domains = allowed_domain_list
start_urls = input_domain_list
max_depth = 1
invalid_url = []
def parse(self, response):
from_url = ''
from_text = ''
depth = 0
# Extract the meta information from the response, if any
if 'text' in response.meta:
from_text = response.meta['text']
if 'depth' in response.meta:
depth = response.meta['depth']
if 'visited' in response.meta:
visited_dict = response.meta['visited']
else:
visited_dict = {}
if response.status == 404:
self.invalid_url.append(response.url)
print('*'*80)
print('INVALID LINK')
print('*'*80)
else:
page = response.url.split("/")[-2]
web_page = response.request.url
ext_text = ' '.join([item.strip() for item in
response.xpath('//body//text()').extract() if item.strip()])
visited = visited_dict.get('{0}'.format(web_page))
print('-'*80)
print('VALID LINK; Depth: {0}; Visited: {1}'.format(depth, visited))
print('-'*80)
yield {'text': ext_text,
'source': web_page}
if not visited and depth <= self.max_depth:
for selector in response.xpath('//a/#href'):
if selector is not None:
link = selector.get()
request = response.follow(link, callback=self.parse)
request.meta['visited'] = visited_dict
request.meta['visited'].update({'{0}'.format(web_page): 1})
request.meta['depth'] = depth + 1
print('*'*80)
print(link, request.meta['visited'])
print('*' * 80)
yield request
I'm getting the following traceback but unsure how to refactor.
ValueError: Missing scheme in request url: #mw-head
Full code:
class MissleSpiderBio(scrapy.Spider):
name = 'missle_spider_bio'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/...']
this is the part giving me issues (I believe)
def parse(self, response):
filename = response.url.split('/')[-1]
table = response.xpath('///div/table[2]/tbody')
rows = table.xpath('//tr')
row = rows[2]
row.xpath('td//text()')[0].extract()
wdata = {}
for row in response.xpath('//* \
[#class="wikitable"]//tbody//tr'):
for link in response.xpath('//a/#href'):
link = link.extract()
if((link.strip() != '')):
yield Request(link, callback=self.parse)
#wdata.append(link)
else:
yield None
#wdata = {}
#wdata['link'] = BASE_URL +
#row.xpath('a/#href').extract() #[0]
wdata['link'] = BASE_URL + link
request = scrapy.Request(wdata['link'],\
callback=self.get_mini_bio, dont_filter=True)
request.meta['item'] = MissleItem(**wdata)
yield request
here is the second part of the code:
def get_mini_bio(self, response):
BASE_URL_ESCAPED = 'http:\/\/en.wikipedia.org'
item = response.meta['item']
item['image_urls'] = []
img_src = response.xpath('//table[contains(#class, \
"infobox")]//img/#src')
if img_src:
item['image_urls'] = ['http:' + img_src[0].extract()]
mini_bio = ''
paras = response.xpath('//*[#id="mw-content-text"]/p[text()\
or normalize-space(.)=""]').extract()
for p in paras:
if p =='<p></p>':
break
mini_bio += p
mini_bio = mini_bio.replace('href="/wiki', 'href="' + \
BASE_URL + '/wiki')
mini_bio = mini_bio.replace('href="#', item['link'] + '#')
item['mini_bio'] = mini_bio
yield item
I tried refactoring but am now getting a:
ValueError: Missing scheme in request url: #mw-head
any help would be immensely appreciated
Looks like you were on the right track with the commented out [0].
xpath().extract() #returns a list of strings
You need to select the string with [0]
row.xpath('a/#href').extract()
That expression evaluates to a list NOT a string. When you pass the URL to the request object, scrapy expects a string, not a list
To fix this, you have a few options:
You can use LinkExtractors which will allow you to search a page for links and automatically create scrapy request objects for those links:
https://doc.scrapy.org/en/latest/topics/link-extractors.html
OR
You could run a for loop and go through each of the links:
from scrapy.spiders import Request
for link in response.xpath('//a/#href'):
link = link.extract()
if((link.strip() != '')):
yield Request(link, callback=self.parse)
else:
yield None
You can add whatever string filters you want to that code
OR
If you just want the first link, you can use .extract_first() instead of .extract()
I am making a web crawler. I'm not using scrapy or anything, I'm trying to have my script do most things. I have tried doing a search for the issue however I can't seem to find anything that helps with the error. I've tried switching around some of the variable to try and narrow down the problem. I am getting an error on line 24 saying IndexError: string index out of range. The functions run on the first url, (the original url) then the second and fail on the third in the original array. I'm lost, any help would be appreciated greatly! Note, I'm only printing all of them for testing, I'll eventually have them printed to a text file.
import requests
from bs4 import BeautifulSoup
# creating requests from user input
url = raw_input("Please enter a domain to crawl, without the 'http://www' part : ")
def makeRequest(url):
r = requests.get('http://' + url)
# Adding in BS4 for finding a tags in HTML
soup = BeautifulSoup(r.content, 'html.parser')
# Writes a as the link found in the href
output = soup.find_all('a')
return output
def makeFilter(link):
# Creating array for our links
found_link = []
for a in link:
a = a.get('href')
a_string = str(a)
# if statement to filter our links
if a_string[0] == '/': # this is the line with the error
# Realtive Links
found_link.append(a_string)
if 'http://' + url in a_string:
# Links from the same site
found_link.append(a_string)
if 'https://' + url in a_string:
# Links from the same site with SSL
found_link.append(a_string)
if 'http://www.' + url in a_string:
# Links from the same site
found_link.append(a_string)
if 'https://www.' + url in a_string:
# Links from the same site with SSL
found_link.append(a_string)
#else:
# found_link.write(a_string + '\n') # testing only
output = found_link
return output
# Function for removing duplicates
def remove_duplicates(values):
output = []
seen = set()
for value in values:
if value not in seen:
output.append(value)
seen.add(value)
return output
# Run the function with our list in this order -> Makes the request -> Filters the links -> Removes duplicates
def createURLList(values):
requests = makeRequest(values)
new_list = makeFilter(requests)
filtered_list = remove_duplicates(new_list)
return filtered_list
result = createURLList(url)
# print result
# for verifying and crawling resulting pages
for b in result:
sub_directories = createURLList(url + b)
crawler = []
crawler.append(sub_directories)
print crawler
After a_string = str(a) try adding:
if not a_string:
continue
I am using Scrapy to grab stock data from Yahoo! Finance.
Sometimes, I need to loop over several pages, 19 in this example , in order to get all of the stock data.
Previously (when I knew there would only be two pages), I would use one function for each page, like so:
def stocks_page_1(self, response):
returns_page1 = []
#Grabs data here...
current_page = response.url
next_page = current_page + "&z=66&y=66"
yield Request(next_page, self.stocks_page_2, meta={'returns_page1': returns_page1})
def stocks_page_2(self, response):
# Grab data again...
Now, instead of writing 19 or more functions, I was wondering if there was a way I could loop through an iteration using one function to grab all data from all pages available for a given stock.
Something like this:
for x in range(30): # 30 was randomly selected
current_page = response.url
# Grabs Data
# Check if there is a 'next' page:
if response.xpath('//td[#align="right"]/a[#rel="next"]').extract() != ' ':
u = x * 66
next_page = current_page + "&z=66&y={0}".format(u)
# Go to the next page somehow within the function???
Updated Code:
Works, but only returns one page of data.
class DmozSpider(CrawlSpider):
name = "dnot"
allowed_domains = ["finance.yahoo.com", "http://eoddata.com/"]
start_urls = ['http://finance.yahoo.com/q?s=CAT']
rules = [
Rule(LinkExtractor(restrict_xpaths='//td[#align="right"]/a[#rel="next"]'),
callback='stocks1',
follow=True),
]
def stocks1(self, response):
returns = []
rows = response.xpath('//table[#class="yfnc_datamodoutline1"]//table/tr')[1:]
for row in rows:
cells = row.xpath('.//td/text()').extract()
try:
values = cells[-1]
try:
float(values)
returns.append(values)
except ValueError:
continue
except ValueError:
continue
unformatted_returns = response.meta.get('returns_pages')
returns = [float(i) for i in returns]
global required_amount_of_returns, counter
if counter == 1 and "CAT" in response.url:
required_amount_of_returns = len(returns)
elif required_amount_of_returns == 0:
raise CloseSpider("'Error with initiating required amount of returns'")
counter += 1
print counter
# Iterator to calculate Rate of return
# ====================================
if data_intervals == "m":
k = 12
elif data_intervals == "w":
k = 4
else:
k = 30
sub_returns_amount = required_amount_of_returns - k
sub_returns = returns[:sub_returns_amount]
rate_of_return = []
if len(returns) == required_amount_of_returns or "CAT" in response.url:
for number in sub_returns:
numerator = number - returns[k]
rate = numerator/returns[k]
if rate == '':
rate = 0
rate_of_return.append(rate)
k += 1
item = Website()
items = []
item['url'] = response.url
item['name'] = response.xpath('//div[#class="title"]/h2/text()').extract()
item['avg_returns'] = numpy.average(rate_of_return)
item['var_returns'] = numpy.cov(rate_of_return)
item['sd_returns'] = numpy.std(rate_of_return)
item['returns'] = returns
item['rate_of_returns'] = rate_of_return
item['exchange'] = response.xpath('//span[#class="rtq_exch"]/text()').extract()
item['ind_sharpe'] = ((numpy.average(rate_of_return) - RFR) / numpy.std(rate_of_return))
items.append(item)
yield item
You see, a parse callback is just a function that takes the response and returns or yields either Items or Requests or both. There is no issue at all with reusing these callbacks, so you can just pass the same callback for every request.
Now, you could pass the current page info using the Request meta but instead, I'd leverage the CrawlSpider to crawl across every page. It's really easy, start generating the Spider with the command line:
scrapy genspider --template crawl finance finance.yahoo.com
Then write it like this:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
Scrapy 1.0 has deprecated the scrapy.contrib namespace for the modules above, but if you're stuck with 0.24, use scrapy.contrib.linkextractors and scrapy.contrib.spiders.
from yfinance.items import YfinanceItem
class FinanceSpider(CrawlSpider):
name = 'finance'
allowed_domains = ['finance.yahoo.com']
start_urls = ['http://finance.yahoo.com/q/hp?s=PWF.TO&a=04&b=19&c=2005&d=04&e=19&f=2010&g=d&z=66&y=132']
rules = (
Rule(LinkExtractor(restrict_css='[rel="next"]'),
callback='parse_items',
follow=True),
)
LinkExtractor will pick up the links in the response to follow, but it can be limited with XPath (or CSS) and regular expressions. See documentation for more.
Rules will follow the links and call the callback on every response. follow=True will keep extracting links on every new response, but it can be limited by depth. See documentation again.
def parse_items(self, response):
for line in response.css('.yfnc_datamodoutline1 table tr')[1:-1]:
yield YfinanceItem(date=line.css('td:first-child::text').extract()[0])
Just yield the Items, since Requests for the next pages will be handled by the CrawlSpider Rules.