How to make allowed domain dynamic using Scrapy?

How to make allowed domain dynamic using Scrapy? - python

I am practicing web scrapping. I am trying to scrape the websites and wanted to include allowed_domains so that it does not scrape other urls.
import scrapy
class SeleniumSpider(scrapy.Spider):
name = 'test_selenium'
allowed_domains=['quotes.toscrape.com']
start_urls = ['https://quotes.toscrape.com/page/1/']
def parse(self, response):
for quote in response.css('div.quote'):
result = {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
print(result)
So, I wanted to changes the allowed domain as URL changes in start_url, not with the same domain but the different domain.
Thank You

I don't know if I understand your problem because it is not so big problem to add manually new allowed_domains when you add manually new start_urls.
But if you want to create automatically allowed_domains based on start_urls
then you can use __init__ to get domains from self.start_urls and add to self.allowed_domains.
import urllib.parse
class SeleniumSpider(scrapy.Spider):
name = 'test_selenium'
start_urls = ['https://quotes.toscrape.com/page/1/']
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
allowed = set() # `set()` to keep every domain only once
for url in self.start_urls:
parts = urllib.parse.urlparse(url)
allowed.add( parts.netloc )
self.allowed_domains = list(allowed)
You may use __init__ to set other values automatically - i.e. to read values from file or database or get from command line.
Full working example
import scrapy
import urllib.parse
class SeleniumSpider(scrapy.Spider):
name = 'test_selenium'
#allowed_domains=['quotes.toscrape.com']
start_urls = ['https://quotes.toscrape.com/page/1/']
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
allowed = set() # `set()` to keep every domain only once
for url in self.start_urls:
parts = urllib.parse.urlparse(url)
#print(parts)
allowed.add( parts.netloc )
self.allowed_domains = list(allowed)
for domain in self.allowed_domains:
print("allowed:", domain)
def parse(self, response):
print('parse url:', response.url)
for a in response.xpath('//a/#href'):
yield response.follow(a)
# --- run without project ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess()
c.crawl(SeleniumSpider)
c.start()

Related

How to change scrapy closespider itemcount while parsing

i am new to scrapy.
is it possible to change the CLOSESPIDER_ITEMCOUNT while the spider is running?
class TestSpider(scrapy.Spider):
name = 'tester'
custom_settings = {'CLOSESPIDER_ITEMCOUNT': 100,}
def start_requests(self):
urls = ['https://google.com', 'https://amazon.com']
for url in urls:
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
if response.xpath('//*[id="content"]') or True: # only for testing
# set CLOSESPIDER_ITEMCOUNT to 300
# rest of code
I want to be able to change the value on an "if condition" in the parse method

You can get access to the crawler settings object, unfreeze the settings, change the value and then freeze the settings object again. Please note that since this is not documented in the docs, it may have unexpected effects.
class TestSpider(scrapy.Spider):
name = 'tester'
custom_settings = {'CLOSESPIDER_ITEMCOUNT': 100,}
def start_requests(self):
urls = ['https://google.com', 'https://amazon.com']
for url in urls:
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
if response.xpath('//*[id="content"]') or True: # only for testing
self.crawler.settings.frozen = False
self.crawler.settings.set("CLOSESPIDER_ITEMCOUNT", 300)
self.crawler.settings.frozen = True
# add the rest of the code

Craw data from urls by passing URL to Scrapy from other *.py file

I'm using Scrapy to craw data from website, and this is my code at file spider.py in folder spider of Scrapy
class ThumbSpider(scrapy.Spider):
userInput = readInputData('input/user_input.json')
name = 'thumb'
# start_urls = ['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society']
def __init__(self, *args, **kwargs):
super(ThumbSpider, self).__init__(*args, **kwargs)
self.start_urls = kwargs.get('start_urls')
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for cssThumb in self.userInput['cssThumb']: # browse each cssThumb which user provides
items = response.css('{0}::attr(href)'.format(cssThumb)).getall() # access it
for item in items:
item = response.urljoin(item)
yield scrapy.Request(url=item, callback=self.parse_details)
def parse_details(self, response):
data = response.css('div.vnnews-text-post p span::text').extract()
with open('result/page_content.txt', 'a') as outfile:
json.dump(data, outfile)
yield data
I call class ThumbSpider in file main.py and run this file in terminal
import json
import os
import modules.misc as msc
from scrapy.crawler import CrawlerProcess
from week_7.spiders.spider import NaviSpider, ThumbSpider
process2 = CrawlerProcess()
process2.crawl(ThumbSpider, start_urls=['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society'])
process2.start()
My program doesn't get anything from 2 urls, but when I uncomment start_urls = ['https://vietnamnews.vn/politics-laws', 'https://vietnamnews.vn/society'] and delete __init__ and start_requests methods in class ThumbSpider and in file main.py edit process2.crawl(ThumbSpider, start_urls=msc.getUserChoices()) into process2.crawl(ThumbSpider) it worked well. I don't know what happening. Anyone can help me, thank you so much

Having problems with a scrapy-splash script. I only get one result and my scraper does not parse other pages

I am trying to parse a list from a javascript website. When I run it, it only gives me back one entry on each column and then the spider shuts down. I have already set up my middleware settings. I am not sure what is going wrong. Thanks in advance!
import scrapy
from scrapy_splash import SplashRequest
class MalrusSpider(scrapy.Spider):
name = 'malrus'
allowed_domains = ['backgroundscreeninginrussia.com']
start_urls = ['http://www.backgroundscreeninginrussia.com/publications/new-citizens-of-malta-since-january-2015-till-december-2017/']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='render.html')
def parse(self, response):
russians = response.xpath('//table[#id="tablepress-8"]')
for russian in russians:
yield{'name' : russian.xpath('//*[#class="column-1"]/text()').extract_first(),
'source' : russian.xpath('//*[#class="column-2"]/text()').extract_first()}
script = """function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.3)
button = splash:select("a[class=paginate_button next] a")
splash:set_viewport_full()
splash:wait(0.1)
button:mouse_click()
splash:wait(1)
return {url = splash:url(),
html = splash:html()}
end"""
yield SplashRequest(url=response.url,
callback=self.parse,
endpoint='execute',
args={'lua_source': script})

The .extract_first() (now .get()) you used will always return the first result. It's not an iterator so there is no sense to call it several times. You should try the .getall() method. That will be something like:
names = response.xpath('//table[#id="tablepress-8"]').xpath('//*[#class="column-1"]/text()').getall()
sources = response.xpath('//table[#id="tablepress-8"]').xpath('//*[#class="column-2"]/text()').getall()

2 functions in scrapy spider and the second one not running

I am using scrapy to get the content inside some urls on a page, similar to this question here:
Use scrapy to get list of urls, and then scrape content inside those urls
I am able to get the subURLs from my start urls(first def), However, my second def doesn't seem to be passing through. And the result file is empty. I have tested the content inside the function in scrapy shell and it is getting the info I want, but not when I am running the spider.
import scrapy
from scrapy.selector import Selector
#from scrapy import Spider
from WheelsOnlineScrapper.items import Dealer
from WheelsOnlineScrapper.url_list import urls
import logging
from urlparse import urljoin
logger = logging.getLogger(__name__)
class WheelsonlinespiderSpider(scrapy.Spider):
logger.info('Spider starting')
name = 'wheelsonlinespider'
rotate_user_agent = True # lives in middleware.py and settings.py
allowed_domains = ["https://wheelsonline.ca"]
start_urls = urls # this list is created in url_list.py
logger.info('URLs retrieved')
def parse(self, response):
subURLs = []
partialURLs = response.css('.directory_name::attr(href)').extract()
for i in partialURLs:
subURLs = urljoin('https://wheelsonline.ca/', i)
yield scrapy.Request(subURLs, callback=self.parse_dealers)
logger.info('Dealer ' + subURLs + ' fetched')
def parse_dealers(self, response):
logger.info('Beginning of page')
dlr = Dealer()
#Extracting the content using css selectors
try:
dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first() + ' ' + response.css(".dealer_head_aux_name::text").extract_first()
except TypeError:
dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first()
dlr['MailingAddress'] = ','.join(response.css(".dealer_address_right::text").extract())
dlr['PhoneNumber'] = response.css(".dealer_head_phone::text").extract_first()
logger.info('Dealer fetched ' + dlr['DealerName'])
yield dlr
logger.info('End of page')

Your allowed_domains list contains the protocol (https). It should have only the domain name as per the documentation:
allowed_domains = ["wheelsonline.ca"]
Also, you should've received a message in your log:
URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://wheelsonline.ca in allowed_domains

Creating a generic scrapy spider

My question is really how to do the same thing as a previous question, but in Scrapy 0.14.
Using one Scrapy spider for several websites
Basically, I have GUI that takes parameters like domain, keywords, tag names, etc. and I want to create a generic spider to crawl those domains for those keywords in those tags. I've read conflicting things, using older versions of scrapy, by either overriding the spider manager class or by dynamically creating a spider. Which method is preferred and how do I implement and invoke the proper solution? Thanks in advance.
Here is the code that I want to make generic. It also uses BeautifulSoup. I paired it down so hopefully didn't remove anything crucial to understand it.
class MySpider(CrawlSpider):
name = 'MySpider'
allowed_domains = ['somedomain.com', 'sub.somedomain.com']
start_urls = ['http://www.somedomain.com']
rules = (
Rule(SgmlLinkExtractor(allow=('/pages/', ), deny=('', ))),
Rule(SgmlLinkExtractor(allow=('/2012/03/')), callback='parse_item'),
)
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll('p', itemprop="myProp")
for contentTag in contentTags:
matchedResult = re.search('Keyword1|Keyword2', contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)
pass

You could create a run-time spider which is evaluated by the interpreter. This code piece could be evaluated at runtime like so:
a = open("test.py")
from compiler import compile
d = compile(a.read(), 'spider.py', 'exec')
eval(d)
MySpider
<class '__main__.MySpider'>
print MySpider.start_urls
['http://www.somedomain.com']

I use the Scrapy Extensions approach to extend the Spider class to a class named Masterspider that includes a generic parser.
Below is the very "short" version of my generic extended parser. Note that you'll need to implement a renderer with a Javascript engine (such as Selenium or BeautifulSoup) a as soon as you start working on pages using AJAX. And a lot of additional code to manage differences between sites (scrap based on column title, handle relative vs long URL, manage different kind of data containers, etc...).
What is interresting with the Scrapy Extension approach is that you can still override the generic parser method if something does not fit but I never had to. The Masterspider class checks if some methods have been created (eg. parser_start, next_url_parser...) under the site specific spider class to allow the management of specificies: send a form, construct the next_url request from elements in the page, etc.
As I'm scraping very different sites, there's always specificities to manage. That's why I prefer to keep a class for each scraped site so that I can write some specific methods to handle it (pre-/post-processing except PipeLines, Request generators...).
masterspider/sitespider/settings.py
EXTENSIONS = {
'masterspider.masterspider.MasterSpider': 500
}
masterspider/masterspdier/masterspider.py
# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
class MasterSpider(Spider):
def start_requests(self):
if hasattr(self,'parse_start'): # First page requiring a specific parser
fcallback = self.parse_start
else:
fcallback = self.parse
return [ Request(self.spd['start_url'],
callback=fcallback,
meta={'itemfields': {}}) ]
def parse(self, response):
sel = Selector(response)
lines = sel.xpath(self.spd['xlines'])
# ...
for line in lines:
item = genspiderItem(response.meta['itemfields'])
# ...
# Get request_url of detailed page and scrap basic item info
# ...
yield Request(request_url,
callback=self.parse_item,
meta={'item':item, 'itemfields':response.meta['itemfields']})
for next_url in sel.xpath(self.spd['xnext_url']).extract():
if hasattr(self,'next_url_parser'): # Need to process the next page URL before?
yield self.next_url_parser(next_url, response)
else:
yield Request(
request_url,
callback=self.parse,
meta=response.meta)
def parse_item(self, response):
sel = Selector(response)
item = response.meta['item']
for itemname, xitemname in self.spd['x_ondetailpage'].iteritems():
item[itemname] = "\n".join(sel.xpath(xitemname).extract())
return item
masterspider/sitespider/spiders/somesite_spider.py
# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
from masterspider.masterspider import MasterSpider
class targetsiteSpider(MasterSpider):
name = "targetsite"
allowed_domains = ["www.targetsite.com"]
spd = {
'start_url' : "http://www.targetsite.com/startpage", # Start page
'xlines' : "//td[something...]",
'xnext_url' : "//a[contains(#href,'something?page=')]/#href", # Next pages
'x_ondetailpage' : {
"itemprop123" : u"id('someid')//text()"
}
}
# def next_url_parser(self, next_url, response): # OPTIONAL next_url regexp pre-processor
# ...

Instead of having the variables name,allowed_domains, start_urls and rules attached to the class, you should write a MySpider.__init__, call CrawlSpider.__init__ from that passing the necessary arguments, and setting name, allowed_domains etc. per object.
MyProp and keywords also should be set within your __init__. So in the end you should have something like below. You don't have to add name to the arguments, as name is set by BaseSpider itself from kwargs:
class MySpider(CrawlSpider):
def __init__(self, allowed_domains=[], start_urls=[],
rules=[], findtag='', finditemprop='', keywords='', **kwargs):
CrawlSpider.__init__(self, **kwargs)
self.allowed_domains = allowed_domains
self.start_urls = start_urls
self.rules = rules
self.findtag = findtag
self.finditemprop = finditemprop
self.keywords = keywords
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll(self.findtag, itemprop=self.finditemprop)
for contentTag in contentTags:
matchedResult = re.search(self.keywords, contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)

I am not sure which way is preferred, but I will tell you what I have done in the past. I am in no way sure that this is the best (or correct) way of doing this and I would be interested to learn what other people think.
I usually just override the parent class (CrawlSpider) and either pass in arguments and then initialize the parent class via super(MySpider, self).__init__() from within my own init-function or I pull in that data from a database where I have saved a list of links to be appended to start_urls earlier.

As far as crawling specific domains passed as arguments goes, I just override Spider.__init__:
class MySpider(scrapy.Spider):
"""
This spider will try to crawl whatever is passed in `start_urls` which
should be a comma-separated string of fully qualified URIs.
Example: start_urls=http://localhost,http://example.com
"""
def __init__(self, name=None, **kwargs):
if 'start_urls' in kwargs:
self.start_urls = kwargs.pop('start_urls').split(',')
super(Spider, self).__init__(name, **kwargs)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to make allowed domain dynamic using Scrapy? - python

Related

How to change scrapy closespider itemcount while parsing

Craw data from urls by passing URL to Scrapy from other *.py file

Having problems with a scrapy-splash script. I only get one result and my scraper does not parse other pages

2 functions in scrapy spider and the second one not running

Creating a generic scrapy spider

Categories

Resources