How to pass the start url in scrappy rules link extractor?

How to pass the start url in scrappy rules link extractor? - python

I have a set of start urls, like below:
start_urls = [www.example.com,www.example.com/ca,wwww.example.com/ap]
Now I have written code for extracting all the urls occurring inside each start_urls like below:
rules = (Rule(
LinkExtractor(
allow_domains = ('example.com'),
attrs = ('href'),
tags = ('a'),
deny = (),
deny_extensions = (),
unique = True,
),
callback = 'parseHtml', follow = True),)
In the parseHtml function, I am parsing the the content of the links.
Now in the above sites, I have common links occurring. For those common links I need to have some sort of identification to be done based on the start_urls.
How can I achieve this using scrappy?

You could not use the CrawlSpider and pass the start_url information yourself from start_requests through all your callbacks
You could create a Spider Middleware to handle start_requests to do the same but without doing it directly on the spider, you can find a similar behaviour here

Related

Scrapy - Selecting and crawling a specific type of sitemap nodes

This is the sitemap of the website I'm crawling. The 3rd and 4th <sitemap> nodes have the urls which goes to the item details. Is there any way to apply crawling logic only to those
nodes? (like selecting them by their indices)
class MySpider(SitemapSpider):
name = 'myspider'
sitemap_urls = [
'https://www.dfimoveis.com.br/sitemap_index.xml',
]
sitemap_rules = [
('/somehow targeting the 3rd and 4th node', 'parse_item')
]
def parse_item(self, response):
# scraping the item

You don't need to use SitemapSpider, just use regex and standard spider.
def start_requests(self):
sitemap = 'https://www.dfimoveis.com.br/sitemap_index.xml'
yield scrapy.Request(url=sitemap, callback=self.parse_sitemap)
def parse_sitemap(self, response):
sitemap_links = re.findall(r"<loc>(.*?)</loc>", response.text, re.DOTALL)
sitemap_links = sitemap_links[2:4] # Only 3rd and 4th nodes.
for sitemap_link in sitemap_links:
yield scrapy.Request(url=sitemap_link, callback=self.parse)

Scrapy’s Spider subclasses, including SitemapSpider are meant to make very common scenarios very easy.
You want to do something that is rather uncommon, so you should read the source code of SitemapSpider, try to understand what it does, and either subclass SitemapSpider overriding the behavior you want to change or directly write your own spider from scratch based on the code of SitemapSpider.

How to crawl urls from sitemap whose modified date has changed using scrapy?

I'm trying to implement an incremental crawler but in this case instead of matching the url I'm trying to match the attribute of sitemap xml to check if the page is modified or not. Right now the problem is I'm not able to find a way to decipher where should I intercept the request which gets the sitemap url so that I can add logic to look from a stored <lastmod> value and return only those url whose value is changed.
Here's the xml:
<url>
<loc>https://www.example.com/hello?id=1</loc>
<lastmod>2017-12-03</lastmod>
<changefreq>Daily</changefreq>
<priority>1.0</priority>
</url>
Sitemap spider:
class ExampleSpider(SitemapSpider):
name = "example"
allowed_domains = []
sitemap_urls = ["https://www.example.com/sitemaps.xml"]
sitemap_rules = [
('/hello/', 'parse_data')
]
def parse_data(self,response):
pass
My question is: Is it possible to override the sitemap _parse_sitemap function ? As of now I had found that scrapy's sitemap spider only look for <loc> attribute. Can I override it using process_request just like we use to do in normal spiders ?

If all you need is to get the value of the lastmod and then crawl every loc that meets some condition then this should work:
import scrapy
class ExampleSpider(scrapy.spiders.CrawlSpider):
name = "example"
start_urls = ["https://www.example.com/sitemaps.xml"]
def parse(self, response):
sitemap = scrapy.selector.XmlXPathSelector(response)
sitemap.register_namespace(
# ns is just a namespace and the second param should be whatever the
# xmlns of your sitemap is
'ns', 'http://www.sitemaps.org/schemas/sitemap/0.9'
)
# this gets you a list of all the "loc" and "last modified" fields.
locsList = sitemap.select('//ns:loc/text()').extract()
lastModifiedList = sitemap.select('//ns:lastmod/text()').extract()
# zip() the 2 lists together
pageList = list(zip(locsList, lastModifiedList))
for page in pageList:
url, lastMod = page
if r.search(r'\/hello\/', url) and lastMod # ... add the rest of your condition for list modified here:
# crawl the url
yield response.follow(url, callback=self.parse_data)
def parse_data(self,response):
pass

Webscrape from 2500 links - courses of action?

I have nearly 2500 unique links, from which I want to run BeautifulSoup and gather some text captured in paragraphs on each of the 2500 pages. I could create variables for each link, but having 2500 is obviously not the most efficient course of action. The links are contained in a list like the following:
linkslist = ["http://www.website.com/category/item1","http://www.website.com/category/item2","http://www.website.com/category/item3", ...]
Should I just write a for loop like the following?
for link in linkslist:
opened_url = urllib2.urlopen(link).read()
soup = BeautifulSoup(opened_url)
...
I'm looking for any constructive criticism. Thanks!

This is a good use case for Scrapy - a popular web-scraping framework based on Twisted:
Scrapy is written with Twisted, a popular event-driven networking
framework for Python. Thus, it’s implemented using a non-blocking (aka
asynchronous) code for concurrency.
Set the start_urls property of your spider and parse the page inside the parse() callback:
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = ["http://www.website.com/category/item1","http://www.website.com/category/item2","http://www.website.com/category/item3", ...]
allowed_domains = ["website.com"]
def parse(self, response):
print response.xpath("//title/text()").extract()

How about writing a function that would treat each URL separately?
def processURL(url):
pass
# Your code here
map(processURL, linkslist)
This will run your function on each url in your list. If you want to speed things up, this is easy to run in parallel:
from multiprocessing import Pool
list(Pool(processes = 10).map(processURL, linkslist))

Python Scrapy: passing properties into parser

I'm new to Scrapy and web-scraping in general so this might be a stupid question but it wouldn't be the first time so here goes.
I have a simple Scrapy spider, based on the tutorial example, that processes various URLs (in start_urls). I would like to categorise the URLs e.g. URLs A, B, and C are Category 1, while URLS D and E are Category 2, then be able to store the category on the resulting Items when the parser processes the response for each URL.
I guess I could have a separate spider for each category, then just hold the category as an attribute on the class so the parser can pick it up from there. But I was kind of hoping I could have just one spider for all the URLs, but tell the parser which category to use for a given URL.
Right now, I'm setting up the URLs in start_urls via my spider's init() method. How do I pass the category for a given URL from my init method to the parser so that I can record the category on the Items generated from the responses for that URL?

As paul t. suggested:
class MySpider(CrawlSpider):
def start_requests(self):
...
yield Request(url1, meta={'category': 'cat1'}, callback=self.parse)
yield Request(url2, meta={'category': 'cat2'}, callback=self.parse)
...
def parse(self, response):
category = response.meta['category']
...
You use start_requests to have control over the first URLs you're visiting, attaching metadata to each URL, and you can access that metadata through response.meta afterwards.
Same thing if you need to pass data from a parse function to a parse_item, for instance.

Creating a generic scrapy spider

My question is really how to do the same thing as a previous question, but in Scrapy 0.14.
Using one Scrapy spider for several websites
Basically, I have GUI that takes parameters like domain, keywords, tag names, etc. and I want to create a generic spider to crawl those domains for those keywords in those tags. I've read conflicting things, using older versions of scrapy, by either overriding the spider manager class or by dynamically creating a spider. Which method is preferred and how do I implement and invoke the proper solution? Thanks in advance.
Here is the code that I want to make generic. It also uses BeautifulSoup. I paired it down so hopefully didn't remove anything crucial to understand it.
class MySpider(CrawlSpider):
name = 'MySpider'
allowed_domains = ['somedomain.com', 'sub.somedomain.com']
start_urls = ['http://www.somedomain.com']
rules = (
Rule(SgmlLinkExtractor(allow=('/pages/', ), deny=('', ))),
Rule(SgmlLinkExtractor(allow=('/2012/03/')), callback='parse_item'),
)
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll('p', itemprop="myProp")
for contentTag in contentTags:
matchedResult = re.search('Keyword1|Keyword2', contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)
pass

You could create a run-time spider which is evaluated by the interpreter. This code piece could be evaluated at runtime like so:
a = open("test.py")
from compiler import compile
d = compile(a.read(), 'spider.py', 'exec')
eval(d)
MySpider
<class '__main__.MySpider'>
print MySpider.start_urls
['http://www.somedomain.com']

I use the Scrapy Extensions approach to extend the Spider class to a class named Masterspider that includes a generic parser.
Below is the very "short" version of my generic extended parser. Note that you'll need to implement a renderer with a Javascript engine (such as Selenium or BeautifulSoup) a as soon as you start working on pages using AJAX. And a lot of additional code to manage differences between sites (scrap based on column title, handle relative vs long URL, manage different kind of data containers, etc...).
What is interresting with the Scrapy Extension approach is that you can still override the generic parser method if something does not fit but I never had to. The Masterspider class checks if some methods have been created (eg. parser_start, next_url_parser...) under the site specific spider class to allow the management of specificies: send a form, construct the next_url request from elements in the page, etc.
As I'm scraping very different sites, there's always specificities to manage. That's why I prefer to keep a class for each scraped site so that I can write some specific methods to handle it (pre-/post-processing except PipeLines, Request generators...).
masterspider/sitespider/settings.py
EXTENSIONS = {
'masterspider.masterspider.MasterSpider': 500
}
masterspider/masterspdier/masterspider.py
# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
class MasterSpider(Spider):
def start_requests(self):
if hasattr(self,'parse_start'): # First page requiring a specific parser
fcallback = self.parse_start
else:
fcallback = self.parse
return [ Request(self.spd['start_url'],
callback=fcallback,
meta={'itemfields': {}}) ]
def parse(self, response):
sel = Selector(response)
lines = sel.xpath(self.spd['xlines'])
# ...
for line in lines:
item = genspiderItem(response.meta['itemfields'])
# ...
# Get request_url of detailed page and scrap basic item info
# ...
yield Request(request_url,
callback=self.parse_item,
meta={'item':item, 'itemfields':response.meta['itemfields']})
for next_url in sel.xpath(self.spd['xnext_url']).extract():
if hasattr(self,'next_url_parser'): # Need to process the next page URL before?
yield self.next_url_parser(next_url, response)
else:
yield Request(
request_url,
callback=self.parse,
meta=response.meta)
def parse_item(self, response):
sel = Selector(response)
item = response.meta['item']
for itemname, xitemname in self.spd['x_ondetailpage'].iteritems():
item[itemname] = "\n".join(sel.xpath(xitemname).extract())
return item
masterspider/sitespider/spiders/somesite_spider.py
# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
from masterspider.masterspider import MasterSpider
class targetsiteSpider(MasterSpider):
name = "targetsite"
allowed_domains = ["www.targetsite.com"]
spd = {
'start_url' : "http://www.targetsite.com/startpage", # Start page
'xlines' : "//td[something...]",
'xnext_url' : "//a[contains(#href,'something?page=')]/#href", # Next pages
'x_ondetailpage' : {
"itemprop123" : u"id('someid')//text()"
}
}
# def next_url_parser(self, next_url, response): # OPTIONAL next_url regexp pre-processor
# ...

Instead of having the variables name,allowed_domains, start_urls and rules attached to the class, you should write a MySpider.__init__, call CrawlSpider.__init__ from that passing the necessary arguments, and setting name, allowed_domains etc. per object.
MyProp and keywords also should be set within your __init__. So in the end you should have something like below. You don't have to add name to the arguments, as name is set by BaseSpider itself from kwargs:
class MySpider(CrawlSpider):
def __init__(self, allowed_domains=[], start_urls=[],
rules=[], findtag='', finditemprop='', keywords='', **kwargs):
CrawlSpider.__init__(self, **kwargs)
self.allowed_domains = allowed_domains
self.start_urls = start_urls
self.rules = rules
self.findtag = findtag
self.finditemprop = finditemprop
self.keywords = keywords
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll(self.findtag, itemprop=self.finditemprop)
for contentTag in contentTags:
matchedResult = re.search(self.keywords, contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)

I am not sure which way is preferred, but I will tell you what I have done in the past. I am in no way sure that this is the best (or correct) way of doing this and I would be interested to learn what other people think.
I usually just override the parent class (CrawlSpider) and either pass in arguments and then initialize the parent class via super(MySpider, self).__init__() from within my own init-function or I pull in that data from a database where I have saved a list of links to be appended to start_urls earlier.

As far as crawling specific domains passed as arguments goes, I just override Spider.__init__:
class MySpider(scrapy.Spider):
"""
This spider will try to crawl whatever is passed in `start_urls` which
should be a comma-separated string of fully qualified URIs.
Example: start_urls=http://localhost,http://example.com
"""
def __init__(self, name=None, **kwargs):
if 'start_urls' in kwargs:
self.start_urls = kwargs.pop('start_urls').split(',')
super(Spider, self).__init__(name, **kwargs)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to pass the start url in scrappy rules link extractor? - python

You could not use the CrawlSpider and pass the start_url information yourself from start_requests through all your callbacks You could create a Spider Middleware to handle start_requests to do the same but without doing it directly on the spider, you can find a similar behaviour here

Related

Scrapy - Selecting and crawling a specific type of sitemap nodes

How to crawl urls from sitemap whose modified date has changed using scrapy?

Webscrape from 2500 links - courses of action?

Python Scrapy: passing properties into parser

Creating a generic scrapy spider

Categories

Resources