I'm trying to scrape a single url with scrapy. I don't want it to crawl, just parse the item, run the pipelines and return. My pipeline just updates the database. The following code is what i've done so far and is taking around 3 seconds but seems like most of the time is spend loading scrapy. If there a better way todo this?
Ideally I want to parse a single url from a python script and not command line.
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('start_url')]
def parse(self, response):
if 'item.asp' in response.url:
yield Request(response.url, callback=self.parse_item)
Then i'm running from command line like the following
time scrapy crawl --loglevel=DEBUG MySpider -a start_url="www.example.com"
I did also try the following but never worked with the pipeline parameter.
time scrape parse "www.example.com" --spider=MySpider --callback parse_item --pipelines AddToDB
check the documentation for scrapy parse http://doc.scrapy.org/en/latest/topics/commands.html?highlight=parse#std:command-parse
In your case you are misunderstanding the --pipelines argument. it enables all of the pipelines defined in the settings.py
so just run without AddToDB.
If you want to disable some pipelines from running it might be tricky and you might want to just have a child of your spider, add class attribute custom_settings and restrict the pipelines in it.
So in your case something like:
class MySpider2(MySpider):
name = 'spider2'
custom_settings = {'ITEM_PIPELINES': 'project.pipelines.AddToDB'}
and then use scrapy parse 'http://example.com' --spider=spider2 --pipelines.
Related
I'm working in a little scraping platform using Django and Scrapy (scrapyd as API). Default spider is working as expected, and using ScrapyAPI (python-scrapyd-api) I'm passing a URL from Django and scrap data, I'm even saving results as JSON to a postgres instance. This is for a SINGLE URL pass as parameter.
When trying to pass a list of URLs, scrapy just takes the first URL from a list. I don't know if it's something about how Python or ScrapyAPI is treating or processing this arguments.
# views.py
# This is how I pass parameters from Django
task = scrapyd.schedule(
project=scrapy_project,
spider=scrapy_spider,
settings=scrapy_settings,
url=urls
)
# default_spider.py
def __init__(self, *args, **kwargs):
super(SpiderMercadoLibre, self).__init__(*args, **kwargs)
self.domain = kwargs.get('domain')
self.start_urls = [self.url] # list(kwargs.get('url'))<--Doesn't work
self.allowed_domains = [self.domain]
# Setup to tell Scrapy to make calls from same URLs
def start_requests(self):
...
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, meta={'original_url': url}, dont_filter=True)
Of course I can make some changes to my model so I can save every result iterating from the list of URLs and scheduling each URL using ScrapydAPI, but I'm wondering if this is a limitation of scrapyd itself or am I missing something about Python mechanics.
This is how ScrapydAPI is processing the schedule method:
def schedule(self, project, spider, settings=None, **kwargs):
"""
Schedules a spider from a specific project to run. First class, maps
to Scrapyd's scheduling endpoint.
"""
url = self._build_url(constants.SCHEDULE_ENDPOINT)
data = {
'project': project,
'spider': spider
}
data.update(kwargs)
if settings:
setting_params = []
for setting_name, value in iteritems(settings):
setting_params.append('{0}={1}'.format(setting_name, value))
data['setting'] = setting_params
json = self.client.post(url, data=data, timeout=self.timeout)
return json['jobid']
I think i'm implementing everything as expected but everytime, no matter what approach is used, only the first URL from the list of URLs is scraped
I am running scrapy on Anaconda and have tried to run example code from this DigitalOcean guide as shown below:
import scrapy
from scrapy import Spider
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
I am a beginner with Scrapy so keep this in mind.This code executes but no output is shown. There is supposed to be output based on the article I got the code from. Please let me know how to view the information the spider gathers. I am running the module off my IDLE, if I try to do "runspider" in cmd it says it cannot find my python file even though I can see the file directory and open it on IDLE.Thanks in advance.
Your spider is missing a callback method to handle the response from http://brickset.com/sets/year-2016.
Try defining a callback method like this:
import scrapy
from scrapy import Spider
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['http://brickset.com/sets/year-2016']
def parse(self, response):
self.log('I visited: {}'.format(response.url))
By default, Scrapy calls the parse method defined in your spider to handle the responses for the requests that your spider generates.
Have a look at the official Scrapy tutorial too: https://doc.scrapy.org/en/latest/intro/tutorial.html
I'm using pipelines to cache the documents from Scrapy crawls into a database, so that I can reparse them if I change the item parsing logic without having to hit the server again.
What's the best way to have Scrapy process from the cache instead of trying to perform a normal crawl?
I like scrapy's support for CSS and XPath selectors, else I would just hit the database separately with a lxml parser.
For a time, I wasn't caching the document at all and using Scrapy in a normal fashion - parsing the items on the fly - but I've found that changing the item logic requires a time and resource intensive recrawl. Instead, I'm now caching the document body along with the item parse, and I want to have the option to have Scrapy iterate through those documents from a database instead of crawling the target URL.
How do I go about modifying Scrapy to give me the option to pass it a set of documents and then parsing them individually as if it had just pulled them down from the web?
I think a custom Downloader Middleware is a good way to go. The idea is to have this middleware return a source code directly from the database and don't let Scrapy make any HTTP requests.
Sample implementation (not tested and definitely needs error-handling):
import re
import MySQLdb
from scrapy.http import Response
from scrapy.exceptions import IgnoreRequest
from scrapy import log
from scrapy.conf import settings
class CustomDownloaderMiddleware(object):
def __init__(self, *args, **kwargs):
super(CustomDownloaderMiddleware, self).__init__(*args, **kwargs)
self.connection = MySQLdb.connect(**settings.DATABASE)
self.cursor = self.connection.cursor()
def process_request(self, request, spider):
# extracting product id from a url
product_id = re.search(request.url, r"(\d+)$").group(1)
# getting cached source code from the database by product id
self.cursor.execute("""
SELECT
source_code
FROM
products
WHERE
product_id = %s
""", product_id)
source_code = self.cursor.fetchone()[0]
# making HTTP response instance without actually hitting the web-site
return Response(url=request.url, body=source_code)
And don't forget to activate the middleware.
I'm trying to save some information between the last runned spider and current spider. To make this possible I found the Stats Collection supported by scrapy. My code bellow:
class StatsSpider(Spider):
name = 'stats'
def __init__(self, crawler, *args, **kwargs):
Spider.__init__(self, *args, **kwargs)
self.crawler = crawler
print self.crawler.stats.get_value('last_visited_url')
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def start_requests(self):
return [Request(url)
for url in ['http://www.google.com', 'http://www.yahoo.com']]
def parse(self, response):
self.crawler.stats.set_value('last_visited_url', response.url)
print'URL: %s' % response.url
When I run my spider, I can see via debug that stats variable is being refreshed with the new data, however, when I run my spider again (locally), the stats variable starts empty. How should I propertly run my spider in order to persist the data?
I'm running it on console:
scrapy runspider stats.py
EDIT : If you are running it on Scrapinghub you can use their collections api
You need to save this data to disk in one way or another (in a file or database).
The crawler object your writing the data to only exists during the execution of your crawl. Once your spider finishes that object leaves memory and you lost your data.
I suggest loading the stats from your last run in init. Then updating them in parse like you are. Then hooking up the scrapy spider_closed signal to persist the data when the spider is done running.
If you need an example of spider_closed let me know and I'll update. But plenty of examples are readily available on the web.
Edit: I'll just give you an example: https://stackoverflow.com/a/12394371/2368836
I need to create a user configurable web spider/crawler, and I'm thinking about using Scrapy. But, I can't hard-code the domains and allowed URL regex:es -- this will instead be configurable in a GUI.
How do I (as simple as possible) create a spider or a set of spiders with Scrapy where the domains and allowed URL regex:es are dynamically configurable? E.g. I write the configuration to a file, and the spider reads it somehow.
WARNING: This answer was for Scrapy v0.7, spider manager api changed a lot since then.
Override default SpiderManager class, load your custom rules from a database or somewhere else and instanciate a custom spider with your own rules/regexes and domain_name
in mybot/settings.py:
SPIDER_MANAGER_CLASS = 'mybot.spidermanager.MySpiderManager'
in mybot/spidermanager.py:
from mybot.spider import MyParametrizedSpider
class MySpiderManager(object):
loaded = True
def fromdomain(self, name):
start_urls, extra_domain_names, regexes = self._get_spider_info(name)
return MyParametrizedSpider(name, start_urls, extra_domain_names, regexes)
def close_spider(self, spider):
# Put here code you want to run before spiders is closed
pass
def _get_spider_info(self, name):
# query your backend (maybe a sqldb) using `name` as primary key,
# and return start_urls, extra_domains and regexes
...
return (start_urls, extra_domains, regexes)
and now your custom spider class, in mybot/spider.py:
from scrapy.spider import BaseSpider
class MyParametrizedSpider(BaseSpider):
def __init__(self, name, start_urls, extra_domain_names, regexes):
self.domain_name = name
self.start_urls = start_urls
self.extra_domain_names = extra_domain_names
self.regexes = regexes
def parse(self, response):
...
Notes:
You can extend CrawlSpider too if you want to take advantage of its Rules system
To run a spider use: ./scrapy-ctl.py crawl <name>, where name is passed to SpiderManager.fromdomain and is the key to retreive more spider info from the backend system
As solution overrides default SpiderManager, coding a classic spider (a python module per SPIDER) doesn't works, but, I think this is not an issue for you. More info on default spiders manager TwistedPluginSpiderManager
What you need is to dynamically create spider classes, subclassing your favorite generic spider class as supplied by scrapy (CrawlSpider subclasses with your rules added, or XmlFeedSpider, or whatever) and adding domain_name, start_urls, and possibly extra_domain_names (and/or start_requests(), etc), as you get or deduce them from your GUI (or config file, or whatever).
Python makes it easy to perform such dynamic creation of class objects; a very simple example might be:
from scrapy import spider
def makespider(domain_name, start_urls,
basecls=spider.BaseSpider):
return type(domain_name + 'Spider',
(basecls,),
{'domain_name': domain_name,
'start_urls': start_urls})
allspiders = []
for domain, urls in listofdomainurlpairs:
allspiders.append(makespider(domain, urls))
This gives you a list of very bare-bone spider classes -- you'll probably want to add parse methods to them before you instantiate them. Season to taste...;-).
Shameless self promotion on domo! you'll need to instantiate the crawler as given in the examples, for your project.
Also you'll need to make the crawler configurable on runtime, which is simply passing the configuration to crawler, and overriding the settings on runtime, when configuration changed.
Now it is extremely easy to configure scrapy for these purposes:
About the first urls to visit, you can pass it as an attribute on the spider call with -a, and use the start_requests function to setup how to start the spider
You don't need to setup the allowed_domains variable for the spiders. If you don't include that class variable, the spider will be able to allow every domain.
It should end up to something like:
class MySpider(Spider):
name = "myspider"
def start_requests(self):
yield Request(self.start_url, callback=self.parse)
def parse(self, response):
...
and you should call it with:
scrapy crawl myspider -a start_url="http://example.com"