In scrapy projects, we can get persistence support by defining a job directory through the JOBDIR setting for eg.
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
But how to do the same when running spiders using scrapy.crawler.CrawlerProcess from a python script as answered in How to run Scrapy from within a Python script?
As your reference question points out you can pass settings to CrawlerProcess instance.
So all you need to do is pass JOBDIR setting:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'JOBDIR': 'crawls/somespider-1' # <----- Here
})
process.crawl(MySpider)
process.start()
Related
I have been using the method described on stackoverflow (https://stackoverflow.com/a/43661172/5037146) , to make scrapy run from script using Crawler Runner to allow to restart the process.
However, I don't get any console logs when running the process through CrawlerRunner, whereas when I using CrawlerProcess, it outputs the status and progress.
Code is available online: https://colab.research.google.com/drive/14hKTjvWWrP--h_yRqUrtxy6aa4jG18nJ
With CrawlerRunner you need to manually setup logging, which you can do with configure_logging(). See https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
When you use CrawlerRunner you have to manually configure a logger
You can do it using scrapy.utils.log.configure_logging function
for example
import scrapy.crawler
from my_spider import MySpider
runner = scrapy.crawler.CrawlerRunner()
scrapy.utils.log.configure_logging(
{
"LOG_FORMAT": "%(levelname)s: %(message)s",
},
)
crawler = runner.create_crawler(MySpider)
crawler.crawl()
Am trying to implement a scrapy spider which is started using a script as per below code.
from scrapy.crawler import CrawlerRunner
from scrapy_app.scrapy_app.spiders.generic import GenericSpider
....
class MyProcess(object):
def start_my_process(self, _config, _req_obj, site_urls):
runner = CrawlerRunner()
runner.crawl(GenericSpider,
config=_config,
reqObj=_req_obj,
urls=site_urls)
deferred = runner.join()
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
....
So, using a CrawlerRunner, am not receiving the project level settings.py configurations while executing the spider. The Generic spider accepts three parameters in which one is the list of start urls.
How can we load the settings.py to the CrawlerRunner process other than setting custom_settings inside the spider?
I am going to try to answer this as best I can even though my situation is not 100% identical to yours, however, I was having similar issues.
The typical scrapy project structure looks like this:
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
The directory containing the scrapy.cfg file is considered the root directory of the project.
In that file you will see something like this:
[settings]
default: your_project.settings
[deploy]
...
When running your main script that calls on a spider to run with a specific set of settings you should have your main.py script in the same directory as the scrapy.cfg file.
Now from main.py your code is going to have to create a CrawlerProcess or CrawlerRunner instance to run a spider, of which either can be instantiated with a settings object or dict like so:
process = CrawlerProcess(settings={
'FEED_FORMAT': 'json',
'FEED_URI': 'items.json'
})
---------------------------------------
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
The dict scenario works but it cumbersome, so the the get_project_settings() call is probably of greater interest which I will expand upon.
I had a large scrapy project that contained multiple spiders that shared a large number of similar settings. So I had a global_settings.py file and then specific settings contained within each spider. Because of the large number of shared settings I liked the idea of keeping everything neat and tidy in one file and not copying and pasting code.
The easiest way I have found after a lot of research is to instantiate the CrawlerProcess/Runner object with the get_project_settings() function, the catch is that get_project_settings uses the default value under [settings] in scrapy.cfg to find project specific settings.
So its important to make sure that for your project the scrapy.cfg settings default value points to your desired settings file when you call get_project_settings().
I'll also add that if you have multiple settings files for multiple scrapy projects and you want to share the root directory you can add those in to scrapy.cfg additionally like so:
[settings]
default = your_project.settings
project1 = myproject1.settings
project2 = myproject2.settings
Adding all these settings to the root directory config file will allow you the opportunity to switch between settings at will in scripts.
As I said before your out of the box call to get_project_settings() will load the default value's settings file for your spider from the scrapy.cfg file (your_project.settings in the example above), however, if you want to change the settings used for the next spider run in the same process you can modify the settings loaded for that spider to be started.
This is slightly tricky and "hackey" but it has worked for me...
After the first call to get_project_settings() an environment variable called SCRAPY_SETTINGS_MODULE will be set. This environment variable value will be set to whatever your default value was in the scrapy.cfg file. To alter the settings used to subsequent spiders that are run in the process instance created (CrawlerRunner/Process --> process.crawl('next_spider_to_start')), this variable will need to be manipulated.
This is what should be done to set a new settings module on a current process instance that previously had get_project_settings() instantiated with it:
import os
# Clear the old settings module
del os.environ['SCRAPY_SETTINGS_MODULE']
# Set the project environment variable (new set of settings), this should be a value in your scrapy.cfg
os.environ['SCRAPY_PROJECT'] = 'project2'
# Call get_project_settings again and set to process object
process.settings = get_project_settings()
# Run the next crawler with the updated settings module
process.crawl('next_spider_to_start')
get_project_settings() just updated the current process settings (Twisted Reactor) to myproject2.settings for your crawler process instance.
This can all be done from a main script to manipulate spiders and the settings for them. Like I said previously though, I found it easier to just have a global settings file with all the commonalities, and then spider specific settings set in the spiders themselves. This is usually much clearer.
Scrapy docs are kinda rough, hope this helps someone...
Hi all i have multiple spider running from the script. Script will schedule daily once.
I want to log the infos, errors separately. log filename must be a spider_infolog_[date] and spider_errlog_[date]
i am trying following code,
spider __init__ file
from twisted.python import log
import logging
LOG_FILE = 'logs/spider.log'
ERR_FILE = 'logs/spider_error.log'
logging.basicConfig(level=logging.INFO, filemode='w+', filename=LOG_FILE)
logging.basicConfig(level=logging.ERROR, filemode='w+', filename=ERR_FILE)
observer = log.PythonLoggingObserver()
observer.start()
within spider:
import logging
.
.
.
logging.error(message)
if any exception happens in spider code [like i am fetching start urls from the MysqlDB, if the connection fails i need to close the specific spider not other spiders because i am running all spiders from the script]
raise CloseSpider(message)
is above code sufficent to close the particular spider ?
EDIT #eLRuLL
import logging
from scrapy.utils.log import configure_logging
LOG_FILE = 'logs/spider.log'
ERR_FILE = 'logs/spider_error.log'
configure_logging()
logging.basicConfig(level=logging.INFO, filemode='w+', filename=LOG_FILE)
logging.basicConfig(level=logging.ERROR, filemode='w+', filename=ERR_FILE)
i have put the above code in a script that schedules spiders. not working file not created but in console i got log messages.
EDIT 2
i have added install_root_handler=False in configure_logging() it gives all the console output in spider.log file error is not differenciated.
configure_logging(install_root_handler=False)
You can do this:
from scrapy import cmdline
cmdline.execute("scrapy crawl myspider --logfile mylog.log".split())
Put that script in the path where you put scrapy.cfg
I am using Scrapy, it is great! so fast to build a crawler. with the number of web sites are increasing, need to create new spiders, but these web sits are the same type,
all these spiders use same items, pipelines, parsing process
the contents of the project directory:
test/
├── scrapy.cfg
└── test
├── __init__.py
├── items.py
├── mybasespider.py
├── pipelines.py
├── settings.py
├── spider1_settings.py
├── spider2_settings.py
└── spiders
├── __init__.py
├── spider1.py
└── spider2.py
To reduce source code redundancy, mybasespider.py has a base spider MyBaseSpider, 95% source code are in it, all other spiders inherited from it, if a spider has some special things, override some
class methods, normally only need to add several lines source code to create a new spider
Place all common settings in settings.py, one spider's special settings are in [spider name]_settings.py, such as:
the special settings of spider1 in spider1_settings.py:
from settings import *
LOG_FILE = 'spider1.log'
LOG_LEVEL = 'INFO'
JOBDIR = 'spider1-job'
START_URLS = [
'http://test1.com/',
]
the special settings of spider2 in spider2_settings.py:
from settings import *
LOG_FILE = 'spider2.log'
LOG_LEVEL = 'DEBUG'
JOBDIR = 'spider2-job'
START_URLS = [
'http://test2.com/',
]
Scrapy uses LOG_FILE, LOG_LEVEL, JOBDIR before launching a spider;
All urls in START_URLS are filled into MyBaseSpider.start_urls, different spider has different contents, but the name START_URLS used in the base spider MyBaseSpider isn't changed.
the contents of the scrapy.cfg:
[settings]
default = test.settings
spider1 = spider1.settings
spider2 = spider2.settings
[deploy]
url = http://localhost:6800/
project = test
To run a spider, such as spider1:
export SCRAPY_PROJECT=spider1
scrapy crawl spider1
But this way can't be used to run spiders in scrapyd. scrapyd-deploy command always uses 'default' project name in scrapy.cfg 'settings' section to build an egg file and deploys it to scrapyd
Have several questions:
Is this the way to use multiple spiders in one project if I don't create a project per spider? Are there any better ways?
How to separate a spider's special settings as above which can run in scrapyd and reduce source code redundancy
If all spiders use a same JOBDIR, is it safe to run all spiders concurrently? Is the persistent spider state corrupted?
Any insights would be greatly appreciated.
I don't know if it will answer to your first question but I use scrapy with multiple spiders and in the past i use the command
scrapy crawl spider1
but if I had more then one spider this command activate it or another modules so I start to use this command:
scrapy runspider <your full spider1 path with the spiderclass.py>
example: "scrapy runspider home/Documents/scrapyproject/scrapyproject/spiders/spider1.py"
I hope it will help :)
As all spiders should have their own class, you could set the settings per spider with the custom_settings class argument, so something like:
Class MySpider1(Spider):
name = "spider1"
custom_settings = {'USER_AGENT': 'user_agent_for_spider1/version1'}
Class MySpider1(Spider):
name = "spider1"
custom_settings = {'USER_AGENT': 'user_agent_for_spider2/version2'}
this custom_settings will overwrite the ones on the settings.py file so you could still set some global ones.
Good job! I didn't find better way to manager multiple spiders in the documentation.
I don't know about scrapyd. But when run from command line, you should set environment variable SCRAPY_PROJECT to the target project.
see scrapy/utils/project.py
ENVVAR = 'SCRAPY_SETTINGS_MODULE'
...
def get_project_settings():
if ENVVAR not in os.environ:
project = os.environ.get('SCRAPY_PROJECT', 'default')
init_env(project)
I have a three different spiders in different scrapy projects called REsale, REbuy and RErent, each with their own pipeline that directs their output to various MySQL tables on my server. They all run OK when called using scrapy crawl. Ultimately, I want a script that can run as a service on my windows 7 machine that runs the spiders at different intervals. ATM, I am stuck at the scrapy API. I cant even get it to run one of the spiders! Is there somewhere special this needs to be saved? At the moment it's just in my root python directory. Sale, Buy and Rent are the names of the spiders I would call using scrapy crawl and sale_spider etc. is the spider's .py file.
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from REsale.spiders.sale_spider import Sale
from REbuy.spiders.buy_spider import Buy
from RErent.spiders.sale_spider import Rent
spider = Buy()
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
spider = Rent()
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
spider = Sale()
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
This is returning the error:
c:\Python27>File "real_project.py", line 5, in <module>
from REsale.spiders.sale_spider import Sale
ImportError: No module named REsale.spiders.sale_spider
I am new so any help is greatly appreciated.
I suggest you'll look at http://scrapyd.readthedocs.org/en/latest/, a ready made scrapy daemon for deploying and scheduling scrapy spiders