I'm trying to develop my Scrapy application using multiple configurations depending on my environment (e.g. development, production). My problem is that there are some settings that I'm not sure how to set them. For example, if I have to setup my database, in development should be "localhost" and in production has to be another one.
How can I specify these settings when I'm doing scrapy deploy ? Can I set them with a variable in command-line?
You should set the deploy options in your scrapy.cfg file. For example:
[deploy:dev]
url = http://dev_url/
[deploy:production]
url = http://production_url/
With that, you could do:
scrapyd-deploy def
or
scrapyd-deploy production
You can refer to the answer in the following link :
https://alanbuxton.wordpress.com/2018/10/09/using-local-settings-in-a-scrapy-project/
I copy here for quick reference:
Edit the settings.py file so it would read from additional settings files depending on a SCRAPY_ENV environment variable
Move all the settings files to a separate config directory (and change scrapy.cfg so it knew where to look
The magic happens at the end of settings.py:
from importlib import import_module
from scrapy.utils.log import configure_logging
import logging
import os
SCRAPY_ENV=os.environ.get('SCRAPY_ENV',None)
if SCRAPY_ENV == None:
raise ValueError("Must set SCRAPY_ENV environment var")
logger = logging.getLogger(__name__)
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
# Load if file exists; incorporate any names started with an
# uppercase letter into globals()
def load_extra_settings(fname):
if not os.path.isfile("config/%s.py" % fname):
logger.warning("Couldn't find %s, skipping" % fname)
return
mdl=import_module("config.%s" % fname)
names = [x for x in mdl.__dict__ if x[0].isupper()]
globals().update({k: getattr(mdl,k) for k in names})
load_extra_settings("secrets")
load_extra_settings("secrets_%s" % SCRAPY_ENV)
load_extra_settings("settings_%s" % SCRAPY_ENV)
Then in the python file you want to get the variables defined in the setting, use the following code
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
env_variable = settings.get('ENV_VARIABLE')
Related
I have a config.ini file which contains some properties but I want to read the environment variables inside the config file.
[section1]
prop1:(from envrinment variable) or value1
Is this possible or do I have to write a method to take care of that?
I know this is late to the party, but someone might find this handy.
What you are asking after is value interpolation combined with os.environ.
Pyramid?
The logic of doing so is unusual and generally this happens if one has a pyramid server, (which is read by a config file out of the box) while wanting to imitate a django one (which is always set up to read os.environ).
If you are using pyramid, then pyramid.paster.get_app has the argument options: if you pass os.environ as the dictionary you can use %(variable)s in the ini. Not that this is not specific to pyramid.paster.get_app as the next section shows (but my guess is get_app is the case).
app.py
from pyramid.paster import get_app
from waitress import serve
import os
app = get_app('production.ini', 'main', options=os.environ)
serve(app, host='0.0.0.0', port=8000, threads=50)
production.ini:
[app:main]
sqlalchemy.url = %(SQL_URL)s
...
Configparse?
The above is using configparser basic interpolation.
Say I have a file called config.ini with the line
[System]
conda_location = %(CONDA_PYTHON_EXE)
The following will work...
import configparser, os
cfg = configparser.ConfigParser()
cfg.read('config.ini')
print(cfg.get('System', 'conda_location', vars=os.environ))
I think :thinking_face, use .env and in config.py from dotenv import load_dotenv() and in next line load_dotenv() and it will load envs from .env file
I have the following structure of my project
project
--project
----settings
------base.py
------development.py
------testing.py
------secrets.json
--functional_tests
--manage.py
development.py and testing.py 'inherit' from base.py
from .base import *
So, where I have problems
I have the SECRET_KEY for Django in secrets.json, which is stored in settings folder
I load this key like this (saw this in "Two scoops of Django")
import json
from django.core.exceptions import ImproperlyConfigured
key = "secrets.json"
with open(key) as f:
secrets = json.loads(f.read())
def get_secret(setting, secret=secrets):
try:
return secrets[setting]
except KeyError:
error_msg = "Set the {} environment variable".format(setting)
raise ImproperlyConfigured(error_msg)
SECRET_KEY = get_secret("SECRET_KEY")
But when I run python manage.py runserver
Blah-blah-blah
django.core.exceptions.ImproperlyConfigured: The SECRET_KEY setting must not be empty.
After some investigations I got the following
If I put print(os.getcwd()) inside base.py I get /media/grimel/Home/project/ instead of /media/grimel/Home/project/project/settings/
This code works only if I replace:
key = "secrets.json"
by
key = "project/settings/secrets.json"
Personally, I don't like this solution.
So, questions:
Why, for base.py current working directory is so confusing?
What's a better approach in solving this problem?
The working directory is based on how you run the program, in your case python manage.py runserver hints that your working directory is the one containing manage.py. Beware that this can vary when run as WSGI script or otherwise, so your concern with using key = "project/settings/secrets.json" is valid.
One solution is to use the value of __file__ in base.py, likely to be "project/settings/base.py". I would use something like
import os
BASE_DIR = os.path.dirname(__file__)
key = os.path.join(BASE_DIR, "secrets.json")
To make life simpler why not move secrets.json to your project root and reference
import os
key = os.path.join(BASE_DIR, "secrets.json")
directly. This is platform independent saving you the need to override BASE_DIR at all in your settings file. Don't forget to add your settings file to version control.
I have this folder structure for django
settings/dev.py
settings/prod.py
settings/test.py
Then i have common settings in settings/common.py in which i check the ENV variable like
if PROD:
from settings.prod import *
Based on ENV variable each one of them will be active
I want to use something like this in my code
from myapp import settings
rather than
from myapp.settings import dev
This is the method which I follow. Learnt this from the book Two Scoops of Django.
Have a file, such as, settings/common.py which will contain the properties/settings which are common in dev, prod and test environment. (You already have this.)
The other 3 files should:
Import the common settings from the settings/common.py by adding the line from .common import *
And should contain settings for its own corresponding environment.
The manage.py file decides which settings file to import depending on the OS environment variable, DJANGO_SETTINGS_MODULE. So, for test environment the value of DJANGO_SETTINGS_MODULE should be mysite.settings.test
Links for reference:
Django documentation for django-admin utility - Link
Two Scoops of Django sample project - Link
Preserve your settings folder structure and create __init__.py there.
Please, use code below in your settings/__init__.py:
import os
# DJANGO_SERVER_TYPE
# if 1: Production Server
# else if 2: Test Server
# else: Development Server
server_type = os.getenv('DJANGO_SERVER_TYPE')
if server_type==1:
from prod import *
elif server_type==2:
from test import *
else:
from dev import *
Now you can set environment variable called DJANGO_SERVER_TYPE to choose between Production, Test or Development Server and import settings using:
import settings
I have made a Scrapy spider that can be successfully run from a script located in the root directory of the project. As I need to run multiple spiders from different projects from the same script (this will be a django app calling the script upon the user's request), I moved the script from the root of one of the projects to the parent directory. For some reason, the script is no longer able to get the project's custom settings in order to pipeline the scraped results into the database tables. Here is the code from the scrapy docs I'm using to run the spider from a script:
def spiderCrawl():
settings = get_project_settings()
settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
process = CrawlerProcess(settings)
process.crawl(MySpider3)
process.start()
Is there some extra module that needs to be imported in order to get the project settings from outside of the project? Or does there need to be some additions made to this code? Below I also have the code for the script running the spiders, thanks.
from ticket_city_scraper.ticket_city_scraper import *
from ticket_city_scraper.ticket_city_scraper.spiders import tc_spider
from vividseats_scraper.vividseats_scraper import *
from vividseats_scraper.vividseats_scraper.spiders import vs_spider
tc_spider.spiderCrawl()
vs_spider.spiderCrawl()
Thanks to some of the answers already provided here, I realised scrapy wasn't actually importing the settings.py file. This is how I fixed it.
TLDR: Make sure you set the 'SCRAPY_SETTINGS_MODULE' variable to your actual settings.py file. I'm doing this in the __init__() func of Scraper.
Consider a project with the following structure.
my_project/
main.py # Where we are running scrapy from
scraper/
run_scraper.py #Call from main goes here
scrapy.cfg # deploy configuration file
scraper/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
quotes_spider.py # Contains the QuotesSpider class
Basically, the command
scrapy startproject scraper was executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.
My main file:
from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()
My run_scraper.py file:
from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os
class Scraper:
def __init__(self):
settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
self.process = CrawlerProcess(get_project_settings())
self.spider = QuotesSpider # The spider you want to crawl
def run_spiders(self):
self.process.crawl(self.spider)
self.process.start() # the script will block here until the crawling is finished
Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper).
So in my case:
SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'
And repeat for all the settings variables you have!
It should work , can you share your scrapy log file
Edit:
your approach will not work
because ...when you execute the script..it will look for your default settings in
if you have set the environment variable ENVVAR
if you have scrapy.cfg file in you present directory from where you are executing your script and if that file points to valid settings.py directory ,it will load those settings...
else it will run with vanilla settings provided by scrapy ( your case)
Solution 1
create a cfg file inside the directory (outside folder) and give it a path to the valid settings.py file
Solution 2
make your parent directory package , so that absolute path will not be required and you can use relative path
i.e python -m cron.project1
Solution 3
Also you can try something like
Let it be where it is , inside the project directory..where it is working...
Create a sh file...
Line 1: Cd to first projects location ( root directory)
Line 2 : Python script1.py
Line 3. Cd to second projects location
Line 4: python script2.py
Now you can execute spiders via this sh file when requested by django
I have used this code to solve the problem:
from scrapy.settings import Settings
settings = Settings()
settings_module_path = os.environ.get('SCRAPY_ENV', 'project.settings.dev')
settings.setmodule(settings_module_path, priority='project')
print(settings.get('BASE_URL'))
this could happen because you are no longer "inside" a scrapy project, so it doesn't know how to get the settings with get_project_settings().
You can also specify the settings as a dictionary as the example here:
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
I used the OS module for this problem.
The python file you are running is in one directory and your scrapy project is in a different directory. You can not simply just import the python spider and run on this python script because the current directory you are working in does not have the settings.py file or the scrapy.cfg.
import os
To show the current directory you are working in use the following code:
print(os.getcwd())
From here you are going to want to change the current directory:
os.chdir(\path\to\spider\folder)
Lastly, tell os which command to execute.
os.system('scrape_file.py')
This is an addition to the answer of malla.
You can configure the settings, pipelines, spiders, etc modules variable. You dont need to pass them as strings. The big advantage is that you can run that spider from different places and you dont need to adjust the strings in the settings. You can do both: run from script (from anywhere, even from multiple different roots) and run with scrapy crawl without adjusting:
from ticket_city_scraper.ticket_city_scraper import settings # your setting module
def run_spider():
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings.__name__)
process = CrawlerProcess(get_project_settings())
process.crawl(MySpider3)
process.start()
You can make the setting itself variable:
from . import spiders # your spiders module
from . import pipelines # your pipelines module
def get_full_package_name_for_class(clazz) -> str:
return ".".join([clazz.__module__, clazz.__name__])
SPIDER_MODULES = [spiders.__name__]
NEWSPIDER_MODULE = spiders.__name__
ITEM_PIPELINES = {
get_full_package_name_for_class(pipelines.YourScrapyPipeline): 300
}
I'd like to have something in my settings like
if ip in DEV_IPS:
SOMESETTING = 'foo'
else:
SOMESETTING = 'bar'
Is there an easy way to get the ip or hostname - also - is this is a bad idea ?
import socket
socket.gethostbyname(socket.gethostname())
However, I'd recommend against this and instead maintain multiple settings file for each environment you're working with.
settings/__init__.py
settings/qa.py
settings/production.py
__init__.py has all of your defaults. At the top of qa.py, and any other settings file, the first line has:
from settings import *
followed by any overrides needed for that particular environment.
One method some shops use is to have an environment variable set on each machine. Maybe called "environment". In POSIX systems you can do something like ENVIRONMENT=production in the user's .profile file (this will be slightly different for each shell and OS). Then in settings.py you can do something like this:
import os
if os.environ['ENVIRONMENT'] == 'production':
# Production
DATABASE_ENGINE = 'mysql'
DATABASE_NAME = ....
else:
# Development