Getting scrapy project settings when script is outside of root directory - python

I have made a Scrapy spider that can be successfully run from a script located in the root directory of the project. As I need to run multiple spiders from different projects from the same script (this will be a django app calling the script upon the user's request), I moved the script from the root of one of the projects to the parent directory. For some reason, the script is no longer able to get the project's custom settings in order to pipeline the scraped results into the database tables. Here is the code from the scrapy docs I'm using to run the spider from a script:
def spiderCrawl():
settings = get_project_settings()
settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
process = CrawlerProcess(settings)
process.crawl(MySpider3)
process.start()
Is there some extra module that needs to be imported in order to get the project settings from outside of the project? Or does there need to be some additions made to this code? Below I also have the code for the script running the spiders, thanks.
from ticket_city_scraper.ticket_city_scraper import *
from ticket_city_scraper.ticket_city_scraper.spiders import tc_spider
from vividseats_scraper.vividseats_scraper import *
from vividseats_scraper.vividseats_scraper.spiders import vs_spider
tc_spider.spiderCrawl()
vs_spider.spiderCrawl()

Thanks to some of the answers already provided here, I realised scrapy wasn't actually importing the settings.py file. This is how I fixed it.
TLDR: Make sure you set the 'SCRAPY_SETTINGS_MODULE' variable to your actual settings.py file. I'm doing this in the __init__() func of Scraper.
Consider a project with the following structure.
my_project/
main.py # Where we are running scrapy from
scraper/
run_scraper.py #Call from main goes here
scrapy.cfg # deploy configuration file
scraper/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
quotes_spider.py # Contains the QuotesSpider class
Basically, the command
scrapy startproject scraper was executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.
My main file:
from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()
My run_scraper.py file:
from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os
class Scraper:
def __init__(self):
settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
self.process = CrawlerProcess(get_project_settings())
self.spider = QuotesSpider # The spider you want to crawl
def run_spiders(self):
self.process.crawl(self.spider)
self.process.start() # the script will block here until the crawling is finished
Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper).
So in my case:
SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'
And repeat for all the settings variables you have!

It should work , can you share your scrapy log file
Edit:
your approach will not work
because ...when you execute the script..it will look for your default settings in
if you have set the environment variable ENVVAR
if you have scrapy.cfg file in you present directory from where you are executing your script and if that file points to valid settings.py directory ,it will load those settings...
else it will run with vanilla settings provided by scrapy ( your case)
Solution 1
create a cfg file inside the directory (outside folder) and give it a path to the valid settings.py file
Solution 2
make your parent directory package , so that absolute path will not be required and you can use relative path
i.e python -m cron.project1
Solution 3
Also you can try something like
Let it be where it is , inside the project directory..where it is working...
Create a sh file...
Line 1: Cd to first projects location ( root directory)
Line 2 : Python script1.py
Line 3. Cd to second projects location
Line 4: python script2.py
Now you can execute spiders via this sh file when requested by django

I have used this code to solve the problem:
from scrapy.settings import Settings
settings = Settings()
settings_module_path = os.environ.get('SCRAPY_ENV', 'project.settings.dev')
settings.setmodule(settings_module_path, priority='project')
print(settings.get('BASE_URL'))

this could happen because you are no longer "inside" a scrapy project, so it doesn't know how to get the settings with get_project_settings().
You can also specify the settings as a dictionary as the example here:
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

I used the OS module for this problem.
The python file you are running is in one directory and your scrapy project is in a different directory. You can not simply just import the python spider and run on this python script because the current directory you are working in does not have the settings.py file or the scrapy.cfg.
import os
To show the current directory you are working in use the following code:
print(os.getcwd())
From here you are going to want to change the current directory:
os.chdir(\path\to\spider\folder)
Lastly, tell os which command to execute.
os.system('scrape_file.py')

This is an addition to the answer of malla.
You can configure the settings, pipelines, spiders, etc modules variable. You dont need to pass them as strings. The big advantage is that you can run that spider from different places and you dont need to adjust the strings in the settings. You can do both: run from script (from anywhere, even from multiple different roots) and run with scrapy crawl without adjusting:
from ticket_city_scraper.ticket_city_scraper import settings # your setting module
def run_spider():
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings.__name__)
process = CrawlerProcess(get_project_settings())
process.crawl(MySpider3)
process.start()
You can make the setting itself variable:
from . import spiders # your spiders module
from . import pipelines # your pipelines module
def get_full_package_name_for_class(clazz) -> str:
return ".".join([clazz.__module__, clazz.__name__])
SPIDER_MODULES = [spiders.__name__]
NEWSPIDER_MODULE = spiders.__name__
ITEM_PIPELINES = {
get_full_package_name_for_class(pipelines.YourScrapyPipeline): 300
}

Related

Scrapy - How can I load the project level settings.py while using a script to start the spider

Am trying to implement a scrapy spider which is started using a script as per below code.
from scrapy.crawler import CrawlerRunner
from scrapy_app.scrapy_app.spiders.generic import GenericSpider
....
class MyProcess(object):
def start_my_process(self, _config, _req_obj, site_urls):
runner = CrawlerRunner()
runner.crawl(GenericSpider,
config=_config,
reqObj=_req_obj,
urls=site_urls)
deferred = runner.join()
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
....
So, using a CrawlerRunner, am not receiving the project level settings.py configurations while executing the spider. The Generic spider accepts three parameters in which one is the list of start urls.
How can we load the settings.py to the CrawlerRunner process other than setting custom_settings inside the spider?
I am going to try to answer this as best I can even though my situation is not 100% identical to yours, however, I was having similar issues.
The typical scrapy project structure looks like this:
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
The directory containing the scrapy.cfg file is considered the root directory of the project.
In that file you will see something like this:
[settings]
default: your_project.settings
[deploy]
...
When running your main script that calls on a spider to run with a specific set of settings you should have your main.py script in the same directory as the scrapy.cfg file.
Now from main.py your code is going to have to create a CrawlerProcess or CrawlerRunner instance to run a spider, of which either can be instantiated with a settings object or dict like so:
process = CrawlerProcess(settings={
'FEED_FORMAT': 'json',
'FEED_URI': 'items.json'
})
---------------------------------------
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
The dict scenario works but it cumbersome, so the the get_project_settings() call is probably of greater interest which I will expand upon.
I had a large scrapy project that contained multiple spiders that shared a large number of similar settings. So I had a global_settings.py file and then specific settings contained within each spider. Because of the large number of shared settings I liked the idea of keeping everything neat and tidy in one file and not copying and pasting code.
The easiest way I have found after a lot of research is to instantiate the CrawlerProcess/Runner object with the get_project_settings() function, the catch is that get_project_settings uses the default value under [settings] in scrapy.cfg to find project specific settings.
So its important to make sure that for your project the scrapy.cfg settings default value points to your desired settings file when you call get_project_settings().
I'll also add that if you have multiple settings files for multiple scrapy projects and you want to share the root directory you can add those in to scrapy.cfg additionally like so:
[settings]
default = your_project.settings
project1 = myproject1.settings
project2 = myproject2.settings
Adding all these settings to the root directory config file will allow you the opportunity to switch between settings at will in scripts.
As I said before your out of the box call to get_project_settings() will load the default value's settings file for your spider from the scrapy.cfg file (your_project.settings in the example above), however, if you want to change the settings used for the next spider run in the same process you can modify the settings loaded for that spider to be started.
This is slightly tricky and "hackey" but it has worked for me...
After the first call to get_project_settings() an environment variable called SCRAPY_SETTINGS_MODULE will be set. This environment variable value will be set to whatever your default value was in the scrapy.cfg file. To alter the settings used to subsequent spiders that are run in the process instance created (CrawlerRunner/Process --> process.crawl('next_spider_to_start')), this variable will need to be manipulated.
This is what should be done to set a new settings module on a current process instance that previously had get_project_settings() instantiated with it:
import os
# Clear the old settings module
del os.environ['SCRAPY_SETTINGS_MODULE']
# Set the project environment variable (new set of settings), this should be a value in your scrapy.cfg
os.environ['SCRAPY_PROJECT'] = 'project2'
# Call get_project_settings again and set to process object
process.settings = get_project_settings()
# Run the next crawler with the updated settings module
process.crawl('next_spider_to_start')
get_project_settings() just updated the current process settings (Twisted Reactor) to myproject2.settings for your crawler process instance.
This can all be done from a main script to manipulate spiders and the settings for them. Like I said previously though, I found it easier to just have a global settings file with all the commonalities, and then spider specific settings set in the spiders themselves. This is usually much clearer.
Scrapy docs are kinda rough, hope this helps someone...

How can put files in folder in django for settings

I have this folder structure for django
settings/dev.py
settings/prod.py
settings/test.py
Then i have common settings in settings/common.py in which i check the ENV variable like
if PROD:
from settings.prod import *
Based on ENV variable each one of them will be active
I want to use something like this in my code
from myapp import settings
rather than
from myapp.settings import dev
This is the method which I follow. Learnt this from the book Two Scoops of Django.
Have a file, such as, settings/common.py which will contain the properties/settings which are common in dev, prod and test environment. (You already have this.)
The other 3 files should:
Import the common settings from the settings/common.py by adding the line from .common import *
And should contain settings for its own corresponding environment.
The manage.py file decides which settings file to import depending on the OS environment variable, DJANGO_SETTINGS_MODULE. So, for test environment the value of DJANGO_SETTINGS_MODULE should be mysite.settings.test
Links for reference:
Django documentation for django-admin utility - Link
Two Scoops of Django sample project - Link
Preserve your settings folder structure and create __init__.py there.
Please, use code below in your settings/__init__.py:
import os
# DJANGO_SERVER_TYPE
# if 1: Production Server
# else if 2: Test Server
# else: Development Server
server_type = os.getenv('DJANGO_SERVER_TYPE')
if server_type==1:
from prod import *
elif server_type==2:
from test import *
else:
from dev import *
Now you can set environment variable called DJANGO_SERVER_TYPE to choose between Production, Test or Development Server and import settings using:
import settings

When and how should use multiple spiders in one Scrapy project

I am using Scrapy, it is great! so fast to build a crawler. with the number of web sites are increasing, need to create new spiders, but these web sits are the same type,
all these spiders use same items, pipelines, parsing process
the contents of the project directory:
test/
├── scrapy.cfg
└── test
├── __init__.py
├── items.py
├── mybasespider.py
├── pipelines.py
├── settings.py
├── spider1_settings.py
├── spider2_settings.py
└── spiders
├── __init__.py
├── spider1.py
└── spider2.py
To reduce source code redundancy, mybasespider.py has a base spider MyBaseSpider, 95% source code are in it, all other spiders inherited from it, if a spider has some special things, override some
class methods, normally only need to add several lines source code to create a new spider
Place all common settings in settings.py, one spider's special settings are in [spider name]_settings.py, such as:
the special settings of spider1 in spider1_settings.py:
from settings import *
LOG_FILE = 'spider1.log'
LOG_LEVEL = 'INFO'
JOBDIR = 'spider1-job'
START_URLS = [
'http://test1.com/',
]
the special settings of spider2 in spider2_settings.py:
from settings import *
LOG_FILE = 'spider2.log'
LOG_LEVEL = 'DEBUG'
JOBDIR = 'spider2-job'
START_URLS = [
'http://test2.com/',
]
Scrapy uses LOG_FILE, LOG_LEVEL, JOBDIR before launching a spider;
All urls in START_URLS are filled into MyBaseSpider.start_urls, different spider has different contents, but the name START_URLS used in the base spider MyBaseSpider isn't changed.
the contents of the scrapy.cfg:
[settings]
default = test.settings
spider1 = spider1.settings
spider2 = spider2.settings
[deploy]
url = http://localhost:6800/
project = test
To run a spider, such as spider1:
export SCRAPY_PROJECT=spider1
scrapy crawl spider1
But this way can't be used to run spiders in scrapyd. scrapyd-deploy command always uses 'default' project name in scrapy.cfg 'settings' section to build an egg file and deploys it to scrapyd
Have several questions:
Is this the way to use multiple spiders in one project if I don't create a project per spider? Are there any better ways?
How to separate a spider's special settings as above which can run in scrapyd and reduce source code redundancy
If all spiders use a same JOBDIR, is it safe to run all spiders concurrently? Is the persistent spider state corrupted?
Any insights would be greatly appreciated.
I don't know if it will answer to your first question but I use scrapy with multiple spiders and in the past i use the command
scrapy crawl spider1
but if I had more then one spider this command activate it or another modules so I start to use this command:
scrapy runspider <your full spider1 path with the spiderclass.py>
example: "scrapy runspider home/Documents/scrapyproject/scrapyproject/spiders/spider1.py"
I hope it will help :)
As all spiders should have their own class, you could set the settings per spider with the custom_settings class argument, so something like:
Class MySpider1(Spider):
name = "spider1"
custom_settings = {'USER_AGENT': 'user_agent_for_spider1/version1'}
Class MySpider1(Spider):
name = "spider1"
custom_settings = {'USER_AGENT': 'user_agent_for_spider2/version2'}
this custom_settings will overwrite the ones on the settings.py file so you could still set some global ones.
Good job! I didn't find better way to manager multiple spiders in the documentation.
I don't know about scrapyd. But when run from command line, you should set environment variable SCRAPY_PROJECT to the target project.
see scrapy/utils/project.py
ENVVAR = 'SCRAPY_SETTINGS_MODULE'
...
def get_project_settings():
if ENVVAR not in os.environ:
project = os.environ.get('SCRAPY_PROJECT', 'default')
init_env(project)

Launching Scrapyd with multiple configurations

I'm trying to develop my Scrapy application using multiple configurations depending on my environment (e.g. development, production). My problem is that there are some settings that I'm not sure how to set them. For example, if I have to setup my database, in development should be "localhost" and in production has to be another one.
How can I specify these settings when I'm doing scrapy deploy ? Can I set them with a variable in command-line?
You should set the deploy options in your scrapy.cfg file. For example:
[deploy:dev]
url = http://dev_url/
[deploy:production]
url = http://production_url/
With that, you could do:
scrapyd-deploy def
or
scrapyd-deploy production
You can refer to the answer in the following link :
https://alanbuxton.wordpress.com/2018/10/09/using-local-settings-in-a-scrapy-project/
I copy here for quick reference:
Edit the settings.py file so it would read from additional settings files depending on a SCRAPY_ENV environment variable
Move all the settings files to a separate config directory (and change scrapy.cfg so it knew where to look
The magic happens at the end of settings.py:
from importlib import import_module
from scrapy.utils.log import configure_logging
import logging
import os
SCRAPY_ENV=os.environ.get('SCRAPY_ENV',None)
if SCRAPY_ENV == None:
raise ValueError("Must set SCRAPY_ENV environment var")
logger = logging.getLogger(__name__)
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
# Load if file exists; incorporate any names started with an
# uppercase letter into globals()
def load_extra_settings(fname):
if not os.path.isfile("config/%s.py" % fname):
logger.warning("Couldn't find %s, skipping" % fname)
return
mdl=import_module("config.%s" % fname)
names = [x for x in mdl.__dict__ if x[0].isupper()]
globals().update({k: getattr(mdl,k) for k in names})
load_extra_settings("secrets")
load_extra_settings("secrets_%s" % SCRAPY_ENV)
load_extra_settings("settings_%s" % SCRAPY_ENV)
Then in the python file you want to get the variables defined in the setting, use the following code
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
env_variable = settings.get('ENV_VARIABLE')

adding project to python path doesn't work

I want to test my scrapy spider. I want to import spider to a test file an make a test spider and override start_urls, but I have a problem with importing it. Here is a project structure
...product-scraper\test_spider.py
...product-scraper\oxygen\oxygen\spiders\oxygen_spider.py
...product-scraper\oxygen\oxygen\items.py
the problem is that spider import Product class from items.py
from oxygen.items import Product
ImportError: No module named items
cmdscrapy crawl oxygen_spider works
I tried change sys.path or site.addsitedir in all possible ways
basedir = os.path.abspath(os.path.dirname(__file__))
module_path = os.path.join(basedir, "oxygen\\oxygen")
sys.path.append(basedir) # module_path
no success :(
I use python 2.7 on windows
Do you really get the error "No module named items"? Or is it something like "No module named oxygen.items"?
Also I'm not really sure why you would want to use os.path commands. Wouldn't this just work:
from items import Product
So without the "oxygen." This would however, as far as I know, only work if Product is a class in your items.py. If it's not a class I would suggest to just use:
import items
If that does not work, please specify what Product is in your items.py

Categories

Resources