Scrapy CrawlerRunner: Output missing - python

I have been using the method described on stackoverflow (https://stackoverflow.com/a/43661172/5037146) , to make scrapy run from script using Crawler Runner to allow to restart the process.
However, I don't get any console logs when running the process through CrawlerRunner, whereas when I using CrawlerProcess, it outputs the status and progress.
Code is available online: https://colab.research.google.com/drive/14hKTjvWWrP--h_yRqUrtxy6aa4jG18nJ

With CrawlerRunner you need to manually setup logging, which you can do with configure_logging(). See https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

When you use CrawlerRunner you have to manually configure a logger
You can do it using scrapy.utils.log.configure_logging function
for example
import scrapy.crawler
from my_spider import MySpider
runner = scrapy.crawler.CrawlerRunner()
scrapy.utils.log.configure_logging(
{
"LOG_FORMAT": "%(levelname)s: %(message)s",
},
)
crawler = runner.create_crawler(MySpider)
crawler.crawl()

Related

How to run and save scrapy state from a python script

In scrapy projects, we can get persistence support by defining a job directory through the JOBDIR setting for eg.
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
But how to do the same when running spiders using scrapy.crawler.CrawlerProcess from a python script as answered in How to run Scrapy from within a Python script?
As your reference question points out you can pass settings to CrawlerProcess instance.
So all you need to do is pass JOBDIR setting:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'JOBDIR': 'crawls/somespider-1' # <----- Here
})
process.crawl(MySpider)
process.start()

How to log scrapy spiders running from script

Hi all i have multiple spider running from the script. Script will schedule daily once.
I want to log the infos, errors separately. log filename must be a spider_infolog_[date] and spider_errlog_[date]
i am trying following code,
spider __init__ file
from twisted.python import log
import logging
LOG_FILE = 'logs/spider.log'
ERR_FILE = 'logs/spider_error.log'
logging.basicConfig(level=logging.INFO, filemode='w+', filename=LOG_FILE)
logging.basicConfig(level=logging.ERROR, filemode='w+', filename=ERR_FILE)
observer = log.PythonLoggingObserver()
observer.start()
within spider:
import logging
.
.
.
logging.error(message)
if any exception happens in spider code [like i am fetching start urls from the MysqlDB, if the connection fails i need to close the specific spider not other spiders because i am running all spiders from the script]
raise CloseSpider(message)
is above code sufficent to close the particular spider ?
EDIT #eLRuLL
import logging
from scrapy.utils.log import configure_logging
LOG_FILE = 'logs/spider.log'
ERR_FILE = 'logs/spider_error.log'
configure_logging()
logging.basicConfig(level=logging.INFO, filemode='w+', filename=LOG_FILE)
logging.basicConfig(level=logging.ERROR, filemode='w+', filename=ERR_FILE)
i have put the above code in a script that schedules spiders. not working file not created but in console i got log messages.
EDIT 2
i have added install_root_handler=False in configure_logging() it gives all the console output in spider.log file error is not differenciated.
configure_logging(install_root_handler=False)
You can do this:
from scrapy import cmdline
cmdline.execute("scrapy crawl myspider --logfile mylog.log".split())
Put that script in the path where you put scrapy.cfg

Getting scrapy project settings when script is outside of root directory

I have made a Scrapy spider that can be successfully run from a script located in the root directory of the project. As I need to run multiple spiders from different projects from the same script (this will be a django app calling the script upon the user's request), I moved the script from the root of one of the projects to the parent directory. For some reason, the script is no longer able to get the project's custom settings in order to pipeline the scraped results into the database tables. Here is the code from the scrapy docs I'm using to run the spider from a script:
def spiderCrawl():
settings = get_project_settings()
settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
process = CrawlerProcess(settings)
process.crawl(MySpider3)
process.start()
Is there some extra module that needs to be imported in order to get the project settings from outside of the project? Or does there need to be some additions made to this code? Below I also have the code for the script running the spiders, thanks.
from ticket_city_scraper.ticket_city_scraper import *
from ticket_city_scraper.ticket_city_scraper.spiders import tc_spider
from vividseats_scraper.vividseats_scraper import *
from vividseats_scraper.vividseats_scraper.spiders import vs_spider
tc_spider.spiderCrawl()
vs_spider.spiderCrawl()
Thanks to some of the answers already provided here, I realised scrapy wasn't actually importing the settings.py file. This is how I fixed it.
TLDR: Make sure you set the 'SCRAPY_SETTINGS_MODULE' variable to your actual settings.py file. I'm doing this in the __init__() func of Scraper.
Consider a project with the following structure.
my_project/
main.py # Where we are running scrapy from
scraper/
run_scraper.py #Call from main goes here
scrapy.cfg # deploy configuration file
scraper/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
quotes_spider.py # Contains the QuotesSpider class
Basically, the command
scrapy startproject scraper was executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.
My main file:
from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()
My run_scraper.py file:
from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os
class Scraper:
def __init__(self):
settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
self.process = CrawlerProcess(get_project_settings())
self.spider = QuotesSpider # The spider you want to crawl
def run_spiders(self):
self.process.crawl(self.spider)
self.process.start() # the script will block here until the crawling is finished
Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper).
So in my case:
SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'
And repeat for all the settings variables you have!
It should work , can you share your scrapy log file
Edit:
your approach will not work
because ...when you execute the script..it will look for your default settings in
if you have set the environment variable ENVVAR
if you have scrapy.cfg file in you present directory from where you are executing your script and if that file points to valid settings.py directory ,it will load those settings...
else it will run with vanilla settings provided by scrapy ( your case)
Solution 1
create a cfg file inside the directory (outside folder) and give it a path to the valid settings.py file
Solution 2
make your parent directory package , so that absolute path will not be required and you can use relative path
i.e python -m cron.project1
Solution 3
Also you can try something like
Let it be where it is , inside the project directory..where it is working...
Create a sh file...
Line 1: Cd to first projects location ( root directory)
Line 2 : Python script1.py
Line 3. Cd to second projects location
Line 4: python script2.py
Now you can execute spiders via this sh file when requested by django
I have used this code to solve the problem:
from scrapy.settings import Settings
settings = Settings()
settings_module_path = os.environ.get('SCRAPY_ENV', 'project.settings.dev')
settings.setmodule(settings_module_path, priority='project')
print(settings.get('BASE_URL'))
this could happen because you are no longer "inside" a scrapy project, so it doesn't know how to get the settings with get_project_settings().
You can also specify the settings as a dictionary as the example here:
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
I used the OS module for this problem.
The python file you are running is in one directory and your scrapy project is in a different directory. You can not simply just import the python spider and run on this python script because the current directory you are working in does not have the settings.py file or the scrapy.cfg.
import os
To show the current directory you are working in use the following code:
print(os.getcwd())
From here you are going to want to change the current directory:
os.chdir(\path\to\spider\folder)
Lastly, tell os which command to execute.
os.system('scrape_file.py')
This is an addition to the answer of malla.
You can configure the settings, pipelines, spiders, etc modules variable. You dont need to pass them as strings. The big advantage is that you can run that spider from different places and you dont need to adjust the strings in the settings. You can do both: run from script (from anywhere, even from multiple different roots) and run with scrapy crawl without adjusting:
from ticket_city_scraper.ticket_city_scraper import settings # your setting module
def run_spider():
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings.__name__)
process = CrawlerProcess(get_project_settings())
process.crawl(MySpider3)
process.start()
You can make the setting itself variable:
from . import spiders # your spiders module
from . import pipelines # your pipelines module
def get_full_package_name_for_class(clazz) -> str:
return ".".join([clazz.__module__, clazz.__name__])
SPIDER_MODULES = [spiders.__name__]
NEWSPIDER_MODULE = spiders.__name__
ITEM_PIPELINES = {
get_full_package_name_for_class(pipelines.YourScrapyPipeline): 300
}

ImportError: cannot import name ScrapyFileLogObserver

I try to test the scrapy log with ScrapyFileLogObserver. In my source code, I correctly set the package to use:
from scrapy.log import ScrapyFileLogObserver
but i've got this error when I launch my spider:
from scrapy.log import ScrapyFileLogObserver
ImportError: cannot import name ScrapyFileLogObserver
for information I use the last version of scrapy (Scrapy 1.0.1).
How I can fixe my bug ?
In 1.0 Scrapy's logging system was completely rewritten, there is no ScrapyFileLogObserver anymore. Instead, Scrapy now uses twisted's PythonLoggingObserver directly:
observer = twisted_log.PythonLoggingObserver('twisted')
observer.start()

Logging does not work

I am using standard logging configurations, set in settings.py file, and accessed in program but I get the error
error No handlers could be found for logger.
It works when run from the console but does not work when run from Eclipse.
The code is as follows:
import logging
from config import settings
logger = logging.getLogger('engine')
class ReplyUser(object):
def __init__(self):
logger.info("Initalizes ReplyUser")
def myfun(self):
logger.info("Hi")
print "hi"
I am guessing the problem is in the PATH which eclipse is using and it is unable to find settings.py as the handler information is stored in the settings.py file, hence the error.

Categories

Resources