Using scrapy command "crawl" from django - python

I am trying to crawl a spider (of scrapy) from django and now the problem is, the spider can be crawled only when we are at the top directory(directory with scrapy.cfg). So how can that be achieved?
.../polls/managements/commands/mycommand.py
from django.core.management.base import BaseCommand
from scrapy.cmdline import execute
import os
class Command(BaseCommand):
def run_from_argv(self, argv):
print ('In run_from_argv')
self._argv = argv
return self.execute()
def handle(self, *args, **options):
#os.environ['SCRAPY_SETTINGS_MODULE'] = '/home/nabin/scraptut/newscrawler'
execute(self._argv[1:])
And if i try
python manage.py mycommands crawl myspider
then I won't be able. Because to use crawl i need to be in the top directory with scrapy.cfg file.. So I want to know, how that is possible?

You don't need to change the working directory, unless you want to use the .cfg which can include default options for the deploy command.
In your first approach you forgot to add the crawler path to the python path and set correctly the scrapy settings module:
# file: myapp/management/commands/bot.py
import os
import sys
from django.core.management.base import BaseCommand
from scrapy import cmdline
class Command(BaseCommand):
help = "Run scrapy"
def handle(self, *args, **options):
sys.path.insert(0, '/home/user/mybot')
os.environ['SCRAPY_SETTINGS_MODULE'] = 'mybot.settings'
# Execute expects the list args[1:] to be the actual command arguments.
cmdline.execute(['bot'] + list(args))

Ok i have found the solution myself.
In settings.py I defined:
CRAWLER_PATH = os.path.join(os.path.dirname(BASE_DIR), 'required path')
And did the following.
from django.conf import settings
os.chdir(settings.CRAWLER_PATH)

Related

Scrapy: cannot load items to spider

I cannot load scrapy items to scrapy spiders. Here is my project structure:
-rosen
.log
.scrapers
..scrapers
...spiders
....__init__.py
....exampleSpider.py
...__init__.py
...items.py
...middlewares.py
...pipelines.py
...settings.py
.src
..__init__.py
..otherStuff.py
.tmp
This structure has been created using scrapy startproject scrapers inside of rosen project (directory).
Now, the items.py has the following code:
import scrapy
from Decimal import Decimal
class someItem(scrapy.Item):
title: str = scrapy.Field(serializer=str)
bid: Decimal = scrapy.Field(serializer=Decimal)
And the exampleSpider.py has the following code:
import scrapy
from __future__ import absolute_import
from scrapy.loader import ItemLoader
from scrapers.scrapers.items import someItem
class someSpider(scrapy.Spider):
name = "some"
def __init__(self, **kwargs):
super().__init__(**kwargs)
self._some_fields = someItem()
def parse(self, response) -> None:
some_loader = ItemLoader(item=self._some_fields, response=response)
print(self._some_fields.keys())
The error I get is the following: runspider: error: Unable to load 'someSpider.py': No module named 'scrapers.scrapers'
I found Scrapy: ImportError: No module named items and tried all three solutions by renaming and adding from __future__ import absolute_import. Nothing helps. Please advice.
The command that I execute is scrapy runspider exampleSpider.py. I tried it from the spiders and rosen directories.
i do not see any virtualenv inside your directory. So i recommend you to do so eg. under 'rosen'.
you can try this:
try:
from scrapers.items import someItem
except FileNotFoundError:
from scrapers. scrapers.items import someItem
then cal it with:
scrapy crawl NameOfSpider
or:
scrapy runspider path/to/spider.py

Scrapy: twisted.internet.error.ReactorNotRestartable from running CrawlProcess()

I am trying to run my scrapy from script.
I am using CrawlerProcess and I only have one spider to run.
I've been stuck from this error for a while now, and I've tried a lot of things to change the settings, but every time I run the spider, I get
twisted.internet.error.ReactorNotRestartable
I've been searching to solve this error, and I believe you should only get this error when you try to call process.start() more than once. But I didn't.
Here's my code:
import scrapy
from scrapy.utils.log import configure_logging
from scrapyprefect.items import ScrapyprefectItem
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
class SpiderSpider(scrapy.Spider):
name = 'spider'
start_urls = ['http://www.nigeria-law.org/A.A.%20Macaulay%20v.%20NAL%20Merchant%20Bank%20Ltd..htm']
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def parse(self, response):
item = ScrapyprefectItem()
...
yield item
process = CrawlerProcess(settings=get_project_settings())
process.crawl('spider')
process.start()
Error:
Traceback (most recent call last):
File "/Users/pluggle/Documents/Upwork/scrapyprefect/scrapyprefect/spiders/spider.py", line 59, in <module>
process.start()
File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/scrapy/crawler.py", line 309, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/twisted/internet/base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/twisted/internet/base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "/Users/pluggle/Documents/Upwork/scrapyprefect/venv/lib/python3.7/site-packages/twisted/internet/base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
I notice that this only happens when I'm trying to save my items to mongodb.
pipeline.py:
import logging
import pymongo
class ScrapyprefectPipeline(object):
collection_name = 'SupremeCourt'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
#classmethod
def from_crawler(cls, crawler):
# pull in information from settings.py
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
# initializing spider
# opening db connection
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
# clean up when spider is closed
self.client.close()
def process_item(self, item, spider):
# how to handle each post
self.db[self.collection_name].insert(dict(item))
logging.debug("Post added to MongoDB")
return item
If I change the pipeline.py to the default, which is...
import logging
import pymongo
class ScrapyprefectPipeline(object):
def process_item(self, item, spider):
return item
...the script runs fine.
I'm thinking this has something to do with how I setup the pycharm settings to run the code.
So for referece, I'm also including my pycharm settings
I hope someone can help me. Let me know if need more details
Reynaldo,
thanks a lot - you saved my project!
And you pushed me to the idea, that possibly this occurs because you have this piece if script starting the process in the same file with your spider definition. As a result, it is executed each time scrapy imports your spider definition. I am not a big expert in scrapy, but possibly it does it few times internally and thus we run into this error problem.
Your suggestion obviusly solves the problem!
Another approach could be is to separate the spider class definition and the script running it. Possibly, this is the approach scrapy assumes and that is why in it's Running spider from script documentation it does not even mention this __name__ check.
So what I did is following:
in the project root I have sites folder and in it I have site_spec.py file. This is just a utility file with some target site specific information (URL structure, etc.). I mention it here just to show you how you can import your various utility modules into your spider class definition;
in the project root I have spiders folder and my_spider.py class definition in it. And in that file I import site_spec.py file with directive:
from sites import site_spec
It is important to mention, that the script, running the spider (the one that you presented) IS REMOVED from the class definition my_spider.py file. Also, note, that I import my site_spec.py file with the path related to run.py file (see below), but not related to the class definition file, where this directive is issued as one could expect (python relative import, I guess)
finally, in the project root I have run.py file, runnig the scrapy from script:
from scrapy.crawler import CrawlerProcess
from spiders.my_spider import MySpider # this is our friend in subfolder **spiders**
from scrapy.utils.project import get_project_settings
# Run that thing!
process = CrawlerProcess(get_project_settings())
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
With this setup finally I was able to get rid of this twisted.internet.error.ReactorNotRestartable
Thank you very much !!!
Okay. I solved it.
So I think, in the pipeline, when the scraper enters the open_spider, it runs the spider.py again, and calling the process.start() the second time.
To solve the problem, I add this in the spider so process.start() will only be executed when you run the spider:
if __name__ == '__main__':
process = CrawlerProcess(settings=get_project_settings())
process.crawl('spider')
process.start()
Try to change Scrapy and Twisted version. It isnt the solution, but worked.
pip install Twisted==22.1.0
pip install Scrapy==2.5.1

How to run a python script in Django?

I am new to Django, and I'm trying to import one of my models in a script as we do it in views.py. I'm getting an error:
Traceback (most recent call last):
File "CallCenter\make_call.py", line 3, in <module>
from .models import Campaign
ModuleNotFoundError: No module named '__main__.models'; '__main__' is not a package
My file structure is like:
MyApp\CallCenter\
CallCenter contains __init__.py, make_call.py, models.py, views.py and MyApp has manage.py
from twilio.rest import Client
from twilio.twiml.voice_response import VoiceResponse, Say, Dial, Number, VoiceResponse
from .models import Campaign
def create_xml():
# Creates XML
response = VoiceResponse()
campaign = Campaign.objects.get(pk=1)
response.say(campaign.campaign_text)
return response
xml = create_xml()
print(xml)
In general, it's better to refactor "ad-hoc" scripts – anything you might run manually from a command line, say – into management commands.
That way the Django runtime is set up correctly once things get to your code, and you get command-line parsing for free too.
Your make_call.py might become something like this:
CallCenter/management/commands/make_call.py
from twilio.rest import Client
from twilio.twiml.voice_response import VoiceResponse, Say, Dial, Number, VoiceResponse
from CallCenter.models import Campaign
from django.core.management import BaseCommand
def create_xml(campaign):
# Creates XML
response = VoiceResponse()
response.say(campaign.campaign_text)
return response
class Command(BaseCommand):
def add_arguments(self, parser):
parser.add_argument("--campaign-id", required=True, type=int)
def handle(self, campaign_id, **options):
campaign = Campaign.objects.get(pk=campaign_id)
xml = create_xml(campaign)
print(xml)
and it would be invoked with
$ python manage.py make_call --campaign-id=1
from wherever your manage.py is.
(Remember to have an __init__.py file in both the management/ and the management/commands/ folders.)

Getting scrapy project settings when script is outside of root directory

I have made a Scrapy spider that can be successfully run from a script located in the root directory of the project. As I need to run multiple spiders from different projects from the same script (this will be a django app calling the script upon the user's request), I moved the script from the root of one of the projects to the parent directory. For some reason, the script is no longer able to get the project's custom settings in order to pipeline the scraped results into the database tables. Here is the code from the scrapy docs I'm using to run the spider from a script:
def spiderCrawl():
settings = get_project_settings()
settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
process = CrawlerProcess(settings)
process.crawl(MySpider3)
process.start()
Is there some extra module that needs to be imported in order to get the project settings from outside of the project? Or does there need to be some additions made to this code? Below I also have the code for the script running the spiders, thanks.
from ticket_city_scraper.ticket_city_scraper import *
from ticket_city_scraper.ticket_city_scraper.spiders import tc_spider
from vividseats_scraper.vividseats_scraper import *
from vividseats_scraper.vividseats_scraper.spiders import vs_spider
tc_spider.spiderCrawl()
vs_spider.spiderCrawl()
Thanks to some of the answers already provided here, I realised scrapy wasn't actually importing the settings.py file. This is how I fixed it.
TLDR: Make sure you set the 'SCRAPY_SETTINGS_MODULE' variable to your actual settings.py file. I'm doing this in the __init__() func of Scraper.
Consider a project with the following structure.
my_project/
main.py # Where we are running scrapy from
scraper/
run_scraper.py #Call from main goes here
scrapy.cfg # deploy configuration file
scraper/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
quotes_spider.py # Contains the QuotesSpider class
Basically, the command
scrapy startproject scraper was executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.
My main file:
from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()
My run_scraper.py file:
from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os
class Scraper:
def __init__(self):
settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
self.process = CrawlerProcess(get_project_settings())
self.spider = QuotesSpider # The spider you want to crawl
def run_spiders(self):
self.process.crawl(self.spider)
self.process.start() # the script will block here until the crawling is finished
Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper).
So in my case:
SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'
And repeat for all the settings variables you have!
It should work , can you share your scrapy log file
Edit:
your approach will not work
because ...when you execute the script..it will look for your default settings in
if you have set the environment variable ENVVAR
if you have scrapy.cfg file in you present directory from where you are executing your script and if that file points to valid settings.py directory ,it will load those settings...
else it will run with vanilla settings provided by scrapy ( your case)
Solution 1
create a cfg file inside the directory (outside folder) and give it a path to the valid settings.py file
Solution 2
make your parent directory package , so that absolute path will not be required and you can use relative path
i.e python -m cron.project1
Solution 3
Also you can try something like
Let it be where it is , inside the project directory..where it is working...
Create a sh file...
Line 1: Cd to first projects location ( root directory)
Line 2 : Python script1.py
Line 3. Cd to second projects location
Line 4: python script2.py
Now you can execute spiders via this sh file when requested by django
I have used this code to solve the problem:
from scrapy.settings import Settings
settings = Settings()
settings_module_path = os.environ.get('SCRAPY_ENV', 'project.settings.dev')
settings.setmodule(settings_module_path, priority='project')
print(settings.get('BASE_URL'))
this could happen because you are no longer "inside" a scrapy project, so it doesn't know how to get the settings with get_project_settings().
You can also specify the settings as a dictionary as the example here:
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
I used the OS module for this problem.
The python file you are running is in one directory and your scrapy project is in a different directory. You can not simply just import the python spider and run on this python script because the current directory you are working in does not have the settings.py file or the scrapy.cfg.
import os
To show the current directory you are working in use the following code:
print(os.getcwd())
From here you are going to want to change the current directory:
os.chdir(\path\to\spider\folder)
Lastly, tell os which command to execute.
os.system('scrape_file.py')
This is an addition to the answer of malla.
You can configure the settings, pipelines, spiders, etc modules variable. You dont need to pass them as strings. The big advantage is that you can run that spider from different places and you dont need to adjust the strings in the settings. You can do both: run from script (from anywhere, even from multiple different roots) and run with scrapy crawl without adjusting:
from ticket_city_scraper.ticket_city_scraper import settings # your setting module
def run_spider():
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings.__name__)
process = CrawlerProcess(get_project_settings())
process.crawl(MySpider3)
process.start()
You can make the setting itself variable:
from . import spiders # your spiders module
from . import pipelines # your pipelines module
def get_full_package_name_for_class(clazz) -> str:
return ".".join([clazz.__module__, clazz.__name__])
SPIDER_MODULES = [spiders.__name__]
NEWSPIDER_MODULE = spiders.__name__
ITEM_PIPELINES = {
get_full_package_name_for_class(pipelines.YourScrapyPipeline): 300
}

Unable to import items in scrapy

I have a very basic spider, following the instructions in the getting started guide, but for some reason, trying to import my items into my spider returns an error. Spider and items code is shown below:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from myProject.items import item
class MyProject(BaseSpider):
name = "spider"
allowed_domains = ["website.com"]
start_urls = [
"website.com/start"
]
def parse(self, response):
print response.body
from scrapy.item import Item, Field
class ProjectItem(Item):
title = Field()
When I run this code scrapy either can't find my spider, or can't import my items file. What's going on here? This should be a really example to run right?
I also had this several times while working with scrapy. You could add at the beginning of your Python modules this line:
from __future__ import absolute_import
More info here:
http://www.python.org/dev/peps/pep-0328/#rationale-for-absolute-imports
http://pythonquirks.blogspot.ru/2010/07/absolutely-relative-import.html
you are importing a field ,you must import a class from items.py
like from myproject.items import class_name.
So, this was a problem that I came across the other day that I was able to fix through some trial and error, but I wasn't able to find any documentation of it so I thought I'd put this up in case anyone happens to run into the same problem I did.
This isn't so much an issue with scrapy as it is an issue with naming files and how python deals with importing modules. Basically the problem is that if you name your spider file the same thing as the project then your imports are going to break. Python will try to import from the directory closest to your current position which means it's going to try to import from the spider's directory which isn't going to work.
Basically just change the name of your spider file to something else and it'll all be up and running just fine.
if the structure like this:
package/
__init__.py
subpackage1/
__init__.py
moduleX.py
moduleY.py
subpackage2/
__init__.py
moduleZ.py
moduleA.py
and if you are in moduleX.py, the way to import other modules can be:
from .moduleY.py import *
from ..moduleA.py import *
from ..subpackage2.moduleZ.py import *
refer:PEP Imports: Multi-Line and Absolute/Relative

Categories

Resources