Access django models inside of Scrapy - python

Is it possible to access my django models inside of a Scrapy pipeline, so that I can save my scraped data straight to my model?
I've seen this, but I don't really get how to set it up?

If anyone else is having the same problem, this is how I solved it.
I added this to my scrapy settings.py file:
def setup_django_env(path):
import imp, os
from django.core.management import setup_environ
f, filename, desc = imp.find_module('settings', [path])
project = imp.load_module('settings', f, filename, desc)
setup_environ(project)
setup_django_env('/path/to/django/project/')
Note: the path above is to your django project folder, not the settings.py file.
Now you will have full access to your django models inside of your scrapy project.

The opposite solution (setup scrapy in a django management command):
# -*- coding: utf-8 -*-
# myapp/management/commands/scrapy.py
from __future__ import absolute_import
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def run_from_argv(self, argv):
self._argv = argv
self.execute()
def handle(self, *args, **options):
from scrapy.cmdline import execute
execute(self._argv[1:])
and in django's settings.py:
import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'scrapy_project.settings'
Then instead of scrapy foo run ./manage.py scrapy foo.
UPD: fixed the code to bypass django's options parsing.

Add DJANGO_SETTINGS_MODULE env in your scrapy project's settings.py
import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'your_django_project.settings'
Now you can use DjangoItem in your scrapy project.
Edit:
You have to make sure that the your_django_project projects settings.py is available in PYTHONPATH.

For Django 1.4, the project layout has changed. Instead of /myproject/settings.py, the settings module is in /myproject/myproject/settings.py.
I also added path's parent directory (/myproject) to sys.path to make it work correctly.
def setup_django_env(path):
import imp, os, sys
from django.core.management import setup_environ
f, filename, desc = imp.find_module('settings', [path])
project = imp.load_module('settings', f, filename, desc)
setup_environ(project)
# Add path's parent directory to sys.path
sys.path.append(os.path.abspath(os.path.join(path, os.path.pardir)))
setup_django_env('/path/to/django/myproject/myproject/')

Check out django-dynamic-scraper, it integrates a Scrapy spider manager into a Django site.
https://github.com/holgerd77/django-dynamic-scraper

Why not create a __init__.py file in the scrapy project folder and hook it up in INSTALLED_APPS? Worked for me. I was able to simply use:
piplines.py
from my_app.models import MyModel
Hope that helps.

setup-environ is deprecated. You may need to do the following in scrapy's settings file for newer versions of django 1.4+
def setup_django_env():
import sys, os, django
sys.path.append('/path/to/django/myapp')
os.environ['DJANGO_SETTINGS_MODULE'] = 'myapp.settings'
django.setup()

Minor update to solve KeyError. Python(3)/Django(1.10)/Scrapy(1.2.0)
from django.core.management.base import BaseCommand
class Command(BaseCommand):
help = 'Scrapy commands. Accessible from: "Django manage.py". '
def __init__(self, stdout=None, stderr=None, no_color=False):
super().__init__(stdout=None, stderr=None, no_color=False)
# Optional attribute declaration.
self.no_color = no_color
self.stderr = stderr
self.stdout = stdout
# Actual declaration of CLI command
self._argv = None
def run_from_argv(self, argv):
self._argv = argv
self.execute(stdout=None, stderr=None, no_color=False)
def handle(self, *args, **options):
from scrapy.cmdline import execute
execute(self._argv[1:])
The SCRAPY_SETTINGS_MODULE declaration is still required.
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'scrapy_project.settings')

Related

Is it possible to get django info without running it

I have a django model, the whole code is completed. but I want to access my model info. a code like this to get field names.
for f in myModel._meta.fields:
print(f.get_attname())
is it possible to do it from an external python script without running django server?
other possible automated ways of doing this and saving results to a file are also appreciated.
try1
because Im using docker I ran it up. and from django container I started python shell
>>> from django.conf import settings
>>> settings.configure()
>>> import models
it gave django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.
try2
by #Klaus D advice in comments I tried management command. so I created
users/
__init__.py
models.py
management/
__init__.py
commands/
__init__.py
_private.py
modelInfo.py
structure. in modelInfo.py I did
from django.core.management.base import BaseCommand, CommandError
from users import views2
def savelisttxtfile(the_list, path_, type_='w', encoding="utf-8"):
with open(path_, type_, encoding=encoding) as file_handler:
for item in the_list:
file_handler.write("{}\n".format(item))
class Command(BaseCommand):
def handle(self, *args, **options):
dic=[]
for f in views2.ChertModel._meta.fields:
print(f.get_attname())
dic.append(f.get_attname())
savelisttxtfile(dic,"F:\projects\sd.txt")
and from another python file I tried
os.chdir(r'F:\projects\users\management\commands')
from subprocess import run
import sys
run([sys.executable, r'F:\projects\users\management\commands\modelInfo.py'])
and it returned
CompletedProcess(args=['C:\\ProgramData\\Anaconda3\\python.exe', 'F:\projects\users\management\commands\modelInfo.py'], returncode=1)
and the results were not save in sd.txt
thanks to #klaus D and management command documentation I made this structure
users/
__init__.py
models.py
management/
__init__.py
commands/
__init__.py
_private.py
modelInfo.py
and in modelInfo.py I did
from django.core.management.base import BaseCommand, CommandError
from users import views2
def savelisttxtfile(the_list, path_, type_='w', encoding="utf-8"):
with open(path_, type_, encoding=encoding) as file_handler:
for item in the_list:
file_handler.write("{}\n".format(item))
class Command(BaseCommand):
def handle(self, *args, **options):
dic=[]
for f in views2.ChertModel._meta.fields:
print(f.get_attname())
dic.append(f.get_attname())
savelisttxtfile(dic,"F:\projects\sd.txt")
and to run it I went to manage.py location and executed python manage.py modelInfo to launch it.
Regarding your "try1" it seems to be a little bit trickier to start a python shell like python manage.py shell than what you propose there.
Fortunately you can do this:
python manage.py shell < your_script.py
and your script will be executed as if typed directly into the "django shell". Keep in mind that you still need to import your models relative to your project, i.e. from myapp.models import mymodel.

how to import all django models and more in a script?

I've put the following code at the top of my script file
os.environ.setdefault("DJANGO_SETTINGS_MODULE", 'momsite.conf.local.settings')
django.setup()
Now I can import my django apps and run small snippets (to mainly test stuff)
I'd like to import all the models registered through settings.INSTALLED_APPS
I know https://github.com/django-extensions/django-extensions does this when running manage.py shell_plus it automatically imports all the models and more.
I'm looking at their code. not sure if I'll make sense out of it.
https://github.com/django-extensions/django-extensions/blob/3355332238910f3f30a3921e604641562c79a0a8/django_extensions/management/commands/shell_plus.py#L137
at the moment, I'm doing the following, and I think it is importing models, but not available in the script somehow
from django_extensions.management.shells import import_objects
from django.core.management.base import BaseCommand, CommandError
options = {}
style = BaseCommand().style
import_objects(options, style)
edit.. answer adopted from dirkgroten
import_objects internally calls from importlib import import_module Apparently, we need to populate globals() with imported class
options = {'quiet_load': True}
style = BaseCommand().style
imported_objects = import_objects(options, style)
globals().update(imported_objects)
After you run django.setup(), do this:
from django.apps import apps
for _class in apps.get_models():
if _class.__name__.startswith("Historical"):
continue
globals()[_class.__name__] = _class
That will make all models classes available as globals in your script.
Create a management command. It will automagically load django() and everything.
Then in your command you simply start your command. ./manage.py mytest
#myapp/management/commands/mytest.py
from django.core.management.base import BaseCommand, CommandError
from myapp.sometest import Mycommand
class Command(BaseCommand):
help = 'my test script'
def add_arguments(self, parser):
pass
# parser.add_argument('poll_ids', nargs='+', type=int)
def handle(self, *args, **options):
Mycommand.something(self)
which will call the actuall script:
#sometest.py
from .models import *
class Mycommand():
def something(self):
print('...something')

Accessing django database from python script

I'm trying to access my Django database from within a regular Python script. So far what I did is:
import os
import django
from django.db import models
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "m_site.settings")
django.setup()
from django.apps import apps
ap=apps.get_model('py','Post')
q=ap.objects.all()
for a in q:
print(a.title)
There is an application in Django called py where I have many posts (models.py, which contains class Post(models.Model)).
I would like to have the possibility to access and update this database from a regular Python script. So far script above works fine, I can insert, update or query but it looks like it is a separate database and not the database from Django's py application. Am I doing something wrong or missing something? Any help would be very appreciated.
Consider writing your script as a custom management command. This will let you run it via manage.py, with Django all properly wired up for you, e.g.
python manage.py print_post_titles
Something like this should be a good start:
from django.core.management.base import BaseCommand
from py.models import Post
class Command(BaseCommand):
help = 'Prints the titles of all Posts'
def handle(self, *args, **options):
for post in Post.objects.all():
print(a.title)

Deleting folders in docker conatiner using python

I have this function which deletes the given directory when row is being deleted in Django Admin. The row is successfully deleted when done on Django Admin but the directory still exists.
models.py
from django.db import models
from django.conf import settings
import git, os, shutil
class DIR (models.Model):
username = models.CharField(max_length=39)
repository = models.CharField(max_length=100)
def get_dir_name(self):
return os.path.join(settings.PLAYBOOK_DIR, self.repository)
def rm_repository(self):
DIR_NAME = self.get_dir_name()
shutil.rmtree(os.path.join(DIR_NAME))
def delete(self):
self.rm_repository(self):
super(DIR, self).delete(*args, **kwargs)
But when I try it using shell, the directory and contents get deleted
$ docker exec -it <container> python manage.py shell
>>> import os, git, shutil
>>> DIR_NAME = '/opt/app/john'
>>> shutil.rmtree(DIR_NAME)
What is the difference between shell. There were no errors given, just not sure why doing it in Django Admin's delete doesn't work. While it is working when tested on python shell?
Apparently function delete() is not for bulk delete.
Which what I was doing in Django Admin.

How to correctly override Django manage.py's command?

I need to override createsuperuser.py's handle method in Django Command class.
I created myapp\management\commands\createsuperuser.py:
import getpass
import sys
import django.contrib.auth.management.commands.createsuperuser as makesuperuser
from django.contrib.auth.management import get_default_username
from django.contrib.auth.password_validation import validate_password
from django.core import exceptions
from django.core.management.base import CommandError
from django.utils.encoding import force_str
from django.utils.text import capfirst
class Command(makesuperuser.Command):
def handle(self, *args, **options):
# the rest of code is copied from Django source and is almost
# standart except few changes related to how info of
# REQUIRED_FIELDS is shown
When I do in terminal ./manage.py createsuperuser I do not see any changes. If I change the name of my file to lets say mycmd.py and do ./manage.py mycmd everything starts to work as I expect.
How to get changes I need using ./manage.py createsuperuser?
Put your application name on top in the INSTALLED_APPS list.

Categories

Resources