How to update DjangoItem in Scrapy

How to update DjangoItem in Scrapy - python

I've been working with Scrapy but run into a bit of a problem.
DjangoItem has a save method to persist items using the Django ORM. This is great, except that if I run a scraper multiple times, new items will be created in the database even though I may just want to update a previous value.
After looking at the documentation and source code, I don't see any means to update existing items.
I know that I could call out to the ORM to see if an item exists and update it, but it would mean calling out to the database for every single object and then again to save the item.
How can I update items if they already exist?

Unfortunately, the best way that I found to accomplish this is to do exactly what was stated: Check if the item exists in the database using django_model.objects.get, then update it if it does.
In my settings file, I added the new pipeline:
ITEM_PIPELINES = {
# ...
# Last pipeline, because further changes won't be saved.
'apps.scrapy.pipelines.ItemPersistencePipeline': 999
}
I created some helper methods to handle the work of creating the item model, and creating a new one if necessary:
def item_to_model(item):
model_class = getattr(item, 'django_model')
if not model_class:
raise TypeError("Item is not a `DjangoItem` or is misconfigured")
return item.instance
def get_or_create(model):
model_class = type(model)
created = False
# Normally, we would use `get_or_create`. However, `get_or_create` would
# match all properties of an object (i.e. create a new object
# anytime it changed) rather than update an existing object.
#
# Instead, we do the two steps separately
try:
# We have no unique identifier at the moment; use the name for now.
obj = model_class.objects.get(name=model.name)
except model_class.DoesNotExist:
created = True
obj = model # DjangoItem created a model for us.
return (obj, created)
def update_model(destination, source, commit=True):
pk = destination.pk
source_dict = model_to_dict(source)
for (key, value) in source_dict.items():
setattr(destination, key, value)
setattr(destination, 'pk', pk)
if commit:
destination.save()
return destination
Then, the final pipeline is fairly straightforward:
class ItemPersistencePipeline(object):
def process_item(self, item, spider):
try:
item_model = item_to_model(item)
except TypeError:
return item
model, created = get_or_create(item_model)
update_model(model, item_model)
return item

I think it could be done more simply with
class DjangoSavePipeline(object):
def process_item(self, item, spider):
try:
product = Product.objects.get(myunique_id=item['myunique_id'])
# Already exists, just update it
instance = item.save(commit=False)
instance.pk = product.pk
except Product.DoesNotExist:
pass
item.save()
return item
Assuming your django model has some unique id from the scraped data, such as a product id, and here assuming your Django model is called Product.

for related models with foreignkeys
def update_model(destination, source, commit=True):
pk = destination.pk
source_fields = fields_for_model(source)
for key in source_fields.keys():
setattr(destination, key, getattr(source, key))
setattr(destination, 'pk', pk)
if commit:
destination.save()
return destination

Related

Access model instance inside model field

I have a model (Event) that has a ForeignKey to the User model (the owner of the Event).
This User can invite other Users, using the following ManyToManyField:
invites = models.ManyToManyField(
User, related_name="invited_users",
verbose_name=_("Invited Users"), blank=True
)
This invite field generates a simple table, containing the ID, event_id and user_id.
In case the Event owner deletes his profile, I don't want the Event to be deleted, but instead to pass the ownership to the first user that was invited.
So I came up with this function:
def get_new_owner():
try:
invited_users = Event.objects.get(id=id).invites.order_by("-id").filter(is_active=True)
if invited_users.exists():
return invited_users.first()
else:
Event.objects.get(id=id).delete()
except ObjectDoesNotExist:
pass
This finds the Event instance, and returns the active invited users ordered by the Invite table ID, so I can get the first item of this queryset, which corresponds to the first user invited.
In order to run the function when a User gets deleted, I used on_delete=models.SET:
owner = models.ForeignKey(User, related_name='evemt_owner', verbose_name=_("Owner"), on_delete=models.SET(get_new_owner()))
Then I ran into some problems:
It can't access the ID of the field I'm passing
I could'n find a way to use it as a classmethod or something, so I had to put the function above the model. Obviously this meant that it could no longer access the class below it, so I tried to pass the Event model as a parameter of the function, but could not make it work.
Any ideas?

First we can define a strategy for the Owner field that will call the function with the object that has been updated. We can define such deletion, for example in the <i.app_name/deletion.py file:
# app_name/deletion.py
def SET_WITH(value):
if callable(value):
def set_with_delete(collector, field, sub_objs, using):
for obj in sub_objs:
collector.add_field_update(field, value(obj), [obj])
else:
def set_with_delete(collector, field, sub_objs, using):
collector.add_field_update(field, value, sub_objs)
set_with_delete.deconstruct = lambda: ('app_name.SET_WITH', (value,), {})
return set_with_delete
You should pass a callable to SET, not call the function, so you implement this as:
from django.conf import settings
from django.db.models import Q
from app_name.deletion import SET_WITH
def get_new_owner(event):
invited_users = event.invites.order_by(
'eventinvites__id'
).filter(~Q(pk=event.owner_id), is_active=True).first()
if invited_users is not None:
return invited_users
else:
event.delete()
class Event(models.Model):
# …
owner = models.ForeignKey(
settings.AUTH_USER_MODEL,
related_name='owned_events',
verbose_name=_('Owner'),
on_delete=models.SET_WITH(get_new_owner)
)
Here we thus will look at the invites to find a user to transfer the object to. Perhaps you need to exclude the current .owner of the event in your get_new_owner from the collection of .inivites.
We can, as #AbdulAzizBarkat says, better work with a CASCADE than explicitly delete the Event object , since that will avoid infinite recursion where an User delete triggers an Event delete that might trigger a User delete: at the moment this is not possible, but later if extra logic is implemented one might end up in such case. In that case we can work with:
from django.db.models import CASCADE
def SET_WITH(value):
if callable(value):
def set_with_delete(collector, field, sub_objs, using):
for obj in sub_objs:
val = value(obj)
if val is None:
CASCADE(collector, field, [obj], using)
else:
collector.add_field_update(field, val, [obj])
else:
def set_with_delete(collector, field, sub_objs, using):
collector.add_field_update(field, value, sub_objs)
set_with_delete.deconstruct = lambda: ('app_name.SET_WITH', (value,), {})
return set_with_delete
and rewrite the get_new_owner to:
def get_new_owner(event):
invited_users = event.invites.order_by(
'eventinvites__id'
).filter(~Q(pk=event.owner_id), is_active=True).first()
if invited_users is not None:
return invited_users
else: # strictly speaking not necessary, but explicit over implicit
return None

Django - Create object if field not in database

I want to create an object only if there's no other object with the same ID already in the database. The code below would create the same item if one the parameters below like the State was modified.
returns = Return.objects.all()
for ret in returns:
obj, created = Return.objects.get_or_create(ItemID="UUID",
ItemName="Hodaddy", State="Started")
obj.save()

get_or_create works off all the arguments provided to find the object.
What you need to do instead is use the special defaults argument to provide the new value for a field that you don't want to filter on.
In your case, you only want the UUID field to be unique, and so you provide the other two members as defaults.
obj, created = Return.objects.get_or_create(ItemID="UUID",
defaults={ItemName:"Hodaddy", State:"Started"})
Then you can make further decisions based on the value of created. I am not sure why you're iterating over all the Returns in the original question?

If you know the id, you can query for it:
In your question, you have:
returns = Return.objects.all()
for ret in returns:
return_in_database = Return.objects.filter(ItemId="UUID").exists()
if not return_in_database:
obj, created = Return.objects.get_or_create(ItemID="UUID",
ItemName="Hodaddy", State="Started")
obj.save()
This can be done as:
if not Return.objects.filter(ItemId="UUID").exists():
obj, created = Return.objects.get_or_create(ItemID="UUID",
ItemName="Hodaddy", State="Started")
obj.save()
As you can see, I've removed the for loop, as you were not using the variable ret anywhere, so no need to iterate over all Return objects. The above is functionally equivalent to what you had. :)
OR:
You can write your own manager with a create_or_update method inside your models.py
class ReturnManager(models.Manager):
def create_or_update(self, **kwargs):
new_return = Return(**kwargs)
existing = Return.objects.filter(ItemId=new_returns.ItemID).first()
if existing:
new_return.pk = existing.pk
new_return.id = existing.id
new_return.save()
return new_return
You would then assign this to your Return model
class Return(model.Models):
# your object fields here
objects = ReturnManager()

You need to change the param in the query, to just constraint the object lookup on ItemID. Once, a new object is returned you can update the ItemName and State
obj, created = Return.objects.get_or_create(ItemID="UUID")
if created:
obj.ItemName="<item-name>"
obj.State="<Started>"
obj.save()

Django Admin app: building a dynamic list of admin actions

I am trying to dynamically build a list of admin actions using the get_actions() method on a ModelAdmin. Each action relates to a particular instance of another model, and as new instances may be added or removed, I want to make sure the list of actions reflects that.
Here's the ModelAdmin:
class PackageAdmin(admin.ModelAdmin):
list_display = ('name', 'quality')
def _actions(self, request):
for q in models.Quality.objects.all():
action = lambda modeladmin, req, qset: qset.update(quality=q)
name = "mark_%s" % (q,)
yield (name, (action, name, "Mark selected as %s quality" % (q,)))
def get_actions(self, request):
return dict(action for action in self._actions(request))
(The weird repetitive dict of tuples return value is explained by the Django docs for get_actions().)
As expected, this results in a list of appropriately named admin actions for bulk assignment of Quality foreign keys to Package objects.
The problem is that whichever action I choose, the same Quality object gets assigned to the selected Packages.
I assume that the closures I am creating with the lambda keyword all contain a reference to the same q object, so every iteration changes the value of q for every function.
Can I break this reference, allowing me to still use a list of closures containing different values of q?
Edit: I realise that lambda is not necessary in this example. Instead of:
action = lambda modeladmin, req, qset: qset.update(quality=q)
I could simply use def:
def action(modeladmin, req, qset):
return qset.update(quality=q)

try
def make_action(quality):
return lambda modeladmin, req, qset: qset.update(quality=quality)
for q in models.Quality.objects.all():
action = make_action(q)
name = "mark_%s" % (q,)
yield (name, (action, name, "Mark selected as %s quality" % (q,)))
if that doesn't work, i suspect the bug has something to do with your use of yield. maybe try:
def make_action(quality):
name = 'mark_%s' % quality
action = lambda modeladmin, req, qset: qset.update(quality=quality)
return (name, (action, name, "Mark selected as %s quality" % quality))
def get_actions(self, request):
return dict([make_action for q in models.Quality.objects.all()])

As I mentioned in my comment to andylei's answer, I just found a solution; using another function to create the closure seems to break the reference, meaning that now every action refers to the correct instance of Quality.
def create_action(quality):
fun = lambda modeladmin, request, queryset: queryset.update(quality=quality)
name = "mark_%s" % (quality,)
return (name, (fun, name, "Mark selected as %s quality" % (quality,)))
class PackageAdmin(admin.ModelAdmin):
list_display = ('name', 'quality')
def get_actions(self, request):
return dict(create_action(q) for q in models.Quality.objects.all())

I am surprised that q stays the same object within the loop.
Does it work with quality=q.id in your lambda?

How would you inherit from and override the django model classes to create a listOfStringsField?

I want to create a new type of field for django models that is basically a ListOfStrings. So in your model code you would have the following:
models.py:
from django.db import models
class ListOfStringsField(???):
???
class myDjangoModelClass():
myName = models.CharField(max_length=64)
myFriends = ListOfStringsField() #
other.py:
myclass = myDjangoModelClass()
myclass.myName = "bob"
myclass.myFriends = ["me", "myself", "and I"]
myclass.save()
id = myclass.id
loadedmyclass = myDjangoModelClass.objects.filter(id__exact=id)
myFriendsList = loadedclass.myFriends
# myFriendsList is a list and should equal ["me", "myself", "and I"]
How would you go about writing this field type, with the following stipulations?
We don't want to do create a field which just crams all the strings together and separates them with a token in one field like this. It is a good solution in some cases, but we want to keep the string data normalized so tools other than django can query the data.
The field should automatically create any secondary tables needed to store the string data.
The secondary table should ideally have only one copy of each unique string. This is optional, but would be nice to have.
Looking in the Django code it looks like I would want to do something similar to what ForeignKey is doing, but the documentation is sparse.
This leads to the following questions:
Can this be done?
Has it been done (and if so where)?
Is there any documentation on Django about how to extend and override their model classes, specifically their relationship classes? I have not seen a lot of documentation on that aspect of their code, but there is this.
This is comes from this question.

There's some very good documentation on creating custom fields here.
However, I think you're overthinking this. It sounds like you actually just want a standard foreign key, but with the additional ability to retrieve all the elements as a single list. So the easiest thing would be to just use a ForeignKey, and define a get_myfield_as_list method on the model:
class Friends(model.Model):
name = models.CharField(max_length=100)
my_items = models.ForeignKey(MyModel)
class MyModel(models.Model):
...
def get_my_friends_as_list(self):
return ', '.join(self.friends_set.values_list('name', flat=True))
Now calling get_my_friends_as_list() on an instance of MyModel will return you a list of strings, as required.

What you have described sounds to me really similar to the tags.
So, why not using django tagging?
It works like a charm, you can install it independently from your application and its API is quite easy to use.

I also think you're going about this the wrong way. Trying to make a Django field create an ancillary database table is almost certainly the wrong approach. It would be very difficult to do, and would likely confuse third party developers if you are trying to make your solution generally useful.
If you're trying to store a denormalized blob of data in a single column, I'd take an approach similar to the one you linked to, serializing the Python data structure and storing it in a TextField. If you want tools other than Django to be able to operate on the data then you can serialize to JSON (or some other format that has wide language support):
from django.db import models
from django.utils import simplejson
class JSONDataField(models.TextField):
__metaclass__ = models.SubfieldBase
def to_python(self, value):
if value is None:
return None
if not isinstance(value, basestring):
return value
return simplejson.loads(value)
def get_db_prep_save(self, value):
if value is None:
return None
return simplejson.dumps(value)
If you just want a django Manager-like descriptor that lets you operate on a list of strings associated with a model then you can manually create a join table and use a descriptor to manage the relationship. It's not exactly what you need, but this code should get you started.

Thanks for all those that answered. Even if I didn't use your answer directly the examples and links got me going in the right direction.
I am not sure if this is production ready, but it appears to be working in all my tests so far.
class ListValueDescriptor(object):
def __init__(self, lvd_parent, lvd_model_name, lvd_value_type, lvd_unique, **kwargs):
"""
This descriptor object acts like a django field, but it will accept
a list of values, instead a single value.
For example:
# define our model
class Person(models.Model):
name = models.CharField(max_length=120)
friends = ListValueDescriptor("Person", "Friend", "CharField", True, max_length=120)
# Later in the code we can do this
p = Person("John")
p.save() # we have to have an id
p.friends = ["Jerry", "Jimmy", "Jamail"]
...
p = Person.objects.get(name="John")
friends = p.friends
# and now friends is a list.
lvd_parent - The name of our parent class
lvd_model_name - The name of our new model
lvd_value_type - The value type of the value in our new model
This has to be the name of one of the valid django
model field types such as 'CharField', 'FloatField',
or a valid custom field name.
lvd_unique - Set this to true if you want the values in the list to
be unique in the table they are stored in. For
example if you are storing a list of strings and
the strings are always "foo", "bar", and "baz", your
data table would only have those three strings listed in
it in the database.
kwargs - These are passed to the value field.
"""
self.related_set_name = lvd_model_name.lower() + "_set"
self.model_name = lvd_model_name
self.parent = lvd_parent
self.unique = lvd_unique
# only set this to true if they have not already set it.
# this helps speed up the searchs when unique is true.
kwargs['db_index'] = kwargs.get('db_index', True)
filter = ["lvd_parent", "lvd_model_name", "lvd_value_type", "lvd_unique"]
evalStr = """class %s (models.Model):\n""" % (self.model_name)
evalStr += """ value = models.%s(""" % (lvd_value_type)
evalStr += self._params_from_kwargs(filter, **kwargs)
evalStr += ")\n"
if self.unique:
evalStr += """ parent = models.ManyToManyField('%s')\n""" % (self.parent)
else:
evalStr += """ parent = models.ForeignKey('%s')\n""" % (self.parent)
evalStr += "\n"
evalStr += """self.innerClass = %s\n""" % (self.model_name)
print evalStr
exec (evalStr) # build the inner class
def __get__(self, instance, owner):
value_set = instance.__getattribute__(self.related_set_name)
l = []
for x in value_set.all():
l.append(x.value)
return l
def __set__(self, instance, values):
value_set = instance.__getattribute__(self.related_set_name)
for x in values:
value_set.add(self._get_or_create_value(x))
def __delete__(self, instance):
pass # I should probably try and do something here.
def _get_or_create_value(self, x):
if self.unique:
# Try and find an existing value
try:
return self.innerClass.objects.get(value=x)
except django.core.exceptions.ObjectDoesNotExist:
pass
v = self.innerClass(value=x)
v.save() # we have to save to create the id.
return v
def _params_from_kwargs(self, filter, **kwargs):
"""Given a dictionary of arguments, build a string which
represents it as a parameter list, and filter out any
keywords in filter."""
params = ""
for key in kwargs:
if key not in filter:
value = kwargs[key]
params += "%s=%s, " % (key, value.__repr__())
return params[:-2] # chop off the last ', '
class Person(models.Model):
name = models.CharField(max_length=120)
friends = ListValueDescriptor("Person", "Friend", "CharField", True, max_length=120)
Ultimately I think this would still be better if it were pushed deeper into the django code and worked more like the ManyToManyField or the ForeignKey.

I think what you want is a custom model field.

Duplicating model instances and their related objects in Django / Algorithm for recusrively duplicating an object

I've models for Books, Chapters and Pages. They are all written by a User:
from django.db import models
class Book(models.Model)
author = models.ForeignKey('auth.User')
class Chapter(models.Model)
author = models.ForeignKey('auth.User')
book = models.ForeignKey(Book)
class Page(models.Model)
author = models.ForeignKey('auth.User')
book = models.ForeignKey(Book)
chapter = models.ForeignKey(Chapter)
What I'd like to do is duplicate an existing Book and update it's User to someone else. The wrinkle is I would also like to duplicate all related model instances to the Book - all it's Chapters and Pages as well!
Things get really tricky when look at a Page - not only will the new Pages need to have their author field updated but they will also need to point to the new Chapter objects!
Does Django support an out of the box way of doing this? What would a generic algorithm for duplicating a model look like?
Cheers,
John
Update:
The classes given above are just an example to illustrate the problem I'm having!

This no longer works in Django 1.3 as CollectedObjects was removed. See changeset 14507
I posted my solution on Django Snippets. It's based heavily on the django.db.models.query.CollectedObject code used for deleting objects:
from django.db.models.query import CollectedObjects
from django.db.models.fields.related import ForeignKey
def duplicate(obj, value, field):
"""
Duplicate all related objects of `obj` setting
`field` to `value`. If one of the duplicate
objects has an FK to another duplicate object
update that as well. Return the duplicate copy
of `obj`.
"""
collected_objs = CollectedObjects()
obj._collect_sub_objects(collected_objs)
related_models = collected_objs.keys()
root_obj = None
# Traverse the related models in reverse deletion order.
for model in reversed(related_models):
# Find all FKs on `model` that point to a `related_model`.
fks = []
for f in model._meta.fields:
if isinstance(f, ForeignKey) and f.rel.to in related_models:
fks.append(f)
# Replace each `sub_obj` with a duplicate.
sub_obj = collected_objs[model]
for pk_val, obj in sub_obj.iteritems():
for fk in fks:
fk_value = getattr(obj, "%s_id" % fk.name)
# If this FK has been duplicated then point to the duplicate.
if fk_value in collected_objs[fk.rel.to]:
dupe_obj = collected_objs[fk.rel.to][fk_value]
setattr(obj, fk.name, dupe_obj)
# Duplicate the object and save it.
obj.id = None
setattr(obj, field, value)
obj.save()
if root_obj is None:
root_obj = obj
return root_obj
For django >= 2 there should be some minimal changes. so the output will be like this:
def duplicate(obj, value=None, field=None, duplicate_order=None):
"""
Duplicate all related objects of obj setting
field to value. If one of the duplicate
objects has an FK to another duplicate object
update that as well. Return the duplicate copy
of obj.
duplicate_order is a list of models which specify how
the duplicate objects are saved. For complex objects
this can matter. Check to save if objects are being
saved correctly and if not just pass in related objects
in the order that they should be saved.
"""
from django.db.models.deletion import Collector
from django.db.models.fields.related import ForeignKey
collector = Collector(using='default')
collector.collect([obj])
collector.sort()
related_models = collector.data.keys()
data_snapshot = {}
for key in collector.data.keys():
data_snapshot.update(
{key: dict(zip([item.pk for item in collector.data[key]], [item for item in collector.data[key]]))})
root_obj = None
# Sometimes it's good enough just to save in reverse deletion order.
if duplicate_order is None:
duplicate_order = reversed(related_models)
for model in duplicate_order:
# Find all FKs on model that point to a related_model.
fks = []
for f in model._meta.fields:
if isinstance(f, ForeignKey) and f.remote_field.related_model in related_models:
fks.append(f)
# Replace each `sub_obj` with a duplicate.
if model not in collector.data:
continue
sub_objects = collector.data[model]
for obj in sub_objects:
for fk in fks:
fk_value = getattr(obj, "%s_id" % fk.name)
# If this FK has been duplicated then point to the duplicate.
fk_rel_to = data_snapshot[fk.remote_field.related_model]
if fk_value in fk_rel_to:
dupe_obj = fk_rel_to[fk_value]
setattr(obj, fk.name, dupe_obj)
# Duplicate the object and save it.
obj.id = None
if field is not None:
setattr(obj, field, value)
obj.save()
if root_obj is None:
root_obj = obj
return root_obj

Here's an easy way to copy your object.
Basically:
(1) set the id of your original object to None:
book_to_copy.id = None
(2) change the 'author' attribute and save the ojbect:
book_to_copy.author = new_author
book_to_copy.save()
(3) INSERT performed instead of UPDATE
(It doesn't address changing the author in the Page--I agree with the comments regarding re-structuring the models)

I haven't tried it in django but python's deepcopy might just work for you
EDIT:
You can define custom copy behavior for your models if you implement functions:
__copy__() and __deepcopy__()

this is an edit of http://www.djangosnippets.org/snippets/1282/
It's now compatible with the Collector which replaced CollectedObjects in 1.3.
I didn't really test this too heavily, but did test it with an object with about 20,000 sub-objects, but in only about three layers of foreign-key depth. Use at your own risk of course.
For the ambitious guy who reads this post, you should consider subclassing Collector (or copying the entire class to remove this dependency on this unpublished section of the django API) to a class called something like "DuplicateCollector" and writing a .duplicate method that works similarly to the .delete method. that would solve this problem in a real way.
from django.db.models.deletion import Collector
from django.db.models.fields.related import ForeignKey
def duplicate(obj, value=None, field=None, duplicate_order=None):
"""
Duplicate all related objects of obj setting
field to value. If one of the duplicate
objects has an FK to another duplicate object
update that as well. Return the duplicate copy
of obj.
duplicate_order is a list of models which specify how
the duplicate objects are saved. For complex objects
this can matter. Check to save if objects are being
saved correctly and if not just pass in related objects
in the order that they should be saved.
"""
collector = Collector({})
collector.collect([obj])
collector.sort()
related_models = collector.data.keys()
data_snapshot = {}
for key in collector.data.keys():
data_snapshot.update({ key: dict(zip([item.pk for item in collector.data[key]], [item for item in collector.data[key]])) })
root_obj = None
# Sometimes it's good enough just to save in reverse deletion order.
if duplicate_order is None:
duplicate_order = reversed(related_models)
for model in duplicate_order:
# Find all FKs on model that point to a related_model.
fks = []
for f in model._meta.fields:
if isinstance(f, ForeignKey) and f.rel.to in related_models:
fks.append(f)
# Replace each `sub_obj` with a duplicate.
if model not in collector.data:
continue
sub_objects = collector.data[model]
for obj in sub_objects:
for fk in fks:
fk_value = getattr(obj, "%s_id" % fk.name)
# If this FK has been duplicated then point to the duplicate.
fk_rel_to = data_snapshot[fk.rel.to]
if fk_value in fk_rel_to:
dupe_obj = fk_rel_to[fk_value]
setattr(obj, fk.name, dupe_obj)
# Duplicate the object and save it.
obj.id = None
if field is not None:
setattr(obj, field, value)
obj.save()
if root_obj is None:
root_obj = obj
return root_obj
EDIT: Removed a debugging "print" statement.

In Django 1.5 this works for me:
thing.id = None
thing.pk = None
thing.save()

Using the CollectedObjects snippet above no longer works but can be done with the following modification:
from django.contrib.admin.util import NestedObjects
from django.db import DEFAULT_DB_ALIAS
and
collector = NestedObjects(using=DEFAULT_DB_ALIAS)
instead of CollectorObjects

I tried a few of the answers in Django 2.2/Python 3.6 and they didn't seem to copy one-to-many and many-to-many related objects. Also, many included hardcoding / incorporated foreknowledge of the data structures.
I wrote a way to do this in a more generic fashion, handling one-to-many and many-to-many related objects. Comments included, and I'm looking to improve upon it if you have suggestions:
def duplicate_object(self):
"""
Duplicate a model instance, making copies of all foreign keys pointing to it.
There are 3 steps that need to occur in order:
1. Enumerate the related child objects and m2m relations, saving in lists/dicts
2. Copy the parent object per django docs (doesn't copy relations)
3a. Copy the child objects, relating to the copied parent object
3b. Re-create the m2m relations on the copied parent object
"""
related_objects_to_copy = []
relations_to_set = {}
# Iterate through all the fields in the parent object looking for related fields
for field in self._meta.get_fields():
if field.one_to_many:
# One to many fields are backward relationships where many child
# objects are related to the parent. Enumerate them and save a list
# so we can copy them after duplicating our parent object.
print(f'Found a one-to-many field: {field.name}')
# 'field' is a ManyToOneRel which is not iterable, we need to get
# the object attribute itself.
related_object_manager = getattr(self, field.name)
related_objects = list(related_object_manager.all())
if related_objects:
print(f' - {len(related_objects)} related objects to copy')
related_objects_to_copy += related_objects
elif field.many_to_one:
# In testing, these relationships are preserved when the parent
# object is copied, so they don't need to be copied separately.
print(f'Found a many-to-one field: {field.name}')
elif field.many_to_many:
# Many to many fields are relationships where many parent objects
# can be related to many child objects. Because of this the child
# objects don't need to be copied when we copy the parent, we just
# need to re-create the relationship to them on the copied parent.
print(f'Found a many-to-many field: {field.name}')
related_object_manager = getattr(self, field.name)
relations = list(related_object_manager.all())
if relations:
print(f' - {len(relations)} relations to set')
relations_to_set[field.name] = relations
# Duplicate the parent object
self.pk = None
self.save()
print(f'Copied parent object ({str(self)})')
# Copy the one-to-many child objects and relate them to the copied parent
for related_object in related_objects_to_copy:
# Iterate through the fields in the related object to find the one that
# relates to the parent model.
for related_object_field in related_object._meta.fields:
if related_object_field.related_model == self.__class__:
# If the related_model on this field matches the parent
# object's class, perform the copy of the child object and set
# this field to the parent object, creating the new
# child -> parent relationship.
related_object.pk = None
setattr(related_object, related_object_field.name, self)
related_object.save()
text = str(related_object)
text = (text[:40] + '..') if len(text) > 40 else text
print(f'|- Copied child object ({text})')
# Set the many-to-many relations on the copied parent
for field_name, relations in relations_to_set.items():
# Get the field by name and set the relations, creating the new
# relationships.
field = getattr(self, field_name)
field.set(relations)
text_relations = []
for relation in relations:
text_relations.append(str(relation))
print(f'|- Set {len(relations)} many-to-many relations on {field_name} {text_relations}')
return self

If there's just a couple copies in the database you're building, I've found you can just use the back button in the admin interface, change the necessary fields and save the instance again. This has worked for me in cases where, for instance, I need to build a "gimlet" and a "vodka gimlet" cocktail where the only difference is replacing the name and an ingredient. Obviously, this requires a little foresight of the data and isn't as powerful as overriding django's copy/deepcopy - but it may do the trick for some.

Django does have a built-in way to duplicate an object via the admin - as answered here:
In the Django admin interface, is there a way to duplicate an item?

Simple non generic way
Proposed solutions didn't work for me, so I went the simple, not clever way. This is only useful for simple cases.
For a model with the following structure
Book
|__ CroppedFace
|__ Photo
|__ AwsReco
|__ AwsLabel
|__ AwsFace
|__ AwsEmotion
this works
def duplicate_book(book: Book, new_user: MyUser):
# AwsEmotion, AwsFace, AwsLabel, AwsReco, Photo, CroppedFace, Book
old_cropped_faces = book.croppedface_set.all()
old_photos = book.photo_set.all()
book.pk = None
book.user = new_user
book.save()
for cf in old_cropped_faces:
cf.pk = None
cf.book = book
cf.save()
for photo in old_photos:
photo.pk = None
photo.book = book
photo.save()
if hasattr(photo, 'awsreco'):
reco = photo.awsreco
old_aws_labels = reco.awslabel_set.all()
old_aws_faces = reco.awsface_set.all()
reco.pk = None
reco.photo = photo
reco.save()
for label in old_aws_labels:
label.pk = None
label.reco = reco
label.save()
for face in old_aws_faces:
old_aws_emotions = face.awsemotion_set.all()
face.pk = None
face.reco = reco
face.save()
for emotion in old_aws_emotions:
emotion.pk = None
emotion.aws_face = face
emotion.save()
return book

Here is a somewhat simple-minded solution. This does not depend on any undocumented Django APIs. It assumes that you want to duplicate a single parent record, along with its child, grandchild, etc. records. You pass in a whitelist of classes that should actually be duplicated, in the form of a list of names of the one-to-many relationships on each parent object that point to its child objects. This code assumes that, given the above whitelist, the entire tree is self-contained, with no external references to worry about.
This solution doesn't do anything special for the author field above. I'm not sure if it would work with that. Like others have said, that author field probably shouldn't be repeated in different model classes.
One more thing about this code: it is truly recursive, in that it calls itself for each new level of descendants.
from collections import OrderedDict
def duplicate_model_with_descendants(obj, whitelist, _new_parent_pk=None):
kwargs = {}
children_to_clone = OrderedDict()
for field in obj._meta.get_fields():
if field.name == "id":
pass
elif field.one_to_many:
if field.name in whitelist:
these_children = list(getattr(obj, field.name).all())
if children_to_clone.has_key(field.name):
children_to_clone[field.name] |= these_children
else:
children_to_clone[field.name] = these_children
else:
pass
elif field.many_to_one:
if _new_parent_pk:
kwargs[field.name + '_id'] = _new_parent_pk
elif field.concrete:
kwargs[field.name] = getattr(obj, field.name)
else:
pass
new_instance = obj.__class__(**kwargs)
new_instance.save()
new_instance_pk = new_instance.pk
for ky in children_to_clone.keys():
child_collection = getattr(new_instance, ky)
for child in children_to_clone[ky]:
child_collection.add(duplicate_model_with_descendants(child, whitelist=whitelist, _new_parent_pk=new_instance_pk))
return new_instance
Example usage:
from django.db import models
class Book(models.Model)
author = models.ForeignKey('auth.User')
class Chapter(models.Model)
# author = models.ForeignKey('auth.User')
book = models.ForeignKey(Book, related_name='chapters')
class Page(models.Model)
# author = models.ForeignKey('auth.User')
# book = models.ForeignKey(Book)
chapter = models.ForeignKey(Chapter, related_name='pages')
WHITELIST = ['books', 'chapters', 'pages']
original_record = models.Book.objects.get(pk=1)
duplicate_record = duplicate_model_with_descendants(original_record, WHITELIST)

I think you'd be happier with a simpler data model, also.
Is it really true that a Page is in some Chapter but a different book?
userMe = User( username="me" )
userYou= User( username="you" )
bookMyA = Book( userMe )
bookYourB = Book( userYou )
chapterA1 = Chapter( book= bookMyA, author=userYou ) # "me" owns the Book, "you" owns the chapter?
chapterB2 = Chapter( book= bookYourB, author=userMe ) # "you" owns the book, "me" owns the chapter?
page1 = Page( book= bookMyA, chapter= chapterB2, author=userMe ) # Book and Author aggree, chapter doesn't?
It seems like your model is too complex.
I think you'd be happier with something simpler. I'm just guessing at this, since I don't your know entire problem.
class Book(models.Model)
name = models.CharField(...)
class Chapter(models.Model)
name = models.CharField(...)
book = models.ForeignKey(Book)
class Page(models.Model)
author = models.ForeignKey('auth.User')
chapter = models.ForeignKey(Chapter)
Each page has distinct authorship. Each chapter, then, has a collection of authors, as does the book. Now you can duplicate Book, Chapter and Pages, assigning the cloned Pages to the new Author.
Indeed, you might want to have a many-to-many relationship between Page and Chapter, allowing you to have multiple copies of just the Page, without cloning book and Chapter.

I had no luck with any of the answers here with Django 2.1.2, so I created a generic way of performing a deep copy of a database model that is heavily based on the answers posted above.
The key differences from the answers above is that ForeignKey no longer has an attribute called rel, so it has to be changed to f.remote_field.model etc.
Furthermore, because of the difficulty of knowing the order the database models should be copied in, I created a simple queuing system that pushes the current model to the end of the list if it is unsuccessfully copied. The code is postet below:
import queue
from django.contrib.admin.utils import NestedObjects
from django.db.models.fields.related import ForeignKey
def duplicate(obj, field=None, value=None, max_retries=5):
# Use the Nested Objects collector to retrieve the related models
collector = NestedObjects(using='default')
collector.collect([obj])
related_models = list(collector.data.keys())
# Create an object to map old primary keys to new ones
data_snapshot = {}
model_queue = queue.Queue()
for key in related_models:
data_snapshot.update(
{key: {item.pk: None for item in collector.data[key]}}
)
model_queue.put(key)
# For each of the models in related models copy their instances
root_obj = None
attempt_count = 0
while not model_queue.empty():
model = model_queue.get()
root_obj, success = copy_instances(model, related_models, collector, data_snapshot, root_obj)
# If the copy is not a success, it probably means that not
# all the related fields for the model has been copied yet.
# The current model is therefore pushed to the end of the list to be copied last
if not success:
# If the last model is unsuccessful or the number of max retries is reached, raise an error
if model_queue.empty() or attempt_count > max_retries:
raise DuplicationError(model)
model_queue.put(model)
attempt_count += 1
return root_obj
def copy_instances(model, related_models, collector, data_snapshot, root_obj):
# Store all foreign keys for the model in a list
fks = []
for f in model._meta.fields:
if isinstance(f, ForeignKey) and f.remote_field.model in related_models:
fks.append(f)
# Iterate over the instances of the model
for obj in collector.data[model]:
# For each of the models foreign keys check if the related object has been copied
# and if so, assign its personal key to the current objects related field
for fk in fks:
pk_field = f"{fk.name}_id"
fk_value = getattr(obj, pk_field)
# Fetch the dictionary containing the old ids
fk_rel_to = data_snapshot[fk.remote_field.model]
# If the value exists and is in the dictionary assign it to the object
if fk_value is not None and fk_value in fk_rel_to:
dupe_pk = fk_rel_to[fk_value]
# If the desired pk is none it means that the related object has not been copied yet
# so the function returns unsuccessful
if dupe_pk is None:
return root_obj, False
setattr(obj, pk_field, dupe_pk)
# Store the old pk and save the object without an id to create a shallow copy of the object
old_pk = obj.id
obj.id = None
if field is not None:
setattr(obj, field, value)
obj.save()
# Store the new id in the data snapshot object for potential use on later objects
data_snapshot[model][old_pk] = obj.id
if root_obj is None:
root_obj = obj
return root_obj, True
I hope it is of any help :)
The duplication error is just a simple exception extension:
class DuplicationError(Exception):
"""
Is raised when a duplication operation did not succeed
Attributes:
model -- The database model that failed
"""
def __init__(self, model):
self.error_model = model
def __str__(self):
return f'Was not able to duplicate database objects for model {self.error_model}'

There is an option to create a duplicate/clone/save-as-new in django admin.
Create a ModelAdmin class of the model you want to clone in admin.py
In the class add an admin action like:
#admin.register(Book)
class BookAdmin(models.ModelAdmin):
save_as = True
and this will create a "Save as New" button in your admin panel to completely clone the model object with all it's related fields.

django-clone library works perfectly for me with ManyToMany relationships. Just:
Make the model you want to clone a subclass of CloneModel:
from django.db import models
from model_clone.models import CloneModel
class MyModel(CloneModel):
name = models.CharField(max_length=50)
tags = models.ManyToManyField(Tag)
# You must specify all the ManyToManyField fields
_clone_m2m_fields = ['tags']
Then just call make_clone method:
obj = MyModel.objects.get(pk=some_pk)
cloned = obj.make_clone()
You can also define specific values for the cloned object. Read the docs for more!

I experimented the Stephen G Tuggy's solution and I found it very clever but, unfortunatly, it won't work in some special situations.
Let's suppose the following scenario:
class FattAqp(models.Model):
descr = models.CharField('descrizione', max_length=200)
ef = models.ForeignKey(Esercizio, ...)
forn = models.ForeignKey(Fornitore, ...)
class Periodo(models.Model):
# id usato per identificare i documenti
# periodo rilevato in fattura
data_i_p = models.DateField('data inizio', blank=True)
idfatt = models.ForeignKey(FattAqp, related_name='periodo')
class Lettura(models.Model):
mc_i = models.DecimalField(max_digits=7, ...)
faqp = models.ForeignKey(FattAqp, related_name='lettura')
an_im = models.ForeignKey('cnd.AnagImm', ..)
class DettFAqp(models.Model):
imponibile = models.DecimalField(...)
voce = models.ForeignKey(VoceAqp, ...)
periodo = models.ForeignKey(Periodo, related_name='dettfaqp')
In this case, if we try to deep-copy a FattAqp instance, ef, forn, an_im and voce fields will not correctly set; on the other hand idfatt, faqp, periodo will.
I solved the problem by adding one more parameter to the function and with a slight modification to the code. I tested it with Python 3.6 and Django 2.2
Here is it:
def duplicate_model_with_descendants(obj, whitelist, _new_parent_pk=None, static_fk=None):
kwargs = {}
children_to_clone = OrderedDict()
for field in obj._meta.get_fields():
if field.name == "id":
pass
elif field.one_to_many:
if field.name in whitelist:
these_children = list(getattr(obj, field.name).all())
if field.name in children_to_clone:
children_to_clone[field.name] |= these_children
else:
children_to_clone[field.name] = these_children
else:
pass
elif field.many_to_one:
name_with_id = field.name + '_id'
if _new_parent_pk:
kwargs[name_with_id] = _new_parent_pk
if name_with_id in static_fk:
kwargs[name_with_id] = getattr(obj, name_with_id)
elif field.concrete:
kwargs[field.name] = getattr(obj, field.name)
else:
pass
new_instance = obj.__class__(**kwargs)
new_instance.save()
new_instance_pk = new_instance.pk
for ky in children_to_clone.keys():
child_collection = getattr(new_instance, ky)
for child in children_to_clone[ky]:
child_collection.add(
duplicate_model_with_descendants(child, whitelist=whitelist, _new_parent_pk=new_instance_pk,static_fk=static_fk))
Example usage:
original_record = FattAqp.objects.get(pk=4)
WHITELIST = ['lettura', 'periodo', 'dettfaqp']
STATIC_FK = ['fornitore_id','ef_id','an_im_id', 'voce_id']
duplicate_record = duplicate_model_with_descendants(original_record, WHITELIST, static_fk=STATIC_FK)

Elaborated based on previous answers:
def derive(obj):
import copy
from django.contrib.admin.utils import NestedObjects
from django.db import DEFAULT_DB_ALIAS
from django.db.models.fields.related import ForeignKey
"""
Derive a new model instance from previous one,
and duplicate all related fields to point to the new instance
"""
obj2 = copy.copy(obj)
obj2.pk = None
obj2.save()
collector = NestedObjects(using=DEFAULT_DB_ALIAS)
collector.collect([obj])
collector.sort()
related_models = collector.data.keys()
data_snapshot = {}
for key in collector.data.keys():
data_snapshot.update({
key: dict(
zip(
[item.pk for item in collector.data[key]],
[item for item in collector.data[key]]
)
)
})
duplicate_order = reversed(related_models)
for model in duplicate_order:
# Find all FKs on model that point to a related_model.
fks = []
for f in model._meta.fields:
if isinstance(f, ForeignKey) and f.rel.to in related_models:
fks.append(f)
# Replace each `sub_obj` with a duplicate.
if model not in collector.data:
continue
sub_objects = collector.data[model]
for obj in sub_objects:
for fk in fks:
dupe_obj = copy.copy(obj)
setattr(dupe_obj, fk.name, obj2)
dupe_obj.pk = None
dupe_obj.save()
return obj2

Suggestion of Julio Marins works! Thnx!
For Django >= 2.* this line:
if isinstance(f, ForeignKey) and f.rel.to in related_models:
Should be replaced with:
if isinstance(f, ForeignKey) and f.remote_field.model in related_models:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to update DjangoItem in Scrapy - python

Related

Access model instance inside model field

Django - Create object if field not in database

Django Admin app: building a dynamic list of admin actions

How would you inherit from and override the django model classes to create a listOfStringsField?

Duplicating model instances and their related objects in Django / Algorithm for recusrively duplicating an object

Categories

Resources