I've built a scraper that gets product data from different shopping websites.
When I run python scraper.py the program will print a JSON object containing all the data like this:
{ 'ebay': [ { 'advertiser': 'ebay',
'advertiser_url': 'https://rover.ebay.com/rover/1/711-53200-19255-0/1?ff3=2&toolid=10041&campid=5338482617&customid=&lgeo=1&vectorid=229466&item=302847614914',
'description': '30-Day Warranty - Free Charger & Cable - '
'Easy Returns!',
'main_image': 'https://thumbs1.ebaystatic.com/pict/04040_0.jpg',
'price': '290.0',
'title': 'Apple iPhone 8 Plus Smartphone AT&T Sprint '
'T-Mobile Verizon or Unlocked 4G LTE'}
]}
I want this data to be added to the database automatically every time I run the scraper.
Here's my database structure:
models.py
class Product(models.Model):
similarity_id = models.CharField(max_length=255, blank=True, null=True)
name = models.CharField(max_length=255, blank=True, null=True)
url = models.SlugField(blank=True, unique=True, allow_unicode=True)
advertiser_url = models.TextField(blank=True, null=True)
main_image = models.TextField(blank=True, null=True)
second_image = models.TextField(blank=True, null=True)
third_image = models.TextField(blank=True, null=True)
old_price = models.FloatField(default=0.00)
price = models.FloatField(default=0.00)
discount = models.FloatField(default=0.00)
currency = models.CharField(max_length=255, default="$")
description = models.TextField(blank=True, null=True)
keywords = models.CharField(max_length=255, blank=True, null=True)
asin = models.CharField(max_length=80, blank=True, null=True)
iban = models.CharField(max_length=255, blank=True, null=True)
sku = models.CharField(max_length=255, blank=True, null=True)
seller = models.CharField(max_length=255, blank=True, null=True)
free_shipping = models.BooleanField(default=False)
in_stock = models.BooleanField(default=True)
sold_items = models.IntegerField(default=0)
likes_count = models.IntegerField(default=0)
category = models.CharField(max_length=255, blank=True, null=True)
sub_category = models.CharField(max_length=255, blank=True, null=True)
reviews_count = models.IntegerField(default=0)
rating = models.FloatField(default=0)
active = models.BooleanField(default=True)
is_prime = models.BooleanField(default=False)
created_on = models.DateTimeField(auto_now_add=True)
advertiser = models.CharField(max_length=255, blank=True, null=True)
objects = ProductManager()
class Meta:
verbose_name_plural = "products"
def __str__(self):
return self.name
Add this to scrapper.py:
import path.to.model
product = Product()
product.<key> = <value> #Where key is the field and value is the value you need to fill
and after you assign every field, add
product.save()
Trick
If all the keys in the json response match the fields in the model, you can do:
for k, v in response.items():
setattr(product, k, v)
product.save()
That will save you a lot of lines and time :)
I work with json a lot; I have API caching where I receive a lot of json-based API data and I want to store it in a database for querying and caching. If you use postgres (for instance), you will see that if has extensions for json. This means that you can save json data in a special json field. But better, there are sql extensions that let you run queries on the json data. That is, postgres has "no sql" capabilities. This lets you work with json natively. I find it very compelling and I recommend it highly. It is a learning curve because it uses non-traditional sql, but heck, we have stackoverflow.
see: https://django-postgres-extensions.readthedocs.io/en/latest/json.html
here is a little example:
product_onhand_rows = DearCache.objects.filter(
object_type=DearObjectType.PRODUCT_AVAILABILITY.value).filter(
dear_account_id=self.dear_api.account_id).filter(jdata__Location=warehouse).filter(jdata__SKU=sku).all()
in this example, I have the json stored in a field jdata.
jdata__Location accesses the key Location in the json.
It nests and so on. For advanced queries, I resort to sql
select object_type,last_modified, jdata
from cached_dear_dearcache
where object_type = 'orders'
and jdata->>'Status' in ('ESTIMATING','ESTIMATED')
order by last_modified;
and there's more, you can 'unroll' lists (this is what I would call a complicated example, my json has lists of invoices, each of which has a list of lines...)
/* 1. listing invoice lines. We have to iterate over the array of invoices to get each invoice, and then inside the invoice object find the array of lines */
select object_type,last_modified, jsonb_array_elements(jsonb_array_elements(cached_dear_dearcache.jdata#>'{Invoices}')->'Lines') as lines,
jsonb_array_elements(cached_dear_dearcache.jdata#>'{Invoices}')->'InvoiceDate' as invoice_date,
jsonb_array_elements(cached_dear_dearcache.jdata#>'{Invoices}')->'InvoiceNumber' as invoice_number
from cached_dear_dearcache
where object_type = 'orders' order by last_modified;
Your approach is to convert the json data to a traditional sql model. That will work too. It's not very flexible ... if the json "schema" changes, your database schema may need to change. Philosophically, I think it is better to go with the flow, and use the json extensions, this is the best of both worlds. Performance is good, by the way.
Related
I'm currently working on a website where advertisements will be posted to display vehicles for sale and rent. I would like to retrieve a queryset that highlights only one car brand (i.e. Audi) which has the highest number of posts for the respective model. Example:
Displaying the Audi brand because it has the highest number of related posts.
My question is, what's the most efficient way of doing this? I've done some work here but I'm pretty sure this is not the most efficient way. What I have is the following:
# Algorithm that is currently retrieving the name of the brand and the number of related posts it has.
def top_brand_ads():
queryset = Advertisement.objects.filter(status__iexact="Published", owner__payment_made="True").order_by('-publish', 'name')
result = {}
for ad in queryset:
# Try to update an existing key-value pair
try:
count = result[ad.brand.name.title()]
result[ad.brand.name.title()] = count + 1
except KeyError:
# If the key doesn't exist then create it
result[ad.brand.name.title()] = 1
# Getting the brand with the highest number of posts from the result dictionary
top_brand = max(result, key=lambda x: result[x]) # Returns for i.e. (Mercedes Benz)
context = {
top_brand: result[top_brand] # Retrieving the value for the top_brand from the result dict.
}
print(context) # {'Mercedes Benz': 7} -> Mercedes Benz has seven (7) related posts.
return context
Is there a way I could return a queryset instead without doing what I did here or could this be way more efficient?
If the related models are needed, please see below:
models.py
# Brand
class Brand(models.Model):
name = models.CharField(max_length=255, unique=True)
image = models.ImageField(upload_to='brand_logos/', null=True, blank=True)
slug = models.SlugField(max_length=250, unique=True)
...
# Methods
# Owner
class Owner(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE)
telephone = models.CharField(max_length=30, blank=True, null=True)
alternate_telephone = models.CharField(max_length=30, blank=True, null=True)
user_type = models.CharField(max_length=50, blank=True, null=True)
payment_made = models.BooleanField(default=False)
expiring = models.DateTimeField(default=timezone.now)
...
# Methods
# Advertisement (Post)
class Advertisement(models.Model):
STATUS_CHOICES = (
('Draft', 'Draft'),
('Published', 'Published'),
)
owner = models.ForeignKey(Owner, on_delete=models.CASCADE, blank=True, null=True)
name = models.CharField(max_length=150, blank=True, null=True)
brand = models.ForeignKey(Brand, on_delete=models.CASCADE, blank=True, null=True)
publish = models.DateTimeField(default=timezone.now)
status = models.CharField(max_length=10, choices=STATUS_CHOICES, default='Draft')
...
# Other fields & methods
Any help would be greatly appreciated.
Since you need brands, let's query on Brand model:
Brand.objects.filter(advertisement__status__iexact="Published").\
filter(advertisement__owner__payment_made=True).\
annotate(published_ads=Count('advertisement__id')).\
order_by('-published_ads')
However, even in your proposed solution, you can improve a little bit:
Remove the order_by method from your queryset. It doesn't affect the final result but adds some overhead, especially if your Advertisement model is not indexed on those fields.
Every time you call ad.brand you are hitting the database. This is called the N+1 problem. You are in a loop of n, you make n extra db access. You can use select_related to avoid such problems. In your case: Advertisement.objects.select_related('brand')...
Did you try the count method?
from django.db.models import Count
Car.objects.annotate(num_views=Count('car_posts_related_name')).order_by('num_views')
I have a large dataset with over 1m records. It has a manytomany field that causes duplicate returns on filtering.
models.py:
class Type(models.Model):
name = models.CharField(max_length=100, db_index=True)
class Catalogue(models.Model):
link = models.TextField(null=False)
image = models.TextField(null=True)
title = models.CharField(max_length=100, null=True)
city = models.CharField(db_index=True,max_length=100, null=True)
district = models.CharField(db_index=True,max_length=100, null=True)
type = models.ManyToManyField(Type, db_index=True)
datetime = models.CharField(db_index=True, max_length=100, null=True)
views.py:
last2week_q = Q(datetime__gte=last2week)
type_q = Q(type__in=intersections)
city_district_q = (Q(*[Q(city__contains=x) for x in city_district], _connector=Q.OR) |
Q(*[Q(district__contains=x) for x in city_district], _connector=Q.OR))
models.Catalogue.objects.filter(last2week_q & type_q & city_district_q).order_by('-datetime').distinct()
distinct() is too slow and I'm looking for a different solution to remove duplicates.
P.S:
I also tried to use this query instead of type_q, but it's slower than distinct! because type_ids is a very large list.
typ_ids = models.Catalogue.objects.only('type').filter(type__in=intersections).values_list('id', flat=True)
type_q = Q(id__in=typ_ids)
The model I'm trying to pull values has a different primary key than the column I'm trying to get the list from.
EDIT: Adding a bit more detail about what I want to do: I want to get a list of all records from CustomerCatalog that have the same value in it's "ccname" field as the Server model/form has in it's "account" field. This way, as I'm adding a server for a specific account, it will look up all products in the CustomerCatalog that are for this specific account.
This is what I was thinking but it doesn't work:
class Server(models.Model):
"""
Model representing list of servers per contract
"""
os_license = models.CharField(max_length=95, blank=True, choices=product_list)
account = models.ForeignKey('Account', on_delete=models.SET_NULL, null=True)
#staticmethod
def product_list():
return CustomerCatalog.objects.filter(account=Server.account).values_list('ccname', flat=True)
Here's the CustomerCatalog model:
class CustomerCatalog(models.Model):
"""
Model for representing products that have been sold and are being used by a specific customer.
"""
id = models.AutoField(primary_key=True)
ccproductid = models.ForeignKey('ProductCatalog', on_delete=models.SET_NULL, null=True, verbose_name="Product ID")
ccname = models.CharField(max_length=200, verbose_name="Product name", blank=True)
account = models.ForeignKey('Account', on_delete=models.SET_NULL, null=True, blank=True, help_text='Account to which this product was sold.')
unit = models.CharField(max_length=50, help_text='Unit of measurement.', null=True, blank=True, default='VM')
unit_price = models.DecimalField(max_digits=30, decimal_places=2, help_text="Enter price per unit.", null=True, blank=True)
total_qty = models.DecimalField(max_digits=30, decimal_places=2, help_text="Enter quantity.", null=True, blank=True)
total_price = models.DecimalField(max_digits=30, decimal_places=2, help_text="Enter total price (note this will be calculated in a future release).", null=True, blank=True)
in_contract = models.BooleanField(null=False, blank=False, default=True)
history = HistoricalRecords()
#property
def get_ccname(self):
return self.ccproductid.name
def save(self, *args, **kwargs):
self.ccname = self.get_ccname
super(CustomerCatalog, self).save(*args, **kwargs)
class Meta:
verbose_name_plural = 'Customer Catalog'
def __str__(self):
"""
String for representing the Model object (in Admin site etc.)
"""
return f'{self.ccname}'
Here's the ProductCatalog. This is used as the unique list of all products and what is indexed when assigning products to individual customers.
class ProductCatalog(models.Model):
"""
Model to represent the full product portfolio.
"""
id = models.CharField(max_length=40, help_text="Enter Product ID.", primary_key=True)
name = models.CharField(max_length=255, help_text="Enter name of resource unit that will be used with service definitions.", unique=True)
billing = models.CharField(max_length=70, help_text="Enter billing type.")
unit = models.CharField(max_length=80, help_text='Unit of measurement.')
short_description = models.TextField(max_length=400, help_text="Enter the description of the service.", blank=True)
version = models.DateField(default=timezone.now, help_text="Enter date of the CPS version.")
servicecat = models.ForeignKey('Service', on_delete=models.SET_NULL, null=True, blank=True, help_text='Enter the service in which this product belongs.')
org = models.ForeignKey(OrgUnit, on_delete=models.SET_NULL, null=True, help_text="Enter organization providing the service.")
history = HistoricalRecords()
class Meta:
verbose_name_plural = 'Product Catalog'
def __str__(self):
"""
String for representing the Model object (in Admin site etc.)
"""
return f'{self.id}'
As it is said in the error log you provided:
ERRORS: ServiceCatalog.HistoricalServer.os_license: (fields.E005) 'choices' must be an iterable containing (actual value, human readable name) tuples. ServiceCatalog.Server.os_license: (fields.E005) 'choices' must be an iterable containing (actual value, human readable name) tuples.
method product_list * returns flat list of values but not the tuples. It Should look like tuple(tuple(value, human readable name)). So edit this function to return tuple of these tuples. You can achieve this by mapping the flat array and adding some values to it or by taking exactly two values from values_list method. (NOTE: you cannot use flat=True, it is prohibited and not desired).
* product_list this method is not working because, you are not passing either intance nor anything else to this method. So Server.account is not the model instance but model class. You should edit this function for example like this.
def product_list(self):
return CustomerCatalog.objects.filter(account=self.account).values_list('os_license', 'ccname') # or map(lambda x: (x, 'Some human readable string'), ...values_list('ccname'))
I have the following piece of code. I'm using it to return json so Datatables can render it. I pass in query parameters.
def map_query(type_, type_model, type_pk, query_data, request):
type_results_query = None
problem_data = get_model_query_data(query_data, Problem)
problems_filtered = Problem.objects.filter(**problem_data)
if request.POST:
model_query_data = get_model_query_data(query_data, type_model)
type_results_query = Chgbk.objects.filter(**model_query_data)
print(type_results_query)
return type_results_query
So type_results_query returns data I want. But Problem model has a foreign key on it which links to key on table. I want to get the data from the Problem table into the Chgbk query as well, sort of the two objects merged but I cannot figure out how to do this and it is driving me crazy.
Models would be:
class Chgbk(VNCModel):
chgbk_id = models.IntegerField(primary_key=True)
facility = models.ForeignKey('Facility', models.DO_NOTHING)
create_dt = models.DateTimeField(blank=True, null=True)
mod_dt = models.DateTimeField(blank=True, null=True)
carrier_scac = models.CharField(max_length=25, blank=True, null=True)
carrier_name = models.CharField(max_length=25, blank=True, null=True)
class Problem(VNCModel):
problem_id = models.IntegerField(primary_key=True)
chgbk = models.ForeignKey(Chgbk, models.DO_NOTHING, blank=True, null=True)
I have 100k records in both model 'A' and in model 'B'
Ex:
class A(models.Model):
user_email = models.EmailField(null=True, blank=True)
user_mobile = models.CharField(max_length=30, null=True, blank=True)
book_id = models.CharField(max_length=255, null=True, blank=True)
payment_gateway_response = JSONField(blank=True, null=True)
class B(models.Model):
order = models.ForeignKey(A, null=True, blank=True)
pay_id = models.CharField(max_length=250, null=True, blank=True)
user_email = models.EmailField(null=True, blank=True)
user_mobile = models.CharField(max_length=30, null=True, blank=True)
created = models.DateTimeField(blank=True, null=True)
total_payment = models.DecimalField(decimal_places=3, max_digits=20, blank=True, null=True)
I want to get B's objects using A's values
for example
all_a = A.objects.all()
for a in all_a:
b = B.objects.filter(user_email=a.user_email, user_mobile=a.user_mobile)
This is fine, I am getting the results. But as it's 100k records it's taking too much time. for loop iteration is taking time. Is there any faster way to do this in django?
You can get a list of each value in a and filter b with those values.
a = A.objects.all()
emails = list(a.values_list('user_email', flat=True))
mobiles = list(a.values_list('user_mobile', flat=True))
b = B.objects.filter(user_email__in=emails, user_mobile__in=mobiles)
How ever results may have pair of email and mobile that are not pair in A. But if you make sure that emails and mobiles will be unique in A and the email and mobile in each B are based in one of the A' models, then you won't have any problems.
If you're not interested in caching the A model, you may have a performance increase using iterator() (see for reference https://docs.djangoproject.com/en/1.11/ref/models/querysets/#iterator):
for a in A.objects.all().iterator():
b = B.objects.filter(user_email=a.user_email, user_mobile=a.user_mobile)
You can do
import operator
from django.db.models import Q
q = A.objects.all().values('user_email', 'user_mobile')
B.objects.filter(reduce(operator.or_, [Q(**i) for i in q]))
If you want to do with some operations with every b object depends on a.This is not the way.