Django: Which days have objects? - python

I'm currently creating a simple calendar for one of my Django projects. The calendar will display the current month and the days. Any day which has a item for that day, will be highlighted in red so that the user knows that there are items for that day. The number of items or what items they are don't matter. All we care about is whether a day has items.
Lets say I have the following model.
class Items(models.Model):
name = models.CharField(max_length=140)
datetime = models.DateTimeField(auto_now_add=False)
def save(self, *args, **kwargs):
if datetim is None:
created = datetime.now()
super(Items, self).save()
Here is my current logic for finding which days have items:
from calendar import monthrange
# Find number of days for June 2015
num_days = monthrange(2015, 6)[1]
days_with_items = []
'''
Increase num_days by 1 so that the last day of the month
is included in the range
'''
num_days =+ 1
for day in range(0, num_days):
has_items = Items.objects.filter(datetime__day = day,
datetime__month = 6,
datetime__year = 2015).exists()
if has_items:
days_with_items.append(day)
return days_with_items
This returns me a list with all the days that have items. This works however I'm looking for a more efficient way of doing this since Django is making multiple trips to the DB for the .exists()
Any suggestions?

I see two possible options. The first one is to add counts at DB level, the second is to have an efficient loop over available data at python level. Depending on data size and db-efficiency you can choose which suits you best.
Counting in the database is explained here:
Django ORM, group by day
Or solution two, a simple script (not so elegant.. but just as an example):
days_with_items_hash = {}
items = Items.objects.filter(
datetime__month = 6,
datetime__year = 2015
)
for item in items:
days_with_item_hash[item.datetime.day] = True
days_with_item = days_with_item_hash.keys()
I would stick with the database solution because it can be optimised (sql views, extra column with just the day, etc)

At first, let's get all the items for the required month.
items = Items.objects.filter(datetime__month=6, datetime__year=2015)
days = set([item.datetime.day for item in items]) # unique days
If you want to make a partial query, specify values you need, here's the concept:
items = Item.objects.filter(
date_added__month=6, date_added__year=2015
).values('date_added')
days = set([item['date_added'].day for item in items])
It will result in the following SQL query:
QUERY = u'SELECT "main_item"."date_added" FROM "main_item" WHERE
(django_datetime_extract(\'month\', "main_item"."date_added", %s) = %s
AND "main_item"."date_added" BETWEEN %s AND %s)'
- PARAMS = (u"'UTC'", u'6', u'datetime.datetime(2015, 1, 1, 0, 0, tzinfo=<UTC>)',
u'datetime.datetime(2015, 12, 31, 23, 59, 59, 999999, tzinfo=<UTC>)')
If you are dealing with big amout of Items, you can break your query into parts (< 15 and >=15 for example). This will result in extra database hit, but the memory usage pick will be smaller. You can also consider different methods.
Please, also note:
that datetime is not the best name for a field. Name it
meaningfully, like: "date_added", "date_created" or something like
that
if self.datetime is None is 'almost' equal to if not self.datetime

Use the dates method.
items = Item.objects.filter(date_added__month=6, date_added__year=2015)
dates = items.dates('date_added', 'day') # returns a queryset of datetimes
days = [d.day for d in dates] # convert to a list of days

Related

Downloading weekly Sentinel 2 data using SentinelApi

I'm trying to download weekly Sentinel 2 data for one year. So, one Sentinel dataset within each week of the year. I can create a list of datasets using the code:
from sentinelsat import SentinelAPI
api = SentinelAPI(user, password, 'https://scihub.copernicus.eu/dhus')
products = api.query(footprint,
date = ('20211001', '20221031'),
platformname = 'Sentinel-2',
processinglevel = 'Level-2A',
cloudcoverpercentage = (0,10)
)
products_gdf = api.to_geodataframe(products)
products_gdf_sorted = products_gdf.sort_values(['beginposition'], ascending=[False])
products_gdf_sorted
This creates a list of all datasets available within the year, and as the data capture is around one in every five days you could argue I can work off this list. But instead I would like to have just one option each week (Mon - Sun). I thought I could create a dataframe with a startdate and an enddate for each week and loop that through the api.query code. But not sure how I would do this.
I have created a dataframe using:
import pandas as pd
dates_df = pd.DataFrame({'StartDate':pd.date_range(start='20211001', end='20221030', freq = 'W-MON'),'EndDate':pd.date_range(start='20211004', end='20221031', freq = 'W-SUN')})
print (dates_df)
Any tips or advice is greatly appreciated. Thanks!

Faster loop in Pandas looking for ID and older date

So, I have a DataFrame that represents purchases with 4 columns:
date (date of purchase in format %Y-%m-%d)
customer_ID (string column)
claim (1-0 column that means 1-the customer complained about the purchase, 0-customer didn't complain)
claim_value (for claim = 1 it means how much the claim cost to the company, for claim = 0 it's NaN)
I need to build 3 new columns:
past_purchases (how many purchases the specific customer had before this purchase)
past_claims (how many claims the specific customer had before this purchase)
past_claims_value (how much did the customer's past claims cost)
This has been my approach until now:
past_purchases = []
past_claims = []
past_claims_value = []
for i in range(0, len(df)):
date = df['date'][i]
customer_ID = df['customer_ID'][i]
df_temp = df[(df['date'] < date) & (df['customer_ID'] == customer_ID)]
past_purchases.append(len(df_temp))
past_claims.append(df_temp['claim'].sum())
past_claims_value.append(df['claim_value'].sum())
df['past_purchases'] = pd.DataFrame(past_purchases)
df['past_claims'] = pd.DataFrame(past_claims)
df['past_claims_value'] = pd.DataFrame(past_claims_value)
The code works fine, but it's too slow. Can anyone make it work faster? Thanks!
Ps: It's importante to check that the date is older, if the customer had 2 purchases in the same date they shouldn't count for each other.
Pss: I'm willing to use libraries for parallel processing like multiprocessing, concurrent.futures, joblib or dask, but never had before in a similar way.
Expected outcome:
Maybe you can try using a cumsum over customers, if the dates are sorted ascendant
df.sort_values('date', inplace=True)
new_temp_columns = ['claim_s','claim_value_s']
df[['claim_s','claim_value_s']] = df[new_temp_columns].shift()
df['past_claims'] = df.groupby('customer_ID')['claim_s'].transform(pd.Series.cumsum)
df['past_claims_value'] = df.groupby('customer_ID')['claim_value_s'].transform(pd.Series.cumsum)
# set the min value for the groups
dfc = data.groupby(['customer_ID','date'])[['past_claims','past_claims_value']]
data[['past_claims', 'past_claims_value']] = dfc.transform(min)
# Remove temp columns
data = data.loc[:, ~data.columns.isin(new_temp_columns)]
Again, this will only works if te date are srotes

Django prefetch_related and performance optimisation with multiple QuerySets within loops

Effectively I have multiple Queries within loops that I am just not happy with. I was wondering if someone had the expertise with prefetch_related and other Django Query construction optimisation to be able to help me on this issue.
I start with:
users = User.objects.filter(organisation=organisation).filter(is_active=True)
Then, I start my loop over all days starting from a certain date "start_date":
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
I then within this loop over a filtered subset of the above users
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
for user in users.filter(created_date__lte=date).iterator():
Ok, so firstly, is there any way to optimise this?
What may make some of the hardcore Django-ers loose their tether, I do all of the above inside another loop!!
for survey in Survey.objects.all().iterator():
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
for user in users.filter(created_date__lte=date).iterator():
Inside the last loop, I perform one final Query filter:
survey_result = SurveyResult.objects.filter(survey=survey, user=user, created_date__lte=date).order_by('-updated_date')[0]
I do this because I feel I need to have the user, survey and date variables ready to filter...
I have started thinking about prefetch_related and the Prefetch object. I've consulted the documentation but I can't seem to wrap my head around how I could apply this to my situation.
Effectively, the query is taking far too long. For an average of 1000 users, 4 surveys and approximately 30 days, this query is taking 1 minute to complete.
Ideally, I would like to shave off 50% of this. Any better, and I will be extremely happy. I'd also like the load on the DB server to be reduced as this query could be running multiple times across different organisations.
Any expert tips on how to organise such horrific queries within loops within loops would be greatly appreciated!
Full "condensed" minimum viable snippet:
users = User.objects.filter(organisation=organisation).filter(is_active=True)
datasets = []
for survey in Survey.objects.all():
data = []
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
total_score = 0
participants = 0
for user in users.filter(created_date__lte=date):
participants += 1
survey_result = SurveyResult.objects.filter(survey=survey, user=user, created_date__lte=date).order_by('-updated_date')[0]
total_score += survey_result.score
# An average is calculated from the total_score and participants and append to the data array.:
# Divide catches divide by zero errors.
# Round will round to two decimal places for front end readability.
data.append(
round(
divide(total_score, participants), 2
)
)
datasets.append(data)
********* ADDENDUM: *********
So, further to #dirkgroten's answer I am currently running with:
for survey in Survey.objects.all():
results = SurveyResult.objects.filter(
user__in=users, survey=survey, created_date__range=date_range
).values(
'survey',
'created_date',
).annotate(
total_score=Sum('normalized_score'),
participants=Count('user'),
average_score=Avg('normalized_score'),
).order_by(
'created_date'
)
for result in results:
print(result)
As I ("think I") need a breakdown by survey for each QuerySet.
Are there any other optimisations available to me?
You can actually combine queries and perform the calculations directly inside your query:
from django.db.models import Sum, Count, Avg
from django.utils import timezone
users = User.objects.filter(organisation=organisation).filter(is_active=True)
date_range = [start_date, timezone.now().date] # or adapt end time to different time zone
results = SurveyResult.objects.filter(user__in=users, created_date__range=date_range)\
.values('survey', 'created_date')\
.annotate(total_score=Sum('score'), participants=Count('pk'))
.order_by('survey', 'created_date')
This will group the results by survey and created_date and add the total_score and participants to each result, something like:
[{'survey': 1, 'created_date': '2019-08-05', 'total_score': 54, 'participants': 20},
{'survey': 1, ... } ... ]
I'm assuming there's only one SurveyResult per user so the number of SurveyResult in each group is the number of participants.
Note that Avg also gives you the average score at once, that is assuming only one possible score per user:
.annotate(average_score=Avg('score')) # instead of total and participants
This should shave off 99.9% of your query time :-)
If you want the dataset as a list of lists, you just do something like this:
dataset = []
data = []
current_survey = None
current_date = start_date
for result in results
if not result['survey'] == current_survey:
# results ordered by survey, so if it changes, reset data
if data: dataset.append(data)
data = []
current_survey = result['survey']
if not result['created_date'] == current_date:
# results ordered by date so missing date won't be there later
# assume a daterange function to create a list of dates
for date in daterange(current_date, result['created_date']):
data.append(0) # padding data
current_date = result['created_date']
data.append(result['average_score'])
The result will be a list of lists:
dataset = [[0, 0, 10.4, 3.9, 0], [20.2, 3.5, ...], ...]
Not super efficient python, but with a few 1000 results, this will be super fast anyway, way faster than performing more db queries.
EDIT: Since created_date is DateTimeField, you first need to get the corresponding date:
from django.db.models.functions import TruncDate
results = SurveyResult.objects.filter(user__in=users, created_date__range=date_range)
.annotate(date=TruncDate('created_date'))
.values('survey', 'date')
.annotate(average_score=Avg('score'))
.order_by('survey', 'date')

How can I change my SQLalchemy query to include the current day?

Currently I have a function which returns the stock ticker with the highest error for the entire data set. What I actually want is to return the stock ticker with the highest error for the current day.
Here is the current function:
#main.route('/api/highest/error')
def get_highest_error():
"""
API which returns the highest stock error for the current day.
:return: ticker of the stock matching the query.
"""
sub = db.session.query(db.func.max(Stock.error).label('max_error')).subquery()
stock = db.session.query(Stock).join(sub, sub.c.max_error == Stock.error).first()
return stock.ticker
Here is what I attempted:
todays_stock = db.session.query(db.func.date(Stock.time_stamp) == date.today())
stock = todays_stock.filter(db.func.max(Stock.error))
return stock.ticker
Unfortunately this is operating on a BaseQuery which is not what I expected.
I also tried:
stock = Stock.query.filter(db.func.date(Stock.time_stamp) == date.today()).filter(db.func.max(Stock.error)).first()
But this generated an error with the messageaggregate functions are not allowed in WHERE
The error is pretty self explanatory. You cannot use aggregate functions in the WHERE clause. If you have to eliminate group rows based on aggregates, use HAVING. But that's not what you need: for fetching the row with greatest error order by error in descending order and pick the first row:
stock = Stock.query.\
filter(db.func.date(Stock.time_stamp) == date.today()).\
order_by(Stock.error.desc().nullslast()).\
first()
Unless you have a ridiculous amount of Stock per day, the sorting should be plenty fast. Note that db.func.date(Stock.time_stamp) == date.today() is not very index friendly, unless your DB supports functional indexes. Instead you could filter on a half open range:
today = date.today()
...
filter(Stock.time_stamp >= today,
Stock.time_stamp < today + timedelta(days=1)).\

Making a string by accessing dictionary values

years, months, days, hours, minutes these values accessed from a dict. now I want to create a string like years = 12 months= 1 if there is only years and months. Consider a case only minutes and second. then the string should be minutes=1 seconds= 1 . how can I do this in an effective way??
The sample data may look like this
I tried to do something like this. But not working
years, months, days, hours, minutes = self.initial_data["months"], \
self.initial_data["years"], \
self.initial_data["days"], \
self.initial_data["hours"], \
self.initial_data["minutes"],
duration = if years: "years= {}".format(years) + \
if months "months={}".format(months) +\
and so on
So the string change by if there is value or not
Apart from bad syntax (you cannot use if in such a way), you're making this more complicated than it needs to be. Just create an ordered map and loop through it to extract existing fields from self.initial_data:
fields = ["years", "months", "days", "hours", "minutes", "seconds"] # ordered list
data = self.initial_data # a shorthand for readability sake
duration = " ".join("{}={}".format(f, data[f]) for f in fields if f in data)
You can add an additional data[f] check in the duration list build up if you want to ignore fields whose values evaluate to False (e.g. 0, None etc.).

Categories

Resources