Effectively I have multiple Queries within loops that I am just not happy with. I was wondering if someone had the expertise with prefetch_related and other Django Query construction optimisation to be able to help me on this issue.
I start with:
users = User.objects.filter(organisation=organisation).filter(is_active=True)
Then, I start my loop over all days starting from a certain date "start_date":
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
I then within this loop over a filtered subset of the above users
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
for user in users.filter(created_date__lte=date).iterator():
Ok, so firstly, is there any way to optimise this?
What may make some of the hardcore Django-ers loose their tether, I do all of the above inside another loop!!
for survey in Survey.objects.all().iterator():
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
for user in users.filter(created_date__lte=date).iterator():
Inside the last loop, I perform one final Query filter:
survey_result = SurveyResult.objects.filter(survey=survey, user=user, created_date__lte=date).order_by('-updated_date')[0]
I do this because I feel I need to have the user, survey and date variables ready to filter...
I have started thinking about prefetch_related and the Prefetch object. I've consulted the documentation but I can't seem to wrap my head around how I could apply this to my situation.
Effectively, the query is taking far too long. For an average of 1000 users, 4 surveys and approximately 30 days, this query is taking 1 minute to complete.
Ideally, I would like to shave off 50% of this. Any better, and I will be extremely happy. I'd also like the load on the DB server to be reduced as this query could be running multiple times across different organisations.
Any expert tips on how to organise such horrific queries within loops within loops would be greatly appreciated!
Full "condensed" minimum viable snippet:
users = User.objects.filter(organisation=organisation).filter(is_active=True)
datasets = []
for survey in Survey.objects.all():
data = []
for date in (start_date + datetime.timedelta(n) for n in range((datetime.datetime.now().replace(tzinfo=pytz.UTC) - start_date).days + 1)):
total_score = 0
participants = 0
for user in users.filter(created_date__lte=date):
participants += 1
survey_result = SurveyResult.objects.filter(survey=survey, user=user, created_date__lte=date).order_by('-updated_date')[0]
total_score += survey_result.score
# An average is calculated from the total_score and participants and append to the data array.:
# Divide catches divide by zero errors.
# Round will round to two decimal places for front end readability.
data.append(
round(
divide(total_score, participants), 2
)
)
datasets.append(data)
********* ADDENDUM: *********
So, further to #dirkgroten's answer I am currently running with:
for survey in Survey.objects.all():
results = SurveyResult.objects.filter(
user__in=users, survey=survey, created_date__range=date_range
).values(
'survey',
'created_date',
).annotate(
total_score=Sum('normalized_score'),
participants=Count('user'),
average_score=Avg('normalized_score'),
).order_by(
'created_date'
)
for result in results:
print(result)
As I ("think I") need a breakdown by survey for each QuerySet.
Are there any other optimisations available to me?
You can actually combine queries and perform the calculations directly inside your query:
from django.db.models import Sum, Count, Avg
from django.utils import timezone
users = User.objects.filter(organisation=organisation).filter(is_active=True)
date_range = [start_date, timezone.now().date] # or adapt end time to different time zone
results = SurveyResult.objects.filter(user__in=users, created_date__range=date_range)\
.values('survey', 'created_date')\
.annotate(total_score=Sum('score'), participants=Count('pk'))
.order_by('survey', 'created_date')
This will group the results by survey and created_date and add the total_score and participants to each result, something like:
[{'survey': 1, 'created_date': '2019-08-05', 'total_score': 54, 'participants': 20},
{'survey': 1, ... } ... ]
I'm assuming there's only one SurveyResult per user so the number of SurveyResult in each group is the number of participants.
Note that Avg also gives you the average score at once, that is assuming only one possible score per user:
.annotate(average_score=Avg('score')) # instead of total and participants
This should shave off 99.9% of your query time :-)
If you want the dataset as a list of lists, you just do something like this:
dataset = []
data = []
current_survey = None
current_date = start_date
for result in results
if not result['survey'] == current_survey:
# results ordered by survey, so if it changes, reset data
if data: dataset.append(data)
data = []
current_survey = result['survey']
if not result['created_date'] == current_date:
# results ordered by date so missing date won't be there later
# assume a daterange function to create a list of dates
for date in daterange(current_date, result['created_date']):
data.append(0) # padding data
current_date = result['created_date']
data.append(result['average_score'])
The result will be a list of lists:
dataset = [[0, 0, 10.4, 3.9, 0], [20.2, 3.5, ...], ...]
Not super efficient python, but with a few 1000 results, this will be super fast anyway, way faster than performing more db queries.
EDIT: Since created_date is DateTimeField, you first need to get the corresponding date:
from django.db.models.functions import TruncDate
results = SurveyResult.objects.filter(user__in=users, created_date__range=date_range)
.annotate(date=TruncDate('created_date'))
.values('survey', 'date')
.annotate(average_score=Avg('score'))
.order_by('survey', 'date')
Related
I have an original function from pandas which worked perfectly for my use case, however it only worked on a small dataset. My current dataset is 50+MM rows, which Pandas is unable to handle.
As stated in the title the goal behind the function is to Iterate through each row in a PySpark Frame and return the distinct count of users operating +-5 minutes outside of the timeframe identified in each row. e.g if an user completes a task at 2021-02-04 12:44:33, i want to know the number of distinct individuals that also completed the same task + and - 5minutes from this user completing the task. This would need to be checked for every row in the dataframe.
My original python code for pandas was as follows
def workers(s):
#get unique count of employee id working on process/function where start_time after amended_start and end_time after amended_end.
if s.tracking_type == 'indirect':
balancedate = s.balancedate
process = s.process_name
function = s.function_name
start_time = s.start_date_utc
end_time = s.end_date_utc
amended_start = s.start_date_utc - datetime.timedelta(minutes=5)
amended_end = s.end_date_utc + datetime.timedelta(minutes=5)
t = df2_mer[(df2_mer['balancedate']==balancedate)&(df2_mer['function_name']==function)&(df2_mer['process_name']==process)&(df2_mer['start_date_utc'] >= amended_start)&(df2_mer['end_date_utc'] <= amended_end)&(df2_mer['tracking_type']=='direct')].employee_id.nunique()
return t
My Attempted PySpark Modification:
def workers(s):
#get unique count of employee id working on process/function where start_time after amended_start and end_time after amended_end.
balancedate = s.balancedate
process = s.process_name
function = s.function_name
amended_start = s.start_date_convert - f.expr('INTERVAL 5 MINUTES')
amended_end = s.end_date_convert + f.expr('INTERVAL 5 MINUTES')
t = df3.filter((f.col('balancedate')==balancedate))\
.filter(
(f.col('function_name') == function) & f.col('start_time_convert') > amended_start & f.col('end_time_convert') <= amended_end).select(countDistinct('employee_id'))
return function
However, upon attempting to run the same/similiar code in PySpark (edited for filters), I am hit with a Could not serialize object error as a result of including a dataframe in the function.
I am not sure how to proceed to achieve my objective, whether i can use the same methodology as in Pandas or I would need to use a completely different methodology.
So, I have a DataFrame that represents purchases with 4 columns:
date (date of purchase in format %Y-%m-%d)
customer_ID (string column)
claim (1-0 column that means 1-the customer complained about the purchase, 0-customer didn't complain)
claim_value (for claim = 1 it means how much the claim cost to the company, for claim = 0 it's NaN)
I need to build 3 new columns:
past_purchases (how many purchases the specific customer had before this purchase)
past_claims (how many claims the specific customer had before this purchase)
past_claims_value (how much did the customer's past claims cost)
This has been my approach until now:
past_purchases = []
past_claims = []
past_claims_value = []
for i in range(0, len(df)):
date = df['date'][i]
customer_ID = df['customer_ID'][i]
df_temp = df[(df['date'] < date) & (df['customer_ID'] == customer_ID)]
past_purchases.append(len(df_temp))
past_claims.append(df_temp['claim'].sum())
past_claims_value.append(df['claim_value'].sum())
df['past_purchases'] = pd.DataFrame(past_purchases)
df['past_claims'] = pd.DataFrame(past_claims)
df['past_claims_value'] = pd.DataFrame(past_claims_value)
The code works fine, but it's too slow. Can anyone make it work faster? Thanks!
Ps: It's importante to check that the date is older, if the customer had 2 purchases in the same date they shouldn't count for each other.
Pss: I'm willing to use libraries for parallel processing like multiprocessing, concurrent.futures, joblib or dask, but never had before in a similar way.
Expected outcome:
Maybe you can try using a cumsum over customers, if the dates are sorted ascendant
df.sort_values('date', inplace=True)
new_temp_columns = ['claim_s','claim_value_s']
df[['claim_s','claim_value_s']] = df[new_temp_columns].shift()
df['past_claims'] = df.groupby('customer_ID')['claim_s'].transform(pd.Series.cumsum)
df['past_claims_value'] = df.groupby('customer_ID')['claim_value_s'].transform(pd.Series.cumsum)
# set the min value for the groups
dfc = data.groupby(['customer_ID','date'])[['past_claims','past_claims_value']]
data[['past_claims', 'past_claims_value']] = dfc.transform(min)
# Remove temp columns
data = data.loc[:, ~data.columns.isin(new_temp_columns)]
Again, this will only works if te date are srotes
Currently I have a function which returns the stock ticker with the highest error for the entire data set. What I actually want is to return the stock ticker with the highest error for the current day.
Here is the current function:
#main.route('/api/highest/error')
def get_highest_error():
"""
API which returns the highest stock error for the current day.
:return: ticker of the stock matching the query.
"""
sub = db.session.query(db.func.max(Stock.error).label('max_error')).subquery()
stock = db.session.query(Stock).join(sub, sub.c.max_error == Stock.error).first()
return stock.ticker
Here is what I attempted:
todays_stock = db.session.query(db.func.date(Stock.time_stamp) == date.today())
stock = todays_stock.filter(db.func.max(Stock.error))
return stock.ticker
Unfortunately this is operating on a BaseQuery which is not what I expected.
I also tried:
stock = Stock.query.filter(db.func.date(Stock.time_stamp) == date.today()).filter(db.func.max(Stock.error)).first()
But this generated an error with the messageaggregate functions are not allowed in WHERE
The error is pretty self explanatory. You cannot use aggregate functions in the WHERE clause. If you have to eliminate group rows based on aggregates, use HAVING. But that's not what you need: for fetching the row with greatest error order by error in descending order and pick the first row:
stock = Stock.query.\
filter(db.func.date(Stock.time_stamp) == date.today()).\
order_by(Stock.error.desc().nullslast()).\
first()
Unless you have a ridiculous amount of Stock per day, the sorting should be plenty fast. Note that db.func.date(Stock.time_stamp) == date.today() is not very index friendly, unless your DB supports functional indexes. Instead you could filter on a half open range:
today = date.today()
...
filter(Stock.time_stamp >= today,
Stock.time_stamp < today + timedelta(days=1)).\
I'm currently creating a simple calendar for one of my Django projects. The calendar will display the current month and the days. Any day which has a item for that day, will be highlighted in red so that the user knows that there are items for that day. The number of items or what items they are don't matter. All we care about is whether a day has items.
Lets say I have the following model.
class Items(models.Model):
name = models.CharField(max_length=140)
datetime = models.DateTimeField(auto_now_add=False)
def save(self, *args, **kwargs):
if datetim is None:
created = datetime.now()
super(Items, self).save()
Here is my current logic for finding which days have items:
from calendar import monthrange
# Find number of days for June 2015
num_days = monthrange(2015, 6)[1]
days_with_items = []
'''
Increase num_days by 1 so that the last day of the month
is included in the range
'''
num_days =+ 1
for day in range(0, num_days):
has_items = Items.objects.filter(datetime__day = day,
datetime__month = 6,
datetime__year = 2015).exists()
if has_items:
days_with_items.append(day)
return days_with_items
This returns me a list with all the days that have items. This works however I'm looking for a more efficient way of doing this since Django is making multiple trips to the DB for the .exists()
Any suggestions?
I see two possible options. The first one is to add counts at DB level, the second is to have an efficient loop over available data at python level. Depending on data size and db-efficiency you can choose which suits you best.
Counting in the database is explained here:
Django ORM, group by day
Or solution two, a simple script (not so elegant.. but just as an example):
days_with_items_hash = {}
items = Items.objects.filter(
datetime__month = 6,
datetime__year = 2015
)
for item in items:
days_with_item_hash[item.datetime.day] = True
days_with_item = days_with_item_hash.keys()
I would stick with the database solution because it can be optimised (sql views, extra column with just the day, etc)
At first, let's get all the items for the required month.
items = Items.objects.filter(datetime__month=6, datetime__year=2015)
days = set([item.datetime.day for item in items]) # unique days
If you want to make a partial query, specify values you need, here's the concept:
items = Item.objects.filter(
date_added__month=6, date_added__year=2015
).values('date_added')
days = set([item['date_added'].day for item in items])
It will result in the following SQL query:
QUERY = u'SELECT "main_item"."date_added" FROM "main_item" WHERE
(django_datetime_extract(\'month\', "main_item"."date_added", %s) = %s
AND "main_item"."date_added" BETWEEN %s AND %s)'
- PARAMS = (u"'UTC'", u'6', u'datetime.datetime(2015, 1, 1, 0, 0, tzinfo=<UTC>)',
u'datetime.datetime(2015, 12, 31, 23, 59, 59, 999999, tzinfo=<UTC>)')
If you are dealing with big amout of Items, you can break your query into parts (< 15 and >=15 for example). This will result in extra database hit, but the memory usage pick will be smaller. You can also consider different methods.
Please, also note:
that datetime is not the best name for a field. Name it
meaningfully, like: "date_added", "date_created" or something like
that
if self.datetime is None is 'almost' equal to if not self.datetime
Use the dates method.
items = Item.objects.filter(date_added__month=6, date_added__year=2015)
dates = items.dates('date_added', 'day') # returns a queryset of datetimes
days = [d.day for d in dates] # convert to a list of days
Though this may be elementary but for me its proving beyond my level. And my thanks goes beforehand.
What I wanted to achieve is that a row of data which I am querying,and after querying on the basis of the value of the field in the Liveroute table being inactive I perform calculation of finding the distance between two sets of longitude and latitude values, I am able to use the haversine formula to calculate the distance between two points longitude and latitude. But what I am wary of that this calculation process may take time and I will not be able to display the data on time for larger number of rows in the table. So I thought I will save the result of calculation in another Table and fetch the data from that table and display it.
I will perform the calculation on a value in the table becoming inactive.
Here is the Django code which tells whether the row is active or inactive for a route in Liveroute model class.
class LiveRoutes(models.Model):
user = models.ForeignKey(User)
route = models.ForeignKey(UserRoutes)
status = models.ForeignKey(LiveRoutesStatus)
traveller = models.ManyToManyField(LiveRouteTravellers)
datetime = models.DateTimeField()
def __unicode__(self):
return self.route.__unicode__()
def isActive(self):
utc = pytz.utc
os.environ['TZ'] = 'UTC'
local = pytz.timezone("Asia/Calcutta")
now = utc.localize(datetime.datetime.today())
now = now.astimezone(local)
time_delta = (local.localize(self.datetime.replace(tzinfo=None)) + datetime.timedelta(minutes=self.route.journey_time_day)) - now
if time_delta.days == -1 and (24 - (time_delta.seconds / 3600)) <= 2:
return True
elif time_delta.days >= 0:
return True
else:
return False
Based on this value from isActive function I wanted to perform the calculation as follows
def carbonFootPrint(request):
if request.method != "GET":
raise Http404
routes = LiveRoutes.objects.all();
routeDetailArr = []
for lroute in routes:
routeDetail = dict()
if lroute.isActive() == False:
#Now I need to find out the start location and end location for the journey and the number of travellers.
routeDetail['travellers'] = lroute.traveller.all().count()
routeDetail['start_loc_lat']= lroute.route.start_location.latitude
routeDetail['start_loc_long'] = lroute.route.start_location.longitude
routeDetail ['end_loc_lat'] = lroute.route.end_location.latitude
routeDetail['end_loc_long'] = lroute.route.end_location.longitude
routeDetail['distance'] = haversine(start_loc_lat,start_loc_long,end_loc_lat,end_loc_long)
routeDetailArr.append(routeDetail)
my problem is how to insert all this data back into another table, so that later on I could fetch those values.
Thanks any advice will be highly appreciated.
In order to trigger the session, it must be like this:
request.session['travellers'] = lroute.traveller.all().count()
//other session here
To get and use the data, it must be:
travellers = request.session.get('travellers')