Count of a query as a cron or task [duplicate]

Count of a query as a cron or task [duplicate] - python

This question already has answers here:
Closed 11 years ago.
Possible Duplicates:
Google AppEngine: how to count a database's entries beyond 1000?
how does one get a count of rows in a datastore model in google appengine?
Hi I've got > 10000 entities and would like to do count of a query. Since there going to be more than 1000 results a way I'm thinking to do it could be using a task queue or a cron job to refresh to value of the counter. That way I could use a job that takes more time to partition the count and adding parts that sum up to less than 1000. Do you agree that this is a reasonable way to performs counts that do not need to be exactly up to date all the time and do not need more accuracy than +/- 10 %? Specifically I want to display the number of recent posts where a small error wouldn't matter.
Edit: It seems counting > 1000 entities now works. So we will add these functions to the tabs ie displaying the total number of active entities and for private and company articles also number of active entities / articles. This way to count more than 1000 entities is not found in the documentation but here at another answer. We now do it this simple way that we plan to use with memcache so it only hits the data layer from time to time
self.response.out.write(A.all().count(100000000))`
that will count > 1000.
Thanks

Related

How do I get the count of a result set efficiently?

I've got a function which interacts with a postgres DB.
The function takes a parameter called pagination_data_required (boolean).
If pagination_required is set to true, the function executes a query as well as a query.count() which according to the documentation docs.peewee link here, puts the query in a wrapped count() function.
def list_records(pagination_data_required):
query = table1.select(table1.columns...).join(table2....).distinct() ## returns nearly 500k rows
if (filter_request_body.pagination_data_required):
total_count = query.count()
My problem arises when .count() is called. Without a .count() my api returns results within a second whereas with .count(), the response time skyrockets to ~18 seconds.
I need this total count due to a requirement from the frontend team.
The query is returning roughly 500k records (which is needed, plus there's a .paginate() function being called)
How do I efficiently count the number of rows returned in query ?
I've tried the api with pagination_data_required True and Falseand the results remain the same.
I've tried to call .dicts() on the original query and take the count of items but it gives the same response time.

The only way to count the number of rows returned by a query is to execute the query and count the results. I don't know how your ORM implements pagination, but I assume that it will append a LIMIT clause at the end of the query. That can speed up execution, because only the first few rows of the result set have to be calculated. But calculating the count will take much longer for a large result set.
So there is no good solution for this problem other than not showing an exact result count. See my article for a discussion of the problem and potential workarounds.

This one's a classic.
It is usually not possible to count the rows returned by a query without actually running the query. If it includes things that don't change the count, like left joins, sorts, joins on foreign keys that don't add or remove rows, etc, then you could remove them and get a bit of a speedup, but you will still be running the query. But if it is using a LIMIT'ed index scan for efficient searching of the most recent rows (for exampel) then that optimization won't work with a count. Also reading such a large amount of useless data will trash your cache. If the count query is run often, all the data it uses will fill your cache, and evict data that other more useful queries need, this will make these queries slow. Or you will have to upgrade your RAM.
In some cases, like a forum, displaying a topic always uses the same search criteria. It is simply "where topic_id=... order by post_id". In this case, counting the posts is very wasteful, always doing the exact same query all over again, and paginating results with (LIMIT+OFFSET) is also slow as it discards all the selected rows before the requested offset. Since the most often requested page is the last one, the worst case is the most common.
However, with such fixed search and ordering criteria, the row number of any row in the result set is always the same, so it is possible to cache it as "post number in topic" in the posts table. Then, to get one specific page, it is simply a matter of "post_number BETWEEN ... AND ...", and to count the posts in a topic, just select the post_number of the last one. In this case it is possible to get the exact count without actually counting, and to paginate without using OFFSET, which is much faster.
For a generic search query that can use many criteria, it is not possible to store the row number in such a simple way. However, knowing the exact count is usually not necessary. When the GUI displays:
Page: 1 2 3 4 .... 50000 50001
Will a user ever navigate to page 837? Probably not. What users do in this case is use sort to get the result they want on top, or refine their search criteria to reduce the number of results to something manageable. So the time spent in this huge count() query is almost always wasted. Basically, the information that is relevant to the user is: are there few pages, so it's possible to scan them by eye, or are there a lot, so he should refine his search criteria?
This does not need an accurate count, so the easiest way to fix this is to limit the counted results to something that would fill a number of pages like 5 or 10. Instead of:
SELECT count(*) FROM ...
use:
SELECT count(*) FROM (subquery ORDER BY ... LIMIT ...) AS foo
The next step is to realize selecting a few pages of results will quite often be almost as fast as selecting one page, so this is a good opportunity to cache the results for at least the first few pages when the first page is requested. This allows getting rid of the count, as you retrieve more results than necessary.
It is also possible to return the first few pages to the client and paginate on the client side using javascript, which means side queries.
Quite often the user will click on the last page instead of reversing the order, in this case you should flip the ORDER BY direction to keep a small LIMIT, not count all the rows and use a huge OFFSET to skip all pages except the last. When using the correct ORDER BY direction depending on which page is requested, the most common ones (first and last page) are fastest, with the worst case being in the middle, which is rarely clicked.
Another option is to cache the counts. The largest counts will most likely be for queries involving few search criteria, perhaps with common values, which results in a few combinations that can be cached beforehand. In addition, if the user clicks on page 2, reuse the cached count from the previous page. Of course the counts won't be exact, but that doesn't matter. It would only matter if the pagination logic was done wrong, ie not flipping the ORDER BY for pages close to the last one are requested.
I need this total count due to a requirement from the frontend team.
It's not possible, so the frontend team needs to read the answers to your question and act accordingly.

python create random list of numbers along with a fixed increment

I have a long-running (several hours) script that periodically sends queries to a server. The server is very sensitive to load, so the queries are sparse (not more than 1 every 3 minutes).
The server will always take exactly 10 minutes to process the query. So I can check the result of query 1 any time after 10 minutes of sending it.
So there are two types of operations, "sending query" and "checking result of query". I want all operations to happen at random intervals (subject to the constraint than there are at least 3 minutes between adjacent operations)
Following the advice in this answer (https://stackoverflow.com/a/51918697/10690958) , I can generate a time-series of integers such that there is a gap of at least 3 between them. Lets all be series 1.
I can also generate a similar time-series of status checking queries (3 minutes between them). Lets call this series 2.
Now series 1 is randomly spaced. Series 2 is also randomly spaced. But there is a correlation between series 1 and 2 ,i.e. "response time"="query time"+10 minutes.
This the union of series 1 and 2 wont be random. Furthermore there is a (very small) possiblility of collision. For example, query 2 might be going out exactly when one is checking the result of query 1.
Is there a way to make union of the two sequences also perfectly random , as well as avoid the possibility of collisions. Ideally all traffic to the server (whether query or status check) should be at perfectly random intervals.
I realize that the title is not very descriptive, but could not figure out a better way to describe the situation. Please edit if you think you have a better description.
For example:
query_sequence=set([3,8,12,21,37])
check_result_sequence=set([13,18,22,31,47])
server_traffic=query_sequence.union(check_result_sequence)
But their union (server_traffic) is not random , since
check_result_sequence=query_sequence+10
P.S.:
Generating time-points with more granularity might help with reducing probability of collisions (as mentioned in the comment). As regards randomness of the union of two sequences, I dont see any satisfactory solution. What I finally decided to do was
check_result_sequence=query_sequence+10+( 5*random.random())
This adds a random "jitter" to the responses sequence, and so should help with reducing correlation between the two sequences.

1) I hardly see the necessity to randomize the interval between the requests
2) You could do a single list: a list which represent the available moments to submit a request
server_traffic=set([3,8,12,15,19,23,26,30,34,40])
for x in range(4):
send_query(server_traffic)
while(True):
send_result_request(server_traffic)
send_query(server_traffic)
Then every time you decide if you want to send a query, or to check the result, with your own policy. This should make everything easier

Django: alternatives to count()

I have a Django app that displays paginated search results. Each page displays 20 results and I have a pagination bar at the bottom that displays the 5 pages less than and 5 pages greater than the current page (like Google). The problem is, for the pagination bar I call count() to get the total number of results so I know if there is actually 5 pages of results ahead of the current page.
The problem is more general queries could take around 10 seconds to perform a count() on. I don't actually care about the exact number, since most of my users will probably never reach the end of the results. Is there any way to estimate the output of count, or more generally, estimate the number of returned results from a query?
This is currently my query to get the actual results.
results = Item.objects.filter(title__icontains=query).order_by('views')[offset:limit]
The offset and limit variables refer to the segment of results that is shown on the current page. The only way I can see to solve my problem is to get the result segment of ~5 pages ahead and check whether it's empty. However, there are a lot of the edge cases for that solution, and I really don't want to spend a day coding that if there is an easier solution.

Also not ideal solution but might be worth testing for your particular situation. You might introduce a timeout for function making the query with the help of:
http://code.activestate.com/recipes/576780/
... or similar solution. You either need exact number of rows or the information that there's going to be a lot. That requires a bit of benchmarking to get a timoeout right and is still vulnerable to some externalities but might as well just work fine in 99.9% of cases.
Also it reduces the long queries load on the db.

Fast and efficient random queryset - Django [duplicate]

This question already has answers here:
Best way to select random rows PostgreSQL
(13 answers)
Closed 8 years ago.
I look on some posts in SO about random queryset in Django. All i can find is that order_by('?') is not fast and efficient at all, but my experience does not tell so.
I have a table with 30.000 entries (final production will have around 200.000 entries). I can create separate tables around 10-15k entries each.
So, i want to get very fast and efficient a random of 100 (maybe 200) items.
Idea of creating list of 100 random numbers is in my opinion not good enough, because some PKs will be missing (because of deleting, etc..).
And, i don't want a generate a random number, and then 99 following items.
I will be using Postgresql (no special reason...i can choose other if they are better).
I tested order_by('id')[:100] and it seems very fast (i think). It took only? 0.017s per list.
Why the docs says that this is not good operation for random?
Which random do you prefer?
Is there any better way to do this?

ORDER BY random()
LIMIT n
is a valid approach but slow because every single row in the table has to be considered.
This is still fast with 30 k rows, but with 30 M rows .. not so much.
I suggest this related questiion:
Best way to select random rows PostgreSQL

Google App Engine request timing out on count() function

I am using count() function to calculate the number of results returned by the query. The problem is that count is taking too long , that the request times out. Is there any way that i can make the count to respond quickly or any alternative to count() ?
query = MyModel.query().filter(MyModel.name.IN(['john', 'sara', 'alex']))
search_count = query.count()
if i remove the count line and just return the results it takes just couple of seconds.

Unfortunately count doesn't scale. You can only count 1000 items without using a cursor. Secondly if you want to count do a keys only query (pulls less data from the datastore).
Really to keep a count relatively up to date for a large number of entities, you will need to use a task and run it every so often, (or trigger a task to be scheduled each time data is added/modified if it is infrequent) and store that value away some where.
Or think about why you really need a count ;-) and how accurate it is.

If you need count(), you should use the keys_only option as Tim Hoffman already suggested. That should save you enough time for counting small query results.
Be aware that count() actually runs through the complete query until the very last match in the index. This means, if your query matches millions of items in a huge index, you will see terrible request times and time-outs even with the keys_only option.
From a usability perspective it isn't likely that a user wants accurate numbers in large scales. Typically users will not even browse through dozens or even hundreds of pages.
Counter with threshold accuracy
Consider using a counter that only is accurate up to a low limit, e.g. "41 items found", and beyond that limit use a generic display, e.g. "1000 or more items found". This is how text searches in GMail shows number of matches.
Pre-calculated counter
Enter a generic term like "spaghetti" into Google search and you will see some incredibly high number, e.g. "5.3 million documents found". Then try to get to page number 1,000 or to match number 1,000,000. It won't work. And the number is inaccurate as well. For calculating number of matches ahead of time, you could write tasks / cron jobs (maybe with map-reduce) that will calculate the counters asynchronously. However, even in business use-cases the counter of an individual search query like in your example doesn't need to be accurate with large numbers because it is very probable that the counter is changing significantly while the user goes through the results.
Shard counters
If you however need an accurate counter, for example the number of all sales orders in the datastore, rather than individual queries, you could write a counter and increase/decrease it with every new sales order that is created or deleted in the datastore. Depending on how you model the entity groups such counter might hit current datastore limitations in large volume writes (~ 1 write op per second per entity group, in reality maybe 3 to 4). See the article Sharding counters which explains how to build a scalable counter.
Use Search API
You could use the full text search service in Google App Engine. Define an index (e.g. "Customer") with fields you want to search. Whenever a customer entity in datastore is updated, put an updated copy of it as document into the search index. In my experience, the Search API is scaling much better for complex searches in large indices. It also shows you a counter and provides your users with full text search capabilities.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.