Pagination with Dynamodb in web application

Pagination with Dynamodb in web application - python

I have around 5000+ videos on my database and I have created a pages http://mysite.com/videos to list all the videos. Now I am implementing pagination so that only 20 videos are listed in each page. e.g.
http://mysite.com/videos?page=1 showing first 20 videos, http://mysite.com/videos?page=2 showing next 20 videos.
I have a problem choosing what is the best method to implement pagination. I thought of using table.scan() each time a new page is executed and then selecting only required based on some logic with Python code. But that seems to be quite expensive.
I am using Python / Django with boto library.

In Dynamo you can execute queries by setting a limit. From the documentation at:
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html
you can read:
ExclusiveStartKey
The primary key of the first item that this
operation will evaluate. Use the value that was returned for
LastEvaluatedKey in the previous operation.
The data type for ExclusiveStartKey must be String, Number or Binary.
No set data types are allowed.
Type: String to AttributeValue object map
Required: No
And
Limit
The maximum number of items to evaluate (not necessarily the number of matching items). If Amazon DynamoDB processes the number of
items up to the limit while processing the results, it stops the
operation and returns the matching values up to that point, and a
LastEvaluatedKey to apply in a subsequent operation, so that you can
pick up where you left off. Also, if the processed data set size
exceeds 1 MB before Amazon DynamoDB reaches this limit, it stops the
operation and returns the matching values up to the limit, and a
LastEvaluatedKey to apply in a subsequent operation to continue the
operation. For more information see Query and Scan in the Amazon
DynamoDB Developer Guide.
Type: Number
Required: No
You don't provide any info about how the keys of your table are structured. However the approach would be Query the table for the elements matching your key (and range key if suitable), with the limit set to 20.
In the result, you will receive a "LastEvaluatedKey", that you will have to use for the next query, again with the limit set to 20.

Here are some of options:
You can pre-load all video objects when the application starts up and then do in-memory pagination the way you want. 5000+ objects shouldn't be a big deal.
You can fetch first page and then asynchronously fetch the rest (via scan) and then again do pagination in-memory.
You can create an index table that would store an entry for each page containing the id-s of each video and then to fetch the video of a page you do to calls:
3.1 Fetch the page by page id (simple get operation). This will contain the list of video ids that should be on that page
3.2 Fetch all the videos from 3.1 by doing a multi-get operation
For a similar use case, we loaded all metadata via Javascript objects and do pagination and sorting from there and the end result for user is just nice (fast and responsive). As well, we're using the trick of fetching 1st page and then the whole content again (as DynamoDB doesn't support cursor at the moment)

Limit is not what you think. Here's what I suggest:
Using DynamoDBMapper issue a
numRows = mapper.count(<SomeClass>.class, scanExpression)
to efficiently get the number of rows in your table.
Then run a
PaginatedScanList<FeedEntry> result = mapper.scan(<SomeClass>.class, scanExpression);
to get the List - the key here is PaginatedScanList is lazy-loaded. Be careful not to do a .size() on the result as this will load all the rows. Just use .get() to only load the rows you need.
Iterate over the paginatedScanList using
offset = startPage * pageSize
ArrayList<SomeClass> list = new ArrayList<SomeClass>()
for (i = 0 ... pageSize)
list.add(result.get( offset + i))
Check for out-of-bounds etc. Hope that helps.

Related

Number of items in a table in dynamodb using boto3

I am trying to get the number of items in a table from dynamo db.
Code
def urlfn():
if request.method == 'GET':
print("GET REq processing")
return render_template('index.html',count = table.item_count)
But I am not getting the real count. I found that there is a 6 hour delay in getting the real count. Is there any way to get the real count of items in a table.

assuming in your code above that table is a service resource already defined, you can use:
len(table.scan())
this will give you an up to date count of items in your table. BUT it reads every single item in your table - for significantly large tables this can take a long time. AND it uses read capacity on your table to do so. So, for most practical purposes it really isn't a very good way to do so,
Depending on your use case here there are a few other options:
add a meta item that is updated everytime a new document is added to the dynamo. This is just document of whatever hash key / sort key combination you want with an attribute of "value" that you add 1 to every time you add a new item to the database.
you forget about using Dynamo. Sorry that sounds harsh, but DynamoDB is a no-sql database and attempting to use it in the same manner as a traditional relational database system is folly. # of 'rows' is not something that Dynamo is designed for because thats not its use case scope. There are no rows in Dynamo - there are documents, and those documents are partitioned, and you access small chunks of them at a time - meaning that the back end architecture does not lend itself for knowing what the entire system has in it at any given time (hence the 6 hour delay)

Is there a way to set TTL on a document within AWS Elasticsearch utilizing python library?

I can't find anyway to setup TTL on a document within AWS Elasticsearch utilizing python elasticsearch library.
I looked at the code of the library itself, and there are no argument for it, and I yet to see any answers on google.

There is none, you can use the index management policy if you like, which will operate at the index level, not at the doc level. You have a bit of wriggle room though in that you can create a pattern data-* and have more than 1 index, data-expiring-2020-..., data-keep-me.
You can apply a template to the pattern data-expiring-* and set a transition to delete an index after lets say 20 days. If you roll over to a new index each day you will the oldest day being deleted at the end of the day once it is over 20 days.
This method is much more preferable because if you are deleting individual documents that could consume large amounts of your cluster's capacity, as opposed to deleting entire shards. Other NoSQL databases such as DynamoDB operate in a similar fashion, often what you can do is add another field to your docs such as deletionDate and add that to your query to filter out docs which are marked for deletion, but are still alive in your index as a deletion job has not yet cleaned them up. That is how the TTL in DynamoDB behaves as well, data is not deleted the moment the TTL expires it, but rather in batches to improve performance.

Retrieving Data Faster Using Django from a Large Database

I have a large database table of more than a million records and django is taking really long time to retrieve the data. When I had less records the data was retrieved fast.
I am using the get() method to retrieve the data from the database. I did try the filter() method but when I did that then it gave me entire table rather than filtering on the given condition.
Currently I retrieve the data using the code shown below:
context['variables'] = variable.objects.get(id=self.kwargs['pk'])
I know why it is slow, because its trying to go through all the records and get the records whose id matched. But I was wondering if there was a way I could restrict the search to last 100 records or if there is something I am not doing correctly with the filter() function. Any help would be appretiated.

django haystack: how to iterate over all indexed elements

I am trying to iterate over a Search Queryset with haystack, but it throws me this error:
Result window is too large, from + size must be less than or equal to: [10000] but was [11010]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.
Is there a way to iterate over all indexed elements? (let's say I have several million records).

max_result_window is an index setting that you can change if you want but most of the time you don't have to, because if you'd like to iterate on all your documents using the search api is not the way you should go.
Try with a scan and scroll api.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
And a personal note: I use elasticsearch with django and I found haystack difficult to use as opposed to elasticsearch-dsl. Try to have a look to elasticsearch-dsl-py. https://github.com/elastic/elasticsearch-dsl-py

Is it possible to Bulk Insert using Google Cloud Datastore

We are migrating some data from our production database and would like to archive most of this data in the Cloud Datastore.
Eventually we would move all our data there, however initially focusing on the archived data as a test.
Our language of choice is Python, and have been able to transfer data from mysql to the datastore row by row.
We have approximately 120 million rows to transfer and at a one row at a time method will take a very long time.
Has anyone found some documentation or examples on how to bulk insert data into cloud datastore using python?
Any comments, suggestions is appreciated thank you in advanced.

There is no "bulk-loading" feature for Cloud Datastore that I know of today, so if you're expecting something like "upload a file with all your data and it'll appear in Datastore", I don't think you'll find anything.
You could always write a quick script using a local queue that parallelizes the work.
The basic gist would be:
Queuing script pulls data out of your MySQL instance and puts it on a queue.
(Many) Workers pull from this queue, and try to write the item to Datastore.
On failure, push the item back on the queue.
Datastore is massively parallelizable, so if you can write a script that will send off thousands of writes per second, it should work just fine. Further, your big bottleneck here will be network IO (after you send a request, you have to wait a bit to get a response), so lots of threads should get a pretty good overall write rate. However, it'll be up to you to make sure you split the work up appropriately among those threads.
Now, that said, you should investigate whether Cloud Datastore is the right fit for your data and durability/availability needs. If you're taking 120m rows and loading it into Cloud Datastore for key-value style querying (aka, you have a key and an unindexed value property which is just JSON data), then this might make sense, but loading your data will cost you ~$70 in this case (120m * $0.06/100k).
If you have properties (which will be indexed by default), this cost goes up substantially.
The cost of operations is $0.06 per 100k, but a single "write" may contain several "operations". For example, let's assume you have 120m rows in a table that has 5 columns (which equates to one Kind with 5 properties).
A single "new entity write" is equivalent to:
+ 2 (1 x 2 write ops fixed cost per new entity)
+ 10 (5 x 2 write ops per indexed property)
= 12 "operations" per entity.
So your actual cost to load this data is:
120m entities * 12 ops/entity * ($0.06/100k ops) = $864.00

I believe what you are looking for is the put_multi() method.
From the docs, you can use put_multi() to batch multiple put operations. This will result in a single RPC for the batch rather than one for each of the entities.
Example:
# a list of many entities
user_entities = [ UserEntity(name='user %s' % i) for i in xrange(10000)]
users_keys = ndb.put_multi(user_entities) # keys are in same order as user_entities
Also to note, from the docs is that:
Note: The ndb library automatically batches most calls to Cloud Datastore, so in most cases you don't need to use the explicit batching operations shown below.
That said, you may still, as suggested by , use a task queue (I prefer the deferred library) in order to batch-put a lot of data in the background.

As an update to the answer of #JJ Geewax, as of July 1st, 2016
the cost of read and write operations have changed as explained here: https://cloud.google.com/blog/products/gcp/google-cloud-datastore-simplifies-pricing-cuts-cost-dramatically-for-most-use-cases
So writing should have gotten cheaper for the described case, as
writing a single entity only costs 1 write regardless of indexes and will now cost $0.18 per 100,000

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.