I am trying to get the number of items in a table from dynamo db.
Code
def urlfn():
if request.method == 'GET':
print("GET REq processing")
return render_template('index.html',count = table.item_count)
But I am not getting the real count. I found that there is a 6 hour delay in getting the real count. Is there any way to get the real count of items in a table.
assuming in your code above that table is a service resource already defined, you can use:
len(table.scan())
this will give you an up to date count of items in your table. BUT it reads every single item in your table - for significantly large tables this can take a long time. AND it uses read capacity on your table to do so. So, for most practical purposes it really isn't a very good way to do so,
Depending on your use case here there are a few other options:
add a meta item that is updated everytime a new document is added to the dynamo. This is just document of whatever hash key / sort key combination you want with an attribute of "value" that you add 1 to every time you add a new item to the database.
you forget about using Dynamo. Sorry that sounds harsh, but DynamoDB is a no-sql database and attempting to use it in the same manner as a traditional relational database system is folly. # of 'rows' is not something that Dynamo is designed for because thats not its use case scope. There are no rows in Dynamo - there are documents, and those documents are partitioned, and you access small chunks of them at a time - meaning that the back end architecture does not lend itself for knowing what the entire system has in it at any given time (hence the 6 hour delay)
Related
I am new to dynamo db and want to compare values of a list(python) with attribute value of dynamo db table.
I am able to compare single value by using query with index key:
response = dynamotable.query(
IndexName='Classicmovies',
KeyConditionExpression = Key('DDT').eq('BBB-rrr-jjj-mq'))
but want to compare entire list which should be in .eq as follow:
movies =['ddd-dddss-gdgdg','kkdf-dfdfd-www','dfw-gddf-gssg']
I have searched alot and not able to figure out right way.
Hard to say what you are trying to do. A query will only retrieve a bunch of records belonging to a single item collection. Maybe what you need is a scan but please avoid heavily using scans unless of its for maintenance purposes.
I have a large database table of more than a million records and django is taking really long time to retrieve the data. When I had less records the data was retrieved fast.
I am using the get() method to retrieve the data from the database. I did try the filter() method but when I did that then it gave me entire table rather than filtering on the given condition.
Currently I retrieve the data using the code shown below:
context['variables'] = variable.objects.get(id=self.kwargs['pk'])
I know why it is slow, because its trying to go through all the records and get the records whose id matched. But I was wondering if there was a way I could restrict the search to last 100 records or if there is something I am not doing correctly with the filter() function. Any help would be appretiated.
I'd like to "truncate" (delete all items) in a DynamoDB table. I know that the most efficient way to do this would be to delete the table and re-create it (name, indexes, etc.). However, the table is part of a SAM-CloudFormation deployment. The table (by name) is also referenced within other parts of the application.
If I deleted and re-created it, I could use the same name it had previously; however, I think this would cause problems because (1) the deletion isn't immediate and (2) the ARN would change and that could have implications on the CloudFormation stack.
It seems that there should be a better solution than the brute-force approach: iterate through all items, deleting them one at a time (with some optimization via the batch_writer).
I've looked at some other solutions here, but they don't address the "part of a CloudFormation stack" part of my question.
Truncate DynamoDb or rewrite data via Data Pipeline
What is the recommended way to delete a large number of items from DynamoDB?
I even provided a brute-force solution myself to another's question on this topic.
delete all items DynamoDB using Python
Here is the brute force approach
import boto3
table = boto3.resource('dynamodb').Table('my-table-name')
scan = None
with table.batch_writer() as batch:
count = 0
while scan is None or 'LastEvaluatedKey' in scan:
if scan is not None and 'LastEvaluatedKey' in scan:
scan = table.scan(
ProjectionExpression='id',
ExclusiveStartKey=scan['LastEvaluatedKey'],
)
else:
scan = table.scan(ProjectionExpression='id')
for item in scan['Items']:
if count % 5000 == 0:
print(count)
batch.delete_item(Key={'id': item['id']})
count = count + 1
The desired final state is a DynamoDB table (that was previously full of items) with the same name, no items, and still able to be destroyed as part of a CloudFormation delete operation.
No matter if you created the table as AWS::Serverless::SimpleTable or AWS::DynamoDB::Table there is no out-of-the-box solution to empty it using CloudFormation while keeping its name.
As a general best practice you shouldn't name DynamoDB tables created by CloudFormation, but let CloudFormation assign a name for the resource. If that would have been the case in your setup you could simply do a change to the resource which requires "replacement" of the resource, like temporary adding a Local Secondary Index, which would recreate the resource and would work with resources depending on it.
That said, in your situation the best approach is probably be to wrap your brute force approach in a CloudFormation custom resource and include that in your CloudFormation stack. With that you can truncate the table once or, depending on the implementation of your custom resource, whenever you want.
Keep in mind that deleting all items from a DynamoDB table might take quite long, so using a Lambda-backed custom resource might run into the limit of Lambda function runtime, depending on the number of items in the table. It might also become quite costly if the table contains a lot of items.
I have around 5000+ videos on my database and I have created a pages http://mysite.com/videos to list all the videos. Now I am implementing pagination so that only 20 videos are listed in each page. e.g.
http://mysite.com/videos?page=1 showing first 20 videos, http://mysite.com/videos?page=2 showing next 20 videos.
I have a problem choosing what is the best method to implement pagination. I thought of using table.scan() each time a new page is executed and then selecting only required based on some logic with Python code. But that seems to be quite expensive.
I am using Python / Django with boto library.
In Dynamo you can execute queries by setting a limit. From the documentation at:
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html
you can read:
ExclusiveStartKey
The primary key of the first item that this
operation will evaluate. Use the value that was returned for
LastEvaluatedKey in the previous operation.
The data type for ExclusiveStartKey must be String, Number or Binary.
No set data types are allowed.
Type: String to AttributeValue object map
Required: No
And
Limit
The maximum number of items to evaluate (not necessarily the number of matching items). If Amazon DynamoDB processes the number of
items up to the limit while processing the results, it stops the
operation and returns the matching values up to that point, and a
LastEvaluatedKey to apply in a subsequent operation, so that you can
pick up where you left off. Also, if the processed data set size
exceeds 1 MB before Amazon DynamoDB reaches this limit, it stops the
operation and returns the matching values up to the limit, and a
LastEvaluatedKey to apply in a subsequent operation to continue the
operation. For more information see Query and Scan in the Amazon
DynamoDB Developer Guide.
Type: Number
Required: No
You don't provide any info about how the keys of your table are structured. However the approach would be Query the table for the elements matching your key (and range key if suitable), with the limit set to 20.
In the result, you will receive a "LastEvaluatedKey", that you will have to use for the next query, again with the limit set to 20.
Here are some of options:
You can pre-load all video objects when the application starts up and then do in-memory pagination the way you want. 5000+ objects shouldn't be a big deal.
You can fetch first page and then asynchronously fetch the rest (via scan) and then again do pagination in-memory.
You can create an index table that would store an entry for each page containing the id-s of each video and then to fetch the video of a page you do to calls:
3.1 Fetch the page by page id (simple get operation). This will contain the list of video ids that should be on that page
3.2 Fetch all the videos from 3.1 by doing a multi-get operation
For a similar use case, we loaded all metadata via Javascript objects and do pagination and sorting from there and the end result for user is just nice (fast and responsive). As well, we're using the trick of fetching 1st page and then the whole content again (as DynamoDB doesn't support cursor at the moment)
Limit is not what you think. Here's what I suggest:
Using DynamoDBMapper issue a
numRows = mapper.count(<SomeClass>.class, scanExpression)
to efficiently get the number of rows in your table.
Then run a
PaginatedScanList<FeedEntry> result = mapper.scan(<SomeClass>.class, scanExpression);
to get the List - the key here is PaginatedScanList is lazy-loaded. Be careful not to do a .size() on the result as this will load all the rows. Just use .get() to only load the rows you need.
Iterate over the paginatedScanList using
offset = startPage * pageSize
ArrayList<SomeClass> list = new ArrayList<SomeClass>()
for (i = 0 ... pageSize)
list.add(result.get( offset + i))
Check for out-of-bounds etc. Hope that helps.
I'm trying to move from redis to dynamoDB and sofar everything is working great! The only thing I have yet to figure out is key expiration. Currently, I have my data setup with one primary key and no range key as so:
{
"key" => string,
"value" => ["string", "string"],
"timestamp" => seconds since epoch
}
What I was thinking was to do a scan over the database for where timestamp is less than a particular value, and then explicitly delete them. This, however, seems extremely inefficient and would use up a ridiculous number of read/write units for no reason! On top of which, the expirations would only happen when I run the scan, so they could conceivably build up.
So, has anyone found a good solution to this problem?
I'm also using DynamoDB like the way we used to use Redis.
My suggestion is to write the key into different time-sliced tables.
For example, say a type of record should last few minutes, at most less an hour, then you can
Create a new table every day for this type of record and store new records in today's table.
Use a read repair tip when you read records, which means if you can't find a record in today's table, you try to find it in yesterday's table and put in today's table if necessary.
If you find the record in either table, verify it with it's timestamp. It's not necessary to delete expired records at this moment.
Drop entire stale tables in your tasks.
This is easier to maintain and cost-efficient.
You could do lazy expiration and delete it on request.
For example:
store key "a" with an attribute "expiration", expires in 10 minutes.
fetch in 9 minutes, check expiration, return it.
fetch in 11 minutes. check expiration. since it's less than now, delete the entry.
This is what memcached was doing when I looked at the source a few years ago.
You'd still need to do a scan to remove all the old entries.
You could also consider using Elasticache, which is meant for caching rather than a permanent data store.
It seems that Amazon just added expiration support to DynamoDB (as of feb 27 2017). Take a look at the official blog post:
https://aws.amazon.com/blogs/aws/new-manage-dynamodb-items-using-time-to-live-ttl/
You could use the timestamp as the range key which would be indexed and allow for easier operations based on the time.