I'd like to "truncate" (delete all items) in a DynamoDB table. I know that the most efficient way to do this would be to delete the table and re-create it (name, indexes, etc.). However, the table is part of a SAM-CloudFormation deployment. The table (by name) is also referenced within other parts of the application.
If I deleted and re-created it, I could use the same name it had previously; however, I think this would cause problems because (1) the deletion isn't immediate and (2) the ARN would change and that could have implications on the CloudFormation stack.
It seems that there should be a better solution than the brute-force approach: iterate through all items, deleting them one at a time (with some optimization via the batch_writer).
I've looked at some other solutions here, but they don't address the "part of a CloudFormation stack" part of my question.
Truncate DynamoDb or rewrite data via Data Pipeline
What is the recommended way to delete a large number of items from DynamoDB?
I even provided a brute-force solution myself to another's question on this topic.
delete all items DynamoDB using Python
Here is the brute force approach
import boto3
table = boto3.resource('dynamodb').Table('my-table-name')
scan = None
with table.batch_writer() as batch:
count = 0
while scan is None or 'LastEvaluatedKey' in scan:
if scan is not None and 'LastEvaluatedKey' in scan:
scan = table.scan(
ProjectionExpression='id',
ExclusiveStartKey=scan['LastEvaluatedKey'],
)
else:
scan = table.scan(ProjectionExpression='id')
for item in scan['Items']:
if count % 5000 == 0:
print(count)
batch.delete_item(Key={'id': item['id']})
count = count + 1
The desired final state is a DynamoDB table (that was previously full of items) with the same name, no items, and still able to be destroyed as part of a CloudFormation delete operation.
No matter if you created the table as AWS::Serverless::SimpleTable or AWS::DynamoDB::Table there is no out-of-the-box solution to empty it using CloudFormation while keeping its name.
As a general best practice you shouldn't name DynamoDB tables created by CloudFormation, but let CloudFormation assign a name for the resource. If that would have been the case in your setup you could simply do a change to the resource which requires "replacement" of the resource, like temporary adding a Local Secondary Index, which would recreate the resource and would work with resources depending on it.
That said, in your situation the best approach is probably be to wrap your brute force approach in a CloudFormation custom resource and include that in your CloudFormation stack. With that you can truncate the table once or, depending on the implementation of your custom resource, whenever you want.
Keep in mind that deleting all items from a DynamoDB table might take quite long, so using a Lambda-backed custom resource might run into the limit of Lambda function runtime, depending on the number of items in the table. It might also become quite costly if the table contains a lot of items.
Related
Background
I would like to create a DynamoDB trigger such that on each new entry the value is updated before saving.
The DynamoDB table consists of Jobs/tasks and I would like to do calculations and assign the job/task to the respective employee.
The task seems relatively simple just need some guidance and assistance creating a lamda function that can accomplish this.
[...] the value is updated before saving.
I am afraid that DynamoDB streams do not work like this. The stream will contain only items that are already stored in the table.
One way to solve this, is to add another property to your table that indicates if the "job" is ready to process. By default, jobs that are added to the table are "not ready", then the DynamoDB stream is going to trigger your Lambda, which does its calculations, assigns the job to an employee AND sets the job to "ready".
Another option to solve this might be a little restructuring. For example: why not use a Step Function that has multiple steps, the final one being a step to save the calculated result of the previous step(s) into the table.
More than likely, you'll end up enabling DynamoDB Streams with a Lambda function to read from that stream. That'll give you the "on each new entry the value is..." The function would then do the calculations and assigning of the job you mentioned. I recommend starting with this from the DynamoDB documentation on Lambda triggers as well as this from the AWS Lambda documentation on working with DynamoDB Streams.
If that does not get you going in the right direction, let me know and I will dig up more for you.
I am trying to get the number of items in a table from dynamo db.
Code
def urlfn():
if request.method == 'GET':
print("GET REq processing")
return render_template('index.html',count = table.item_count)
But I am not getting the real count. I found that there is a 6 hour delay in getting the real count. Is there any way to get the real count of items in a table.
assuming in your code above that table is a service resource already defined, you can use:
len(table.scan())
this will give you an up to date count of items in your table. BUT it reads every single item in your table - for significantly large tables this can take a long time. AND it uses read capacity on your table to do so. So, for most practical purposes it really isn't a very good way to do so,
Depending on your use case here there are a few other options:
add a meta item that is updated everytime a new document is added to the dynamo. This is just document of whatever hash key / sort key combination you want with an attribute of "value" that you add 1 to every time you add a new item to the database.
you forget about using Dynamo. Sorry that sounds harsh, but DynamoDB is a no-sql database and attempting to use it in the same manner as a traditional relational database system is folly. # of 'rows' is not something that Dynamo is designed for because thats not its use case scope. There are no rows in Dynamo - there are documents, and those documents are partitioned, and you access small chunks of them at a time - meaning that the back end architecture does not lend itself for knowing what the entire system has in it at any given time (hence the 6 hour delay)
I can't find anyway to setup TTL on a document within AWS Elasticsearch utilizing python elasticsearch library.
I looked at the code of the library itself, and there are no argument for it, and I yet to see any answers on google.
There is none, you can use the index management policy if you like, which will operate at the index level, not at the doc level. You have a bit of wriggle room though in that you can create a pattern data-* and have more than 1 index, data-expiring-2020-..., data-keep-me.
You can apply a template to the pattern data-expiring-* and set a transition to delete an index after lets say 20 days. If you roll over to a new index each day you will the oldest day being deleted at the end of the day once it is over 20 days.
This method is much more preferable because if you are deleting individual documents that could consume large amounts of your cluster's capacity, as opposed to deleting entire shards. Other NoSQL databases such as DynamoDB operate in a similar fashion, often what you can do is add another field to your docs such as deletionDate and add that to your query to filter out docs which are marked for deletion, but are still alive in your index as a deletion job has not yet cleaned them up. That is how the TTL in DynamoDB behaves as well, data is not deleted the moment the TTL expires it, but rather in batches to improve performance.
I need some help with designing a DynamoDB Hash+Range key scheme for fast single item write access as well as fast parallel read access to groups of items.
Background:
Currently, each fanning link is stored as an item in the following format:
{
user_id : NUMBER
fanned_id : NUMBER
timestamp: NUMBER
},
where user_id is the hash key and fanned_id is the range key. This scheme allows for fast access to a single fanship item (via user_id + fanned_id), but when the complete fanship is read from DynamoDB, it takes long for the data to be transferred if the user has fanned thousands of other users.
Here is how I query DynamoDB using the boto python library:
table = Table("fanship_data", connection=conn)
fanship = []
uid = 10
for fanned in table.query_2(user_id__eq=uid):
fanship.append((fanned["fanned_id"],fanned["timestamp"]))
Clearly the throughput bottleneck is in the boto query, because the whole fanship of a user must be transferred at 25 items per second, even I have specified high throughput capacity for DynamoDB.
My question to you:
Assume that there is large read throughput capacity, and that all the data is present in DynamoDB. I do not mind resorting to multiprocessing, since that will be necessary for transferring the data in parallel. What scheme for the Hash+Range key will allow me to transfer the complete fanship of a user quickly?
I think that your hash/range key schema is the right one for what you're trying to accomplish. I have implemented similar schemas on several of my tables.
According to the docs, "Query performance depends on the amount of data retrieved", and there doesn't seem to be a way to parallelize the read. The only way to do a parallel read is via a Scan, but I'm not sure if that is going to be a better approach for you.
Apologies for the longish description.
I want to run a transform on every doc in a large-ish Mongodb collection with 10 million records approx 10G. Specifically I want to apply a geoip transform to the ip field in every doc and either append the result record to that doc or just create a whole other record linked to this one by say id (the linking is not critical, I can just create a whole separate record). Then I want to count and group by say city - (I do know how to do the last part).
The major reason I believe I cant use map-reduce is I can't call out to the geoip library in my map function (or at least that's the constraint I believe exists).
So I the central question is how do I run through each record in the collection apply the transform - using the most efficient way to do that.
Batching via Limit/skip is out of question as it does a "table scan" and it is going to get progressively slower.
Any suggestions?
Python or Js preferred just bec I have these geoip libs but code examples in other languages welcome.
Since you have to go over "each record", you'll do one full table scan anyway, then a simple cursor (find()) + maybe only fetching few fields (_id, ip) should do it. python driver will do the batching under the hood, so maybe you can give a hint on what's the optimal batch size (batch_size) if the default is not good enough.
If you add a new field and it doesn't fit the previously allocated space, mongo will have to move it to another place, so you might be better off creating a new document.
Actually I am also attempting another approach in parallel (as plan B) which is to use mongoexport. I use it with --csv to dump a large csv file with just the (id, ip) fields. Then the plan is to use a python script to do a geoip lookup and then post back to mongo as a new doc on which map-reduce can now be run for count etc. Not sure if this is faster or the cursor is. We'll see.