How can I optimally (in terms financial cost) empty a DynamoDB table with boto? (as we can do in SQL with a truncate statement.)
boto.dynamodb2.table.delete() or boto.dynamodb2.layer1.DynamoDBConnection.delete_table() deletes the entire table, while boto.dynamodb2.table.delete_item() boto.dynamodb2.table.BatchTable.delete_item() only deletes the specified items.
While i agree with Johnny Wu that dropping the table and recreating it is much more efficient, there may be cases such as when many GSI's or Tirgger events are associated with a table and you dont want to have to re-associate those. The script below should work to recursively scan the table and use the batch function to delete all items in the table. For massively large tables though, this may not work as it requires all items in the table to be loaded into your computer
import boto3
dynamo = boto3.resource('dynamodb')
def truncateTable(tableName):
table = dynamo.Table(tableName)
#get the table keys
tableKeyNames = [key.get("AttributeName") for key in table.key_schema]
"""
NOTE: there are reserved attributes for key names, please see https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ReservedWords.html
if a hash or range key is in the reserved word list, you will need to use the ExpressionAttributeNames parameter
described at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Table.scan
"""
#Only retrieve the keys for each item in the table (minimize data transfer)
ProjectionExpression = ", ".join(tableKeyNames)
response = table.scan(ProjectionExpression=ProjectionExpression)
data = response.get('Items')
while 'LastEvaluatedKey' in response:
response = table.scan(
ProjectionExpression=ProjectionExpression,
ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
with table.batch_writer() as batch:
for each in data:
batch.delete_item(
Key={key: each[key] for key in tableKeyNames}
)
truncateTable("YOUR_TABLE_NAME")
As Johnny Wu mentioned, deleting a table and re-creating it is more efficient than deleting individual items. You should make sure your code doesn't try to create a new table before it is completely deleted.
def deleteTable(table_name):
print('deleting table')
return client.delete_table(TableName=table_name)
def createTable(table_name):
waiter = client.get_waiter('table_not_exists')
waiter.wait(TableName=table_name)
print('creating table')
table = dynamodb.create_table(
TableName=table_name,
KeySchema=[
{
'AttributeName': 'YOURATTRIBUTENAME',
'KeyType': 'HASH'
}
],
AttributeDefinitions= [
{
'AttributeName': 'YOURATTRIBUTENAME',
'AttributeType': 'S'
}
],
ProvisionedThroughput={
'ReadCapacityUnits': 1,
'WriteCapacityUnits': 1
},
StreamSpecification={
'StreamEnabled': False
}
)
def emptyTable(table_name):
deleteTable(table_name)
createTable(table_name)
Deleting a table is much more efficient than deleting items one-by-one. If you are able to control your truncation points, then you can do something similar to rotating tables as suggested in the docs for time series data.
This builds on the answer given by Persistent Plants. If the table already exists, you can extract the table definitions and use that to recreate the table.
import boto3
dynamodb = boto3.resource('dynamodb', region_name='us-east-2')
def delete_table_ddb(table_name):
table = dynamodb.Table(table_name)
return table.delete()
def create_table_ddb(table_name, key_schema, attribute_definitions,
provisioned_throughput, stream_enabled, billing_mode):
settings = dict(
TableName=table_name,
KeySchema=key_schema,
AttributeDefinitions=attribute_definitions,
StreamSpecification={'StreamEnabled': stream_enabled},
BillingMode=billing_mode
)
if billing_mode == 'PROVISIONED':
settings['ProvisionedThroughput'] = provisioned_throughput
return dynamodb.create_table(**settings)
def truncate_table_ddb(table_name):
table = dynamodb.Table(table_name)
key_schema = table.key_schema
attribute_definitions = table.attribute_definitions
if table.billing_mode_summary:
billing_mode = 'PAY_PER_REQUEST'
else:
billing_mode = 'PROVISIONED'
if table.stream_specification:
stream_enabled = True
else:
stream_enabled = False
capacity = ['ReadCapacityUnits', 'WriteCapacityUnits']
provisioned_throughput = {k: v for k, v in table.provisioned_throughput.items() if k in capacity}
delete_table_ddb(table_name)
table.wait_until_not_exists()
return create_table_ddb(
table_name,
key_schema=key_schema,
attribute_definitions=attribute_definitions,
provisioned_throughput=provisioned_throughput,
stream_enabled=stream_enabled,
billing_mode=billing_mode
)
Now call use the function:
table_name = 'test_ddb'
truncate_table_ddb(table_name)
Related
I have a glue script to create new partitions using create_partition(). The glue script is running successfully, and i could see the partitions in the Athena console when using SHOW PARTITIONS. For glue script create_partitions, I did refer to this sample code here : https://medium.com/#bv_subhash/demystifying-the-ways-of-creating-partitions-in-glue-catalog-on-partitioned-s3-data-for-faster-e25671e65574
When I try to run a Athena query for a given partition which was newly added, I am getting no results.
Is it that I need to trigger the MSCK command, even if I add the partitions using create_partitions. Appreciate any suggestions please
.
I have got the solution myself, wanted to share with SO community, so it would be useful someone. The following code when run as a glue job, creates partitions, and can also be queried in Athena for the new partition columns. Please change/add the parameter values db name, table name, partition columns as needed.
import boto3
import urllib.parse
import os
import copy
import sys
# Configure database / table name and emp_id, file_id from workflow params?
DATABASE_NAME = 'my_db'
TABLE_NAME = 'enter_table_name'
emp_id_tmp = ''
file_id_tmp = ''
# # Initialise the Glue client using Boto 3
glue_client = boto3.client('glue')
#get current table schema for the given database name & table name
def get_current_schema(database_name, table_name):
try:
response = glue_client.get_table(
DatabaseName=DATABASE_NAME,
Name=TABLE_NAME
)
except Exception as error:
print("Exception while fetching table info")
sys.exit(-1)
# Parsing table info required to create partitions from table
table_data = {}
table_data['input_format'] = response['Table']['StorageDescriptor']['InputFormat']
table_data['output_format'] = response['Table']['StorageDescriptor']['OutputFormat']
table_data['table_location'] = response['Table']['StorageDescriptor']['Location']
table_data['serde_info'] = response['Table']['StorageDescriptor']['SerdeInfo']
table_data['partition_keys'] = response['Table']['PartitionKeys']
return table_data
#prepare partition input list using table_data
def generate_partition_input_list(table_data):
input_list = [] # Initializing empty list
part_location = "{}/emp_id={}/file_id={}/".format(table_data['table_location'], emp_id_tmp, file_id_tmp)
input_dict = {
'Values': [
emp_id_tmp, file_id_tmp
],
'StorageDescriptor': {
'Location': part_location,
'InputFormat': table_data['input_format'],
'OutputFormat': table_data['output_format'],
'SerdeInfo': table_data['serde_info']
}
}
input_list.append(input_dict.copy())
return input_list
#create partition dynamically using the partition input list
table_data = get_current_schema(DATABASE_NAME, TABLE_NAME)
input_list = generate_partition_input_list(table_data)
try:
create_partition_response = glue_client.batch_create_partition(
DatabaseName=DATABASE_NAME,
TableName=TABLE_NAME,
PartitionInputList=input_list
)
print('Glue partition created successfully.')
print(create_partition_response)
except Exception as e:
# Handle exception as per your business requirements
print(e)
I have a lambda function that needs to retrieve an item from DynamoDB and update the counter of that item. But..
The DynamoDB table is structured as:
id: int
options: map
some_option: 0
some_other_option: 0
I need to first retrieve the item of the table that has a certain id and a certain option listed as a key in the options.
Then I want to increment that counter by some value.
Here is what I have so far:
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('options')
response = None
try:
response = table.get_item(Key={'id': id})
except ClientError as e:
print(e.response['Error']['Message'])
option = response.get('Item', None)
if option:
option['options'][some_option] = int(option['options'][some_option]) + some_value
# how to update item in DynamoDB now?
My issues is how to update the record now and more importantly will such solution cause data races? Could 2 simultaneous lambda calls that try to update the same item at the same option cause data races? If so what's the way to solve this?
Any pointers/help is appreciated.
Ok, I found the answer:
All I need is:
response = table.update_item(
Key={
'id': my_id,
},
UpdateExpression='SET options.#s = options.#s + :val',
ExpressionAttributeNames={
"#s": my_option
},
ExpressionAttributeValues={
':val': Decimal(some_value)
},
ReturnValues="UPDATED_NEW"
)
This is inspired from Step 3.4: Increment an Atomic Counter which provides an atomic approach to increment values. According to the documentation:
DynamoDB supports atomic counters, which use the update_item method to
increment or decrement the value of an existing attribute without
interfering with other write requests. (All write requests are applied
in the order in which they are received.)
Whenever I updated my insert_one with a new field to use, I had to always delete the old posts in the collection. I know there are manual methods of updating such fields using update_many but I know it's inefficient.
For example:
posts.insert_one({
"id": random.randint(1,10000)
"value1": "value1",
"value2": "value2"
})
I use the following code to check if the document exists or not. How would this work for a field?
if posts.find({'id': 12312}).count() > 0:
I know I can easily overwrite the previous data but I know people won't enjoy having their data wiped every other month.
Is there a way to add the field to a document in Python?
How would this work for a field?
You can use $exists to check whether a field exists in a doc.
In your case, you can combine this with find
find({ 'id':1, "fieldToCheck":{$exists:"true"}})
It will return the doc if it exists with id = 1, fieldToCheck is present in doc with id = 1
You can skip id=1, in that case, it will return all docs where fieldToCheck exists
Is there a way to add the field to a document in Python?
You could use update with new field, it will update if it is present else it will insert.
update({"_id":1}, {field:"x"})
If field is present, it will set to x else it will add with field:x
Beware of update options like multi, upsert
Yes you can you use update command in mongoDB shell to do that. check here
This is the command to use...
db.collection.update({},{$set : {"newfield":1}},false,true)
The above will work in the mongoDB shell. It will add newfield in all the documents, if it is not present.
If you want to use Python, use pymongo.
For python, following command should work
db.collection.update({},{"$set" : {"newfield":1}},False, True)
Thanks to john's answer I have made an entire solution that automatically updates documents without the need to run a task meaning you don't update inactive documents.
import datetime
import pymongo
database = pymongo.MongoClient("mongodb://localhost:27017") # Mongodb connection
db = database.maindb # Database
posts = db.items # Collection within a database
# A schema equivalent function that returns the object
def user_details(name, dob):
return {
"username": name, # a username/id
"dob": dob, # some data
"level": 0, # some other data
"latest_update": datetime.datetime.fromtimestamp(1615640176)
# Must be kept to ensure you aren't doing it that often
}
# The first schema changed for example after adding a new feature
def user_details2(name, dob, cake):
return {
"username": name, # a username/id
"dob": dob, # Some data
"level": 0, # Some other data
"cake": cake, # Some new data that isn't in the document
"latest_update": datetime.datetime.utcnow() # Must be kept to ensure you aren't doing it that often
}
def check_if_update(find, main_document,
collection): # parameters: What you find a document with, the schema dictionary, then the mongodb collection
if collection.count_documents(find) > 0: # How many documents match, only proceed if it exists
fields = {} # Init a dictionary
for x in collection.find(find): # You only want one for this to work
fields = x
if "latest_update" in fields: # Just in case it doesn't exist yet
last_time = fields["latest_update"] # Get the time that it was last updated
time_diff = datetime.datetime.utcnow() - last_time # Get the time difference between the utc time now and the time it was last updated
if time_diff.total_seconds() < 3600: # If the total seconds of the difference is smaller than an hour
print("return")
return
db_schema = main_document # Better naming
db_schema["_id"] = 0 # Adds the _id schema_key into the dictionary
if db_schema.keys() != fields:
print("in")
for schema_key, schema_value in db_schema.items():
if schema_key not in fields.keys(): # Main key for example if cake is added and doesn't exist in db fetched fields
collection.update_one(find, {"$set": {schema_key: schema_value}})
else: # Everything exists and you want to check for if a dictionary within that dictionary is changed
try:
sub_dict = dict(schema_value) # Make the value of it a dictionary
# It exists in the schema dictionary but not in the db fetched document
for key2, value2 in sub_dict.items():
if key2 not in fields[schema_key].keys():
new_value = schema_value
new_value[
key2] = value2 # Adding the key and value from the schema dictionary that was added
collection.update_one(find,
{"$set": {schema_key: new_value}})
# It exists in the db fetched document but not in the schema dictionary
for key2, value2 in fields[schema_key].items():
if key2 not in sub_dict.keys():
new_dict = {} # Get all values, filter then so that only the schema existent ones are passed back
for item in sub_dict:
if item != key2:
new_dict[item] = sub_dict.get(item)
collection.update_one(find, {"$set": {schema_key: new_dict}})
except: # Wasn't a dict
pass
# You removed a value from the schema dictionary and want to update it in the db
for key2, value2 in fields.items():
if key2 not in db_schema:
collection.update_one(find, {"$unset": {key2: 1}})
else:
collection.insert_one(main_document) # Insert it because it doesn't exist yet
print("start")
print(posts.find_one({"username": "john"}))
check_if_update({"username": "john"}, user_details("john", "13/03/2021"), posts)
print("inserted")
print(posts.find_one({"username": "john"}))
check_if_update({"username": "john"}, user_details2("john", "13/03/2021", "Lemon drizzle"), posts)
print("Results:")
print(posts.find_one({"username": "john"}))
It is available as a gist
I am using Lambda to detect faces and would like to send the response to a Dynamotable.
This is the code I am using:
rekognition = boto3.client('rekognition', region_name='us-east-1')
dynamodb = boto3.client('dynamodb', region_name='us-east-1')
# --------------- Helper Functions to call Rekognition APIs ------------------
def detect_faces(bucket, key):
response = rekognition.detect_faces(Image={"S3Object": {"Bucket": bucket,
"Name": key}}, Attributes=['ALL'])
TableName = 'table_test'
for face in response['FaceDetails']:
table_response = dynamodb.put_item(TableName=TableName, Item='{0} - {1}%')
return response
My problem is in this line:
for face in response['FaceDetails']:
table_response = dynamodb.put_item(TableName=TableName, Item= {'key:{'S':'value'}, {'S':'Value')
I am able to see the result in the console.
I don't want to add specific item(s) to the table- I need the whole response to be transferred to the table.
Do do this:
1. What to add as a key and partition key in the table?
2. How to transfer the whole response to the table
i have been stuck in this for three days now and can't figure out any result. Please help!
******************* EDIT *******************
I tried this code:
rekognition = boto3.client('rekognition', region_name='us-east-1')
# --------------- Helper Functions to call Rekognition APIs ------------------
def detect_faces(bucket, key):
response = rekognition.detect_faces(Image={"S3Object": {"Bucket": bucket,
"Name": key}}, Attributes=['ALL'])
TableName = 'table_test'
for face in response['FaceDetails']:
face_id = str(uuid.uuid4())
Age = face["AgeRange"]
Gender = face["Gender"]
print('Generating new DynamoDB record, with ID: ' + face_id)
print('Input Age: ' + Age)
print('Input Gender: ' + Gender)
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ['test_table'])
table.put_item(
Item={
'id' : face_id,
'Age' : Age,
'Gender' : Gender
}
)
return response
It gave me two of errors:
1. Error processing object xxx.jpg
2. cannot concatenate 'str' and 'dict' objects
Can you pleaaaaase help!
When you create a Table in DynamoDB, you must specify, at least, a Partition Key. Go to your DynamoDB table and grab your partition key. Once you have it, you can create a new object that contains this partition key with some value on it and the object you want to pass itself. The partition key is always a MUST upon creating a new Item in a DynamoDB table.
Your JSON object should look like this:
{
"myPartitionKey": "myValue",
"attr1": "val1",
"attr2:" "val2"
}
EDIT: After the OP updated his question, here's some new information:
For problem 1)
Are you sure the image you are trying to process is a valid one? If it is a corrupted file Rekognition will fail and throw that error.
For problem 2)
You cannot concatenate a String with a Dictionary in Python. Your Age and Gender variables are dictionaries, not Strings. So you need to access an inner attribute within these dictionaries. They have a 'Value' attribute. I am not a Python developer, but you need to access the Value attribute inside your Gender object. The Age object, however, has 'Low' and 'High' as attributes.
You can see the complete list of attributes in the docs
Hope this helps!
In this SO question I had learnt that I cannot delete a Cosmos DB document using SQL.
Using Python, I believe I need the DeleteDocument() method. This is how I'm getting the document ID's that are required (I believe) to then call the DeleteDocument() method.
# set up the client
client = document_client.DocumentClient()
# use a SQL based query to get a bunch of documents
query = { 'query': 'SELECT * FROM server s' }
result_iterable = client.QueryDocuments('dbs/DB/colls/coll', query, options)
results = list(result_iterable);
for x in range(0, len (results)):
docID = results[x]['id']
Now, at this stage I want to call DeleteDocument().
The inputs into which are document_link and options.
I can define document_link as something like
document_link = 'dbs/DB/colls/coll/docs/'+docID
And successfully call ReadAttachments() for example, which has the same inputs as DeleteDocument().
When I do however, I get an error...
The partition key supplied in x-ms-partitionkey header has fewer
components than defined in the the collection
...and now I'm totally lost
UPDATE
Following on from Jay's help, I believe I'm missing the partitonKey element in the options.
In this example, I've created a testing database, it looks like this
So I think my partition key is /testPART
When I include the partitionKey in the options however, no results are returned, (and so print len(results) outputs 0).
Removing partitionKey means that results are returned, but the delete attempt fails as before.
# Query them in SQL
query = { 'query': 'SELECT * FROM c' }
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 2
options['partitionKey'] = '/testPART'
result_iterable = client.QueryDocuments('dbs/testDB/colls/testCOLL', query, options)
results = list(result_iterable)
# should be > 0
print len(results)
for x in range(0, len (results)):
docID = results[x]['id']
print docID
client.DeleteDocument('dbs/testDB/colls/testCOLL/docs/'+docID, options=options)
print 'deleted', docID
According to your description, I tried to use pydocument module to delete document in my azure document db and it works for me.
Here is my code:
import pydocumentdb;
import pydocumentdb.document_client as document_client
config = {
'ENDPOINT': 'Your url',
'MASTERKEY': 'Your master key',
'DOCUMENTDB_DATABASE': 'familydb',
'DOCUMENTDB_COLLECTION': 'familycoll'
};
# Initialize the Python DocumentDB client
client = document_client.DocumentClient(config['ENDPOINT'], {'masterKey': config['MASTERKEY']})
# use a SQL based query to get a bunch of documents
query = { 'query': 'SELECT * FROM server s' }
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 2
result_iterable = client.QueryDocuments('dbs/familydb/colls/familycoll', query, options)
results = list(result_iterable);
print(results)
client.DeleteDocument('dbs/familydb/colls/familycoll/docs/id1',options)
print 'delete success'
Console Result:
[{u'_self': u'dbs/hitPAA==/colls/hitPAL3OLgA=/docs/hitPAL3OLgABAAAAAAAAAA==/', u'myJsonArray': [{u'subId': u'sub1', u'val': u'value1'}, {u'subId': u'sub2', u'val': u'value2'}], u'_ts': 1507687788, u'_rid': u'hitPAL3OLgABAAAAAAAAAA==', u'_attachments': u'attachments/', u'_etag': u'"00002100-0000-0000-0000-59dd7d6c0000"', u'id': u'id1'}, {u'_self': u'dbs/hitPAA==/colls/hitPAL3OLgA=/docs/hitPAL3OLgACAAAAAAAAAA==/', u'myJsonArray': [{u'subId': u'sub3', u'val': u'value3'}, {u'subId': u'sub4', u'val': u'value4'}], u'_ts': 1507687809, u'_rid': u'hitPAL3OLgACAAAAAAAAAA==', u'_attachments': u'attachments/', u'_etag': u'"00002200-0000-0000-0000-59dd7d810000"', u'id': u'id2'}]
delete success
Please notice that you need to set the enableCrossPartitionQuery property to True in options if your documents are cross-partitioned.
Must be set to true for any query that requires to be executed across
more than one partition. This is an explicit flag to enable you to
make conscious performance tradeoffs during development time.
You could find above description from here.
Update Answer:
I think you misunderstand the meaning of partitionkey property in the options[].
For example , my container is created like this:
My documents as below :
{
"id": "1",
"name": "jay"
}
{
"id": "2",
"name": "jay2"
}
My partitionkey is 'name', so here I have two paritions : 'jay' and 'jay1'.
So, here you should set the partitionkey property to 'jay' or 'jay2',not 'name'.
Please modify your code as below:
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 2
options['partitionKey'] = 'jay' (please change here in your code)
result_iterable = client.QueryDocuments('dbs/db/colls/testcoll', query, options)
results = list(result_iterable);
print(results)
Hope it helps you.
Using the azure.cosmos library:
install and import azure cosmos package:
from azure.cosmos import exceptions, CosmosClient, PartitionKey
define delete items function - in this case using the partition key in query:
def deleteItems(deviceid):
client = CosmosClient(config.cosmos.endpoint, config.cosmos.primarykey)
# Create a database if not exists
database = client.create_database_if_not_exists(id=azure-cosmos-db-name)
# Create a container
# Using a good partition key improves the performance of database operations.
container = database.create_container_if_not_exists(id=container-name, partition_key=PartitionKey(path='/your-pattition-path'), offer_throughput=400)
#fetch items
query = f"SELECT * FROM c WHERE c.device.deviceid IN ('{deviceid}')"
items = list(container.query_items(query=query, enable_cross_partition_query=False))
for item in items:
container.delete_item(item, 'partition-key')
usage:
deviceid=10
deleteItems(items)
github full example here: https://github.com/eladtpro/python-iothub-cosmos