Azure Cosmos DB, Delete IDS (definitely exist) - python

This is probably a very simple and silly mistake but I am unsure of how this is failing. I have used the https://github.com/Azure/azure-cosmos-python#insert-data tutorial. How can I query a database then used those ids to delete and then they don't exist.
Can anyone help before the weekend sets in? Thanks, struggling to see how this fails!
Error:
azure.cosmos.errors.HTTPFailure: Status code: 404
{"code":"NotFound","message":"Entity with the specified id does not exist in the system., \r\nRequestStartTime: 2020-02-07T17:08:48.1413131Z,
RequestEndTime: 2020-02-07T17:08:48.1413131Z,
Number of regions attempted:1\r\nResponseTime: 2020-02-07T17:08:48.1413131Z,
StoreResult: StorePhysicalAddress: rntbd://cdb-ms-prod-northeurope1-fd24.documents.azure.com:14363/apps/dedf1644-3129-4bd1-9eaa-8efc450341c4/services/956a2aa9-0cad-451f-a172-3f3c7d8353ef/partitions/bac75b40-384a-4019-a973-d2e85ada9c87/replicas/132248272332111641p/,
LSN: 79, GlobalCommittedLsn: 79, PartitionKeyRangeId: 0, IsValid: True, StatusCode: 404,
SubStatusCode: 0, RequestCharge: 1.24, ItemLSN: -1, SessionToken: 0#79#13=-1,
UsingLocalLSN: False, TransportException: null, ResourceType: Document,
OperationType: Delete\r\n, Microsoft.Azure.Documents.Common/2.9.2"}
This is my code...
def get_images_to_check(database_id, container_id):
images = client.QueryItems("dbs/" + database_id + "/colls/" + container_id,
{
'query': 'SELECT * FROM c r WHERE r.manually_reviewed=#manrev',
'parameters': [
{'name': '#manrev', 'value': False}
]
},
{'enableCrossPartitionQuery': True})
return list(images)
def delete_data(database_id, container_id, data):
for item in data:
print(item['id'])
client.DeleteItem("dbs/" + database_id + "/colls/" + container_id + "/docs/" + item['id'], {'partitionKey': 'class'})
database_id = 'ModelData'
container_id = 'ImagePredictions'
container_id = 'IncorrectPredictions'
images_to_check = get_images_to_check(database_id, container_id)
delete_data(database_id, container_id, images_to_check)```

When specifying the partition key in the call to client.DeleteItem(), you chose:
{'partitionKey': 'class'}
The extra parameter to DeleteItem() should be specifying the value of your partition key.
Per your comments, /class is your partition key. So I believe that, if you change your parameter to something like:
{'partitionKey': 'value-of-partition-key'}
This should hopefully work.

More than likely the issue is with mismatched id and PartitionKey value for the document. A document is uniquely identified in a collection by combination of it's id and PartitionKey value.
Thus in order to delete a document, you need to specify correct values for both the document's id and it's PartitionKey value.

Related

Accessing specific UID in array

I got the following problem. I'm trying to pull the specific field, in the "warnings" array, which has the given UID. I can't seem to figure out why it's not working.
The output (Everything prints out successfully): https://i.imgur.com/ZslJ0rV.png\
My MongoDB structure: https://i.imgur.com/3bRegAD.png
client = pymongo.MongoClient("")
database = client["LateNight"]
ModlogsCollection = database["modlogs"]
theUID = "63TF-lYv0-72m7-9f4I"
theGuild = 1063516188988153896
all_mod_docs = ModlogsCollection.find({"_id": str(theGuild)})
all_uids = []
for doc in all_mod_docs:
doc_keys = [key for key in doc.keys() if key != "_id"]
for key in doc_keys:
sub_doc = doc[key]
if warnings := sub_doc.get("warnings"):
for warning in warnings:
if warning["UID"] == theUID:
print(warning)
print("Warning")
result = ModlogsCollection.update_one(
{"_id": str(theGuild)},
{"$pull": {
"warnings": {"UID": theUID}
}}
)
print(result)
print(result.modified_count)
as you yourself said you try to "extract the specific field, in the warnings table that has the UID given". Before recovering the UID value you must specify the index 0. Afterwards you get a dictionary that will have the keys:
moderator, reason, time and UID

AWS IAM Tagging, not tagging as expected using Lambda and Python

I created this script to apply tags if certain conditions are met but it will not apply the tags if I just reference them, it will apply the tags if I type them in manually. This portion of the code works if it's manually typed in:
tag_user(user['UserName'], 'key', 'value')
Yes, I understand why it works, but if that works, why wouldn't this work as well?
tag_user(user['UserName'], testtag['Key'], testtag['Value'])
Is that not the same thing? I've tried numerous methods as you can see in the tag_user section but none of them work except the first one, which is not convenient. I want to be able to reference "testtag" which is a list of key and value. I don't even think I need the tag_user function at the start since the boto3.client('iam') includes it, I would just reference iam.tag_user(), but again I can't get that to work. I'm not sure what I'm doing wrong here, any help would be much appreciated. Thank you.
import boto3
iam = boto3.resource('iam')
iam_client = boto3.client('iam')
response = iam_client.list_users()
def tag_user(user, key, value):
client = boto3.client('iam')
try:
response = client.tag_user(
UserName=user,
Tags=[
{
'Key': key,
'Value': value
},
]
)
except:
response = 'We got an error'
return response
def lambda_handler(event,context):
return_value = {} #creates empty dictionary#
everything_dict = {} #dictionary of instances, which contains a dictionary of categories
#(missing tag key, missing tag values, incorrect tag keys, etc), which contains a list with the details
return_value['missingtagkeys'] = [] #within return values dictionary, create a missing tag key list#
return_value['missingtagvalues'] = [] #within return values dictionart, creates a missing tag values key list#
return_value['incorrecttagkeys'] = [] #within return values dictionary, create a incorrect tag key list#
return_value['incorrecttagvalues'] = [] #within return values dictionary, create a incorrect tag value list#
return_value['unknowntagvalues'] = [] #within return values dictionary, create a unknown tag value list#
testtag = [{
"Key": 'test',
"Value": 'testvalue'
}]
for user in response['Users']:
tags = iam_client.list_user_tags(UserName = user['UserName'])
tags = {x['Key']: x['Value'] for x in tags["Tags"]}
print(tags)
# iam user properties
ids = user['UserName']
username = user['UserName']
iam_user_id = user['UserId']
iam_user_arn = user['Arn']
try:
# if instance_ids not in everything_dict:
if username not in everything_dict:
# ids = user['UserName']
everything_dict[username] = {
'tags' : [],
'missingtagkeys' : [],
'missingtagvalues' : [],
'incorrecttagkeys' : [],
'incorrecttagvalues' : [],
'unknowntagvalues' : [],
}
everything_dict[username]['tags'].append(tags)
except:
pass
try:
if tags['contact'] in ['me', 'you']:
print(username + " (" + user['UserId'] + ")" + " has an approved contact tag value of " + tags['contact'] + ".")
tagissue = (username + " (" + user['UserId'] + ")" + " (" + user['Arn'] + ")" + " has an approved contact tag value of " + tags['contact'] + ".")
tag_user(user['UserName'], 'key', 'value') # hard coded tag key and values, works
tag_user(user['UserName'], str(testtag['Key']), str(testtag['Value'])) # does not work, why not?
tag_user(user['UserName'], testtag.get('Key'), testtag.get('Value')) # does not work, why not?
tag_user([user['UserName']], testtag) # does not work, why not?
iam.tag_user(username, Tags=testtag) # does not work, why not?
# Store values
return_value['incorrecttagvalues'].append(tagissue)
everything_dict[username]['incorrecttagvalues'].append(tagissue)
except:
pass
return everything_dict
Your "testtag" is actually a list of tag key-value pairs, so you need to iterate through this list.
Renaming testtag to test_tags, with example of second k-v pair:
test_tags = [
{
"Key": 'test',
"Value": 'testvalue'
},
{
"Key": 'test2',
"Value": 'testvalue2'
},
]
2a. Utilizing the custom function in the Lambda Function body:
for test_tag in test_tags:
tag_user(user['UserName'], test_tag['Key'], test_tag['Value'])
2b. Alternatively, as you guessed at, you could just call IAM.Client.tag_user directly and remove the extra custom function.
This works because you already have a Sequence of TagTypeDef to pass into the Tags= keyword argument.
iam_client.tag_user(UserName=user['UserName'], Tags=test_tags)

Passing pandas dataframe to fastapi

I wish to create an API using which I can take Pandas dataframe as an input and store that in my DB.
I am able to do so with the csv file. However, the problem with that is that, my datatype information is lost (column datatypes like: int, array, float and so on) which is important for what I am trying to do.
I have already read this: Passing a pandas dataframe to FastAPI for NLP ML
I cannot create a class like this:
class Data(BaseModel):
# id: str
project: str
messages: str
The reason being I don't have any fixed schema. the dataframe could be of any shape with varying data types. I have created a dynamic query to create a table as per coming data frame and insert into that dataframe as well.
However, being new to fastapi, I am not able to figure out if there is an efficient way of sending this changing (dynamic) dataframe requirement of mine and store it via the queries that I have created.
If the information is not sufficient, I can try to provide more examples.
Is there a way I can send pandas dataframe from my jupyter notebook itself.
Any guidance on this would be greatly appreciated.
#router.post("/send-df")
async def push_df_funct(
target_name: Optional[str] = Form(...),
join_key: str = Form(...),
local_csv_file: UploadFile = File(None),
db: Session = Depends(pg.get_db)
):
"""
API to upload dataframe to database
"""
return upload_dataframe(db, featureset_name, local_csv_file, join_key)
def registration_cassandra(self, feature_registation_dict):
'''
# Table creation in cassandra as per the given feature registration JSON
Takes:
1. feature_registration_dict: Feature registration JSON
Returns:
- Response stating that the table has been created in cassandra
'''
logging.info(feature_registation_dict)
target_table_name = feature_registation_dict.get('featureset_name')
join_key = feature_registation_dict.get('join_key')
metadata_list = feature_registation_dict.get('metadata_list')
table_name_delimiter = "__"
logging.info(metadata_list)
column_names = [ sub['name'] for sub in metadata_list ]
data_types = [ DataType.to_cass_datatype(eval(sub['data_type']).value) for sub in metadata_list ]
logging.info(f"Column names: {column_names}")
logging.info(f"Data types: {data_types}")
ls = list(zip(column_names, data_types))
target_table_name = target_table_name + table_name_delimiter + join_key
base_query = f"CREATE TABLE {self.keyspace}.{target_table_name} ("
# CREATE TABLE images_by_month5(tid object PRIMARY KEY , cc_num object,amount object,fraud_label object,activity_time object,month object);
# create_query_new = "CREATE TABLE vpinference_dev.images_by_month4 (month int,activity_time timestamp,amount double,cc_num varint,fraud_label varint,
# tid text,PRIMARY KEY (month, activity_time, tid)) WITH CLUSTERING ORDER BY (activity_time DESC, tid ASC)"
#CREATE TABLE group_join_dates ( groupname text, joined timeuuid, username text, email text, age int, PRIMARY KEY (groupname, joined) )
flag = True
for name, data_type in ls:
base_query += " " + name
base_query += " " + data_type
#if flag :
# base_query += " PRIMARY KEY "
# flag = False
base_query += ','
create_query = base_query.strip(',').rstrip(' ') + ', month varchar, activity_time timestamp,' + ' PRIMARY KEY (' + f'month, activity_time, {join_key}) )' + f' WITH CLUSTERING ORDER BY (activity_time DESC, {join_key} ASC' + ');'
logging.info(f"Query to create table in cassandra: {create_query}")
try:
session = self.get_session()
session.execute((create_query))
except Exception as e:
logging.exception(f"Some error occurred while doing the registration in cassandra. Details :: {str(e)}")
raise AppException(f"Some error occurred while doing the registration in cassandra. Details :: {str(e)}")
response = f"Table created successfully in cassandra at: vpinference_dev.{target_table_name}__{join_key};"
return response
This is the dictionary that I am passing:
feature_registation_dict = {
'featureSetName': 'data_type_testing_29',
'teamName': 'Harsh',
'frequency': 'DAILY',
'joinKey': 'tid',
'model_version': 'v1',
'model_name': 'data type testing',
'metadata_list': [{'name': 'tid',
'data_type': 'text',
'definition': 'Credit Card Number (Unique)'},
{'name': 'cc_num',
'data_type': 'bigint',
'definition': 'Aggregated Metric: Average number of transactions for the card aggregated by past 10 minutes'},
{'name': 'amount',
'data_type': 'double',
'definition': 'Aggregated Metric: Average transaction amount for the card aggregated by past 10 minutes'},
{'name': 'datetime',
'data_type': 'text',
'definition': 'Required feature for event timestamp'}]}
Not sure I understood exactly what you need but I'll give it a try. To send any dataframe to fastapi, you could do something like:
#fastapi
#app.post("/receive_df")
def receive_df(df_in: str):
df = pd.DataFrame.read_json(df_in)
#jupyter
payload={"df_in":df.to_json()}
requests.post("localhost:8000/receive_df", data=payload)
Can't really test this right now, there's probably some mistakes in there but the gist is just serializing the DataFrame to json and then serializing it in the endpoint. If you need (json) validation, you can also use the pydantic.Json data type. If there is no fixed schema then you can't use BaseModel in any useful way. But just sending a plain json string should be all you need, if your data comes only from reliable sources (your jupyter Notebook).

Cosmos DB - Delete Document with Python

In this SO question I had learnt that I cannot delete a Cosmos DB document using SQL.
Using Python, I believe I need the DeleteDocument() method. This is how I'm getting the document ID's that are required (I believe) to then call the DeleteDocument() method.
# set up the client
client = document_client.DocumentClient()
# use a SQL based query to get a bunch of documents
query = { 'query': 'SELECT * FROM server s' }
result_iterable = client.QueryDocuments('dbs/DB/colls/coll', query, options)
results = list(result_iterable);
for x in range(0, len (results)):
docID = results[x]['id']
Now, at this stage I want to call DeleteDocument().
The inputs into which are document_link and options.
I can define document_link as something like
document_link = 'dbs/DB/colls/coll/docs/'+docID
And successfully call ReadAttachments() for example, which has the same inputs as DeleteDocument().
When I do however, I get an error...
The partition key supplied in x-ms-partitionkey header has fewer
components than defined in the the collection
...and now I'm totally lost
UPDATE
Following on from Jay's help, I believe I'm missing the partitonKey element in the options.
In this example, I've created a testing database, it looks like this
So I think my partition key is /testPART
When I include the partitionKey in the options however, no results are returned, (and so print len(results) outputs 0).
Removing partitionKey means that results are returned, but the delete attempt fails as before.
# Query them in SQL
query = { 'query': 'SELECT * FROM c' }
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 2
options['partitionKey'] = '/testPART'
result_iterable = client.QueryDocuments('dbs/testDB/colls/testCOLL', query, options)
results = list(result_iterable)
# should be > 0
print len(results)
for x in range(0, len (results)):
docID = results[x]['id']
print docID
client.DeleteDocument('dbs/testDB/colls/testCOLL/docs/'+docID, options=options)
print 'deleted', docID
According to your description, I tried to use pydocument module to delete document in my azure document db and it works for me.
Here is my code:
import pydocumentdb;
import pydocumentdb.document_client as document_client
config = {
'ENDPOINT': 'Your url',
'MASTERKEY': 'Your master key',
'DOCUMENTDB_DATABASE': 'familydb',
'DOCUMENTDB_COLLECTION': 'familycoll'
};
# Initialize the Python DocumentDB client
client = document_client.DocumentClient(config['ENDPOINT'], {'masterKey': config['MASTERKEY']})
# use a SQL based query to get a bunch of documents
query = { 'query': 'SELECT * FROM server s' }
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 2
result_iterable = client.QueryDocuments('dbs/familydb/colls/familycoll', query, options)
results = list(result_iterable);
print(results)
client.DeleteDocument('dbs/familydb/colls/familycoll/docs/id1',options)
print 'delete success'
Console Result:
[{u'_self': u'dbs/hitPAA==/colls/hitPAL3OLgA=/docs/hitPAL3OLgABAAAAAAAAAA==/', u'myJsonArray': [{u'subId': u'sub1', u'val': u'value1'}, {u'subId': u'sub2', u'val': u'value2'}], u'_ts': 1507687788, u'_rid': u'hitPAL3OLgABAAAAAAAAAA==', u'_attachments': u'attachments/', u'_etag': u'"00002100-0000-0000-0000-59dd7d6c0000"', u'id': u'id1'}, {u'_self': u'dbs/hitPAA==/colls/hitPAL3OLgA=/docs/hitPAL3OLgACAAAAAAAAAA==/', u'myJsonArray': [{u'subId': u'sub3', u'val': u'value3'}, {u'subId': u'sub4', u'val': u'value4'}], u'_ts': 1507687809, u'_rid': u'hitPAL3OLgACAAAAAAAAAA==', u'_attachments': u'attachments/', u'_etag': u'"00002200-0000-0000-0000-59dd7d810000"', u'id': u'id2'}]
delete success
Please notice that you need to set the enableCrossPartitionQuery property to True in options if your documents are cross-partitioned.
Must be set to true for any query that requires to be executed across
more than one partition. This is an explicit flag to enable you to
make conscious performance tradeoffs during development time.
You could find above description from here.
Update Answer:
I think you misunderstand the meaning of partitionkey property in the options[].
For example , my container is created like this:
My documents as below :
{
"id": "1",
"name": "jay"
}
{
"id": "2",
"name": "jay2"
}
My partitionkey is 'name', so here I have two paritions : 'jay' and 'jay1'.
So, here you should set the partitionkey property to 'jay' or 'jay2',not 'name'.
Please modify your code as below:
options = {}
options['enableCrossPartitionQuery'] = True
options['maxItemCount'] = 2
options['partitionKey'] = 'jay' (please change here in your code)
result_iterable = client.QueryDocuments('dbs/db/colls/testcoll', query, options)
results = list(result_iterable);
print(results)
Hope it helps you.
Using the azure.cosmos library:
install and import azure cosmos package:
from azure.cosmos import exceptions, CosmosClient, PartitionKey
define delete items function - in this case using the partition key in query:
def deleteItems(deviceid):
client = CosmosClient(config.cosmos.endpoint, config.cosmos.primarykey)
# Create a database if not exists
database = client.create_database_if_not_exists(id=azure-cosmos-db-name)
# Create a container
# Using a good partition key improves the performance of database operations.
container = database.create_container_if_not_exists(id=container-name, partition_key=PartitionKey(path='/your-pattition-path'), offer_throughput=400)
#fetch items
query = f"SELECT * FROM c WHERE c.device.deviceid IN ('{deviceid}')"
items = list(container.query_items(query=query, enable_cross_partition_query=False))
for item in items:
container.delete_item(item, 'partition-key')
usage:
deviceid=10
deleteItems(items)
github full example here: https://github.com/eladtpro/python-iothub-cosmos

Empty a DynamoDB table with boto

How can I optimally (in terms financial cost) empty a DynamoDB table with boto? (as we can do in SQL with a truncate statement.)
boto.dynamodb2.table.delete() or boto.dynamodb2.layer1.DynamoDBConnection.delete_table() deletes the entire table, while boto.dynamodb2.table.delete_item() boto.dynamodb2.table.BatchTable.delete_item() only deletes the specified items.
While i agree with Johnny Wu that dropping the table and recreating it is much more efficient, there may be cases such as when many GSI's or Tirgger events are associated with a table and you dont want to have to re-associate those. The script below should work to recursively scan the table and use the batch function to delete all items in the table. For massively large tables though, this may not work as it requires all items in the table to be loaded into your computer
import boto3
dynamo = boto3.resource('dynamodb')
def truncateTable(tableName):
table = dynamo.Table(tableName)
#get the table keys
tableKeyNames = [key.get("AttributeName") for key in table.key_schema]
"""
NOTE: there are reserved attributes for key names, please see https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ReservedWords.html
if a hash or range key is in the reserved word list, you will need to use the ExpressionAttributeNames parameter
described at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Table.scan
"""
#Only retrieve the keys for each item in the table (minimize data transfer)
ProjectionExpression = ", ".join(tableKeyNames)
response = table.scan(ProjectionExpression=ProjectionExpression)
data = response.get('Items')
while 'LastEvaluatedKey' in response:
response = table.scan(
ProjectionExpression=ProjectionExpression,
ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
with table.batch_writer() as batch:
for each in data:
batch.delete_item(
Key={key: each[key] for key in tableKeyNames}
)
truncateTable("YOUR_TABLE_NAME")
As Johnny Wu mentioned, deleting a table and re-creating it is more efficient than deleting individual items. You should make sure your code doesn't try to create a new table before it is completely deleted.
def deleteTable(table_name):
print('deleting table')
return client.delete_table(TableName=table_name)
def createTable(table_name):
waiter = client.get_waiter('table_not_exists')
waiter.wait(TableName=table_name)
print('creating table')
table = dynamodb.create_table(
TableName=table_name,
KeySchema=[
{
'AttributeName': 'YOURATTRIBUTENAME',
'KeyType': 'HASH'
}
],
AttributeDefinitions= [
{
'AttributeName': 'YOURATTRIBUTENAME',
'AttributeType': 'S'
}
],
ProvisionedThroughput={
'ReadCapacityUnits': 1,
'WriteCapacityUnits': 1
},
StreamSpecification={
'StreamEnabled': False
}
)
def emptyTable(table_name):
deleteTable(table_name)
createTable(table_name)
Deleting a table is much more efficient than deleting items one-by-one. If you are able to control your truncation points, then you can do something similar to rotating tables as suggested in the docs for time series data.
This builds on the answer given by Persistent Plants. If the table already exists, you can extract the table definitions and use that to recreate the table.
import boto3
dynamodb = boto3.resource('dynamodb', region_name='us-east-2')
def delete_table_ddb(table_name):
table = dynamodb.Table(table_name)
return table.delete()
def create_table_ddb(table_name, key_schema, attribute_definitions,
provisioned_throughput, stream_enabled, billing_mode):
settings = dict(
TableName=table_name,
KeySchema=key_schema,
AttributeDefinitions=attribute_definitions,
StreamSpecification={'StreamEnabled': stream_enabled},
BillingMode=billing_mode
)
if billing_mode == 'PROVISIONED':
settings['ProvisionedThroughput'] = provisioned_throughput
return dynamodb.create_table(**settings)
def truncate_table_ddb(table_name):
table = dynamodb.Table(table_name)
key_schema = table.key_schema
attribute_definitions = table.attribute_definitions
if table.billing_mode_summary:
billing_mode = 'PAY_PER_REQUEST'
else:
billing_mode = 'PROVISIONED'
if table.stream_specification:
stream_enabled = True
else:
stream_enabled = False
capacity = ['ReadCapacityUnits', 'WriteCapacityUnits']
provisioned_throughput = {k: v for k, v in table.provisioned_throughput.items() if k in capacity}
delete_table_ddb(table_name)
table.wait_until_not_exists()
return create_table_ddb(
table_name,
key_schema=key_schema,
attribute_definitions=attribute_definitions,
provisioned_throughput=provisioned_throughput,
stream_enabled=stream_enabled,
billing_mode=billing_mode
)
Now call use the function:
table_name = 'test_ddb'
truncate_table_ddb(table_name)

Categories

Resources