Is it possible to paginate put_item in boto3? - python

When I use boto3 I can paginate if I am making a query or scan
Is it possible to do the same with put_item?

The closest to "paginating" PutItem with boto3 is probably the included BatchWriter class and associated context manager. This class handles buffering and sending items in batches. Aside from PutItem, it supports DeleteItem as well.
Here is an example of how to use it:
import boto3
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("name")
with table.batch_writer() as batch_writer:
for _ in range(1000):
batch_writer.put_item(Item={"HashKey": "...",
"Otherstuff": "..."})

Paginating is when DynamoDB reaches its maximum of 1MB response size or it you are using --limit. It allows you to get the next "page" of data.
That does not make sense with a PutItem as you are simply putting a single item.
If what you mean is you want to put more than 1 item at a time, then use BatchWriteItem API where you can pass in a batch of up to 25 items.
You can also use high level interfaces like the batch_writer in boto3 where you can give it a list of items any size and it breaks the list into chunks of 25 for you and writes those batches while also handling any retry logic:
import boto3
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("name")
with table.batch_writer() as batch_writer:
for _ in range(1000):
batch_writer.put_item(Item=myitem)
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/dynamodb.html#

Related

Cursor not found while reading all documents from a collection

I have a collection student and I want this collection as list in Python, but unfortunately I got the following error CursorNextError: [HTTP 404][ERR 1600] cursor not found. Is there an option to read a 'huge' collection without an error?
from arango import ArangoClient
# Initialize the ArangoDB client.
client = ArangoClient()
# Connect to database as user.
db = client.db(<db>, username=<username>, password=<password>)
print(db.collections())
students = db.collection('students')
#students.all()
students = db.collection('handlingUnits').all()
list(students)
[OUT] CursorNextError: [HTTP 404][ERR 1600] cursor not found
students = list(db.collection('students'))
[OUT] CursorNextError: [HTTP 404][ERR 1600] cursor not found
as suggested in my comment, if raising the ttl is not an option (what I wouldn't do either) I would get the data in chunks instead of all at once. In most cases you don't need the whole collection anyway, so maybe think of limiting that first. Do you really need all documents and all their fields?
That beeing said I have no experience with arango, but this is what I would do:
entries = db.collection('students').count() # get total amount of documents in collection
limit=100 # blocksize you want to request
yourlist = [] # final output
for x in range(int(entries/limit) + 1):
block = db.collection('students').all(skip=x*limit, limit=100)
yourlist.extend(block) # assuming block is of type list. Not sure what arango returns
something like this. (Based on the documentation here: https://python-driver-for-arangodb.readthedocs.io/_/downloads/en/dev/pdf/)
Limit your request to a reasonable amount and then skip this amount with your next request. You have to check if this "range()" thing works like that you might have to think of a better way of defining the number of iterations you need.
This also assumes arango sorts the all() function per default.
So what is the idea?
determin the number of entries in the collection.
based on that determin how many requests you need (f.e. size=1000 -> 10 blocks each containing 100 entries)
make x requests where you skip the blocks you already have. First iteration entries 1-100; second iteration 101-200, third iteration 201-300 etc.
By default, AQL queries generate the complete result, which is then held in memory, and provided batch by batch. So the cursor is simply fetching the next batch of the already calculated result. In most of the cases this is fine, but if your query produces a huge result set, then this can take a long time and will require a lot of memory.
As an alternative you can create a streaming cursor. See https://www.arangodb.com/docs/stable/http/aql-query-cursor-accessing-cursors.html and check the stream option.
Streaming cursors calculate the next batch on demand and are therefore better suited to iterate a large collection.

How can I get the total number of calls to my dynamodb table?

I'm working with AWS Lambda. I created a lambda function that perform a get operation to my dynamoDB table. Depending on the id (primary key) I pass to this get function, it should return me the correct item in JSON format. For that, I'm using the get_item function from boto3:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html#DynamoDB.Client.get_item
So normally, if a call my lambda function via an API (created from API Gateway) by specifying and ID, I should get the corresponding item. The problem is that I also need to get the number of times I retrieved a result. For example, if it's the seventh time I call my lambda function, I should get an item (still depending on the id) and the index 7, like this :
{
"7":{
"id" :1246 ,
"toy":"car",
"color": "red"
}
}
Logically, the number of times that I call my lambda function is the number of times that I call dynamoDB. Than I suppose that the correct way to get this number is by maybe using dynamodb, but I already spent hours trying to find a way to get this number of events/calls to my table by looking everywhere... What can I do to get this number ? and how could I implement this using boto3?
There is no out of box solution to get the number of calls to the table in DynamoDb. You need to write a custom counter that will be shared across Lambda calls.
The easiest option and the fastest solution is probably using Redis and it's INCR operation to perform atomic increments. If you're not familiar with Redis, check the doc for INCR operation and specifically the Pattern: Counter section.
If you only can use the DynamoDb, you need to maintain a counter in a separate single item. Example:
{
"partionKey": "counter_item",
"counter": 1
}
Then you can execute update calls to increment the counter like that:
response = table.update_item(
Key={'partionKey': {'S': ':pk'}},
TableName='your_table_name',
ReturnValues='ALL_NEW',
UpdateExpression='SET #counter = if_not_exists(#counter, :default) + :incr',
ExpressionAttributeValues={
':pk': 'counter_item',
':incr': 1,
':default': 0
},
ExpressionAttributeNames={
'#counter': 'counter'
}
)
There will be an updated item in the response so you can get the counter field from it.
You can check this DynamoDb guide for better examples in python.
I feel there are plenty of ways to find out the number of server calls.
If you have the logs you can easily get the server calls of any of the specific pages.
Use the AWS Dashboard to get the info matrices. (It has everything, latency, failure ratio, calls, etc.)
Write your own function which can count the get, post, and update calls. (It will be similar to profile hits, generally, this is used at
the initial stage of a project.)

Check if DynamoDB table Empty

I have a dynamoDB table and I want to check if there are any items in it (using python). In other words, return true is the table is empty.
I am not sure how to go about this. Any suggestions?
Using Scan
The best way is to scan and check the count. You might be using boto3 AWS sdk for python.Use the scan function to scan the whole table and get the count.This may not be costly as you are scanning the table only once and it would not scan the entire table.
A single scan returns only 1 MB of data, so it would not be time consuming.
Read the docs for more details : Boto3 Docs
Using describe table
This could be helpful as well to get the count but
DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
so this could be only used if you don't want the most recent updated value.
Read the docs for more details : describe table dynamodb
You can simply take the count of that particular table using boto3, which is the AWS SDK for Python:
import boto3
def table_is_empty(table_name):
dynamo_resource = boto3.resource('dynamodb')
table = dynamo_resource.Table(table_name)
return table.item_count == 0
Note that the values are updated periodically and the result might not be precise:
The number of items in the specified index. DynamoDB updates this
value approximately every six hours. Recent changes might not be
reflected in this value.
You can use the Describe table function from boto3 in the response you can get the number of items that are in the table as you can see in the response example on the link.
Part of the command response:
'TableSizeBytes': 123,
'ItemCount': 123,
'TableArn': 'string',
'TableId': 'string',
As said in the comments, the vaule is updated aproximately every 6h, so recent changes may not be updated.

Modify item attribute after scan using boto3 in AWS Lambda

The goal is to scan and return all of the items in a DynamoDB table, but before the response is returned, modify a specific attribute of each specific item.
I have this completed already, but I'm curious to know if there is a more cost-effective way without looping through all the items.
Currently I'm returning a complete scan of the table and looping through each list item (found out it is not an object but a list):
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('<table name>')
response = table.scan()
items = response['Items']
for item in items:
item['Thumbnail'] = 'https://s3.amazonaws.com/<s3bucket>/' + item['Thumbnail']
return items
I doubt the solution can be resolved without looping but if there is a solution that avoids looping I'm eager to hear it!
Your cost for the loop to update the items will be measured in ms. The Dynamodb scan + network latency will take much more time.

Moving records from one collection to another PyMongo

What is the proper way of moving a number of records from one collection to another. I have come across several other SO posts such as this which deal with achieve the same goal but none have a python implementation.
#Taking a number of records from one database and returning the cursor
cursor_excess_new = db.test_collection_new.find().sort([("_id", 1)]).limit(excess_num)
# db.test.insert_many(doc for doc in cursor_excess_new).inserted_ids
# Iterating cursor and Trying to write to another database
# for doc in cursor_excess_new:
# db.test_collection_old.insert_one(doc)
result = db.test_collection_old.bulk_write([
for doc in cursor_excess_new:
InsertMany(doc for each doc in cur)
pprint(doc)
])
If I use insert_many, I get the following error: pymongo.errors.OperationFailure: Writes to config servers must have batch size of 1, found 10
bulk_write is giving me a syntax error at the start of for loop.
What is the best practice and correct way of transferring records from one collection to another in pymongo so that it is atomic?
Collection.bulk_write accepts as argument an iterable of query operations.
pymongo has pymongo.operations.InsertOne operation not InsertMany.
For your situation, you can build a list of InsertOne operations for each document in the source collection. Then do a bulk_write on the destination using the built-up list of operations.
from pymongo import InsertOne
...
cursor_excess_new = (
db.test_collection_new
.find()
.sort([("_id", 1)])
.limit(excess_num)
)
queries = [InsertOne(doc) for doc in cursor_excess_new]
db.test_collection_old.bulk_write(queries)
You don't need "for loop".
myList=list(collection1.find({}))
collection2.insert_many(myList)
collection1.delete_many({})
If you need to filter it you can use the following code:
myList=list(collection1.find({'status':10}))
collection2.insert_many(myList)
collection1.delete_many({'status':10})
But be careful, because it has no warranty to move successfully so you need to control transactions. If you're going to use the following code you should consider that your MongoDb shouldn't be Standalone and you need to active Replication and have another instance.
with myClient.start_session() as mySession:
with mySession.start_transaction():
...yourcode...
Finally, the above code has the warranty to move (insert and delete) successfully but the transaction isn't in your hand and you can't get the result of this transaction so you can use the following code to control both moving and transaction:
with myClient.start_session() as mySession:
mySession.start_transaction()
try:
...yourcode...
mySession.commit_transaction()
print("Done")
except Exception as e:
mySession.abort_transaction()
print("Failed",e)

Categories

Resources