AWS Lambda Python Boto3 - Item count dynamodb table - python

I am trying to count the total number of items in the Dynamobd table. Boto3 documenation says
item_count attribute.
(integer) --
The number of items in the specified table. DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
I populated about 100 records into that table. output shows 0 reccords
import json
import os
import boto3
from pprint import pprint
tableName = os.environ.get('TABLE')
fieldName = os.environ.get('FIELD')
dbclient = boto3.resource('dynamodb')
def lambda_handler(event, context):
tableresource = dbclient.Table(tableName)
count = tableresource.item_count
print('total items in the table are ' + str(count))

As you saw in the AWS documentation, item_count is:
The number of items in the specified table. DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
It looks like you added the new items very recently so that count will not be accurate right now. You would have to wait for 6 hours max, 3 hours on average to get an updated item count
This is generally how large-scale distributed systems like S3 and DynamoDB work. They don't offer you an instantaneous, accurate count of objects or items because it's difficult to maintain that count accurately and the cost to calculate it instantaneously is prohibitive in the general case.

Related

Does fetching data from Azure Table Storage from python takes too long? Data has around 1000 rows per hour and I am fetching it per hour

import os, uuid
from azure.data.tables import TableClient
import json
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity, EntityProperty
import pandas as pd
def queryAzureTable(azureTableName,filterQuery):
table_service = TableService(account_name='accountname', account_key='accountkey')
tasks=Entity()
tasks = table_service.query_entities(azureTableName, filter=filterQuery)
return tasks
filterQuery = f"PartitionKey eq '{key}' and Timestamp ge datetime'2022-06-15T09:00:00' and Timestamp lt datetime'2022-06-15T10:00:00')"
entities = queryAzureTable("TableName",filterQuery)
for i in entities:
print(i)
OR
df = pd.DataFrame(entities)
Above is the code that I am using, in the azure table there are only around 1000 entries which should not take too long but extracting it takes more than an hour using this.
Both, using either a 'for' loop or changing entities directly to DataFrame takes too long.
Could anyone let me know the reason why it is taking too long or generally it takes that much of time.
If that's the case, is there any alternate way of it that does not take more than 10-15 mins for processing it without increasing the number of clusters already in use.
I read multithreading might resolve it, I tried that too but doesn't seems to be of any help, maybe I am writing it wrong, could anyone help me with the code using multithreading or any alternate way.
I tried to list all the rows with my table storage, By default Azure table storage can only have 1000 rows or entities per table.
Also, there are few limitations on the partition key and rows, which should not exceed the size of 1KIB. Unfortunately the type of table storage account also matters to decrease the latency of your output. As, you’re trying to query 1000 rows at once:-
Make sure you have your table storage near to your region.
Check the scalability targets and limitations for your rows of Azure table storage here :- https://learn.microsoft.com/en-us/azure/storage/tables/scalability-targets#scale-targets-for-table-storage
Also, AFAIK, In your code, you can directly make use of
list_entities
method to list all the entities in the table instead of writing such complex query :-
I tried the below code and was able to retrieve all the table entities successfully within few seconds with standard general purpose V2 Storage account :-
Code :-
from azure.data.tables import TableClient
table_client = TableClient.from_connection_string(conn_str="DefaultEndpointsProtocol=<connection-strin g>windows.net", table_name="myTable")
**# Query the entities in the table**
entities = list(table_client.list_entities())
for i, entity in enumerate(entities):
print("Entity #{}: {}".format(entity, i))
Result :-
With Pandas :-
import pandas as pd
from azure.cosmosdb.table.tableservice import TableService
CONNECTION_STRING = "DefaultEndpointsProtocol=https;AccountName=siliconstrg45;AccountKey=<connection-string>==;EndpointSuffix=core.windows.net"
SOURCE_TABLE = "myTable"
def set_table_service():
""" Set the Azure Table Storage service """
return TableService(connection_string=CONNECTION_STRING)
def get_dataframe_from_table_storage_table(table_service):
""" Create a dataframe from table storage data """
return pd.DataFrame(get_data_from_table_storage_table(table_service,
))
def get_data_from_table_storage_table(table_service):
""" Retrieve data from Table Storage """
get_dataframe_from_table_storage_table
for record in table_service.query_entities(
SOURCE_TABLE
):
yield record
ts = set_table_service()
df = get_dataframe_from_table_storage_table(table_service=ts,
)
print(df)
Result :-
If you have your table storage scalability targets in place, You can consider few points from this document to increase the I/Ops of your table storage :-
Refer here :-
https://learn.microsoft.com/en-us/azure/storage/tables/storage-performance-checklist
Also, storage quota and limits vary for Azure subscriptions type too!

Is it possible to paginate put_item in boto3?

When I use boto3 I can paginate if I am making a query or scan
Is it possible to do the same with put_item?
The closest to "paginating" PutItem with boto3 is probably the included BatchWriter class and associated context manager. This class handles buffering and sending items in batches. Aside from PutItem, it supports DeleteItem as well.
Here is an example of how to use it:
import boto3
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("name")
with table.batch_writer() as batch_writer:
for _ in range(1000):
batch_writer.put_item(Item={"HashKey": "...",
"Otherstuff": "..."})
Paginating is when DynamoDB reaches its maximum of 1MB response size or it you are using --limit. It allows you to get the next "page" of data.
That does not make sense with a PutItem as you are simply putting a single item.
If what you mean is you want to put more than 1 item at a time, then use BatchWriteItem API where you can pass in a batch of up to 25 items.
You can also use high level interfaces like the batch_writer in boto3 where you can give it a list of items any size and it breaks the list into chunks of 25 for you and writes those batches while also handling any retry logic:
import boto3
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("name")
with table.batch_writer() as batch_writer:
for _ in range(1000):
batch_writer.put_item(Item=myitem)
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/dynamodb.html#

How to get live count immediately after insertion from dynamodb using boto3

I have table called details
I am trying to get the live count from table
Below is the code
I was having already 7 items in the table and I inserted 8 items. Now My output has to show 15.
Still my out showing 7 which is old count. How to get the live updated count
From UI also when i check its showing 7 but when i checked live count by Start Scan, I got 15 entries
Is there time is there like some hours which will update the live count?
dynamo_resource = boto3.resource('dynamodb')
dynamodb_table_name = dynamo_resource.Table('details')
item_count_table = dynamodb_table_name.item_count
print('table_name count for field is', item_count_table)
using client
dynamoDBClient = boto3.client('dynamodb')
table = dynamoDBClient.describe_table(TableName='details')
print(table['Table']['ItemCount'])
In your example you are calling DynamoDB DescribeTable which only updates its data approximately every 6 hours:
Item Count:
The number of items in the specified table. DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_TableDescription.html#DDB-Type-TableDescription-ItemCount
In order to get the "Live Count" you have two possible options:
Option 1: Scan the entire table.
dynamoDBClient = boto3.client('dynamodb')
item_count = dynamoDBClient.scan(TableName='details', Select='COUNT')
print(item_count)
Be aware this will call for a full table Scan and may not be the best option for large tables.
Option 2: Update a counter for each item you write
You would need to use TransactWriteItems and Update a live counter for every Put which you do to the table. This can be useful, but be aware that you will limit your throughput to 1000 WCU per second as you are focusing all your writes on a single item.
You can then return the live item count with a simple GetItem.
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_TransactWriteItems.html
According to the DynamoDB documentation,
ItemCount - The number of items in the global secondary index. DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
You might need to keep track of the item counts yourself if you need to return an updated count shortly after editing the table.

Check if DynamoDB table Empty

I have a dynamoDB table and I want to check if there are any items in it (using python). In other words, return true is the table is empty.
I am not sure how to go about this. Any suggestions?
Using Scan
The best way is to scan and check the count. You might be using boto3 AWS sdk for python.Use the scan function to scan the whole table and get the count.This may not be costly as you are scanning the table only once and it would not scan the entire table.
A single scan returns only 1 MB of data, so it would not be time consuming.
Read the docs for more details : Boto3 Docs
Using describe table
This could be helpful as well to get the count but
DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
so this could be only used if you don't want the most recent updated value.
Read the docs for more details : describe table dynamodb
You can simply take the count of that particular table using boto3, which is the AWS SDK for Python:
import boto3
def table_is_empty(table_name):
dynamo_resource = boto3.resource('dynamodb')
table = dynamo_resource.Table(table_name)
return table.item_count == 0
Note that the values are updated periodically and the result might not be precise:
The number of items in the specified index. DynamoDB updates this
value approximately every six hours. Recent changes might not be
reflected in this value.
You can use the Describe table function from boto3 in the response you can get the number of items that are in the table as you can see in the response example on the link.
Part of the command response:
'TableSizeBytes': 123,
'ItemCount': 123,
'TableArn': 'string',
'TableId': 'string',
As said in the comments, the vaule is updated aproximately every 6h, so recent changes may not be updated.

time efficient pymongo query for fetching thousands of records and appending it to the list for falsk rest api

I have hundreds of thousands of records in the collection named "student_details".
using below pymongo query
students_info = db.student_details.find()
it gives me all the records which is huge.and I don't want any filters there,i mean no where clause.i want to fetch all the records
Now if I am using for loop and appending it to list.
def student_information():
student_list = []
for student in students_info:
'''
in between there are numbers of if else blocks'''
student_list.append(student)
return jsonify({"result":student_list})
it takes huge number of time which is making the response time very late.
please help me how can I make it time efficient.

Categories

Resources