How to get live count immediately after insertion from dynamodb using boto3

How to get live count immediately after insertion from dynamodb using boto3 - python

I have table called details
I am trying to get the live count from table
Below is the code
I was having already 7 items in the table and I inserted 8 items. Now My output has to show 15.
Still my out showing 7 which is old count. How to get the live updated count
From UI also when i check its showing 7 but when i checked live count by Start Scan, I got 15 entries
Is there time is there like some hours which will update the live count?
dynamo_resource = boto3.resource('dynamodb')
dynamodb_table_name = dynamo_resource.Table('details')
item_count_table = dynamodb_table_name.item_count
print('table_name count for field is', item_count_table)
using client
dynamoDBClient = boto3.client('dynamodb')
table = dynamoDBClient.describe_table(TableName='details')
print(table['Table']['ItemCount'])

In your example you are calling DynamoDB DescribeTable which only updates its data approximately every 6 hours:
Item Count:
The number of items in the specified table. DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_TableDescription.html#DDB-Type-TableDescription-ItemCount
In order to get the "Live Count" you have two possible options:
Option 1: Scan the entire table.
dynamoDBClient = boto3.client('dynamodb')
item_count = dynamoDBClient.scan(TableName='details', Select='COUNT')
print(item_count)
Be aware this will call for a full table Scan and may not be the best option for large tables.
Option 2: Update a counter for each item you write
You would need to use TransactWriteItems and Update a live counter for every Put which you do to the table. This can be useful, but be aware that you will limit your throughput to 1000 WCU per second as you are focusing all your writes on a single item.
You can then return the live item count with a simple GetItem.
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_TransactWriteItems.html

According to the DynamoDB documentation,
ItemCount - The number of items in the global secondary index. DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
You might need to keep track of the item counts yourself if you need to return an updated count shortly after editing the table.

Related

Azure Table Storage sync between 2 different storages

I have a list of storage accounts and I would like to copy the exact table content from source_table to destination_table exactly how it is. Which mean if I add an entry to source_table that will be moved to the destination_table same think if I delete the entry from source I want it to be deleted from destination.
So far I have in place this code:
source_table = TableService(account_name="sourcestorageaccount",
account_key="source key")
destination_storage = TableService(account_name="destination storage",
account_key="destinationKey")
query_size = 1000
# save data to storage2 and check if there is lefted data in current table，if yes recurrence
def queryAndSaveAllDataBySize(source_table_name, target_table_name, resp_data: ListGenerator,
table_out: TableService, table_in: TableService, query_size: int):
for item in resp_data:
tb_name = source_table_name
del item.etag
del item.Timestamp
print("INSERT data:" + str(item) + "into TABLE:" + tb_name)
table_in.insert_or_replace_entity(target_table_name, item)
if resp_data.next_marker:
data = table_out.query_entities(table_name=source_table_name, num_results=query_size,
marker=resp_data.next_marker)
queryAndSaveAllDataBySize(source_table_name, target_table_name, data, table_out, table_in, query_size)
tbs_out = table_service_out.list_tables()
print(tbs_out)
for tb in tbs_out:
table = tb.name
# create table with same name in storage2
table_service_in.create_table(table_name=table, fail_on_exist=False)
# first query
data = table_service_out.query_entities(tb.name, num_results=query_size)
queryAndSaveAllDataBySize(tb.name, table, data, table_service_out, table_service_in, query_size)
As you can see this block of code up runs just perfectly, it loops over the source storage account table and creates the same table and its content in destination storage account. but I am missing the part of how I can check if a record has been deleted from the source storage and remove the same record from the destination table.
I hope my question/issue is clear enough, and if not please just ask me for more informations.
Thank you so much for any help you can provide
UPDATE:
The more a think about this the more the logic get messy.
One of the solution that I thought about and tried is to have 2 lists to store every single table entity:
Source_table_entries
Destination_table_entries
Once I have populated the lists for each run I can compare the partition keys and if a partition key is present in Destination_table_entries but on in source, that will me promoted to be deleted.
But this logic will work flawless as long as I have a small table, unfortunately some table contains hundreds of thousands of entities (and I have hundreds of storages) which sooner or later will become a mess to managed.
So one of the solution that I thought about. Is to keep the same code I have above and just create a new table every week and delete the older one (from the destination storage). For example
Table week 1
Table week 2
Table week 3 (this will be deleted)
I read around that I could potentially add a metadata to the table for date and leverage that to decide which table should be deleted based on date time. But I cannot find anything in the documentation.
Can anyone please direct me on the best approach for this. Thank you so much, I am loosing my mind on this last bit

AWS Lambda Python Boto3 - Item count dynamodb table

I am trying to count the total number of items in the Dynamobd table. Boto3 documenation says
item_count attribute.
(integer) --
The number of items in the specified table. DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
I populated about 100 records into that table. output shows 0 reccords
import json
import os
import boto3
from pprint import pprint
tableName = os.environ.get('TABLE')
fieldName = os.environ.get('FIELD')
dbclient = boto3.resource('dynamodb')
def lambda_handler(event, context):
tableresource = dbclient.Table(tableName)
count = tableresource.item_count
print('total items in the table are ' + str(count))

As you saw in the AWS documentation, item_count is:
The number of items in the specified table. DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
It looks like you added the new items very recently so that count will not be accurate right now. You would have to wait for 6 hours max, 3 hours on average to get an updated item count
This is generally how large-scale distributed systems like S3 and DynamoDB work. They don't offer you an instantaneous, accurate count of objects or items because it's difficult to maintain that count accurately and the cost to calculate it instantaneously is prohibitive in the general case.

Can we use Lock when multiprocessing Insert query

I am using Multiprocessing.Pool in my current program because I wanted to increase the fetching and dumping of data from on-premises data center to another db in a different server. The current rate is too slow for MB worth of data, this seems to work best in my current requirement:
def fetch_data()
select data from on_prem_db (id, name...data)
#using Pool and starmap,
#runs dump_data function in 5 parallel threads
dump_data()
pass
def dump_data()
insert entry in table_f1
insert entry in table_g1
Now I am running into the issue where sometimes, multiple threads fetching already processed granules which leads to unique key violation.
eg: first thread fetch [10,20,40,50,70]
second thread fetch[30,40,60,70,80]
rows with id 40 and 70 and duplicated. I am supposed to see 10 entry in my db but I see only 8 entries, and 2 of them raises unique key violation.
How can I make sure that different threads fetch different rows from my source db which is on-prem db so that my program don't try to insert already inserted rows?
eg of my select query:
fetch_data_list_of_ids = [list of ids of processed data]
data_list = list(itertools.islice(on_prem_db.get_data(table_name),5))
Is there a way I can make a list and append the row ids of already processed data in fetch_data () ?
And every time data_list runs a new query to fetch the data, next thing i would do is check if the newly fetched data has ids in fetch_data_list_of_ids list ?
Or is there any other way I can do it to make sure duplicate entries are not being processed??

Check if DynamoDB table Empty

I have a dynamoDB table and I want to check if there are any items in it (using python). In other words, return true is the table is empty.
I am not sure how to go about this. Any suggestions?

Using Scan
The best way is to scan and check the count. You might be using boto3 AWS sdk for python.Use the scan function to scan the whole table and get the count.This may not be costly as you are scanning the table only once and it would not scan the entire table.
A single scan returns only 1 MB of data, so it would not be time consuming.
Read the docs for more details : Boto3 Docs
Using describe table
This could be helpful as well to get the count but
DynamoDB updates this value approximately every six hours. Recent changes might not be reflected in this value.
so this could be only used if you don't want the most recent updated value.
Read the docs for more details : describe table dynamodb

You can simply take the count of that particular table using boto3, which is the AWS SDK for Python:
import boto3
def table_is_empty(table_name):
dynamo_resource = boto3.resource('dynamodb')
table = dynamo_resource.Table(table_name)
return table.item_count == 0
Note that the values are updated periodically and the result might not be precise:
The number of items in the specified index. DynamoDB updates this
value approximately every six hours. Recent changes might not be
reflected in this value.

You can use the Describe table function from boto3 in the response you can get the number of items that are in the table as you can see in the response example on the link.
Part of the command response:
'TableSizeBytes': 123,
'ItemCount': 123,
'TableArn': 'string',
'TableId': 'string',
As said in the comments, the vaule is updated aproximately every 6h, so recent changes may not be updated.

Retrieve distinct values from the hash key - DynamoDB

I have a dynamodb table to store email attribute information. I have a hash key on the email, range key on timestamp(number). The initial idea for using email as hash key is to query all emails by per email. But one thing I trying to do is retrieve all email ids(in hash key). I am using boto for this, but I am unsure as to how to retrieve distinct email ids.
My current code to pull 10,000 email records is
conn=boto.dynamodb2.connect_to_region('us-west-2')
email_attributes = Table('email_attributes', connection=conn)
s = email_attributes.scan(limit=10000,attributes=['email'])
But to retrieve the distinct records, I will have to do a full table scan and then pick the distinct records in the code. Another idea that I have is to maintain another table that will just store these emails and do conditional writes to see if an email id exists, if not then write. But I am trying to think if this will be more expensive and it will be a conditional write.
Q1.) Is there a way to retrieve distinct records using a DynamoDB scan?
Q2.) Is there a good way to calculate the cost per query?

Using a DynamoDB Scan, you would need to filter out duplicates on the client side (in your case, using boto). Even if you create a GSI with the reverse schema, you will still get duplicates. Given a H+R table of email_id+timestamp called stamped_emails, a list of all unique email_ids is a materialized view of the H+R stamped_emails table. You could enable a DynamoDB Stream on the stamped_emails table, subscribe a Lambda function to stamped_emails' Stream that does a PutItem (email_id) to a Hash-only table called emails_only. Then, you could Scan emails_only and you would get no duplicates.
Finally, regarding your question about cost, Scan will read entire items even if you only request certain projected attributes from those items. Second, Scan has to read through every item, even if it is filtered out by a FilterExpression (Condition Expression). Third, Scan reads through items sequentially. That means that each scan call is treated as one big read for metering purposes. The cost implication of this is that if a Scan call reads 200 different items, it will not necessarily cost 100 RCU. If the size of each of those items is 100 bytes, that Scan call will cost ROUND_UP((20000 bytes / 1024 kb/byte) / 8 kb / EC RCU) = 3 RCU. Even if this call only returns 123 items, if the Scan had to read 200 items, you would incur 3 RCU in this situation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get live count immediately after insertion from dynamodb using boto3 - python

Related

Azure Table Storage sync between 2 different storages

AWS Lambda Python Boto3 - Item count dynamodb table

Can we use Lock when multiprocessing Insert query

Check if DynamoDB table Empty

Retrieve distinct values from the hash key - DynamoDB

Categories

Resources