I have a dynamodb table to store email attribute information. I have a hash key on the email, range key on timestamp(number). The initial idea for using email as hash key is to query all emails by per email. But one thing I trying to do is retrieve all email ids(in hash key). I am using boto for this, but I am unsure as to how to retrieve distinct email ids.
My current code to pull 10,000 email records is
conn=boto.dynamodb2.connect_to_region('us-west-2')
email_attributes = Table('email_attributes', connection=conn)
s = email_attributes.scan(limit=10000,attributes=['email'])
But to retrieve the distinct records, I will have to do a full table scan and then pick the distinct records in the code. Another idea that I have is to maintain another table that will just store these emails and do conditional writes to see if an email id exists, if not then write. But I am trying to think if this will be more expensive and it will be a conditional write.
Q1.) Is there a way to retrieve distinct records using a DynamoDB scan?
Q2.) Is there a good way to calculate the cost per query?
Using a DynamoDB Scan, you would need to filter out duplicates on the client side (in your case, using boto). Even if you create a GSI with the reverse schema, you will still get duplicates. Given a H+R table of email_id+timestamp called stamped_emails, a list of all unique email_ids is a materialized view of the H+R stamped_emails table. You could enable a DynamoDB Stream on the stamped_emails table, subscribe a Lambda function to stamped_emails' Stream that does a PutItem (email_id) to a Hash-only table called emails_only. Then, you could Scan emails_only and you would get no duplicates.
Finally, regarding your question about cost, Scan will read entire items even if you only request certain projected attributes from those items. Second, Scan has to read through every item, even if it is filtered out by a FilterExpression (Condition Expression). Third, Scan reads through items sequentially. That means that each scan call is treated as one big read for metering purposes. The cost implication of this is that if a Scan call reads 200 different items, it will not necessarily cost 100 RCU. If the size of each of those items is 100 bytes, that Scan call will cost ROUND_UP((20000 bytes / 1024 kb/byte) / 8 kb / EC RCU) = 3 RCU. Even if this call only returns 123 items, if the Scan had to read 200 items, you would incur 3 RCU in this situation.
Related
I am using Multiprocessing.Pool in my current program because I wanted to increase the fetching and dumping of data from on-premises data center to another db in a different server. The current rate is too slow for MB worth of data, this seems to work best in my current requirement:
def fetch_data()
select data from on_prem_db (id, name...data)
#using Pool and starmap,
#runs dump_data function in 5 parallel threads
dump_data()
pass
def dump_data()
insert entry in table_f1
insert entry in table_g1
Now I am running into the issue where sometimes, multiple threads fetching already processed granules which leads to unique key violation.
eg: first thread fetch [10,20,40,50,70]
second thread fetch[30,40,60,70,80]
rows with id 40 and 70 and duplicated. I am supposed to see 10 entry in my db but I see only 8 entries, and 2 of them raises unique key violation.
How can I make sure that different threads fetch different rows from my source db which is on-prem db so that my program don't try to insert already inserted rows?
eg of my select query:
fetch_data_list_of_ids = [list of ids of processed data]
data_list = list(itertools.islice(on_prem_db.get_data(table_name),5))
Is there a way I can make a list and append the row ids of already processed data in fetch_data () ?
And every time data_list runs a new query to fetch the data, next thing i would do is check if the newly fetched data has ids in fetch_data_list_of_ids list ?
Or is there any other way I can do it to make sure duplicate entries are not being processed??
So I have a DynamoDB database table which looks like this (exported to csv):
"email (S)","created_at (N)","firstName (S)","ip_addresses (L)","lastName (S)","updated_at (N)"
"name#email","1628546958.837838381","ddd","[ { ""M"" : { ""expiration"" : { ""N"" : ""1628806158"" }, ""IP"" : { ""S"" : ""127.0.0.1"" } } }]","ddd","1628546958.837940533"
I want to be able to do a "query" not a "scan" for all of the IP's (attribute attached to users) which are expired. The time is stored in unix time.
Right now I'm scanning the entire table and looking through each user, one by one and then I loop through all of their IPs to see if they are expired or not. But I need to do this using a query, scans are expensive.
The table layout is like this:
primaryKey = email
attributes = firstName, lastName, ip_addresses (array of {} maps where each map has IP, and Expiration as two keys).
I have no idea how to do this using a query so I would greatly appreciate if anyone could show me how! :)
I'm currently running the scan using python and boto3 like this:
response = client.scan(
TableName='users',
Select='SPECIFIC_ATTRIBUTES',
AttributesToGet=[
'ip_addresses',
])
As per the boto3 documentation, The Query operation finds items based on primary key values. You can query any table or secondary index that has a composite primary key (a partition key and a sort key).
Use the KeyConditionExpression parameter to provide a specific value for the partition key. The Query operation will return all of the items from the table or index with that partition key value. You can optionally narrow the scope of the Query operation by specifying a sort key value and a comparison operator in KeyConditionExpression . To further refine the Query results, you can optionally provide a FilterExpression . A FilterExpression determines which items within the results should be returned to you. All of the other results are discarded.
So long story short, it will only work to fetch a particular row whose primary key you have mentioned while running query.
A Query operation always returns a result set. If no matching items are found, the result set will be empt
I have two object called RECEIPT and PARTICULAR.
Attributes of RECEIPT object are:
receiptNo
particularList
Attributes of PARTICULAR object are:
particularId
particularName
Also I have their respective tables. receiptNo is the primary key of RECEIPT table and it is the foreign key in PARTICULAR table. So for a receipt there are multiple particulars.
I want to fetch data to populate RECEIPT object. To achieve this I can first run select query to RECEIPT table and by iterating the result using a for loop I can run another query to fetch the PARTICULAR table. Here I am calling the DB twice.
To avoid calling DB twice I tried joins also as:
SELECT * FROM RECEIPT r,PARTICULAR p WHERE r.RECEIPT_NO = p.RECEIPT_NO
However as it returns repetitive data for the RECEIPT, i.e. for each PARTICULAR row corresponding RECEIPT data are also fectching. This RECEIPT data are repetitive as multiple particularId shares same receiptNo. Due to thisI am unable to load the data properly to the RECEIPT object (Or may be I dont know how to load such resulset to the respective objects)
My actual requirement is to load RECEIPT object by forming PARTICULAR list for each receipt.
Is using the for loop and calling DB twice the only way to achieve it?
Suggest me an efficient way to achieve this
I think querying the data from the database with the JOIN approach is the most efficient way to do it.
If you make sure to ORDER BY "RECEIPT_NO" you just have to loop through the list once in python, only creating a new Receipt object every time you reach a new "RECEIPT_NO".
So the SQL becomes:
SELECT * FROM RECEIPT r,PARTICULAR p WHERE r.RECEIPT_NO = p.RECEIPT_NO ORDER BY RECEIPT_NO
And the python code could look like
data = query("SELECT * FROM RECEIPT r,PARTICULAR p WHERE r.RECEIPT_NO = p.RECEIPT_NO ORDER BY RECEIPT_NO")
last_receipt_nr = ""
for row in data:
if last_receipt_nr == row.RECEIPT_NO:
# Replace with code initializing a new Receipt object
last_receipt_nr = row.RECEIPT_NO
#Replace with code initializing a new Particular object
I'm trying to delete a large number of items in a DynamoDB table using boto and python. My Table is set up with the primary key as a device ID (think MAC address.) There are multiple entries in the table for each device ID, as the secondary key is a UNIX timestamp.
From my reading this code should work:
from boto.dynamodb2.table import Table
def delete_batch(self, guid):
table = Table('Integers')
with table.batch_write() as batch:
batch.delete_item(Id=guid)
Source: http://docs.pythonboto.org/en/latest/dynamodb2_tut.html#batch-writing
However it returns 'The provided key element does not match the schema' as the error message.
I suspect the problem is because guid is not unique in my table.
Given that, is there way to delete multiple items with the same primary key without specifying the secondary key?
You are providing only the hash part of the key and not an item (hash+range) - this is why you get an error and can't delete items.
You can't ask DynamoDB to delete all items with a hash key (the same way Query gets them all)
Read this answer by Steffen for more information
I have a few hundred keys, all of the same Model, which I have pre-computed:
candidate_keys = [db.Key(...), db.Key(...), db.Key(...), ...]
Some of these keys refer to actual entities in the datastore, and some do not. I wish to determine which keys do correspond to entities.
It is not necessary to know the data within the entities, just whether they exist.
One solution would be to use db.get():
keys_with_entities = set()
for entity in db.get(candidate_keys):
if entity:
keys_with_entities.add(entity.key())
However this procedure would fetch all entity data from the store which is unnecessary and costly.
A second idea is to use a Query with an IN filter on key_name, manually fetching in chunks of 30 to fit the requirements of the IN pseudo-filter. However keys-only queries are not allowed with the IN filter.
Is there a better way?
IN filters are not supported directly by the App Engine datastore; they're a convenience that's implemented in the client library. An IN query with 30 values is translated into 30 equality queries on one value each, resulting in 30 regular queries!
Due to round-trip times and the expense of even keys-only queries, I suspect you'll find that simply attempting to fetch all the entities in one batch fetch is the most efficient. If your entities are large, however, you can make a further optimization: For every entity you insert, insert an empty 'presence' entity as a child of that entity, and use that in queries. For example:
foo = AnEntity(...)
foo.put()
presence = PresenceEntity(key_name='x', parent=foo)
presence.put()
...
def exists(keys):
test_keys = [db.Key.from_path('PresenceEntity', 'x', parent=x) for x in keys)
return [x is not None for x in db.get(test_keys)]
At this point, the only solution I have is to manually query by key with keys_only=True, once per key.
for key in candidate_keys:
if MyModel.all(keys_only=True).filter('__key__ =', key).count():
keys_with_entities.add(key)
This may in fact be slower then just loading the entities in batch and discarding them, although the batch load also hammers the Data Received from API quota.
How not to do it (update based on Nick Johnson's answer):
I am also considering adding a parameter specifically for the purpose of being able to scan for it with an IN filter.
class MyModel(db.Model):
"""Some model"""
# ... all the old stuff
the_key = db.StringProperty(required=True) # just a duplicate of the key_name
#... meanwhile back in the example
for key_batch in batches_of_30(candidate_keys):
key_names = [x.name() for x in key_batch]
found_keys = MyModel.all(keys_only=True).filter('the_key IN', key_names)
keys_with_entities.update(found_keys)
The reason this should be avoided is that the IN filter on a property sequentially performs an index scan, plus lookup once per item in your IN set. Each lookup takes 160-200ms so that very quickly becomes a very slow operation.