Retrieving data from a DynamoDB using Python - python

A newbie to DynamoDb and python in general here. I have a task to complete where I have to retrieve information from a DynamoDB using some information provided. I've set up the access keys and such and I've been provided a 'Hash Key' and a table name. I'm looking for a way to use the hash key in order to retrieve the information but I haven't been able to find something specific online.
#Table Name
table_name = 'Waypoints'
#Dynamodb client
dynamodb_client = boto3.client('dynamodb')
#Hash key
hash_key = {
''
}
#Retrieve items
response = dynamodb_client.get_item(TableName = table_name, Key = hash_key)
Above is what I have writtenbut that doesn't work. Get item only returns one_item from what I can gather but I'm not sure what to pass on to make it work in the first place.
Any sort of help would be greatly appreaciated.

First of all, in get_item() request the key should not be just the key's value, but rather a map with the key's name and value. For example, if your hash-key attribute is called "p", the Key you should pass would be {'p': hash_key}.
Second, is the hash key the entire key in your table? If you also have a sort key, in a get_item() you must also specify that part of the key - and the result is one item. If you are looking for all the items with a particular hash key but different sort keys, then the function you need to use is query(), not get_item().

Related

The most efficient way to compare two dictionaries, verifying dict_2 is a complete subset of dict_1, and return all values of dict_2 which are less?

I'm working on a data pipeline that will pull data from online and store it in MongoDB. To manage the process, I've developed two dictionaries; request_totals and mongo_totals. mongo_totals will contain a key for each container in the Mongo database, along with a value for the max(id) that container contains. request_totals has a key for each category data can be pulled from, along with a value for the max(id) of that category. If MongoDB is fully updated, these who dictionaries would be identical.
I've developed code that runs, but I can't shake the feeling that I'm not really being efficient here. I hope that someone can share some tips on how to better write this:
def compare(request_totals, mongo_totals):
outdated = dict()
# Verifies MongoDB contains no unique collections
if request_totals | mongo_totals != request_totals:
raise AttributeError('mongo_totals does not appear to be a subset of request_totals')
sharedKeys = set(request_totals.keys()).intersection(mongo_totals.keys())
unsharedKeys = set(request_totals) - set(mongo_totals)
# Updates outdated dict with outdated key-value pairs representing MongoDB collections
for key in sharedKeys:
if request_totals[key] > mongo_totals[key]:
outdated.update({key : mongo_totals[key]})
elif request_totals[key] < mongo_totals[key]:
raise AttributeError(
f'mongo_total for {key}: {mongo_totals[key]} exceeds request_totals for {key}: {request_totals[key]}')
return outdated|dict.fromkeys(unsharedKeys, 0)
compare(request_totals, mongo_totals)
The returned dictionary has key:value pairs that maybe used in the following way; Query the API using the key, and offset the records by the key's value. This way, it allows me to keep the database updated. Is there a more efficient way to handle this comparison?

DynamoDB Query for users with expired IP addresses

So I have a DynamoDB database table which looks like this (exported to csv):
"email (S)","created_at (N)","firstName (S)","ip_addresses (L)","lastName (S)","updated_at (N)"
"name#email","1628546958.837838381","ddd","[ { ""M"" : { ""expiration"" : { ""N"" : ""1628806158"" }, ""IP"" : { ""S"" : ""127.0.0.1"" } } }]","ddd","1628546958.837940533"
I want to be able to do a "query" not a "scan" for all of the IP's (attribute attached to users) which are expired. The time is stored in unix time.
Right now I'm scanning the entire table and looking through each user, one by one and then I loop through all of their IPs to see if they are expired or not. But I need to do this using a query, scans are expensive.
The table layout is like this:
primaryKey = email
attributes = firstName, lastName, ip_addresses (array of {} maps where each map has IP, and Expiration as two keys).
I have no idea how to do this using a query so I would greatly appreciate if anyone could show me how! :)
I'm currently running the scan using python and boto3 like this:
response = client.scan(
TableName='users',
Select='SPECIFIC_ATTRIBUTES',
AttributesToGet=[
'ip_addresses',
])
As per the boto3 documentation, The Query operation finds items based on primary key values. You can query any table or secondary index that has a composite primary key (a partition key and a sort key).
Use the KeyConditionExpression parameter to provide a specific value for the partition key. The Query operation will return all of the items from the table or index with that partition key value. You can optionally narrow the scope of the Query operation by specifying a sort key value and a comparison operator in KeyConditionExpression . To further refine the Query results, you can optionally provide a FilterExpression . A FilterExpression determines which items within the results should be returned to you. All of the other results are discarded.
So long story short, it will only work to fetch a particular row whose primary key you have mentioned while running query.
A Query operation always returns a result set. If no matching items are found, the result set will be empt

Best practice for DynamoDB composite primary key travelling inside the system (partition key and sort key)

I am working on a system where I am storing data in DynamoDB and it has to be sorted chronologically. For partition_key I have an id (uuid) and for sort_key I have a date_created value. Now originally it was enough to save unique entries using only the ID, but then a problem arose that this data was not being sorted as I wanted, so a sort_key was added.
Using python boto3 library, it would be enough for me to get, update or delete items using only the id primary key since I know that it is always unique:
import boto3
resource = boto3.resource('dynamodb')
table = resource.Table('my_table_name')
table.get_item(
Key={'item_id': 'unique_item_id'}
)
table.update_item(
Key={'item_id': 'unique_item_id'}
)
table.delete_item(
Key={'item_id': 'unique_item_id'}
)
However, DynamoDB requires a sort key to be provided as well, since primary keys are composed partition key and sort key.
table.get_item(
Key={
'item_id': 'unique_item_id',
'date_created': 12345 # timestamp
}
)
First of all, is it the right approach to use sort key to sort data chronologically or are there better approaches?
Secondly, what would be the best approach for transmitting partition key and sort key across the system? For example I have an API endpoint which accepts the ID, by this ID the backend performs a get_item query and returns the corresponding data. Now since I also need the sort key, I was thinking about using a hashing algorithm internally, where I would hash a JSON like this:
{
"item_id": "unique_item_id",
"date_created": 12345
}
and a single value then becomes my identifier for this database entry. I would then dehash this value before performing any database queries. Is this the approach common?
First of all, is it the right approach to use sort key to sort data chronologically
Sort keys are the means of sorting data in DynamoDB. Using a timestamp as a sort key field is the right thing to do, and a common pattern in DDB.
DynamoDB requires a sort key to be provided ... since primary keys are composed partition key and sort key.
This is true. However, when reading from DDB it is possible to specify only the partition key using the query operation (as opposed to theget_item operation which requires the full primary key). This is a powerful construct that lets you specify which items you want to read from a given partition.
You may want to look into KSUIDs for your unique identifiers. KSUIDs are like UUIDs, but they contain a time component. This allows them to be sorted by generation time. There are several KSUID libraries in python, so you don't need to implement the algorithm yourself.

Python dictionary key length not same as rows returning for the query in mysql

So i am trying to fetch data from the mysql into a python dictionary
here is my code.
def getAllLeadsForThisYear():
charges={}
cur.execute("select lead_id,extract(month from transaction_date),pid,extract(Year from transaction_date) from transaction where lead_id is not NULL and transaction_type='CHARGE' and YEAR(transaction_date)='2015'")
for i in cur.fetchall():
lead_id=i[0]
month=i[1]
pid=i[2]
year=str(i[3])
new={lead_id:[month,pid,year]}
charges.update(new)
return charges
x=getAllLeadsForThisYear()
when i prints (len(x.keys()) it gave me some number say 450
When i run the same query in mysql it returns me 500 rows.Although i do have some same keys in dictionary but it should count them as i have not mentioned it if i not in charges.keys(). Please correct me if i am wrong.
Thanks
As I said, the problem is that you are overwriting your value at a key every time a duplicate key pops up. This can be fixed two ways:
You can do a check before adding a new value and if the key already exists, append to the already existing list.
For example:
#change these lines
new={lead_id:[month,pid,year]}
charges.update(new)
#to
if lead_id in charges:
charges[lead_id].extend([month,pid,year])
else
charges[lead_id] = [month,pid,year]
Which gives you a structure like this:
charges = {
'123':[month1,pid1,year1,month2,pid2,year2,..etc]
}
With this approach, you can reach each separate entry by chunking the value at each key by chunks of 3 (this may be useful)
However, I don't really like this approach because it requires you to do that chunking. Which brings me to approach 2.
Use defaultdict from collections which acts in the exact same way as a normal dict would except that it defaults a value when you try to call a key that hasn't already been made.
For example:
#change
charges={}
#to
charges=defaultdict(list)
#and change
new={lead_id:[month,pid,year]}
charges.update(new)
#to
charges[lead_id].append((month,pid,year))
which gives you a structure like this:
charges = {
'123':[(month1,pid1,year1),(month2,pid2,year2),(..etc]
}
With this approach, you can now iterate through each list at each key with:
for key in charges:
for entities in charges[key]:
print(entities) # would print `(month,pid,year)` for each separate entry
If you are using this approach, dont forget to from collections import defaultdict. If you don't want to import external, you can mimic this by:
if lead_id in charges:
charges[lead_id].append((month,pid,year))
else
charges[lead_id] = [(month,pid,year)]
Which is incredibly similar to the first approach but does the explicit "create a list if the key isnt there" that defaultdict would do implicitly.

delete memcache by key name start with a specific string (gae,python)

i build a blog with gae, and stored many items in memcache, including the paged entries.
the key to store these pages is use query object and pageindex:
#property
def _query_id(self):
if not hasattr(self, '__query_id'):
hsh = hashlib.md5()
hsh.update(repr(self.query))
self.__query_id = hsh.hexdigest()
return self.__query_id
def _get_cache_key(self, page):
return '%s%s' % (self._query_id, page)
it'll show in admin console like: NDB9:xxxxxx,
beside this, I stored any other item start with sitename-obj.
In some case, I want to only clear all the paged cache, but I don't know how.
I wonder if there is a way to delete memcache by key name which start with NDB9?
yes, I'v found such function,
delete_multi(keys, seconds=0, key_prefix='', namespace=None)
but it seems that the key_prefix is just add to every key in the first argument, and I want to only delete memcache by key_prefix.
You cannot delete keys by prefix; you can only delete specific keys, or flush all the of the cache.
In this case, you'd have to loop over all page ids to produce all possible keys. Pass those to delete_multi().
The key_prefix argument is just a convenience method; you can send shorter 'keys' if they all have the same prefix. If all your keys start with NDB9, use that as the key prefix, and send a list of keys over without that prefix. The prefix will be added to each key by the memcached server when looking for what keys to delete.
Use memcache to store all other keys.
keys = [key1, key2, key3 ....]
When you need to delete keys by pattern, iterate over this value and use delete_multi to delete those keys

Categories

Resources