Background
I would like to create a DynamoDB trigger such that on each new entry the value is updated before saving.
The DynamoDB table consists of Jobs/tasks and I would like to do calculations and assign the job/task to the respective employee.
The task seems relatively simple just need some guidance and assistance creating a lamda function that can accomplish this.
[...] the value is updated before saving.
I am afraid that DynamoDB streams do not work like this. The stream will contain only items that are already stored in the table.
One way to solve this, is to add another property to your table that indicates if the "job" is ready to process. By default, jobs that are added to the table are "not ready", then the DynamoDB stream is going to trigger your Lambda, which does its calculations, assigns the job to an employee AND sets the job to "ready".
Another option to solve this might be a little restructuring. For example: why not use a Step Function that has multiple steps, the final one being a step to save the calculated result of the previous step(s) into the table.
More than likely, you'll end up enabling DynamoDB Streams with a Lambda function to read from that stream. That'll give you the "on each new entry the value is..." The function would then do the calculations and assigning of the job you mentioned. I recommend starting with this from the DynamoDB documentation on Lambda triggers as well as this from the AWS Lambda documentation on working with DynamoDB Streams.
If that does not get you going in the right direction, let me know and I will dig up more for you.
Related
I'd like to "truncate" (delete all items) in a DynamoDB table. I know that the most efficient way to do this would be to delete the table and re-create it (name, indexes, etc.). However, the table is part of a SAM-CloudFormation deployment. The table (by name) is also referenced within other parts of the application.
If I deleted and re-created it, I could use the same name it had previously; however, I think this would cause problems because (1) the deletion isn't immediate and (2) the ARN would change and that could have implications on the CloudFormation stack.
It seems that there should be a better solution than the brute-force approach: iterate through all items, deleting them one at a time (with some optimization via the batch_writer).
I've looked at some other solutions here, but they don't address the "part of a CloudFormation stack" part of my question.
Truncate DynamoDb or rewrite data via Data Pipeline
What is the recommended way to delete a large number of items from DynamoDB?
I even provided a brute-force solution myself to another's question on this topic.
delete all items DynamoDB using Python
Here is the brute force approach
import boto3
table = boto3.resource('dynamodb').Table('my-table-name')
scan = None
with table.batch_writer() as batch:
count = 0
while scan is None or 'LastEvaluatedKey' in scan:
if scan is not None and 'LastEvaluatedKey' in scan:
scan = table.scan(
ProjectionExpression='id',
ExclusiveStartKey=scan['LastEvaluatedKey'],
)
else:
scan = table.scan(ProjectionExpression='id')
for item in scan['Items']:
if count % 5000 == 0:
print(count)
batch.delete_item(Key={'id': item['id']})
count = count + 1
The desired final state is a DynamoDB table (that was previously full of items) with the same name, no items, and still able to be destroyed as part of a CloudFormation delete operation.
No matter if you created the table as AWS::Serverless::SimpleTable or AWS::DynamoDB::Table there is no out-of-the-box solution to empty it using CloudFormation while keeping its name.
As a general best practice you shouldn't name DynamoDB tables created by CloudFormation, but let CloudFormation assign a name for the resource. If that would have been the case in your setup you could simply do a change to the resource which requires "replacement" of the resource, like temporary adding a Local Secondary Index, which would recreate the resource and would work with resources depending on it.
That said, in your situation the best approach is probably be to wrap your brute force approach in a CloudFormation custom resource and include that in your CloudFormation stack. With that you can truncate the table once or, depending on the implementation of your custom resource, whenever you want.
Keep in mind that deleting all items from a DynamoDB table might take quite long, so using a Lambda-backed custom resource might run into the limit of Lambda function runtime, depending on the number of items in the table. It might also become quite costly if the table contains a lot of items.
I dont have much knowledge in dbs, but wanted to know if there is any technique by which when i update or insert a specific entry in a table, it should notify my python application to which i can then listen whats updated and then update that particular row, in the data stored in session or some temporary storage.
I need to send data filter and sort calls again n again, so i dont want to fetch whole data from sql, so i decided to keep it local, nd process it from there. But i was worried if in the mean time the db updates, and i could have been passing the same old data to filter requests.
Any suggestions?
rdbs only will be updated by your program's method or function sort of things.
you can just print console or log inside of yours.
if you want to track what updated modified deleted things,
you have to build a another program to able to track the logs for rdbs
thanks.
I have a seemingly simple problem when constructing my pipeline for Dataflow. I have multiple pipelines that fetch data from external sources, transform the data and write it to several BigQuery tables. When this process is done I would like to run a query that queries the just generated tables. Ideally I would like this to happen in the same job.
Is this the way Dataflow is meant to be used, or should the loading to BigQuery and the querying of the tables be split up between jobs?
If this is possible in the same job how would one solve this, as the BigQuerySink does not generate a PCollection? If this is not possible in the same job, is there some way to trigger a job on the completion of another job (i.e. the writing job and the querying job)?
You alluded to what would need to happen to do this in a single job -- the BigQuerySink would need to produce a PCollection. Even if it is empty, you could then use it as the input to the step that reads from BigQuery in a way that made that step wait until the first sink was done.
You would need to create your own version of the BigQuerySink to to do this.
If possible, an easier option might be to have the second step read from the collection that you wrote to BigQuery rather than reading the table you just put into BigQuery. For example:
PCollection<TableRow> rows = ...;
rows.apply(BigQuery.Write.to(...));
rows.apply(/* rest of the pipeline */);
You could even do this earlier if you wanted to continue processing the elements written to BigQuery rather than the table rows.
I am trying to interact with a DynamoDB table from python using boto. I want all reads/writes to be quorum consistency to ensure that reads sent out immediately after writes always reflect the correct data.
NOTE: my table is set up with "phone_number" as the hash key and first_name+last_name as a secondary index. And for the purposes of this question one (and only one) item exists in the db (first_name="Paranoid", last_name="Android", phone_number="42")
The following code works as expected:
customer = customers.get_item(phone_number="42")
While this statement:
customer = customers.get_item(phone_number="42", consistent_read=True)
fails with the following error:
boto.dynamodb2.exceptions.ValidationException: ValidationException: 400 Bad Request
{u'message': u'The provided key element does not match the schema', u'__type': u'com.amazon.coral.validate#ValidationException'}
Could this be the result of some hidden data corruption due to failed requests in the past? (for example two concurrent and different writes executed at eventual consistency)
Thanks in advance.
It looks like you are calling the get_item method so the issue is with how you are passing parameters.
get_item(hash_key, range_key=None, attributes_to_get=None, consistent_read=False, item_class=<class 'boto.dynamodb.item.Item'>)
Which would mean you should be calling the API like:
customer = customers.get_item(hash_key="42", consistent_read=True)
I'm not sure why the original call you were making was working.
To address your concerns about data corruption and eventual consistency, it is highly unlike that any API call you could make to DynamoDB could result in it getting into a bad state outside of you sending it bad data for an item. DynamoDB is a highly tested solution that provides exceptional availability and goes to extraordinary lengths to take care of the data you send it.
Eventual consistency is something to be aware of with DynamoDB, but generally speaking it is not something that causes many issues depending on the specifics of the use case. While AWS does not provide specific metrics on what "eventually consistent" look like, in day-to-day use it is normal to be able to read out records that were just written/modified under a second even when eventually consistent reads.
As for performing multiple writes simultaneously on the same object, DynamoDB writes are always strongly consistent. You can utilize conditional writes with DynamoDB if you are worried about an individual item being modified at the same time resulting in unexpected behavior which will allow writes to fail and your application logic to deal with any issues that arise.
I'm trying to figure out the best approach to keeping track of the number of entities of a certain NDB kind I have in my cloud datastore.
One approach is just, when I want to know how many I have, get the .count() of a query that I know will return all of them, but that costs a ton of datastore small operations (looks like it's proportional to the number of entities of that kind I have). So that's not ideal.
Another option would be having a counter in the datastore that gets updated every time I create or delete an entity, but that's also not ideal because it would add an extra read and write operation to every entity I create or destroy.
As of now, it looks like the second option is my best choice, so my question is--do you agree? Are there any other options that would be more cost-effective?
Thanks a lot.
PS: Working in Python if that makes a difference.
Second option is the way to go.
Other considerations:
If you have many writes per second you may wish to consider using a shared counter
To reduce datastore writes, you could use a cron job to update the datastore at timed intervals (ie count how many entities have been created since last run)
Also consider using memcache.incr() in conjunction with a cron job to persist the data. Downside of this is you're memcache key could drop, so only really an option if the count doesn't have to be accurate.
There's actually a better/cheaper/faster way to see the info you are looking for but it might not work if you need to know EXACT number of fields at any given moment since its only updated couple of times a day (i.e. you can access it anytime but it may be a few hours outdated).
The "Datastore Statistics" page in GAE dashboard displays some detailed data about kinds/entities including "count" numbers and there's a way to access it programmatically. See more info here: https://cloud.google.com/appengine/docs/python/datastore/stats