Prevent aws lambda code execute for multiple time - python

I have a very important webhook which call my lambda function. The issue is this webhook is hitting my lambda function thrice with same data. I don't want to process thrice. I want to exit if it's already being called. I tried to store the data (paid) in dynamo db and check if it's already present but that ain't working. it's like the db is not atomic.
I call below method before executing the code.
def check_duplicate_webhook(user_id, order_id):
try:
status = dynamodb_table_payment.get_item(Key={'user_id': user_id},
ProjectionExpression='payments.#order_id.#pay_status',
ExpressionAttributeNames={
"#order_id": order_id,
'#pay_status': "status"
})
if "Item" in status and "payments" in status['Item']:
check = status['Item']['payments'][order_id]
if check == 'paid':
return True
return False
except Exception as e:
log(e)
return False
Updating the database
dynamodb_table_payment.update_item(Key={'user_id': user_id},
UpdateExpression="SET payments.#order_id.#pay_status = :pay_status, "
"payments.#order_id.#update_date = :update_date, "
"payments.#order_id.reward = :reward_amount",
ExpressionAttributeNames={
"#order_id": attr['order_id'],
'#pay_status': "status",
'#update_date': 'updated_at'
},
ExpressionAttributeValues={
":pay_status": 'paid',
':update_date': int(time.time()),
':reward_amount': reward_amount
})

DynamoDB isn't atomic and if the three requests come very close together, it could happen that the read value isn't consistent. For financial transactions it is recommended to use DynamoDB transactions.
May I also suggest that you use Step Functions and decouple the triggering from the actual execution. The webhook will trigger a function that will register the payment for execution. A different function will execute it. You will need some orchestration in the future, if for not anything else, to implement a retry logic.

You're separating the retrieve and update, which can cause a race condition. You should be able to switch to a put_item() with condition, which will only insert once (or optionally update if the criteria are met).
You could also use a FIFO SQS queue as an intermediary between the webhook and Lambda, and let it do the deduplication. But that's a more complex solution.
It also appears that you're storing all orders for a given customer in a single record in DynamoDB. This seems like a bad idea to me: first because you need more RCUs/WCUs to be able to retrieve larger records, second because you will eventually bump up against the size limit of a DynamoDB record, and third because it makes the update logic more complex. I think you would be better to manage orders separately, using a key of (user_id, order_id).

Related

Deleting an index in Elasticsearch then inserting items doesn't work python [duplicate]

I'm attempting to improve performance on a suite that tests against ElasticSearch.
The tests take a long time because Elasticsearch does not update it's indexes immediately after updating. For instance, the following code runs without raising an assertion error.
from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')
# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
index='blog',
doc_type=,'blog'
id=1,
body={
....
}
)
results = elasticsearch.search()
assert not results
# results are not populated
Currently out hacked together solution to this issue is dropping a time.sleep call into the code, to give ElasticSearch some time to update it's indexes.
from time import sleep
from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')
# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
index='blog',
doc_type=,'blog'
id=1,
body={
....
}
)
# Don't want to use sleep functions
sleep(1)
results = elasticsearch.search()
assert len(results) == 1
# results are now populated
Obviously this isn't great, as it's rather failure prone, hypothetically if ElasticSearch takes longer than a second to update it's indexes, despite how unlikely that is, the test will fail. Also it's extremely slow when you're running 100s of tests like this.
My attempt to solve the issue has been to query the pending cluster jobs to see if there are any tasks left to be done. However this doesn't work, and this code will run without an assertion error.
from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')
# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
index='blog',
doc_type=,'blog'
id=1,
body={
....
}
)
# Query if there are any pending tasks
while elasticsearch.cluster.pending_tasks()['tasks']:
pass
results = elasticsearch.search()
assert not results
# results are not populated
So basically, back to my original question, ElasticSearch updates are not
immediate, how do you wait for ElasticSearch to finish updating it's index?
As of version 5.0.0, elasticsearch has an option:
?refresh=wait_for
on the Index, Update, Delete, and Bulk api's. This way, the request won't receive a response until the result is visible in ElasticSearch. (Yay!)
See https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-refresh.html for more information.
edit: It seems that this functionality is already part of the latest Python elasticsearch api:
https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.index
Change your elasticsearch.update to:
elasticsearch.update(
index='blog',
doc_type='blog'
id=1,
refresh='wait_for',
body={
....
}
)
and you shouldn't need any sleep or polling.
Seems to work for me:
els.indices.refresh(index)
els.cluster.health(wait_for_no_relocating_shards=True,wait_for_active_shards='all')
Elasticsearch do near real-time search. The updated/indexed document is not immediately searchable but only after the next refresh operation. The refresh is scheduled every 1 second.
To retrieve a document after updating/indexing, you should use GET api instead. By default, the get API is realtime, and is not affected by the refresh rate of the index. That means if the update/index was correctly done, you should see the modifications in the response of GET request.
If you insist on using SEARCH api to retrive a document after updating/indexing. Then from the documentation, there are 3 solutions:
Waiting for the refresh interval
Setting the ?refresh option in an index/update/delete request
Using the Refresh API to explicitly complete a refresh (POST _refresh) after an index/update request. However, please note that refreshes are resource-intensive.
If you use bulk helpers you can do it like this:
from elasticsearch.helpers import bulk
bulk(client=self.es, actions=data, refresh='wait_for')
You can also call elasticsearch.Refresh('blog') if you don't want to wait for the cluster refresh interval

How to handle creation of db entries through celery async tasks in Django

My system receives payload from another source. This payload contains items information, including brand and such.
I save this payload inside a buffer. Buffers are processed async with Celery tasks. Based on the payload we create the entries or update them if necessary. This is done by using an atomic transaction, in which we create the Item, the Brand and Category.
The issue I am running into is that it can be possible that two buffers both have an Brand which is not yet created in the db. Using update_or_create inside the atomic block I check whether it already exists. Since both buffers are ran async at almost exactly the same time both think the Brand does not yet exists. This means both of them try to create the Brand yielding me the following database error:
postgres_1 | ERROR: duplicate key value violates unique constraint "catalogue_brand_name_key"
postgres_1 | DETAIL: Key (name)=(Bic) already exists.
postgres_1 | STATEMENT: INSERT INTO "catalogue_brand" ("id", "created_date", "modified_date", "name") VALUES ('c8e9f328-cee7-4b9b-ba45-268180c723d8'::uuid, '2018-12-28T08:08:51.519672+00:00'::timestamptz, '2018-12-28T08:08:51.519691+00:00'::timestamptz, 'Bic')
Since this is a db-level exception I am unable to catch it inside my code and the buffer will be marked as completed (since no exception inside the code was given).
The easy solution here is to not run the buffers async, but just one at a time.
Here is my part of the code:
#atomic
def create(self, data):
...
brand_obj = Brand.objects.update_or_create(
name=brand['name'],
defaults={
'name': brand['name']
}
)[0]
...
Which is called through:
process_incoming.delay(buffer.pk)
Is it possible to run my buffers async without hitting db-level constraints?

Why does Firebase event return empty object on second and subsequent events?

I have a Python Firebase SDK on the server, which writes to Firebase real-time DB.
I have a Javascript Firebase client on the browser, which registers itself as a listener for "child_added" events.
Authentication is handled by the Python server.
With Firebase rules allowing reads, the client listener gets data on the first event (all data at that FB location), but only a key with empty data on subsequent child_added events.
Here's the listener registration:
firebaseRef.on
(
"child_added",
function(snapshot, prevChildKey)
{
console.log("FIREBASE REF: ", firebaseRef);
console.log("FIREBASE KEY: ", snapshot.key);
console.log("FIREBASE VALUE: ", snapshot.val());
}
);
"REF" is always good.
"KEY" is always good.
But "VALUE" is empty after the first full retrieval of that db location.
I tried instantiating the firebase reference each time anew inside the listen function. Same result.
I tried a "value" event instead of "child_added". No improvement.
The data on the Firebase side looks perfect in the FB console.
Here's how the data is being written by the Python admin to firebase:
def push_value(rootAddr, childAddr, data):
try:
ref = db.reference(rootAddr)
posts_ref = ref.child(childAddr)
new_post_ref = posts_ref.push()
new_post_ref.set(data)
except Exception:
raise
And as I said, this works perfectly to put the data at the correct place in FB.
Why the empty event objects after the first download of the database, on subsequent events?
I found the answer. Like most things, it turned out to be simple, but took a couple of days to find. Maybe this will save someone else.
On the docs page:
http://firebase.google.com/docs/database/admin/save-data#section-push
"In JavaScript and Python, the pattern of calling push() and then
immediately calling set() is so common that the Firebase SDK lets you
combine them by passing the data to be set directly to push() as
follows..."
I suggest the wording should emphasize that you must do it that way.
The earlier Python example on the same page doesn't work:
new_post_ref = posts_ref.push()
new_post_ref.set({
'author': 'gracehop',
'title': 'Announcing COBOL, a New Programming Language'
})
A separate empty push() followed by set(data) as in this example, won't work for Python and Javascript because in those cases the push() implicitly also does a set() and so an empty push triggers unwanted event listeners with empty data, and the set(data) didn't trigger an event with data, either.
In other words, the code in the question:
new_post_ref = posts_ref.push()
new_post_ref.set(data)
must be:
new_post_ref = posts_ref.push(data)
with set() not explicitly called.
Since this push() code happens only when new objects are written to FB, the initial download to the client wasn't affected.
Though the documentation may be trying to convey the evolution of the design, it fails to point out that only the last Python and Javascript example given will work and the others shouldn't be used.

redis block until key exists

I'm new to Redis and was wondering if there is a way to be able to await geting a value by it's key until the key exists. Minimal code:
async def handler():
data = await self._fetch(key)
async def _fetch(key):
return self.redis_connection.get(key)
As you know, if such key doesnt exist, it return's None. But since in my project, seting key value pair to redis takes place in another application, I want the redis_connection get method to block untill key exists.
Is such expectation even valid?
It is not possible to do what you are trying to do without implementing some sort of polling redis GET on your client. On that case your client would have to do something like:
async def _fetch(key):
val = self.redis_connection.get(key)
while val is None:
# Sleep and retry here
asyncio.sleep(1)
val = self.redis_connection.get(key)
return val
However I would ask you to completelly reconsider the pattern you are using for this problem.
It seems to me that what you need its to do something like Pub/Sub https://redis.io/topics/pubsub.
So the app that performs the SET becomes a publisher, and the app that does the GET and waits until the key is available becomes the subscriber.
I did a bit of research on this and it looks like you can do it with asyncio_redis:
Subscriber https://github.com/jonathanslenders/asyncio-redis/blob/b20d4050ca96338a129b30370cfaa22cc7ce3886/examples/pubsub/receiver.py.
Sender(Publisher): https://github.com/jonathanslenders/asyncio-redis/blob/b20d4050ca96338a129b30370cfaa22cc7ce3886/examples/pubsub/sender.py
Hope this helps.
Except the keyspace notification method mentioned by #Itamar Haber, another solution is the blocking operations on LIST.
handler method calls BRPOP on an empty LIST: BRPOP notify-list timeout, and blocks until notify-list is NOT empty.
The other application pushes the value to the LIST when it finishes setting the key-value pair as usual: SET key value; LPUSH notify-list value.
handler awake from the blocking operation with the value you want, and the notify-list is destroyed by Redis automatically.
The advantage of this solution is that you don't need to modify your handler method too much (with the keyspace notification solution, you need to register a callback function). While the disadvantage is that you have to rely on the notification of another application (with keyspace notification solution, Redis does the notification automatically).
The closest you can get to this behavior is by enabling keyspace notifications and subscribing to the relevant channels (possibly by pattern).
Note, however, that notifications rely on PubSub that is not guaranteed to deliver messages (at-most-once semantics).
After Redis 5.0 there is built-in stream which supports blocking read. The following are sample codes with redis-py.
#add value to my_stream
redis.xadd('my_stream',{'key':'str_value'})
#read from beginning of stream
last_id='0'
#blocking read until there is value
last_stream_item = redis.xread({"my_stream":last_id},block=0)
#update last_id
last_id = last_stream_item[0][1][0][0]
#wait for next value to arrive on stream
last_stream_item = redis.xread({"my_stream":last_id},block=0)

Redis, only allow operation on existing keys

I am using the python package (redis-py) to operate the redis database. I have a bunch of clients that set keys and values of a hash in redis. I want they set keys and values only when the hash exists. If the hash doesn't exist, setting keys and values will create the hash, which is not what I want to do.
In the redis-py page (https://github.com/andymccurdy/redis-py), the author suggested a way to do atomic operation on client side. So I wrote a similar function:
with r.pipeline() as pipe:
while True:
try:
pipe.watch("a_hash")
if pipe.exists("a_hash"):
pipe.hset("a_hash", "key", "value")
break
except redis.WatchError:
continue
finally:
pipe.reset()
However, this seems does not work. After I delete the hash from another client, this hash still gets created by this piece of code, so I guess this piece of code is not atomic operation. Could someone help me to identify what's the problem with this code? Or is there a better to achieve this purpose?
Appreciate your help!
I would suggest to read the definition of a WATCH/MULTI/EXEC block as explained in the Redis documentation.
In such block, only the commands between MULTI and EXEC are actually processed atomically (and conditionally, with an all-or-nothing semantic depending on the watch).
In your example, the EXISTS and HSET commands are not executed atomically. Actually, you don't need this atomicity: what you want is the conditional execution.
This should work better:
with r.pipeline() as pipe:
while True:
try:
pipe.watch("a_hash")
if pipe.exists("a_hash"):
pipe.multi()
pipe.hset("a_hash", "key", "value")
pipe.execute()
break
except redis.WatchError:
continue
finally:
pipe.reset()
If the key is deleted after the EXISTS but before the MULTI, the HSET will not be executed, thanks to the watch.
With Redis 2.6, a Lua server-side script is probably easier to write, and more efficient.

Categories

Resources