Prevent multiple service access same document in MongoDB till processing is done - python

I have multiple instances of a services. This service access a collection of unprocessed document. We are using MongoDB. The role of service is :
Fetch the first unprocessed document from collection A.
Make a Rest call using uuid.
Get the response and store the response in another collection B.
Multiple service may access the same document leading to duplicate. The steps which I can think to deal with this situation:
FindandModify() along with progress field. So we will call this function with query of progress field to be "0". We will update the value to 1 so other services can not access it. On getting the success from Rest call we can delete the record. On getting the failure we will again call the FindandModify() with update value to be "0" so other service can access at later time.
We we call Find() function which will give us one document. We get the "_id" of the document and store it into another collection. If another service also gets the same document and that document "_id" is already present. Then it will not be insert again and that service again call Find() function.
What would be the performance and bottleneck of these approaches. Also do we have any other better approach which will enhance the performance.

FindandModify() along with progress field. So we will call this function with query of progress field to be "0". We will update the value to 1 so other services can not access it. On getting the success from Rest call we can delete the record. On getting the failure we will again call the FindandModify() with update value to be "0" so other service can access at later time.
This solution is OK and is simpler and faster than the second.
We we call Find() function which will give us one document. We get the "_id" of the document and store it into another collection. If another service also gets the same document and that document "_id" is already present. Then it will not be insert again and that service again call Find() function.
This solution (as you presented) would not work unless the document is removed from the first collection after it is inserted in the second, otherwise the processes will be caught in an infinite loop. The drawback is that this solution implies two collections and the document moving process.
In any case, you must take into account the failure of the processes in any of the phases and you must be able to detect failure and recover from it.

Related

Fetching entire changelog for an issue in JIRA using jira-python

Using jira-python, I want to retrieve the entire changelog for a JIRA issue:
issues_returned = jira.search_issues(args.jql, expand='changelog')
I discovered that for issues with more than 100 entries in their changelog I am only receiving the first 100:
My question is how do I specify a startAt and make another call to get subsequent pages of the changelog (using python-jira)?
From this thread at Atlassian I see that API v3 provides an endpoint to get the change log directly:
/rest/api/3/issue/{issueIdOrKey}/changelog
but this doesn't seem to be accessible via jira-python. I'd like to avoid having to do the REST call directly and authenticate separately. Barring a way to do it directly via jira-python, is there a way to make a 'raw' REST API call from jira-python?
In instances where more than 100 results are present, you'll need to edit the 'startAt' parameter when searching issues:
issues_returned = jira.search_issues(args.jql, expand='changelog', startAt=100)
You'll need to setup a statement that compares the 'total' and 'maxResults' data points, then run another query with a different 'startAt' parameter if the total is higher and append the two together.

DynamoDB consistent read results in schema error

I am trying to interact with a DynamoDB table from python using boto. I want all reads/writes to be quorum consistency to ensure that reads sent out immediately after writes always reflect the correct data.
NOTE: my table is set up with "phone_number" as the hash key and first_name+last_name as a secondary index. And for the purposes of this question one (and only one) item exists in the db (first_name="Paranoid", last_name="Android", phone_number="42")
The following code works as expected:
customer = customers.get_item(phone_number="42")
While this statement:
customer = customers.get_item(phone_number="42", consistent_read=True)
fails with the following error:
boto.dynamodb2.exceptions.ValidationException: ValidationException: 400 Bad Request
{u'message': u'The provided key element does not match the schema', u'__type': u'com.amazon.coral.validate#ValidationException'}
Could this be the result of some hidden data corruption due to failed requests in the past? (for example two concurrent and different writes executed at eventual consistency)
Thanks in advance.
It looks like you are calling the get_item method so the issue is with how you are passing parameters.
get_item(hash_key, range_key=None, attributes_to_get=None, consistent_read=False, item_class=<class 'boto.dynamodb.item.Item'>)
Which would mean you should be calling the API like:
customer = customers.get_item(hash_key="42", consistent_read=True)
I'm not sure why the original call you were making was working.
To address your concerns about data corruption and eventual consistency, it is highly unlike that any API call you could make to DynamoDB could result in it getting into a bad state outside of you sending it bad data for an item. DynamoDB is a highly tested solution that provides exceptional availability and goes to extraordinary lengths to take care of the data you send it.
Eventual consistency is something to be aware of with DynamoDB, but generally speaking it is not something that causes many issues depending on the specifics of the use case. While AWS does not provide specific metrics on what "eventually consistent" look like, in day-to-day use it is normal to be able to read out records that were just written/modified under a second even when eventually consistent reads.
As for performing multiple writes simultaneously on the same object, DynamoDB writes are always strongly consistent. You can utilize conditional writes with DynamoDB if you are worried about an individual item being modified at the same time resulting in unexpected behavior which will allow writes to fail and your application logic to deal with any issues that arise.

Forcing a sqlalchemy ORM get() outside identity map

Background
The get() method is special in SQLAlchemy's ORM because it tries to return objects from the identity map before issuing a SQL query to the database (see the documentation).
This is great for performance, but can cause problems for distributed applications because an object may have been modified by another process, so the local process has no ability to know that the object is dirty and will keep retrieving the stale object from the identity map when get() is called.
Question
How can I force get() to ignore the identity map and issue a call to the DB every time?
Example
I have a Company object defined in the ORM.
I have a price_updater() process which updates the stock_price attribute of all the Company objects every second.
I have a buy_and_sell_stock() process which buys and sells stocks occasionally.
Now, inside this process, I may have loaded a microsoft = Company.query.get(123) object.
A few minutes later, I may issue another call for Company.query.get(123). The stock price has changed since then, but my buy_and_sell_stock() process is unaware of the change because it happened in another process.
Thus, the get(123) call returns the stale version of the Company from the session's identity map, which is a problem.
I've done a search on SO(under the [sqlalchemy] tag) and read the SQLAlchemy docs to try to figure out how to do this, but haven't found a way.
Using session.expire(my_instance) will cause the data to be re-selected on access. However, even if you use expire (or expunge), the next data that is fetched will be based on the transaction isolation level. See the PostgreSQL docs on isolations levels (it applies to other databases as well) and the SQLAlchemy docs on setting isolation levels.
You can test if an instance is in the session with in: my_instance in session.
You can use filter instead of get to bypass the cache, but it still has the same isolation level restriction.
Company.query.filter_by(id=123).one()

GAE datastore how to poll for new items

I use python, ndb and the datastore. My model ("Event") has a property:
created = ndb.DateTimeProperty(auto_now_add=True).
Events gets saved every now and then, sometimes several within one second.
I want to "poll for new events", without getting the same Event twice, and get an empty result if there aren't any new Events. However, polling again might give me new events.
I have seen Cursors, - but I don't know if they can be used somehow to poll for new Events, after having reached the end if the first query? The "next_cursor" is None when I've reached the (current) end of the data.
Keeping the last received "created" DateTime-property and use that for getting the next batch works, but that's only using a resolution of seconds, so the ordering might get screwed up..
Must I create my own transactional, incrementing counter in Event for this?
Yes, using cursors is a valid option. Even tho this link is from the Java documentation, it's valid for python also. The second paragraph is what you are looking for:
An interesting application of cursors is to monitor entities for unseen changes. If the app sets a timestamp property with the current date and time every time an entity changes, the app can use a query sorted by the timestamp property, ascending, with a Datastore cursor to check when entities are moved to the end of the result list. If an entity's timestamp is updated, the query with the cursor returns the updated entity. If no entities were updated since the last time the query was performed, no results are returned, and the cursor does not move.
EDIT: Prospective search has been shut down on December 1, 2015
Rather than polling an alternate approach would be to use prospective search
https://cloud.google.com/appengine/docs/python/prospectivesearch/
From the docs
"Prospective search is a querying service that allows your application
to match search queries against real-time data streams. For every
document presented, prospective search returns the ID of every
registered query that matches the document."

Which one data load method is the best for perfomance?

For example, I have object user stored in database (Redis)
It has several fields:
String nick
String password
String email
List posts
List comments
Set followers
and so on...
In Python programm I have class (User) with same fields for this object. Instances of this class maps to object in database. The question is how to get data from DB for best performance:
Load values for each field on instance creating and initialize fields with it.
Load field value each time on field value requesting.
As second one but after value load replace field property by loaded value.
p.s. redis runs in localhost
The method entirely depends on the requirements.
If there is only one client reading and modifying the properties, this is a rather simple problem. When modifying data, just change the instance attributes in your current Python program and -- at the same time -- keep the DB in sync while keeping your program responsive. To that end, you should outsource blocking calls to another thread or make use of greenlets. If there is only one client, there definitely is no need to fetch a property from the DB on each value lookup.
If there are multiple clients reading the data and only one client modifying the data, you have to think about which level of synchronization you need. If you need 100 % synchronization, you will have to fetch data from the DB on each value lookup.
If there are multiple clients changing the data in the database you better look into a rock-solid industry standard solution rather than writing your own DB cache/mapper.
Your distinction between (2) and (3) does not really make sense. If you fetch data on every lookup, there is no need to 'store' data. You see, if there can be multiple clients involved these things quickly become quite complex and it's really hard to get it right.

Categories

Resources