"Get" document from cosmosdb by id (not knowing the _rid) - python

As MS Support recently told me that using a "GET" is much more efficient in RUs usage than a sql query. I'm wondering if I can (within the azure.cosmos python package or a custom HTTP request to the REST API) get a document by its unique 'id' field (for which I generated a GUIDs) without an SQL Query.
Every example shown are using the link/path of the doc which is built with the '_rid' metadata of the document and not the 'id' field set when creating the doc.
I use a bulk upsert stored procedure I wrote to create my new documents and never retrieve the metadata for each one of them (I have ~ 100 millions docs) so retrieving the _rid would be equivalent to retrieving the doc itself.

The reason that the ReadDocument method is so much more efficient than a SQL query is because it uses _rid instead of a user generated field, even the required id field. This is because the _rid isn't just a unique value, it also encodes information about where that document is physically stored.
To give an example of how this works, let's say you are explaining to someone where a party is this weekend. You could use the name that you use for the house "my friend Ryan's house" or you could use the address "123 ThatOne Street Somewhere, WA 11111". They both are unique identifiers, but for someone trying to get there one is way more efficient than the other.
Telling someone to go to your friend's house is like using your own id. It does map to a specific house, but the person will still need to find out where that physically is to get there. Using the address is like working with the _rid field. Based on that information alone they can get to the party location. Of course, in the real world the person would probably need directions, but the data storage in a database is a lot more organized than most city streets so an address is sufficient to go retrieve the document.
If you want to take advantage of this method you will need to find a way to work with the _rid field.

Related

Sorting and Filtering multiple queries of the same collection in Firestore

I'm new on cloud firestore and I'm trying to make queries as efficient as possible but I kind of desperate with an specific one. I would greatly appreciate your help.
This is the situation:
I want to show a project list which that I'm getting from an user field and 2 queries in project entity. The user field let’s called "favorite projects" and it has the projects id that reference those projects on their entity. The other query retrieve me the public projects (==) and the last the private projects where the user is a contributor (array_contains).
I want to sort and filtering the result of the two queries. Is there an option to merge both queries and use sort and filter as a we do with a collection reference?
Thank you for your time, have a nice day!
Based on this and this documentation, I do not believe there is an out of the box solution for joining the results of queries such as the ones described.
You'll need to achieve that within the your code.
For example you can run the first query and store all the data of the document in a map or array. Then use the reference of the other document within the document_reference to make the second query and the third.
Once you have all of them you can do as you please using Python. But getting them ready using a single query or auto-joining the queries seems to not be supported yet.

How to retrieve objects from the sotlayer saved quote using Python API

I'm trying to retrieve the objects/items (server name, host name, domain name, location, etc...) that are stored under the saved quote for a particular Softlayer account. Can someone help how to retrieve the objects within a quote? I could find a REST API (Python) to retrieve quote details (quote ID, status, etc..) but couldn't find a way to fetch objects within a quote.
Thanks!
Best regards,
Khelan Patel
If you are trying to retrieve the same order information structure you sent when placing the quote then you need to use the method getRecalculatedOrderContainer, it should returns the packageId, presetId, location, item prices, etc., but as far as I know, the hostname, domain, sshKeys, provisionScripts, vlans aren't in the quote since those values
could change over the time whether the user requires new values before placing the order or to avoid errors from system due to availability of resources like vlans and subnets.
https://[username]:[apikey]api.softlayer.com/rest/v3/SoftLayer_Billing_Order_Quote/[quoteID]/getRecalculatedOrderContainer
Method: GET
Now if you want to retrieve orderId, items, etc, then you need to use object-mask feature whether you are using the methods Account::getQuotes or SoftLayer_Billing_Order_Quote::getObject they both returns the datatype SoftLayer_Billing_Order_Quote in a list or as a single object.
Account::getQuotes
https://[username]:[apikey]#api.softlayer.com/rest/v3/SoftLayer_Account/getQuotes?objectMask=mask[id,name,order[id,status,items[id,description,domainName,hostName,location]]]
Method: GET
SoftLayer_Billing_Order_Quote::getObject
https://[username]:[apikey]#api.softlayer.com/rest/v3/SoftLayer_Billing_Order_Quote/[quoteID]/getObject?objectMask=mask[id,name,order[id,status,items[id,description,location]]]
Method: GET
References:
https://softlayer.github.io/reference/services/SoftLayer_Account/getQuotes/
https://softlayer.github.io/reference/services/SoftLayer_Billing_Order_Quote/
https://softlayer.github.io/reference/datatypes/SoftLayer_Billing_Order_Quote/
Thanks Albert getRecalculatedOrderContainer is the thing I was looking for.

Method to insert an array or dictionary into db database in Google App Engine - Python

I am new to Python (and OOP) and working on a challenging project (and first post here!). I tried searching, but could not find anything of use, or perhaps did not know what to search for.
Here is what I want to do:
I have two tables in db (from google.appengine.ext). One is "fruits" with name of fruits and their nutrition info and the other is "user" which I want to store two columns, uid and favFruits (fruits they like, their score (5-star scale) and a comment). Problem I am having is each user (uid row) can have multiple fruits they like and comment. The favFruits will be shown on user's profile and when the link is clicked, it goes to nutrition page.
Example:
"favFruits":[
{
"fruit":"fuji apple"
"score":"4"
"comment":"Delicious. Bit tart, but very sweet"
},
{
"fruit":"orange"
"score":"5"
"comment":"I just love it!"
}
]
What would be the best method to store this in the Google Appengine Datastore? Currently I am using db.StringListProperty() with only favFruits['fruit'] as the list input, which does not include ['score'] or ['comment']. What I really want is to store 2D table inside a column (an array into a column of db) that is efficiently search-able, as soon as uid is identified.
Is Json the best approach? What about concatenating into a single string containing all three fields into a string and store the list (like [u'orange,5,I just love it!']?
If there is a better approach, please let me know! I am stuck... Any help is greatly appreciated!
Thank you
My first advise: use NDB. Using NDB you have structured and JSON properties, built-in caching and much more: https://developers.google.com/appengine/docs/python/ndb/
You can use an NDB structured repeated property for favFruits.
https://developers.google.com/appengine/docs/python/ndb/properties#structured
A JSON property (a blob) cannot be used in a query.

How do I transform every doc in a large Mongodb collection without map/reduce?

Apologies for the longish description.
I want to run a transform on every doc in a large-ish Mongodb collection with 10 million records approx 10G. Specifically I want to apply a geoip transform to the ip field in every doc and either append the result record to that doc or just create a whole other record linked to this one by say id (the linking is not critical, I can just create a whole separate record). Then I want to count and group by say city - (I do know how to do the last part).
The major reason I believe I cant use map-reduce is I can't call out to the geoip library in my map function (or at least that's the constraint I believe exists).
So I the central question is how do I run through each record in the collection apply the transform - using the most efficient way to do that.
Batching via Limit/skip is out of question as it does a "table scan" and it is going to get progressively slower.
Any suggestions?
Python or Js preferred just bec I have these geoip libs but code examples in other languages welcome.
Since you have to go over "each record", you'll do one full table scan anyway, then a simple cursor (find()) + maybe only fetching few fields (_id, ip) should do it. python driver will do the batching under the hood, so maybe you can give a hint on what's the optimal batch size (batch_size) if the default is not good enough.
If you add a new field and it doesn't fit the previously allocated space, mongo will have to move it to another place, so you might be better off creating a new document.
Actually I am also attempting another approach in parallel (as plan B) which is to use mongoexport. I use it with --csv to dump a large csv file with just the (id, ip) fields. Then the plan is to use a python script to do a geoip lookup and then post back to mongo as a new doc on which map-reduce can now be run for count etc. Not sure if this is faster or the cursor is. We'll see.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?
At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.
The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.
If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Categories

Resources