Flask-MongoEngine & PyMongo Aggregation Query - python

I am trying to make an aggregation query using flask-mongoengine, and from what I have read it does not sound like it is possible.
I have looked over several forum threads, e-mail chains and a few questions on Stack Overflow, but I have not found a really good example of how to implement aggregation with flask-mongoengine.
There is a comment in this question that says you have to use "raw pymongo and aggregation functionality." However, there is no examples of how that might work. I have tinkered with Python and have a basic application up using Flask framework, but delving into full fledged applications & connecting/querying to Mongo is pretty new to me.
Can someone provide an example (or link to an example) of how I might utilize my flask-mongoengine models, but query using the aggregation framework with PyMongo?
Will this require two connections to MongoDB (one for PyMongo to perform the aggregation query, and a second for the regular query/insert/updating via MongoEngine)?
An example of the aggregation query I would like to perform is as follows (this query gets me exactly the information I want in the Mongo shell):
db.entry.aggregate([
{ '$group' :
{ '_id' : { 'carrier' : '$carrierA', 'category' : '$category' },
'count' : { '$sum' : 1 }
}
}
])
An example of the output from this query:
{ "_id" : { "carrier" : "Carrier 1", "category" : "XYZ" }, "count" : 2 }
{ "_id" : { "carrier" : "Carrier 1", "category" : "ABC" }, "count" : 4 }
{ "_id" : { "carrier" : "Carrier 2", "category" : "XYZ" }, "count" : 31 }
{ "_id" : { "carrier" : "Carrier 2", "category" : "ABC" }, "count" : 6 }

The class your define with Mongoengine actually has a _get_collection() method which gets the "raw" collection object as implemented in the pymongo driver.
I'm just using the name Model here as a placeholder for your actual class defined for the connection in this example:
Model._get_collection().aggregate([
{ '$group' :
{ '_id' : { 'carrier' : '$carrierA', 'category' : '$category' },
'count' : { '$sum' : 1 }
}
}
])
So you can always access the pymongo objects without establishing a separate connection. Mongoengine is itself build upon pymongo.

aggregate is available since Mongoengine 0.9.
Link to the API Reference.
As there is no example whatsoever around, here is how you perform an aggregate query using aggregation framework with Mongoengine > 0.9
pipeline = [
{ '$group' :
{ '_id' : { 'carrier' : '$carrierA', 'category' : '$category' },
'count' : { '$sum' : 1 }
}
}]
Model.objects().aggregate(*pipeline)

Related

How to select a particular value/attribute in json data via python?

I have some json data , which I want to load and inspect in python. I know python has few different ways to handle json. if i want to see what is the author name in following json data, how can directly select the value of name inside author in following json, without having to iterate , if there are multiple topic/blog in the data?
{
"topic":
{
"language": "JSON",
},
"blog":[
{
"author" : {
"name" : "coder"
}
}]
}

PyMongo MongoDB - Long DB Query

I've got a fairly long MongoDB query which I have been using in the console:
db.addresses.find({
$and: [ {"date": {$gte: "2017-06-01"}},
{"date": {$lt: "2017-06-60"}},
{ $or: [{"address.postal_code" : { $regex: /^SW1 /i } },
{"address.postal_code" : { $regex: /^SW2 /i } },
{"address.postal_code" : { $regex: /^SW3 /i } },
{"address.postal_code" : { $regex: /^SW4 /i } }
]}
]})
I now need to use this in Python using PyMongo but am understandably getting a:
SyntaxError: invalid syntax
As it's fairly complex, is there an easy way to escape it as I've got several to use, I've tried enclosing the above query like this:
"""The query from above between the brackets"""
addresses_to_process = db.addresses.find(query)
But I get an error:
TypeError: filter must be an instance of dict, bson.son.SON, or other type that inherits from collections.Mapping

MongoDB Collection with Non-repeated field value

So I'm rather new to MongoDB. Here is an imaginary database with the following format.
{
"_id": "message_id",
"headers": {
"from": <from_email>,
"to": <to_email>,
"timestamp": <timestamp>
},
"message": {
"message": <the message contents>,
"signature": <signature contents>
}
}
Suppose all emails received are inserted into it and sometimes emails are double sent. How can one return a collection of emails from an author without any double sends.
I thought this might do it but it doesn't seem to work as expected:
db.mycoll.find({"headers.from": <authorname>}).distinct("message.message")
Edit:
Please excuse me, It seems I have been making some kind of typo, the above query works, but it only returns messages.messages without the Headers, How would I keep the headers intact as well?
Hard to really determine from your question which part is the "duplicate" or therefore should be unique. It stands to reason though that things such as the message "_id" and "timestamp" are not going to duplicate, so this only really leaves the message content, with the possible additional paranoia of that message being "from" the same person.
Document reshaping is generally best handled by the aggregation framework:
db.collection.aggregate([
{ "$group": {
"_id": { "message": "$message.message", "from": "$headers.from" },
"message_id": { "$first": "$_id" },
"headers": { "$first": "$headers" },
"message": { "$first": "$message" }
}},
{ "$project": {
"_id": "$message_id",
"headers": 1,
"message": 1
}}
])
The $group will filter out any matching message content with the $first operations selecting only the "first" found item for the matching field on the document grouping boundary.
There is an assumption in here that the existing order is by "timestamp" but if not then you might want to apply a $sort as the first pipeline stage before the others:
{ "$sort": { "headers.timestamp": 1 } }
The final $project really just restores the original document form and removes the "grouping key" that was supplied earlier. Just prettier than duplicating information and/or putting things out of place.
You could use distinct() to return an array of distinct messages from a specific author as follows:
db.collection.distinct('message.message', {"headers.from": <authorname>})
What you're looking for is not currently implemented (at least as far as I know). One work around would be this
db.mycoll.aggregate([
{
$match:{"headers.from": <authorname>}
},{
$group:{
_id:"$headers.from",
"message":{$addToSet:"$message.message"}
}
}
])
Building on Neil Lunn's answer above:
I think one can do
db.collection.aggregate([{"$match": {"headers.from": <from email>} } ,
{"$group": { "_id": "$message.message"},
"headers": {"$first": "$headers"},
"signature": {"$first": "$message.signature"},
"message_id": "$_id" }},
{"$project" : { "_id": "$message_id",
"headers": "$headers",
"message": { "message": "$_id", "signature": "$signature" } } }])
Since _id must be unique the consequence is that duplicate messages will not make the list, and then $project will restructure it to the original object structure with correct key names.
I guess I only have one question in this regard - is there a way to force uniqueness without aggregating into _id or is this generally considered the correct way to do it in MongoDB ?

Python & MongoDb: Query not working at execution time

I have a token saved in mongo db like .
db.user.findOne({'token':'7fd74c28-8ba1-11e2-9073-e840f23c81a0'}['uuid'])
{
"_id" : ObjectId("5140114fae4cb51773d8c4f8"),
"username" : "jjj51#gmail.com",
"name" : "vivek",
"mobile" : "12345",
"is_active" : false,
"token" : BinData(3,"hLL6kIugEeKif+hA8jyBoA==")
}
The above query works fine when i execute in the mongo db command line interface .
The same query when i am trying to run in Django view lik.
get_user = db.user.findOne({'token':token}['uuid'])
or `get_user = db.user.findOne({'token':'7fd74c28-8ba1-11e2-9073-e840f23c81a0'}['uuid'])`
I am getting an error
KeyError at /activateaccount/
'uuid'
Please help me out why I am getting this error .
My database
db.user.find()
{ "_id" : ObjectId("5140114fae4cb51773d8c4f8"), "username" : "ghgh#gmail.com", "name" : "Rohit", "mobile" : "12345", "is_active" : false, "token" : BinData(3,"hLL6kIugEeKif+hA8jyBoA==") }
{ "_id" : ObjectId("51401194ae4cb51773d8c4f9"), "username" : "ghg#gmail.com", "name" : "rohit", "mobile" : "12345", "is_active" : false, "token" : BinData(3,"rgBIMIugEeKQBuhA8jyBoA==") }
{ "_id" : ObjectId("514012fcae4cb51874ca3e6f"), "username" : "ghgh#gmail.com", "name" : "rahul", "mobile" : "8528256", "is_active" : false, "token" : BinData(3,"f9dMKIuhEeKQc+hA8jyBoA==") }
TL;DR your query is faulty.
Longer explanation:
{'token':'7fd74c28-8ba1-11e2-9073-e840f23c81a0'}['uuid']
translates to undefined, because you're trying to get the property uuid from an object that doesn't have that property. In the Mongo shell, which uses Javascript, that translates to the following query:
db.user.findOne(undefined)
You'll get some random (okay, not so random, probably the first) result.
Python is a bit more strict when you're trying to get an unknown key from a dictionary:
{'token':token}['uuid']
Since uuid isn't a valid key in the dictionary {'token':token}, you'll get a KeyError when you try to access it.
EDIT: since you've used Python UUID types to store the tokens in the database, you also need to use the same type in your query:
from uuid import UUID
token = '7fd74c28-8ba1-11e2-9073-e840f23c81a0'
get_user = db.user.find_one({'token' : UUID(token) })

ElasticSearch: Index only the fields specified in the mapping

I have an ElasticSearch setup, receiving data to index via a CouchDB river. I have the problem that most of the fields in the CouchDB documents are actually not relevant for search: they are fields internally used by the application (IDs and so on), and I do not want to get false positives because of these fields. Besides, indexing not needed data seems to me a waste of resources.
To solve this problem, I have defined a mapping where I specify the fields which I want to be indexed. I am using pyes to access ElasticSearch. The process that I follow is:
Create the CouchDB river, associated to an index. This apparently creates also the index, and creates a "couchdb" mapping in that index which, as far as I can see, includes all fields, with dynamically assigned types.
Put a mapping, restring it to the fields which I really want to index.
This is the index definition as obtained by:
curl -XGET http://localhost:9200/notes_index/_mapping?pretty=true
{
"notes_index" : {
"default_mapping" : {
"properties" : {
"note_text" : {
"type" : "string"
}
}
},
"couchdb" : {
"properties" : {
"_rev" : {
"type" : "string"
},
"created_at_date" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"note_text" : {
"type" : "string"
},
"organization_id" : {
"type" : "long"
},
"user_id" : {
"type" : "long"
},
"created_at_time" : {
"type" : "long"
}
}
}
}
}
The problem that I have is manyfold:
that the default "couchdb" mapping is indexing all fields. I do not want this. Is it possible to avoid the creation of that mapping? I am confused, because that mapping seems to be the one which is somehow "connecting" to the CouchDB river.
the mapping that I create seems not to have any effect: there are no documents indexed by that mapping
Do you have any advice on this?
EDIT
This is what I am actually doing, exactly as typed:
server="localhost"
# Create the index
curl -XPUT "$server:9200/index1"
# Create the mapping
curl -XPUT "$server:9200/index1/mapping1/_mapping" -d '
{
"type1" : {
"properties" : {
"note_text" : {"type" : "string", "store" : "no"}
}
}
}
'
# Configure the river
curl -XPUT "$server:9200/_river/river1/_meta" -d '{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"user" : "admin",
"password" : "admin",
"db" : "notes"
},
"index" : {
"index" : "index1",
"type" : "type1"
}
}'
The documents in index1 still contain fields other than "note_text", which is the only one that I have specifically mentioned in the mapping definition. Why is that?
The default behavior of CouchDB river is to use a 'dynamic' mapping, i.e. index all the fields that are found in the incoming CouchDB documents. You're right that it can unnecessarily increase the size of the index (your problems with search can be solved by excluding some fields from the query).
To use your own mapping instead of the 'dynamic' one, you need to configure the River plugin to use the mapping you've created (see this article):
curl -XPUT 'elasticsearch-host:9200/_river/notes_index/_meta' -d '{
"type" : "couchdb",
... your CouchDB connection configuration ...
"index" : {
"index" : "notes_index",
"type" : "mapping1"
}
}'
The name of the type that you're specifying in URL while doing mapping PUT overrides the one that you're including in the definition, so the type that you're creating is in fact mapping1. Try executing this command to see for yourself:
> curl 'localhost:9200/index1/_mapping?pretty=true'
{
"index1" : {
"mapping1" : {
"properties" : {
"note_text" : {
"type" : "string"
}
}
}
}
}
I think that if you will get the name of type right, it will start working fine.

Categories

Resources