So I'm rather new to MongoDB. Here is an imaginary database with the following format.
{
"_id": "message_id",
"headers": {
"from": <from_email>,
"to": <to_email>,
"timestamp": <timestamp>
},
"message": {
"message": <the message contents>,
"signature": <signature contents>
}
}
Suppose all emails received are inserted into it and sometimes emails are double sent. How can one return a collection of emails from an author without any double sends.
I thought this might do it but it doesn't seem to work as expected:
db.mycoll.find({"headers.from": <authorname>}).distinct("message.message")
Edit:
Please excuse me, It seems I have been making some kind of typo, the above query works, but it only returns messages.messages without the Headers, How would I keep the headers intact as well?
Hard to really determine from your question which part is the "duplicate" or therefore should be unique. It stands to reason though that things such as the message "_id" and "timestamp" are not going to duplicate, so this only really leaves the message content, with the possible additional paranoia of that message being "from" the same person.
Document reshaping is generally best handled by the aggregation framework:
db.collection.aggregate([
{ "$group": {
"_id": { "message": "$message.message", "from": "$headers.from" },
"message_id": { "$first": "$_id" },
"headers": { "$first": "$headers" },
"message": { "$first": "$message" }
}},
{ "$project": {
"_id": "$message_id",
"headers": 1,
"message": 1
}}
])
The $group will filter out any matching message content with the $first operations selecting only the "first" found item for the matching field on the document grouping boundary.
There is an assumption in here that the existing order is by "timestamp" but if not then you might want to apply a $sort as the first pipeline stage before the others:
{ "$sort": { "headers.timestamp": 1 } }
The final $project really just restores the original document form and removes the "grouping key" that was supplied earlier. Just prettier than duplicating information and/or putting things out of place.
You could use distinct() to return an array of distinct messages from a specific author as follows:
db.collection.distinct('message.message', {"headers.from": <authorname>})
What you're looking for is not currently implemented (at least as far as I know). One work around would be this
db.mycoll.aggregate([
{
$match:{"headers.from": <authorname>}
},{
$group:{
_id:"$headers.from",
"message":{$addToSet:"$message.message"}
}
}
])
Building on Neil Lunn's answer above:
I think one can do
db.collection.aggregate([{"$match": {"headers.from": <from email>} } ,
{"$group": { "_id": "$message.message"},
"headers": {"$first": "$headers"},
"signature": {"$first": "$message.signature"},
"message_id": "$_id" }},
{"$project" : { "_id": "$message_id",
"headers": "$headers",
"message": { "message": "$_id", "signature": "$signature" } } }])
Since _id must be unique the consequence is that duplicate messages will not make the list, and then $project will restructure it to the original object structure with correct key names.
I guess I only have one question in this regard - is there a way to force uniqueness without aggregating into _id or is this generally considered the correct way to do it in MongoDB ?
Related
I have a database 'Product'. Which contains a collection name 'ProductLog'. Inside this collection , there are 2 documents in the following format:
{
"environment": "DevA",
"data": [
{
"Name": "ABC",
"Stream": "Yes"
},
{
"Name": "ZYX",
"Stream": "Yes"
}
]
},
{
"environment": "DevB",
"data": [
{
"Name": "ABC",
"Stream": "Yes"
},
{
"Name": "ZYX",
"Stream": "Yes"
}
]
}
This gets added as 2 documents in collection. I want to append more data in the already existing document's 'data' field in MongoDB using python. Is there a way for that? I guess update would remove the existing fields in "data" field or may update a whole document.
For example: Adding one more array in EmployeeDetails field, while the earlier data in EmployeeDetail also remains.
I want to show how you can append more data in the already existing document 'data' field in MongoDB using python:
First install pymongo:
pip install mongoengine
Now let's get our hands dirty:
from pymongo import MongoClient
mongo_uri = "mongodb://user:pass#mongosrv:27017/"
client = MongoClient(mongo_uri)
database = client["Product"]
collection = "ProductLog"
database[collection].update_one({"environment": "DevB"}, {
"$push": {
"data": {"Name": "DEF", "Stream": "NO"}
}
})
There is a SQL library in Python language through which you can insert/add your data in your desired database. For more information, check out the tutorial
I have the collection posts that contains posts that look something like this
{
"_id": "5ae37fd270f3e72399988198",
"moderator": {
"flagged": false,
"reviewed": true,
"pending": false,
"time": "2018-04-27 20:34:38.099000",
"account": "samhamou"
},
"author": "cryptohazard",
"permlink": "security-enhancements-for-steem-messenger",
"title": "Security enhancements for Steem Messenger",
"repository": {
"owner": {
"login": "kingswisdom"
},
"fork": false,
"html_url": "https:\/\/github.com\/kingswisdom\/SteemMessenger",
"full_name": "kingswisdom\/SteemMessenger",
"name": "SteemMessenger",
"id": 127418766
}
}
I am trying to create an index on the collection in one of my Python files with the following code
posts = DB.posts
posts.drop_indexes()
posts.create_index([
("author", "text"),
("moderator.account", "text"),
("repository.full_name", "text")
])
but this is giving me the following error:
pymongo.errors.OperationFailure: language override unsupported: C++
How can I prevent this from happening? I can create the indexes using code found in the answers of this question:
> db.posts.createIndex({
"moderator.account": "text",
author: "text",
"repository.full_name": "text"
}, {
"language_override": "en"
});
However I want to be able to do it from within my Python script instead of having to do it from the MongoDB shell. I tried finding a way to add the "language_override": "en" option from within my Python script, but this doesn't seem possible when checking the documentation for create_index.
Recently migrated from AWS Elasticsearch Service (used Elasticsearch 1.5.2) to Elastic Cloud (currently using Elasticsearch 5.1.2). Glad I did it, but with that change comes a newer version of Elasticsearch and newer API's. Struggling to get my head around the new way of requesting stuff. Formerly, I could more or less copy/paste from Kibana's "Elasticsearch Request Body", adjust a few things, run elasticsearch.Elasticsearch.search() and get what I expect.
Here's my Elasticsearch Request Body from Kibana (for brevity, removed some of the extraneous stuff that Kibana usually inserts):
{
"size": 500,
"sort": [
{
"Time.ISO8601": {
"order": "desc",
"unmapped_type": "boolean"
}
}
],
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "Message\\ ID: 2003",
"analyze_wildcard": true
}
},
{
"range": {
"Time.ISO8601": {
"gte": 1484355455678,
"lte": 1484359055678,
"format": "epoch_millis"
}
}
}
],
"must_not": []
}
},
"stored_fields": [
"*"
],
"script_fields": {},
}
Now I want to use elasticsearch-dsl to do it, since that seems to be the recommended method (instead of using elasticsearch-py). How would I translate the above into elasticsearch-dsl?
Here's what I have so far:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
client = Elasticsearch(
hosts=['HASH.REGION.aws.found.io/elasticsearch'],
use_ssl=True,
port=443,
http_auth=('USER','PASS')
)
s = Search(using=client, index="emp*")
s = s.query("query_string", query="Message\ ID:2003", analyze_wildcards=True)
s = s.query("range", **{"Time.ISO8601": {"gte": 1484355455678, "lte": 1484359055678, "format": "epoch_millis"}})
s = s.sort("Time.ISO8601")
response = s.execute()
for hit in response:
print '%s %s' % (hit['Time']['ISO8601'], hit['Message ID'])
My code written as above is not giving me what I expect. Getting results that include stuff that doesn't match "Message\ ID:2003", and also it's giving me things outside the requested range of Time.ISO8601 as well.
Totally new to elasticsearch-dsl and ES 5.1.2's way of doing things, so I know I've got lots to learn. What am I doing wrong? Thanks in advance for the help!
I don't have elasticsearch running right now but the query looks like what you wanted (you can always see the query produced by looking at s.to_dict()) with the exception of escaping the \ sign. In the original query it was escaped yet in python the result might be different due to different escaping.
I wuld strongly advise to not have spaces in your fields and also to use a more structured query than query_string:
s = Search(using=client, index="emp*")
s = s.filter("term", message_id=2003)
s = s.query("range", Time__ISO8601={"gte": 1484355455678, "lte": 1484359055678, "format": "epoch_millis"})
s = s.sort("Time.ISO8601")
Note that I also changed query() to filter() for a slight speedup and used __ instead of . in the field name keyword argument. elasticsearch-dsl will automatically expand that to ..
Hope this helps...
I have a stream of events coming in as JSON. The schema for the JSON is well defined, but the source producing them doesn't always behave when it comes to types.
Example Schema:
{
"type":"object",
"$schema": "http://json-schema.org/draft-03/schema",
"properties":{
"FirstName": {
"type":"string",
"id": "http://jsonschema.net/FirstName",
"required":false
},
"MiddleName": {
"type":"string",
"id": "http://jsonschema.net/MiddleName",
"required":false
},
"LastName": {
"type":"string",
"id": "http://jsonschema.net/LastName",
"required":false
},
"Age": {
"type":"number",
"id": "http://jsonschema.net/Age",
"required":false
}
}
In some cases the Age shows up as a "-" character, meaning it was left blank when the record was created. Obviously this isn't a number, thus my problem.
I'm not using any formal JSON validation library, but I was considering looping through each element of the event and handling any needed type conversation. In the example above, I would just make age 0.
Is there a way to validate each element and then apply some type of conversation function is it fails validation?
I ended up using Schematics with custom Types to do this. Works perfectly.
I am using python-jsonschema for json validation. I have an object with localised texts that are specified inside rfc1766 language code keys as followings:
"Description": {
"en": "English Description",
"sv": "Swedish Description",
"fr": "French Description"
},
I've read in the documentation that I could use the 'format' attribute to check a custom format using a function. So,I wrote a method which takes a string as a parameter and returns True if it is an RFC1766 language string.
#_checks_drafts('rfc1766lang')
def rfc1766lang(instance):
"""some logic, return True if rfc1766"""
However I couldn't find any example on how to apply this to do validation on an object key, not a value.
Is this possible?
I have tried something like below but I couldn't succeed
rfc1766_string_schema_v2 = {
'type': 'object',
'format': 'rfc1766lang',
'additionalProperties': False
}
I know that it would be much easier if I had the json string as follows. However, this is not an option for now.
"Description": [{
"lan": "en",
"text": "Description in English"
}, {
"lan": "sv",
"name": "Description in Swedish"
}]
This is a very good and relevant question because this is actually part of the proposed syntax for v5, so the official meta-schema will have to deal with this as well.
JSON Schema cannot specify a "format" for object keys. The only "validation" JSON Schema supports for object keys is patternProperties, which supplies a regular expression.
For language codes, the best you can do is probably something like:
{
"type": "object",
"patternProperties": {
"^[a-zA-Z]+(-[a-zA-Z]+)*$": {...}
},
"additionalProperties": false
}
That would limit the data so that it was only allowed properties matching that pattern - but that's not the full validation you're looking for, I'm afraid.