MongoDB documents combination - python

I have the collection with document structure like this:
{
"_id" : "Host CPU Utilization (%)",
"count" : 1,
"avg" : NumberDecimal("20.2397439956"),
"flaga" : 4
},
{
"_id" : "Active Sessions Using CPU",
"count" : 1,
"avg" : NumberDecimal("4.0580000000"),
"flaga" : 4
},
{
"_id" : "Wait Time (%)",
"count" : 1,
"avg" : NumberDecimal("1795.2150000000"),
"flaga" : 999
}
Is that possible to use pymongo changing data like:
{
"_id" : 4,
"Host CPU Utilization (%)" : NumberDecimal("20.2397439956"),
"Active Sessions Using CPU" : NumberDecimal("4.0580000000")
},
{
"_id" : 999,
"Wait Time (%)" : NumberDecimal("1795.2150000000"),
}
I have tried to use update commend rename but can't do it dynamically and can't combine two documents into one. If I use aggregation framework, I don't know how to $put documents with variable field name.

Since version 3.4 we have $arrayToObject operator which might be helpful. You can try grouping by flaga field and then using mentioned operator.
db.myCollection.aggregate([
{
$group: {
"_id": "$flaga",
"values": {
"$push": {
"k": "$_id",
"v": "$avg"
}
}
}
},
{
$project: {
"_id": 1,
"values": { $arrayToObject: "$values" }
}
}
])
This will give you results like:
{
"_id":4,
"values":{
"Host CPU Utilization (%)":NumberDecimal("20.2397439956"),
"Active Sessions Using CPU":NumberDecimal("4.0580000000")
}
}
You can add next pipeline stage with $replaceRoot to get rid of this nesting but unfortunately you'll loose _id field and I bet it's not what you're looking for, so probably you should perform this post-processing in your business logic code.

Related

PyMongo not returning results on aggregation

I'm a total beginner in PyMongo. I'm trying to find activities that are registered multiple times. This code is returning an empty list. Could you please help me in finding the mistake:
rows = self.db.Activity.aggregate( [
{ '$group':{
"_id":
{
"user_id": "$user_id",
"transportation_mode": "$transportation_mode",
"start_date_time": "$start_date_time",
"end_date_time": "$end_date_time"
},
"count": {'$sum':1}
}
},
{'$match':
{ "count": { '$gt': 1 } }
},
{'$project':
{"_id":0,
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1
}
}
]
)
5 rows from db:
{ "_id" : 0, "user_id" : "000", "start_date_time" : "2008-10-23 02:53:04", "end_date_time" : "2008-10-23 11:11:12" }
{ "_id" : 1, "user_id" : "000", "start_date_time" : "2008-10-24 02:09:59", "end_date_time" : "2008-10-24 02:47:06" }
{ "_id" : 2, "user_id" : "000", "start_date_time" : "2008-10-26 13:44:07", "end_date_time" : "2008-10-26 15:04:07" }
{ "_id" : 3, "user_id" : "000", "start_date_time" : "2008-10-27 11:54:49", "end_date_time" : "2008-10-27 12:05:54" }
{ "_id" : 4, "user_id" : "000", "start_date_time" : "2008-10-28 00:38:26", "end_date_time" : "2008-10-28 05:03:42" }
Thank you
When you pass _id: 0 in the $project stage, it will not project the sub-objects even if they are projected in the follow up, since the rule is overwritten.
Try the below $project stage.
{
'$project': {
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1
}
}
rows = self.db.Activity.aggregate( [
{
'$group':{
"_id": {
"user_id": "$user_id",
"transportation_mode": "$transportation_mode",
"start_date_time": "$start_date_time",
"end_date_time": "$end_date_time"
},
"count": {'$sum':1}
}
},
{
'$match':{
"count": { '$gt': 1 }
}
},
{
'$project': {
"user_id":"_id.user_id",
"transportation_mode":"_id.transportation_mode",
"start_date_time":"_id.start_date_time",
"end_date_time":"_id.end_date_time",
"count": 1,
}
}
])
Your group criteria is likely too narrow.
The $group stage will create a separate output document for each distinct value of the _id field. The pipeline in the question will only include two input documents in the same group if they have exactly the same value in all four of those fields.
In order for a count to be greater than 1, there must exist 2 documents with the same user, mode, and exactly the same start and end.
In the same data you show, there are no two documents that would be in the same group, so all of the output documents from the $group stage would have a count of 1, and therefore none of them satisfy the $match, and the return is an empty list.

Rank records on the basis of a field value in Elasticsearch

I have a field distribution in record schema that looks likes this:
...
"distribution": {
"properties": {
"availability": {
"type": "keyword"
}
}
}
...
I want to rank the records with distribution.availability == "ondemand" lower than other records.
I looked in Elasticsearch docs but can't find a way to reduce the scores of this type of records in index-time to appear lower in search results.
How can I achieve this, any pointers to related source would be enough as well.
More Info:
I was completely omitting these ondemand records with help of python client in query-time like this:
from elasticsearch_dsl.query import Q
_query = Q("query_string", query=query_string) & ~Q('match', **{'availability.keyword': 'ondemand'})
Now, I want to include these records but I want to place them lower than other records.
If it is not possible to implement something like this in index-time, please suggest how can I achieve this in query-time with python client.
After applying the suggestion from llermaly, the python client query looks like this:
boosting_query = Q(
"boosting",
positive=Q("match_all"),
negative=Q(
"bool", filter=[Q({"term": {"distribution.availability.keyword": "ondemand"}})]
),
negative_boost=0.5,
)
if query_string:
_query = Q("query_string", query=query_string) & boosting_query
else:
_query = Q() & boosting_query
EDIT2 : elasticsearch-dsl-py version of boosting query
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
from elasticsearch_dsl import Q
client = Elasticsearch()
q = Q('boosting', positive=Q("match_all"), negative=Q('bool', filter=[Q({"term": {"test.available.keyword": "ondemand"}})]), negative_boost=0.5)
s = Search(using=client, index="test_parths007").query(q)
response = s.execute()
print(response)
for hit in response:
print(hit.meta.score, hit.test.available)
EDIT : Just read you need to do it on index time.
Elasticsearch deprecated index time boosting on 5.0
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/mapping-boost.html
You can use a Boosting query to achieve that on query time.
Ingest Documents
POST test_parths007/_doc
{
"name": "doc1",
"test": {
"available": "ondemand"
}
}
POST test_parths007/_doc
{
"name": "doc1",
"test": {
"available": "higherscore"
}
}
POST test_parths007/_doc
{
"name": "doc2",
"test": {
"available": "higherscore"
}
}
Query (index time)
POST test_parths007/_search
{
"query": {
"boosting": {
"positive": {
"match_all": {}
},
"negative": {
"term": {
"test.available.keyword": "ondemand"
}
},
"negative_boost": 0.5
}
}
}
Response
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test_parths007",
"_type" : "_doc",
"_id" : "VMdY7XcB50NMsuQPelRx",
"_score" : 1.0,
"_source" : {
"name" : "doc2",
"test" : {
"available" : "higherscore"
}
}
},
{
"_index" : "test_parths007",
"_type" : "_doc",
"_id" : "Vcda7XcB50NMsuQPiVRB",
"_score" : 1.0,
"_source" : {
"name" : "doc1",
"test" : {
"available" : "higherscore"
}
}
},
{
"_index" : "test_parths007",
"_type" : "_doc",
"_id" : "U8dY7XcB50NMsuQPdlTo",
"_score" : 0.5,
"_source" : {
"name" : "doc1",
"test" : {
"available" : "ondemand"
}
}
}
]
}
}
For more advanced manipulation you can check the Function Score Query

Query DSL not working in pyes search

I am trying to use a custom query DSL to get results using the pyes library. I have query DSL that works when I use the command line
curl -XGET localhost:9200/test_index/_search -d '{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"field_value_factor": {
"field": "starred",
"modifier": "none",
"factor": 2
}
}
},
"aggs" : {
"types" : {
"filters" : {
"filters" : {
"category1" : { "type" : { "value" : "category1"}},
"category2" : { "type" : { "value" : "category2"}},
"category3" : { "type" : { "value" : "category3"}},
"category4": { "type" : { "value" : "category4"}},
"category5" : { "type" : { "value" : "category5"}}
}
},
"aggs": {
"topFoundHits": {
"top_hits": {
"size": 5
}
}
}
}
}
}'
The idea here is to search across many categorized documents for all documents matching a particular string query. Then using aggregations I want to find the top five resulting documents by category. Starred items are boosted so that they show up above other search results.
This works great when I enter the command as listed above directly in terminal but it doesn't work when I try to put it in pyes. I'm not sure what the best way is to do it. The documentation for the pyes library is really confusing for me to translate this totally into pyes objects.
I'm trying to do the following:
query_dsl = self.get_text_index_query_dsl()
resulting_docs = conn.search(query=query_dsl)
(where self.get_test_index_query_dsl returns the query dsl dict above)
Searching as is gives me a:
ElasticSearchException: QueryParsingException[[test_index] No query registered for [query]]; }]
If I remove the parent "query" mapping and try:
query_dsl = {
"function_score": {
"query": {
"match_all": {}
},
"field_value_factor": {
"field": "starred",
"modifier": "none",
"factor": 2
}
},
"aggs" : {
"types" : {
"filters" : {
"filters" : {
"category1" : { "type" : { "value" : "category1"}},
"category2" : { "type" : { "value" : "category2"}},
"category3" : { "type" : { "value" : "category3"}},
"category4": { "type" : { "value" : "category4"}},
"category5" : { "type" : { "value" : "category5"}}
}
},
"aggs": {
"topFoundHits": {
"top_hits": {
"size": 5
}
}
}
}
}
}
This also errors out with: ElasticSearchException: ElasticsearchParseException[Expected field name but got START_OBJECT "aggs"]; }]
These errors in addition to the fact that pyes doesn't seem to have a 'topFoundHits' functionality yet (I think) are holding me up.
Any ideas why this is happening and how to fix it?
Thank you so much!
I got this working using this library where you can just use your regular query dsl JSON syntax : http://elasticsearch-dsl.readthedocs.org/en/latest/.

Python and Elasticsearch API changes and Autcomplete

So to begin. I am trying to add around 7.2k documents. No problem there. The issue is after I am not able to get any suggestions returned to me. So this is how the information is added:
def addVariantToElasticSearch(self,docId, companyId, companyName, parent, companyIndustry, variants, count,conn):
body = { "company":{
"company_name": companyName,
"parent": parent,
"suggest": { "input": variants,
"output": companyName,
"weight": count,
"payload": {"industry_id": companyIndustry,
"no_of_jobseekers":count,
"company_id": companyId
}
}
}
}
res = conn.index(body=body, index="companies", doc_type="company", id=docId)
The mapping and settings is defined as:
def setting():
return { "settings" : {
"index": {
"number_of_replicas" : 0,
"number_of_shards": 1
},
"analysis" : {
"analyzer" : {
"my_edge_ngram_analyzer" : {
"tokenizer" : "my_edge_ngram_tokenizer",
"filter":["standard", "lowercase"]
}
},
"tokenizer" : {
"my_edge_ngram_tokenizer" : {
"type" : "edgeNGram",
"min_gram" : "1",
"max_gram" : "5",
"token_chars": [ "letter", "digit" ]
}
}
}
},
"mappings": {
"company" : {
"properties" : {
"name" : { "type" : "string" },
"industy": {"type": "integer"},
"count" : {"type": "long" },
"parent": {"type": "string"},
"suggest" : {
"type" : "completion",
"index_analyzer": "my_edge_ngram_analyzer",
"search_analyzer": "my_edge_ngram_analyzer",
"payloads": True
}
}
}
}
}
Index creation:
def createMapping(es):
settings = setting()
es.indices.create(index="companies", body=settings)
I call createMapping which uses setting(), then add each variant - surrounded by a try,except -> causes no issue. I can see all my documents added in the browser as well as looking at the status, settings and mappings.
But when I use a curl request as below, I get no results. (See curl and output beneath)
curl -X POST localhost:9200/companies/_suggest -d '
{
"company-suggest" : {
"text" : "1800",
"completion" : {
"field" : "suggest"
}
}
}'
{
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"suggest" : [ {
"text" : "ruby",
"offset" : 0,
"length" : 4,
"options" : [ ]
} ]
I am currently using ES 1.1.0. I have tried both Python API 0.4 and 1.1.0 with no luck (I tried 0.4 as a result of 1.1.0 not working although I know it isn't best to due to compatibility issues with version of ES). I have also been able to add the same settings with mappings via curl and added a company which I have been able to retrieve by this curl above.
I'm not sure exactly where the issue lies. I have looked at the Data folder in ES to ensure it has been created, as well as the browser. I have also ensured only a single ES instance is running.
Any help greatly appreciated,

How to print minimum result in MongoDB

MongoDB noob here...
So, I'm trying to print out the minimum value score inside a collection that looks like this...
> db.students.find({'_id': 1}).pretty()
{
"_id" : 1,
"name" : "Aurelia Menendez",
"scores" : [
{
"type" : "exam",
"score" : 60.06045071030959
},
{
"type" : "quiz",
"score" : 52.79790691903873
},
{
"type" : "homework",
"score" : 71.76133439165544
},
{
"type" : "homework",
"score" : 34.85718117893772
}
]
}
The incantation I'm using is as such...
db.students.aggregate(
// Initial document match (uses index, if a suitable one is available)
{ $match: {
_id : 1
}},
// Expand the scores array into a stream of documents
{ $unwind: '$scores' },
// Filter to 'homework' scores
{ $match: {
'scores.type': 'homework'
}},
// grab the minimum value score
{ $match: {
'scores.min.score': 1
}}
)
the output i'm getting is this...
{ "result" : [ ], "ok" : 1 }
What am I doing wrong?
You've got the right idea, but in the last step of the aggregation what you want to do is group all the scores by student and find the $min value.
Change the last pipeline operation to:
{ $group: {
_id: "$_id",
minScore: {$min: "$scores.score"}
}}
> db.students.aggregate(
{ $unwind: "$scores" },`
{ $match:{"scores.type":"homework"} },
{ $group: {
_id : "$_id",
maxScore : { $max : "$scores.score"},
minScore: { $min:"$scores.score"}
}
});
how to aggregate on each item in collection in mongoDB

Categories

Resources