Mongodb $lookup to join two collection Python [duplicate] - python

I am trying to fetch data based on some match condition. First I've tried this:
Here ending_date is full date format
Offer.aggregate([
{
$match: {
carer_id : req.params.carer_id,
status : 3
}
},
{
$group : {
_id : { year: { $year : "$ending_date" }, month: { $month : "$ending_date" }},
count : { $sum : 1 }
}
}],
function (err, res)
{ if (err) ; // TODO handle error
console.log(res);
});
which gives me following output:
[ { _id: { year: 2015, month: 11 }, count: 2 } ]
Now I want to check year also, so I am trying this:
Offer.aggregate([
{
$project: {
myyear: {$year: "$ending_date"}
}
},
{
$match: {
carer_id : req.params.carer_id,
status : 3,
$myyear : "2015"
}
},
{
$group : {
_id : { year: { $year : "$ending_date" }, month: { $month : "$ending_date" }},
count : { $sum : 1 }
}
}],
function (err, res)
{ if (err) ; // TODO handle error
console.log(res);
});
which gives me following output:
[]
as you can see, _id has 2015 as a year, so when I match year it should be come in array. But I am getting null array. Why this?
Is there any other way to match only year form whole datetime?
Here is the sample data
{
"_id": {
"$oid": "56348e7938b1ab3c382d3363"
},
"carer_id": "55e6f647f081105c299bb45d",
"user_id": "55f000a2878075c416ff9879",
"starting_date": {
"$date": "2015-10-15T05:41:00.000Z"
},
"ending_date": {
"$date": "2015-11-19T10:03:00.000Z"
},
"amount": "850",
"total_days": "25",
"status": 3,
"is_confirm": false,
"__v": 0
}
{
"_id": {
"$oid": "563b5747d6e0a50300a1059a"
},
"carer_id": "55e6f647f081105c299bb45d",
"user_id": "55f000a2878075c416ff9879",
"starting_date": {
"$date": "2015-11-06T04:40:00.000Z"
},
"ending_date": {
"$date": "2015-11-16T04:40:00.000Z"
},
"amount": "25",
"total_days": "10",
"status": 3,
"is_confirm": false,
"__v": 0
}

You forgot to project the fields that you're using in $match and $group later on. For a quick fix, use this query instead:
Offer.aggregate([
{
$project: {
myyear: { $year: "$ending_date" },
carer_id: 1,
status: 1,
ending_date: 1
}
},
{
$match: {
carer_id: req.params.carer_id,
myyear: 2015,
status: 3
}
},
{
$group: {
_id: {
year: { $year: "$ending_date" },
month: { $month: "$ending_date" }
},
count: { $sum: 1 }
}
}],
function (err, res)
{
if (err) {} // TODO handle error
console.log(res);
});
That said, Blakes Seven explained how to make a better query in her answer. I think you should try and use her approach instead.

You are doing so many things wrong here, that it really warrants an explanation so hopefully you learn something.
Its a Pipeline
It's the most basic concept but the one people do not pick up on the most often ( and even after continued use ) that the aggregation "pipeline" is just exactly that, being "piped" processes that feed input into the each stage as it goes along. Think "unix pipe" |:
ps -ef | grep mongo | tee out.txt
You've seen something similar before no doubt, and it's the basic concept that output from the first thing goes to the next thing and then that manipulates to give input to the next thing and so on.
So heres the basic problem with what you are asking:
{
$project: {
myyear: {$year: "$ending_date"}
}
},
{
$match: {
carer_id : req.params.carer_id,
status : 3,
$myyear : "2015"
}
},
Consider what $project does here. You specify fields you want in the output and it "spits them out", and possibly with manipulation. Does it output these fields in "addition" the the fields in the document? No It does not. Only what you ask to come out actually comes out and can be used in the following pipeline stage(s).
The $match here essentially asks for fields that are no longer present, because you only asked for one thing in the output. The same problem occurs further down as again you ask for fields you removed earlier and there simply is nothing to reference, but also everything was already removed by a $match that could not match anything.
Also, that is not how field projections work as you have entered
So just use a range for the date
{ "$match": {
"carer_id" : req.params.carer_id,
"status" : 3,
"ending_date": {
"$gte": new Date("2015-01-01"),
"$lt": new Date("2016-01-01")
}
}},
{ "$group": {
"_id": {
"year": { "$year": "$ending_date" },
"month": { "$month": "$ending_date" }
},
"count": { "$sum": 1 }
}}
Why? Because it just makes sense. If you want to match the "year" then supply the date range for the whole year. We could play silliness with $redact to match on the extracted year value, but that is just wasted processing time.
Doing this way is the fastest to process and can actually use an index to process faster. So don't otherthink the problem and just ask for the date range you want.

If you want your aggregation to work you have to use $addFields instead of $project in order to keep status and carer_id in the object you pass to the $match
{
$addFields: {
myyear: {$year: "$ending_date"}
}
},
{
$match: {
carer_id : req.params.carer_id,
status : 3,
$myyear : "2015"
}
},

Related

How to filter ElasticSearch results without having it affect the document score?

I am trying to filter my results on "publication_year" field but I don't want it to affect the score of the document, but if I add the "range" to the query or to "filter", it seems to affect the score and score the documents higher whose "publication_year" is closer to "lte" or "less than equal to" the upper limit in the "range".
My query:
query = {
'bool': {
'should': [
{
'match_phrase': {
"title": keywords
}
},
{
'match_phrase': {
"abstract": keywords
}
},
]
}
}
if publication_year_constraint:
range_query = {"range":{"publication_year":{"gte":publication_year_constraint, "lte": datetime.datetime.today().year}}}
query["bool"]["filter"] = [range_query]
tried putting the "range" inside the "should" block as well, similar results.
Try use Filter Context.
In a filter context, a query clause answers the question “Does this
document match this query clause?” The answer is a simple Yes or
No — no scores are calculated.
Example:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "Search" }},
{ "match": { "content": "Elasticsearch" }}
],
"filter": [
{ "term": { "status": "published" }},
{ "range": { "publish_date": { "gte": "2015-01-01" }}}
]
}
}
}

What happens to a $match term in a pipeline?

I'm a newbie to MongoDB and Python scripts. I'm confused how a $match term is handled in a pipeline.
Let's say I manage a library, where books are tracked as JSON files in a MongoDB. There is one JSON for each copy of a book. The book.JSON files look like this:
{
"Title": "A Tale of Two Cities",
"subData":
{
"status": "Checked In"
...more data here...
}
}
Here, status will be one string from a finite set of strings, perhaps just: { "Checked In", "Checked Out", "Missing", etc. } But also note also that there may not be a status field at all:
{
"Title": "Great Expectations",
"subData":
{
...more data here...
}
}
Okay: I am trying to write a MongoDB pipeline within a Python script that does the following:
For each book in the library:
Groups and counts the different instances of the status field
So my target output from my Python script would be something like this:
{ "A Tale of Two Cities" 'Checked In' 3 }
{ "A Tale of Two Cities" 'Checked Out' 4 }
{ "Great Expectations" 'Checked In' 5 }
{ "Great Expectations" '' 7 }
Here's my code:
mydatabase = client.JSON_DB
mycollection = mydatabase.JSON_all_2
listOfBooks = mycollection.distinct("bookname")
for book in listOfBooks:
match_variable = {
"$match": { 'Title': book }
}
group_variable = {
"$group":{
'_id': '$subdata.status',
'categories' : { '$addToSet' : '$subdata.status' },
'count': { '$sum': 1 }
}
}
project_variable = {
"$project": {
'_id': 0,
'categories' : 1,
'count' : 1
}
}
pipeline = [
match_variable,
group_variable,
project_variable
]
results = mycollection.aggregate(pipeline)
for result in results:
print(str(result['Title'])+" "+str(result['categories'])+" "+str(result['count']))
As you can probably tell, I have very little idea what I'm doing. When I run the code, I get an error because I'm trying to reference my $match term:
Traceback (most recent call last):
File "testScript.py", line 34, in main
print(str(result['Title'])+" "+str(result['categories'])+" "+str(result['count']))
KeyError: 'Title'
So a $match term is not included in the pipeline? Or am I not including it in the group_variable or project_variable ?
And on a general note, the above seems like a lot of code to do something relatively easy. Does anyone see a better way? Its easy to find simple examples online, but this is one step of complexity away from anything I can locate. Thank you.
Here's one aggregation pipeline to "$group" all the books by "Title" and "subData.status".
db.collection.aggregate([
{
"$group": {
"_id": {
"Title": "$Title",
"status": {"$ifNull": ["$subData.status", ""]}
},
"count": { "$count": {} }
}
},
{ // not really necessary, but puts output in predictable order
"$sort": {
"_id.Title": 1,
"_id.status": 1
}
},
{
"$replaceWith": {
"$mergeObjects": [
"$_id",
{"count": "$count"}
]
}
}
])
Example output for one of the "books":
{
"Title": "mumblecore",
"count": 3,
"status": ""
},
{
"Title": "mumblecore",
"count": 3,
"status": "Checked In"
},
{
"Title": "mumblecore",
"count": 8,
"status": "Checked Out"
},
{
"Title": "mumblecore",
"count": 6,
"status": "Missing"
}
Try it on mongoplayground.net.

Aggregation function for Counting of Duplicates in a field based on duplicate items in another field

I am using mongoengine as ORM with flask application. The model class is define like
class MyData(db.Document):
task_id = db.StringField(max_length=50, required=True)
url = db.URLField(max_length=500,required=True,unique=True)
organization = db.StringField(max_length=250,required=True)
val = db.StringField(max_length=50, required=True)
The field organization can be repeating and I want to get the count of duplicates with respect to values in another field. For example if the data in mongodb is like
[{"task_id":"as4d2rds5","url":"https:example1.com","organization":"Avengers","val":"null"},
{"task_id":"rfre43fed","url":"https:example1.com","organization":"Avengers","val":"valid"},
{"task_id":"uyje3dsxs","url":"https:example2.com","organization":"Metro","val":"valid"},
{"task_id":"ghs563vt6","url":"https:example1.com","organization":"Avengers","val":"invalid"},
{"task_id":"erf6egy64","url":"https:example2.com","organization":"Metro","val":"null"}]
Then I am querying all the objects using
data = MyData.objects()
I want a response like
[{"url":"https:example1.com","Avengers":{"valid":1,"null":1,"invalid":1}},{"url":"https:example2.com",Metro":{"valid":1,"null":1,"invalid":0}}]
I tried like
db.collection.aggregate([
{
"$group": {
"_id": "$organization",
"count": [
{
"null": {
"$sum": 1
},
"valid": {
"$sum": 1
},
"invalid": {
"$sum": 1
}
}
]
}
}
])
but I am getting an error
The field 'count' must be an accumulator object
Maybe something like this:
db.collection.aggregate([
{
"$group": {
"_id": {
k: "$organization",
v: "$val"
},
"cnt": {
$sum: 1
}
}
},
{
$project: {
_id: 0,
k: "$_id.k",
o: {
k: "$_id.v",
v: "$cnt"
}
}
},
{
$group: {
_id: "$k",
v: {
$push: "$o"
}
}
},
{
$addFields: {
v: {
"$arrayToObject": "$v"
}
}
},
{
$project: {
_id: 0,
new: [
{
k: "$_id",
v: "$v"
}
]
}
},
{
"$addFields": {
"new": {
"$arrayToObject": "$new"
}
}
},
{
"$replaceRoot": {
"newRoot": "$new"
}
}
])
Explained:
Group to count
Project for arrayToObject
Group to join the values
arrayToObject one more time
project additionally
arrayToObject to form the final object
project one more time
replaceRoot to move the object to root.
P.S.
Please, note this solution is not showing the missing values if they do not exist , if you need the missing values additional mapping / mergeObjects need to be added
playground1
Option with missing values ( if possible values are fixed to null,valid,invalid) :
just replace the second addFiedlds with:
{
$addFields: {
v: {
"$mergeObjects": [
{
"null": 0,
valid: 0,
invalid: 0
},
{
"$arrayToObject": "$v"
}
]
}
}
}
playground2
++url:
playground3

Get field value in MongoDB without parent object name

I'm trying to find a way to retrieve some data on MongoDB trough python scripts
but I got stuck on a situation as follows:
I have to retrieve some data, check a field value and compare with another data (MongoDB Documents).
But the Object's name may vary from each module, see bellow:
Document 1
{
"_id": "001",
"promotion": {
"Avocado": {
"id": "01",
"timestamp": "202005181407",
},
"Banana": {
"id": "02",
"timestamp": "202005181407",
}
},
"product" : {
"id" : "11"
}
Document 2
{
"_id": "002",
"promotion": {
"Grape": {
"id": "02",
"timestamp": "202005181407",
},
"Dragonfruit": {
"id": "02",
"timestamp": "202005181407",
}
},
"product" : {
"id" : "15"
}
}
I'll aways have an Object called promotion but the child's name may vary, sometimes it's an ordered number, sometimes it is not. The field I need the value is the id inside promotion, it will aways have the same name.
So if the document matches the criteria I'll retrieve with python and get the rest of the work done.
PS.: I'm not the one responsible for this kind of Document Structure.
I've already tried these docs, but couldn't get them to work the way I need.
$all
$elemMatch
Try this python pipeline:
[
{
'$addFields': {
'fruits': {
'$objectToArray': '$promotion'
}
}
}, {
'$addFields': {
'FruitIds': '$fruits.v.id'
}
}, {
'$project': {
'_id': 0,
'FruitIds': 1
}
}
]
Output produced:
{FruitIds:["01","02"]},
{FruitIds:["02","02"]}
Is this the desired output?

Getting linked documents in single lookup query in Elastic Search

To provide some context :
I want to write a bulk update query(possibly affecting 0.5 - 1M docs). The update would be in the aspects field (shown below) which are mostly duplicated.
My thinking was if I normalised it into another entity (aspect_label), the amount of docs updated would be reduced drastically (say 500-1000 max).
Query : I want to find out if there is a way to get linked documents via id in Elastic Search.
Eg. if I have documents in index my_db according to the mapping below.
Just to point out : processed_reviews is a child of aspect_label
{
"my_db":{
"mappings":{
"processed_reviews":{
"_all":{
"enabled":false
},
"_parent":{
"type":"aspect_label"
},
"_routing":{
"required":true
},
"properties":{
"data":{
"properties":{
"insights":{
"type":"nested",
"properties":{
"aspects":{
"type":"nested",
"properties":{
"aspect_label_id":{
"type":"keyword"
},
"aspect_term_frequency":{
"type":"long"
}
}
}
}
},
"preprocessed_text":{
"type":"text"
},
"preprocessed_title":{
"type":"text"
}
}
}
}
}
}
}
}
And another entity aspect_label :
{
"my_db": {
"mappings": {
"aspect_label": {
"_all": {
"enabled": false
},
"properties": {
"aspect": {
"type": "keyword"
},
"aspect_label_new": {
"type": "keyword"
},
"aspect_label_old": {
"type": "text"
}
}
}
}
}
}
Now, I want to write a search query on the processed_reviews type such that the aspect_label_id entity is replaced with the the value of aspect_label_new in the doc or the entire doc in aspect_label matching the id.
{
"_index":"my_db",
"_type":"processed_reviews",
"_id":"191b3bff-4915-4404-a05a-10e6bd2b19d4",
"_score":1,
"_routing":"5",
"_parent":"5",
"_source":{
"data":{
"preprocessed_text":"Good product I really like so comfortable and so light wait and looks good",
"preprocessed_title":"Good choice",
"insights":[
{
"aspects":[
{
"aspect_label":"color",
"aspect_term_frequency":1
}
]
}
]
}
}
}
Also, if there is a better way to approach this problem/ something wrong with my approach or if this is possible or not. Please inform me of the same as well.

Categories

Resources