I'm a newbie to MongoDB and Python scripts. I'm confused how a $match term is handled in a pipeline.
Let's say I manage a library, where books are tracked as JSON files in a MongoDB. There is one JSON for each copy of a book. The book.JSON files look like this:
{
"Title": "A Tale of Two Cities",
"subData":
{
"status": "Checked In"
...more data here...
}
}
Here, status will be one string from a finite set of strings, perhaps just: { "Checked In", "Checked Out", "Missing", etc. } But also note also that there may not be a status field at all:
{
"Title": "Great Expectations",
"subData":
{
...more data here...
}
}
Okay: I am trying to write a MongoDB pipeline within a Python script that does the following:
For each book in the library:
Groups and counts the different instances of the status field
So my target output from my Python script would be something like this:
{ "A Tale of Two Cities" 'Checked In' 3 }
{ "A Tale of Two Cities" 'Checked Out' 4 }
{ "Great Expectations" 'Checked In' 5 }
{ "Great Expectations" '' 7 }
Here's my code:
mydatabase = client.JSON_DB
mycollection = mydatabase.JSON_all_2
listOfBooks = mycollection.distinct("bookname")
for book in listOfBooks:
match_variable = {
"$match": { 'Title': book }
}
group_variable = {
"$group":{
'_id': '$subdata.status',
'categories' : { '$addToSet' : '$subdata.status' },
'count': { '$sum': 1 }
}
}
project_variable = {
"$project": {
'_id': 0,
'categories' : 1,
'count' : 1
}
}
pipeline = [
match_variable,
group_variable,
project_variable
]
results = mycollection.aggregate(pipeline)
for result in results:
print(str(result['Title'])+" "+str(result['categories'])+" "+str(result['count']))
As you can probably tell, I have very little idea what I'm doing. When I run the code, I get an error because I'm trying to reference my $match term:
Traceback (most recent call last):
File "testScript.py", line 34, in main
print(str(result['Title'])+" "+str(result['categories'])+" "+str(result['count']))
KeyError: 'Title'
So a $match term is not included in the pipeline? Or am I not including it in the group_variable or project_variable ?
And on a general note, the above seems like a lot of code to do something relatively easy. Does anyone see a better way? Its easy to find simple examples online, but this is one step of complexity away from anything I can locate. Thank you.
Here's one aggregation pipeline to "$group" all the books by "Title" and "subData.status".
db.collection.aggregate([
{
"$group": {
"_id": {
"Title": "$Title",
"status": {"$ifNull": ["$subData.status", ""]}
},
"count": { "$count": {} }
}
},
{ // not really necessary, but puts output in predictable order
"$sort": {
"_id.Title": 1,
"_id.status": 1
}
},
{
"$replaceWith": {
"$mergeObjects": [
"$_id",
{"count": "$count"}
]
}
}
])
Example output for one of the "books":
{
"Title": "mumblecore",
"count": 3,
"status": ""
},
{
"Title": "mumblecore",
"count": 3,
"status": "Checked In"
},
{
"Title": "mumblecore",
"count": 8,
"status": "Checked Out"
},
{
"Title": "mumblecore",
"count": 6,
"status": "Missing"
}
Try it on mongoplayground.net.
Related
I'm trying to find a way to retrieve some data on MongoDB trough python scripts
but I got stuck on a situation as follows:
I have to retrieve some data, check a field value and compare with another data (MongoDB Documents).
But the Object's name may vary from each module, see bellow:
Document 1
{
"_id": "001",
"promotion": {
"Avocado": {
"id": "01",
"timestamp": "202005181407",
},
"Banana": {
"id": "02",
"timestamp": "202005181407",
}
},
"product" : {
"id" : "11"
}
Document 2
{
"_id": "002",
"promotion": {
"Grape": {
"id": "02",
"timestamp": "202005181407",
},
"Dragonfruit": {
"id": "02",
"timestamp": "202005181407",
}
},
"product" : {
"id" : "15"
}
}
I'll aways have an Object called promotion but the child's name may vary, sometimes it's an ordered number, sometimes it is not. The field I need the value is the id inside promotion, it will aways have the same name.
So if the document matches the criteria I'll retrieve with python and get the rest of the work done.
PS.: I'm not the one responsible for this kind of Document Structure.
I've already tried these docs, but couldn't get them to work the way I need.
$all
$elemMatch
Try this python pipeline:
[
{
'$addFields': {
'fruits': {
'$objectToArray': '$promotion'
}
}
}, {
'$addFields': {
'FruitIds': '$fruits.v.id'
}
}, {
'$project': {
'_id': 0,
'FruitIds': 1
}
}
]
Output produced:
{FruitIds:["01","02"]},
{FruitIds:["02","02"]}
Is this the desired output?
I'm trying to navigation through a json file but cannot parse properly the 'headliner' node.
Here is my JSON file :
{
"resultsPage":{
"results":{
"calendarEntry":[
{
"event":{
"id":38862824,
"artistName":"Raphael",
},
"performance":[
{
"id":73632729,
"headlinerName":"Top-Secret",
}
}
],
"venue":{
"id":4285819,
"displayName":"Sacré"
}
}
}
}
Here is what I my trying to do :
for item in data ["resultsPage"]["results"]["calendarEntry"]:
artistname = item["event"]["artistName"]
headliner = item["performance"]["headlinerName"]
I don't understand why it's working for the 'artistName' but it's not working for 'headlinerName'. Thanks for your help and your explanation.
Notice your performance key:
"performance":[
{
"id":73632729,
"headlinerName":"Top-Secret",
}
}
],
The json you posted is malformed. Assuming the structure is like:
"performance":[
{
"id":73632729,
"headlinerName":"Top-Secret",
}
],
You can do:
for i in item:
i["headlinerName"]
or as #UltraInstinct suggested:
item["performance"][0]["headlinerName"]
A few problems here. First, your JSON is incorrectly formatted. Your square brackets don't match up. Maybe you meant something like this? I am going to assume "calendarEntry" is a list here and everything else is an object. Usually lists are made plural, i.e. "calendarEntries".
{
"resultsPage": {
"results": {
"calendarEntries": [
{
"event": {
"id": 38862824,
"artistName": "Raphael"
},
"performance": {
"id": 73632729,
"headlinerName": "Top-Secret"
},
"venue": {
"id": 4285819,
"displayName": "Sacré"
}
}
]
}
}
}
I am trying to fetch data based on some match condition. First I've tried this:
Here ending_date is full date format
Offer.aggregate([
{
$match: {
carer_id : req.params.carer_id,
status : 3
}
},
{
$group : {
_id : { year: { $year : "$ending_date" }, month: { $month : "$ending_date" }},
count : { $sum : 1 }
}
}],
function (err, res)
{ if (err) ; // TODO handle error
console.log(res);
});
which gives me following output:
[ { _id: { year: 2015, month: 11 }, count: 2 } ]
Now I want to check year also, so I am trying this:
Offer.aggregate([
{
$project: {
myyear: {$year: "$ending_date"}
}
},
{
$match: {
carer_id : req.params.carer_id,
status : 3,
$myyear : "2015"
}
},
{
$group : {
_id : { year: { $year : "$ending_date" }, month: { $month : "$ending_date" }},
count : { $sum : 1 }
}
}],
function (err, res)
{ if (err) ; // TODO handle error
console.log(res);
});
which gives me following output:
[]
as you can see, _id has 2015 as a year, so when I match year it should be come in array. But I am getting null array. Why this?
Is there any other way to match only year form whole datetime?
Here is the sample data
{
"_id": {
"$oid": "56348e7938b1ab3c382d3363"
},
"carer_id": "55e6f647f081105c299bb45d",
"user_id": "55f000a2878075c416ff9879",
"starting_date": {
"$date": "2015-10-15T05:41:00.000Z"
},
"ending_date": {
"$date": "2015-11-19T10:03:00.000Z"
},
"amount": "850",
"total_days": "25",
"status": 3,
"is_confirm": false,
"__v": 0
}
{
"_id": {
"$oid": "563b5747d6e0a50300a1059a"
},
"carer_id": "55e6f647f081105c299bb45d",
"user_id": "55f000a2878075c416ff9879",
"starting_date": {
"$date": "2015-11-06T04:40:00.000Z"
},
"ending_date": {
"$date": "2015-11-16T04:40:00.000Z"
},
"amount": "25",
"total_days": "10",
"status": 3,
"is_confirm": false,
"__v": 0
}
You forgot to project the fields that you're using in $match and $group later on. For a quick fix, use this query instead:
Offer.aggregate([
{
$project: {
myyear: { $year: "$ending_date" },
carer_id: 1,
status: 1,
ending_date: 1
}
},
{
$match: {
carer_id: req.params.carer_id,
myyear: 2015,
status: 3
}
},
{
$group: {
_id: {
year: { $year: "$ending_date" },
month: { $month: "$ending_date" }
},
count: { $sum: 1 }
}
}],
function (err, res)
{
if (err) {} // TODO handle error
console.log(res);
});
That said, Blakes Seven explained how to make a better query in her answer. I think you should try and use her approach instead.
You are doing so many things wrong here, that it really warrants an explanation so hopefully you learn something.
Its a Pipeline
It's the most basic concept but the one people do not pick up on the most often ( and even after continued use ) that the aggregation "pipeline" is just exactly that, being "piped" processes that feed input into the each stage as it goes along. Think "unix pipe" |:
ps -ef | grep mongo | tee out.txt
You've seen something similar before no doubt, and it's the basic concept that output from the first thing goes to the next thing and then that manipulates to give input to the next thing and so on.
So heres the basic problem with what you are asking:
{
$project: {
myyear: {$year: "$ending_date"}
}
},
{
$match: {
carer_id : req.params.carer_id,
status : 3,
$myyear : "2015"
}
},
Consider what $project does here. You specify fields you want in the output and it "spits them out", and possibly with manipulation. Does it output these fields in "addition" the the fields in the document? No It does not. Only what you ask to come out actually comes out and can be used in the following pipeline stage(s).
The $match here essentially asks for fields that are no longer present, because you only asked for one thing in the output. The same problem occurs further down as again you ask for fields you removed earlier and there simply is nothing to reference, but also everything was already removed by a $match that could not match anything.
Also, that is not how field projections work as you have entered
So just use a range for the date
{ "$match": {
"carer_id" : req.params.carer_id,
"status" : 3,
"ending_date": {
"$gte": new Date("2015-01-01"),
"$lt": new Date("2016-01-01")
}
}},
{ "$group": {
"_id": {
"year": { "$year": "$ending_date" },
"month": { "$month": "$ending_date" }
},
"count": { "$sum": 1 }
}}
Why? Because it just makes sense. If you want to match the "year" then supply the date range for the whole year. We could play silliness with $redact to match on the extracted year value, but that is just wasted processing time.
Doing this way is the fastest to process and can actually use an index to process faster. So don't otherthink the problem and just ask for the date range you want.
If you want your aggregation to work you have to use $addFields instead of $project in order to keep status and carer_id in the object you pass to the $match
{
$addFields: {
myyear: {$year: "$ending_date"}
}
},
{
$match: {
carer_id : req.params.carer_id,
status : 3,
$myyear : "2015"
}
},
I am using match phrase query to find in ES. but i have noticed that the results returned are not appropriate.
code --
res = es.search(index=('indice_1'),
body = {
"_source":["content"],
"query": {
"match_phrase":{
"content":"xyz abc"
}}}
,
size=500,
scroll='60s')
It doesn't get me records where content is -
"hi my name isxyz abc." and "hey wassupxyz abc. how is life"
doing a similar search in mongodb using using regex gets both the records as well. Any help would be appreciated.
If you didn't specify an analyzer then you are using standard by default. It will do grammar based tokenization. So your terms for the phrase "hi my name isxyz abc." will be something like [hi, my, name, isxyz, abc] and match_phrase is looking for the terms [xyz, abc] right next to each other (unless you specify slop).
You can either use a different analyzer or modify your query. If you use a match query, it will match on the term "abc". If you want the phrase to match, you'll need to use a different analyzer. NGrams should work for you.
Here's an example:
PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
PUT test_index/_doc/1
{
"content": "hi my name isxyz abc."
}
PUT test_index/_doc/2
{
"content": "hey wassupxyz abc. how is life"
}
POST test_index/_doc/_search
{
"query": {
"match_phrase": {
"content": "xyz abc"
}
}
}
That results in finding both documents.
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5753642,
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "2",
"_score": 0.5753642,
"_source": {
"content": "hey wassupxyz abc. how is life"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"content": "hi my name isxyz abc."
}
}
]
}
}
EDIT:
If you're looking to do a wildcard query, you can use the standard analyzer. The use case you specified in the comments would be added like this:
PUT test_index/_doc/3
{
"content": "RegionLasit Pant0Q00B000001KBQ1SAO00"
}
And you can query it with wildcard:
POST test_index/_doc/_search
{
"query": {
"wildcard": {
"content.keyword": {
"value": "*Lasit Pant*"
}
}
}
}
Essentially you are doing a substring search without the nGram analyzer. Your query phrase will then just be "*<my search terms>*". I would still recommend looking into nGrams.
you can also use type parameter to set to phrase in the query
res = es.search(index=('indice_1'),
body = {
"_source":["content"],
"query": {
"query":"xyz abc"
},
type:"phrase"}
,
size=500,
scroll='60s')
Using elastic search's query DSL this is how I am currently constructing my query:
elastic_sort = [
{ "timestamp": {"order": "desc" }},
"_score",
{ "name": { "order": "desc" }},
{ "channel": { "order": "desc" }},
]
elastic_query = {
"fuzzy_like_this" : {
"fields" : [ "msgs.channel", "msgs.msg", "msgs.name" ],
"like_text" : search_string,
"max_query_terms" : 10,
"fuzziness": 0.7,
}
}
res = self.es.search(index="chat", body={
"from" : from_result, "size" : results_per_page,
"track_scores": True,
"query": elastic_query,
"sort": elastic_sort,
})
I've been trying to implement a filter or an analyzer that will allow the inclusion of "#" in searches (I want a search for "#thing" to return results that include "#thing"), but I am coming up short. The error messages I am getting are not helpful and just telling me that my query is malformed.
I attempted to incorporate the method found here : http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html but it doesn't make any sense to me in context.
Does anyone have a clue how I can do this?
Did you create a mapping for you index? You can specify within your mapping to not analyze certain fields.
For example, a tweet mapping can be something like:
"tweet": {
"properties": {
"id": {
"type": "long"
},
"msg": {
"type": "string"
},
"hashtags": {
"type": "string",
"index": "not_analyzed"
}
}
}
You can then perform a term query on "hashtags" for an exact string match, including "#" character.
If you want "hashtags" to be tokenized as well, you can always create a multi-field for "hashtags".