I want to query what comments have been made by any User about machine learning book between '2020-03-15' and '2020-04-25', ordered the comments from the most recent to the least recent.
Here is my document.
lib_books = db.lib_books
document_book1 = ({
"bookid" : "99051fe9-6a9c-46c2-b949-38ef78858dd0",
"title" : "Machine learning",
"author" : "Tom Michael",
"date_of_first_publication" : "2000-10-02",
"number_of_pages" : 414,
"publisher" : "New York : McGraw-Hill",
"topics" : ["Machine learning", "Computer algorithms"],
"checkout_list" : [
{
"time_checked_out" : "2020-03-20 09:11:22",
"userid" : "ef1234",
"comments" : [
{
"comment1" : "I just finished it and it is worth learning!",
"time_commented" : "2020-04-01 10:35:13"
},
{
"comment2" : "Some cases are a little bit outdated.",
"time_commented" : "2020-03-25 13:19:13"
},
{
"comment3" : "Can't wait to learning it!!!",
"time_commented" : "2020-03-21 08:21:42"
}]
},
{
"time_checked_out" : "2020-03-04 16:18:02",
"userid" : "ab1234",
"comments" : [
{
"comment1" : "The book is a little bit difficult but worth reading.",
"time_commented" : "2020-03-20 12:18:02"
},
{
"comment2" : "It's hard and takes a lot of time to understand",
"time_commented" : "2020-03-15 11:22:42"
},
{
"comment3" : "I just start reading, the principle of model is well explained.",
"time_commented" : "2020-03-05 09:11:42"
}]
}]
})
I tried this code, but it returns nothing.
query_test = lib_books.find({"bookid": "99051fe9-6a9c-46c2-b949-38ef78858dd0", "checkout_list.comments.time_commented" : {"$gte" : "2020-03-20", "$lte" : "2020-04-20"}})
for x in query_test:
print(x)
Can you try this
pipeline = [{'$match':{'bookid':"99051fe9-6a9c-46c2-b949-38ef78858dd0"}},//bookid filter
{'$unwind':'$checkout_list'},
{'$unwind':'$checkout_list.comments'},
{'$match':{'checkout_list.comments.time_commented':{"$gte" : "2020-03-20", "$lte" : "2020-04-20"}}},
{'$project':{'_id':0,'bookid':1,'title':1,'comment':'$checkout_list.comments'}},
{'$sort':{'checkout_list.comments.time_commented':-1}}]
query_test = lib_books.aggregate(pipeline)
#{"bookid": "99051fe9-6a9c-46c2-b949-38ef78858dd0", "checkout_list.comments.time_commented" : {"$gte" : "2020-03-20", "$lte" : "2020-04-20"}})
for x in query_test:
print(x)
I would recommend that you maintain comment field as one name, rather than keeping it as 'comment1', 'comment2', etc. If the field had been 'comment', it can be brought to the root itself
Aggregate can be modified as below
pipeline = [{'$match':{'bookid':"99051fe9-6a9c-46c2-b949-38ef78858dd0"}},//bookid filter
{'$unwind':'$checkout_list'},
{'$unwind':'$checkout_list.comments'},
{'$match':{'checkout_list.comments.time_commented':{"$gte" : "2020-03-20", "$lte" : "2020-04-20"}}},
{'$project':{'_id':0,'bookid':1,'title':1,'comment':'$checkout_list.comments.comment','time_commented':'$checkout_list.comments.time_commented'}},
{'$sort':{'time_commented':-1}}]
MongoDB Query, in case if required
db.books.aggregate([
{$match:{'bookid':"99051fe9-6a9c-46c2-b949-38ef78858dd0"}},//bookid filter
{$unwind:'$checkout_list'},
{$unwind:'$checkout_list.comments'},
{$match:{'checkout_list.comments.time_commented':{"$gte" : "2020-03-20", "$lte" : "2020-04-20"}}},
{$project:{_id:0,bookid:1,title:1,comment:'$checkout_list.comments.comment',time_commented:'$checkout_list.comments.time_commented'}},
{$sort:{'time_commented':-1}}
])
if there are multiple documents that you need to search, then you can use $in condition.
{$match:{'bookid':{$in:["99051fe9-6a9c-46c2-b949-38ef78858dd0","99051fe9-6a9c-46c2-b949-38ef78858dd1"]}}},//bookid filter
Related
I am wondering what I am doing wrong when trying to print the data of name of the following code in python.
import urllib.request, json
with urllib.request.urlopen("<THIS IS A URL IN THE ORIGINAL SCRIPT>") as url:
data = json.loads(url.read().decode())
print (data['Departure']['Product']['name'])
print (data['Departure']['Stops']['Stop'][0]['depTime'])
And this is the api I am fetching the data from:
{
"Departure" : [ {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"Stop" : [ {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}
data["Departure"] is a list, and you are indexing into it like it's a dictionary.
You wrote the dictionary sample confusingly. Here's how I think it looks:
d = {
"Departure" : [ {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"Stop" : [ {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}]}}]}
And here's how you can print depTime
print(d["Departure"][0]["Stops"]["Stop"][0]["depTime"])
The important part you missed is d["Departure"][0] because d["Departure"] is a list.
As Kyle said in the previous answer, data["Departure"] is a list, but you're trying to use it as a dictionary. There are 2 possible solutions.
Change data["Departure"]["Stops"]["Stop"] etc. to data["Departure"][0]["Stops"]["Stop"] etc.
Change the JSON file to make departure into a dictionary, which would allow you to keep your original code. This would make the final JSON snippet look like this:
"Departure" : {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}
}
I have a posts document. I'd like to add comments to this document as embedded. Some of the documents doesn't have comments field.
{
"_id" : 6,
"title" : "Emacs tutorial",
"updatedate" : ISODate("2017-10-18T19:05:08.555Z"),
"content" : "Welcome to Emacs tutorial\n",
"comments" : {
"date" : "2016-04-20",
"content" : "first comment",
"_id" : 122
}
}
What I'd like to do is check whether current document includes 'comments' field.
Here is what I've tried so far:
document = mongo.db.posts
result = document.find(
{ '_id' : int(number) },
{ "comments": { '$exists': True, '$ne': False } })
print(result)
if result:
print ('I find the comments field')
else:
print ('I shouldn't find')
But I recognized that printing the result returns a pymongo.cursor.Cursor object.
I am using this approach to get the comments on page data.Its working fine,but I need to dump the data into MongoDB. Using this approach data is inserted but as a single document.I want to store that every comment should have a separate document with the information I am getting from the API.
from facepy import GraphAPI
import json
import pymongo
import json
connection = pymongo.MongoClient("mongodb://localhost")
facebook = connection.facebook
commen = facebook.comments
access = ''
#message
graph = GraphAPI(access)
page_id= 'micromaxinfo'
datas= graph.get(page_id+'/posts?fields=comments,created_time', page=True, retry=5)
posts=[]
for data in datas:
print data
commen.insert(data)
break
Output Stored in MongoDB:
{
"created_time" : "2015-11-04T08:04:14+0000",
"id" : "120735417936636_1090909150919253",
"comments" : {
"paging" : {
"cursors" : {
"after" : "WTI5dGJXVnVkRjlqZFhKemIzSTZNVEE1TVRReE5ESTVOelV6TlRRd05Ub3hORFEyTnpFNU5UTTU=",
"before" : "WTI5dGJXVnVkRjlqZFhKemIzSTZNVEE1TURrd09UVTRNRGt4T1RJeE1Eb3hORFEyTmpJME16Z3g="
}
},
"data" : [
{
"created_time" : "2015-11-04T08:06:21+0000",
"message" : "my favorite mobiles on canvas silver",
"from" : {
"name" : "Velchamy Alagar",
"id" : "828304797279948"
},
"id" : "1090909130919255_1090909580919210"
},
{
"created_time" : "2015-11-04T08:10:13+0000",
"message" : "Micromax mob. मैने कुछ दिन पहले Micromax Bolt D321 mob. खरिद लिया | Bt मेरा मोबा. बहुत गरम होता है Without internate. और internate MB कम समय मेँ ज्यादा खर्च होती है | कोई तो help करो.",
"from" : {
"name" : "Amit Gangurde",
"id" : "1637669796485258"
},
"id" : "1090909130919255_1090910364252465"
},
{
"created_time" : "2015-11-04T08:10:27+0000",
"message" : "Nice phones.",
"from" : {
"name" : "Nayan Chavda",
"id" : "1678393592373659"
},
"id" : "1090909130919255_1090910400919128"
},
{
"created_time" : "2015-11-04T08:10:54+0000",
"message" : "sir micromax bolt a089 mobile ki battery price kitna. #micromax mobile",
"from" : {
"name" : "Arit Singha Roy",
"id" : "848776351903695"
},
So technically I want to store only information coming in data field:
{
"created_time" : "2015-11-04T08:10:54+0000",
"message" : "sir micromax bolt a089 mobile ki battery price kitna. #micromax mobile",
"from" : {
"name" : "Arit Singha Roy",
"id" : "848776351903695"
}
How to get this into my database?
You can use the pentaho data integration open source ETL tool for this. I use it to store specific fields from the JSON output for tweets.
Select the fields you want to parse from the JSON and select an output as csv or table output in Oracle etc.
Hope this helps
I am building a python script which will be executed through Apache spark in which I am generating a RDD from json file stored on S3 bucket.
I need to filter that json RDD according to some data in json document and thereby generating a new json file which consist of filtered json documents.That json file needs to be uploaded to S3 bucket.
So please suggest me appropriate solution for its implementation through pyspark.
Input json
{
"_id" : ObjectId("55a787ee9efccaeb288b457f"),
"data" : {
"N◦ CATEGORIA" : 102.0,
"NOMBRE CATEGORIA" : "GASEOSAS",
"VARIABLE" : "TOP OF HEART",
"VAR." : "TOH",
"MARCA" : "COCA COLA ZERO",
"MES" : "ENERO",
"MES_N" : 1.0,
"AÑO" : 2014.0,
"UNIVERSO_TOTAL" : 1.0433982E7,
"UNIVERSO_FEMENINO" : 5529024.0,
"UNIVERSO_MASCULINO" : 4904958.0,
"PORCENTAJE_TOTAL" : 0.0066,
"PORCENTAJE_FEMENINO" : 0.0125,
"PORCENTAJE_MASCULINO" : null
},
"app_id" : ObjectId("5376349e11bc073138c33163"),
"category" : "excel_RAC",
"subcategory" : "RAC",
"created_time" : NumberLong(1437042670),
"instance_id" : null,
"metric_date" : NumberLong(1437042670),
"campaign_id" : ObjectId("5386602ba102b6cd4528ed93"),
"datasource_id" : ObjectId("559f5c8f9efccacf0a3c9875"),
"duplicate_id" : "695a3f5f562d0a02f1820fe5d91642a5"
}
The input json data needs to be filtered according to VARIABLE : "TOP OF HEART" and there by generate output json as following
Output Json
{
"_id" : ObjectId("55b5d19e9efcca86118b45a2"),
"widget_type" : "rac_toh_excel",
"campaign_id" : ObjectId("558554b29efccab00a3c987c"),
"datasource_id" : ObjectId("55b5d18f9efcca770b3c986a"),
"date_time" : NumberLong(1388530800),
"data" : {
"key" : "COCA COLA ZERO",
"values" : {
"x" : NumberLong(1388530800),
"y" : 1.0433982E7,
"data" : {
"id" : ObjectId("553a151e5c93ffe0408b46f9"),
"month" : 1.0,
"year" : 2014.0,
"total" : 1.0433982E7,
"variable" : "TOH",
"total_percentage" : 0.0066
}
}
},
"filter" : [
]
}
My items store in MongoDB like this :
{"ProductName":"XXXX",
"Catalogs" : [
{
"50008064" : "Apple"
},
{
"50010566" : "Box"
},
{
"50016422" : "Water"
}
]}
Now I want query all the items belong to Catalog:50008064,how to?
(the catalog id "50008064" , catalog name "Apple")
You cannot query this in an efficient manner and performance will decrease as your data grows. As such I would consider it a schema bug and you should refactor/migrate to the following model which does allow for indexing :
{"ProductName":"XXXX",
"Catalogs" : [
{
id : "50008064",
value : "Apple"
},
{
id : "50010566",
value : "Box"
},
{
id : "50016422",
value : "Water"
}
]}
And then index :
ensureIndex({'Catalogs.id':1})
Again, I strongly suggest you change your schema as this is a potential performance bottleneck you cannot fix any other way.
This should probably work according to the entry here, although this won't be very fast, as stated in in the link.
db.products.find({ "Catalogs.50008064" : { $exists: true } } )