How to store facepy data into mongodb?

How to store facepy data into mongodb? - python

I am using this approach to get the comments on page data.Its working fine,but I need to dump the data into MongoDB. Using this approach data is inserted but as a single document.I want to store that every comment should have a separate document with the information I am getting from the API.
from facepy import GraphAPI
import json
import pymongo
import json
connection = pymongo.MongoClient("mongodb://localhost")
facebook = connection.facebook
commen = facebook.comments
access = ''
#message
graph = GraphAPI(access)
page_id= 'micromaxinfo'
datas= graph.get(page_id+'/posts?fields=comments,created_time', page=True, retry=5)
posts=[]
for data in datas:
print data
commen.insert(data)
break
Output Stored in MongoDB:
{
"created_time" : "2015-11-04T08:04:14+0000",
"id" : "120735417936636_1090909150919253",
"comments" : {
"paging" : {
"cursors" : {
"after" : "WTI5dGJXVnVkRjlqZFhKemIzSTZNVEE1TVRReE5ESTVOelV6TlRRd05Ub3hORFEyTnpFNU5UTTU=",
"before" : "WTI5dGJXVnVkRjlqZFhKemIzSTZNVEE1TURrd09UVTRNRGt4T1RJeE1Eb3hORFEyTmpJME16Z3g="
}
},
"data" : [
{
"created_time" : "2015-11-04T08:06:21+0000",
"message" : "my favorite mobiles on canvas silver",
"from" : {
"name" : "Velchamy Alagar",
"id" : "828304797279948"
},
"id" : "1090909130919255_1090909580919210"
},
{
"created_time" : "2015-11-04T08:10:13+0000",
"message" : "Micromax mob. मैने कुछ दिन पहले Micromax Bolt D321 mob. खरिद लिया | Bt मेरा मोबा. बहुत गरम होता है Without internate. और internate MB कम समय मेँ ज्यादा खर्च होती है | कोई तो help करो.",
"from" : {
"name" : "Amit Gangurde",
"id" : "1637669796485258"
},
"id" : "1090909130919255_1090910364252465"
},
{
"created_time" : "2015-11-04T08:10:27+0000",
"message" : "Nice phones.",
"from" : {
"name" : "Nayan Chavda",
"id" : "1678393592373659"
},
"id" : "1090909130919255_1090910400919128"
},
{
"created_time" : "2015-11-04T08:10:54+0000",
"message" : "sir micromax bolt a089 mobile ki battery price kitna. #micromax mobile",
"from" : {
"name" : "Arit Singha Roy",
"id" : "848776351903695"
},
So technically I want to store only information coming in data field:
{
"created_time" : "2015-11-04T08:10:54+0000",
"message" : "sir micromax bolt a089 mobile ki battery price kitna. #micromax mobile",
"from" : {
"name" : "Arit Singha Roy",
"id" : "848776351903695"
}
How to get this into my database?

You can use the pentaho data integration open source ETL tool for this. I use it to store specific fields from the JSON output for tweets.
Select the fields you want to parse from the JSON and select an output as csv or table output in Oracle etc.
Hope this helps

Related

Find a subdocument in array PyMongo

I want to query what comments have been made by any User about machine learning book between '2020-03-15' and '2020-04-25', ordered the comments from the most recent to the least recent.
Here is my document.
lib_books = db.lib_books
document_book1 = ({
"bookid" : "99051fe9-6a9c-46c2-b949-38ef78858dd0",
"title" : "Machine learning",
"author" : "Tom Michael",
"date_of_first_publication" : "2000-10-02",
"number_of_pages" : 414,
"publisher" : "New York : McGraw-Hill",
"topics" : ["Machine learning", "Computer algorithms"],
"checkout_list" : [
{
"time_checked_out" : "2020-03-20 09:11:22",
"userid" : "ef1234",
"comments" : [
{
"comment1" : "I just finished it and it is worth learning!",
"time_commented" : "2020-04-01 10:35:13"
},
{
"comment2" : "Some cases are a little bit outdated.",
"time_commented" : "2020-03-25 13:19:13"
},
{
"comment3" : "Can't wait to learning it!!!",
"time_commented" : "2020-03-21 08:21:42"
}]
},
{
"time_checked_out" : "2020-03-04 16:18:02",
"userid" : "ab1234",
"comments" : [
{
"comment1" : "The book is a little bit difficult but worth reading.",
"time_commented" : "2020-03-20 12:18:02"
},
{
"comment2" : "It's hard and takes a lot of time to understand",
"time_commented" : "2020-03-15 11:22:42"
},
{
"comment3" : "I just start reading, the principle of model is well explained.",
"time_commented" : "2020-03-05 09:11:42"
}]
}]
})
I tried this code, but it returns nothing.
query_test = lib_books.find({"bookid": "99051fe9-6a9c-46c2-b949-38ef78858dd0", "checkout_list.comments.time_commented" : {"$gte" : "2020-03-20", "$lte" : "2020-04-20"}})
for x in query_test:
print(x)

Can you try this
pipeline = [{'$match':{'bookid':"99051fe9-6a9c-46c2-b949-38ef78858dd0"}},//bookid filter
{'$unwind':'$checkout_list'},
{'$unwind':'$checkout_list.comments'},
{'$match':{'checkout_list.comments.time_commented':{"$gte" : "2020-03-20", "$lte" : "2020-04-20"}}},
{'$project':{'_id':0,'bookid':1,'title':1,'comment':'$checkout_list.comments'}},
{'$sort':{'checkout_list.comments.time_commented':-1}}]
query_test = lib_books.aggregate(pipeline)
#{"bookid": "99051fe9-6a9c-46c2-b949-38ef78858dd0", "checkout_list.comments.time_commented" : {"$gte" : "2020-03-20", "$lte" : "2020-04-20"}})
for x in query_test:
print(x)
I would recommend that you maintain comment field as one name, rather than keeping it as 'comment1', 'comment2', etc. If the field had been 'comment', it can be brought to the root itself
Aggregate can be modified as below
pipeline = [{'$match':{'bookid':"99051fe9-6a9c-46c2-b949-38ef78858dd0"}},//bookid filter
{'$unwind':'$checkout_list'},
{'$unwind':'$checkout_list.comments'},
{'$match':{'checkout_list.comments.time_commented':{"$gte" : "2020-03-20", "$lte" : "2020-04-20"}}},
{'$project':{'_id':0,'bookid':1,'title':1,'comment':'$checkout_list.comments.comment','time_commented':'$checkout_list.comments.time_commented'}},
{'$sort':{'time_commented':-1}}]
MongoDB Query, in case if required
db.books.aggregate([
{$match:{'bookid':"99051fe9-6a9c-46c2-b949-38ef78858dd0"}},//bookid filter
{$unwind:'$checkout_list'},
{$unwind:'$checkout_list.comments'},
{$match:{'checkout_list.comments.time_commented':{"$gte" : "2020-03-20", "$lte" : "2020-04-20"}}},
{$project:{_id:0,bookid:1,title:1,comment:'$checkout_list.comments.comment',time_commented:'$checkout_list.comments.time_commented'}},
{$sort:{'time_commented':-1}}
])
if there are multiple documents that you need to search, then you can use $in condition.
{$match:{'bookid':{$in:["99051fe9-6a9c-46c2-b949-38ef78858dd0","99051fe9-6a9c-46c2-b949-38ef78858dd1"]}}},//bookid filter

Add dict into Item dynamodb python

I have an item register on dynamodb with the next structure:
{
"OwnerID":"12312wqeq",
"license":"23423werwegdf",
"MaintenanceList":{
"10-11-2018":{
"garage" : "lopcars",
"city" : "NY",
"country" "USA",
"location" : "1929-1927 Fulton St Brooklyn"
}
}
}
I need to add a new Maintenance to the list, and I tried this:
response=table.update_item(
Key={
"OwnerID":"12312wqeq",
"license":"23423werwegdf",'
}
,UpdateExpression = "SET #d1=:dt",
ExpressionAttributeValues = {
':dt' : "12-11-2019":{
"garage" : "Crazycars",
"city" : "NY",
"country" "USA",
"location" : "120 E Suffolk Ave Central Islip"
}
}
},
ExpressionAttributeNames={
'#d1' : 'MaintenanceList'
},
ReturnValues="UPDATED_NEW"
)
but overwrite the attribute MaintenanceList and I need it to look like this after update:
{
"OwnerID":"12312wqeq",
"license":"23423werwegdf",
"MaintenanceList":{
"10-11-2018":{
"garage" : "lopcars",
"city" : "NY",
"country" "USA",
"location" : "1929-1927 Fulton St Brooklyn"
},
"12-11-2019":{
"garage" : "Crazycars",
"city" : "NY",
"country" "USA",
"location" : "120 E Suffolk Ave Central Islip"
}
}
}

The SET MaintenanceList=:dt expression indeed replaces the value of the the attributed called MaintenanceList. If you wanted the content of this attribute to be a hash table, and add to it, you need to update it using a nested atribute path as explained in this DynamoDB documentation. For example do something like SET #d1.#date=:dt.
However, note that keeping a hash table inside a single attribute's value is problematic - its total size is strictly limited (to 400KB) and you'll also pay for the entire item size every time you read or write a small part of it.

Geocoded_Waypoints Google Maps Api Zero Results

I am using google maps api with this python code to print a route between two points.
import requests, json
#Google MapsDdirections API endpoint
endpoint = 'https://maps.googleapis.com/maps/api/directions/json?'
api_key = 'AIzaSyCTPkufBttRcfSkA9zPYgivrYs9QEhdEEU'
#Asks the user to input Where they are and where they want to go.
origin = input('Where are you?: ').replace(' ','+')
destination = input('Where do you want to go?: ').replace(' ','+')
#Building the URL for the request
nav_request = 'origin={}&destination={}&key={}'.format(origin,destination,api_key)
request = endpoint + nav_request
#Sends the request and reads the response.
#response = urllib.request.urlopen(request).read()
r = requests.get('https://maps.googleapis.com/maps/api/directions/json?origin=Vigo&destination=Lugo&key=AIzaSyCTPkufBttRcfSkA9zPYgivrYs9QEhdEEU')
#Loads response as JSON
#directions = json.loads(response)
directions = r.json()
print(directions)
The problem is that my response gives me ZERO_RESULTS.
I´ve tried it manually in google chrome getting the next result:
{
"geocoded_waypoints" : [
{
"geocoder_status" : "OK",
"place_id" : "ChIJbYcwsYDOMQ0RDAVnKL9fMB8",
"types" : [ "locality", "political" ]
},
{
"geocoder_status" : "OK",
"place_id" : "ChIJk8GyYRRiLw0Rn9RLF60dRHs",
"types" : [ "locality", "political" ]
}
],
"routes" : [
{
"bounds" : {
"northeast" : {
"lat" : 43.0082848,
"lng" : -7.554997200000001
},
"southwest" : {
"lat" : 42.2392374,
"lng" : -8.720694999999999
}
},
"copyrights" : "Datos de mapas ©2019 Inst. Geogr. Nacional",
"legs" : [
{
"distance" : {
"text" : "188 km",
"value" : 188311
},
"duration" : {
"text" : "2h 11 min",
"value" : 7830
},
"end_address" : "Vigo, Pontevedra, España",
"end_location" : {
"lat" : 42.2406168,
"lng" : -8.720694999999999
},
"start_address" : "Lugo, España",
"start_location" : {
"lat" : 43.0082848,
[...]
However, when I try it online i get different geocoded waypoints and, therefore, zero_results.
'types': ['bar', 'establishment', 'food', 'point_of_interest', 'restaurant']}], 'routes': [], 'status': 'ZERO_RESULTS'}
How can I change geocoded_waypoint types??

As I can see from your example you try to get directions between two Spanish cities. However, when you specify only names of cities in origin and destination the service might resolve them to different countries due to ambiguity in parameters. For example, when I execute your request on my server that located in USA the destination Lugo is resolved to place ID ChIJi59iTw6wZIgRvssCUK9Ra84 which is a restaurant with name "Lugo's" located in 107 S Main St, Dickson, TN 37055, USA.
See place details
https://maps.googleapis.com/maps/api/place/details/json?placeid=ChIJi59iTw6wZIgRvssCUK9Ra84&fields=formatted_address,geometry,name,type&key=YOUR_API_KEY
As origin is located in Spain and destination in USA, the directions service cannot build a driving directions and returns ZERO_RESULTS.
In order to resolve ambiguity you should provide more precise origin and destination parameters or specify a region where you are searching results.
If I add region parameter in request I get expected route between Spanish cities
https://maps.googleapis.com/maps/api/directions/json?origin=Vigo&destination=Lugo&region=ES&key=YOUR_API_KEY
You can see it in directions calculator:
https://directionsdebug.firebaseapp.com/?origin=Vigo&destination=Lugo&region=ES
I hope my answer clarifies your doubt!

Print only specific parts of json file

I am wondering what I am doing wrong when trying to print the data of name of the following code in python.
import urllib.request, json
with urllib.request.urlopen("<THIS IS A URL IN THE ORIGINAL SCRIPT>") as url:
data = json.loads(url.read().decode())
print (data['Departure']['Product']['name'])
print (data['Departure']['Stops']['Stop'][0]['depTime'])
And this is the api I am fetching the data from:
{
"Departure" : [ {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"Stop" : [ {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}

data["Departure"] is a list, and you are indexing into it like it's a dictionary.

You wrote the dictionary sample confusingly. Here's how I think it looks:
d = {
"Departure" : [ {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"Stop" : [ {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}]}}]}
And here's how you can print depTime
print(d["Departure"][0]["Stops"]["Stop"][0]["depTime"])
The important part you missed is d["Departure"][0] because d["Departure"] is a list.

As Kyle said in the previous answer, data["Departure"] is a list, but you're trying to use it as a dictionary. There are 2 possible solutions.
Change data["Departure"]["Stops"]["Stop"] etc. to data["Departure"][0]["Stops"]["Stop"] etc.
Change the JSON file to make departure into a dictionary, which would allow you to keep your original code. This would make the final JSON snippet look like this:
"Departure" : {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}
}

Process json data using Pyspark

I am building a python script which will be executed through Apache spark in which I am generating a RDD from json file stored on S3 bucket.
I need to filter that json RDD according to some data in json document and thereby generating a new json file which consist of filtered json documents.That json file needs to be uploaded to S3 bucket.
So please suggest me appropriate solution for its implementation through pyspark.
Input json
{
"_id" : ObjectId("55a787ee9efccaeb288b457f"),
"data" : {
"N◦ CATEGORIA" : 102.0,
"NOMBRE CATEGORIA" : "GASEOSAS",
"VARIABLE" : "TOP OF HEART",
"VAR." : "TOH",
"MARCA" : "COCA COLA ZERO",
"MES" : "ENERO",
"MES_N" : 1.0,
"AÑO" : 2014.0,
"UNIVERSO_TOTAL" : 1.0433982E7,
"UNIVERSO_FEMENINO" : 5529024.0,
"UNIVERSO_MASCULINO" : 4904958.0,
"PORCENTAJE_TOTAL" : 0.0066,
"PORCENTAJE_FEMENINO" : 0.0125,
"PORCENTAJE_MASCULINO" : null
},
"app_id" : ObjectId("5376349e11bc073138c33163"),
"category" : "excel_RAC",
"subcategory" : "RAC",
"created_time" : NumberLong(1437042670),
"instance_id" : null,
"metric_date" : NumberLong(1437042670),
"campaign_id" : ObjectId("5386602ba102b6cd4528ed93"),
"datasource_id" : ObjectId("559f5c8f9efccacf0a3c9875"),
"duplicate_id" : "695a3f5f562d0a02f1820fe5d91642a5"
}
The input json data needs to be filtered according to VARIABLE : "TOP OF HEART" and there by generate output json as following
Output Json
{
"_id" : ObjectId("55b5d19e9efcca86118b45a2"),
"widget_type" : "rac_toh_excel",
"campaign_id" : ObjectId("558554b29efccab00a3c987c"),
"datasource_id" : ObjectId("55b5d18f9efcca770b3c986a"),
"date_time" : NumberLong(1388530800),
"data" : {
"key" : "COCA COLA ZERO",
"values" : {
"x" : NumberLong(1388530800),
"y" : 1.0433982E7,
"data" : {
"id" : ObjectId("553a151e5c93ffe0408b46f9"),
"month" : 1.0,
"year" : 2014.0,
"total" : 1.0433982E7,
"variable" : "TOH",
"total_percentage" : 0.0066
}
}
},
"filter" : [
]
}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to store facepy data into mongodb? - python

You can use the pentaho data integration open source ETL tool for this. I use it to store specific fields from the JSON output for tweets. Select the fields you want to parse from the JSON and select an output as csv or table output in Oracle etc. Hope this helps

Related

Find a subdocument in array PyMongo

Add dict into Item dynamodb python

Geocoded_Waypoints Google Maps Api Zero Results

Print only specific parts of json file

Process json data using Pyspark

Categories

Resources