Process json data using Pyspark

Process json data using Pyspark - python

I am building a python script which will be executed through Apache spark in which I am generating a RDD from json file stored on S3 bucket.
I need to filter that json RDD according to some data in json document and thereby generating a new json file which consist of filtered json documents.That json file needs to be uploaded to S3 bucket.
So please suggest me appropriate solution for its implementation through pyspark.
Input json
{
"_id" : ObjectId("55a787ee9efccaeb288b457f"),
"data" : {
"N◦ CATEGORIA" : 102.0,
"NOMBRE CATEGORIA" : "GASEOSAS",
"VARIABLE" : "TOP OF HEART",
"VAR." : "TOH",
"MARCA" : "COCA COLA ZERO",
"MES" : "ENERO",
"MES_N" : 1.0,
"AÑO" : 2014.0,
"UNIVERSO_TOTAL" : 1.0433982E7,
"UNIVERSO_FEMENINO" : 5529024.0,
"UNIVERSO_MASCULINO" : 4904958.0,
"PORCENTAJE_TOTAL" : 0.0066,
"PORCENTAJE_FEMENINO" : 0.0125,
"PORCENTAJE_MASCULINO" : null
},
"app_id" : ObjectId("5376349e11bc073138c33163"),
"category" : "excel_RAC",
"subcategory" : "RAC",
"created_time" : NumberLong(1437042670),
"instance_id" : null,
"metric_date" : NumberLong(1437042670),
"campaign_id" : ObjectId("5386602ba102b6cd4528ed93"),
"datasource_id" : ObjectId("559f5c8f9efccacf0a3c9875"),
"duplicate_id" : "695a3f5f562d0a02f1820fe5d91642a5"
}
The input json data needs to be filtered according to VARIABLE : "TOP OF HEART" and there by generate output json as following
Output Json
{
"_id" : ObjectId("55b5d19e9efcca86118b45a2"),
"widget_type" : "rac_toh_excel",
"campaign_id" : ObjectId("558554b29efccab00a3c987c"),
"datasource_id" : ObjectId("55b5d18f9efcca770b3c986a"),
"date_time" : NumberLong(1388530800),
"data" : {
"key" : "COCA COLA ZERO",
"values" : {
"x" : NumberLong(1388530800),
"y" : 1.0433982E7,
"data" : {
"id" : ObjectId("553a151e5c93ffe0408b46f9"),
"month" : 1.0,
"year" : 2014.0,
"total" : 1.0433982E7,
"variable" : "TOH",
"total_percentage" : 0.0066
}
}
},
"filter" : [
]
}

Related

create a list from values in array of objects

how i make a list from the values inside urban only for type gasolina?
{ ... "fuelUse" : {
"urban" : [
{
"value" : 6.2,
"unit" : "km/l",
"type" : "alcool"
},
{
"value" : 8.9,
"unit" : "km/l",
"type" : "gasolina"
}
],
},
...."fuelUse" : {
"urban" : [
{
"value" : 7.8,
"unit" : "km/l",
"type" : "alcool"
},
{
"value" : 10.4,
"unit" : "km/l",
"type" : "gasolina"
}
],
}
}
the output like: list = [ 8.9 , 10.4 ]
i tried to iterate in that way, but hav key error: 1
for c in cars:
for a in c['fuelUse']['urban']:
list.append(a[1]['value'])

try
list.append(a['value'])
instead of
list.append(a[1]['value'])
Since a is not a list, it is a single object, there is no need for further indexing.
If you would like the value of the second element, which type is gasolina, from each urban, you should loop through them, not the object's inside.
for c in cars:
for a in c['fuelUse']['urban']:
if a['type'] == 'gasolina':
list.append(a['value'])

I am not quite sure as you did not provide the entire data structure but according to your try it could be like this:
output = [x.get("value") for car in cars for x in car.get("fuelUse").get("urban") if x.get("type") == "gasolina"]

Find a subdocument in array PyMongo

I want to query what comments have been made by any User about machine learning book between '2020-03-15' and '2020-04-25', ordered the comments from the most recent to the least recent.
Here is my document.
lib_books = db.lib_books
document_book1 = ({
"bookid" : "99051fe9-6a9c-46c2-b949-38ef78858dd0",
"title" : "Machine learning",
"author" : "Tom Michael",
"date_of_first_publication" : "2000-10-02",
"number_of_pages" : 414,
"publisher" : "New York : McGraw-Hill",
"topics" : ["Machine learning", "Computer algorithms"],
"checkout_list" : [
{
"time_checked_out" : "2020-03-20 09:11:22",
"userid" : "ef1234",
"comments" : [
{
"comment1" : "I just finished it and it is worth learning!",
"time_commented" : "2020-04-01 10:35:13"
},
{
"comment2" : "Some cases are a little bit outdated.",
"time_commented" : "2020-03-25 13:19:13"
},
{
"comment3" : "Can't wait to learning it!!!",
"time_commented" : "2020-03-21 08:21:42"
}]
},
{
"time_checked_out" : "2020-03-04 16:18:02",
"userid" : "ab1234",
"comments" : [
{
"comment1" : "The book is a little bit difficult but worth reading.",
"time_commented" : "2020-03-20 12:18:02"
},
{
"comment2" : "It's hard and takes a lot of time to understand",
"time_commented" : "2020-03-15 11:22:42"
},
{
"comment3" : "I just start reading, the principle of model is well explained.",
"time_commented" : "2020-03-05 09:11:42"
}]
}]
})
I tried this code, but it returns nothing.
query_test = lib_books.find({"bookid": "99051fe9-6a9c-46c2-b949-38ef78858dd0", "checkout_list.comments.time_commented" : {"$gte" : "2020-03-20", "$lte" : "2020-04-20"}})
for x in query_test:
print(x)

Can you try this
pipeline = [{'$match':{'bookid':"99051fe9-6a9c-46c2-b949-38ef78858dd0"}},//bookid filter
{'$unwind':'$checkout_list'},
{'$unwind':'$checkout_list.comments'},
{'$match':{'checkout_list.comments.time_commented':{"$gte" : "2020-03-20", "$lte" : "2020-04-20"}}},
{'$project':{'_id':0,'bookid':1,'title':1,'comment':'$checkout_list.comments'}},
{'$sort':{'checkout_list.comments.time_commented':-1}}]
query_test = lib_books.aggregate(pipeline)
#{"bookid": "99051fe9-6a9c-46c2-b949-38ef78858dd0", "checkout_list.comments.time_commented" : {"$gte" : "2020-03-20", "$lte" : "2020-04-20"}})
for x in query_test:
print(x)
I would recommend that you maintain comment field as one name, rather than keeping it as 'comment1', 'comment2', etc. If the field had been 'comment', it can be brought to the root itself
Aggregate can be modified as below
pipeline = [{'$match':{'bookid':"99051fe9-6a9c-46c2-b949-38ef78858dd0"}},//bookid filter
{'$unwind':'$checkout_list'},
{'$unwind':'$checkout_list.comments'},
{'$match':{'checkout_list.comments.time_commented':{"$gte" : "2020-03-20", "$lte" : "2020-04-20"}}},
{'$project':{'_id':0,'bookid':1,'title':1,'comment':'$checkout_list.comments.comment','time_commented':'$checkout_list.comments.time_commented'}},
{'$sort':{'time_commented':-1}}]
MongoDB Query, in case if required
db.books.aggregate([
{$match:{'bookid':"99051fe9-6a9c-46c2-b949-38ef78858dd0"}},//bookid filter
{$unwind:'$checkout_list'},
{$unwind:'$checkout_list.comments'},
{$match:{'checkout_list.comments.time_commented':{"$gte" : "2020-03-20", "$lte" : "2020-04-20"}}},
{$project:{_id:0,bookid:1,title:1,comment:'$checkout_list.comments.comment',time_commented:'$checkout_list.comments.time_commented'}},
{$sort:{'time_commented':-1}}
])
if there are multiple documents that you need to search, then you can use $in condition.
{$match:{'bookid':{$in:["99051fe9-6a9c-46c2-b949-38ef78858dd0","99051fe9-6a9c-46c2-b949-38ef78858dd1"]}}},//bookid filter

Print only specific parts of json file

I am wondering what I am doing wrong when trying to print the data of name of the following code in python.
import urllib.request, json
with urllib.request.urlopen("<THIS IS A URL IN THE ORIGINAL SCRIPT>") as url:
data = json.loads(url.read().decode())
print (data['Departure']['Product']['name'])
print (data['Departure']['Stops']['Stop'][0]['depTime'])
And this is the api I am fetching the data from:
{
"Departure" : [ {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"Stop" : [ {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}

data["Departure"] is a list, and you are indexing into it like it's a dictionary.

You wrote the dictionary sample confusingly. Here's how I think it looks:
d = {
"Departure" : [ {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"Stop" : [ {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}]}}]}
And here's how you can print depTime
print(d["Departure"][0]["Stops"]["Stop"][0]["depTime"])
The important part you missed is d["Departure"][0] because d["Departure"] is a list.

As Kyle said in the previous answer, data["Departure"] is a list, but you're trying to use it as a dictionary. There are 2 possible solutions.
Change data["Departure"]["Stops"]["Stop"] etc. to data["Departure"][0]["Stops"]["Stop"] etc.
Change the JSON file to make departure into a dictionary, which would allow you to keep your original code. This would make the final JSON snippet look like this:
"Departure" : {
"Product" : {
"name" : "Länstrafik - Buss 201",
"num" : "201",
"catCode" : "7",
"catOutS" : "BLT",
"catOutL" : "Länstrafik - Buss",
"operatorCode" : "254",
"operator" : "JLT",
"operatorUrl" : "http://www.jlt.se"
},
"Stops" : {
"name" : "Gislaved Lundåkerskolan",
"id" : "740040260",
"extId" : "740040260",
"routeIdx" : 12,
"lon" : 13.530096,
"lat" : 57.298178,
"depTime" : "20:55:00",
"depDate" : "2019-03-05"
}
}

Invalid Document: Cannot encode object

Sample documents in the collection are as follows:
[
{ "_id" : ObjectId("57690ce4a89aa8aa92ed1896"), "total_enters" : 308974, "Segment" : "7", "Chain" : "11625", "Geography" : "303"},
{ "_id" : ObjectId("57690ce4a89aa8aa92ed1897"), "total_enters" : 311076, "Segment" : "7", "Chain" : "4624", "Geography" : "303"}
]
I have a collection in the above format.
The following query is returning an error:
**InvalidDocument: Cannot encode object: set(['$total_enters'])
cursor_v2=db.Chain_share.aggregate([{
"$group":{
"_id":{"Geography":"$Geography","Segment":"$Segment"},
"enters":{"$sum":"$total_enters"},
"max_chain":{"$max":{"$total_enters"}}}
}
])
The field total_enters is an aggregate field which was obtained as a result of $sum operation. Not able to figure out why it is not able to sum it up again?
What is causing the encoding error?

How to store facepy data into mongodb?

I am using this approach to get the comments on page data.Its working fine,but I need to dump the data into MongoDB. Using this approach data is inserted but as a single document.I want to store that every comment should have a separate document with the information I am getting from the API.
from facepy import GraphAPI
import json
import pymongo
import json
connection = pymongo.MongoClient("mongodb://localhost")
facebook = connection.facebook
commen = facebook.comments
access = ''
#message
graph = GraphAPI(access)
page_id= 'micromaxinfo'
datas= graph.get(page_id+'/posts?fields=comments,created_time', page=True, retry=5)
posts=[]
for data in datas:
print data
commen.insert(data)
break
Output Stored in MongoDB:
{
"created_time" : "2015-11-04T08:04:14+0000",
"id" : "120735417936636_1090909150919253",
"comments" : {
"paging" : {
"cursors" : {
"after" : "WTI5dGJXVnVkRjlqZFhKemIzSTZNVEE1TVRReE5ESTVOelV6TlRRd05Ub3hORFEyTnpFNU5UTTU=",
"before" : "WTI5dGJXVnVkRjlqZFhKemIzSTZNVEE1TURrd09UVTRNRGt4T1RJeE1Eb3hORFEyTmpJME16Z3g="
}
},
"data" : [
{
"created_time" : "2015-11-04T08:06:21+0000",
"message" : "my favorite mobiles on canvas silver",
"from" : {
"name" : "Velchamy Alagar",
"id" : "828304797279948"
},
"id" : "1090909130919255_1090909580919210"
},
{
"created_time" : "2015-11-04T08:10:13+0000",
"message" : "Micromax mob. मैने कुछ दिन पहले Micromax Bolt D321 mob. खरिद लिया | Bt मेरा मोबा. बहुत गरम होता है Without internate. और internate MB कम समय मेँ ज्यादा खर्च होती है | कोई तो help करो.",
"from" : {
"name" : "Amit Gangurde",
"id" : "1637669796485258"
},
"id" : "1090909130919255_1090910364252465"
},
{
"created_time" : "2015-11-04T08:10:27+0000",
"message" : "Nice phones.",
"from" : {
"name" : "Nayan Chavda",
"id" : "1678393592373659"
},
"id" : "1090909130919255_1090910400919128"
},
{
"created_time" : "2015-11-04T08:10:54+0000",
"message" : "sir micromax bolt a089 mobile ki battery price kitna. #micromax mobile",
"from" : {
"name" : "Arit Singha Roy",
"id" : "848776351903695"
},
So technically I want to store only information coming in data field:
{
"created_time" : "2015-11-04T08:10:54+0000",
"message" : "sir micromax bolt a089 mobile ki battery price kitna. #micromax mobile",
"from" : {
"name" : "Arit Singha Roy",
"id" : "848776351903695"
}
How to get this into my database?

You can use the pentaho data integration open source ETL tool for this. I use it to store specific fields from the JSON output for tweets.
Select the fields you want to parse from the JSON and select an output as csv or table output in Oracle etc.
Hope this helps

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Process json data using Pyspark - python

Related

create a list from values in array of objects

Find a subdocument in array PyMongo

Print only specific parts of json file

Invalid Document: Cannot encode object

How to store facepy data into mongodb?

Categories

Resources