I want to count the number of elements in an array using python and pymongo. Here is the data.
{
"_id": 5,
"type": "Student",
"Applicates": [
{
"appId": 100,
"School": "dfgdfgd",
"Name": "tony",
"URL": "www.url.com",
"Time": "5/5/5-6/6/6",
"Research": "dfgdfg",
"Budge": 5000,
"citizenship": "us",
"Major": "csc",
"preAwards": "None",
"Advisor": "dfgdfg",
"Evaluators": [
{
"abstractScore": 10,
"goalsObjectivesScore": 20,
"evalNum": 1
},
{
"abstractScore": 30,
"goalsObjectivesScore": 40,
"evalNum": 2
},
{
"abstractScore": 50,
"goalsObjectivesScore": 60,
"evalNum": 3
}
]
},
{
"appId": 101,
"School": "dvdu",
"Name": "jessy",
"URL": "www.url.com",
"Time": "4/4/4-6/6/6",
"Research": "dfgdfg",
"Budge": 7500,
"citizenship": "us",
"Major": "dfgdfg",
"preAwards": "dfgfd",
"Advisor": "dfgdfg",
"Evaluators": [
{
"abstractScore": 70,
"goalsObjectivesScore": 80,
"evalNum": 1
},
{
"abstractScore": 90,
"goalsObjectivesScore": 100,
"evalNum": 2
}
]
}
]}
So I want to get the size of the Evaluators array. {"appId" : 100} would give 3 and {"appId" : 101} would give 2. I have been playing around with $size but cant seem to get it.
Queries return documents. No query will return the size of the Evaluators array in the array element with "appId" : 100`. But the following awkwardly formatted expression will so what you want:
len(coll.find_one(
{ "Applicates.appId" : 100 },
{ "Applicates.$.Evaluators" : 1 }
)["Applicates"][0]["Evaluators"])
where coll is the Collection object.
With this syntax { $size: <expression> } you can count number of items in an array. See here for more > $size
One approach would be to loop through your array, and then on each iteration use len() on that dict's Evaluators property.
for obj in Applicates:
count = len(obj['Evaluators'])
Related
I am trying to link several Altair charts that share aspects of the same data. I can do this by merging all the data into one data frame, but because of the nature of the data the merged data frame is much larger than is needed to have two separate data frames for each of the two charts. This is because the columns unique to each chart have many repeated rows for each entry in the shared column.
Would using transform_lookup save space over just using the merged data frame, or does transform_lookup end up doing the whole merge internally?
No, the entire dataset is still included in the vegaspec when you use transform_lookup. You can see this by printing the json spec of the charts you create. With the example from the docs:
import altair as alt
import pandas as pd
from vega_datasets import data
people = data.lookup_people().head(3)
people
name age height
0 Alan 25 180
1 George 32 174
2 Fred 39 182
groups = data.lookup_groups().head(3)
groups
group person
0 1 Alan
1 1 George
2 1 Fred
With pandas merge:
merged = pd.merge(groups, people, how='left',
left_on='person', right_on='name')
print(alt.Chart(merged).mark_bar().encode(
x='mean(age):Q',
y='group:O'
).to_json())
{
"$schema": "https://vega.github.io/schema/vega-lite/v4.8.1.json",
"config": {
"view": {
"continuousHeight": 300,
"continuousWidth": 400
}
},
"data": {
"name": "data-b41b97ffc89b39c92e168871d447e720"
},
"datasets": {
"data-b41b97ffc89b39c92e168871d447e720": [
{
"age": 25,
"group": 1,
"height": 180,
"name": "Alan",
"person": "Alan"
},
{
"age": 32,
"group": 1,
"height": 174,
"name": "George",
"person": "George"
},
{
"age": 39,
"group": 1,
"height": 182,
"name": "Fred",
"person": "Fred"
}
]
},
"encoding": {
"x": {
"aggregate": "mean",
"field": "age",
"type": "quantitative"
},
"y": {
"field": "group",
"type": "ordinal"
}
},
"mark": "bar"
}
With transform lookup all the data is there but as to separate dataset (so technically it takes a little bit of more space with the additional braces and the transform):
print(alt.Chart(groups).mark_bar().encode(
x='mean(age):Q',
y='group:O'
).transform_lookup(
lookup='person',
from_=alt.LookupData(data=people, key='name',
fields=['age'])
).to_json())
{
"$schema": "https://vega.github.io/schema/vega-lite/v4.8.1.json",
"config": {
"view": {
"continuousHeight": 300,
"continuousWidth": 400
}
},
"data": {
"name": "data-5fe242a79352d1fe243b588af570c9c6"
},
"datasets": {
"data-2b374d1509415e1d327c3a7521f8117c": [
{
"age": 25,
"height": 180,
"name": "Alan"
},
{
"age": 32,
"height": 174,
"name": "George"
},
{
"age": 39,
"height": 182,
"name": "Fred"
}
],
"data-5fe242a79352d1fe243b588af570c9c6": [
{
"group": 1,
"person": "Alan"
},
{
"group": 1,
"person": "George"
},
{
"group": 1,
"person": "Fred"
}
]
},
"encoding": {
"x": {
"aggregate": "mean",
"field": "age",
"type": "quantitative"
},
"y": {
"field": "group",
"type": "ordinal"
}
},
"mark": "bar",
"transform": [
{
"from": {
"data": {
"name": "data-2b374d1509415e1d327c3a7521f8117c"
},
"fields": [
"age",
"height"
],
"key": "name"
},
"lookup": "person"
}
]
}
When transform_lookup can save space is if you use it with the URLs of two dataset:
people = data.lookup_people.url
groups = data.lookup_groups.url
print(alt.Chart(groups).mark_bar().encode(
x='mean(age):Q',
y='group:O'
).transform_lookup(
lookup='person',
from_=alt.LookupData(data=people, key='name',
fields=['age'])
).to_json())
{
"$schema": "https://vega.github.io/schema/vega-lite/v4.8.1.json",
"config": {
"view": {
"continuousHeight": 300,
"continuousWidth": 400
}
},
"data": {
"url": "https://vega.github.io/vega-datasets/data/lookup_groups.csv"
},
"encoding": {
"x": {
"aggregate": "mean",
"field": "age",
"type": "quantitative"
},
"y": {
"field": "group",
"type": "ordinal"
}
},
"mark": "bar",
"transform": [
{
"from": {
"data": {
"url": "https://vega.github.io/vega-datasets/data/lookup_people.csv"
},
"fields": [
"age",
"height"
],
"key": "name"
},
"lookup": "person"
}
]
}
I have an input json file that looks roughly like this
[
{
"identifier": "116S5RJ63",
"containers": [
{
"contains": "soap",
"height": {
"unit": "FT",
"value": 12.07123829231181
},
"length": {
"unit": "FT",
"value": 12.07123829231181
},
"quantity": 1,
"weight": {
"unit": "volumeUnits",
"value": 10000
},
"width": {
"unit": "FT",
"value": 12.07123829231181
}
}
],{...}]
I read it in using
input_json = pd.read_json(input_json_file)
I then process the input_json a bit: nothing dramatic, just changing the contents of some fields. Next I try to output the json again as
input_json.to_json(output_file, orient='records', date_format='iso')
but the output looks like this
[
{
"index": 28741,
"identifier": "115JKLJVZ",
"containers": [
{
"contains": "soap",
"height": {
"unit": "FT",
"value": 12.07123829231181
},
"length": {
"unit": "FT",
"value": 12.07123829231181
},
"quantity": 1,
"weight": {
"unit": "volumeUnits",
"value": 10000
},
"width": {
"unit": "FT",
"value": 12.07123829231181
}
}
],{...}]
Specifically it now includes the field 'index', which I thought the orient='records' was supposed to deal with. I'm not sure what to do next. Any suggestions?
Try input_json.reset_index(drop=True, inplace=True) before saving it to file; this should drop the old index from being added as a column. reset_index documentation
I just downloaded some json from spotify and took a look into the pd.normalize_json().
But if I normalise the data i still have dictionaries within my dataframe. Also setting the level doesnt help.
DATA I want to have in my dataframe:
{
"collaborative": false,
"description": "",
"external_urls": {
"spotify": "https://open.spotify.com/playlist/5"
},
"followers": {
"href": null,
"total": 0
},
"href": "https://api.spotify.com/v1/playlists/5?additional_types=track",
"id": "5",
"images": [
{
"height": 640,
"url": "https://i.scdn.co/image/a",
"width": 640
}
],
"name": "Another",
"owner": {
"display_name": "user",
"external_urls": {
"spotify": "https://open.spotify.com/user/user"
},
"href": "https://api.spotify.com/v1/users/user",
"id": "user",
"type": "user",
"uri": "spotify:user:user"
},
"primary_color": null,
"public": true,
"snapshot_id": "M2QxNTcyYTkMDc2",
"tracks": {
"href": "https://api.spotify.com/v1/playlists/100&additional_types=track",
"items": [
{
"added_at": "2020-12-13T18:34:09Z",
"added_by": {
"external_urls": {
"spotify": "https://open.spotify.com/user/user"
},
"href": "https://api.spotify.com/v1/users/user",
"id": "user",
"type": "user",
"uri": "spotify:user:user"
},
"is_local": false,
"primary_color": null,
"track": {
"album": {
"album_type": "album",
"artists": [
{
"external_urls": {
"spotify": "https://open.spotify.com/artist/1dfeR4Had"
},
"href": "https://api.spotify.com/v1/artists/1dfDbWqFHLkxsg1d",
"id": "1dfeR4HaWDbWqFHLkxsg1d",
"name": "Q",
"type": "artist",
"uri": "spotify:artist:1dfeRqFHLkxsg1d"
}
],
"available_markets": [
"CA",
"US"
],
"external_urls": {
"spotify": "https://open.spotify.com/album/6wPXmlLzZ5cCa"
},
"href": "https://api.spotify.com/v1/albums/6wPXUJ9LzZ5cCa",
"id": "6wPXUmYJ9zZ5cCa",
"images": [
{
"height": 640,
"url": "https://i.scdn.co/image/ab676620a47",
"width": 640
},
{
"height": 300,
"url": "https://i.scdn.co/image/ab67616d0620a47",
"width": 300
},
{
"height": 64,
"url": "https://i.scdn.co/image/ab603e6620a47",
"width": 64
}
],
"name": "The (Deluxe ",
"release_date": "1920-07-17",
"release_date_precision": "day",
"total_tracks": 15,
"type": "album",
"uri": "spotify:album:6m5cCa"
},
"artists": [
{
"external_urls": {
"spotify": "https://open.spotify.com/artist/1dg1d"
},
"href": "https://api.spotify.com/v1/artists/1dsg1d",
"id": "1dfeR4HaWDbWqFHLkxsg1d",
"name": "Q",
"type": "artist",
"uri": "spotify:artist:1dxsg1d"
}
],
"available_markets": [
"CA",
"US"
],
"disc_number": 1,
"duration_ms": 21453,
"episode": false,
"explicit": false,
"external_ids": {
"isrc": "GBU6015"
},
"external_urls": {
"spotify": "https://open.spotify.com/track/5716J"
},
"href": "https://api.spotify.com/v1/tracks/5716J",
"id": "5716J",
"is_local": false,
"name": "Another",
"popularity": 73,
"preview_url": null,
"track": true,
"track_number": 3,
"type": "track",
"uri": "spotify:track:516J"
},
"video_thumbnail": {
"url": null
}
}
],
"limit": 100,
"next": null,
"offset": 0,
"previous": null,
"total": 1
},
"type": "playlist",
"uri": "spotify:playlist:fek"
}
So what are best practices to read nested data like this into one dataframe in pandas?
I'm glad for any advice.
EDIT:
so basically I want all keys as columns in my dataframe. But with normalise it stops at "tracks.items" and if I normalise this again i have the recursive problem again.
It depends on the information you are looking for. Take a look at pandas.read_json() to see if that can work. Also you can select data as such
json_output = {"collaborative": 'false',"description": "", "external_urls": {"spotify": "https://open.spotify.com/playlist/5"}}
df['collaborative'] = json_output['collaborative'] #set value of your df to value of returned json values
I am trying to return just the arrays that meet my criteria. Here is what i have:
{
"_id": 1,
"awardAmount": 20000,
"url": "www.url.com",
"numAwards": 2,
"award": "Faculty Research Grant",
"Type": "faculty",
"Applicants": [
{
"preAwards": "NO1",
"Name": "Omar1",
"School": "SCSU1",
"citizenship": "YES1",
"budget": 1,
"Advisor": "Dr. DaPonte1",
"Major": "CSC1",
"appId": 100,
"Research": "Test data entry1",
"Time": "12 months1",
"URL": "www.url.com",
"Evaluators": [
{
"abstractScore": 11,
"evalNum": 1,
"goalsObjectivesScore": 11
},
{
"abstractScore": 22,
"evalNum": 2,
"goalsObjectivesScore": 22
}
]
},
{
"preAwards": "NO2",
"citizenship": "YES2",
"Major": "CSC2",
"Time": "12 months2",
"budget": 2,
"URL": "www.2.com",
"appId": 200,
"Advisor": "Dr. DaPonte2",
"Name": "Omar2",
"Research": "Test data entry2",
"School": "SCSU2",
"url": "www.2.com"
},
{
"preAwards": "NO3",
"citizenship": "YES3",
"Major": "CSC3",
"Time": "12 months3",
"budget": 3,
"URL": "www.3.com",
"appId": 300,
"Advisor": "Dr. DaPonte3",
"Name": "Omar3",
"Research": "Test data entry3",
"School": "SCSU3",
"url": "www.3.com",
"Evaluators": [
{
"abstractScore": 454,
"evalNum": 1,
"goalsObjectivesScore": 4546
}
]
}
]
}
I want to get back just the applicants that don't have Evaluators fields.
{
"_id": 1,
"awardAmount": 20000,
"url": "www.url.com",
"numAwards": 2,
"award": "Faculty Research Grant",
"Type": "faculty",
"Applicants": [
{
"preAwards": "NO2",
"citizenship": "YES2",
"Major": "CSC2",
"Time": "12 months2",
"budget": 2,
"URL": "www.2.com",
"appId": 200,
"Advisor": "Dr. DaPonte2",
"Name": "Omar2",
"Research": "Test data entry2",
"School": "SCSU2",
"url": "www.2.com"
}
]
}
This is just an example of one document. I want all the Applicants with no Evaluators fields in all documents.
Using aggregation with pymongo
col.aggregate([{"$unwind": "$Applicants"}, {"$match" : {"Applicants.Evaluators": {"$exists": False}}}]))
Output
{'ok': 1.0,
'result': [{'Applicants': {'Advisor': 'Dr. DaPonte2',
'Major': 'CSC2',
'Name': 'Omar2',
'Research': 'Test data entry2',
'School': 'SCSU2',
'Time': '12 months2',
'URL': 'www.2.com',
'appId': 200,
'budget': 2,
'citizenship': 'YES2',
'preAwards': 'NO2',
'url': 'www.2.com'},
'Type': 'faculty',
'_id': 1,
'award': 'Faculty Research Grant',
'awardAmount': 20000,
'numAwards': 2,
'url': 'www.url.com'}]}
In mongo shell you can do this:
db.test.find(
{
Applicants : { $elemMatch : { "Evaluators" : { $exists : 0 } }}
},
{
"_id" : 1,
"awardAmount" : 1,
"url" : 1,
"numAwards" : 2,
"award" : 1,
"Type" : 1,
'Applicants.$' : 1,
});
One problem is that the above query just return one Applicants with no Evaluators in it, the valid complete solution will achieve via aggregation
db.test.aggregate(
[
{ $match : { Applicants : { $elemMatch : { "Evaluators" : { $exists : 0 } } } } },
{ $unwind : "$Applicants" },
{ $match : { "Applicants.Evaluators" : { $exists : 0 } } },
{
$group :
{
_id : '$_id',
Applicants : { $push : '$Applicants' },
awardAmount : { $first : '$awardAmount' } ,
url : { $first : '$url' } ,
numAwards : { $first : '$numAwards' } ,
award : { $first : '$award' } ,
Type : { $first : '$Type' } ,
}
}
]
)
If I understand your question correctly I would suggest using the aggregation pipeline to $unwind the documents on your 'Applicants' field. You can then filter the resulting documents using $match to remove the documents where 'Evaluators' exist then $group them back together using $first and $push. Hope this is of some help.
I'm trying to edit some fields in a document
{
"_id": 5,
"Applicates": [
{
"School": "UCONN",
"Name": "Mike",
"Research": "cloud computing",
"Budge": 5000,
"appId": 100,
"Time": "5/5/5-6/6/6",
"citizenship": "us",
"Evaluators": [
{
"abstractScore": null,
"goalsObjectivesScore": null,
"evalNum": 1
},
{
"abstractScore": null,
"goalsObjectivesScore": null,
"evalNum": 2
},
{
"abstractScore": null,
"goalsObjectivesScore": null,
"evalNum": 3
}
],
"Major": "csc",
"preAwards": "none",
"Advisor": "Dr. pie"
},
{
"School": "psu",
"Name": "Tom",
"Research": "topology",
"Budge": 7500,
"appId": 101,
"Time": "1/1/1-2/2/2",
"citizenship": "us",
"Evaluators": [
{
"abstractScore": null,
"goalsObjectivesScore": null,
"evalNum": 1
},
{
"abstractScore": null,
"goalsObjectivesScore": null,
"evalNum": 2
},
{
"abstractScore": null,
"goalsObjectivesScore": null,
"evalNum": 3
}
],
"Major": "MAT",
"preAwards": "none",
"Advisor": "Dr. cool"
}
]
}
I need to update all the null values. I have been trying to do it in python with $set but had no luck. Here is what I was trying
posts.update({"_id" : 5,"Applicates.Name":"Mike","Application.Evaluators.evalNum":"1"},{"$set":{"Applicates.Evaluators.abstractScore" :10}})
So I'm asking how do I update each null field separately? What I wanted my code above to do is update the first abstractScore in Evaluators which is in {"Applicates.Name" : 'Mike"}. I also want to update the other 2 abstractScore for {"Applicates.Name" : 'Mike"} and the 3 for {"Applicates.Name" : "Tom"} all separately. Of course I want goalsObjectivesScore updated too but I'm trying to do 1 step at a time.
I have looked around quit a lot and cant seem to find a solid answers. Any help would be appreciated.
I got close enough with this line of code. It's not exactly what I wanted, but I made it work for what I needed.
posts.update({"_id":5,"Applicates.appId":100},{"$set":{"Applicates.$.Evaluators.0.abstractScore": 10}})