Extracting Multiple Fields from an Array of Dictionaries - python

I have an array of dictionaries that looks like this :
[{u'description': None, u'url': u'https://epi.testsite.net/index.php?/suites/view/196', u'is_completed': False, u'is_baseline': False, u'completed_on': None, u'is_master': False, u'project_id': 13, u'id': 196, u'name': u'Very Basic'}, {u'description': None, u'url': u'https://epi.testsite.net/index.php?/suites/view/200', u'is_completed': False, u'is_baseline': False, u'completed_on': None, u'is_master': False, u'project_id': 13, u'id': 200, u'name': u'Stress Testing'}]
and some Python code written to extract the 'id' field. Code is as follows :
suites_list = client.send_get ('get_suites/' + pid)
suites_list_ids = [item['id'] for item in suites_list]
return (suites_list_ids)
suites_list generates the data above; suite_list_ids generates a tidy output as follows :
[196, 200]
I would like to pull a second field 'name' and have that included in the output. The desired result like this :
[ {196,'Very Basic'}, {200, 'Stress Testing'} ]
I have been burning many cycles on this one and probably am overlooking something simple. Appreciate any advice.
Dan.

you can do something like this:
suites_list_vals = [(item['id'], item['name']) for item in suites_list]
output:
[(196, 'Very Basic'), (200, 'Stress Testing')]
That is a list of tuples. To iterate on the object you can do something like this:
for val in suites_list_vals:
print(val[0], ':', val[1])
output:
196 : Very Basic
200 : Stress Testing

Related

Scraping a list with scrapy and structure it

I'm trying to scrape every title and score from that page https://myanimelist.net/animelist/MoonlessMidnite?status=7 and return data in that form :
{"user" : moonlessmidnite, "anime" : A, "score" : x
"user" : moonlessmidnite, "anime" : B, "score" : x
"user" : moonlessmidnite, "anime" : C, "score" : x }
...ect
I managed to get table
table = response.xpath('.//tr[#class = "list-table-data"]')
score = table.xpath('.//td[#class = "data score"]//a/text()').extract()
title = table.xpath('.//td//a[#class = "link sort"]').extract()
but when i'm trying to scrape title or score i got some weird ouput like :
['\n ', '\n ', '${ item.anime_title }']
Look at the raw HTML of the website:
You see that it indeed contains ${ item.anime_title }.
That indicates that the content is generated via Javascript.
There's no easy solution for that, you'll have to look at the XHR requests that are being done and see if you can get something meaningful.
If you look closely at the HTML, you will see that the data is contained in a big JSON string in the table data-item attrbute.
Try this in the scrapy shell:
fetch('https://myanimelist.net/animelist/MoonlessMidnite?status=7')
import json
json.loads(response.xpath('//table[#class="list-table"]/#data-items').extract_first()
This outputs something like this:
{'status': 2,
'score': 0,
'tags': '',
'is_rewatching': 0,
'num_watched_episodes': 1,
'anime_title': 'Hidan no Aria Special',
'anime_num_episodes': 1,
'anime_airing_status': 2,
'anime_id': 10604,
'anime_studios': None,
'anime_licensors': None,
'anime_season': None,
'has_episode_video': False,
'has_promotion_video': True,
'has_video': True,
'video_url': '/anime/10604/Hidan_no_Aria_Special/video',
'anime_url': '/anime/10604/Hidan_no_Aria_Special',
'anime_image_path': 'https://cdn.myanimelist.net/r/96x136/images/anime/2/29138.jpg?s=90cb8381c58c92d39862ac700c43f7b5',
'is_added_to_list': False,
'anime_media_type_string': 'Special',
'anime_mpaa_rating_string': 'PG-13',
'start_date_string': None,
'finish_date_string': None,
'anime_start_date_string': '12-21-11',
'anime_end_date_string': '12-21-11',
'days_string': None,
'storage_string': '',
'priority_string': 'Low'},
{'status': 6,
'score': 0,
'tags': '',
'is_rewatching': 0,
'num_watched_episodes': 0,
'anime_title': '.hack//Roots',
'anime_num_episodes': 26,
'anime_airing_status': 2,
'anime_id': 873,
'anime_studios': None,
'anime_licensors': None,
'anime_season': None,
'has_episode_video': False,
'has_promotion_video': True,
'has_video': True,
'video_url': '/anime/873/hack__Roots/video',
'anime_url': '/anime/873/hack__Roots',
'anime_image_path': 'https://cdn.myanimelist.net/r/96x136/images/anime/3/13050.jpg?s=db9ff70bf19742172f1d0140c95c4a65',
'is_added_to_list': False,
'anime_media_type_string': 'TV',
'anime_mpaa_rating_string': 'PG-13',
'start_date_string': None,
'finish_date_string': None,
'anime_start_date_string': '04-06-06',
'anime_end_date_string': '09-28-06',
'days_string': None,
'storage_string': '',
'priority_string': 'Low'}
You then just have to use this dict to get the info that you need.

Python - JSON results with nested for loops

Nested for loops alongside JSONresults can be mind bending.
With the code below I would like to fetch track names which belong to a specific playlist.
I create a series of empty lists:
#for 'search' endpoint results
owner_ids = []
playlist_query_names = []
playlist_query_ids = []
#for 'playlist' endpoint results
playlist_names = []
track_names = []
track_ids = []
then a function to perform a query:
def query_playlists(query):
#query 'search' endpoint(desired playlists),
#with no track_ids to be found
query_results = sp.search(q=query, type='playlist')
query_items = query_results['playlists']['items']
for item in query_items:
playlist_query_names.append(item['name'])
playlist_query_ids.append(item['id'])
owner = item['owner']
owner_ids.append(owner['id'])
however, one must access another endpoint in order to retrieve track names:
#lookup 'user_playlists' endpoint (all playlists),
#where track ids are to be found
playlist_results = [sp.user_playlists(owner) for owner in owner_ids]
playlist_items = playlist_results[0]['items']
for item in playlist_items:
playlist_name = item['name']
playlist_names.append(playlist_name)
username = item['owner']['id']
here is the method for the other endpoint
results = sp.user_playlist(username, item['id'], fields="tracks,next")
#intersect both endpoints
if playlist_name in playlist_query_names:
tracks = results['tracks']
#find desired tracks
for item in (tracks['items']):
track = item['track']
track_id = item['track']['uri']
#add tracks
track_names.append(track['name'])
track_ids.append(track_id)
return track_ids, track_names
somehow I'm not being able to narrow down track names results to playlists obtained via search endpoint
could please some rigorous mind point me out where my mistake is?
EDIT
JSONresults from playlists endpoint:
[{u'items': [{u'is_local': False, u'track': {u'album': {u'album_type': u'album', u'name': u'At Last!', u'external_urls': {u'spotify': u'https://open.spotify.com/album/2pBhXw3Hi1hBf8FpAtE101'}, u'uri': u'spotify:album:2pBhXw3Hi1hBf8FpAtE101', u'href': u'https://api.spotify.com/v1/albums/2pBhXw3Hi1hBf8FpAtE101', u'images': [{u'url': u'https://i.scdn.co/image/6387bb37eb021db9f3c9da7173fd093f5ded2429', u'width': 640, u'height': 637}, {u'url': u'https://i.scdn.co/image/55d6dc87cf5f29485c251cf672a0896bd87cc2b9', u'width': 300, u'height': 299}, {u'url': u'https://i.scdn.co/image/8bcfde39549a94c46e9cf51e653572a71aaf1f0d', u'width': 64, u'height': 64}], u'type': u'album', u'id': u'2pBhXw3Hi1hBf8FpAtE101', u'available_markets': [u'AD'..., u'UY']}, u'name': u'At Last - Single Version', u'uri': u'spotify:track:0CmIALzGn4vHIHJG4n3Q4z', u'external_urls': {u'spotify': u'https://open.spotify.com/track/0CmIALzGn4vHIHJG4n3Q4z'}, u'popularity': 65, u'explicit': False, u'preview_url': u'https://p.scdn.co/mp3-preview/a8fc45fd24d3f6aaaa33262cba0d5c91b37d56fd', u'track_number': 7, u'disc_number': 1, u'href': u'https://api.spotify.com/v1/tracks/0CmIALzGn4vHIHJG4n3Q4z', u'artists': [{u'name': u'Etta James', u'external_urls': {u'spotify': u'https://open.spotify.com/artist/0iOVhN3tnSvgDbcg25JoJb'}, u'uri': u'spotify:artist:0iOVhN3tnSvgDbcg25JoJb', u'href': u'https://api.spotify.com/v1/artists/0iOVhN3tnSvgDbcg25JoJb', u'type': u'artist', u'id': u'0iOVhN3tnSvgDbcg25JoJb'}], u'duration_ms': 182400, u'external_ids': {u'isrc': u'USMC16046323'}, u'type': u'track', u'id': u'0CmIALzGn4vHIHJG4n3Q4z', u'available_markets': [u'AD',...u'UY']}, u'added_by': None, u'added_at': u'2012-01-06T10:47:09Z'}, {u'is_local': False...]

Storing Python dictionary data into a csv

I have a list of dicts that stores Facebook status data (Graph API):
len(test_statuses)
3
test_statuses
[{u'comments': {u'data': [{u'created_time': u'2016-01-27T10:47:30+0000',
u'from': {u'id': u'1755814687982070', u'name': u'Fadi Cool Panther'},
u'id': u'447173898813933_447182555479734',
u'message': u'Sidra Abrar'}],
u'paging': {u'cursors': {u'after': u'WTI5dGJXVnVkRjlqZFhKemIzSTZORFEzTVRneU5UVTFORGM1TnpNME9qRTBOVE00T1RFMk5UQT0=',
u'before': u'WTI5dGJXVnVkRjlqZFhKemIzSTZORFEzTVRneU5UVTFORGM1TnpNME9qRTBOVE00T1RFMk5UQT0='}},
u'summary': {u'can_comment': False,
u'order': u'ranked',
u'total_count': 1}},
u'created_time': u'2016-01-27T10:16:56+0000',
u'id': u'5842136044_10153381090881045',
u'likes': {u'data': [{u'id': u'729038357232696'},
{u'id': u'547422955417520'},
{u'id': u'422351987958296'},
{u'id': u'536057309903473'},
{u'id': u'206846772999449'},
{u'id': u'1671329739783719'},
{u'id': u'991398107599340'},
{u'id': u'208751836138231'},
{u'id': u'491047841097510'},
{u'id': u'664580270350825'}],
u'paging': {u'cursors': {u'after': u'NjY0NTgwMjcwMzUwODI1',
u'before': u'NzI5MDM4MzU3MjMyNjk2'},
u'next': u'https://graph.facebook.com/v2.5/5842136044_10153381090881045/likes?limit=10&summary=true&access_token=521971961312518|121ca7ef750debf4c51d1388cf25ead4&after=NjY0NTgwMjcwMzUwODI1'},
u'summary': {u'can_like': False, u'has_liked': False, u'total_count': 13}},
u'link': u'https://www.facebook.com/ukbhangrasongs/videos/447173898813933/',
u'message': u'Track : Ik Waar ( Official Music Video )\nSinger : Falak shabir ft DJ Shadow\nMusic by Dj Shadow\nFor more : UK Bhangra Songs',
u'shares': {u'count': 7},
u'type': u'video'},
{u'comments': {u'data': [],
u'summary': {u'can_comment': False,
u'order': u'chronological',
u'total_count': 0}},
u'created_time': u'2016-01-27T06:15:40+0000',
u'id': u'5842136044_10153380831261045',
u'likes': {u'data': [],
u'summary': {u'can_like': False, u'has_liked': False, u'total_count': 0}},
u'message': u'I want to work with you. tracks for flicks',
u'type': u'status'}]
I need to extract each status text and the text of each comment under the status, which I can do by appending them to separate lists e.g.,:
status_text = []
comment_text = []
for s in test_statuses:
try:
status_text.append(s['message'])
for c in s['comments']['data']:
comment_text.append(c['message'])
except:
continue
This gives me two lists of separate lengths len(status_text) = 2, len(comment_text) = 49.
Unfortunately that's a horrible way of dealing with the data since I cannot keep track of what comment belongs to what status. Ideally I would like to store this as a tree structure and export in into a cvs file, but I can't figure out how to do it.
Probable data acv data structure:
Text is_comment
status1 0
status2 0
statusN 0
comment1 status1
comment2 status1
commentN statusN
Why do you need this to be in a CSV? It is already structured and ready to be persisted as JSON.
If you really need the tabular approach offered by CSV, then you have to either denormalize it, or use more than one CSV table with references from one to another (and again, the best approach would be to put the data in an SQL database which takes care of the relationships for you)
That said, the way to denormalize is simply to save the same status text to each row where a comment is - that is: record you CSV row in the innermost loop with your approach:
import csv
status_text = []
comment_text = []
writer = csv.writer(open("mycsv.csv", "wt"))
for s in test_statuses:
test_messages.append(s['message'])
for c in s['comments']['data']:
test_comments.append(c['message'])
writer.writerow((s['message'], c['message']))
Note that you'd probably be better by writing the status idto each row, and create a second table with the status message where the id is the key (and put it in a database instead of various CSV files). And then, again, you are probably better of simply keeping the JSON. If you need search capabilities, use a JSON capable database such as MongoDB or PostgreSQL

How do I merge and sort two json lists using key value

I can get a JSON list of groups and devices from an API, but the key values don't allow me to do a merge without manipulating the returned lists. Unfortunately, the group info and devices info have to be retrieved using separate http requests.
The code for getting the group info looks like this:
#Python Code
import requests
import simplejson as json
import datetime
import pprintpp
print datetime.datetime.now().time()
url = 'https://www.somecompany.com/api/v2/groups/?fields=id,name'
s = requests.Session()
## Ver2 API Authenticaion ##
headers = {
'X-ABC-API-ID': 'nnnn-nnnn-nnnn-nnnn-nnnn',
'X-ABC-API-KEY': 'nnnnnnnn',
'X-DE-API-ID': 'nnnnnnnn',
'X-DE-API-KEY': 'nnnnnnnn'
}
r = json.loads(s.get((url), headers=headers).text)
print "Working...Groups extracted"
groups = r["data"]
print "*** Ver2 API Groups Information ***"
pprintpp.pprint (groups)
The printed output of groups looks like this:
#Groups
[
{u'id': u'0001', u'name': u'GroupA'},
{u'id': u'0002', u'name': u'GroupB'},
]
The code for getting the devices info looks like this:
url = 'https://www.somecompany.com/api/v2/devicess/?limit=500&fields=description,group,id,name'
r = json.loads(s.get((url), headers=headers).text)
print "Working...Devices extracted"
devices = r["data"]
print "*** Ver2 API Devices Information ***"
pprintpp.pprint (devices)
The devices output looks like this:
#Devices
[
{
u'description': u'GroupB 100 (City State)',
u'group': u'https://www.somecompany.com/api/v2/groups/0002/',
u'id': u'90001',
u'name': u'ABC550-3e9',
},
{
u'description': u'GroupA 101 (City State)',
u'group': u'https://www.somecompany.com/api/v2/groups/0001/',
u'id': u'90002',
u'name': u'ABC500-3e8',
}
]
What I would like to do is to be able to merge and sort the two JSON lists into an output that looks like this:
#Desired Output
#Seperated List of GroupA & GroupB Devices
[
{u'id': u'0001', u'name': u'GroupA'},
{
u'description': u'GroupA 101 (City State)',
u'group': u'https://www.somecompany.com/api/v2/groups/0001/',
u'id': u'90002',
u'name': u'ABC500-3e8',
},
{u'id': u'0002', u'name': u'GroupB'},
{
u'description': u'GroupB 100 (City State)',
u'group': u'https://www.somecompany.com/api/v2/groups/0002/',
u'id': u'90001',
u'name': u'ABC550-3e9',
}
]
A couple of problems I am having is that the key names for groups and devices output are not unique. The key named 'id' in groups is actually the same value as the last 4 digits of the key named 'group' in devices, and is the value I wish to use for the sort. Also, 'id' and 'name' in groups is different than 'id' and 'name' in devices. My extremely limited skill with Python is making this quite the challenge. Any help with pointing me in the correct direction for a solution will be greatly appreciated.
This program produces your desired output:
import pprintpp
groups = [
{u'id': u'0001', u'name': u'GroupA'},
{u'id': u'0002', u'name': u'GroupB'},
]
devices = [
{
u'description': u'GroupB 100 (City State)',
u'group': u'https://www.somecompany.com/api/v2/groups/0002/',
u'id': u'90001',
u'name': u'ABC550-3e9',
},
{
u'description': u'GroupA 101 (City State)',
u'group': u'https://www.somecompany.com/api/v2/groups/0001/',
u'id': u'90002',
u'name': u'ABC500-3e8',
}
]
desired = sorted(
groups + devices,
key = lambda x: x.get('group', x.get('id')+'/')[-5:-1])
pprintpp.pprint(desired)
Or, if that lambda does not seem self-documenting:
def key(x):
'''Sort on either the last few digits of x['group'], if that exists,
or the entirety of x['id'], if x['group'] does not exist.
'''
if 'group' in x:
return x['group'][-5:-1]
return x['id']
desired = sorted(groups + devices, key=key)

Can't get dictionary value from key in unicode dictionary

The following code stores a unicode dictionary in the variable webproperties_list:
webproperties_list = service.management().webproperties().list(
accountId='~all').execute()
profile_id = webproperties_list.get(u'defaultProfileId')
print profile_id
For some reason the .get() on the key of u'defaultProfileId' is giving me None, even though I know it is in the response. I also tried the get without the u and I still get a None response:
profile_id = webproperties_list.get('defaultProfileId')
Do I need to do something to the dict before I get the value from the key, or am I doing something else wrong entirely?
UPDATE:
Here is the response I get:
{u'username': u'removed', u'kind': u'analytics#webproperties', u'items': [{u'profileCount': 1, u'kind': u'analytics#webproperty', u'name': u'removed', u'level': u'STANDARD', u'defaultProfileId': u'removed'.....
I need to retrieve the value of u'defaultProfileId'
Not really sure how to get a value from a key that is in a dict within a list within a dict...
To figure out how to access it, it sometimes helps to go step by step:
>>> d
{u'username': u'removed', u'items': [{u'profileCount': 1, u'defaultProfileId': u'removed', u'kind': u'analytics#webproperty', u'name': u'removed', u'level': u'STANDARD'}], u'kind': u'analytics#webproperties'}
>>> d['items']
[{u'profileCount': 1, u'defaultProfileId': u'removed', u'kind': u'analytics#webproperty', u'name': u'removed', u'level': u'STANDARD'}]
>>> d['items'][0]
{u'profileCount': 1, u'defaultProfileId': u'removed', u'kind': u'analytics#webproperty', u'name': u'removed', u'level': u'STANDARD'}
>>> d['items'][0]['defaultProfileId']
u'removed'

Categories

Resources