I am currently writing a scraper that reads from an API that contains a JSON. By doing response.json() it would return a dict where we could easily use the e.g response["object"]to get the value we want as I assume that converts it to a dict. The current mock data looks like this:
data = {
'id': 336461,
'thumbnail': '/images/product/123456?trim&h=80',
'variants': None,
'name': 'Testing',
'data': {
'Videoutgång': {
'Typ av gränssnitt': {
'name': 'Typ av gränssnitt',
'value': 'PCI Test'
}
}
},
'stock': {
'web': 0,
'supplier': None,
'displayCap': '50',
'1': 0,
'orders': {
'CL': {
'ordered': -10,
'status': 1
}
}
}
}
What I am looking after is that the API sometimes does contain "orders -> CL" but sometime doesn't . That means that both happy path and unhappy path is what I am looking for which is the fastest way to get a data from a dict.
I have currently done something like this:
data = {
'id': 336461,
'thumbnail': '/images/product/123456?trim&h=80',
'variants': None,
'name': 'Testing',
'data': {
'Videoutgång': {
'Typ av gränssnitt': {
'name': 'Typ av gränssnitt',
'value': 'PCI Test'
}
}
},
'stock': {
'web': 0,
'supplier': None,
'displayCap': '50',
'1': 0,
'orders': {
'CL': {
'ordered': -10,
'status': 1
}
}
}
}
if (
"stock" in data
and "orders" in data["stock"]
and "CL" in data["stock"]["orders"]
and "status" in data["stock"]["orders"]["CL"]
and data["stock"]["orders"]["CL"]["status"]
):
print(f'{data["stock"]["orders"]["CL"]["status"]}: {data["stock"]["orders"]["CL"]["ordered"]}')
1: -10
However my question is that I would like to know which is the fastest way to get the data from a dict if it is in the dict?
Lookups are faster in dictionaries because Python implements them using hash tables.
If we explain the difference by Big O concepts, dictionaries have constant time complexity, O(1). This is another approach using .get() method as well:
data = {
'id': 336461,
'thumbnail': '/images/product/123456?trim&h=80',
'variants': None,
'name': 'Testing',
'data': {
'Videoutgång': {
'Typ av gränssnitt': {
'name': 'Typ av gränssnitt',
'value': 'PCI Test'
}
}
},
'stock': {
'web': 0,
'supplier': None,
'displayCap': '50',
'1': 0,
'orders': {
'CL': {
'ordered': -10,
'status': 1
}
}
}
}
if (data.get('stock', {}).get('orders', {}).get('CL')):
print(f'{data["stock"]["orders"]["CL"]["status"]}: {data["stock"]["orders"]["CL"]["ordered"]}')
Here is a nice writeup on lookups in Python with list and dictionary as example.
I got your point. For this question, since your stock has just 4 values it is hard to say if .get() method will work faster than using a loop or not. If your dictionary would have more items then certainly .get() would have worked much faster but since there are few keys, using loop will not make much difference.
I need to get the 'ids' of this json response,the thing is that, there are many dictionaries with a list of dictionaries inside,how can I do this??(PS:len(items) is 20,so I need to get the 20 ids in the form of a dictionary.
{'playlists': {'href': 'https://api.spotify.com/v1/search?query=rewind-The%25&type=playlist&offset=0&limit=20',
'items': [{'collaborative': False,
'description': 'Remember what you listened to in 2010? Rewind and rediscover your favorites.',
'external_urls': {'spotify': 'https://open.spotify.com/playlist/37i9dQZF1DXc6IFF23C9jj'},
'href': 'https://api.spotify.com/v1/playlists/37i9dQZF1DXc6IFF23C9jj',
'id': '37i9dQZF1DXc6IFF23C9jj',
'images': [{'height': None,
'url': 'https://i.scdn.co/image/ab67706f0000000327ba1078080355421d1a49e2',
'width': None}],
'name': 'Rewind - The Sound of 2010',
'owner': {'display_name': 'Spotify',
'external_urls': {'spotify': 'https://open.spotify.com/user/spotify'},
'href': 'https://api.spotify.com/v1/users/spotify',
'id': 'spotify',
'type': 'user',
'uri': 'spotify:user:spotify'},
'primary_color': None,
'public': None,
'snapshot_id': 'MTU5NTUzMTE1OSwwMDAwMDAwMGQ0MWQ4Y2Q5OGYwMGIyMDRlOTgwMDk5OGVjZjg0Mjdl',
'tracks': {'href': 'https://api.spotify.com/v1/playlists/37i9dQZF1DXc6IFF23C9jj/tracks',
'total': 100},
'type': 'playlist',
'uri': 'spotify:playlist:37i9dQZF1DXc6IFF23C9jj'},
Im trying to get it through this:
dict={'id':''}
for playlists in playlist_data['playlists']:
for items in playlists['items']:
for item in items:
for dic in range(len(item)):
for id in dic['id']:
dict.update('id')
print(dict)
I get this error:
TypeError: string indices must be integers ```
Try something like this:
ids = [item["id"] for item in json_data["playlists"]["items"]]
This is called a list comprehension.
You want to iterate over all of the "items" within the "playlists" key.
You can access that list of items:
json_data["playlists"]["items"]
Then you iterate over each item within items:
for item in json_data["playlists"]["items"]
Then you access the "id" of each item:
item["id"]
You can index an object using the keys of object. I can see there are two places where id is present in an object. To retrieve those two ids and store them in a dictionary format, you can use the following approach -
_json = {
'playlists': {
'href': 'https://api.spotify.com/v1/search?query=rewind-The%25&type=playlist&offset=0&limit=20',
'items': [{
'collaborative': False,
'description': 'Remember what you listened to in 2010? Rewind and rediscover your favorites.',
'external_urls': {
'spotify': 'https://open.spotify.com/playlist/37i9dQZF1DXc6IFF23C9jj'
},
'href': 'https://api.spotify.com/v1/playlists/37i9dQZF1DXc6IFF23C9jj',
'id': '37i9dQZF1DXc6IFF23C9jj',
'images': [{
'height': None,
'url': 'https://i.scdn.co/image/ab67706f0000000327ba1078080355421d1a49e2',
'width': None
}],
'name': 'Rewind - The Sound of 2010',
'owner': {
'display_name': 'Spotify',
'external_urls': {
'spotify': 'https://open.spotify.com/user/spotify'
},
'href': 'https://api.spotify.com/v1/users/spotify',
'id': 'spotify',
'type': 'user',
'uri': 'spotify:user:spotify'
},
'primary_color': None,
'public': None,
'snapshot_id': 'MTU5NTUzMTE1OSwwMDAwMDAwMGQ0MWQ4Y2Q5OGYwMGIyMDRlOTgwMDk5OGVjZjg0Mjdl',
'tracks': {
'href': 'https://api.spotify.com/v1/playlists/37i9dQZF1DXc6IFF23C9jj/tracks',
'total': 100
},
'type': 'playlist',
'uri': 'spotify:playlist:37i9dQZF1DXc6IFF23C9jj'
}, ]
}
}
res_dict = {'id':[items['id'], items['owner']['id']] for items in _json['playlists']['items']}
print(res_dict)
OUTPUT :
{'id': ['37i9dQZF1DXc6IFF23C9jj', 'spotify']}
If you don't need the second id that's present in the json object, you can just remove it from above res_dict and modify it as -
res_dict = {'id':items['id'] for items in _json['playlists']['items']}
This will only fetch the id present in the items array as key of any element and not any further nested ids (like items[i]->owner->id won't be in the final res as it was in the fist case ).
This is my first question :)
I loop over a nested dictionary to print specific values. I am using the following code.
for i in lizzo_top_tracks['tracks']:
print('Track Name: ' + i['name'])
It works for string variables, but does not work for other variables. For example, when I use the following code for the date variable:
for i in lizzo_top_tracks['tracks']:
print('Album Release Date: ' + i['release_date'])
I receive a message like this KeyError: 'release_date'
What should I do?
Here is a sample of my nested dictionary:
{'tracks': [{'album': {'album_type': 'album',
'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/56oDRnqbIiwx4mymNEv7dS'},
'href': 'https://api.spotify.com/v1/artists/56oDRnqbIiwx4mymNEv7dS',
'id': '56oDRnqbIiwx4mymNEv7dS',
'name': 'Lizzo',
'type': 'artist',
'uri': 'spotify:artist:56oDRnqbIiwx4mymNEv7dS'}],
'external_urls': {'spotify': 'https://open.spotify.com/album/74gSdSHe71q7urGWMMn3qB'},
'href': 'https://api.spotify.com/v1/albums/74gSdSHe71q7urGWMMn3qB',
'id': '74gSdSHe71q7urGWMMn3qB',
'images': [{'height': 640,
'width': 640}],
'name': 'Cuz I Love You (Deluxe)',
'release_date': '2019-05-03',
'release_date_precision': 'day',
'total_tracks': 14,
'type': 'album',
'uri': 'spotify:album:74gSdSHe71q7urGWMMn3qB'}]}
The code you posted isn't syntactically correct; running it through a Python interpreter gives a syntax error on the last line. It looks like you lost a curly brace somewhere toward the end. :)
I went through it and fixed up the white space to make the structure easier to see; the way you had it formatted made it hard to see which keys were at which level of nesting, but with consistent indentation it becomes much clearer:
lizzo_top_tracks = {
'tracks': [{
'album': {
'album_type': 'album',
'artists': [{
'external_urls': {
'spotify': 'https://open.spotify.com/artist/56oDRnqbIiwx4mymNEv7dS'
},
'href': 'https://api.spotify.com/v1/artists/56oDRnqbIiwx4mymNEv7dS',
'id': '56oDRnqbIiwx4mymNEv7dS',
'name': 'Lizzo',
'type': 'artist',
'uri': 'spotify:artist:56oDRnqbIiwx4mymNEv7dS'
}],
'external_urls': {
'spotify': 'https://open.spotify.com/album/74gSdSHe71q7urGWMMn3qB'
},
'href': 'https://api.spotify.com/v1/albums/74gSdSHe71q7urGWMMn3qB',
'id': '74gSdSHe71q7urGWMMn3qB',
'images': [{'height': 640, 'width': 640}],
'name': 'Cuz I Love You (Deluxe)',
'release_date': '2019-05-03',
'release_date_precision': 'day',
'total_tracks': 14,
'type': 'album',
'uri': 'spotify:album:74gSdSHe71q7urGWMMn3qB'
}
}]
}
So the first (and only) value you get for i in lizzo_top_tracks['tracks'] is going to be this dictionary:
i = {
'album': {
'album_type': 'album',
'artists': [{
'external_urls': {
'spotify': 'https://open.spotify.com/artist/56oDRnqbIiwx4mymNEv7dS'
},
'href': 'https://api.spotify.com/v1/artists/56oDRnqbIiwx4mymNEv7dS',
'id': '56oDRnqbIiwx4mymNEv7dS',
'name': 'Lizzo',
'type': 'artist',
'uri': 'spotify:artist:56oDRnqbIiwx4mymNEv7dS'
}],
'external_urls': {
'spotify': 'https://open.spotify.com/album/74gSdSHe71q7urGWMMn3qB'
},
'href': 'https://api.spotify.com/v1/albums/74gSdSHe71q7urGWMMn3qB',
'id': '74gSdSHe71q7urGWMMn3qB',
'images': [{'height': 640, 'width': 640}],
'name': 'Cuz I Love You (Deluxe)',
'release_date': '2019-05-03',
'release_date_precision': 'day',
'total_tracks': 14,
'type': 'album',
'uri': 'spotify:album:74gSdSHe71q7urGWMMn3qB'
}
}
The only key in this dictionary is 'album', the value of which is another dictionary that contains all the other information. If you want to print, say, the album release date and a list of the artists' names, you'd do:
for track in lizzo_top_tracks['tracks']:
print('Album Release Date: ' + track['album']['release_date'])
print('Artists: ' + str([artist['name'] for artist in track['album']['artists']]))
If these are dictionaries that you're building yourself, you might want to remove some of the nesting layers where there's only a single key, since they just make it harder to navigate the structure without giving you any additional information. For example:
lizzo_top_albums = [{
'album_type': 'album',
'artists': [{
'external_urls': {
'spotify': 'https://open.spotify.com/artist/56oDRnqbIiwx4mymNEv7dS'
},
'href': 'https://api.spotify.com/v1/artists/56oDRnqbIiwx4mymNEv7dS',
'id': '56oDRnqbIiwx4mymNEv7dS',
'name': 'Lizzo',
'type': 'artist',
'uri': 'spotify:artist:56oDRnqbIiwx4mymNEv7dS'
}],
'external_urls': {
'spotify': 'https://open.spotify.com/album/74gSdSHe71q7urGWMMn3qB'
},
'href': 'https://api.spotify.com/v1/albums/74gSdSHe71q7urGWMMn3qB',
'id': '74gSdSHe71q7urGWMMn3qB',
'images': [{'height': 640, 'width': 640}],
'name': 'Cuz I Love You (Deluxe)',
'release_date': '2019-05-03',
'release_date_precision': 'day',
'total_tracks': 14,
'type': 'album',
'uri': 'spotify:album:74gSdSHe71q7urGWMMn3qB'
}]
This structure allows you to write the query the way you were originally trying to do it:
for album in lizzo_top_albums:
print('Album Release Date: ' + album['release_date'])
print('Artists: ' + str([artist['name'] for artist in album['artists']]))
Much simpler, right? :)
I've been struggling for about 3 weeks on this simple issue. I can't understand why and I would give anything to solve it lol.
I am trying to read values from the data structure below. The docs say it's a dictionary with keys containing lists of results of that type.
Example: I get the master query reply using an eval function. I lookup the key "song_hits" to get that structure. Then I lookup the key 'track' and parse it. The problem is getting to the 'track' part.
When I do it from how Perl docs tell me to, I get Can't locate object method "FIRSTKEY" via package "Inline::Python::Object::Data".
So I'm wondering if there's a way to read the value using the eval function to bypass ObjectData's hash key limitation, another way to read it given I know exact keys, or if I'm just doing this entirely wrong.
{
'album_hits': [
{
'album':
{
'albumArtRef': 'http://lh5.ggpht.com/DVIg4GiD6msHfgPs_Vu_2eRxCyAoz0fF...',
'albumArtist': 'J.Cole',
'albumId': 'Bfp2tuhynyqppnp6zennhmf6w3y',
'artist': 'J.Cole',
'artistId': ['Ajgnxme45wcqqv44vykrleifpji'],
'description_attribution':
{
'kind': 'sj#attribution',
'license_title': 'Creative Commons Attribution CC-BY',
'license_url': 'http://creativecommons.org/licenses/by/4.0/legalcode',
'source_title': 'Freebase',
'source_url': ''
},
'explicitType': '1',
'kind': 'sj#album',
'name': 'Work Out',
'year': 2011
},
'type': '3'
}],
'artist_hits': [
{
'artist':
{
'artistArtRef': 'http://lh3.googleusercontent.com/MJe-cDw9uQ-pUagoLlm...',
'artistArtRefs': [
{
'aspectRatio': '2',
'autogen': False,
'kind': 'sj#imageRef',
'url': 'http://lh3.googleusercontent.com/MJe-cDw9uQ-pUagoLlmKX3x_K...'
}],
'artistId': 'Ajgnxme45wcqqv44vykrleifpji',
'artist_bio_attribution':
{
'kind': 'sj#attribution',
'source_title': 'David Jeffries, Rovi'
},
'kind': 'sj#artist',
'name': 'J. Cole'
},
'type': '2'
}],
'playlist_hits': [
{
'playlist':
{
'albumArtRef': [
{
'url': 'http://lh3.googleusercontent.com/KJsAhrg8Jk_5A4xYLA68LFC...'
}],
'description': 'Workout Plan ',
'kind': 'sj#playlist',
'name': 'Workout',
'ownerName': 'Ida Sarver',
'shareToken': 'AMaBXyktyF6Yy_G-8wQy8Rru0tkueIbIFblt2h0BpkvTzHDz-fFj6P...',
'type': 'SHARED'
},
'type': '4'
}],
'situation_hits': [
{
'situation':
{
'description': 'Level up and enter beast mode with some loud, aggressive music.',
'id': 'Nrklpcyfewwrmodvtds5qlfp5ve',
'imageUrl': 'http://lh3.googleusercontent.com/Cd8WRMaG_pDwjTC_dSPIIuf...',
'title': 'Entering Beast Mode',
'wideImageUrl': 'http://lh3.googleusercontent.com/8A9S-nTb5pfJLcpS8P...'
},
'type': '7'
}],
'song_hits': [
{
'track':
{
'album': 'Work Out',
'albumArtRef': [
{
'aspectRatio': '1',
'autogen': False,
'kind': 'sj#imageRef',
'url': 'http://lh5.ggpht.com/DVIg4GiD6msHfgPs_Vu_2eRxCyAoz0fFdxj5w...'
}],
'albumArtist': 'J.Cole',
'albumAvailableForPurchase': True,
'albumId': 'Bfp2tuhynyqppnp6zennhmf6w3y',
'artist': 'J Cole',
'artistId': ['Ajgnxme45wcqqv44vykrleifpji', 'Ampniqsqcwxk7btbgh5ycujij5i'],
'composer': '',
'discNumber': 1,
'durationMillis': '234000',
'estimatedSize': '9368582',
'explicitType': '1',
'genre': 'Pop',
'kind': 'sj#track',
'nid': 'Tq3nsmzeumhilpegkimjcnbr6aq',
'primaryVideo':
{
'id': '6PN78PS_QsM',
'kind': 'sj#video',
'thumbnails': [
{
'height': 180,
'url': 'https://i.ytimg.com/vi/6PN78PS_QsM/mqdefault.jpg',
'width': 320
}]
},
'storeId': 'Tq3nsmzeumhilpegkimjcnbr6aq',
'title': 'Work Out',
'trackAvailableForPurchase': True,
'trackAvailableForSubscription': True,
'trackNumber': 1,
'trackType': '7',
'year': 2011
},
'type': '1'
}],
'station_hits': [
{
'station':
{
'compositeArtRefs': [
{
'aspectRatio': '1',
'kind': 'sj#imageRef',
'url': 'http://lh3.googleusercontent.com/3aD9mFppy6PwjADnjwv_w...'
}],
'contentTypes': ['1'],
'description': 'These riff-tastic metal tracks are perfect for getting the blood pumping.',
'imageUrls': [
{
'aspectRatio': '1',
'autogen': False,
'kind': 'sj#imageRef',
'url': 'http://lh5.ggpht.com/YNGkFdrtk43e8H941fuAHjflrNZ1CJUeqdoys...'
}],
'kind': 'sj#radioStation',
'name': 'Heavy Metal Workout',
'seed':
{
'curatedStationId': 'Lcwg73w3bd64hsrgarnorif52r',
'kind': 'sj#radioSeed',
'seedType': '9'
},
'skipEventHistory': [],
'stationSeeds': [
{
'curatedStationId': 'Lcwg73w3bd64hsrgarnorif52r',
'kind': 'sj#radioSeed',
'seedType': '9'
}]
},
'type': '6'
}],
'video_hits': [
{
'score': 629.6226806640625,
'type': '8',
'youtube_video':
{
'id': '6PN78PS_QsM',
'kind': 'sj#video',
'thumbnails': [
{
'height': 180,
'url': 'https://i.ytimg.com/vi/6PN78PS_QsM/mqdefault.jpg',
'width': 320
}],
'title': 'J. Cole - Work Out'
}
}]
}
Cleaned, but broken code with 3 weeks of different attempts: (I have tried for, foreach, while, but the furthest it would read would be either the entire unicode array, error, or an empty string)
sub search {
my $query = shift;
my $uri = 'googlemusic:search:' . $query;
if (my $result = $cache->get($uri)) {
return $result;
}
my $googleResult;
my $result = {
tracks => [],
albums => [],
artists => [],
};
eval {
$googleResult = $googleapi->search($query, $prefs->get('max_search_items'));
};
if ($#) {
$log->error("Not able to search All Access for \"$query\": $#");
return;
}
#gives not an ARRAY refernce error
for my $hit (#{$googleResult->{song_hits}}) {
push #{$result->{tracks}}, to_slim_track($hit->{track});
}
#works, but gives an error on the next line, 'newlist' object has no attribute 'album'
for my $hit ({$googleResult->{album_hits}}) {
push #{$result->{albums}}, album_to_slim_album($hit->{album});
}
#Perl and others recommended way, but gives Can't locate object method "FIRSTKEY" via package "Inline::Python::Object::Data"
for my $hit (%{$googleResult->{artist_hits}}) {
push #{$result->{artists}}, artist_to_slim_artist($hit->{artist});
}
# Add to the cache
$cache->set($uri, $result, $CACHE_TIME);
return $result;
}
I have tried reading up, but have gotten so many errors including:
'key' does not exist
Can't use string ("track") as a HASH ref while strict refs in use
Type of argument to keys on reference must be unblessed hashref or arrayref
My Full Test File: http://pastebin.com/DMnDc56i
GoogleApi PM (Python GAPI Hook): https://raw.githubusercontent.com/hechtus/squeezebox-googlemusic/master/GoogleMusic/GoogleAPI.pm
Edit: Info, There were a couple of people who wanted unmaintained old code fixed, so I offered to help and got everything working besides this part.
Old Code Git: https://github.com/hechtus/squeezebox-googlemusic
Google Api Python I use: https://github.com/simon-weber/gmusicapi
I take it that the data structure shown is in $googleResult. This is 'almost' JSON and you can process it as such using modules, after a simple cleanup. I will use JSON::XS. The code below takes off after $googleResult has been acquired. (In tests I actually copied data shown in the question into a file and read it in.) I first replace ' by " and lower-case True and False, to get a valid JSON format which the module can decode.
# Other code from the question ...
use JSON::XS;
# For tests I loaded shown data into $googleResult (did not run this eval)
eval {
$googleResult = $googleapi->search($query, $prefs->get('max_search_items'));
};
if ($#) {
$log->error("Not able to search All Access for \"$query\": $#");
return;
}
# The structure shown in the question needs a cleanup
# But this may be a road to madness, if there is more
$googleResult =~ s/'/"/g; # ' turn off wrong editor coloring
$googleResult =~ s/False/false/g;
$googleResult =~ s/True/true/g;
my $coder = JSON::XS->new;
# There are many options for how to set it up. Example:
# JSON::XS->new->ascii->pretty->allow_nonref;
my $data = $coder->decode($googleResult);
# Now this is a normal Perl data structure that we can work with.
# Look at what's under 'album_hits' for example
my $ralbhits = $data->{'album_hits'};
print Dumper($ralbhits);
# We get: VAR1 = [ { 'album' => { albumID => ... } } ]
# Array reference, with nested hash references as the sole element
# Extract the 'artist'
my $artist = $ralbhits->[0]->{'album'}->{'artist'};
print "$artist\n";
This prints J. Cole (after the dump which I omit here). You can for convenience first extract a part of the structure and then query it far more simply. For example
# Get the hashref for album
my $ralbum = $ralbhits->[0]->{'album'};
my $artist = $ralbum->{'artist'};
Now once the data is unpacked you can retrieve what you need, based on what artist_to_slim_artist() needs and does. This is a normal data structure to work with.
Modules for JSON parsing return Perl data structures, see Mapping in JSON::XS. Generally they will be nested, except in very simple cases. For how to work with them see perldsc, a cookbook on complex data structures.
The JSON object given in this example, while invalid, needed very little correction. However, it may get far more complicated. For example, there is a far larger document (~100kB) linked to in a comment, with these problems.
Name-value pairs are enclosed in ' instead of " and the values themselves contain ' (like isn't and other contractions), complicating the matching of ' pairs.
Invalid u' sequences at the beginning of names and values (u need be removed). This can be rolled together with the above, as they come together. There was also u".
Text may contain all kinds of escapes, for example some encoding of accents, which are not valid JSON. (One in that document.) This can be found and fixed (escaped for example).
It took a few minutes to come up with a few regex that corrected the document at the link, at close to 100kB in size, so that it parses cleanly with the above code. But the problem is that it is hard to tell what other trouble may be in the next document. Still, since this may be of interest here is the regex.
Instead of being enclosed in a pair of ", the names and values are in between ', and the leading one also has an extra char, u'. What makes it easier is that the closing ' must be followed by either of , : ] } and I use positive lookahead to assert that. Finally, there are some u" opening quotes and u is removed, first.
$googleResult =~ s/False/false/g;
$googleResult =~ s/True/true/g;
$googleResult =~ s/u"/"/g;
# There are also escaped characters in text, escape that backslash
$googleResult =~ s|(\\)|$1$1|g;
# Correct delimiters from u'...' to "...", see text below
$googleResult =~ s/u'(.*?)' (?= []:},] )/"$1"/gx;
# We are good now, decode it
my $data = $coder->decode($googleResult);
my $alb = $data->[0]{track}{album};
print "$alb\n";
This prints These Things Happen (correctly). Above we capture between u' and the first ' that is followed by either of ]:,} (for which a character class [...] is used). Then u'' is replaced by "". After this decode($googleResult); works and we get the Perl data structure to query.
There are various modules that allow a 'relaxed' approach and will accept many such irregularities. However, by using them we agree to use an invalid JSON, which is meant to be a simple and clear data format, and I wouldn't advise to go down that road. Note that the nearly full specification of the format fits nicely in one clear and genereously illustrated page at the above link. Also see JSON Example, for a handful of examples.
I think that the best bet is to try to clean it up. Run a decoder like in the code above and see the error message. It will pinpoint the problem exactly. Then add a regex to correct that particular violation of format. Then go again. If the various documents you may work with carry more or less the same set of problems (like the ones above for example) it may well work. Or it may turn out that it is too much trouble, if new violations keep coming up, in which case you may need a different approach.
Finally, I don't know how you arrived at this format from the original Python-object problem. Could it be that the format got broken somewhere in translation? I don't see how that would be the case. Is it actually not meant to be JSON? However, it is too close to it for that.
Is it possible to ask for valid JSON to be provided?
OK, this isn't really an answer, but out of the goodness of my heart I cleaned up your data for you. Here is a real Python dict. I don't know if some of the numerical string values should be ints or not, so I didn't mess with them. It'll be up to you to figure out what to do with the truncated URLs.
Another way to go about this would be to change True to true, False to false, and parse the dict as JSON.
{
'album_hits': [
{
'album':
{
'albumArtRef': 'http://lh5.ggpht.com/DVIg4GiD6msHfgPs_Vu_2eRxCyAoz0fF...',
'albumArtist': 'J.Cole',
'albumId': 'Bfp2tuhynyqppnp6zennhmf6w3y',
'artist': 'J.Cole',
'artistId': ['Ajgnxme45wcqqv44vykrleifpji'],
'description_attribution':
{
'kind': 'sj#attribution',
'license_title': 'Creative Commons Attribution CC-BY',
'license_url': 'http://creativecommons.org/licenses/by/4.0/legalcode',
'source_title': 'Freebase',
'source_url': ''
},
'explicitType': '1',
'kind': 'sj#album',
'name': 'Work Out',
'year': 2011
},
'type': '3'
}],
'artist_hits': [
{
'artist':
{
'artistArtRef': 'http://lh3.googleusercontent.com/MJe-cDw9uQ-pUagoLlm...',
'artistArtRefs': [
{
'aspectRatio': '2',
'autogen': False,
'kind': 'sj#imageRef',
'url': 'http://lh3.googleusercontent.com/MJe-cDw9uQ-pUagoLlmKX3x_K...'
}],
'artistId': 'Ajgnxme45wcqqv44vykrleifpji',
'artist_bio_attribution':
{
'kind': 'sj#attribution',
'source_title': 'David Jeffries, Rovi'
},
'kind': 'sj#artist',
'name': 'J. Cole'
},
'type': '2'
}],
'playlist_hits': [
{
'playlist':
{
'albumArtRef': [
{
'url': 'http://lh3.googleusercontent.com/KJsAhrg8Jk_5A4xYLA68LFC...'
}],
'description': 'Workout Plan ',
'kind': 'sj#playlist',
'name': 'Workout',
'ownerName': 'Ida Sarver',
'shareToken': 'AMaBXyktyF6Yy_G-8wQy8Rru0tkueIbIFblt2h0BpkvTzHDz-fFj6P...',
'type': 'SHARED'
},
'type': '4'
}],
'situation_hits': [
{
'situation':
{
'description': 'Level up and enter beast mode with some loud, aggressive music.',
'id': 'Nrklpcyfewwrmodvtds5qlfp5ve',
'imageUrl': 'http://lh3.googleusercontent.com/Cd8WRMaG_pDwjTC_dSPIIuf...',
'title': 'Entering Beast Mode',
'wideImageUrl': 'http://lh3.googleusercontent.com/8A9S-nTb5pfJLcpS8P...'
},
'type': '7'
}],
'song_hits': [
{
'track':
{
'album': 'Work Out',
'albumArtRef': [
{
'aspectRatio': '1',
'autogen': False,
'kind': 'sj#imageRef',
'url': 'http://lh5.ggpht.com/DVIg4GiD6msHfgPs_Vu_2eRxCyAoz0fFdxj5w...'
}],
'albumArtist': 'J.Cole',
'albumAvailableForPurchase': True,
'albumId': 'Bfp2tuhynyqppnp6zennhmf6w3y',
'artist': 'J Cole',
'artistId': ['Ajgnxme45wcqqv44vykrleifpji', 'Ampniqsqcwxk7btbgh5ycujij5i'],
'composer': '',
'discNumber': 1,
'durationMillis': '234000',
'estimatedSize': '9368582',
'explicitType': '1',
'genre': 'Pop',
'kind': 'sj#track',
'nid': 'Tq3nsmzeumhilpegkimjcnbr6aq',
'primaryVideo':
{
'id': '6PN78PS_QsM',
'kind': 'sj#video',
'thumbnails': [
{
'height': 180,
'url': 'https://i.ytimg.com/vi/6PN78PS_QsM/mqdefault.jpg',
'width': 320
}]
},
'storeId': 'Tq3nsmzeumhilpegkimjcnbr6aq',
'title': 'Work Out',
'trackAvailableForPurchase': True,
'trackAvailableForSubscription': True,
'trackNumber': 1,
'trackType': '7',
'year': 2011
},
'type': '1'
}],
'station_hits': [
{
'station':
{
'compositeArtRefs': [
{
'aspectRatio': '1',
'kind': 'sj#imageRef',
'url': 'http://lh3.googleusercontent.com/3aD9mFppy6PwjADnjwv_w...'
}],
'contentTypes': ['1'],
'description': 'These riff-tastic metal tracks are perfect for getting the blood pumping.',
'imageUrls': [
{
'aspectRatio': '1',
'autogen': False,
'kind': 'sj#imageRef',
'url': 'http://lh5.ggpht.com/YNGkFdrtk43e8H941fuAHjflrNZ1CJUeqdoys...'
}],
'kind': 'sj#radioStation',
'name': 'Heavy Metal Workout',
'seed':
{
'curatedStationId': 'Lcwg73w3bd64hsrgarnorif52r',
'kind': 'sj#radioSeed',
'seedType': '9'
},
'skipEventHistory': [],
'stationSeeds': [
{
'curatedStationId': 'Lcwg73w3bd64hsrgarnorif52r',
'kind': 'sj#radioSeed',
'seedType': '9'
}]
},
'type': '6'
}],
'video_hits': [
{
'score': 629.6226806640625,
'type': '8',
'youtube_video':
{
'id': '6PN78PS_QsM',
'kind': 'sj#video',
'thumbnails': [
{
'height': 180,
'url': 'https://i.ytimg.com/vi/6PN78PS_QsM/mqdefault.jpg',
'width': 320
}],
'title': 'J. Cole - Work Out'
}
}]
}
I've worked out a solution using a list comprehension like so:
use Inline::Python qw(py_eval);
my $song_hits = py_eval("[x for x in $googleResult->{song_hits}]", 0);
for my $hit (#$song_hits) {
push #{$result->{tracks}}, to_slim_track($hit->{track});
}
The commit is at:
https://github.com/squeezebox-googlemusic/squeezebox-googlemusic/commit/e6fa62d9da3bc7295023283ef5d25698737e5772