Python: nested dictionary with string list and subdictionaries - python

I am interested in the comments (=text) made on certain YouTube Channels. I have scraped data with the Google YouTube Data API. The data comes in a complex structure and format (see picture below) that I am trying to disentangle for a research project.
The comments are stored in the fields Text Display and Text Original that belong to the dictionary Snippet, which in turn is part of the dictionary Top Level Comments. Top Level Comments is part of a string list that in turn is part of the dictionary items.
I think I need to subset the dictionary Top Level Comment as all the comments and related information (see picture below) I need are stored in nested dictionaries there. I don't think I can access the dictionary Top Level Comment as it is part of the list Snippet. So I first tried to subset the list Snippet. This is where I am stuck.
Here my code so far:
from googleapiclient.discovery import build
api_key = '_______________________________'
youtube = build('youtube', 'v3', developerKey = api_key)
#find channel ID https://commentpicker.com/youtube-channel-id.php
request = youtube.commentThreads().list(
part = 'snippet',
allThreadsRelatedToChannelId = 'UC_zxivooFdvF4uuBosUnJxQ'
)
response3 = request.execute()
##Code to explore data structure and format is excluded
#subset dictionary according to keys we want
includedKeys = ['items']
dataDic = {k:v for k, v in response3.items() if k in includedKeys}
In below code I unsuccessfully tried to subset the list Snipet in different ways or convert it.
dataDic2 = {x['snippet'] for x in dataDic} #Link no 1
#TypeError: string indices must be integers
dataDic2 = {x['snippet'] for x in dataDic} #Link no 1
#TypeError: string indices must be integers
dataDic2 = [{'snippet': d['snippet']} for d in dataDic] #Link no 2
#TypeError: string indices must be integers
dataDic2 = [topLevelComment['snippet'] for topLevelComment in dataDic['topLevelComment']['snippet']] #Link no 3
#KeyError: 'topLevelComment'import ast
result = ast.literal_eval('[snippet]')
assert type(result) is list #Link no 4 and 5
#ValueError: malformed node or string: <_ast.Name object at 0x0000010F6D7B9A08>
Link no 1
Link no 2
Link no 3
Link no 4
Link no 5
This link says that ast.literal.eval does not work with lists and dictionaries?
So finally - how to retrieve the data?
I need all fields circled in red in the picture showing the data structure.
EDIT: sample data

see below
data = {'kind': 'youtube#commentThreadListResponse', 'etag': '_yOZ67ear9btS5RarXfH3Xir6A8',
'nextPageToken': 'QURTSl9pME5DS2FQZm5yRzZ5b0ZGZUJGeENkMGh2UWxzVjNueEdUVmtmbVVqYksxSmN4QnpBdDFFWkpCREl6REZVQmlHZS1makpfZXFkQzFNbEpwbDFpb0dNWm95Z2E1TE03NE5GWEg0ajE5UWt0bnlpYS1PczlFVWZ1a1hqbTJLREVRempJaVpaRTYtcnpFeUM2ZU5Va1hUSHR5cVJFTEJ2akdtOHFkTWhGdmdmWUZsMUMwUHg0eTZNVzFBZVdsd1A0YXBqaWhnNGVNMXc=',
'pageInfo': {'totalResults': 14, 'resultsPerPage': 20}, 'items': [
{'kind': 'youtube#commentThread', 'etag': 'knxvgtYnhlPIpkevoCXSTZamb40', 'id': 'Ugwmdd9KdDm4Hm7MxlJ4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'tUXWw6WvgkI',
'topLevelComment': {'kind': 'youtube#comment', 'etag': '4m76jMeR8qFmfrk42kfKeA5Iv_Y',
'id': 'Ugwmdd9KdDm4Hm7MxlJ4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'tUXWw6WvgkI',
'textDisplay': 'Tipp 1: Zusatzszüge – machs wie Fredy. (Hinweis: dieses Video wurde vor der Corona-Pandemie erstellt)',
'textOriginal': 'Tipp 1: Zusatzszüge – machs wie Fredy. (Hinweis: dieses Video wurde vor der Corona-Pandemie erstellt)',
'authorDisplayName': 'Zuschauerquaeler',
'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLQCWIoN-3MmDfxflS5ipDVvatDw8TpbD43mn2kb=s48-c-k-c0x00ffffff-no-rj',
'authorChannelUrl': 'http://www.youtube.com/channel/UCECxysNsTQLhrelU2KikMjQ',
'authorChannelId': {'value': 'UCECxysNsTQLhrelU2KikMjQ'},
'canRate': True, 'viewerRating': 'none', 'likeCount': 1,
'publishedAt': '2021-09-15T07:29:00Z',
'updatedAt': '2021-09-15T07:29:00Z'}}, 'canReply': True,
'totalReplyCount': 0, 'isPublic': True}},
{'kind': 'youtube#commentThread', 'etag': 'tq7mSQltdzKz0sthUiAIPYrQgJg', 'id': 'Ugy2jzL0838zj9HyHu94AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'M98TRem03Lg',
'topLevelComment': {'kind': 'youtube#comment', 'etag': '8BDnS6DXuaN8VdFzHsj7dc1YPZc',
'id': 'Ugy2jzL0838zj9HyHu94AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'M98TRem03Lg',
'textDisplay': 'Ich sehe das Kulturland schon schmelzen und verschwinden...',
'textOriginal': 'Ich sehe das Kulturland schon schmelzen und verschwinden...',
'authorDisplayName': 'Janik Von Niederhäusern',
'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLSk69KdiWMSYw0sYQSBdjEHagXJTD9tWlHdsw=s48-c-k-c0x00ffffff-no-rj',
'authorChannelUrl': 'http://www.youtube.com/channel/UCt87CYDxeIbDRRJLVT0VrdQ',
'authorChannelId': {'value': 'UCt87CYDxeIbDRRJLVT0VrdQ'},
'canRate': True, 'viewerRating': 'none', 'likeCount': 0,
'publishedAt': '2021-09-14T18:08:55Z',
'updatedAt': '2021-09-14T18:08:55Z'}}, 'canReply': True,
'totalReplyCount': 1, 'isPublic': True}},
{'kind': 'youtube#commentThread', 'etag': 'h_gpfnmUju60NWNxlFEwxjkIPQU', 'id': 'Ugx5GfaJTwt5cnuQ3Bh4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'M98TRem03Lg',
'topLevelComment': {'kind': 'youtube#comment', 'etag': 'fMmN1zDH7PVIWbw3L0n5Mt0dtqk',
'id': 'Ugx5GfaJTwt5cnuQ3Bh4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'M98TRem03Lg',
'textDisplay': 'Guete initiativ! Mega fan vo dere projekt!',
'textOriginal': 'Guete initiativ! Mega fan vo dere projekt!',
'authorDisplayName': 'Nionity',
'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTM-Tj3pWLuyhuH7ivlUwxs4YtQn6gez-BMCLdLzQ=s48-c-k-c0x00ffffff-no-rj',
'authorChannelUrl': 'http://www.youtube.com/channel/UCbUj9ZwI0YOkElVEfpAnBVQ',
'authorChannelId': {'value': 'UCbUj9ZwI0YOkElVEfpAnBVQ'},
'canRate': True, 'viewerRating': 'none', 'likeCount': 0,
'publishedAt': '2021-09-14T07:18:31Z',
'updatedAt': '2021-09-14T07:18:31Z'}}, 'canReply': True,
'totalReplyCount': 0, 'isPublic': True}},
{'kind': 'youtube#commentThread', 'etag': 'LOajqt43iY4A2N4V0yiLBRZwaig', 'id': 'Ugxez_tcF7ts7VaAL7t4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'zYnbgDyWM9o',
'topLevelComment': {'kind': 'youtube#comment', 'etag': 'DvNHOkNftBCLBqV1Ajam8mzMFYg',
'id': 'Ugxez_tcF7ts7VaAL7t4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'zYnbgDyWM9o',
'textDisplay': 'Très mauvaise voix off, à un moment il y se reprend même dans le texte 😐',
'textOriginal': 'Très mauvaise voix off, à un moment il y se reprend même dans le texte 😐',
'authorDisplayName': 'Patrick__EPfan',
'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTOmUsxVCimwNSQBVPxNUXfFbUNuYnN7VzVEeBUJA=s48-c-k-c0x00ffffff-no-rj',
'authorChannelUrl': 'http://www.youtube.com/channel/UC8DxMAk8T9Gv8RW0f2n0Q2w',
'authorChannelId': {'value': 'UC8DxMAk8T9Gv8RW0f2n0Q2w'},
'canRate': True, 'viewerRating': 'none', 'likeCount': 0,
'publishedAt': '2021-09-12T12:12:58Z',
'updatedAt': '2021-09-12T12:12:58Z'}}, 'canReply': True,
'totalReplyCount': 0, 'isPublic': True}},
{'kind': 'youtube#commentThread', 'etag': 'MGsQS-TUcYHnuyjyN932wpVIM_A', 'id': 'UgxYTxqSwAsyGyOHzU94AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': '4nU0MgKft6c',
'topLevelComment': {'kind': 'youtube#comment', 'etag': 'iRkZfQGVCGFZ13s8D3xrVZQw83A',
'id': 'UgxYTxqSwAsyGyOHzU94AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': '4nU0MgKft6c',
'textDisplay': 'Shiey be like', 'textOriginal': 'Shiey be like',
'authorDisplayName': 'Canopener Guy',
'authorProfileImageUrl': 'https://yt3.ggpht.com/2XG9uyYmOfkeubUNFQR0cgj7xCimKLsg6_r-3E1PTPVLixXjcxeFosF1HoytvHibGJrxQXal=s48-c-k-c0x00ffffff-no-rj',
'authorChannelUrl': 'http://www.youtube.com/channel/UCk8pieRaYyzsnU32Gp85DvA',
'authorChannelId': {'value': 'UCk8pieRaYyzsnU32Gp85DvA'},
'canRate': True, 'viewerRating': 'none', 'likeCount': 0,
'publishedAt': '2021-09-02T23:23:35Z',
'updatedAt': '2021-09-02T23:23:35Z'}}, 'canReply': True,
'totalReplyCount': 0, 'isPublic': True}},
{'kind': 'youtube#commentThread', 'etag': 'bcPCCsMbvquhAKLiEqIR4a20HnA', 'id': 'Ugw8FWvl7Hbf1RvJWhV4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'oxSLp_1WtcM',
'topLevelComment': {'kind': 'youtube#comment', 'etag': 'rTl4oSjvH14OF4xQ1mnM_amfZag',
'id': 'Ugw8FWvl7Hbf1RvJWhV4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'oxSLp_1WtcM',
'textDisplay': 'Vivement un Lyria en Belgique !!!!',
'textOriginal': 'Vivement un Lyria en Belgique !!!!',
'authorDisplayName': 'Kayuchi Fujimoto',
'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLQ9YSDYj2tQFvjKjt9F_CH9AR2dcWrr84jA70am=s48-c-k-c0x00ffffff-no-rj',
'authorChannelUrl': 'http://www.youtube.com/channel/UCe5ctUAG-Z7cU_hpc-CbauQ',
'authorChannelId': {'value': 'UCe5ctUAG-Z7cU_hpc-CbauQ'},
'canRate': True, 'viewerRating': 'none', 'likeCount': 0,
'publishedAt': '2021-09-02T21:39:26Z',
'updatedAt': '2021-09-02T21:39:26Z'}}, 'canReply': True,
'totalReplyCount': 0, 'isPublic': True}},
{'kind': 'youtube#commentThread', 'etag': 'qbrUI9Z2YkM3LtYOqFogVRwcZWE', 'id': 'UgwomjMWUx5CHjlU_ox4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': '8vCvSmAIv1s',
'topLevelComment': {'kind': 'youtube#comment', 'etag': 'gYjvyBgNsZUB_FYUDK20LCVU-Qk',
'id': 'UgwomjMWUx5CHjlU_ox4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': '8vCvSmAIv1s',
'textDisplay': 'Build a high speed railway line into the moon I dare you with 20 million francs',
'textOriginal': 'Build a high speed railway line into the moon I dare you with 20 million francs',
'authorDisplayName': 'Simulated Trainspotter',
'authorProfileImageUrl': 'https://yt3.ggpht.com/3P-cR_3ORURRZH5RYImCeFv0yeC64SHtpS3otsCiGn4AuBXG-tQVrqnG32vJm4bfwxRt3MwCDzw=s48-c-k-c0x00ffffff-no-rj',
'authorChannelUrl': 'http://www.youtube.com/channel/UCF4ganYY8qP9q8YwXpDn2tQ',
'authorChannelId': {'value': 'UCF4ganYY8qP9q8YwXpDn2tQ'},
'canRate': True, 'viewerRating': 'none', 'likeCount': 0,
'publishedAt': '2021-09-02T08:36:45Z',
'updatedAt': '2021-09-02T08:36:45Z'}}, 'canReply': True,
'totalReplyCount': 0, 'isPublic': True}},
{'kind': 'youtube#commentThread', 'etag': '5KVenAu6Nn6RdnpKTpPj49KuYRY', 'id': 'UgyXleqDMoHFnid0OpV4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': '7earPWDJbhA',
'topLevelComment': {'kind': 'youtube#comment', 'etag': 'C3AxUnPxhDZuIYAKsjqeIZxmyQI',
'id': 'UgyXleqDMoHFnid0OpV4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': '7earPWDJbhA',
'textDisplay': 'Sehr schön', 'textOriginal': 'Sehr schön',
'authorDisplayName': 'Pranave4 Roblox',
'authorProfileImageUrl': 'https://yt3.ggpht.com/V_qXZAr4xsbi2GEFJ2t8NhwDYWGEeiBhFCgVYcgs1TwmaS1e6gCwktKZpdNPJszs3Zwu71ZZ2w=s48-c-k-c0x00ffffff-no-rj',
'authorChannelUrl': 'http://www.youtube.com/channel/UCKoDZxOJY6e90jeujtkC_4A',
'authorChannelId': {'value': 'UCKoDZxOJY6e90jeujtkC_4A'},
'canRate': True, 'viewerRating': 'none', 'likeCount': 2,
'publishedAt': '2021-08-27T16:06:59Z',
'updatedAt': '2021-08-27T16:06:59Z'}}, 'canReply': True,
'totalReplyCount': 0, 'isPublic': True}},
{'kind': 'youtube#commentThread', 'etag': 'mH33Uu3Bm3zkVGLZDiOaOg2idSM', 'id': 'UgxQRQaVxnzeFQRTPTp4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'AXMw3vtsswY',
'topLevelComment': {'kind': 'youtube#comment', 'etag': 'Sht8Gm_LShDQ9cKfIl1nH53FgsI',
'id': 'UgxQRQaVxnzeFQRTPTp4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'AXMw3vtsswY',
'textDisplay': 'wie kann mann feuerwehr mann bei SBB werden',
'textOriginal': 'wie kann mann feuerwehr mann bei SBB werden',
'authorDisplayName': 'Florian Ruhland',
'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLQNfiz21ybCpfDmaXKefJtuy1UDHwFenhsL0R14Kg=s48-c-k-c0x00ffffff-no-rj',
'authorChannelUrl': 'http://www.youtube.com/channel/UCS7LfiWU_ebI-E3ny8Yb6PA',
'authorChannelId': {'value': 'UCS7LfiWU_ebI-E3ny8Yb6PA'},
'canRate': True, 'viewerRating': 'none', 'likeCount': 0,
'publishedAt': '2021-08-21T11:00:05Z',
'updatedAt': '2021-08-21T11:00:05Z'}}, 'canReply': True,
'totalReplyCount': 1, 'isPublic': True}},
{'kind': 'youtube#commentThread', 'etag': 'oM57z1ZCosWjFXPDl1VMIQIFpJ8', 'id': 'UgzzHV3cayZFI7MpziB4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'DmBo0MMxDb0',
'topLevelComment': {'kind': 'youtube#comment', 'etag': '-ecKB_iUT-BOVOeNfX7qoAr0poI',
'id': 'UgzzHV3cayZFI7MpziB4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'DmBo0MMxDb0',
'textDisplay': 'I am only 15, but i have a very very big passion for these trains, i can’t wait to drive around Switzerland and help people arrive at their destinations<br>I also learned about signals in Switzerland as short documentaries on how these trains work.<br>I hope nothing major will change in 5 years:) i really dreaming of becoming an engine driver',
'textOriginal': 'I am only 15, but i have a very very big passion for these trains, i can’t wait to drive around Switzerland and help people arrive at their destinations\nI also learned about signals in Switzerland as short documentaries on how these trains work.\nI hope nothing major will change in 5 years:) i really dreaming of becoming an engine driver',
'authorDisplayName': 'Fred Dev',
'authorProfileImageUrl': 'https://yt3.ggpht.com/JEaQIjszQdpIDgsrIKEtIX6KaeryO48U4IcbSl45oFIKrDNoCxwhmWh3fC6exW5X1pL15Hiw4w=s48-c-k-c0x00ffffff-no-rj',
'authorChannelUrl': 'http://www.youtube.com/channel/UCJKarhI8HsHHix0-HckXwVg',
'authorChannelId': {'value': 'UCJKarhI8HsHHix0-HckXwVg'},
'canRate': True, 'viewerRating': 'none', 'likeCount': 1,
'publishedAt': '2021-08-19T22:32:58Z',
'updatedAt': '2021-08-19T22:32:58Z'}}, 'canReply': True,
'totalReplyCount': 1, 'isPublic': True}},
{'kind': 'youtube#commentThread', 'etag': 'Xu5rUasdLD7ZFsRPWPrL2JUJCWg', 'id': 'UgwBkkcOhrjuzFjE6Y54AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'ES0AnIBNJfQ',
'topLevelComment': {'kind': 'youtube#comment', 'etag': '1ps-PTcq7S2TzbY7s4OuafI4-Fg',
'id': 'UgwBkkcOhrjuzFjE6Y54AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'ES0AnIBNJfQ',
'textDisplay': 'wie heisst der sprecher dieser werbung? so eine wunderbare stimme!<br>die musik ist auch toll, wie heisst das stück?',
'textOriginal': 'wie heisst der sprecher dieser werbung? so eine wunderbare stimme!\ndie musik ist auch toll, wie heisst das stück?',
'authorDisplayName': 'cloudwalker',
'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLQxGBcardOjutARwZxXcfbUSH3f66gqTzq3EA=s48-c-k-c0x00ffffff-no-rj',
'authorChannelUrl': 'http://www.youtube.com/channel/UC3VmTS8W5GKZf0PeIb8l2Jw',
'authorChannelId': {'value': 'UC3VmTS8W5GKZf0PeIb8l2Jw'},
'canRate': True, 'viewerRating': 'none', 'likeCount': 1,
'publishedAt': '2021-08-18T00:50:32Z',
'updatedAt': '2021-08-18T00:50:32Z'}}, 'canReply': True,
'totalReplyCount': 2, 'isPublic': True}},
{'kind': 'youtube#commentThread', 'etag': '_hlBnClge81P8_RqsXR7q4_BIes', 'id': 'Ugzvldq2VB0lBIzoGVR4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'AXMw3vtsswY',
'topLevelComment': {'kind': 'youtube#comment', 'etag': 'QZFjHr5bIQC72OicksbfJ3Py-Hk',
'id': 'Ugzvldq2VB0lBIzoGVR4AaABAg',
'snippet': {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'AXMw3vtsswY',
'textDisplay': 'Ihr seid spitze! Danke, dass es euch gibt 👏',
'textOriginal': 'Ihr seid spitze! Danke, dass es euch gibt 👏',
'authorDisplayName': 'Cris Tiano',
'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLT_ZmzCfLD22VLmHv-zIOnNiBGZHoYBhgcsgQ=s48-c-k-c0x00ffffff-no-rj',
'authorChannelUrl': 'http://www.youtube.com/channel/UCU3xXx609PrAf6AwLjs5oSw',
'authorChannelId': {'value': 'UCU3xXx609PrAf6AwLjs5oSw'},
'canRate': True, 'viewerRating': 'none', 'likeCount': 0,
'publishedAt': '2021-08-16T15:53:30Z',
'updatedAt': '2021-08-16T15:53:30Z'}}, 'canReply': True,
'totalReplyCount': 0, 'isPublic': True}},
]}
comments = []
for item in data['items']:
entry = {}
snippet = item['snippet']['topLevelComment']['snippet']
for field in ['channelId', 'videoId']:
entry[field] = snippet[field]
for field in ['textOriginal', 'textDisplay','canRate','likeCount','updatedAt','viewerRating','publishedAt']:
entry[field] = snippet[field]
entry['canReply'] = item['snippet']['canReply']
entry['isPublic'] = item['snippet']['isPublic']
entry['totalReplyCount'] = item['snippet']['totalReplyCount']
comments.append(entry)
for idx,comment in enumerate(comments,1):
print(f'{idx}) {comment}')
output
1) {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'tUXWw6WvgkI', 'textOriginal': 'Tipp 1: Zusatzszüge – machs wie Fredy. (Hinweis: dieses Video wurde vor der Corona-Pandemie erstellt)', 'textDisplay': 'Tipp 1: Zusatzszüge – machs wie Fredy. (Hinweis: dieses Video wurde vor der Corona-Pandemie erstellt)', 'canRate': True, 'likeCount': 1, 'updatedAt': '2021-09-15T07:29:00Z', 'viewerRating': 'none', 'publishedAt': '2021-09-15T07:29:00Z', 'canReply': True, 'isPublic': True, 'totalReplyCount': 0}
2) {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'M98TRem03Lg', 'textOriginal': 'Ich sehe das Kulturland schon schmelzen und verschwinden...', 'textDisplay': 'Ich sehe das Kulturland schon schmelzen und verschwinden...', 'canRate': True, 'likeCount': 0, 'updatedAt': '2021-09-14T18:08:55Z', 'viewerRating': 'none', 'publishedAt': '2021-09-14T18:08:55Z', 'canReply': True, 'isPublic': True, 'totalReplyCount': 1}
3) {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'M98TRem03Lg', 'textOriginal': 'Guete initiativ! Mega fan vo dere projekt!', 'textDisplay': 'Guete initiativ! Mega fan vo dere projekt!', 'canRate': True, 'likeCount': 0, 'updatedAt': '2021-09-14T07:18:31Z', 'viewerRating': 'none', 'publishedAt': '2021-09-14T07:18:31Z', 'canReply': True, 'isPublic': True, 'totalReplyCount': 0}
4) {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'zYnbgDyWM9o', 'textOriginal': 'Très mauvaise voix off, à un moment il y se reprend même dans le texte 😐', 'textDisplay': 'Très mauvaise voix off, à un moment il y se reprend même dans le texte 😐', 'canRate': True, 'likeCount': 0, 'updatedAt': '2021-09-12T12:12:58Z', 'viewerRating': 'none', 'publishedAt': '2021-09-12T12:12:58Z', 'canReply': True, 'isPublic': True, 'totalReplyCount': 0}
5) {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': '4nU0MgKft6c', 'textOriginal': 'Shiey be like', 'textDisplay': 'Shiey be like', 'canRate': True, 'likeCount': 0, 'updatedAt': '2021-09-02T23:23:35Z', 'viewerRating': 'none', 'publishedAt': '2021-09-02T23:23:35Z', 'canReply': True, 'isPublic': True, 'totalReplyCount': 0}
6) {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'oxSLp_1WtcM', 'textOriginal': 'Vivement un Lyria en Belgique !!!!', 'textDisplay': 'Vivement un Lyria en Belgique !!!!', 'canRate': True, 'likeCount': 0, 'updatedAt': '2021-09-02T21:39:26Z', 'viewerRating': 'none', 'publishedAt': '2021-09-02T21:39:26Z', 'canReply': True, 'isPublic': True, 'totalReplyCount': 0}
7) {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': '8vCvSmAIv1s', 'textOriginal': 'Build a high speed railway line into the moon I dare you with 20 million francs', 'textDisplay': 'Build a high speed railway line into the moon I dare you with 20 million francs', 'canRate': True, 'likeCount': 0, 'updatedAt': '2021-09-02T08:36:45Z', 'viewerRating': 'none', 'publishedAt': '2021-09-02T08:36:45Z', 'canReply': True, 'isPublic': True, 'totalReplyCount': 0}
8) {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': '7earPWDJbhA', 'textOriginal': 'Sehr schön', 'textDisplay': 'Sehr schön', 'canRate': True, 'likeCount': 2, 'updatedAt': '2021-08-27T16:06:59Z', 'viewerRating': 'none', 'publishedAt': '2021-08-27T16:06:59Z', 'canReply': True, 'isPublic': True, 'totalReplyCount': 0}
9) {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'AXMw3vtsswY', 'textOriginal': 'wie kann mann feuerwehr mann bei SBB werden', 'textDisplay': 'wie kann mann feuerwehr mann bei SBB werden', 'canRate': True, 'likeCount': 0, 'updatedAt': '2021-08-21T11:00:05Z', 'viewerRating': 'none', 'publishedAt': '2021-08-21T11:00:05Z', 'canReply': True, 'isPublic': True, 'totalReplyCount': 1}
10) {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'DmBo0MMxDb0', 'textOriginal': 'I am only 15, but i have a very very big passion for these trains, i can’t wait to drive around Switzerland and help people arrive at their destinations\nI also learned about signals in Switzerland as short documentaries on how these trains work.\nI hope nothing major will change in 5 years:) i really dreaming of becoming an engine driver', 'textDisplay': 'I am only 15, but i have a very very big passion for these trains, i can’t wait to drive around Switzerland and help people arrive at their destinations<br>I also learned about signals in Switzerland as short documentaries on how these trains work.<br>I hope nothing major will change in 5 years:) i really dreaming of becoming an engine driver', 'canRate': True, 'likeCount': 1, 'updatedAt': '2021-08-19T22:32:58Z', 'viewerRating': 'none', 'publishedAt': '2021-08-19T22:32:58Z', 'canReply': True, 'isPublic': True, 'totalReplyCount': 1}
11) {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'ES0AnIBNJfQ', 'textOriginal': 'wie heisst der sprecher dieser werbung? so eine wunderbare stimme!\ndie musik ist auch toll, wie heisst das stück?', 'textDisplay': 'wie heisst der sprecher dieser werbung? so eine wunderbare stimme!<br>die musik ist auch toll, wie heisst das stück?', 'canRate': True, 'likeCount': 1, 'updatedAt': '2021-08-18T00:50:32Z', 'viewerRating': 'none', 'publishedAt': '2021-08-18T00:50:32Z', 'canReply': True, 'isPublic': True, 'totalReplyCount': 2}
12) {'channelId': 'UC_zxivooFdvF4uuBosUnJxQ', 'videoId': 'AXMw3vtsswY', 'textOriginal': 'Ihr seid spitze! Danke, dass es euch gibt 👏', 'textDisplay': 'Ihr seid spitze! Danke, dass es euch gibt 👏', 'canRate': True, 'likeCount': 0, 'updatedAt': '2021-08-16T15:53:30Z', 'viewerRating': 'none', 'publishedAt': '2021-08-16T15:53:30Z', 'canReply': True, 'isPublic': True, 'totalReplyCount': 0}

Related

How to find specific HTML element after CSS selector with BeautifulSoup?

I am trying to retrieve the last img scr of a web page doing webscraping with BeautifulSoup. So far I am trying to use a selector but it if impossible for me to find anything after the ::before selector.
The basic code is:
import requests
from bs4 import BeautifulSoup
s = requests.session()
r = s.get("https://www.immobiliare.it/vendita-case/milano/forlanini/?criterio=dataModifica&ordine=desc")
soup = BeautifulSoup(r.content, "lxml")
for property in soup.find_all("li", {"class": "nd-list__item in-realEstateResults__item"}):
The HTML code of the page has the following structure:
Each li class="nd-list__item in-realEstateResults__item" is a property I want to extract the img src from.
Bear in mind that the first image has an easier html code, I cannot get the src from the rest of them
EDIT
I have to correct my initial statement:
The use of the rather sluggish selenium is not absolutely necessary and it is also possible to implement it using requests and beautifulsoup.
On closer inspection, it turned out that all the information can be found in a <script>. Its content can be extracted and used as JSON, and the urls of the maps have to be assembled based on the location information.
Example
import requests, json, time
from bs4 import BeautifulSoup
data = []
url = f'https://www.immobiliare.it/vendita-case/milano/forlanini/?criterio=dataModifica&ordine=desc'
while True:
jsonData = json.loads(
BeautifulSoup(
requests.get(url).text
).select_one('#__NEXT_DATA__').text
)['props']['pageProps']['dehydratedState']['queries'][0]['state']['data']['pages'][0]
for e in jsonData['results']:
l = e['realEstate']['properties'][0]['location']
data.append({
'id':e['realEstate']['id'],
'map':f"https://maps.im-cdn.it/static?zoom=15&size=360x270&language=it&style=feature%3Aroad%7Celement%3Alabels%7Cvisibility%3Aoff&sensor=false&markers=icon%3Ahttps%3A%2F%2Fs1.immobiliare.it%2F_next%2Fstatic%2Fmedia%2Fmap-marker.27fc2b6f.png%7C{l['latitude']}%2C{l['longitude']}&center={l['latitude']}%2C{l['longitude']}"
})
print(f"scraping page: {jsonData['currentPage']}")
if jsonData['maxPages'] != jsonData['currentPage']:
url = f"https://www.immobiliare.it/vendita-case/milano/forlanini/?criterio=dataModifica&ordine=desc&pag={jsonData['currentPage']+1}"
else:
break
time.sleep(1)
data
Older Answer
As mentioned in the comments, content is rendered dynamically, so you will not get the expected result in used combination of requests, that will not render JS, like a browser will do, and BeautifulSoup that won`t find your expected elements, cause they are not there.
Just to clarify ::before is a pseudo-element:
In CSS, ::before creates a pseudo-element that is the first child of the selected element. It is often used to add cosmetic content to an element with the content property. It is inline by default.
You could go with requests if you will use an api, some information comes from:
s.get('https://www.immobiliare.it/api-next/agencies/local-expert/?city-id=8042&province-id=MI&macrozone-id[0]=10294&limit=25&output=json').json()
-> {'agencies': [{'id': 9681, 'displayName': 'Fonte Immobiliare Città Studi 2', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/934856533.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/934856531.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/9681/fonte-citta-studi--milano/', 'address': 'Via Giovanni Briosi 10 20133 - Milano', 'bannerImage': 'https://pic.im-cdn.it/image/934857363/xs-c.jpg', 'externalId': None, 'timeContract': 11, 'paid': True}, {'id': 83565, 'displayName': 'Cfc Immobiliare ', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/244109821.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/244109817.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/83565/cfc-milano/', 'address': 'Via Carnia 7 20132 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.492900,9.236340&zoom=15&size=400x230&markers=45.492900,9.236340', 'externalId': None, 'timeContract': 10, 'paid': True}, {'id': 208668, 'displayName': 'YOUR HOME - Real Estate', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/1127494478.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/1127494476.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/208668/your-home-milano/', 'address': 'Bastioni Porta Nuova 21 20121 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.480100,9.188150&zoom=15&size=400x230&markers=45.480100,9.188150', 'externalId': None, 'timeContract': 7, 'paid': True}, {'id': 231505, 'displayName': 'Homepanda', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/693409659.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/693409657.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/231505/homepanda/', 'address': 'Via Gian Giacomo Mora 20 20123 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.458900,9.179330&zoom=15&size=400x230&markers=45.458900,9.179330', 'externalId': None, 'timeContract': 4, 'paid': True}, {'id': 118081, 'displayName': 'CONSULOVEST CORBETTA Via Meroni 2 - MILANO V.le San Gimignano 8', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/1162882814.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/1162882812.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/118081/consulovest-corbetta/', 'address': 'Via Meroni 2 20011 - Corbetta', 'bannerImage': 'https://pic.im-cdn.it/image/1162882818/xs-c.jpg', 'externalId': None, 'timeContract': None, 'paid': False}, {'id': 5272, 'displayName': 'Arena Immobiliare S.R.L.', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/936162495.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/936162493.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/5272/arena-milano/', 'address': 'Via Marco Bruto 9 20138 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.459800,9.238870&zoom=15&size=400x230&markers=45.459800,9.238870', 'externalId': None, 'timeContract': 21, 'paid': False}, {'id': 32741, 'displayName': 'Studio emme3 ', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/196647202.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/196647201.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/32741/studio-emme-milano/', 'address': 'Via Pompeo Neri 2 20146 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.456800,9.143770&zoom=15&size=400x230&markers=45.456800,9.143770', 'externalId': None, 'timeContract': 4, 'paid': False}, {'id': 242120, 'displayName': 'Levia SRL', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/843934046.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/843934044.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/242120/levia-milano/', 'address': 'Viale Ungheria 20 20138 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.445700,9.246040&zoom=15&size=400x230&markers=45.445700,9.246040', 'externalId': None, 'timeContract': 3, 'paid': False}, {'id': 396994, 'displayName': 'Affiliato Tecnorete: STUDIO IMMOBILIARE CORSICA SRL', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/1247888668.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/1247888664.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/396994/tecnorete-milano-viale-ungheria/', 'address': 'Viale Ungheria 24 20135 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.445500,9.246760&zoom=15&size=400x230&markers=45.445500,9.246760', 'externalId': None, 'timeContract': 0, 'paid': False}, {'id': 140950, 'displayName': 'Abitare Agency Srl', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/1135165888.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/1135165886.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/140950/abitare-agency/', 'address': 'Via Voghera 7 20144 - Milano', 'bannerImage': 'https://pic.im-cdn.it/image/1135165932/xs-c.jpg', 'externalId': None, 'timeContract': 10, 'paid': False}, {'id': 94305, 'displayName': 'Affiliato Tecnocasa: IMMOBILIARE MARGOT SRLU', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/1135591154.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/1135591152.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/94305/tecnocasa-milano-via-mecenate/', 'address': 'Via Mecenate 4 20138 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.457400,9.242440&zoom=15&size=400x230&markers=45.457400,9.242440', 'externalId': None, 'timeContract': 8, 'paid': False}, {'id': 241224, 'displayName': 'INVIMIT SGR SpA', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/829818360.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/829818358.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/241224/invimit-roma/', 'address': 'Via di Santa Maria in Via 12 00187 - Roma', 'bannerImage': 'https://pic.im-cdn.it/image/825106468/xs-c.jpg', 'externalId': None, 'timeContract': 3, 'paid': False}, {'id': 209778, 'displayName': 'STUDIO6ERRE - Sede Milano', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/937464013.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/937464011.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/209778/studioerre-milano/', 'address': 'Viale Abruzzi 80 20131 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.483700,9.217150&zoom=15&size=400x230&markers=45.483700,9.217150', 'externalId': None, 'timeContract': 7, 'paid': False}, {'id': 166328, 'displayName': 'HB ADVISORY', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/311765482.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/311765478.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/166328/hb-advisory/', 'address': 'Corso Buenos Aires 60 20124 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.482200,9.212750&zoom=15&size=400x230&markers=45.482200,9.212750', 'externalId': None, 'timeContract': 9, 'paid': False}, {'id': 41477, 'displayName': 'StudioZimer', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/1143272706.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/1143272704.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/41477/studiozimer/', 'address': 'CORSO LODI 111 20135 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.441500,9.221210&zoom=15&size=400x230&markers=45.441500,9.221210', 'externalId': None, 'timeContract': 4, 'paid': False}, {'id': 386016, 'displayName': 'STUDIO ASTE MC', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/1250133574.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/1250133572.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/386016/studio-aste-mc-sesto-san-giovanni/', 'address': 'Via Carlo Cattaneo 49 20099 - Sesto San Giovanni', 'bannerImage': 'https://pic.im-cdn.it/image/1147591370/xs-c.jpg', 'externalId': None, 'timeContract': None, 'paid': False}, {'id': 42941, 'displayName': 'OBIETTIVOCASA', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/155958486.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/155958482.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/42941/obiettivocasa-milano-via-pordenone/', 'address': 'via pordenone 13 20132 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.490300,9.234840&zoom=15&size=400x230&markers=45.490300,9.234840', 'externalId': None, 'timeContract': 10, 'paid': False}, {'id': 392582, 'displayName': 'AsteGlobal', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/1227120896.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/1227120894.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/392582/asteblobal/', 'address': 'via Reali 13 20037 - Paderno Dugnano', 'bannerImage': 'https://pic.im-cdn.it/image/1227121058/xs-c.jpg', 'externalId': None, 'timeContract': None, 'paid': False}, {'id': 203747, 'displayName': 'Le case di Patty', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/811914140.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/811914138.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/203747/le-case-di-patty-milano/', 'address': 'Via Montebello 14 20121 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.475200,9.189070&zoom=15&size=400x230&markers=45.475200,9.189070', 'externalId': None, 'timeContract': 7, 'paid': False}, {'id': 35498, 'displayName': "Expo' Servizi Immobiliari", 'imageUrls': [], 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/35498/expo/', 'address': 'Viale Premuda 21 20129 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.466000,9.207020&zoom=15&size=400x230&markers=45.466000,9.207020', 'externalId': None, 'timeContract': 5, 'paid': False}, {'id': 228450, 'displayName': 'Aste Milano Immobiliare', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/1061930169.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/1061930167.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/228450/aste-rozzano/', 'address': 'Via Innocenzo Isimbardi 29 20141 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.435200,9.180890&zoom=15&size=400x230&markers=45.435200,9.180890', 'externalId': None, 'timeContract': 5, 'paid': False}, {'id': 5350, 'displayName': 'IMI immobiliare Milano - Partner Navigli', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/424092179.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/424092177.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/5350/imi-milano-navigli/', 'address': 'Via Conchetta 2 20136 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.446800,9.179270&zoom=15&size=400x230&markers=45.446800,9.179270', 'externalId': None, 'timeContract': 12, 'paid': False}, {'id': 237934, 'displayName': 'ASTA4YOU', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/1112140922.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/1112140920.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/237934/astayou/', 'address': 'Via Domenico Cimarosa 26 20144 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.464200,9.157880&zoom=15&size=400x230&markers=45.464200,9.157880', 'externalId': None, 'timeContract': 3, 'paid': False}, {'id': 28201, 'displayName': 'Meta Immobiliare - Massimo Valore Certificato', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/962490064.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/962490062.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/28201/meta-san-donato/', 'address': 'Via Alfonsine 34 20097 - San Donato Milanese', 'bannerImage': 'https://pic.im-cdn.it/image/854995656/xs-c.jpg', 'externalId': None, 'timeContract': 9, 'paid': False}, {'id': 3986, 'displayName': 'TREC s.a.s', 'imageUrls': {'large': 'https://pic.im-cdn.it/imagenoresize/1040856624.jpg', 'small': 'https://pic.im-cdn.it/imagenoresize/1040856622.jpg'}, 'agencyUrl': 'https://www.immobiliare.it/agenzie-immobiliari/3986/tre-c/', 'address': 'Via Negroli 49 20133 - Milano', 'bannerImage': 'https://maps.im-cdn.it/static?center=45.467200,9.232320&zoom=15&size=400x230&markers=45.467200,9.232320', 'externalId': None, 'timeContract': 1, 'paid': False}], 'searchAgencyUrl': 'http://www.immobiliare.it/agenzie-immobiliari/milano/?idMZona[]=10294'}
or with selenium to mimic a browser and work on the rendered driver.page_source.
Example
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
url = 'https://www.immobiliare.it/vendita-case/milano/forlanini/?criterio=dataModifica&ordine=desc'
driver.get(url)
soup = BeautifulSoup(driver.page_source)
data = []
for e in soup.select('li.in-realEstateResults__item'):
data.append({
'title':e.a.get('title'),
'imgUrls':[i.get('src') for i in e.select('.nd-list__item img')],
'imgMapInfo': e.select_one('[alt="mappa"]').get('src') if e.select_one('[alt="mappa"]') else None
})
data
Output
[{'title': 'Bilocale buono stato, primo piano, Viale Ungheria - Mecenate, Milano', 'imgUrls': ['https://pwm.im-cdn.it/image/1261450576/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261450580/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261702222/xxs-c.jpg', 'https://maps.im-cdn.it/static?zoom=15&size=360x270&language=it&style=feature%3Aroad%7Celement%3Alabels%7Cvisibility%3Aoff&sensor=false&markers=icon%3Ahttps%3A%2F%2Fs1.immobiliare.it%2F_next%2Fstatic%2Fmedia%2Fmap-marker.27fc2b6f.png%7C45.4565%2C9.2427&center=45.4565%2C9.2427', 'https://pic.im-cdn.it/imagenoresize/875151762.jpg'], 'imgMapInfo': 'https://maps.im-cdn.it/static?zoom=15&size=360x270&language=it&style=feature%3Aroad%7Celement%3Alabels%7Cvisibility%3Aoff&sensor=false&markers=icon%3Ahttps%3A%2F%2Fs1.immobiliare.it%2F_next%2Fstatic%2Fmedia%2Fmap-marker.27fc2b6f.png%7C45.4565%2C9.2427&center=45.4565%2C9.2427'}, {'title': 'Bilocale via Romualdo Bonfadini 82, Viale Ungheria - Mecenate, Milano', 'imgUrls': ['https://pwm.im-cdn.it/image/1261689706/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689762/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689770/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689736/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689806/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689780/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689794/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689744/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689718/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689728/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689628/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689636/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689752/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689674/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689694/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689680/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689690/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689670/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689652/xxs-c.jpg', 'https://pwm.im-cdn.it/image/1261689816/xxs-c.jpg', 'https://maps.im-cdn.it/static?zoom=15&size=360x270&language=it&style=feature%3Aroad%7Celement%3Alabels%7Cvisibility%3Aoff&sensor=false&markers=icon%3Ahttps%3A%2F%2Fs1.immobiliare.it%2F_next%2Fstatic%2Fmedia%2Fmap-marker.27fc2b6f.png%7C45.4442%2C9.2417&center=45.4442%2C9.2417', 'https://pic.im-cdn.it/imagenoresize/949757836.jpg'], 'imgMapInfo': 'https://maps.im-cdn.it/static?zoom=15&size=360x270&language=it&style=feature%3Aroad%7Celement%3Alabels%7Cvisibility%3Aoff&sensor=false&markers=icon%3Ahttps%3A%2F%2Fs1.immobiliare.it%2F_next%2Fstatic%2Fmedia%2Fmap-marker.27fc2b6f.png%7C45.4442%2C9.2417&center=45.4442%2C9.2417'}, {'title': 'Appartamento via Oreste Salomone, Viale Ungheria - Mecenate, Milano', 'imgUrls': ['https://pwm.im-cdn.it/image/1256189648/xxs-c.jpg', 'https://pic.im-cdn.it/imagenoresize/994952108.jpg'], 'imgMapInfo': None},...]
from bs4 import BeautifulSoup
html = """
<div class="in-mediaContent">
<div class="nd-figure in-photo in-Card__photo--big">
<div class="nd-figure__image nd-ratio">
::before
<div class="nd-slideshow nd-slideshow--small">
<div class="nd-slideshow__content>
</div>
<div class="nd-slideshow__item
</div>
<div class="nd-slideshow__item
</div>
<div class="nd-slideshow__desired_item
<img src =”desired link”>
</div>
</div>
</div>
</div>
</div>"""
soup = BeautifulSoup(html, 'html.parser')
r = soup.select('div[class*="nd-slideshow"]')
print(r)
in result html after ::before
[<div class="nd-slideshow nd-slideshow--small">
<div <="" class="nd-slideshow__content> </div> <div class=" div="" nd-slideshow__item="">
<div <img="" class="nd-slideshow__item </div> <div class=" link”="" nd-slideshow__desired_item="" src="”desired">
</div>
</div>
</div>, <div <="" class="nd-slideshow__content> </div> <div class=" div="" nd-slideshow__item="">
<div <img="" class="nd-slideshow__item </div> <div class=" link”="" nd-slideshow__desired_item="" src="”desired">
</div>
</div>, <div <img="" class="nd-slideshow__item </div> <div class=" link”="" nd-slideshow__desired_item="" src="”desired">
</div>]
Based on your screenshot, I searched the div element that has the "nd-slideshow__item in-realEstateListCard__mapInfo" class and then I could get the image inside the "div" element.
With this idea, I've modified your code as follows:
import requests
from bs4 import BeautifulSoup
url = "https://www.immobiliare.it/vendita-case/milano/forlanini/?criterio=dataModifica&ordine=desc"
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
# The image you want is inside a img HTML element which is contained inside a "div" element:
div_element = soup.find_all("div", class_="nd-slideshow__item in-realEstateListCard__mapInfo")
# Print the "src" value of the img HTML element found on the div
print(div_element[0].find("img")["src"])
And this is the result I got:
https://maps.im-cdn.it/static?zoom=15&size=360x270&language=it&style=feature%3Aroad%7Celement%3Alabels%7Cvisibility%3Aoff&sensor=false&markers=icon%3Ahttps%3A%2F%2Fs1.immobiliare.it%2F_next%2Fstatic%2Fmedia%2Fmap-marker.27fc2b6f.png%7C45.4565%2C9.2427&center=45.4565%2C9.2427

reading Tweepy data value from tweepy.models.Status object in python is not working

I am trying to get information on retweeters for a specific tweet using Tweepy and fetch the in_reply_to_status_id from the returned Tweepy response.
Here is the code
retweets_list = api.get_retweets(id=tweetid)
for retweet in retweets_list:
retweet_json = json.dumps(retweet._json, indent=2)
retweet_json = json.loads(retweet_json)
print(retweet_json)
The code about above produces the data response below
{'created_at': 'Sat Jun 18 06:38:49 +0000 2022', 'id': 1538048568782688256, 'id_str': '1538048568782688256', 'text': 'RT #gyfboxAI: #isle_mcelroy Some mentioned items in thread \n\n#AllisonPDavis The Governesses => httpsurl The Ob…', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'gyfboxAI', 'name': 'Gyfbox', 'id': 1521109812032978946, 'id_str': '1521109812032978946', 'indices': [3, 12]}, {'screen_name': 'isle_mcelroy', 'name': 'Isle McElroy', 'id': 868462820, 'id_str': '868462820', 'indices': [14, 27]}, {'screen_name': 'AllisonPDavis', 'name': 'Allison P Davis', 'id': 15088579, 'id_str': '15088579', 'indices': [61, 75]}, {'screen_name': 'kvargs93', 'name': 'Katherine Varga', 'id': 885284552897429504, 'id_str': '885284552897429504', 'indices': [125, 134]}], 'urls': [{'url': 'httpsurl', 'expanded_url': 'httpsurlamzn.to/3MUM0mI', 'display_url': 'amzn.to/3MUM0mI', 'indices': [100, 123]}]}, 'source': 'Twitter for iPhone', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1003173584, 'id_str': '1003173584', 'name': 'Elaine Showalter', 'screen_name': 'ecshowalter', 'location': 'Washington, D.C./London', 'description': 'Professor Emerita Princeton U; Anglophile, feminist, theatre fanatic, “The Civil Wars of Julia Ward Howe.” watercolor by Vanessa Bell, “The Queen’s Tea Party”', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 8142, 'friends_count': 1049, 'listed_count': 104, 'created_at': 'Tue Dec 11 03:08:17 +0000 2012', 'favourites_count': 24912, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': False, 'statuses_count': 26489, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': 'C0DEED', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'httpsurlabs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/968862619699425281/CKzdSRf6_normal.jpg', 'profile_image_url_https': 'httpsurlpbs.twimg.com/profile_images/968862619699425281/CKzdSRf6_normal.jpg', 'profile_banner_url': 'httpsurlpbs.twimg.com/profile_banners/1003173584/1569562029', 'profile_link_color': '0084B4', 'profile_sidebar_border_color': 'FFFFFF', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'none', 'withheld_in_countries': []}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'retweeted_status': {'created_at': 'Fri Jun 17 17:55:18 +0000 2022', 'id': 1537856423740198913, 'id_str': '1537856423740198913', 'text': '#isle_mcelroy Some mentioned items in thread \n\n#AllisonPDavis The Governesses => httpsurl… httpsurl', 'truncated': True, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'isle_mcelroy', 'name': 'Isle McElroy', 'id': 868462820, 'id_str': '868462820', 'indices': [0, 13]}, {'screen_name': 'AllisonPDavis', 'name': 'Allison P Davis', 'id': 15088579, 'id_str': '15088579', 'indices': [47, 61]}], 'urls': [{'url': 'httpsurl', 'expanded_url': 'httpsurlamzn.to/3MUM0mI', 'display_url': 'amzn.to/3MUM0mI', 'indices': [86, 109]}, {'url': 'httpsurl’, 'expanded_url': 'httpsurltwitter.com/i/web/status/1537856423740198913', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [111, 134]}]}, 'source': 'gyfbox', 'in_reply_to_status_id': 1537835837542604801, 'in_reply_to_status_id_str': '1537835837542604801', 'in_reply_to_user_id': 868462820, 'in_reply_to_user_id_str': '868462820', 'in_reply_to_screen_name': 'isle_mcelroy', 'user': {'id': 1521109812032978946, 'id_str': '1521109812032978946', 'name': 'Gyfbox', 'screen_name': 'gyfboxAI', 'location': '', 'description': 'Tag "#GyfboxAI find item" \n\n#GyfboxAI will reply with link for items mentioned in the thread\n\nCOMING SOON !', 'url': 'httpsurlt.co/u7fGrxh24Y', 'entities': {'url': {'urls': [{'url': 'httpsurlt.co/u7fGrxh24Y', 'expanded_url': 'httpsurlwww.gyfbox.com', 'display_url': 'gyfbox.com', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 1, 'friends_count': 6, 'listed_count': 0, 'created_at': 'Mon May 02 12:50:32 +0000 2022', 'favourites_count': 1, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 49, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': None, 'profile_background_image_url_https': None, 'profile_background_tile': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1521109885827661824/iTrlR67U_normal.png', 'profile_image_url_https': 'httpsurlpbs.twimg.com/profile_images/1521109885827661824/iTrlR67U_normal.png', 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': True, 'default_profile': True, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'none', 'withheld_in_countries': []}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 1, 'favorite_count': 0, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'lang': 'en'}, 'is_quote_status': False, 'retweet_count': 1, 'favorite_count': 0, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'lang': 'en'}
Multiple attempts to extract the in_reply_to_status_id always return "None"
Sample attempt that returned none
retweet_json['in_reply_to_status_id']
retweet.in_reply_to_status_id
The data return above shows
'in_reply_to_status_id': 1537835837542604801,
so I should be getting 1537835837542604801 for in_reply_to_status_id
what am i doing wrong and how can I obtain the in_reply_to_status_id ?
According to your Json structure,
the in_reply_status_id is None, the id is in retweeted_status so based on the Json structure
retweet_json['retweeted_status']['in_reply_to_status_id']
should give
1537835837542604801

Convert nested dictionary, list, and dictionary into a pandas data frame in python

So, I am trying to work with a rest API, and it is giving me the following data:
{'sports': [{'id': '20',
'uid': 's:20',
'name': 'Football',
'slug': 'football',
'leagues': [{'id': '28',
'uid': 's:20~l:28',
'name': 'National Football League',
'abbreviation': 'NFL',
'shortName': 'NFL',
'slug': 'nfl',
'teams': [{'team': {'id': '22',
'uid': 's:20~l:28~t:22',
'slug': 'arizona-cardinals',
'location': 'Arizona',
'name': 'Cardinals',
'nickname': 'Cardinals',
'abbreviation': 'ARI',
'displayName': 'Arizona Cardinals',
'shortDisplayName': 'Cardinals',
'color': 'A40227',
'alternateColor': '000000',
'isActive': True,
'isAllStar': False,
'logos': [{'href': 'https://a.espncdn.com/i/teamlogos/nfl/500/ari.png',
'width': 500,
'height': 500,
'alt': '',
'rel': ['full', 'default'],
'lastUpdated': '2018-06-05T12:11Z'},
{'href': 'https://a.espncdn.com/i/teamlogos/nfl/500-dark/ari.png',
'width': 500,
'height': 500,
'alt': '',
'rel': ['full', 'dark'],
'lastUpdated': '2018-06-05T12:11Z'},
{'href': 'https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/ari.png',
'width': 500,
'height': 500,
'alt': '',
'rel': ['full', 'scoreboard'],
'lastUpdated': '2018-06-05T12:11Z'},
...
I'm just interested in the teams data. However, I try to slice the pie, I'm having trouble extracting the desired information into the dataframe properly.
Here is my code:
url = 'http://site.api.espn.com/apis/site/v2/sports/football/nfl/teams'
r = requests.get(url)
teams_json = r.json()
nfl = []
for teams in teams_json.items():
for x in teams:
for row in x:
print(row['teams'])
I keep getting errors.
Any assistance is greatly appreciated.
I'd suggest looking into how to navigate lists and dictionaries in python (that's all that json files are). It's just a matter of knowing the path, or how to iterate through those.
To get into a dataframe, pandas has a nice .json_normalize() method. I'm not sure what data you want exactly, as the root teams key data is also nested. So depending what you are after, you may need to do a little extra work to extract what you want. But this is the general dataframe for teams.
import requests
import pandas as pd
url = 'http://site.api.espn.com/apis/site/v2/sports/football/nfl/teams'
jsonData = requests.get(url).json()
teams_json = jsonData['sports'][0]['leagues'][0]['teams']
df = pd.json_normalize(teams_json)
Output:
print(df.head().to_string())
team.id team.uid team.slug team.location team.name team.nickname team.abbreviation team.displayName team.shortDisplayName team.color team.alternateColor team.isActive team.isAllStar team.logos team.record.items team.links
0 22 s:20~l:28~t:22 arizona-cardinals Arizona Cardinals Cardinals ARI Arizona Cardinals Cardinals A40227 000000 True False [{'href': 'https://a.espncdn.com/i/teamlogos/nfl/500/ari.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'default'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500-dark/ari.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'dark'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/ari.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'scoreboard'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500-dark/scoreboard/ari.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'scoreboard', 'dark'], 'lastUpdated': '2018-06-05T12:11Z'}] [{'summary': '11-6', 'stats': [{'name': 'playoffSeed', 'value': 5.0}, {'name': 'wins', 'value': 11.0}, {'name': 'losses', 'value': 6.0}, {'name': 'winPercent', 'value': 0.6470588445663452}, {'name': 'gamesBehind', 'value': 0.0}, {'name': 'ties', 'value': 0.0}, {'name': 'OTWins', 'value': 0.0}, {'name': 'OTLosses', 'value': 0.0}, {'name': 'gamesPlayed', 'value': 17.0}, {'name': 'pointsFor', 'value': 449.0}, {'name': 'pointsAgainst', 'value': 366.0}, {'name': 'avgPointsFor', 'value': 26.41176414489746}, {'name': 'avgPointsAgainst', 'value': 21.52941131591797}, {'name': 'points', 'value': 2.5}, {'name': 'differential', 'value': 83.0}, {'name': 'streak', 'value': -1.0}, {'name': 'clincher', 'value': 0.0}, {'name': 'divisionWinPercent', 'value': 0.6666666865348816}, {'name': 'leagueWinPercent', 'value': 0.5833333134651184}, {'name': 'divisionRecord', 'value': 0.0}, {'name': 'divisionWins', 'value': 4.0}, {'name': 'divisionTies', 'value': 0.0}, {'name': 'divisionLosses', 'value': 2.0}]}] [{'language': 'en-US', 'rel': ['clubhouse', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/_/name/ari/arizona-cardinals', 'text': 'Clubhouse', 'shortText': 'Clubhouse', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['roster', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/team/roster/_/name/ari/arizona-cardinals', 'text': 'Roster', 'shortText': 'Roster', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['stats', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/team/stats/_/name/ari/arizona-cardinals', 'text': 'Statistics', 'shortText': 'Statistics', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['schedule', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/schedule/_/name/ari', 'text': 'Schedule', 'shortText': 'Schedule', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['photos', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/photos/_/name/ari', 'text': 'photos', 'shortText': 'photos', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['scores', 'sportscenter', 'app', 'team'], 'href': 'sportscenter://x-callback-url/showClubhouse?uid=s:20~l:28~t:22&section=scores', 'text': 'Scores', 'shortText': 'Scores', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['draftpicks', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/draft/teams/_/name/ari/arizona-cardinals', 'text': 'Draft Picks', 'shortText': 'Draft Picks', 'isExternal': False, 'isPremium': True}, {'language': 'en-US', 'rel': ['transactions', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/transactions/_/name/ari', 'text': 'Transactions', 'shortText': 'Transactions', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['injuries', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/injuries/_/name/ari', 'text': 'Injuries', 'shortText': 'Injuries', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['depthchart', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/depth/_/name/ari', 'text': 'Depth Chart', 'shortText': 'Depth Chart', 'isExternal': False, 'isPremium': False}, {'language': 'en', 'rel': ['tickets', 'desktop', 'team'], 'href': 'https://www.vividseats.com/nfl-football/arizona-cardinals-tickets.html?wsUser=717', 'text': 'Tickets', 'isExternal': True, 'isPremium': False}]
1 1 s:20~l:28~t:1 atlanta-falcons Atlanta Falcons Falcons ATL Atlanta Falcons Falcons 000000 000000 True False [{'href': 'https://a.espncdn.com/i/teamlogos/nfl/500/atl.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'default'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500-dark/atl.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'dark'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/atl.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'scoreboard'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500-dark/scoreboard/atl.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'scoreboard', 'dark'], 'lastUpdated': '2018-06-05T12:11Z'}] [{'summary': '7-10', 'stats': [{'name': 'playoffSeed', 'value': 12.0}, {'name': 'wins', 'value': 7.0}, {'name': 'losses', 'value': 10.0}, {'name': 'winPercent', 'value': 0.4117647111415863}, {'name': 'gamesBehind', 'value': 0.0}, {'name': 'ties', 'value': 0.0}, {'name': 'OTWins', 'value': 0.0}, {'name': 'OTLosses', 'value': 0.0}, {'name': 'gamesPlayed', 'value': 17.0}, {'name': 'pointsFor', 'value': 313.0}, {'name': 'pointsAgainst', 'value': 459.0}, {'name': 'avgPointsFor', 'value': 18.41176414489746}, {'name': 'avgPointsAgainst', 'value': 27.0}, {'name': 'points', 'value': -1.5}, {'name': 'differential', 'value': -146.0}, {'name': 'streak', 'value': -2.0}, {'name': 'clincher', 'value': 0.0}, {'name': 'divisionWinPercent', 'value': 0.3333333432674408}, {'name': 'leagueWinPercent', 'value': 0.3333333432674408}, {'name': 'divisionRecord', 'value': 0.0}, {'name': 'divisionWins', 'value': 2.0}, {'name': 'divisionTies', 'value': 0.0}, {'name': 'divisionLosses', 'value': 4.0}]}] [{'language': 'en-US', 'rel': ['clubhouse', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/_/name/atl/atlanta-falcons', 'text': 'Clubhouse', 'shortText': 'Clubhouse', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['roster', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/team/roster/_/name/atl/atlanta-falcons', 'text': 'Roster', 'shortText': 'Roster', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['stats', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/team/stats/_/name/atl/atlanta-falcons', 'text': 'Statistics', 'shortText': 'Statistics', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['schedule', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/schedule/_/name/atl', 'text': 'Schedule', 'shortText': 'Schedule', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['photos', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/photos/_/name/atl', 'text': 'photos', 'shortText': 'photos', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['scores', 'sportscenter', 'app', 'team'], 'href': 'sportscenter://x-callback-url/showClubhouse?uid=s:20~l:28~t:1&section=scores', 'text': 'Scores', 'shortText': 'Scores', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['draftpicks', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/draft/teams/_/name/atl/atlanta-falcons', 'text': 'Draft Picks', 'shortText': 'Draft Picks', 'isExternal': False, 'isPremium': True}, {'language': 'en-US', 'rel': ['transactions', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/transactions/_/name/atl', 'text': 'Transactions', 'shortText': 'Transactions', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['injuries', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/injuries/_/name/atl', 'text': 'Injuries', 'shortText': 'Injuries', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['depthchart', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/depth/_/name/atl', 'text': 'Depth Chart', 'shortText': 'Depth Chart', 'isExternal': False, 'isPremium': False}, {'language': 'en', 'rel': ['tickets', 'desktop', 'team'], 'href': 'https://www.vividseats.com/nfl-football/atlanta-falcons-tickets.html?wsUser=717', 'text': 'Tickets', 'isExternal': True, 'isPremium': False}]
2 2 s:20~l:28~t:2 buffalo-bills Buffalo Bills Bills BUF Buffalo Bills Bills 04407F c60c30 True False [{'href': 'https://a.espncdn.com/i/teamlogos/nfl/500/buf.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'default'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500-dark/buf.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'dark'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/buf.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'scoreboard'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500-dark/scoreboard/buf.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'scoreboard', 'dark'], 'lastUpdated': '2018-06-05T12:11Z'}] [{'summary': '11-6', 'stats': [{'name': 'playoffSeed', 'value': 3.0}, {'name': 'wins', 'value': 11.0}, {'name': 'losses', 'value': 6.0}, {'name': 'winPercent', 'value': 0.6470588445663452}, {'name': 'gamesBehind', 'value': 0.0}, {'name': 'ties', 'value': 0.0}, {'name': 'OTWins', 'value': 0.0}, {'name': 'OTLosses', 'value': 1.0}, {'name': 'gamesPlayed', 'value': 17.0}, {'name': 'pointsFor', 'value': 483.0}, {'name': 'pointsAgainst', 'value': 289.0}, {'name': 'avgPointsFor', 'value': 28.41176414489746}, {'name': 'avgPointsAgainst', 'value': 17.0}, {'name': 'points', 'value': 2.5}, {'name': 'differential', 'value': 194.0}, {'name': 'streak', 'value': 4.0}, {'name': 'clincher', 'value': 0.0}, {'name': 'divisionWinPercent', 'value': 0.8333333134651184}, {'name': 'leagueWinPercent', 'value': 0.5833333134651184}, {'name': 'divisionRecord', 'value': 0.0}, {'name': 'divisionWins', 'value': 5.0}, {'name': 'divisionTies', 'value': 0.0}, {'name': 'divisionLosses', 'value': 1.0}]}] [{'language': 'en-US', 'rel': ['clubhouse', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/_/name/buf/buffalo-bills', 'text': 'Clubhouse', 'shortText': 'Clubhouse', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['roster', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/team/roster/_/name/buf/buffalo-bills', 'text': 'Roster', 'shortText': 'Roster', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['stats', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/team/stats/_/name/buf/buffalo-bills', 'text': 'Statistics', 'shortText': 'Statistics', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['schedule', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/schedule/_/name/buf', 'text': 'Schedule', 'shortText': 'Schedule', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['photos', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/photos/_/name/buf', 'text': 'photos', 'shortText': 'photos', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['scores', 'sportscenter', 'app', 'team'], 'href': 'sportscenter://x-callback-url/showClubhouse?uid=s:20~l:28~t:2&section=scores', 'text': 'Scores', 'shortText': 'Scores', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['draftpicks', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/draft/teams/_/name/buf/buffalo-bills', 'text': 'Draft Picks', 'shortText': 'Draft Picks', 'isExternal': False, 'isPremium': True}, {'language': 'en-US', 'rel': ['transactions', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/transactions/_/name/buf', 'text': 'Transactions', 'shortText': 'Transactions', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['injuries', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/injuries/_/name/buf', 'text': 'Injuries', 'shortText': 'Injuries', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['depthchart', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/depth/_/name/buf', 'text': 'Depth Chart', 'shortText': 'Depth Chart', 'isExternal': False, 'isPremium': False}, {'language': 'en', 'rel': ['tickets', 'desktop', 'team'], 'href': 'https://www.vividseats.com/nfl-football/buffalo-bills-tickets.html?wsUser=717', 'text': 'Tickets', 'isExternal': True, 'isPremium': False}]
3 3 s:20~l:28~t:3 chicago-bears Chicago Bears Bears CHI Chicago Bears Bears 152644 0b162a True False [{'href': 'https://a.espncdn.com/i/teamlogos/nfl/500/chi.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'default'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500-dark/chi.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'dark'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/chi.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'scoreboard'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500-dark/scoreboard/chi.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'scoreboard', 'dark'], 'lastUpdated': '2018-06-05T12:11Z'}] [{'summary': '6-11', 'stats': [{'name': 'playoffSeed', 'value': 13.0}, {'name': 'wins', 'value': 6.0}, {'name': 'losses', 'value': 11.0}, {'name': 'winPercent', 'value': 0.3529411852359772}, {'name': 'gamesBehind', 'value': 0.0}, {'name': 'ties', 'value': 0.0}, {'name': 'OTWins', 'value': 0.0}, {'name': 'OTLosses', 'value': 0.0}, {'name': 'gamesPlayed', 'value': 17.0}, {'name': 'pointsFor', 'value': 311.0}, {'name': 'pointsAgainst', 'value': 407.0}, {'name': 'avgPointsFor', 'value': 18.294116973876953}, {'name': 'avgPointsAgainst', 'value': 23.941177368164062}, {'name': 'points', 'value': -2.5}, {'name': 'differential', 'value': -96.0}, {'name': 'streak', 'value': -1.0}, {'name': 'clincher', 'value': 0.0}, {'name': 'divisionWinPercent', 'value': 0.3333333432674408}, {'name': 'leagueWinPercent', 'value': 0.3333333432674408}, {'name': 'divisionRecord', 'value': 0.0}, {'name': 'divisionWins', 'value': 2.0}, {'name': 'divisionTies', 'value': 0.0}, {'name': 'divisionLosses', 'value': 4.0}]}] [{'language': 'en-US', 'rel': ['clubhouse', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/_/name/chi/chicago-bears', 'text': 'Clubhouse', 'shortText': 'Clubhouse', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['roster', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/team/roster/_/name/chi/chicago-bears', 'text': 'Roster', 'shortText': 'Roster', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['stats', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/team/stats/_/name/chi/chicago-bears', 'text': 'Statistics', 'shortText': 'Statistics', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['schedule', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/schedule/_/name/chi', 'text': 'Schedule', 'shortText': 'Schedule', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['photos', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/photos/_/name/chi', 'text': 'photos', 'shortText': 'photos', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['scores', 'sportscenter', 'app', 'team'], 'href': 'sportscenter://x-callback-url/showClubhouse?uid=s:20~l:28~t:3&section=scores', 'text': 'Scores', 'shortText': 'Scores', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['draftpicks', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/draft/teams/_/name/chi/chicago-bears', 'text': 'Draft Picks', 'shortText': 'Draft Picks', 'isExternal': False, 'isPremium': True}, {'language': 'en-US', 'rel': ['transactions', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/transactions/_/name/chi', 'text': 'Transactions', 'shortText': 'Transactions', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['injuries', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/injuries/_/name/chi', 'text': 'Injuries', 'shortText': 'Injuries', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['depthchart', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/depth/_/name/chi', 'text': 'Depth Chart', 'shortText': 'Depth Chart', 'isExternal': False, 'isPremium': False}, {'language': 'en', 'rel': ['tickets', 'desktop', 'team'], 'href': 'https://www.vividseats.com/nfl-football/chicago-bears-tickets.html?wsUser=717', 'text': 'Tickets', 'isExternal': True, 'isPremium': False}]
4 4 s:20~l:28~t:4 cincinnati-bengals Cincinnati Bengals Bengals CIN Cincinnati Bengals Bengals FF2700 000000 True False [{'href': 'https://a.espncdn.com/i/teamlogos/nfl/500/cin.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'default'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500-dark/cin.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'dark'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500/scoreboard/cin.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'scoreboard'], 'lastUpdated': '2018-06-05T12:11Z'}, {'href': 'https://a.espncdn.com/i/teamlogos/nfl/500-dark/scoreboard/cin.png', 'width': 500, 'height': 500, 'alt': '', 'rel': ['full', 'scoreboard', 'dark'], 'lastUpdated': '2018-06-05T12:11Z'}] [{'summary': '10-7', 'stats': [{'name': 'playoffSeed', 'value': 4.0}, {'name': 'wins', 'value': 10.0}, {'name': 'losses', 'value': 7.0}, {'name': 'winPercent', 'value': 0.5882353186607361}, {'name': 'gamesBehind', 'value': 0.0}, {'name': 'ties', 'value': 0.0}, {'name': 'OTWins', 'value': 1.0}, {'name': 'OTLosses', 'value': 2.0}, {'name': 'gamesPlayed', 'value': 17.0}, {'name': 'pointsFor', 'value': 460.0}, {'name': 'pointsAgainst', 'value': 376.0}, {'name': 'avgPointsFor', 'value': 27.058822631835938}, {'name': 'avgPointsAgainst', 'value': 22.117647171020508}, {'name': 'points', 'value': 1.5}, {'name': 'differential', 'value': 84.0}, {'name': 'streak', 'value': -1.0}, {'name': 'clincher', 'value': 0.0}, {'name': 'divisionWinPercent', 'value': 0.6666666865348816}, {'name': 'leagueWinPercent', 'value': 0.6666666865348816}, {'name': 'divisionRecord', 'value': 0.0}, {'name': 'divisionWins', 'value': 4.0}, {'name': 'divisionTies', 'value': 0.0}, {'name': 'divisionLosses', 'value': 2.0}]}] [{'language': 'en-US', 'rel': ['clubhouse', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/_/name/cin/cincinnati-bengals', 'text': 'Clubhouse', 'shortText': 'Clubhouse', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['roster', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/team/roster/_/name/cin/cincinnati-bengals', 'text': 'Roster', 'shortText': 'Roster', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['stats', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/team/stats/_/name/cin/cincinnati-bengals', 'text': 'Statistics', 'shortText': 'Statistics', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['schedule', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/schedule/_/name/cin', 'text': 'Schedule', 'shortText': 'Schedule', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['photos', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/photos/_/name/cin', 'text': 'photos', 'shortText': 'photos', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['scores', 'sportscenter', 'app', 'team'], 'href': 'sportscenter://x-callback-url/showClubhouse?uid=s:20~l:28~t:4&section=scores', 'text': 'Scores', 'shortText': 'Scores', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['draftpicks', 'desktop', 'team'], 'href': 'http://www.espn.com/nfl/draft/teams/_/name/cin/cincinnati-bengals', 'text': 'Draft Picks', 'shortText': 'Draft Picks', 'isExternal': False, 'isPremium': True}, {'language': 'en-US', 'rel': ['transactions', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/transactions/_/name/cin', 'text': 'Transactions', 'shortText': 'Transactions', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['injuries', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/injuries/_/name/cin', 'text': 'Injuries', 'shortText': 'Injuries', 'isExternal': False, 'isPremium': False}, {'language': 'en-US', 'rel': ['depthchart', 'desktop', 'team'], 'href': 'https://www.espn.com/nfl/team/depth/_/name/cin', 'text': 'Depth Chart', 'shortText': 'Depth Chart', 'isExternal': False, 'isPremium': False}, {'language': 'en', 'rel': ['tickets', 'desktop', 'team'], 'href': 'https://www.vividseats.com/nfl-football/cincinnati-bengals-tickets.html?wsUser=717', 'text': 'Tickets', 'isExternal': True, 'isPremium': False}]
...
[25 rows x 16 columns]
json.items() returns not just the values but the keys also, meaning that the first value in teams is the string "sports" and the second value is what you are looking for, the list. That's what is causing the error.
Edit: You want to do for key, teams in teams_json.items() not for teams in teams_json.items()

Python Beautifulsoup retrieving json

I'm trying to retrieve the 'inStockQty' json key/value pair using beautifulsoup but am having trouble.
Here's my code so far:
import requests
from bs4 import BeautifulSoup
url = "https://direct.asda.com/george/men/shoes/black-leather-lace-up-oxford-shoes/GEM830406,default,pd.html?cgid=D2M1G10C13"
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14'
headers = {'User-Agent': user_agent,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, "html5lib")
script = soup.select_one('script:contains("window.priceAvailabilityJSON")')
How do I then find 'inStockQty'? I thought about trying to parse all the JSON, but i don't know how to strip out all the HTML crap.
Many Thanks
Try this:
import json
import requests
from bs4 import BeautifulSoup
url = "https://direct.asda.com/george/men/shoes/black-leather-lace-up-oxford-shoes/GEM830406,default,pd.html?cgid=D2M1G10C13"
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14'
headers = {'User-Agent': user_agent,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, "html5lib")
script = soup.find(id='main-content').find('script').string
data = script.split('window.priceAvailabilityJSON = ')[1].split(';\nlet product')[0]
json_data = json.loads(data)
# Output
for product in json_data['productAvailability'].values():
print(product['availability']['inStockQty'])
Try Selenium for that job
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
URL = 'https://direct.asda.com/george/men/shoes/black-leather-lace-up-oxford-shoes/GEM830406,default,pd.html?cgid=D2M1G10C13'
driver.get(URL)
driver.implicitly_wait(5) # wait until content is loaded
Call the variable and you can access it´s content:
jsonData = driver.execute_script('return priceAvailabilityJSON')
print(jsonData.get('productAvailability'))
driver.close()
Output
{'G006386138': {'availability': {'backorderable': False, 'inStockQty': 6, 'instock': True, 'isBackorder': False, 'level': 'instock'}, 'badgesInformation': {'backorderInformation': {'backorderMessage': '', 'backorderableMessage': '', 'displayBackorderMessage': False}, 'displayLowStockBadge': False}, 'price': {'available': True, 'list': {'currency': 'EUR', 'decimalPrice': '27.0', 'formatted': '€ 27.00', 'value': 27}, 'vat': 16}}, 'G006386139': {'availability': {'backorderable': False, 'inStockQty': 2, 'instock': True, 'isBackorder': False, 'level': 'instock'}, 'badgesInformation': {'backorderInformation': {'backorderMessage': '', 'backorderableMessage': '', 'displayBackorderMessage': False}, 'displayLowStockBadge': False}, 'price': {'available': True, 'list': {'currency': 'EUR', 'decimalPrice': '27.0', 'formatted': '€ 27.00', 'value': 27}, 'vat': 16}}, 'G006386140': {'availability': {'backorderable': False, 'inStockQty': 9, 'instock': True, 'isBackorder': False, 'level': 'instock'}, 'badgesInformation': {'backorderInformation': {'backorderMessage': '', 'backorderableMessage': '', 'displayBackorderMessage': False}, 'displayLowStockBadge': False}, 'price': {'available': True, 'list': {'currency': 'EUR', 'decimalPrice': '27.0', 'formatted': '€ 27.00', 'value': 27}, 'vat': 16}}, 'G006386141': {'availability': {'backorderable': False, 'inStockQty': 5, 'instock': True, 'isBackorder': False, 'level': 'instock'}, 'badgesInformation': {'backorderInformation': {'backorderMessage': '', 'backorderableMessage': '', 'displayBackorderMessage': False}, 'displayLowStockBadge': False}, 'price': {'available': True, 'list': {'currency': 'EUR', 'decimalPrice': '27.0', 'formatted': '€ 27.00', 'value': 27}, 'vat': 16}}, 'G006386142': {'availability': {'backorderable': False, 'inStockQty': 2, 'instock': True, 'isBackorder': False, 'level': 'instock'}, 'badgesInformation': {'backorderInformation': {'backorderMessage': '', 'backorderableMessage': '', 'displayBackorderMessage': False}, 'displayLowStockBadge': False}, 'price': {'available': True, 'list': {'currency': 'EUR', 'decimalPrice': '27.0', 'formatted': '€ 27.00', 'value': 27}, 'vat': 16}}, 'G006386143': {'availability': {'backorderable': False, 'inStockQty': 28, 'instock': True, 'isBackorder': False, 'level': 'instock'}, 'badgesInformation': {'backorderInformation': {'backorderMessage': '', 'backorderableMessage': '', 'displayBackorderMessage': False}, 'displayLowStockBadge': False}, 'price': {'available': True, 'list': {'currency': 'EUR', 'decimalPrice': '27.0', 'formatted': '€ 27.00', 'value': 27}, 'vat': 16}}, 'G006386144': {'availability': {'backorderable': False, 'inStockQty': 7, 'instock': True, 'isBackorder': False, 'level': 'instock'}, 'badgesInformation': {'backorderInformation': {'backorderMessage': '', 'backorderableMessage': '', 'displayBackorderMessage': False}, 'displayLowStockBadge': False}, 'price': {'available': True, 'list': {'currency': 'EUR', 'decimalPrice': '27.0', 'formatted': '€ 27.00', 'value': 27}, 'vat': 16}}}

Dataframe not showing twitter sources from Android

I am trying to try to do some analysis on a twitter account, but I am having trouble trying to show sources from Android. What I did was merged two json files and I think I merged it correctly, but incase I got that wrong here is the code I used.
old_tweets = load_tweets("real_tweets/real_old_tweets.json")
print(len(old_tweets))
for aLis1 in old_tweets:
if aLis1 not in tweets:
tweets.append(aLis1)
load_tweets is a custom function that simply opens and loads a json file given a specific path
with open(path, "rb") as f:
import json
return json.load(f)
After merging the two json files of tweets I then called this function to create the data frame and clean it up to only display the information I want.
df_tweets1 = pd.DataFrame(tweets)
df_tweets2 = df_tweets1[['id','created_at','source','full_text','retweet_count']]
df_tweets = df_tweets2.drop_duplicates('id', keep=False
df_tweets.set_index('id', inplace=True)
df_tweets = df_tweets.rename(columns={"created_at": "time", "full_text": "text"})
df_tweets["time"] = pd.to_datetime(df_tweets["time"])
The problem is that when i call df_tweets["source"].unique() I don't see any tweets coming from android
array(['Twitter for iPhone',
'Twitter for iPad',
'Twitter Media Studio',
'Media Studio',
'Twitter Web Client'],
dtype=object)
Did I do something wrong when merging the two sets of Twitter data? Or did I do something wrong when trying to create the data frame?
EDIT**Here is a sample output from real_old_tweets.json to give a sense of the format. I am only going to post one because there is a lot of information contained in one tweet.
[{'created_at': 'Tue Oct 16 16:22:11 +0000 2018',
'id': 1052233253040640001,
'id_str': '1052233253040640001',
'full_text': 'REGISTER TO https://url/0pWiwCHGbh! #MAGA🇺🇸 https://url/ACTMe53TZU',
'truncated': False,
'display_text_range': [0, 44],
'entities': {'hashtags': [{'text': 'MAGA', 'indices': [37, 42]}],
'symbols': [],
'user_mentions': [],
'urls': [{'url': 'url/0pWiwCHGbh',
'expanded_url': 'linkVote.GOP',
'display_url': 'Vote.GOP',
'indices': [12, 35]},
{'url': 'url/ACTMe53TZU',
'expanded_url': 'linktwitter.com/erictrump/status/1052174007708147714',
'display_url': 'twitter.com/erictrump/stat…',
'indices': [45, 68]}]},
'source': 'Twitter for iPhone',
'in_reply_to_status_id': None,
'in_reply_to_status_id_str': None,
'in_reply_to_user_id': None,
'in_reply_to_user_id_str': None,
'in_reply_to_screen_name': None,
'user': {'id': 25073877,
'id_str': '25073877',
'name': 'Donald J. Trump',
'screen_name': 'realDonaldTrump',
'location': 'Washington, DC',
'description': '45th President of the United States of America🇺🇸',
'url': 'url/OMxB0x7xC5',
'entities': {'url': {'urls': [{'url': 'url/OMxB0x7xC5',
'expanded_url': 'linkwww.Instagram.com/realDonaldTrump',
'display_url': 'Instagram.com/realDonaldTrump',
'indices': [0, 23]}]},
'description': {'urls': []}},
'protected': False,
'followers_count': 55165024,
'friends_count': 47,
'listed_count': 94709,
'created_at': 'Wed Mar 18 13:46:38 +0000 2009',
'favourites_count': 25,
'utc_offset': None,
'time_zone': None,
'geo_enabled': True,
'verified': True,
'statuses_count': 39296,
'lang': 'en',
'contributors_enabled': False,
'is_translator': False,
'is_translation_enabled': True,
'profile_background_color': '6D5C18',
'profile_background_image_url': 'linkabs.twimg.com/images/themes/theme1/bg.png',
'profile_background_image_url_https': 'linkabs.twimg.com/images/themes/theme1/bg.png',
'profile_background_tile': True,
'profile_image_url': 'linkpbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg',
'profile_image_url_https': 'linkpbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg',
'profile_banner_url': 'linkpbs.twimg.com/profile_banners/25073877/1539493274',
'profile_link_color': '1B95E0',
'profile_sidebar_border_color': 'BDDCAD',
'profile_sidebar_fill_color': 'C5CEC0',
'profile_text_color': '333333',
'profile_use_background_image': True,
'has_extended_profile': False,
'default_profile': False,
'default_profile_image': False,
'following': False,
'follow_request_sent': False,
'notifications': False,
'translator_type': 'regular'},
'geo': None,
'coordinates': None,
'place': None,
'contributors': None,
'is_quote_status': True,
'quoted_status_id': 1052174007708147714,
'quoted_status_id_str': '1052174007708147714',
'quoted_status_permalink': {'url': 'url/ACTMe53TZU',
'expanded': 'linktwitter.com/erictrump/status/1052174007708147714',
'display': 'twitter.com/erictrump/stat…'},
'quoted_status': {'created_at': 'Tue Oct 16 12:26:46 +0000 2018',
'id': 1052174007708147714,
'id_str': '1052174007708147714',
'full_text': 'Friends: Quick reminder that today is that last day to register to vote in Oregon, Kansas, Louisiana, West Virginia, New Jersey and Maryland. It is very quick and easy - simply go to url/GE5BO5ONN1! Let’s #MakeAmericaGreatAgain 🇺🇸🇺🇸🇺🇸',
'truncated': False,
'display_text_range': [0, 243],
'entities': {'hashtags': [{'text': 'MakeAmericaGreatAgain',
'indices': [214, 236]}],
'symbols': [],
'user_mentions': [],
'urls': [{'url': 'url/GE5BO5ONN1',
'expanded_url': 'linkwww.Vote.GOP',
'display_url': 'Vote.GOP',
'indices': [183, 206]}]},
'source': 'Twitter for iPhone',
'in_reply_to_status_id': None,
'in_reply_to_status_id_str': None,
'in_reply_to_user_id': None,
'in_reply_to_user_id_str': None,
'in_reply_to_screen_name': None,
'user': {'id': 39349894,
'id_str': '39349894',
'name': 'Eric Trump',
'screen_name': 'EricTrump',
'location': '',
'description': "Executive Vice President of The #Trump Organization. Husband to #LaraLeaTrump. Large advocate of #StJude Children's Research Hospital. #MakeAmericaGreatAgain",
'url': 'url/uwwNiWyamR',
'entities': {'url': {'urls': [{'url': 'url/uwwNiWyamR',
'expanded_url': 'linkwww.Trump.com',
'display_url': 'Trump.com',
'indices': [0, 23]}]},
'description': {'urls': []}},
'protected': False,
'followers_count': 2191617,
'friends_count': 715,
'listed_count': 5736,
'created_at': 'Mon May 11 21:42:30 +0000 2009',
'favourites_count': 8638,
'utc_offset': None,
'time_zone': None,
'geo_enabled': True,
'verified': True,
'statuses_count': 5601,
'lang': 'en',
'contributors_enabled': False,
'is_translator': False,
'is_translation_enabled': False,
'profile_background_color': '000000',
'profile_background_image_url': 'linkabs.twimg.com/images/themes/theme1/bg.png',
'profile_background_image_url_link': 'linkabs.twimg.com/images/themes/theme1/bg.png',
'profile_background_tile': True,
'profile_image_url': 'linkpbs.twimg.com/profile_images/974045997268529152/R0CuVYHM_normal.jpg',
'profile_image_url_link': 'linkpbs.twimg.com/profile_images/974045997268529152/R0CuVYHM_normal.jpg',
'profile_banner_url': 'linkpbs.twimg.com/profile_banners/39349894/1516709628',
'profile_link_color': '116AB8',
'profile_sidebar_border_color': '000000',
'profile_sidebar_fill_color': '616161',
'profile_text_color': '000000',
'profile_use_background_image': True,
'has_extended_profile': False,
'default_profile': False,
'default_profile_image': False,
'following': False,
'follow_request_sent': False,
'notifications': False,
'translator_type': 'none'},
'geo': None,
'coordinates': None,
'place': None,
'contributors': None,
'is_quote_status': False,
'retweet_count': 1945,
'favorite_count': 3828,
'favorited': False,
'retweeted': False,
'possibly_sensitive': False,
'lang': 'en'},
'retweet_count': 5415,
'favorite_count': 16565,
'favorited': False,
'retweeted': False,
'possibly_sensitive': False,
'lang': 'en'},
I am assuming that you are having "android" sources and I don't have a clear idea of how your data looks like and what is the relation between the "id" and source. Having said that, there is a bug when you are preparing your data. You are dropping all the duplicates.
For example:
>>> import pandas as pd
>>> df = pd.DataFrame(data={'col1':[1,2,2],'col2':[3,4,3],'col3':[1,4,1]})
>>> df
col1 col2 col3
0 1 3 1
1 2 4 4
2 2 3 1
>>> df.drop_duplicates('col1',keep=False)
col1 col2 col3
0 1 3 1
In the above code, you can see that it is dropping all the duplicate rows if you use "keep=False".
>>> df.drop_duplicates('col1',keep='first')
col1 col2 col3
0 1 3 1
1 2 4 4
Instead, use keep='first' or keep='last' and see if there is any improvement. Also, it would be great if I can get more sense of data, to figure out where it is going wrong.
EDIT
After some time, I took your JSON object and saved to a "me.json" file in the format of:
[{},{}]
Where the first object's source is an iPhone and second object's source is android. I used your code for loading in the data:
Python 2.7.15rc1 (default, Nov 12 2018, 14:31:15)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import json
>>> with open('me.json','rb') as file:
... json_list = json.load(file)
...
>>> len(json_list)
2
>>> df = pd.DataFrame(json_list)
>>> df1 = df[['id','source']]
>>> df1['source'].value_counts()
Twitter for Android 1
Twitter for iPhone 1
Name: source, dtype: int64
In the above output, you can see that I am able to see the "Android". My conclusion is that in your data, there might be no "Android" at all, in the df['source'] column.
Please see it carefully as there are two "source" keys inside each JSON object, one key is inside the "quoted_status". There are chances that you might have seen "Android" in this key.

Categories

Resources