Related
I need your guys' help on how to extract information from a nested dictionary inside a list. Here's the code to get the data:
import requests
import json
import time
all_urls = []
for x in range(5000,5010):
url = f'https://api.jikan.moe/v4/anime/{x}/full'
all_urls.append(url)
all_responses = []
for page_url in all_urls:
response = requests.get(page_url)
all_responses.append(response)
time.sleep(1)
print(all_responses)
data = []
for i in all_responses:
json_data = json.loads(i.text)
data.append(json_data)
print(data)
The sample of the extracted data is as follows:
[{'status': 404,
'type': 'BadResponseException',
'message': 'Resource does not exist',
'error': '404 on https://myanimelist.net/anime/5000/'},
{'status': 404,
'type': 'BadResponseException',
'message': 'Resource does not exist',
'error': '404 on https://myanimelist.net/anime/5001/'},
{'data': {'mal_id': 5002,
'url': 'https://myanimelist.net/anime/5002/Bari_Bari_Densetsu',
'images': {'jpg': {'image_url': 'https://cdn.myanimelist.net/images/anime/4/58873.jpg',
'small_image_url': 'https://cdn.myanimelist.net/images/anime/4/58873t.jpg',
'large_image_url': 'https://cdn.myanimelist.net/images/anime/4/58873l.jpg'},
'webp': {'image_url': 'https://cdn.myanimelist.net/images/anime/4/58873.webp',
'small_image_url': 'https://cdn.myanimelist.net/images/anime/4/58873t.webp',
'large_image_url': 'https://cdn.myanimelist.net/images/anime/4/58873l.webp'}},
'trailer': {'youtube_id': None,
'url': None,
'embed_url': None,
'images': {'image_url': None,
'small_image_url': None,
'medium_image_url': None,
'large_image_url': None,
'maximum_image_url': None}},
'title': 'Bari Bari Densetsu',
'title_english': None,
'title_japanese': 'バリバリ伝説',
'title_synonyms': ['Baribari Densetsu',
......
I need to extract the title from the list of data. Any help is appreciated! Also, any recommendation on a better/simpler/cleaner code to extract the json data from an API is also greatly appreciated!
Firstly, no need to create multiple lists. You can do everything in one loop:
import requests
import json
data = []
for x in range(5000,5010):
page_url = f'https://api.jikan.moe/v4/anime/{x}/full'
response = requests.get(page_url)
json_data = json.loads(response.text)
data.append(json_data)
print(data)
Second, to address your problem, you have two options. You can use dict.get:
for dic in data:
title = dic.get('title', 'no title')
Or use the try/except pattern:
for dic in data:
try:
title = dic['title']
except KeyError:
# deal with case where dict has no title
pass
I'm a python beginner. I would like to ask for help regarding the retrieve the response data. Here's my script:
import pandas as pd
import re
import time
import requests as re
import json
response = re.get(url, headers=headers, auth=auth)
data = response.json()
Here's a part of json response:
{'result': [{'display': '',
'closure_code': '',
'service_offer': 'Integration Platforms',
'updated_on': '2022-04-23 09:05:53',
'urgency': '2',
'business_service': 'Operations',
'updated_by': 'serviceaccount45',
'description': 'ALERT returned 400 but expected 200',
'sys_created_on': '2022-04-23 09:05:53',
'sys_created_by': 'serviceaccount45',
'subcategory': 'Integration',
'contact_type': 'Email',
'problem_type': 'Design: Availability',
'caller_id': '',
'action': 'create',
'company': 'aaaa',
'priority': '3',
'status': '1',
'opened': 'smith.j',
'assigned_to': 'doe.j',
'number': '123456',
'group': 'blabla',
'impact': '2',
'category': 'Business Application & Databases',
'caused_by_change': '',
'location': 'All Locations',
'configuration_item': 'Monitor',
},
I would like to extract the data only for one group = 'blablabla'. Then I would like to extract fields such as:
number = data['number']
group = data['group']
service_offer = data['service_offer']
updated = data['updated_on']
urgency = data['urgency']
username = data['created_by']
short_desc = data['description']
How it should be done?
I know that to check the first value I should use:
service_offer = data['result'][0]['service_offer']
I've tried to create a dictionary, but, I'm getting an error:
data_result = response.json()['result']
payload ={
number = data_result['number']
group = data_result['group']
service_offer = data_result['service_offer']
updated = data_result['updated_on']
urgency = data_result['urgency']
username = data_result['created_by']
short_desc = data_result['description']
}
TypeError: list indices must be integers or slices, not str:
So, I've started to create something like below., but I'm stuck:
get_data = []
if len(data) > 0:
for item in range(len(data)):
get_data.append(data[item])
May I ask for help?
If data is your decoded json response from the question then you can do:
# find group `blabla` in result:
g = next(d for d in data["result"] if d["group"] == "blabla")
# get data from the `blabla` group:
number = g["number"]
group = g["group"]
service_offer = g["service_offer"]
updated = g["updated_on"]
urgency = g["urgency"]
username = g["sys_created_by"]
short_desc = g["description"]
print(number, group, service_offer, updated, urgency, username, short_desc)
Prints:
123456 blabla Integration Platforms 2022-04-23 09:05:53 2 serviceaccount45 ALERT returned 400 but expected 200
I have been desperately seeking a solution to crawl all comments and corresponding replies for my research. Am having a very hard time creating a data frame that includes comment data in correct and corresponding orders.
I am gonna share my code here so you professionals can take a look and give me some insights.
def get_video_comments(service, **kwargs):
comments = []
results = service.commentThreads().list(**kwargs).execute()
while results:
for item in results['items']:
comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
if 'replies' in item.keys():
for reply in item['replies']['comments']:
rauthor = reply['snippet']['authorDisplayName']
rtext = reply['snippet']['textDisplay']
rtime = reply['snippet']['publishedAt']
rlike = reply['snippet']['likeCount']
data = {'Reply ID': [rauthor], 'Reply Time': [rtime], 'Reply Comments': [rtext], 'Reply Likes': [rlike]}
print(rauthor)
print(rtext)
data = {'Comment':[comment],'Date':[comment2],'ID':[comment3], 'Likes':[comment4]}
result = pd.DataFrame(data)
result.to_csv('youtube.csv', mode='a',header=False)
print(comment)
print(comment2)
print(comment3)
print(comment4)
print('==============================')
comments.append(comment)
# Check if another page exists
if 'nextPageToken' in results:
kwargs['pageToken'] = results['nextPageToken']
results = service.commentThreads().list(**kwargs).execute()
else:
break
return comments
When I do this, my crawler collects comments but doesn't collect some of the replies that are under certain comments.
How can I make it collect comments and their corresponding replies and put them in a single data frame?
Update
So, somehow I managed to pull the information I wanted at the output section of Jupyter Notebook. All I have to do now is to append the result at the data frame.
Here is my updated code:
def get_video_comments(service, **kwargs):
comments = []
results = service.commentThreads().list(**kwargs).execute()
while results:
for item in results['items']:
comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
if 'replies' in item.keys():
for reply in item['replies']['comments']:
rauthor = reply['snippet']['authorDisplayName']
rtext = reply['snippet']['textDisplay']
rtime = reply['snippet']['publishedAt']
rlike = reply['snippet']['likeCount']
print(rtext)
print(rtime)
print(rauthor)
print('Likes: ', rlike)
print(comment)
print(comment2)
print(comment3)
print("Likes: ", comment4)
print('==============================')
comments.append(comment)
# Check if another page exists
if 'nextPageToken' in results:
kwargs['pageToken'] = results['nextPageToken']
results = service.commentThreads().list(**kwargs).execute()
else:
break
return comments
The result is:
As you can see, the comments grouped under ======== lines are the comment and corresponding replies underneath.
What would be a good way to append the result into the data frame?
According to the official doc, the property replies.comments[] of CommentThreads resource has the following specification:
replies.comments[] (list)
A list of one or more replies to the top-level comment. Each item in the list is a comment resource.
The list contains a limited number of replies, and unless the number of items in the list equals the value of the snippet.totalReplyCount property, the list of replies is only a subset of the total number of replies available for the top-level comment. To retrieve all of the replies for the top-level comment, you need to call the Comments.list method and use the parentId request parameter to identify the comment for which you want to retrieve replies.
Consequently, if wanting to obtain all reply entries associated to a given top-level comment, you will have to use the Comments.list API endpoint queried appropriately.
I recommend you to read my answer to a very much related question; there are three sections:
Top-Level Comments and Associated Replies,
The property nextPageToken and the parameter pageToken, and
API Limitations Imposed by Design.
From the get go, you'll have to acknowledge that the API (as currently implemented) does not allow to obtain all top-level comments associated to a given video when the number of those comments exceeds a certain (unspecified) upper bound.
For what concerns a Python implementation, I would suggest that you do structure the code as follows:
def get_video_comments(service, video_id):
request = service.commentThreads().list(
videoId = video_id,
part = 'id,snippet,replies',
maxResults = 100
)
comments = []
while request:
response = request.execute()
for comment in response['items']:
reply_count = comment['snippet'] \
['totalReplyCount']
replies = comment.get('replies')
if replies is not None and \
reply_count != len(replies['comments']):
replies['comments'] = get_comment_replies(
service, comment['id'])
# 'comment' is a 'CommentThreads Resource' that has it's
# 'replies.comments' an array of 'Comments Resource'
# Do fill in the 'comments' data structure
# to be provided by this function:
...
request = service.commentThreads().list_next(
request, response)
return comments
def get_comment_replies(service, comment_id):
request = service.comments().list(
parentId = comment_id,
part = 'id,snippet',
maxResults = 100
)
replies = []
while request:
response = request.execute()
replies.extend(response['items'])
request = service.comments().list_next(
request, response)
return replies
Note that the ellipsis dots above -- ... -- would have to be replaced with actual code that fills in the array of structures to be returned by get_video_comments to its caller.
The simplest way (useful for quick testing) would be to have ... replaced with comments.append(comment) and then the caller of get_video_comments to simply pretty print (using json.dump) the object obtained from that function.
Based on stvar' answer and the original publication here I built this code:
import os
import pickle
import csv
import json
import google.oauth2.credentials
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
CLIENT_SECRETS_FILE = "client_secret.json" # for more information to create your credentials json please visit https://python.gotrained.com/youtube-api-extracting-comments/
SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'
def get_authenticated_service():
credentials = None
if os.path.exists('token.pickle'):
with open('token.pickle', 'rb') as token:
credentials = pickle.load(token)
# Check if the credentials are invalid or do not exist
if not credentials or not credentials.valid:
# Check if the credentials have expired
if credentials and credentials.expired and credentials.refresh_token:
credentials.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(
CLIENT_SECRETS_FILE, SCOPES)
credentials = flow.run_console()
# Save the credentials for the next run
with open('token.pickle', 'wb') as token:
pickle.dump(credentials, token)
return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)
def get_video_comments(service, **kwargs):
request = service.commentThreads().list(**kwargs)
comments = []
while request:
response = request.execute()
for comment in response['items']:
reply_count = comment['snippet'] \
['totalReplyCount']
replies = comment.get('replies')
if replies is not None and \
reply_count != len(replies['comments']):
replies['comments'] = get_comment_replies(
service, comment['id'])
# 'comment' is a 'CommentThreads Resource' that has it's
# 'replies.comments' an array of 'Comments Resource'
# Do fill in the 'comments' data structure
# to be provided by this function:
comments.append(comment)
request = service.commentThreads().list_next(
request, response)
return comments
def get_comment_replies(service, comment_id):
request = service.comments().list(
parentId = comment_id,
part = 'id,snippet',
maxResults = 1000
)
replies = []
while request:
response = request.execute()
replies.extend(response['items'])
request = service.comments().list_next(
request, response)
return replies
if __name__ == '__main__':
# When running locally, disable OAuthlib's HTTPs verification. When
# running in production *do not* leave this option enabled.
os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
service = get_authenticated_service()
videoId = input('Enter Video id : ') # video id here (the video id of https://www.youtube.com/watch?v=vedLpKXzZqE -> is vedLpKXzZqE)
comments = get_video_comments(service, videoId=videoId, part='id,snippet,replies', maxResults = 1000)
with open('youtube_comments', 'w', encoding='UTF8') as f:
writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in comments:
# convert the tuple to a list and write to the output file
writer.writerow([row])
it returns a file called youtube_comments with this format:
"{'kind': 'youtube#commentThread', 'etag': 'gvhv4hkH0H2OqQAHQKxzfA-K_tA', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'topLevelComment': {'kind': 'youtube#comment', 'etag': 'qpuKZcuD4FKf6BHgRlMunersEeU', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is a comment', 'textOriginal': 'This is a comment', 'authorDisplayName': 'Gabriell Magana', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLRGBvo2ZncDP1xGjlX6anfUufNYi9b3w9kYZFDl=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCKAa4FYftXsN7VKaPSlCivg', 'authorChannelId': {'value': 'UCKAa4FYftXsN7VKaPSlCivg'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 8, 'publishedAt': '2019-05-22T12:38:34Z', 'updatedAt': '2019-05-22T12:38:34Z'}}, 'canReply': True, 'totalReplyCount': 0, 'isPublic': True}}"
"{'kind': 'youtube#commentThread', 'etag': 'DsgDziMk7mB7xN4OoX7cmqlbDYE', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'topLevelComment': {'kind': 'youtube#comment', 'etag': 'NYjvYM9W_umBafAfQkdg1P9apgg', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is another comment', 'textOriginal': 'This is another comment', 'authorDisplayName': 'Mary Montes', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTg1b1yw8BX8Af0PoTR_t5OOwP9Cfl9_qL-o1iikw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC_GP_8HxDPsqJjJ3Fju_UeA', 'authorChannelId': {'value': 'UC_GP_8HxDPsqJjJ3Fju_UeA'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 9, 'publishedAt': '2019-05-15T05:10:49Z', 'updatedAt': '2019-05-15T05:10:49Z'}}, 'canReply': True, 'totalReplyCount': 3, 'isPublic': True}, 'replies': {'comments': [{'kind': 'youtube#comment', 'etag': 'Tu41ENCZYNJ2KBpYeYz4qgre0H8', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF79DbfJ9zMKxM', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'this is first reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'JULIO EMPRESARIO', 'authorProfileImageUrl': 'https://yt3.ggpht.com/eYP4MBcZ4bON_pHtdbtVsyWnsKbpNKye2wTPhgkffkMYk3ZbN0FL6Aa1o22YlFjn2RVUAkSQYw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCrpB9oZZZfmBv1aQsxrk66w', 'authorChannelId': {'value': 'UCrpB9oZZZfmBv1aQsxrk66w'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-09-15T04:06:50Z', 'updatedAt': '2020-09-15T04:06:50Z'}}, {'kind': 'youtube#comment', 'etag': 'OrpbnJddwzlzwGArCgtuuBsYr94', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF795E1w8RV1DJ', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'the second replay', 'textOriginal': 'the second replay', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Anatolio27 Diaz', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLR1hOySIxEkvRCySExHjo3T6zGBNkvuKpPkqA=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC04N8BM5aUwDJf-PNFxKI-g', 'authorChannelId': {'value': 'UC04N8BM5aUwDJf-PNFxKI-g'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-02-19T18:21:06Z', 'updatedAt': '2020-02-19T18:21:06Z'}}, {'kind': 'youtube#comment', 'etag': 'sPmIwerh3DTZshLiDVwOXn_fJx0', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF78wwH6Aabh4y', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'A third reply', 'textOriginal': 'A third reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Voy detrás de mi pasión', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTgzZ3ZFvkmmAlMzA77ApM-2uGFfvOBnzxegYEX=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCvv6QMokO7KcJCDpK6qZg3Q', 'authorChannelId': {'value': 'UCvv6QMokO7KcJCDpK6qZg3Q'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2019-07-03T18:45:34Z', 'updatedAt': '2019-07-03T18:45:34Z'}}]}}"
Now it is necessary a second step in order to information required. For this I a set of bash script toos like cut, awk and set:
cut -d ":" -f 10- youtube_comments | sed -e "s/', '/\n/g" -e "s/'//g" | awk '/replies/{print "------------------------****---------::: Replies: "$6" :::---------******--------------------------------"}!/replies/{print}' |sed '/^textOriginal:/,/^authorDisplayName:/{/^authorDisplayName/!d}' |sed '/^authorProfileImageUrl:\|^authorChannelUrl:\|^authorChannelId:\|^etag:\|^updatedAt:\|^parentId:\|^id:/d' |sed 's/<[^>]*>//g' | sed 's/{textDisplay/{\ntextDisplay/' |sed '/^snippet:/d' | awk -F":" '(NF==1){print "========================================COMMENT==========================================="}(NF>1){a=0; print $0}' | sed 's/textDisplay: //g' | sed 's/authorDisplayName/User/g' | sed 's/T[0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}Z//g' | sed 's/likeCount: /Likes:/g' | sed 's/publishedAt: //g' > output_file
The final result is a file called output_file with this format:
========================================COMMENT===========================================
This is a comment
User: Robert Everest
Likes:8, 2019-05-22
========================================COMMENT===========================================
This is another comment
User: Anna Davis
Likes:9, 2019-05-15
------------------------****---------::: Replies: 3, :::---------******--------------------------------
this is first reply
User: John Doe
Likes:2, 2020-09-15
the second replay
User: Caraqueno
Likes:2, 2020-02-19
A third reply
User: Rebeca
Likes:2, 2019-07-03
The python script requires of the file token.pickle to work, it is generated the first time the python script run and when it expired, it have to be deleted and generated again.
I had a similar issue that the OP does and managed to solve it, but someone in the community closed my question after I solved it and can't post there. I'm posting it here for fidelity.
The YouTube API doesn't allow users to grab nested replies to comments. What it does allow is you to get the replies to the comments and all the comments i.e. Video --> Comments --> Comment Replies ---> Reply To Reply et al. Knowing this limitation we can write code to get all the top Comments, and then break into those comments to get the first-level replies.
Moduels
import os
import googleapiclient.discovery #required for using googleapi
import pandas as pd #require for data munging. We use pd.json_normalize to create the tables
import numpy as np #just good to have
import json # the requests are returned as json objects.
from datetime import datetime #good to have for date modification
Get All Comments Function
For a given vidId, this function will get the first 100 comments and place them into a df. It then use a while loop to check to see if the response api contains nextPageToken. While it does, it will continue to run to get all the comments until either all the comments are pulled or you run out of credits, whichever happens first.
def vidcomments(vidId):
# Disable OAuthlib's HTTPS verification when running locally.
# *DO NOT* leave this option enabled in production.
os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = "yourapikey" #<--- insert API key here
youtube = googleapiclient.discovery.build(
api_service_name, api_version, developerKey = DEVELOPER_KEY)
request = youtube.commentThreads().list(
part="snippet, replies",
order="time",
maxResults=100,
textFormat="plainText",
videoId=vidId
)
response = request.execute()
full = pd.json_normalize(response, record_path=['items'])
while response:
if 'nextPageToken' in response:
response = youtube.commentThreads().list(
part="snippet",
maxResults=100,
textFormat='plainText',
order='time',
videoId=vidId,
pageToken=response['nextPageToken']
).execute()
df2 = pd.json_normalize(response, record_path=['items'])
full = full.append(df2)
else:
break
return full
Get All Replies To Comments Function
For a particular parentId, get all the first-level replies. Like the vidcomments() function noted above, it will run until all replies to all comments are pulled or you run out of credits, whichever happens first.
def repliesto(parentId):
# Disable OAuthlib's HTTPS verification when running locally.
# *DO NOT* leave this option enabled in production.
os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = DevKey #your dev key
youtube = googleapiclient.discovery.build(
api_service_name, api_version, developerKey = DEVELOPER_KEY)
request = youtube.comments().list(
part="snippet",
maxResults=100,
parentId=parentId,
textFormat="plainText"
)
response = request.execute()
replies = pd.json_normalize(response, record_path=['items'])
while response:
if 'nextPageToken' in response:
response = youtube.comments().list(
part="snippet",
maxResults=100,
parentId=parentId,
textFormat="plainText",
pageToken=response['nextPageToken']
).execute()
df2 = pd.json_normalize(response, record_path=['items'])
replies = pd.concat([replies, df2], sort=False)
else:
break
return replies
Putting It Together
First, run the vidcomments function to get all the comments information. Then use the code below to get all the reply information using a for loop to pull in each topLevelComment.id into a list, then use the list and another for loop to build the replies dataframe. This will create two separate Dataframes, one for Comments and another for Replies. After creating both of these Dataframes you can then join them in a way that makes sense for your purpose, either concat/union or a join/merge.
replyto = []
for reply in full[(full['snippet.totalReplyCount']>0)]
['snippet.topLevelComment.id']:
replyto.append(reply)
# create an empty DF to store all the replies
# use a for loop to place each item in our replyto list into the function defined above
replies = pd.DataFrame()
for reply in replyto:
df = repliesto(reply)
replies = pd.concat([replies, df], ignore_index=True)
I have been desperately seeking a solution to crawl all comments and corresponding replies for my research. Am having a very hard time creating a data frame that includes comment data in correct and corresponding orders.
I am gonna share my code here so you professionals can take a look and give me some insights.
def get_video_comments(service, **kwargs):
comments = []
results = service.commentThreads().list(**kwargs).execute()
while results:
for item in results['items']:
comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
if 'replies' in item.keys():
for reply in item['replies']['comments']:
rauthor = reply['snippet']['authorDisplayName']
rtext = reply['snippet']['textDisplay']
rtime = reply['snippet']['publishedAt']
rlike = reply['snippet']['likeCount']
data = {'Reply ID': [rauthor], 'Reply Time': [rtime], 'Reply Comments': [rtext], 'Reply Likes': [rlike]}
print(rauthor)
print(rtext)
data = {'Comment':[comment],'Date':[comment2],'ID':[comment3], 'Likes':[comment4]}
result = pd.DataFrame(data)
result.to_csv('youtube.csv', mode='a',header=False)
print(comment)
print(comment2)
print(comment3)
print(comment4)
print('==============================')
comments.append(comment)
# Check if another page exists
if 'nextPageToken' in results:
kwargs['pageToken'] = results['nextPageToken']
results = service.commentThreads().list(**kwargs).execute()
else:
break
return comments
When I do this, my crawler collects comments but doesn't collect some of the replies that are under certain comments.
How can I make it collect comments and their corresponding replies and put them in a single data frame?
Update
So, somehow I managed to pull the information I wanted at the output section of Jupyter Notebook. All I have to do now is to append the result at the data frame.
Here is my updated code:
def get_video_comments(service, **kwargs):
comments = []
results = service.commentThreads().list(**kwargs).execute()
while results:
for item in results['items']:
comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
if 'replies' in item.keys():
for reply in item['replies']['comments']:
rauthor = reply['snippet']['authorDisplayName']
rtext = reply['snippet']['textDisplay']
rtime = reply['snippet']['publishedAt']
rlike = reply['snippet']['likeCount']
print(rtext)
print(rtime)
print(rauthor)
print('Likes: ', rlike)
print(comment)
print(comment2)
print(comment3)
print("Likes: ", comment4)
print('==============================')
comments.append(comment)
# Check if another page exists
if 'nextPageToken' in results:
kwargs['pageToken'] = results['nextPageToken']
results = service.commentThreads().list(**kwargs).execute()
else:
break
return comments
The result is:
As you can see, the comments grouped under ======== lines are the comment and corresponding replies underneath.
What would be a good way to append the result into the data frame?
According to the official doc, the property replies.comments[] of CommentThreads resource has the following specification:
replies.comments[] (list)
A list of one or more replies to the top-level comment. Each item in the list is a comment resource.
The list contains a limited number of replies, and unless the number of items in the list equals the value of the snippet.totalReplyCount property, the list of replies is only a subset of the total number of replies available for the top-level comment. To retrieve all of the replies for the top-level comment, you need to call the Comments.list method and use the parentId request parameter to identify the comment for which you want to retrieve replies.
Consequently, if wanting to obtain all reply entries associated to a given top-level comment, you will have to use the Comments.list API endpoint queried appropriately.
I recommend you to read my answer to a very much related question; there are three sections:
Top-Level Comments and Associated Replies,
The property nextPageToken and the parameter pageToken, and
API Limitations Imposed by Design.
From the get go, you'll have to acknowledge that the API (as currently implemented) does not allow to obtain all top-level comments associated to a given video when the number of those comments exceeds a certain (unspecified) upper bound.
For what concerns a Python implementation, I would suggest that you do structure the code as follows:
def get_video_comments(service, video_id):
request = service.commentThreads().list(
videoId = video_id,
part = 'id,snippet,replies',
maxResults = 100
)
comments = []
while request:
response = request.execute()
for comment in response['items']:
reply_count = comment['snippet'] \
['totalReplyCount']
replies = comment.get('replies')
if replies is not None and \
reply_count != len(replies['comments']):
replies['comments'] = get_comment_replies(
service, comment['id'])
# 'comment' is a 'CommentThreads Resource' that has it's
# 'replies.comments' an array of 'Comments Resource'
# Do fill in the 'comments' data structure
# to be provided by this function:
...
request = service.commentThreads().list_next(
request, response)
return comments
def get_comment_replies(service, comment_id):
request = service.comments().list(
parentId = comment_id,
part = 'id,snippet',
maxResults = 100
)
replies = []
while request:
response = request.execute()
replies.extend(response['items'])
request = service.comments().list_next(
request, response)
return replies
Note that the ellipsis dots above -- ... -- would have to be replaced with actual code that fills in the array of structures to be returned by get_video_comments to its caller.
The simplest way (useful for quick testing) would be to have ... replaced with comments.append(comment) and then the caller of get_video_comments to simply pretty print (using json.dump) the object obtained from that function.
Based on stvar' answer and the original publication here I built this code:
import os
import pickle
import csv
import json
import google.oauth2.credentials
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
CLIENT_SECRETS_FILE = "client_secret.json" # for more information to create your credentials json please visit https://python.gotrained.com/youtube-api-extracting-comments/
SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'
def get_authenticated_service():
credentials = None
if os.path.exists('token.pickle'):
with open('token.pickle', 'rb') as token:
credentials = pickle.load(token)
# Check if the credentials are invalid or do not exist
if not credentials or not credentials.valid:
# Check if the credentials have expired
if credentials and credentials.expired and credentials.refresh_token:
credentials.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(
CLIENT_SECRETS_FILE, SCOPES)
credentials = flow.run_console()
# Save the credentials for the next run
with open('token.pickle', 'wb') as token:
pickle.dump(credentials, token)
return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)
def get_video_comments(service, **kwargs):
request = service.commentThreads().list(**kwargs)
comments = []
while request:
response = request.execute()
for comment in response['items']:
reply_count = comment['snippet'] \
['totalReplyCount']
replies = comment.get('replies')
if replies is not None and \
reply_count != len(replies['comments']):
replies['comments'] = get_comment_replies(
service, comment['id'])
# 'comment' is a 'CommentThreads Resource' that has it's
# 'replies.comments' an array of 'Comments Resource'
# Do fill in the 'comments' data structure
# to be provided by this function:
comments.append(comment)
request = service.commentThreads().list_next(
request, response)
return comments
def get_comment_replies(service, comment_id):
request = service.comments().list(
parentId = comment_id,
part = 'id,snippet',
maxResults = 1000
)
replies = []
while request:
response = request.execute()
replies.extend(response['items'])
request = service.comments().list_next(
request, response)
return replies
if __name__ == '__main__':
# When running locally, disable OAuthlib's HTTPs verification. When
# running in production *do not* leave this option enabled.
os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
service = get_authenticated_service()
videoId = input('Enter Video id : ') # video id here (the video id of https://www.youtube.com/watch?v=vedLpKXzZqE -> is vedLpKXzZqE)
comments = get_video_comments(service, videoId=videoId, part='id,snippet,replies', maxResults = 1000)
with open('youtube_comments', 'w', encoding='UTF8') as f:
writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in comments:
# convert the tuple to a list and write to the output file
writer.writerow([row])
it returns a file called youtube_comments with this format:
"{'kind': 'youtube#commentThread', 'etag': 'gvhv4hkH0H2OqQAHQKxzfA-K_tA', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'topLevelComment': {'kind': 'youtube#comment', 'etag': 'qpuKZcuD4FKf6BHgRlMunersEeU', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is a comment', 'textOriginal': 'This is a comment', 'authorDisplayName': 'Gabriell Magana', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLRGBvo2ZncDP1xGjlX6anfUufNYi9b3w9kYZFDl=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCKAa4FYftXsN7VKaPSlCivg', 'authorChannelId': {'value': 'UCKAa4FYftXsN7VKaPSlCivg'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 8, 'publishedAt': '2019-05-22T12:38:34Z', 'updatedAt': '2019-05-22T12:38:34Z'}}, 'canReply': True, 'totalReplyCount': 0, 'isPublic': True}}"
"{'kind': 'youtube#commentThread', 'etag': 'DsgDziMk7mB7xN4OoX7cmqlbDYE', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'topLevelComment': {'kind': 'youtube#comment', 'etag': 'NYjvYM9W_umBafAfQkdg1P9apgg', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is another comment', 'textOriginal': 'This is another comment', 'authorDisplayName': 'Mary Montes', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTg1b1yw8BX8Af0PoTR_t5OOwP9Cfl9_qL-o1iikw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC_GP_8HxDPsqJjJ3Fju_UeA', 'authorChannelId': {'value': 'UC_GP_8HxDPsqJjJ3Fju_UeA'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 9, 'publishedAt': '2019-05-15T05:10:49Z', 'updatedAt': '2019-05-15T05:10:49Z'}}, 'canReply': True, 'totalReplyCount': 3, 'isPublic': True}, 'replies': {'comments': [{'kind': 'youtube#comment', 'etag': 'Tu41ENCZYNJ2KBpYeYz4qgre0H8', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF79DbfJ9zMKxM', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'this is first reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'JULIO EMPRESARIO', 'authorProfileImageUrl': 'https://yt3.ggpht.com/eYP4MBcZ4bON_pHtdbtVsyWnsKbpNKye2wTPhgkffkMYk3ZbN0FL6Aa1o22YlFjn2RVUAkSQYw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCrpB9oZZZfmBv1aQsxrk66w', 'authorChannelId': {'value': 'UCrpB9oZZZfmBv1aQsxrk66w'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-09-15T04:06:50Z', 'updatedAt': '2020-09-15T04:06:50Z'}}, {'kind': 'youtube#comment', 'etag': 'OrpbnJddwzlzwGArCgtuuBsYr94', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF795E1w8RV1DJ', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'the second replay', 'textOriginal': 'the second replay', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Anatolio27 Diaz', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLR1hOySIxEkvRCySExHjo3T6zGBNkvuKpPkqA=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC04N8BM5aUwDJf-PNFxKI-g', 'authorChannelId': {'value': 'UC04N8BM5aUwDJf-PNFxKI-g'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-02-19T18:21:06Z', 'updatedAt': '2020-02-19T18:21:06Z'}}, {'kind': 'youtube#comment', 'etag': 'sPmIwerh3DTZshLiDVwOXn_fJx0', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF78wwH6Aabh4y', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'A third reply', 'textOriginal': 'A third reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Voy detrás de mi pasión', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTgzZ3ZFvkmmAlMzA77ApM-2uGFfvOBnzxegYEX=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCvv6QMokO7KcJCDpK6qZg3Q', 'authorChannelId': {'value': 'UCvv6QMokO7KcJCDpK6qZg3Q'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2019-07-03T18:45:34Z', 'updatedAt': '2019-07-03T18:45:34Z'}}]}}"
Now it is necessary a second step in order to information required. For this I a set of bash script toos like cut, awk and set:
cut -d ":" -f 10- youtube_comments | sed -e "s/', '/\n/g" -e "s/'//g" | awk '/replies/{print "------------------------****---------::: Replies: "$6" :::---------******--------------------------------"}!/replies/{print}' |sed '/^textOriginal:/,/^authorDisplayName:/{/^authorDisplayName/!d}' |sed '/^authorProfileImageUrl:\|^authorChannelUrl:\|^authorChannelId:\|^etag:\|^updatedAt:\|^parentId:\|^id:/d' |sed 's/<[^>]*>//g' | sed 's/{textDisplay/{\ntextDisplay/' |sed '/^snippet:/d' | awk -F":" '(NF==1){print "========================================COMMENT==========================================="}(NF>1){a=0; print $0}' | sed 's/textDisplay: //g' | sed 's/authorDisplayName/User/g' | sed 's/T[0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}Z//g' | sed 's/likeCount: /Likes:/g' | sed 's/publishedAt: //g' > output_file
The final result is a file called output_file with this format:
========================================COMMENT===========================================
This is a comment
User: Robert Everest
Likes:8, 2019-05-22
========================================COMMENT===========================================
This is another comment
User: Anna Davis
Likes:9, 2019-05-15
------------------------****---------::: Replies: 3, :::---------******--------------------------------
this is first reply
User: John Doe
Likes:2, 2020-09-15
the second replay
User: Caraqueno
Likes:2, 2020-02-19
A third reply
User: Rebeca
Likes:2, 2019-07-03
The python script requires of the file token.pickle to work, it is generated the first time the python script run and when it expired, it have to be deleted and generated again.
I had a similar issue that the OP does and managed to solve it, but someone in the community closed my question after I solved it and can't post there. I'm posting it here for fidelity.
The YouTube API doesn't allow users to grab nested replies to comments. What it does allow is you to get the replies to the comments and all the comments i.e. Video --> Comments --> Comment Replies ---> Reply To Reply et al. Knowing this limitation we can write code to get all the top Comments, and then break into those comments to get the first-level replies.
Moduels
import os
import googleapiclient.discovery #required for using googleapi
import pandas as pd #require for data munging. We use pd.json_normalize to create the tables
import numpy as np #just good to have
import json # the requests are returned as json objects.
from datetime import datetime #good to have for date modification
Get All Comments Function
For a given vidId, this function will get the first 100 comments and place them into a df. It then use a while loop to check to see if the response api contains nextPageToken. While it does, it will continue to run to get all the comments until either all the comments are pulled or you run out of credits, whichever happens first.
def vidcomments(vidId):
# Disable OAuthlib's HTTPS verification when running locally.
# *DO NOT* leave this option enabled in production.
os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = "yourapikey" #<--- insert API key here
youtube = googleapiclient.discovery.build(
api_service_name, api_version, developerKey = DEVELOPER_KEY)
request = youtube.commentThreads().list(
part="snippet, replies",
order="time",
maxResults=100,
textFormat="plainText",
videoId=vidId
)
response = request.execute()
full = pd.json_normalize(response, record_path=['items'])
while response:
if 'nextPageToken' in response:
response = youtube.commentThreads().list(
part="snippet",
maxResults=100,
textFormat='plainText',
order='time',
videoId=vidId,
pageToken=response['nextPageToken']
).execute()
df2 = pd.json_normalize(response, record_path=['items'])
full = full.append(df2)
else:
break
return full
Get All Replies To Comments Function
For a particular parentId, get all the first-level replies. Like the vidcomments() function noted above, it will run until all replies to all comments are pulled or you run out of credits, whichever happens first.
def repliesto(parentId):
# Disable OAuthlib's HTTPS verification when running locally.
# *DO NOT* leave this option enabled in production.
os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = DevKey #your dev key
youtube = googleapiclient.discovery.build(
api_service_name, api_version, developerKey = DEVELOPER_KEY)
request = youtube.comments().list(
part="snippet",
maxResults=100,
parentId=parentId,
textFormat="plainText"
)
response = request.execute()
replies = pd.json_normalize(response, record_path=['items'])
while response:
if 'nextPageToken' in response:
response = youtube.comments().list(
part="snippet",
maxResults=100,
parentId=parentId,
textFormat="plainText",
pageToken=response['nextPageToken']
).execute()
df2 = pd.json_normalize(response, record_path=['items'])
replies = pd.concat([replies, df2], sort=False)
else:
break
return replies
Putting It Together
First, run the vidcomments function to get all the comments information. Then use the code below to get all the reply information using a for loop to pull in each topLevelComment.id into a list, then use the list and another for loop to build the replies dataframe. This will create two separate Dataframes, one for Comments and another for Replies. After creating both of these Dataframes you can then join them in a way that makes sense for your purpose, either concat/union or a join/merge.
replyto = []
for reply in full[(full['snippet.totalReplyCount']>0)]
['snippet.topLevelComment.id']:
replyto.append(reply)
# create an empty DF to store all the replies
# use a for loop to place each item in our replyto list into the function defined above
replies = pd.DataFrame()
for reply in replyto:
df = repliesto(reply)
replies = pd.concat([replies, df], ignore_index=True)
I used bing API in python for spell correction. although i get the correct Json format with suggestions it doesn't replace the original string. I tried with data.replace, but it doesn't work. is there any other simple method available to replace original string with suggested words.
import httplib,urllib,base64
headers = {
# Request headers
'Ocp-Apim-Subscription-Key': '7fdf55a1a7e42d0a7890bab142343f8'
}
params = urllib.urlencode({
# Request parameters
'text': 'Lectures were really good. There were lot of people who came their without any Java knowledge and yet you were very suppor.',
'mode': 'proof',
'preContextText': '{string}',
'postContextText': '{string}',
'mkt': '{string}',
})
try:
conn = httplib.HTTPSConnection('api.cognitive.microsoft.com')
conn.request("GET", "/bing/v5.0/spellcheck/?%s" % params, "{body}", headers)
response = conn.getresponse()
data = response.read()
print(data)
conn.close()
except Exception as e:
print("[Errno {0}] {1}".format(e.errno, e.strerror))
output (pretty printed):
{'_type': 'SpellCheck',
'flaggedTokens': [{'offset': 61,
'suggestions': [{'score': 0.854956767552189,
'suggestion': 'there'}],
'token': 'their',
'type': 'UnknownToken'},
{'offset': 116,
'suggestions': [{'score': 0.871971469417366,
'suggestion': 'support'}],
'token': 'suppor',
'type': 'UnknownToken'}]}
You need to do the replacements yourself in your text.
You can iterate the 'flaggedTokens', get the offset of each token, find the best suggestion and replace the token by the suggestion:
import operator
text = 'Lectures were really good. There were lot of people who came their without any Java knowledge and yet you were very suppor.'
data = {'_type': 'SpellCheck',
'flaggedTokens': [{'offset': 61,
'suggestions': [{'score': 0.854956767552189,
'suggestion': 'there'}],
'token': 'their',
'type': 'UnknownToken'},
{'offset': 116,
'suggestions': [{'score': 0.871971469417366,
'suggestion': 'support'}],
'token': 'suppor',
'type': 'UnknownToken'}]}
shifting = 0
correct = text
for ft in data['flaggedTokens']:
offset = ft['offset']
suggestions = ft['suggestions']
token = ft['token']
# find the best suggestion
suggestions.sort(key=operator.itemgetter('score'), reverse=True)
substitute = suggestions[0]['suggestion']
# replace the token by the suggestion
before = correct[:offset + shifting]
after = correct[offset + shifting + len(token):]
correct = before + substitute + after
shifting += len(substitute) - len(token)
print(correct)
You get: “Lectures were really good. There were lot of people who came there without any Java knowledge and yet you were very support.”