YouTube Data API to crawl all comments and replies - python

I have been desperately seeking a solution to crawl all comments and corresponding replies for my research. Am having a very hard time creating a data frame that includes comment data in correct and corresponding orders.
I am gonna share my code here so you professionals can take a look and give me some insights.
def get_video_comments(service, **kwargs):
comments = []
results = service.commentThreads().list(**kwargs).execute()
while results:
for item in results['items']:
comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
if 'replies' in item.keys():
for reply in item['replies']['comments']:
rauthor = reply['snippet']['authorDisplayName']
rtext = reply['snippet']['textDisplay']
rtime = reply['snippet']['publishedAt']
rlike = reply['snippet']['likeCount']
data = {'Reply ID': [rauthor], 'Reply Time': [rtime], 'Reply Comments': [rtext], 'Reply Likes': [rlike]}
print(rauthor)
print(rtext)
data = {'Comment':[comment],'Date':[comment2],'ID':[comment3], 'Likes':[comment4]}
result = pd.DataFrame(data)
result.to_csv('youtube.csv', mode='a',header=False)
print(comment)
print(comment2)
print(comment3)
print(comment4)
print('==============================')
comments.append(comment)
# Check if another page exists
if 'nextPageToken' in results:
kwargs['pageToken'] = results['nextPageToken']
results = service.commentThreads().list(**kwargs).execute()
else:
break
return comments
When I do this, my crawler collects comments but doesn't collect some of the replies that are under certain comments.
How can I make it collect comments and their corresponding replies and put them in a single data frame?
Update
So, somehow I managed to pull the information I wanted at the output section of Jupyter Notebook. All I have to do now is to append the result at the data frame.
Here is my updated code:
def get_video_comments(service, **kwargs):
comments = []
results = service.commentThreads().list(**kwargs).execute()
while results:
for item in results['items']:
comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
if 'replies' in item.keys():
for reply in item['replies']['comments']:
rauthor = reply['snippet']['authorDisplayName']
rtext = reply['snippet']['textDisplay']
rtime = reply['snippet']['publishedAt']
rlike = reply['snippet']['likeCount']
print(rtext)
print(rtime)
print(rauthor)
print('Likes: ', rlike)
print(comment)
print(comment2)
print(comment3)
print("Likes: ", comment4)
print('==============================')
comments.append(comment)
# Check if another page exists
if 'nextPageToken' in results:
kwargs['pageToken'] = results['nextPageToken']
results = service.commentThreads().list(**kwargs).execute()
else:
break
return comments
The result is:
As you can see, the comments grouped under ======== lines are the comment and corresponding replies underneath.
What would be a good way to append the result into the data frame?

According to the official doc, the property replies.comments[] of CommentThreads resource has the following specification:
replies.comments[] (list)
A list of one or more replies to the top-level comment. Each item in the list is a comment resource.
The list contains a limited number of replies, and unless the number of items in the list equals the value of the snippet.totalReplyCount property, the list of replies is only a subset of the total number of replies available for the top-level comment. To retrieve all of the replies for the top-level comment, you need to call the Comments.list method and use the parentId request parameter to identify the comment for which you want to retrieve replies.
Consequently, if wanting to obtain all reply entries associated to a given top-level comment, you will have to use the Comments.list API endpoint queried appropriately.
I recommend you to read my answer to a very much related question; there are three sections:
Top-Level Comments and Associated Replies,
The property nextPageToken and the parameter pageToken, and
API Limitations Imposed by Design.
From the get go, you'll have to acknowledge that the API (as currently implemented) does not allow to obtain all top-level comments associated to a given video when the number of those comments exceeds a certain (unspecified) upper bound.
For what concerns a Python implementation, I would suggest that you do structure the code as follows:
def get_video_comments(service, video_id):
request = service.commentThreads().list(
videoId = video_id,
part = 'id,snippet,replies',
maxResults = 100
)
comments = []
while request:
response = request.execute()
for comment in response['items']:
reply_count = comment['snippet'] \
['totalReplyCount']
replies = comment.get('replies')
if replies is not None and \
reply_count != len(replies['comments']):
replies['comments'] = get_comment_replies(
service, comment['id'])
# 'comment' is a 'CommentThreads Resource' that has it's
# 'replies.comments' an array of 'Comments Resource'
# Do fill in the 'comments' data structure
# to be provided by this function:
...
request = service.commentThreads().list_next(
request, response)
return comments
def get_comment_replies(service, comment_id):
request = service.comments().list(
parentId = comment_id,
part = 'id,snippet',
maxResults = 100
)
replies = []
while request:
response = request.execute()
replies.extend(response['items'])
request = service.comments().list_next(
request, response)
return replies
Note that the ellipsis dots above -- ... -- would have to be replaced with actual code that fills in the array of structures to be returned by get_video_comments to its caller.
The simplest way (useful for quick testing) would be to have ... replaced with comments.append(comment) and then the caller of get_video_comments to simply pretty print (using json.dump) the object obtained from that function.

Based on stvar' answer and the original publication here I built this code:
import os
import pickle
import csv
import json
import google.oauth2.credentials
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
CLIENT_SECRETS_FILE = "client_secret.json" # for more information to create your credentials json please visit https://python.gotrained.com/youtube-api-extracting-comments/
SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'
def get_authenticated_service():
credentials = None
if os.path.exists('token.pickle'):
with open('token.pickle', 'rb') as token:
credentials = pickle.load(token)
# Check if the credentials are invalid or do not exist
if not credentials or not credentials.valid:
# Check if the credentials have expired
if credentials and credentials.expired and credentials.refresh_token:
credentials.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(
CLIENT_SECRETS_FILE, SCOPES)
credentials = flow.run_console()
# Save the credentials for the next run
with open('token.pickle', 'wb') as token:
pickle.dump(credentials, token)
return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)
def get_video_comments(service, **kwargs):
request = service.commentThreads().list(**kwargs)
comments = []
while request:
response = request.execute()
for comment in response['items']:
reply_count = comment['snippet'] \
['totalReplyCount']
replies = comment.get('replies')
if replies is not None and \
reply_count != len(replies['comments']):
replies['comments'] = get_comment_replies(
service, comment['id'])
# 'comment' is a 'CommentThreads Resource' that has it's
# 'replies.comments' an array of 'Comments Resource'
# Do fill in the 'comments' data structure
# to be provided by this function:
comments.append(comment)
request = service.commentThreads().list_next(
request, response)
return comments
def get_comment_replies(service, comment_id):
request = service.comments().list(
parentId = comment_id,
part = 'id,snippet',
maxResults = 1000
)
replies = []
while request:
response = request.execute()
replies.extend(response['items'])
request = service.comments().list_next(
request, response)
return replies
if __name__ == '__main__':
# When running locally, disable OAuthlib's HTTPs verification. When
# running in production *do not* leave this option enabled.
os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
service = get_authenticated_service()
videoId = input('Enter Video id : ') # video id here (the video id of https://www.youtube.com/watch?v=vedLpKXzZqE -> is vedLpKXzZqE)
comments = get_video_comments(service, videoId=videoId, part='id,snippet,replies', maxResults = 1000)
with open('youtube_comments', 'w', encoding='UTF8') as f:
writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in comments:
# convert the tuple to a list and write to the output file
writer.writerow([row])
it returns a file called youtube_comments with this format:
"{'kind': 'youtube#commentThread', 'etag': 'gvhv4hkH0H2OqQAHQKxzfA-K_tA', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'topLevelComment': {'kind': 'youtube#comment', 'etag': 'qpuKZcuD4FKf6BHgRlMunersEeU', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is a comment', 'textOriginal': 'This is a comment', 'authorDisplayName': 'Gabriell Magana', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLRGBvo2ZncDP1xGjlX6anfUufNYi9b3w9kYZFDl=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCKAa4FYftXsN7VKaPSlCivg', 'authorChannelId': {'value': 'UCKAa4FYftXsN7VKaPSlCivg'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 8, 'publishedAt': '2019-05-22T12:38:34Z', 'updatedAt': '2019-05-22T12:38:34Z'}}, 'canReply': True, 'totalReplyCount': 0, 'isPublic': True}}"
"{'kind': 'youtube#commentThread', 'etag': 'DsgDziMk7mB7xN4OoX7cmqlbDYE', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'topLevelComment': {'kind': 'youtube#comment', 'etag': 'NYjvYM9W_umBafAfQkdg1P9apgg', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is another comment', 'textOriginal': 'This is another comment', 'authorDisplayName': 'Mary Montes', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTg1b1yw8BX8Af0PoTR_t5OOwP9Cfl9_qL-o1iikw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC_GP_8HxDPsqJjJ3Fju_UeA', 'authorChannelId': {'value': 'UC_GP_8HxDPsqJjJ3Fju_UeA'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 9, 'publishedAt': '2019-05-15T05:10:49Z', 'updatedAt': '2019-05-15T05:10:49Z'}}, 'canReply': True, 'totalReplyCount': 3, 'isPublic': True}, 'replies': {'comments': [{'kind': 'youtube#comment', 'etag': 'Tu41ENCZYNJ2KBpYeYz4qgre0H8', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF79DbfJ9zMKxM', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'this is first reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'JULIO EMPRESARIO', 'authorProfileImageUrl': 'https://yt3.ggpht.com/eYP4MBcZ4bON_pHtdbtVsyWnsKbpNKye2wTPhgkffkMYk3ZbN0FL6Aa1o22YlFjn2RVUAkSQYw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCrpB9oZZZfmBv1aQsxrk66w', 'authorChannelId': {'value': 'UCrpB9oZZZfmBv1aQsxrk66w'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-09-15T04:06:50Z', 'updatedAt': '2020-09-15T04:06:50Z'}}, {'kind': 'youtube#comment', 'etag': 'OrpbnJddwzlzwGArCgtuuBsYr94', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF795E1w8RV1DJ', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'the second replay', 'textOriginal': 'the second replay', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Anatolio27 Diaz', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLR1hOySIxEkvRCySExHjo3T6zGBNkvuKpPkqA=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC04N8BM5aUwDJf-PNFxKI-g', 'authorChannelId': {'value': 'UC04N8BM5aUwDJf-PNFxKI-g'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-02-19T18:21:06Z', 'updatedAt': '2020-02-19T18:21:06Z'}}, {'kind': 'youtube#comment', 'etag': 'sPmIwerh3DTZshLiDVwOXn_fJx0', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF78wwH6Aabh4y', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'A third reply', 'textOriginal': 'A third reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Voy detrás de mi pasión', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTgzZ3ZFvkmmAlMzA77ApM-2uGFfvOBnzxegYEX=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCvv6QMokO7KcJCDpK6qZg3Q', 'authorChannelId': {'value': 'UCvv6QMokO7KcJCDpK6qZg3Q'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2019-07-03T18:45:34Z', 'updatedAt': '2019-07-03T18:45:34Z'}}]}}"
Now it is necessary a second step in order to information required. For this I a set of bash script toos like cut, awk and set:
cut -d ":" -f 10- youtube_comments | sed -e "s/', '/\n/g" -e "s/'//g" | awk '/replies/{print "------------------------****---------::: Replies: "$6" :::---------******--------------------------------"}!/replies/{print}' |sed '/^textOriginal:/,/^authorDisplayName:/{/^authorDisplayName/!d}' |sed '/^authorProfileImageUrl:\|^authorChannelUrl:\|^authorChannelId:\|^etag:\|^updatedAt:\|^parentId:\|^id:/d' |sed 's/<[^>]*>//g' | sed 's/{textDisplay/{\ntextDisplay/' |sed '/^snippet:/d' | awk -F":" '(NF==1){print "========================================COMMENT==========================================="}(NF>1){a=0; print $0}' | sed 's/textDisplay: //g' | sed 's/authorDisplayName/User/g' | sed 's/T[0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}Z//g' | sed 's/likeCount: /Likes:/g' | sed 's/publishedAt: //g' > output_file
The final result is a file called output_file with this format:
========================================COMMENT===========================================
This is a comment
User: Robert Everest
Likes:8, 2019-05-22
========================================COMMENT===========================================
This is another comment
User: Anna Davis
Likes:9, 2019-05-15
------------------------****---------::: Replies: 3, :::---------******--------------------------------
this is first reply
User: John Doe
Likes:2, 2020-09-15
the second replay
User: Caraqueno
Likes:2, 2020-02-19
A third reply
User: Rebeca
Likes:2, 2019-07-03
The python script requires of the file token.pickle to work, it is generated the first time the python script run and when it expired, it have to be deleted and generated again.

I had a similar issue that the OP does and managed to solve it, but someone in the community closed my question after I solved it and can't post there. I'm posting it here for fidelity.
The YouTube API doesn't allow users to grab nested replies to comments. What it does allow is you to get the replies to the comments and all the comments i.e. Video --> Comments --> Comment Replies ---> Reply To Reply et al. Knowing this limitation we can write code to get all the top Comments, and then break into those comments to get the first-level replies.
Moduels
import os
import googleapiclient.discovery #required for using googleapi
import pandas as pd #require for data munging. We use pd.json_normalize to create the tables
import numpy as np #just good to have
import json # the requests are returned as json objects.
from datetime import datetime #good to have for date modification
Get All Comments Function
For a given vidId, this function will get the first 100 comments and place them into a df. It then use a while loop to check to see if the response api contains nextPageToken. While it does, it will continue to run to get all the comments until either all the comments are pulled or you run out of credits, whichever happens first.
def vidcomments(vidId):
# Disable OAuthlib's HTTPS verification when running locally.
# *DO NOT* leave this option enabled in production.
os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = "yourapikey" #<--- insert API key here
youtube = googleapiclient.discovery.build(
api_service_name, api_version, developerKey = DEVELOPER_KEY)
request = youtube.commentThreads().list(
part="snippet, replies",
order="time",
maxResults=100,
textFormat="plainText",
videoId=vidId
)
response = request.execute()
full = pd.json_normalize(response, record_path=['items'])
while response:
if 'nextPageToken' in response:
response = youtube.commentThreads().list(
part="snippet",
maxResults=100,
textFormat='plainText',
order='time',
videoId=vidId,
pageToken=response['nextPageToken']
).execute()
df2 = pd.json_normalize(response, record_path=['items'])
full = full.append(df2)
else:
break
return full
Get All Replies To Comments Function
For a particular parentId, get all the first-level replies. Like the vidcomments() function noted above, it will run until all replies to all comments are pulled or you run out of credits, whichever happens first.
def repliesto(parentId):
# Disable OAuthlib's HTTPS verification when running locally.
# *DO NOT* leave this option enabled in production.
os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = DevKey #your dev key
youtube = googleapiclient.discovery.build(
api_service_name, api_version, developerKey = DEVELOPER_KEY)
request = youtube.comments().list(
part="snippet",
maxResults=100,
parentId=parentId,
textFormat="plainText"
)
response = request.execute()
replies = pd.json_normalize(response, record_path=['items'])
while response:
if 'nextPageToken' in response:
response = youtube.comments().list(
part="snippet",
maxResults=100,
parentId=parentId,
textFormat="plainText",
pageToken=response['nextPageToken']
).execute()
df2 = pd.json_normalize(response, record_path=['items'])
replies = pd.concat([replies, df2], sort=False)
else:
break
return replies
Putting It Together
First, run the vidcomments function to get all the comments information. Then use the code below to get all the reply information using a for loop to pull in each topLevelComment.id into a list, then use the list and another for loop to build the replies dataframe. This will create two separate Dataframes, one for Comments and another for Replies. After creating both of these Dataframes you can then join them in a way that makes sense for your purpose, either concat/union or a join/merge.
replyto = []
for reply in full[(full['snippet.totalReplyCount']>0)]
['snippet.topLevelComment.id']:
replyto.append(reply)
# create an empty DF to store all the replies
# use a for loop to place each item in our replyto list into the function defined above
replies = pd.DataFrame()
for reply in replyto:
df = repliesto(reply)
replies = pd.concat([replies, df], ignore_index=True)

Related

GMail API Get Full Body Text [duplicate]

This question already has an answer here:
How can I get the body of a gmail email with an attatchment gmail python API
(1 answer)
Closed 5 months ago.
I am trying to get the full body text from an email, but I keep running into various issues. Below is my code:
results = service.users().messages().list(userId='me', labelIds=['Label_538763522493983273'], q="is:unread").execute()
messages = results.get('messages', [])
if not messages:
print("No new messages")
else:
for message in messages:
msg = service.users().messages().get(userId='me', id=message['id']).execute()
payload = msg['payload']
email_data = payload['headers']
parts = payload.get('parts')[0]
print(parts)
for part in parts:
data = parts['body']['data']
data = data.replace("-","+").replace("_","/")
decoded_data = base64.b64decode(data)
print(decoded_data)
# Now, the data obtained is in lxml. So, we will parse
# it with BeautifulSoup library
soup = BeautifulSoup(decoded_data , "lxml")
body = soup.body()
print(body)
The issue I believe is happening is that the parts variable prints this to console:
{'partId': '0', 'mimeType': 'multipart/related', 'filename': '', 'headers': [{'name': 'Content-Type', 'value': 'multipart/related; boundary="----=_Part_487486_335313815.1619387911380"'}], 'body': {'size': 0}, 'parts': [{'partId': '0.0', 'mimeType': 'text/html', 'filename': '', 'headers': [{'name': 'content-type', 'value': 'text/html; charset=UTF-8'}, {'name': 'Content-Transfer-Encoding', 'value': '7bit'}], 'body': {'size': 3697, 'data': 'PCFkb2N0eXBlIGh0bWw-PGh0bWwgeG1sbnM6bz0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6b2ZmaWNlIiB4bWxuczp2PSJ1cm46c2NoZW1hcy1taWNyb3NvZnQtY29tOnZtbCI-DQo8aGVhZD4NCjxNRVRBIGh0dHAtZXF1aXY9IkNvbnRlbnQtVHlwZSIgY29udGVudD0idGV4dC9odG1sOyBjaGFyc2V0PVVURi04Ij4NCjx0aXRsZT48L3RpdGxlPg0KPHN0eWxlIHR5cGU9InRleHQvY3NzIj5ib2R5LCB0YWJsZSB7DQogIGZvbnQtZmFtaWx5OiBWZXJkYW5hLCBBcmlhbCwgc2Fucy1zZXJpZjsNCiAgZm9udC1zaXplOiAxMnB4OyB3aWR0aDoxMDAlOw0KfQ0KZGl2IHsNCiAgcGFkZGluZy10b3A6NXB4Ow0KICBwYWRkaW5nLWJvdHRvbTo1cHg7DQp9DQppbWcgDQp7DQogIGJvcmRlcjowcHg7DQp9PC9zdHlsZT4NCjwvaGVhZD4NCjxib2R5PjxkaXY-DQpDbGljayB0aGlzIGxpbmsgdG8gY29uZmlybSB5b3VyIGVtYWlsIGFkZHJlc3MgYW5kIGNvbXBsZXRlIHNldHVwIGZvciB5b3VyIGNhbmRpZGF0ZSBhY2NvdW50PGJyPmh0dHBzOi8vcG5jLndkNS5teXdvcmtkYXlqb2JzLmNvbS9FeHRlcm5hbC9hY3RpdmF0ZS94d3pzNzY4Ymp4ZnQ0M2F1aWd1cTl2dWRjY3dqNWxnb3ZjdHU1cjFmY2k5dGU4ZXdyZGZ5bjM5d3BuejA2ZXlhNGp4MjRoMDQ5djZwcGJ1enBhdHVxNjRnY2p1MDQwdGh2dTQvP3JlZGlyZWN0PSUyRkV4dGVybmFsJTJGam9iJTJGUEEtLS1QaXR0c2J1cmdoLTE1MjIyJTJGU2VjdXJpdHktQW5hbHlzdC0tLUVtcGxveWVlLUludmVzdGlnYXRpb25zLS1fUjA1NTc0NCUyRmFwcGx5PGJyPlRoZSBsaW5rIHdpbGwgZXhwaXJlIGFmdGVyIDI0IGhvdXJzLg0KPC9kaXY-PGRpdj4NCjxici8-DQo8L2Rpdj48ZGl2Pg0KPGltZyBzcmM9ImNpZDplblF5V3RITFNlIi8-DQo8L2Rpdj4NCgkgICAgIDxkaXY-DQoJICAgIDwhLS1baWYgbXNvIHwgSUVdPg0KCSAgICAgIDx0YWJsZSBib3JkZXI9IjAiIGNlbGxwYWRkaW5nPSIwIiBjZWxsc3BhY2luZz0iMCIgd2lkdGg9IjYwMCIgYWxpZ249ImNlbnRlciIgc3R5bGU9IndpZHRoOjYwMHB4OyI-DQoJICAgICAgICA8dHI-DQoJICAgICAgICAgIDx0ZCBzdHlsZT0ibGluZS1oZWlnaHQ6MHB4O2ZvbnQtc2l6ZTowcHg7bXNvLWxpbmUtaGVpZ2h0LXJ1bGU6ZXhhY3RseTsiPg0KCSAgICAgIDwhW2VuZGlmXS0tPg0KCSAgICA8ZGl2IHN0eWxlPSJtYXJnaW46MCBhdXRvO21heC13aWR0aDo2MDBweDsiPg0KCSAgICAgIDx0YWJsZSBjZWxscGFkZGluZz0iMCIgY2VsbHNwYWNpbmc9IjAiIHN0eWxlPSJmb250LXNpemU6MHB4O3dpZHRoOjEwMCU7IiBhbGlnbj0iY2VudGVyIiBib3JkZXI9IjAiPg0KCSAgICAgICAgPHRib2R5Pg0KCSAgICAgICAgICA8dHI-DQoJICAgICAgICAgICAgPHRkIHN0eWxlPSJ0ZXh0LWFsaWduOmNlbnRlcjt2ZXJ0aWNhbC1hbGlnbjp0b3A7Zm9udC1zaXplOjBweDtwYWRkaW5nOjIwcHggMHB4OyI-PC90ZD4NCgkgICAgICAgICAgPC90cj4NCgkgICAgICAgIDwvdGJvZHk-DQoJICAgICAgPC90YWJsZT4NCgkgICAgPC9kaXY-DQoJICAgIDwhLS1baWYgbXNvIHwgSUVdPg0KCSAgICAgIDwvdGQ-PC90cj48L3RhYmxlPg0KCSAgICAgIDwhW2VuZGlmXS0tPg0KCSAgICA8IS0tW2lmIG1zbyB8IElFXT4NCgkgICAgICA8dGFibGUgYm9yZGVyPSIwIiBjZWxscGFkZGluZz0iMCIgY2VsbHNwYWNpbmc9IjAiIHdpZHRoPSI2MDAiIGFsaWduPSJjZW50ZXIiIHN0eWxlPSJ3aWR0aDo2MDBweDsiPg0KCSAgICAgICAgPHRyPg0KCSAgICAgICAgICA8dGQgc3R5bGU9ImxpbmUtaGVpZ2h0OjBweDtmb250LXNpemU6MHB4O21zby1saW5lLWhlaWdodC1ydWxlOmV4YWN0bHk7Ij4NCgkgICAgICA8IVtlbmRpZl0tLT4NCgkgICAgPGRpdiBzdHlsZT0ibWFyZ2luOjAgYXV0bzttYXgtd2lkdGg6NjAwcHg7Ij4NCgkgICAgICA8dGFibGUgY2VsbHBhZGRpbmc9IjAiIGNlbGxzcGFjaW5nPSIwIiBzdHlsZT0iZm9udC1zaXplOjBweDt3aWR0aDoxMDAlOyIgYWxpZ249ImNlbnRlciIgYm9yZGVyPSIwIj4NCgkgICAgICAgIDx0Ym9keT4NCgkgICAgICAgICAgPHRyPg0KCSAgICAgICAgICAgIDx0ZCBzdHlsZT0idGV4dC1hbGlnbjpjZW50ZXI7dmVydGljYWwtYWxpZ246dG9wO2ZvbnQtc2l6ZTowcHg7cGFkZGluZzoyMHB4IDBweDsiPjwvdGQ-DQoJICAgICAgICAgIDwvdHI-DQoJICAgICAgICA8L3Rib2R5Pg0KCSAgICAgIDwvdGFibGU-DQoJICAgIDwvZGl2Pg0KCSAgICA8IS0tW2lmIG1zbyB8IElFXT4NCgkgICAgICA8L3RkPjwvdHI-PC90YWJsZT4NCgkgICAgICA8IVtlbmRpZl0tLT4NCiAgICAgIDwhLS1baWYgbXNvIHwgSUVdPg0KCSAgICAgIDx0YWJsZSBib3JkZXI9IjAiIGNlbGxwYWRkaW5nPSIwIiBjZWxsc3BhY2luZz0iMCIgd2lkdGg9IjYwMCIgYWxpZ249ImNlbnRlciIgc3R5bGU9IndpZHRoOjYwMHB4OyI-DQoJICAgICAgICA8dHI-DQoJICAgICAgICAgIDx0ZCBzdHlsZT0ibGluZS1oZWlnaHQ6MHB4O2ZvbnQtc2l6ZTowcHg7bXNvLWxpbmUtaGVpZ2h0LXJ1bGU6ZXhhY3RseTsiPg0KCSAgICAgIDwhW2VuZGlmXS0tPjxkaXYgeG1sbnM6d2Q9InVybjpjb20ud29ya2RheS9ic3ZjIiBzdHlsZT0ibWFyZ2luOjAgYXV0bzttYXgtd2lkdGg6NjAwcHg7Ij4NCjx0YWJsZSBib3JkZXI9IjAiIGFsaWduPSJjZW50ZXIiIHN0eWxlPSJmb250LXNpemU6MHB4O3dpZHRoOjEwMCU7IiBjZWxsc3BhY2luZz0iMCIgY2VsbHBhZGRpbmc9IjAiPg0KPHRib2R5Pg0KPHRyPg0KPHRkIHN0eWxlPSJ0ZXh0LWFsaWduOmNlbnRlcjt2ZXJ0aWNhbC1hbGlnbjp0b3A7Zm9udC1zaXplOjBweDtwYWRkaW5nOjIwcHggMHB4OyI-DQo8IS0tW2lmIG1zbyB8IElFXT4NCgkgPHRhYmxlIGJvcmRlcj0iMCIgY2VsbHBhZGRpbmc9IjAiIGNlbGxzcGFjaW5nPSIwIj48dHI-PHRkIHN0eWxlPSJ2ZXJ0aWNhbC1hbGlnbjp0b3A7d2lkdGg6NjAwcHg7Ij4NCgkgICAgICA8IVtlbmRpZl0tLT4NCjxkaXYgc3R5bGU9InZlcnRpY2FsLWFsaWduOnRvcDtkaXNwbGF5OmlubGluZS1ibG9jaztmb250LXNpemU6MTNweDt0ZXh0LWFsaWduOmxlZnQ7d2lkdGg6MTAwJTsiIGNsYXNzPSJtai1jb2x1bW4tcGVyLTEwMCIgYXJpYS1sYWJlbGxlZGJ5PSJtai1jb2x1bW4tcGVyLTEwMCI-DQo8dGFibGUgYm9yZGVyPSIwIiB3aWR0aD0iMTAwJSIgY2VsbHNwYWNpbmc9IjAiIGNlbGxwYWRkaW5nPSIwIj4NCjx0Ym9keT4NCjx0cj4NCjx0ZCBhbGlnbj0iY2VudGVyIiBzdHlsZT0id29yZC1icmVhazpicmVhay13b3JkO2ZvbnQtc2l6ZTowcHg7cGFkZGluZy1ib3R0b206MHB4OyI-DQo8ZGl2IHN0eWxlPSJjdXJzb3I6YXV0bztjb2xvcjojOThhMGE2O2ZvbnQtZmFtaWx5OlJvYm90bztmb250LXNpemU6MTJweDtmb250LXdlaWdodDo0MDA7bGluZS1oZWlnaHQ6MjJweDsiPlRoaXMgZW1haWwgd2FzIGludGVuZGVkIGZvciB3aGl0ZS5sbmF0aGFuQGdtYWlsLmNvbTwvZGl2Pg0KPC90ZD4NCjwvdHI-DQo8L3Rib2R5Pg0KPC90YWJsZT4NCjwvZGl2Pg0KPCEtLVtpZiBtc28gfCBJRV0-DQoJICAgICAgPC90ZD48L3RyPjwvdGFibGU-DQoJICAgICAgPCFbZW5kaWZdLS0-DQo8L3RkPg0KPC90cj4NCjwvdGJvZHk-DQo8L3RhYmxlPg0KPC9kaXY-DQo8IS0tW2lmIG1zbyB8IElFXT4NCgkgICAgICA8L3RkPjwvdHI-PC90YWJsZT4NCgkgICAgICA8IVtlbmRpZl0tLT4NCjwvYm9keT4NCjwvaHRtbD4NCg=='}}, {'partId': '0.1', 'mimeType': 'application/octet-stream', 'filename': 'logo.gif', 'headers': [{'name': 'content-type', 'value': 'application/octet-stream'}, {'name': 'Content-Transfer-Encoding', 'value': 'base64'}, {'name': 'content-disposition', 'value': 'inline; filename=logo.gif'}, {'name': 'content-id', 'value': '<enQyWtHLSe>'}, {'name': 'content-description', 'value': 'logo.gif'}], 'body': {'attachmentId': 'ANGjdJ_nOphvv6Vmz844nOkhWVxB_lgTbbG1fERDE3DD5Cn2dLXlnvqA3DAxEUHtDC2DdSjPLg1v9XgEdUPMM3jfu7FDPmDAfDYM7wlmtKIQ9MaSAj5lMyzKCXGUCwQJGX-u6qOz37ghBmF9ojr1WV_8pq0UWcVYTMajK5XX4N8iwrm9wbTmoDtU9tli-MDNPabIJEpB8I9ppCB552bAuJ9BVMYTqtE3Drx_xy_YyIYLsPZMgMk97QjgawaLZwdFWTzHwrA_njT3OFZ7_hp4-REVp4-ExcN0v-dO4qBjAn8W9eZ2eRCCXvj7x_mssQAMn6K026C4qvL4-D5qiYDNsHY55H4-HR0IyMVJa1UWQGmur6ZbrXDCyvQ8rTHwCrXjZGmXIPFNWjGFF4PPnkcf', 'size': 2754}}]}
It looks as if there is another parts inside of the parts I am calling, as well as body outside of that second parts. I have tried multiple solutions on here including checking the mimeType, but to no avail. If someone could give some insight on this I would appreciate it.
I resolved the issue. I don't take credit for this code, and more can be read up here: How can I get the body of a gmail email with an attatchment gmail python API
I had to chance the body_html variable to body_message since it ended up being an html mimeType.
payload = msg['payload']
email_data = payload['headers']
parts = payload.get('parts')
for part in parts:
body = part.get("body")
data = body.get("data")
mimeType = part.get("mimeType")
# with attachment
if mimeType == 'multipart/related':
subparts = part.get('parts')
for p in subparts:
body = p.get("body")
data = body.get("data")
mimeType = p.get("mimeType")
if mimeType == 'text/plain':
body_message = base64.urlsafe_b64decode(data)
elif mimeType == 'text/html':
body_message = base64.urlsafe_b64decode(data)
# without attachment
elif mimeType == 'text/plain':
body_message = base64.urlsafe_b64decode(data)
elif mimeType == 'text/html':
body_message = base64.urlsafe_b64decode(data)
final_result = str(body_message, 'utf-8')
url = extractor.find_urls(final_result)
print(url[0])

Using Python and YouTube API to get all comment and replies [duplicate]

I have been desperately seeking a solution to crawl all comments and corresponding replies for my research. Am having a very hard time creating a data frame that includes comment data in correct and corresponding orders.
I am gonna share my code here so you professionals can take a look and give me some insights.
def get_video_comments(service, **kwargs):
comments = []
results = service.commentThreads().list(**kwargs).execute()
while results:
for item in results['items']:
comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
if 'replies' in item.keys():
for reply in item['replies']['comments']:
rauthor = reply['snippet']['authorDisplayName']
rtext = reply['snippet']['textDisplay']
rtime = reply['snippet']['publishedAt']
rlike = reply['snippet']['likeCount']
data = {'Reply ID': [rauthor], 'Reply Time': [rtime], 'Reply Comments': [rtext], 'Reply Likes': [rlike]}
print(rauthor)
print(rtext)
data = {'Comment':[comment],'Date':[comment2],'ID':[comment3], 'Likes':[comment4]}
result = pd.DataFrame(data)
result.to_csv('youtube.csv', mode='a',header=False)
print(comment)
print(comment2)
print(comment3)
print(comment4)
print('==============================')
comments.append(comment)
# Check if another page exists
if 'nextPageToken' in results:
kwargs['pageToken'] = results['nextPageToken']
results = service.commentThreads().list(**kwargs).execute()
else:
break
return comments
When I do this, my crawler collects comments but doesn't collect some of the replies that are under certain comments.
How can I make it collect comments and their corresponding replies and put them in a single data frame?
Update
So, somehow I managed to pull the information I wanted at the output section of Jupyter Notebook. All I have to do now is to append the result at the data frame.
Here is my updated code:
def get_video_comments(service, **kwargs):
comments = []
results = service.commentThreads().list(**kwargs).execute()
while results:
for item in results['items']:
comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
if 'replies' in item.keys():
for reply in item['replies']['comments']:
rauthor = reply['snippet']['authorDisplayName']
rtext = reply['snippet']['textDisplay']
rtime = reply['snippet']['publishedAt']
rlike = reply['snippet']['likeCount']
print(rtext)
print(rtime)
print(rauthor)
print('Likes: ', rlike)
print(comment)
print(comment2)
print(comment3)
print("Likes: ", comment4)
print('==============================')
comments.append(comment)
# Check if another page exists
if 'nextPageToken' in results:
kwargs['pageToken'] = results['nextPageToken']
results = service.commentThreads().list(**kwargs).execute()
else:
break
return comments
The result is:
As you can see, the comments grouped under ======== lines are the comment and corresponding replies underneath.
What would be a good way to append the result into the data frame?
According to the official doc, the property replies.comments[] of CommentThreads resource has the following specification:
replies.comments[] (list)
A list of one or more replies to the top-level comment. Each item in the list is a comment resource.
The list contains a limited number of replies, and unless the number of items in the list equals the value of the snippet.totalReplyCount property, the list of replies is only a subset of the total number of replies available for the top-level comment. To retrieve all of the replies for the top-level comment, you need to call the Comments.list method and use the parentId request parameter to identify the comment for which you want to retrieve replies.
Consequently, if wanting to obtain all reply entries associated to a given top-level comment, you will have to use the Comments.list API endpoint queried appropriately.
I recommend you to read my answer to a very much related question; there are three sections:
Top-Level Comments and Associated Replies,
The property nextPageToken and the parameter pageToken, and
API Limitations Imposed by Design.
From the get go, you'll have to acknowledge that the API (as currently implemented) does not allow to obtain all top-level comments associated to a given video when the number of those comments exceeds a certain (unspecified) upper bound.
For what concerns a Python implementation, I would suggest that you do structure the code as follows:
def get_video_comments(service, video_id):
request = service.commentThreads().list(
videoId = video_id,
part = 'id,snippet,replies',
maxResults = 100
)
comments = []
while request:
response = request.execute()
for comment in response['items']:
reply_count = comment['snippet'] \
['totalReplyCount']
replies = comment.get('replies')
if replies is not None and \
reply_count != len(replies['comments']):
replies['comments'] = get_comment_replies(
service, comment['id'])
# 'comment' is a 'CommentThreads Resource' that has it's
# 'replies.comments' an array of 'Comments Resource'
# Do fill in the 'comments' data structure
# to be provided by this function:
...
request = service.commentThreads().list_next(
request, response)
return comments
def get_comment_replies(service, comment_id):
request = service.comments().list(
parentId = comment_id,
part = 'id,snippet',
maxResults = 100
)
replies = []
while request:
response = request.execute()
replies.extend(response['items'])
request = service.comments().list_next(
request, response)
return replies
Note that the ellipsis dots above -- ... -- would have to be replaced with actual code that fills in the array of structures to be returned by get_video_comments to its caller.
The simplest way (useful for quick testing) would be to have ... replaced with comments.append(comment) and then the caller of get_video_comments to simply pretty print (using json.dump) the object obtained from that function.
Based on stvar' answer and the original publication here I built this code:
import os
import pickle
import csv
import json
import google.oauth2.credentials
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
CLIENT_SECRETS_FILE = "client_secret.json" # for more information to create your credentials json please visit https://python.gotrained.com/youtube-api-extracting-comments/
SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'
def get_authenticated_service():
credentials = None
if os.path.exists('token.pickle'):
with open('token.pickle', 'rb') as token:
credentials = pickle.load(token)
# Check if the credentials are invalid or do not exist
if not credentials or not credentials.valid:
# Check if the credentials have expired
if credentials and credentials.expired and credentials.refresh_token:
credentials.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(
CLIENT_SECRETS_FILE, SCOPES)
credentials = flow.run_console()
# Save the credentials for the next run
with open('token.pickle', 'wb') as token:
pickle.dump(credentials, token)
return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)
def get_video_comments(service, **kwargs):
request = service.commentThreads().list(**kwargs)
comments = []
while request:
response = request.execute()
for comment in response['items']:
reply_count = comment['snippet'] \
['totalReplyCount']
replies = comment.get('replies')
if replies is not None and \
reply_count != len(replies['comments']):
replies['comments'] = get_comment_replies(
service, comment['id'])
# 'comment' is a 'CommentThreads Resource' that has it's
# 'replies.comments' an array of 'Comments Resource'
# Do fill in the 'comments' data structure
# to be provided by this function:
comments.append(comment)
request = service.commentThreads().list_next(
request, response)
return comments
def get_comment_replies(service, comment_id):
request = service.comments().list(
parentId = comment_id,
part = 'id,snippet',
maxResults = 1000
)
replies = []
while request:
response = request.execute()
replies.extend(response['items'])
request = service.comments().list_next(
request, response)
return replies
if __name__ == '__main__':
# When running locally, disable OAuthlib's HTTPs verification. When
# running in production *do not* leave this option enabled.
os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
service = get_authenticated_service()
videoId = input('Enter Video id : ') # video id here (the video id of https://www.youtube.com/watch?v=vedLpKXzZqE -> is vedLpKXzZqE)
comments = get_video_comments(service, videoId=videoId, part='id,snippet,replies', maxResults = 1000)
with open('youtube_comments', 'w', encoding='UTF8') as f:
writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in comments:
# convert the tuple to a list and write to the output file
writer.writerow([row])
it returns a file called youtube_comments with this format:
"{'kind': 'youtube#commentThread', 'etag': 'gvhv4hkH0H2OqQAHQKxzfA-K_tA', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'topLevelComment': {'kind': 'youtube#comment', 'etag': 'qpuKZcuD4FKf6BHgRlMunersEeU', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is a comment', 'textOriginal': 'This is a comment', 'authorDisplayName': 'Gabriell Magana', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLRGBvo2ZncDP1xGjlX6anfUufNYi9b3w9kYZFDl=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCKAa4FYftXsN7VKaPSlCivg', 'authorChannelId': {'value': 'UCKAa4FYftXsN7VKaPSlCivg'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 8, 'publishedAt': '2019-05-22T12:38:34Z', 'updatedAt': '2019-05-22T12:38:34Z'}}, 'canReply': True, 'totalReplyCount': 0, 'isPublic': True}}"
"{'kind': 'youtube#commentThread', 'etag': 'DsgDziMk7mB7xN4OoX7cmqlbDYE', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'topLevelComment': {'kind': 'youtube#comment', 'etag': 'NYjvYM9W_umBafAfQkdg1P9apgg', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is another comment', 'textOriginal': 'This is another comment', 'authorDisplayName': 'Mary Montes', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTg1b1yw8BX8Af0PoTR_t5OOwP9Cfl9_qL-o1iikw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC_GP_8HxDPsqJjJ3Fju_UeA', 'authorChannelId': {'value': 'UC_GP_8HxDPsqJjJ3Fju_UeA'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 9, 'publishedAt': '2019-05-15T05:10:49Z', 'updatedAt': '2019-05-15T05:10:49Z'}}, 'canReply': True, 'totalReplyCount': 3, 'isPublic': True}, 'replies': {'comments': [{'kind': 'youtube#comment', 'etag': 'Tu41ENCZYNJ2KBpYeYz4qgre0H8', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF79DbfJ9zMKxM', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'this is first reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'JULIO EMPRESARIO', 'authorProfileImageUrl': 'https://yt3.ggpht.com/eYP4MBcZ4bON_pHtdbtVsyWnsKbpNKye2wTPhgkffkMYk3ZbN0FL6Aa1o22YlFjn2RVUAkSQYw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCrpB9oZZZfmBv1aQsxrk66w', 'authorChannelId': {'value': 'UCrpB9oZZZfmBv1aQsxrk66w'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-09-15T04:06:50Z', 'updatedAt': '2020-09-15T04:06:50Z'}}, {'kind': 'youtube#comment', 'etag': 'OrpbnJddwzlzwGArCgtuuBsYr94', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF795E1w8RV1DJ', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'the second replay', 'textOriginal': 'the second replay', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Anatolio27 Diaz', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLR1hOySIxEkvRCySExHjo3T6zGBNkvuKpPkqA=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC04N8BM5aUwDJf-PNFxKI-g', 'authorChannelId': {'value': 'UC04N8BM5aUwDJf-PNFxKI-g'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-02-19T18:21:06Z', 'updatedAt': '2020-02-19T18:21:06Z'}}, {'kind': 'youtube#comment', 'etag': 'sPmIwerh3DTZshLiDVwOXn_fJx0', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF78wwH6Aabh4y', 'snippet': {'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'A third reply', 'textOriginal': 'A third reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Voy detrás de mi pasión', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTgzZ3ZFvkmmAlMzA77ApM-2uGFfvOBnzxegYEX=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCvv6QMokO7KcJCDpK6qZg3Q', 'authorChannelId': {'value': 'UCvv6QMokO7KcJCDpK6qZg3Q'}, 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2019-07-03T18:45:34Z', 'updatedAt': '2019-07-03T18:45:34Z'}}]}}"
Now it is necessary a second step in order to information required. For this I a set of bash script toos like cut, awk and set:
cut -d ":" -f 10- youtube_comments | sed -e "s/', '/\n/g" -e "s/'//g" | awk '/replies/{print "------------------------****---------::: Replies: "$6" :::---------******--------------------------------"}!/replies/{print}' |sed '/^textOriginal:/,/^authorDisplayName:/{/^authorDisplayName/!d}' |sed '/^authorProfileImageUrl:\|^authorChannelUrl:\|^authorChannelId:\|^etag:\|^updatedAt:\|^parentId:\|^id:/d' |sed 's/<[^>]*>//g' | sed 's/{textDisplay/{\ntextDisplay/' |sed '/^snippet:/d' | awk -F":" '(NF==1){print "========================================COMMENT==========================================="}(NF>1){a=0; print $0}' | sed 's/textDisplay: //g' | sed 's/authorDisplayName/User/g' | sed 's/T[0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}Z//g' | sed 's/likeCount: /Likes:/g' | sed 's/publishedAt: //g' > output_file
The final result is a file called output_file with this format:
========================================COMMENT===========================================
This is a comment
User: Robert Everest
Likes:8, 2019-05-22
========================================COMMENT===========================================
This is another comment
User: Anna Davis
Likes:9, 2019-05-15
------------------------****---------::: Replies: 3, :::---------******--------------------------------
this is first reply
User: John Doe
Likes:2, 2020-09-15
the second replay
User: Caraqueno
Likes:2, 2020-02-19
A third reply
User: Rebeca
Likes:2, 2019-07-03
The python script requires of the file token.pickle to work, it is generated the first time the python script run and when it expired, it have to be deleted and generated again.
I had a similar issue that the OP does and managed to solve it, but someone in the community closed my question after I solved it and can't post there. I'm posting it here for fidelity.
The YouTube API doesn't allow users to grab nested replies to comments. What it does allow is you to get the replies to the comments and all the comments i.e. Video --> Comments --> Comment Replies ---> Reply To Reply et al. Knowing this limitation we can write code to get all the top Comments, and then break into those comments to get the first-level replies.
Moduels
import os
import googleapiclient.discovery #required for using googleapi
import pandas as pd #require for data munging. We use pd.json_normalize to create the tables
import numpy as np #just good to have
import json # the requests are returned as json objects.
from datetime import datetime #good to have for date modification
Get All Comments Function
For a given vidId, this function will get the first 100 comments and place them into a df. It then use a while loop to check to see if the response api contains nextPageToken. While it does, it will continue to run to get all the comments until either all the comments are pulled or you run out of credits, whichever happens first.
def vidcomments(vidId):
# Disable OAuthlib's HTTPS verification when running locally.
# *DO NOT* leave this option enabled in production.
os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = "yourapikey" #<--- insert API key here
youtube = googleapiclient.discovery.build(
api_service_name, api_version, developerKey = DEVELOPER_KEY)
request = youtube.commentThreads().list(
part="snippet, replies",
order="time",
maxResults=100,
textFormat="plainText",
videoId=vidId
)
response = request.execute()
full = pd.json_normalize(response, record_path=['items'])
while response:
if 'nextPageToken' in response:
response = youtube.commentThreads().list(
part="snippet",
maxResults=100,
textFormat='plainText',
order='time',
videoId=vidId,
pageToken=response['nextPageToken']
).execute()
df2 = pd.json_normalize(response, record_path=['items'])
full = full.append(df2)
else:
break
return full
Get All Replies To Comments Function
For a particular parentId, get all the first-level replies. Like the vidcomments() function noted above, it will run until all replies to all comments are pulled or you run out of credits, whichever happens first.
def repliesto(parentId):
# Disable OAuthlib's HTTPS verification when running locally.
# *DO NOT* leave this option enabled in production.
os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = DevKey #your dev key
youtube = googleapiclient.discovery.build(
api_service_name, api_version, developerKey = DEVELOPER_KEY)
request = youtube.comments().list(
part="snippet",
maxResults=100,
parentId=parentId,
textFormat="plainText"
)
response = request.execute()
replies = pd.json_normalize(response, record_path=['items'])
while response:
if 'nextPageToken' in response:
response = youtube.comments().list(
part="snippet",
maxResults=100,
parentId=parentId,
textFormat="plainText",
pageToken=response['nextPageToken']
).execute()
df2 = pd.json_normalize(response, record_path=['items'])
replies = pd.concat([replies, df2], sort=False)
else:
break
return replies
Putting It Together
First, run the vidcomments function to get all the comments information. Then use the code below to get all the reply information using a for loop to pull in each topLevelComment.id into a list, then use the list and another for loop to build the replies dataframe. This will create two separate Dataframes, one for Comments and another for Replies. After creating both of these Dataframes you can then join them in a way that makes sense for your purpose, either concat/union or a join/merge.
replyto = []
for reply in full[(full['snippet.totalReplyCount']>0)]
['snippet.topLevelComment.id']:
replyto.append(reply)
# create an empty DF to store all the replies
# use a for loop to place each item in our replyto list into the function defined above
replies = pd.DataFrame()
for reply in replyto:
df = repliesto(reply)
replies = pd.concat([replies, df], ignore_index=True)

How do I make an API call and authenticate it with a given API key using Python?

This is my code to extract player data from an endpoint containing basketball data for a Data Science project.NOTE: I changed the name of the actual API key I was given since it's subscription. And I change the username/password because for privacy purposes. Using the correct credentials, I wouldn't receive a syntax error but the status code always returns 401. Since it wasn't accepting the API key, I added my account username, password, and the HTTP authentication header as well, but the status code still returns 401.
In case this is relevant, this is the website's recommendation in the developer portal: **The API key can be passed either as a query parameter or using the following HTTP request header.
Please let me know what changes I can make to my code. Any help is appreciated.
Ocp-Apim-Subscription-Key: {key}**
PS: My code got fragmented while posting this, but it is all in one function.
def getData():
user_name = "name#gmail.com"
api_endpoint = "https://api.sportsdata.io/v3/nba/stats/json/PlayerGameStatsByDate/2020-FEB7"
api_key = "a45;lkf"
password = "ksaljd"
header = "Ocp-Apim-Subscription-Key"
PARAMS = {'user': user_name, 'pass': password, 'header': header, 'key': api_key}
response = requests.get(url = api_endpoint, data = PARAMS)
print(response.status_code)
file = open("Data.csv", "w")
file.write(response.text)
file.close()
def _get_auth_headers() -> dict:
return {
'Content-Type': 'application/json',
'Ocp-Apim-Subscription-Key': "`Insert key here`"
}
api_endpoint = "https://api.sportsdata.io/v3/nba/stats/json/PlayerGameStatsByDate/2020-FEB7"
PARAMS = {
# Your params here
}
response = requests.get(
api_endpoint,
headers=_get_auth_headers(),
params=PARAMS
)
Instead of just a string, you need to pass dict in the headers parameter and auth param exist so you can use it as follow:
def getData():
[...]
header = {
"Ocp-Apim-Subscription-Key": api_key
}
[...]
response = requests.get(url = api_endpoint, data = PARAMS, headers=header, auth = (user_name, password))
According to the API documentation you don't need to provide email and password. You're only need to add your API Key to header:
import requests
r = requests.get(url='https://api.sportsdata.io/v3/nba/stats/json/PlayerGameStatsByDate/2020-FEB7', headers={'Ocp-Apim-Subscription-Key': 'API_KEY'})
print(r.json())
Output:
[{
'StatID': 768904,
'TeamID': 25,
'PlayerID': 20000788,
'SeasonType': 1,
'Season': 2020,
'Name': 'Tim Hardaway Jr.',
'Team': 'DAL',
'Position': 'SF',
'Started': 1,
'FanDuelSalary': 7183,
'DraftKingsSalary': 7623,
'FantasyDataSalary': 7623,
...

Accessing nested data in a supposed dict

Alright, I'm stumped. I have googled everything I can think of from nested Dicts, Dicts inside Lists inside Dicts, to JSON referencing and have no idea how to get to this data.
I have this AWS Lambda handler that is reading Slack events and simply reversing someone's message and then spitting it out back to Slack. However, the bot can respond to itself (creating an infinite loop). I thought I had this solved, however, that was for the legacy stuff. I am Python stupid, so how do reference this data?
Data (slack_body_dict print from below):
{'token': 'NgapUeqidaGeTf4ONWkUQQiP', 'team_id': 'T7BD9RY57', 'api_app_id': 'A01LZHA7R9U', 'event': {'client_msg_id': '383aeac2-a436-4bad-8e19-7fa68facf916', 'type': 'message', 'text': 'rip', 'user': 'U7D1RQ9MM', 'ts': '1612727797.024200', 'team': 'T7BD9RY57', 'blocks': [{'type': 'rich_text', 'block_id': 'gA7K', 'elements': [{'type': 'rich_text_section', 'elements': [{'type': 'text', 'text': 'rip'}]}]}], 'channel': 'D01MK0JSNDP', 'event_ts': '1612727797.024200', 'channel_type': 'im'}, 'type': 'event_callback', 'event_id': 'Ev01MN8LJ117', 'event_time': 1612727797, 'authorizations': [{'enterprise_id': None, 'team_id': 'T7BD9RY57', 'user_id': 'U01MW6UK55W', 'is_bot': True, 'is_enterprise_install': False}], 'is_ext_shared_channel': False, 'event_context': '1-message-T7BD9RY57-D01MK0JSNDP'}
There is an 'is_bot' there under 'authorizations' I want to check. I assume this will let the bot stop responding to itself. However, for the life of me, I cannot reference it. It seems to be nested in there.
I have tried the following:
def lambda_handler(api_event, api_context):
print(f"Received event:\n{api_event}\nWith context:\n{api_context}")
# Grab relevant information form the api_event
slack_body_raw = api_event.get('body')
slack_body_dict = json.loads(slack_body_raw)
request_headers = api_event["headers"]
print(f"!!!!!!!!!!!!!!!!!!!!!!!body_dict:\n{slack_body_dict}")
print(f"#######################is_bot:\n{slack_body_dict('is_bot')}")
print(f"#######################is_bot:\n{slack_body_dict("is_bot")}")
print(f"#######################is_bot:\n{slack_body_dict(['is_bot']}")
print(f"#######################is_bot:\n{slack_body_dict(["is_bot"]}")
print(f"#######################is_bot:\n{slack_body_dict['authorizations']['is_bot']}")
As you can see I have absolutely no clue how to get to that variable to tell if it is true or false. Every 'is_bot' print reference results in an error. Can someone tell me how to reference that variable or give me something to google? Appreciate it. Code is below in case it is relevant.
import json
import os
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError
def is_challenge(slack_event_body: dict) -> bool:
"""Is the event a challenge from slack? If yes return the correct response to slack
Args:
slack_event_body (dict): The slack event JSON
Returns:
returns True if it is a slack challenge event returns False otherwise
"""
if "challenge" in slack_event_body:
LOGGER.info(f"Challenge Data: {slack_event_body['challenge']}")
return True
return False
def lambda_handler(api_event, api_context):
# Grab relevant information form the api_event
slack_body_raw = api_event.get('body')
slack_body_dict = json.loads(slack_body_raw)
request_headers = api_event["headers"]
# This is to appease the slack challenge gods
if is_challenge(slack_body_dict):
challenge_response_body = {
"challenge": slack_body_dict["challenge"]
}
return helpers.form_response(200, challenge_response_body)
# This parses the slack body dict to get the event JSON
slack_event_dict = slack_body_dict["event"]
# Build the slack client.
slack_client = WebClient(token=os.environ['BOT_TOKEN'])
# We need to discriminate between events generated by
# the users, which we want to process and handle,
# and those generated by the bot.
if slack_body_dict['is_bot']: #THIS IS GIVING ME THE ERROR. I WANT TO CHECK IF BOT HERE.
logging.warning("Ignore bot event")
else:
# Get the text of the message the user sent to the bot,
# and reverse it.
text = slack_event_dict["text"]
reversed_text = text[::-1]
# Get the ID of the channel where the message was posted.
channel_id = slack_event_dict["channel"]
try:
response = slack_client.chat_postMessage(
channel=channel_id,
text=reversed_text
)
except SlackApiError as e:
# You will get a SlackApiError if "ok" is False
assert e.response["error"] # str like 'invalid_auth', 'channel_not_found'
The structure of the data is:
{
"authorizations": [
{
"is_bot": true
}
]
}
So you would need to first index "authorizations", then to get the first item 0, and lastly "is_bot".
data["authorizations"][0]["is_bot"]
Alternativly, you could iterate over all the authorizations and check if any (or all) of them are marked as a bot like so:
any(auth["is_bot"] for auth in slack_body_dict["authorizations"])

Export Methods for SugarCRM v10 api

I am trying to make a call to the SugarCRM v10 api to get the output of a report without having to log into the web interface and click the export button. I would like to get this report as data that can be written into csv format using python and the requests library.
I can authenticate successfully and get a token but whatever I try all I get as a response from reports is Error Method does not exist, by which they mean that you cannot use /csv at the end of the second url in this code block.
url = "https://mydomain.sugarondemand.com/rest/v10/oauth2/token"
payload = {"grant_type":"password","username":"ursername","password":"password","client_id":"sugar", "platform":"myspecialapp"}
r = requests.post(url, data=json.dumps(payload))
response = json.loads(r.text)
token = response[u'access_token']
print 'Success! OAuth token is ' + token
#What export methods are available? ###################################
#WRONG url = "https://mydomain.sugarondemand.com/rest/v10/Reports/report_id/csv"
#Following paquino's suggestion I used Base64
url = "https://mydomain.sugarondemand.com/rest/v10/Reports/report_id/Base64"
headers = { "Content-Type" : "application/json", "OAuth-Token": token }
r = requests.get(url, headers=headers);
response = r.text.decode('base64')
print response`
My question is this: what Export Methods are available via an api call to v10 of the SugarCRM api.
Edit: Using Base64 in the request url unfortunately returns ab object that I don't know how to parse...
%PDF-1.7
3 0 obj
<</Type /Page
/Parent 1 0 R
/MediaBox [0 0 792.00 612.00]
/Resources 2 0 R
/Contents 4 0 R>>
endobj
4 0 obj
<</Length 37217>>
stream
8.cܬR≈`ä║dàQöWºáW╙µ
The Reports Api accepts "Base64" and "Pdf"
Python Wrapper for SugarCRM REST API v10
https://github.com/Feverup/pysugarcrm
Quickstart
pip install pysugarcrm
from pysugarcrm import SugarCRM
api = SugarCRM('https://yourdomain.sugaropencloud.e', 'youruser', 'yourpassword')
# Return info about current user
api.me
# A more complex query requesting employees
api.get('/Employees', query_params={'max_num': 2, 'offset': 2, 'fields': 'user_name,email'})
{'next_offset': 4,
'records': [{'_acl': {'fields': {}},
'_module': 'Employees',
'date_modified': '2015-09-09T13:40:32+02:00',
'email': [{'email_address': 'John.doe#domain.com',
'invalid_email': False,
'opt_out': False,
'primary_address': True,
'reply_to_address': False}],
'id': '12364218-7d79-80e0-4f6d-35ed99a8419d',
'user_name': 'john.doe'},
{'_acl': {'fields': {}},
'_module': 'Employees',
'date_modified': '2015-09-09T13:39:54+02:00',
'email': [{'email_address': 'alice#domain.com',
'invalid_email': False,
'opt_out': False,
'primary_address': True,
'reply_to_address': False}],
'id': 'a0e117c0-9e46-aebf-f71a-55ed9a2b4731',
'user_name': 'alice'}]}
# Generate a Lead
api.post('/Leads', json={'first_name': 'John', 'last_name': 'Smith', 'business_name_c': 'Test John', 'contact_email_c': 'john#smith.com'})
from pysugarcrm import sugar_api
with sugar_api('http://testserver.com/', "admin", "12345") as api:
data = api.get('/Employees', query_params={'max_num': 2, 'offset': 2, 'fields': 'user_name,email'})
api.post('/Leads', json={'first_name': 'John', 'last_name': 'Smith', 'business_name_c': 'Test John', 'contact_email_c': 'john#smith.com'})
# Once we exit the context manager the sugar connection is closed and the user is logged out

Categories

Resources