KeyError for 'snippet' when using YouTube Data API RelatedToVideoID feature - python

This is my first ever question on Stack Overflow so please do tell me if anything remains unclear. :)
My issue is somewhat related to this thread. I am trying to use the YouTube Data API to sample videos for my thesis. I have done so successfully with the code below; however, when I change the criterion from a query (q) to relatedToVideoId, the unpacking section breaks for some reason.
It works outside of my loop, but not inside it (same story for the .get() suggestion from the other thread). Does anyone know why this might be and how I can solve it?
This is the (shortened) code I wrote which you can use to replicate the issue:
import numpy as np
import pandas as pd
# Allocate credentials:
from googleapiclient.discovery import build
api_key = "YOUR KEY SHOULD GO HERE"
# Session Build
youtube = build('youtube', 'v3', developerKey = api_key)
df_sample_v2 = pd.DataFrame(columns = ["Video.ID", "Title", "Channel Name"])
keywords = ['Global Warming',
'Coronavirus'
]
iter = list(range(1, 150))
rand_selec_ids = ['H6u0VBqNBQ8',
'LEZCxxKp0hM'
]
for i in iter:
# Search Request
request = youtube.search().list(
part = "snippet",
#q = keywords[4],
relatedToVideoId = rand_selec_ids[1],
type = "video",
maxResults = 5000,
videoCategoryId = 28,
order = "relevance",
eventType = "completed",
videoDuration = "medium"
)
# Save Response
response = request.execute()
# Unpack Response
rows = []
for i in list(range(0, response['pageInfo']['resultsPerPage'])):
rows.append([response['items'][i]['id']['videoId'],
response['items'][i]['snippet']['title'], # this is the problematic line
response['items'][i]['snippet']['channelTitle']]
)
temp = pd.DataFrame(rows, columns = ["Video.ID", "Title", "Channel Name"])
df_sample_v2 = df_sample_v2.append(temp)
print(f'{len(df_sample_v2)} videos retrieved!')
The KeyError I get is at the second line of rows.append() where I try to access the snippet.
KeyError Traceback (most recent call last)
<ipython-input-90-c6c01139e372> in <module>
45
46 rows.append([response['items'][i]['id']['videoId'],
---> 47 response['items'][i]['snippet']['title'],
48 response['items'][i]['snippet']['channelTitle']]
49 )
KeyError: 'snippet'

Your issue stems from the fact that the property resultsPerPage should not be used as an indicator for the size of the array items.
The proper way to iterate the items obtained from the API is as follows (this is also the general pythonic way of doing such kind of iterations):
for item in response['items']:
rows.append([
item['id']['videoId'],
item['snippet']['title'],
item['snippet']['channelTitle']
])
You may well add to your code the something like the debugging code below to convince yourself about the claim I made.
print(f"resultsPerPage={response['pageInfo']['resultsPerPage']}")
print(f"len(items)={len(response['items'])}")

Related

Python web scrapper not iterating through a list of url tokens

I am trying to scrape some film critic review information from Rotten Tomatoes. I've managed to access the json API files and am able to pull the data I need from the first page of reviews using the code below. The problem is I can't get my code to iterate through a list of url tokens I've created that should allow my scraper to move on to other pages of the data. The current code pulls data from the first page 11 times (resulting in 550 entries for the same 50 reviews) It always seems like the most simple things trip me up with this stuff. Can anyone tell me what I'm doing wrong?
import requests
import pandas as pd
baseurl = "https://www.rottentomatoes.com/napi/critics/"
endpoint = "alison-willmore/movies?"
tokens = [" ", "after=MA%3D%3D", "after=MQ%3D%3D", "after=Mg%3D%3D", "after=Mw%3D%3D", "after=NA%3D%3D",
"after=NQ%3D%3D", "after=Ng%3D%3D", "after=Nw%3D%3D", "after=OA%3D%3D", "after=OQ$3D%3D"]
def tokenize(t):
for y in t:
return y
def main_request(baseurl, endpoint, x):
r = requests.get(baseurl + endpoint + f'{x}')
return r.json()
reviews = []
def parse_json(response):
for item in response["reviews"]:
review_info = {
'film_title': item["mediaTitle"],
'release_year': item["mediaInfo"],
'full_review_url': item["url"],
'review_date': item["date"],
'rt_quote': item["quote"],
'film_info_link': item["mediaUrl"]}
reviews.append(review_info)
for t in tokens:
data = main_request(baseurl, endpoint, tokenize(t))
parse_json(data)
print(len(reviews))
d_frame = pd.DataFrame(reviews)
print(d_frame.head(), d_frame.tail())
The issue is with your main_request method. Using a print statement or debugger is useful in this situation.
This shows that your code is requesting the same url multiple times. Thus your concatenation isn't working or is wrong. Next step is to figure out why a is the only thing in the URL.
def tokenize(t):
for y in t:
return y
This is your culprit. It is iterating through a string array and returning the first character in the array, which is the letter a (from after=SOMETHING) if that doesn't make sense, I would give this a read. What are you trying to accomplish here? I'm guessing you're trying to encode the URL? Either way, try this
# Change the parse_json method, this is safer. The way you had it would crash if data was invalid
def parse_json(response):
for item in response["reviews"]:
review_info = {
'film_title': item.get("mediaTitle", None),
'release_year': item.get("mediaInfo", None),
'full_review_url': item.get("url", None),
'review_date': item.get("date", None),
'rt_quote': item.get("quote", None),
'film_info_link': item.get("mediaUrl", None)}
reviews.append(review_info)
Change the loop to this or update your tokenize method.
for t in tokens:
data = main_request(baseurl, endpoint, t)
parse_json(data)
Your tokenize function is the problem. You feed it (ex:"after=MA%3D%3D") and it just gives you back the first letter 'a'. Just get rid of tokenize.
for t in tokens:
data = main_request(baseurl, endpoint, t)
parse_json(data)
This fixes your problem, but makes a new problem for you. Apparently, the keys you are getting from JSON, do not always exist. A simple fix is to refactor a bit. Using .get will either return the value or the second argument if the key does not exist.
def parse_json(response):
for item in response["reviews"]:
review_info = {
'film_title' : item.get("mediaTitle", None),
'release_year' : item.get("mediaInfo", None),
'full_review_url': item.get("url", None),
'review_date' : item.get("date", None),
'rt_quote' : item.get("quote", None),
'film_info_link' : item.get("mediaUrl", None)
}
reviews.append(review_info)
Your last token has a typo. after=OQ$.... should be after=OQ%....
If you are interested, I refactored your entire code. IMO, your code is unnecessarily fragmented, which is probably what led to your issues.
import requests
import pandas as pd
baseurl = "https://www.rottentomatoes.com/napi/critics/"
endpoint = "alison-willmore/movies?"
tokens = ("MA%3D%3D", "MQ%3D%3D", "Mg%3D%3D", "Mw%3D%3D", "NA%3D%3D",
"NQ%3D%3D", "Ng%3D%3D", "Nw%3D%3D", "OA%3D%3D", "OQ%3D%3D")
#filter
def filter_json(response):
for item in response["reviews"]:
yield {
'film_title' : item.get("mediaTitle", None),
'release_year' : item.get("mediaInfo" , None),
'full_review_url': item.get("url" , None),
'review_date' : item.get("date" , None),
'rt_quote' : item.get("quote" , None),
'film_info_link' : item.get("mediaUrl" , None)
}
#load
reviews = []
for t in tokens:
r = requests.get(f'{baseurl}{endpoint}after={t}')
for review in filter_json(r.json()):
reviews.append(review)
#use
d_frame = pd.DataFrame(reviews)
print(len(reviews))
print(d_frame.head(), d_frame.tail())

Gathering API payload (Dell DataIQ)

Trying to figure out how to extract the payload from a Dell DataIQ server. From the documentation the call looks like:
def updateShareReports(api):
request = claritynowapi.FastStatRequest()
request.resultType = claritynowapi.FastStatRequest.ALL_PATHS
subRequest = claritynowapi.SubRequest()
subRequest.name = 'Shares'
request.requests.append(subRequest)
results = api.report(request)
report = results.requests[0].results
print (report)
I added the print line in there hoping to see the output, but what I get is: [<claritynowapi.PathResult object at 0x00000233D1C8B700>]
Any suggestions on what I need to do to see the output in a json?
Thanks

Export tensorboard (with pytorch) data into csv with python

I have Tensorboard data and want it to download all of the csv files behind the data, but I could not find anything from the official documentation. From StackOverflow, I found only this question which is 7 years old and also it's about TensorFlow while I am using PyTorch.
We can do this manually, as we can see in the screenshot, manually there is an option. I wonder if we can do that via code or it is not possible? As I have a lot of data to process.
With the help of this script Below is the shortest working code it gets all of the data in dataframe then you can play further.
import traceback
import pandas as pd
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
# Extraction function
def tflog2pandas(path):
runlog_data = pd.DataFrame({"metric": [], "value": [], "step": []})
try:
event_acc = EventAccumulator(path)
event_acc.Reload()
tags = event_acc.Tags()["scalars"]
for tag in tags:
event_list = event_acc.Scalars(tag)
values = list(map(lambda x: x.value, event_list))
step = list(map(lambda x: x.step, event_list))
r = {"metric": [tag] * len(step), "value": values, "step": step}
r = pd.DataFrame(r)
runlog_data = pd.concat([runlog_data, r])
# Dirty catch of DataLossError
except Exception:
print("Event file possibly corrupt: {}".format(path))
traceback.print_exc()
return runlog_data
path="Run1" #folderpath
df=tflog2pandas(path)
#df=df[(df.metric != 'params/lr')&(df.metric != 'params/mm')&(df.metric != 'train/loss')] #delete the mentioned rows
df.to_csv("output.csv")

Using a variable from a dictionary in a loop to attach to an API call

I'm calling a LinkedIn API with the code below and it does what I want.
However when I use almost identical code inside a loop it returns a type error.
it returns a type error:
File "C:\Users\pchmurzynski\OneDrive - Centiq Ltd\Documents\Python\mergedreqs.py", line 54, in <module>
auth_headers = headers(access_token)
TypeError: 'dict' object is not callable
It has a problem with this line (which again, works fine outside of the loop):
headers = headers(access_token)
I tried changing it to
headers = headers.get(access_token)
or
headers = headers[access_token]
EDIT:
I have also tried this, with the same error:
auth_headers = headers(access_token)
But it didn't help. What am I doing wrong? Why does the dictionary work fine outside of the loop, but not inside of it and what should I do to make it work?
What I am hoping to achieve is to get a list, which I can save as json with share statistics called for each ID from the "shids" list. That can be done with individual requests - one link for one ID,
(f'https://api.linkedin.com/v2/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&ugcPosts=List(urn%3Ali%3AugcPost%3A{shid})
or a a request with a list of ids.
(f'https://api.linkedin.com/v2/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&ugcPosts=List(urn%3Ali%3AugcPost%3A{shid},urn%3Ali%3AugcPost%3A{shid2},...,urn%3Ali%3AugcPost%3A{shidx})
Updated Code thanks to your comments.
shlink = ("https://api.linkedin.com/v2/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&shares=List(urn%3Ali%3Ashare%3A{})")
#loop through the list of share ids and make an api request for each of them
shares = []
token = auth(credentials) # Authenticate the API
headers = fheaders(token) # Make the headers to attach to the API call.
for shid in shids:
#create a request link for each sh id
r = (shlink.format(shid))
#call the api
res = requests.get(r, headers = auth_headers)
share_stats = res.json()
#append the shares list with the responce
shares.append(share_stats["elements"])
works fine outside the loop
Because in the loop, you re-define the variable. Added print statments to show it
from liapiauth import auth, headers # one type
for ...:
...
print(type(headers))
headers = headers(access_token) # now set to another type
print(type(headers))
Lesson learned - don't overrwrite your imports
Some refactors - your auth token isn't changing, so don't put it in the loop; You can use one method for all LinkedIn API queries
from liapiauth import auth, headers
import requests
API_PREFIX = 'https://api.linkedin.com/v2'
SHARES_ENDPOINT_FMT = '/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&shares=List(urn%3Ali%3Ashare%3A{}'
def get_linkedin_response(endpoint, headers):
return requests.get(API_PREFIX + endpoint, headers=headers)
def main(access_token=None):
if access_token is None:
raise ValueError('Access-Token not defined')
auth_headers = headers(access_token)
shares = []
for shid in shids:
endpoint = SHARES_ENDPOINT_FMT.format(shid)
resp = get_linkedin_response(endpoint, auth_headers)
if resp.status_code // 100 == 2:
share_stats = resp.json()
shares.append(share_stats[1])
# TODO: extract your data here
idlist = [el["id"] for el in shares_list["elements"]]
if __name__ == '__main__':
credentials = 'credentials.json'
main(auth(credentials))

How to resolve - ValueError: cannot set using a multi-index selection indexer with a different length than the value in Python

I have some sample code that I use to analyze entities and its sentiments using Google's natural language API. For every record in my Pandas dataframe, I want to return a list of dictionaries where each element is an entity. However, I am running into issues when trying to have it work on the production data. Here is the sample code
from google.cloud import language_v1 # version 2.0.0
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/json'
import pandas as pd
# establish client connection
client = language_v1.LanguageServiceClient()
# helper function
def custom_analyze_entity(text_content):
global client
#print("Accepted Input::" + text_content)
document = language_v1.Document(content=text_content, type_=language_v1.Document.Type.PLAIN_TEXT, language = 'en')
response = client.analyze_entity_sentiment(request = {'document': document})
# a document can have many entities
# create a list of dictionaries, every element in the list is a dictionary that represents an entity
# the dictionary is nested
l = []
#print("Entity response:" + str(response.entities))
for entity in response.entities:
#print('=' * 20)
temp_dict = {}
temp_meta_dict = {}
temp_mentions = {}
temp_dict['name'] = entity.name
temp_dict['type'] = language_v1.Entity.Type(entity.type_).name
temp_dict['salience'] = str(entity.salience)
sentiment = entity.sentiment
temp_dict['sentiment_score'] = str(sentiment.score)
temp_dict['sentiment_magnitude'] = str(sentiment.magnitude)
for metadata_name, metadata_value in entity.metadata.items():
temp_meta_dict['metadata_name'] = metadata_name
temp_meta_dict['metadata_value'] = metadata_value
temp_dict['metadata'] = temp_meta_dict
for mention in entity.mentions:
temp_mentions['mention_text'] = str(mention.text.content)
temp_mentions['mention_type'] = str(language_v1.EntityMention.Type(mention.type_).name)
temp_dict['mentions'] = temp_mentions
#print(u"Appended Entity::: {}".format(temp_dict))
l.append(temp_dict)
return l
I have tested it on sample data and it works fine
# works on sample data
data= ['Grapes are good. Bananas are bad.', 'the weather is not good today', 'Michelangelo Caravaggio, Italian painter, is known for many arts','look i cannot articulate how i feel today but its amazing to be back on the field with runs under my belt.']
input_df = pd.DataFrame(data=data, columns = ['freeform_text'])
for i in range(len(input_df)):
op = custom_analyze_entity(input_df.loc[i,'freeform_text'])
input_df.loc[i, 'entity_object'] = op
But when i try to parse it thru production data using below code, it fails with multi-index error. I am not able to reproduce the error using the sample pandas dataframe.
for i in range(len(input_df)):
op = custom_analyze_entity(input_df.loc[i,'freeform_text'])
input_df.loc[i, 'entity_object'] = op
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/opt/conda/default/lib/python3.6/site-packages/pandas/core/indexing.py", line 670, in __setitem__
iloc._setitem_with_indexer(indexer, value)
File "/opt/conda/default/lib/python3.6/site-packages/pandas/core/indexing.py", line 1667, in _setitem_with_indexer
"cannot set using a multi-index "
ValueError: cannot set using a multi-index selection indexer with a different length than the value
Try doing this:
input_df.loc[0, 'entity_object'] = ""
for i in range(len(input_df)):
op = custom_analyze_entity(input_df.loc[i,'freeform_text'])
input_df.loc[i, 'entity_object'] = op
Or for your specific case, you don't need to use the loc function.
input_df["entity_object"] = ""
for i in range(len(input_df)):
op = custom_analyze_entity(input_df.loc[i,'freeform_text'])
input_df["entity_object"][i] = op

Categories

Resources