I am trying to scrape some film critic review information from Rotten Tomatoes. I've managed to access the json API files and am able to pull the data I need from the first page of reviews using the code below. The problem is I can't get my code to iterate through a list of url tokens I've created that should allow my scraper to move on to other pages of the data. The current code pulls data from the first page 11 times (resulting in 550 entries for the same 50 reviews) It always seems like the most simple things trip me up with this stuff. Can anyone tell me what I'm doing wrong?
import requests
import pandas as pd
baseurl = "https://www.rottentomatoes.com/napi/critics/"
endpoint = "alison-willmore/movies?"
tokens = [" ", "after=MA%3D%3D", "after=MQ%3D%3D", "after=Mg%3D%3D", "after=Mw%3D%3D", "after=NA%3D%3D",
"after=NQ%3D%3D", "after=Ng%3D%3D", "after=Nw%3D%3D", "after=OA%3D%3D", "after=OQ$3D%3D"]
def tokenize(t):
for y in t:
return y
def main_request(baseurl, endpoint, x):
r = requests.get(baseurl + endpoint + f'{x}')
return r.json()
reviews = []
def parse_json(response):
for item in response["reviews"]:
review_info = {
'film_title': item["mediaTitle"],
'release_year': item["mediaInfo"],
'full_review_url': item["url"],
'review_date': item["date"],
'rt_quote': item["quote"],
'film_info_link': item["mediaUrl"]}
reviews.append(review_info)
for t in tokens:
data = main_request(baseurl, endpoint, tokenize(t))
parse_json(data)
print(len(reviews))
d_frame = pd.DataFrame(reviews)
print(d_frame.head(), d_frame.tail())
The issue is with your main_request method. Using a print statement or debugger is useful in this situation.
This shows that your code is requesting the same url multiple times. Thus your concatenation isn't working or is wrong. Next step is to figure out why a is the only thing in the URL.
def tokenize(t):
for y in t:
return y
This is your culprit. It is iterating through a string array and returning the first character in the array, which is the letter a (from after=SOMETHING) if that doesn't make sense, I would give this a read. What are you trying to accomplish here? I'm guessing you're trying to encode the URL? Either way, try this
# Change the parse_json method, this is safer. The way you had it would crash if data was invalid
def parse_json(response):
for item in response["reviews"]:
review_info = {
'film_title': item.get("mediaTitle", None),
'release_year': item.get("mediaInfo", None),
'full_review_url': item.get("url", None),
'review_date': item.get("date", None),
'rt_quote': item.get("quote", None),
'film_info_link': item.get("mediaUrl", None)}
reviews.append(review_info)
Change the loop to this or update your tokenize method.
for t in tokens:
data = main_request(baseurl, endpoint, t)
parse_json(data)
Your tokenize function is the problem. You feed it (ex:"after=MA%3D%3D") and it just gives you back the first letter 'a'. Just get rid of tokenize.
for t in tokens:
data = main_request(baseurl, endpoint, t)
parse_json(data)
This fixes your problem, but makes a new problem for you. Apparently, the keys you are getting from JSON, do not always exist. A simple fix is to refactor a bit. Using .get will either return the value or the second argument if the key does not exist.
def parse_json(response):
for item in response["reviews"]:
review_info = {
'film_title' : item.get("mediaTitle", None),
'release_year' : item.get("mediaInfo", None),
'full_review_url': item.get("url", None),
'review_date' : item.get("date", None),
'rt_quote' : item.get("quote", None),
'film_info_link' : item.get("mediaUrl", None)
}
reviews.append(review_info)
Your last token has a typo. after=OQ$.... should be after=OQ%....
If you are interested, I refactored your entire code. IMO, your code is unnecessarily fragmented, which is probably what led to your issues.
import requests
import pandas as pd
baseurl = "https://www.rottentomatoes.com/napi/critics/"
endpoint = "alison-willmore/movies?"
tokens = ("MA%3D%3D", "MQ%3D%3D", "Mg%3D%3D", "Mw%3D%3D", "NA%3D%3D",
"NQ%3D%3D", "Ng%3D%3D", "Nw%3D%3D", "OA%3D%3D", "OQ%3D%3D")
#filter
def filter_json(response):
for item in response["reviews"]:
yield {
'film_title' : item.get("mediaTitle", None),
'release_year' : item.get("mediaInfo" , None),
'full_review_url': item.get("url" , None),
'review_date' : item.get("date" , None),
'rt_quote' : item.get("quote" , None),
'film_info_link' : item.get("mediaUrl" , None)
}
#load
reviews = []
for t in tokens:
r = requests.get(f'{baseurl}{endpoint}after={t}')
for review in filter_json(r.json()):
reviews.append(review)
#use
d_frame = pd.DataFrame(reviews)
print(len(reviews))
print(d_frame.head(), d_frame.tail())
Trying to figure out how to extract the payload from a Dell DataIQ server. From the documentation the call looks like:
def updateShareReports(api):
request = claritynowapi.FastStatRequest()
request.resultType = claritynowapi.FastStatRequest.ALL_PATHS
subRequest = claritynowapi.SubRequest()
subRequest.name = 'Shares'
request.requests.append(subRequest)
results = api.report(request)
report = results.requests[0].results
print (report)
I added the print line in there hoping to see the output, but what I get is: [<claritynowapi.PathResult object at 0x00000233D1C8B700>]
Any suggestions on what I need to do to see the output in a json?
Thanks
I have Tensorboard data and want it to download all of the csv files behind the data, but I could not find anything from the official documentation. From StackOverflow, I found only this question which is 7 years old and also it's about TensorFlow while I am using PyTorch.
We can do this manually, as we can see in the screenshot, manually there is an option. I wonder if we can do that via code or it is not possible? As I have a lot of data to process.
With the help of this script Below is the shortest working code it gets all of the data in dataframe then you can play further.
import traceback
import pandas as pd
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
# Extraction function
def tflog2pandas(path):
runlog_data = pd.DataFrame({"metric": [], "value": [], "step": []})
try:
event_acc = EventAccumulator(path)
event_acc.Reload()
tags = event_acc.Tags()["scalars"]
for tag in tags:
event_list = event_acc.Scalars(tag)
values = list(map(lambda x: x.value, event_list))
step = list(map(lambda x: x.step, event_list))
r = {"metric": [tag] * len(step), "value": values, "step": step}
r = pd.DataFrame(r)
runlog_data = pd.concat([runlog_data, r])
# Dirty catch of DataLossError
except Exception:
print("Event file possibly corrupt: {}".format(path))
traceback.print_exc()
return runlog_data
path="Run1" #folderpath
df=tflog2pandas(path)
#df=df[(df.metric != 'params/lr')&(df.metric != 'params/mm')&(df.metric != 'train/loss')] #delete the mentioned rows
df.to_csv("output.csv")
I'm calling a LinkedIn API with the code below and it does what I want.
However when I use almost identical code inside a loop it returns a type error.
it returns a type error:
File "C:\Users\pchmurzynski\OneDrive - Centiq Ltd\Documents\Python\mergedreqs.py", line 54, in <module>
auth_headers = headers(access_token)
TypeError: 'dict' object is not callable
It has a problem with this line (which again, works fine outside of the loop):
headers = headers(access_token)
I tried changing it to
headers = headers.get(access_token)
or
headers = headers[access_token]
EDIT:
I have also tried this, with the same error:
auth_headers = headers(access_token)
But it didn't help. What am I doing wrong? Why does the dictionary work fine outside of the loop, but not inside of it and what should I do to make it work?
What I am hoping to achieve is to get a list, which I can save as json with share statistics called for each ID from the "shids" list. That can be done with individual requests - one link for one ID,
(f'https://api.linkedin.com/v2/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&ugcPosts=List(urn%3Ali%3AugcPost%3A{shid})
or a a request with a list of ids.
(f'https://api.linkedin.com/v2/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&ugcPosts=List(urn%3Ali%3AugcPost%3A{shid},urn%3Ali%3AugcPost%3A{shid2},...,urn%3Ali%3AugcPost%3A{shidx})
Updated Code thanks to your comments.
shlink = ("https://api.linkedin.com/v2/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&shares=List(urn%3Ali%3Ashare%3A{})")
#loop through the list of share ids and make an api request for each of them
shares = []
token = auth(credentials) # Authenticate the API
headers = fheaders(token) # Make the headers to attach to the API call.
for shid in shids:
#create a request link for each sh id
r = (shlink.format(shid))
#call the api
res = requests.get(r, headers = auth_headers)
share_stats = res.json()
#append the shares list with the responce
shares.append(share_stats["elements"])
works fine outside the loop
Because in the loop, you re-define the variable. Added print statments to show it
from liapiauth import auth, headers # one type
for ...:
...
print(type(headers))
headers = headers(access_token) # now set to another type
print(type(headers))
Lesson learned - don't overrwrite your imports
Some refactors - your auth token isn't changing, so don't put it in the loop; You can use one method for all LinkedIn API queries
from liapiauth import auth, headers
import requests
API_PREFIX = 'https://api.linkedin.com/v2'
SHARES_ENDPOINT_FMT = '/organizationalEntityShareStatistics?q=organizationalEntity&organizationalEntity=urn%3Ali%3Aorganization%3A77487&shares=List(urn%3Ali%3Ashare%3A{}'
def get_linkedin_response(endpoint, headers):
return requests.get(API_PREFIX + endpoint, headers=headers)
def main(access_token=None):
if access_token is None:
raise ValueError('Access-Token not defined')
auth_headers = headers(access_token)
shares = []
for shid in shids:
endpoint = SHARES_ENDPOINT_FMT.format(shid)
resp = get_linkedin_response(endpoint, auth_headers)
if resp.status_code // 100 == 2:
share_stats = resp.json()
shares.append(share_stats[1])
# TODO: extract your data here
idlist = [el["id"] for el in shares_list["elements"]]
if __name__ == '__main__':
credentials = 'credentials.json'
main(auth(credentials))
I have some sample code that I use to analyze entities and its sentiments using Google's natural language API. For every record in my Pandas dataframe, I want to return a list of dictionaries where each element is an entity. However, I am running into issues when trying to have it work on the production data. Here is the sample code
from google.cloud import language_v1 # version 2.0.0
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/json'
import pandas as pd
# establish client connection
client = language_v1.LanguageServiceClient()
# helper function
def custom_analyze_entity(text_content):
global client
#print("Accepted Input::" + text_content)
document = language_v1.Document(content=text_content, type_=language_v1.Document.Type.PLAIN_TEXT, language = 'en')
response = client.analyze_entity_sentiment(request = {'document': document})
# a document can have many entities
# create a list of dictionaries, every element in the list is a dictionary that represents an entity
# the dictionary is nested
l = []
#print("Entity response:" + str(response.entities))
for entity in response.entities:
#print('=' * 20)
temp_dict = {}
temp_meta_dict = {}
temp_mentions = {}
temp_dict['name'] = entity.name
temp_dict['type'] = language_v1.Entity.Type(entity.type_).name
temp_dict['salience'] = str(entity.salience)
sentiment = entity.sentiment
temp_dict['sentiment_score'] = str(sentiment.score)
temp_dict['sentiment_magnitude'] = str(sentiment.magnitude)
for metadata_name, metadata_value in entity.metadata.items():
temp_meta_dict['metadata_name'] = metadata_name
temp_meta_dict['metadata_value'] = metadata_value
temp_dict['metadata'] = temp_meta_dict
for mention in entity.mentions:
temp_mentions['mention_text'] = str(mention.text.content)
temp_mentions['mention_type'] = str(language_v1.EntityMention.Type(mention.type_).name)
temp_dict['mentions'] = temp_mentions
#print(u"Appended Entity::: {}".format(temp_dict))
l.append(temp_dict)
return l
I have tested it on sample data and it works fine
# works on sample data
data= ['Grapes are good. Bananas are bad.', 'the weather is not good today', 'Michelangelo Caravaggio, Italian painter, is known for many arts','look i cannot articulate how i feel today but its amazing to be back on the field with runs under my belt.']
input_df = pd.DataFrame(data=data, columns = ['freeform_text'])
for i in range(len(input_df)):
op = custom_analyze_entity(input_df.loc[i,'freeform_text'])
input_df.loc[i, 'entity_object'] = op
But when i try to parse it thru production data using below code, it fails with multi-index error. I am not able to reproduce the error using the sample pandas dataframe.
for i in range(len(input_df)):
op = custom_analyze_entity(input_df.loc[i,'freeform_text'])
input_df.loc[i, 'entity_object'] = op
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/opt/conda/default/lib/python3.6/site-packages/pandas/core/indexing.py", line 670, in __setitem__
iloc._setitem_with_indexer(indexer, value)
File "/opt/conda/default/lib/python3.6/site-packages/pandas/core/indexing.py", line 1667, in _setitem_with_indexer
"cannot set using a multi-index "
ValueError: cannot set using a multi-index selection indexer with a different length than the value
Try doing this:
input_df.loc[0, 'entity_object'] = ""
for i in range(len(input_df)):
op = custom_analyze_entity(input_df.loc[i,'freeform_text'])
input_df.loc[i, 'entity_object'] = op
Or for your specific case, you don't need to use the loc function.
input_df["entity_object"] = ""
for i in range(len(input_df)):
op = custom_analyze_entity(input_df.loc[i,'freeform_text'])
input_df["entity_object"][i] = op