youtube data api comment paging - python

I'm struggling a little bit with the syntax for iterating through all comments on a youtube video. I'm using the python and have found little documentation on the GetYouTubeVideoCommentFeed() function.
What I'm really trying to do is search all comments of a video for an instance of a word and increase a counter (eventually the comment will be printed out). It functions for the 25 results returned, but I need to access the rest of the comments.
import gdata.youtube
import gdata.youtube.service
video_id = 'hMnk7lh9M3o'
yt_service = gdata.youtube.service.YouTubeService()
comment_feed = yt_service.GetYouTubeVideoCommentFeed(video_id=video_id)
for comment_entry in comment_feed.entry:
comment = comment_entry.content.text
if comment.find('hi') != -1:
counter = counter + 1
print "hi: "
print counter
I tried to set the start_index of GetYouTubeVideoCommentFeed() in addition to the video_id but it didn't like that.
Is there something I'm missing?
Thanks!
Steve

Here's the code snippet for the same:
# Comment feed URL
comment_feed_url = "http://gdata.youtube.com/feeds/api/videos/%s/comments"
''' Get the comment feed of a video given a video_id'''
def WriteCommentFeed(video_id, data_file):
url = comment_feed_url % video_id
comment_feed = yt_service.GetYouTubeVideoCommentFeed(uri=url)
try:
while comment_feed:
for comment_entry in comment_feed.entry:
print comment_entry.id.text
print comment_entry.author[0].name.text
print comment_entry.title.text
print comment_entry.published.text
print comment_entry.updated.text
print comment_entry.content.text
comment_feed = yt_service.Query(comment_feed.GetNextLink().href)
except:
pass

Found out how to do it. Instead of passing a video_id to the GetYouTubeVideoCommentFeed function, you can pass it a URL. You can iterate through the comments by changing the URL parameters.
There must be an API limitation though; I can only access the last 1000 comments on the video.

Related

monitoring a text site (json) using python

IM working on a program to grab variant ID from this website
https://www.deadstock.ca/collections/new-arrivals/products/nike-air-max-1-cool-grey.json
Im using the code
import json
import requests
import time
endpoint = "https://www.deadstock.ca/collections/new-arrivals/products/nike-air-max-1-cool-grey.json"
req = requests.get(endpoint)
reqJson = json.loads(req.text)
for id in reqJson['product']:
name = (id['title'])
print (name)
I dont know what to do here in order to grab the Name of the items. If you visit the link you will see that the name is under 'title'. If you could help me with this that would be awesome.
I get the error message "TypeError: string indices must be integers" so im not too sure what to do.
Your biggest problem right now is that you are adding items to the list before you're checking if they're in it, so everything is coming back as in the list.
Looking at your code right now, I think what you want to do is combine things into a single for loop.
Also as a heads up you shouldn't use a variable name like list as it is shadowing the built-in Python function list().
list = [] # You really should change this to something else
def check_endpoint():
endpoint = ""
req = requests.get(endpoint)
reqJson = json.loads(req.text)
for id in reqJson['threads']: # For each id in threads list
PID = id['product']['globalPid'] # Get current PID
if PID in list:
print('checking for new products')
else:
title = (id['product']['title'])
Image = (id['product']['imageUrl'])
ReleaseType = (id['product']['selectionEngine'])
Time = (id['product']['effectiveInStockStartSellDate'])
send(title, PID, Image, ReleaseType, Time)
print ('added to database'.format(PID))
list.append(PID) # Add PID to the list
return
def main():
while(True):
check_endpoint()
time.sleep(20)
return
if __name__ == "__main__":
main()

urllib.urlopen() on variables that are assigned a string values wont work?

Im writing some code to parse an XML file. Im just wondering if someone could explain why this isn't working. If I put link itself into urllib.urlopen(), it does not seem to make it to that url. However, when I put "http://gdata.youtube.com/feeds/api/standardfeeds/top_rated?max- results=50&time=today" inside urllib.urlopen(), it works. Does it need to be a string and not a variable or is there a way around it?
import urllib
from bs4 import BeautifulSoup
class Uel(object):
def __init__(self, link):
self.content_data = []
self.num_likes = []
self.num_dislikes = []
self.favoritecount = []
self.view_count = []
self.link = link
self.web_obj = urllib.urlopen(link)
self.file = open('youtubequery.txt', 'w+')
self.file.write(str(self.web_obj))
for i in self.web_obj:
self.file.write(i)
with open("youtubequery.txt", "r") as myfile:
self.file_2=myfile.read()
self.soup = BeautifulSoup(self.file_2)
for link in self.soup.find_all("content"):
self.content_data.append(str(link.get("src")))
for stat in self.soup.find_all("yt:statistics"):
self.favoritecount.append(str(stat.get("favoritecount")))
for views in self.soup.find_all("yt:statistics"):
self.view_count.append(str(views.get("viewcount")))
for numlikes in self.soup.find_all("yt:rating"):
self.num_likes.append(str(numlikes.get("numlikes")))
for numdislikes in self.soup.find_all("yt:rating"):
self.num_dislikes.append(str(numdislikes.get("numdislikes")))
def __str__(self):
return str(self.content_data),str(self.num_likes), str(self.num_dislikes)
link = "http://gdata.youtube.com/feeds/api/standardfeeds/top_rated?max- results=50&time=5"
data = Uel(link)
print data.__str__()
In the code you've presented, you are using this url:
http://gdata.youtube.com/feeds/api/standardfeeds/top_rated?max- results=50&time=5
a request to which produces:
Invalid value for time parameter: 5
But, in the question itself, you've mentioned the following URL:
http://gdata.youtube.com/feeds/api/standardfeeds/top_rated?max- results=50&time=today
which has time=today. The code with this URL works for me.

Same code used in multiple functions but with minor differences - how to optimize?

This is the code of a Udacity course, and I changed it a little. Now, when it runs, it asks me for a movie name and the trailer would open in a pop up in a browser (that's another part, which is not shown).
As you can see, this program has a lot of repetitive code in it, the functions extract_name, movie_poster_url and movie_trailer_url have kind of the same code. Is there a way to get rid of the same code being repeated but have the same output? If so, will it run faster?
import fresh_tomatoes
import media
import urllib
import requests
from BeautifulSoup import BeautifulSoup
name = raw_input("Enter movie name:- ")
global movie_name
def extract_html(name):
url = "website name" + name + "continuation of website name" + name + "again continuation of web site name"
response = requests.get(url)
page = str(BeautifulSoup(response.content))
return page
def extract_name(page):
start_link = page.find(' - IMDb</a></h3><div class="s"><div class="kv"')
start_url = page.find('>',start_link-140)
start_url1 = page.find('>', start_link-140)
end_url = page.find(' - IMDb</a>', start_link-140)
name_of_movie = page[start_url1+1:end_url]
return extract_char(name_of_movie)
def extract_char(name_of_movie):
name_array = []
for words in name_of_movie:
word = words.strip('</b>,')
name_array.append(word)
return ''.join(name_array)
def movie_poster_url(name_of_movie):
movie_name, seperator, tail = name_of_movie.partition(' (')
#movie_name = name_of_movie.rstrip('()0123456789 ')
page = urllib.urlopen('another web site name' + movie_name + 'continuation of website name').read()
start_link = page.find('"Poster":')
start_url = page.find('"',start_link+9)
end_url = page.find('"',start_url+1)
poster_url = page[start_url+1:end_url]
return poster_url
def movie_trailer_url(name_of_movie):
movie_name, seperator, tail = name_of_movie.partition(' (')
#movie_name = name_of_movie.rstrip('()0123456789 ')
page = urllib.urlopen('another website name' + movie_name + " trailer").read()
start_link = page.find('<div class="yt-lockup-dismissable"><div class="yt-lockup-thumbnail contains-addto"><a aria-hidden="true" href=')
start_url = page.find('"',start_link+110)
end_url = page.find('" ',start_url+1)
trailer_url1 = page[start_url+1:end_url]
trailer_url = "www.youtube.com" + trailer_url1
return trailer_url
page = extract_html(name)
movie_name = extract_name(page)
new_movie = media.Movie(movie_name, "Storyline WOW", movie_poster_url(movie_name), movie_trailer_url(movie_name))
movies = [new_movie]
fresh_tomatoes.open_movies_page(movies)
You could move the shared parts into their own function:
def find_page(url, name, find, offset):
movie_name, seperator, tail = name_of_movie.partition(' (')
page = urllib.urlopen(url.format(name)).read()
start_link = page.find(find)
start_url = page.find('"',start_link+offset)
end_url = page.find('" ',start_url+1)
return page[start_url+1:end_url]
def movie_poster_url(name_of_movie):
return find_page("another website name{} continuation of website name", name_of_movie, '"Poster":', 9)
def movie_trailer_url(name_of_movie):
trailer_url = find_page("another website name{} trailer", name_of_movie, '<div class="yt-lockup-dismissable"><div class="yt-lockup-thumbnail contains-addto"><a aria-hidden="true" href=', 110)
return "www.youtube.com" + trailer_url
It definetely wont run faster (there is extra work to do to "switch" between the functions) but the performance difference is probably negligable.
For your second question: Profiling is not a technique or method, it's "finding out what's being bad" in your code:
Profiling is a form of
dynamic program analysis that measures, for example, the space
(memory) or time complexity of a program, the usage of particular
instructions, or the frequency and duration of function calls.
(wikipedia)
So it's not something that speeds up your program, it's a word for things you do to find out what you can do to speed up your program.
Going really quickly here because I am a super newb but I can see the repetition; what I would do is to figure out the (mostly) repeating blocks of code shared by all 3 functions and then figure out where they differ; write a new function that takes the differences as the arguments. so for instance:
def extract(tarString,delim,startDiff,endDiff):
start_link = page.find(tarString)
start_url = page.find(delim,start_link+startDiff)
end_url = page.find(delim,start_url+endDiff)
url_out = page[start_url+1:end_url]
Then, in your poster, trailer, etc functions, just call this extract function with the appropriate arguments for each case. ie poster would call
poster_url=extract(tarString='"Poster:"',delim='"',startDiff=9, endDiff=1)
I can see you've got another answer already and it's very likely it's written by someone who knows more than I do, but I hope you get something out of my "philosophy of modularizing" from a newbie perspective.

Why are the videos on the most_recent standard feed so out of date?

I'm trying to grab the most recently uploaded videos. There's a standard feed for that - it's called most_recent. I don't have any problems grabbing the feed, but when I look at the entries inside, they're all half a year old, which is hardly recent.
Here's the code I'm using:
import requests
import os.path as P
import sys
from lxml import etree
import datetime
namespaces = {"a": "http://www.w3.org/2005/Atom", "yt": "http://gdata.youtube.com/schemas/2007"}
fmt = "%Y-%m-%dT%H:%M:%S.000Z"
class VideoEntry:
"""Data holder for the video."""
def __init__(self, node):
self.entry_id = node.find("./a:id", namespaces=namespaces).text
published = node.find("./a:published", namespaces=namespaces).text
self.published = datetime.datetime.strptime(published, fmt)
def __str__(self):
return "VideoEntry[id='%s']" % self.entry_id
def paginate(xml):
root = etree.fromstring(xml)
next_page = root.find("./a:link[#rel='next']", namespaces=namespaces)
if next_page == None:
next_link = None
else:
next_link = next_page.get("href")
entries = [VideoEntry(e) for e in root.xpath("/a:feed/a:entry", namespaces=namespaces)]
return entries, next_link
prefix = "https://gdata.youtube.com/feeds/api/standardfeeds/"
standard_feeds = set("top_rated top_favorites most_shared most_popular most_recent most_discussed most_responded recently_featured on_the_web most_viewed".split(" "))
feed_name = sys.argv[1]
assert feed_name in standard_feeds
feed_url = prefix + feed_name
all_video_ids = []
while feed_url is not None:
r = requests.get(feed_url)
if r.status_code != 200:
break
text = r.text.encode("utf-8")
video_ids, feed_url = paginate(text)
all_video_ids += video_ids
all_upload_times = [e.published for e in all_video_ids]
print min(all_upload_times), max(all_upload_times)
As you can see, it prints the min and max timestamps for the entire feed.
misha#misha-antec$ python get_standard_feed.py most_recent
2013-02-02 14:40:02 2013-02-02 14:54:00
misha#misha-antec$ python get_standard_feed.py top_rated
2006-04-06 21:30:53 2013-07-28 22:22:38
I've glanced through the downloaded XML and it appears to match the output. Am I doing something wrong?
Also, on an unrelated note, the feeds I'm getting are all about 100 entries (I'm paginating through them 25 at a time). Is this normal? I expected the feeds to be a bit bigger.
Regarding the "Most-Recent-Feed"-Topic: There is a ticket for this one here. Unfortunately, the YouTube-API-Teams doesn't respond or solved the problem so far.
Regarding the number of entries: That depends on the type of standardfeed, but for the most-recent-Feed it´s usually around 100.
Note: You could try using the "orderby=published" parameter to get recents videos, although I don´t know how "recent" they are.
https://gdata.youtube.com/feeds/api/videos?orderby=published&prettyprint=True
You can combine this query with the "category"-parameter or other ones (region-specific queries - like for the standard feeds - are not possible, afaik).

How to add total page count in PDF using reportlab

def analysis_report(request):
response = HttpResponse(mimetype='application/pdf')
response['Content-Disposition'] = 'attachment;filename=ANALYSIS_REPORT.pdf'
buffer = StringIO()
doc = SimpleDocTemplate(buffer)
doc.sample_no = 12345
document = []
doc.build(document, onLaterPages=header_footer)
def header_footer(canvas, doc):
canvas.saveState()
canvas.setFont("Times-Bold", 11)
canvas.setFillColor(gray)
canvas.setStrokeColor('#5B80B2')
canvas.drawCentredString(310, 800, 'HEADER ONE GOES HERE')
canvas.drawString(440, 780, 'Sample No: %s' %doc.sample_no)
canvas.setFont('Times-Roman', 5)
canvas.drawString(565, 4, "Page %d" % doc.page)
I above code i can bale to display the page number, but my question is how can i display "Page X of Y" where Y is page count and X is current page.
I followed this http://code.activestate.com/recipes/546511-page-x-of-y-with-reportlab/, but they explained using canvasmaker, where as i'm using OnlaterPages argument in build.
How can i achieve the above thing using canvasmaker or is there any solution using OnLaterPages ?
Here is the improved recipe http://code.activestate.com/recipes/576832/ which should work with images.
Another possible workaround would be to use pyPDF (or any other pdf-lib with the funcionality) to read the total number of pages after doc.build() and then rebuild the story with the gathered information by exchanging the corresponding Paragraph()'s. This approach might be more hackish, but does the trick with no subclassing.
Example:
from pyPdf import PdfFileReader
[...]
story.append(Paragraph('temp paragraph. this will be exchanged with the total page number'))
post_story = story[:] #copy the story because build consumes it
doc.build(story) #build the pdf with name temp.pdf
temp_pdf = PdfFileReader(file("temp.pdf", "rb"))
total_pages = cert_temp.getNumPages()
post_story[-1] = Paragraph('total pages: ' + str(total_pages))
doc.build(post_story)

Categories

Resources