Python YouTube Page Token Issue - python

I'm running into an intermittent issue when I run the code below. I'm trying to collect all the page_tokens in the ajax calls made by pressing the "load more" button if it exists. Basically, I'm trying to get all the page tokens from a YouTube Channel.
Sometimes it will retrieve the tokens, and other times it doesn't. My best guess is either I made a mistake in my "find_embedded_page_token" function or that I need some sort of delay/sleep inserted somewhere.
Below is the full code:
import requests
import pprint
import urllib.parse
import lxml
def find_XSRF_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('"', pos_begin)
return html[pos_begin: pos_end]
def find_page_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('&', pos_begin)
return html[pos_begin: pos_end]
def find_embedded_page_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('&', pos_begin)
excess_str = html[pos_begin: pos_end]
sep = '\\'
rest = excess_str.split(sep,1)[0]
return rest
sxeVid = 'https://www.youtube.com/user/sxephil/videos'
ajaxStr = 'https://www.youtube.com/browse_ajax?action_continuation=1&continuation='
s = requests.Session()
r = s.get(sxeVid)
html = r.text
session_token = find_XSRF_token(html, 'XSRF_TOKEN', 4)
page_token = find_page_token(html, ';continuation=', 0)
print(page_token)
s = requests.Session()
r = s.get(ajaxStr+page_token)
ajxHtml = r.text
ajax_page_token = find_embedded_page_token(ajxHtml, ';continuation=', 0)
while page_token:
ajxBtn = ajxHtml.find('data-uix-load-more-href=')
if ajxBtn != -1:
s = requests.Session()
r = s.get(ajaxStr+ajax_page_token)
ajxHtml = r.text
ajax_page_token = find_embedded_page_token(ajxHtml, ';continuation=', 0)
print(ajax_page_token)
else:
break
This is what's returning randomly that is unexpected. It's pulling not just the token, but also the HTML after the desired cut off.
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D"><span class="yt-uix-button-content"> <span class="load-more-loading hid">
<span class="yt-spinner">
<span class="yt-spinner-img yt-sprite" title="Loading icon"></span>
The expected response I'm expecting is this:
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5MZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5iZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5yZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk43Z0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk9MZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk9iZ0JBQSUzRCUzRA%253D%253D
Any help is greatly appreciated. Also, if my tags are wrong, let me know what tags to +/-.

Related

Excluding 'duplicated' scraped URLs in Python app?

I've never used Python before so excuse my lack of knowledge but I'm trying to scrape a xenforo forum for all of the threads. So far so good, except for the fact its picking up multiple URLs for each page of the same thread, I've posted some data before to explain what I mean.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-9
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-10
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/page-11
Really, what I would ideally want to scrape is just one of these.
forums/my-first-forum/: threads/my-gap-year-uni-story.13846/
Here is my script:
from bs4 import BeautifulSoup
import requests
def get_source(url):
return requests.get(url).content
def is_forum_link(self):
return self.find('special string') != -1
def fetch_all_links_with_word(url, word):
source = get_source(url)
soup = BeautifulSoup(source, 'lxml')
return soup.select("a[href*=" + word + "]")
main_url = "http://example.com/forum/"
forumLinks = fetch_all_links_with_word(main_url, "forums")
forums = []
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
forums.append(link.attrs['href']);
print('Fetched ' + str(len(forums)) + ' forums')
threads = {}
for link in forums:
threadLinks = fetch_all_links_with_word(main_url + link, "threads")
for threadLink in threadLinks:
print(link + ': ' + threadLink.attrs['href'])
threads[link] = threadLink
print('Fetched ' + str(len(threads)) + ' threads')
This solution assumes that what should be removed from the url to check for uniqueness is always going to be "/page-#...". If that is not the case this solution will not work.
Instead of using a list to store your urls you can use a set, which will only add unique values. Then in the url remove the last instance of "page" and anything that comes after it if it is in the format of "/page-#", where # is any number, before adding it to the set.
forums = set()
for link in forumLinks:
if link.has_attr('href') and link.attrs['href'].find('.rss') == -1:
url = link.attrs['href']
position = url.rfind('/page-')
if position > 0 and url[position + 6:position + 7].isdigit():
url = url[:position + 1]
forums.add(url);

Looping through Rocket.Chat API

Python 3.7.2
PyCharm
I'm fairly new to Python, and API interaction; I'm trying to loop through the API for Rocket Chat, specifically pulling user email address's out.
Unlike nearly every example I can find, Rocket Chat doesn't use any kind of construct like "Next" - it uses count and offset, which I had actually
though might make this easier.
I have managed to get the first part of this working,
looping over the JSON and getting the emails. What I need to do, is loop through the API endpoints - which is what I have ran into some issue with.
I have looked at this answer Unable to loop through paged API responses with Python
as it seemed to be pretty close to what I want, but I couldn't get it to work correctly.
The code below, is what I have right now; obviously this isn't doing any looping through the API endpoint just yet, its just looping over the returned json.
import os
import csv
import requests
import json
url = "https://rocketchat.internal.net"
login = "/api/v1/login"
rocketchatusers = "/api/v1/users.list"
#offset = "?count=500&offset=0"
class API:
def userlist(self, userid, token):
headers = {'X-Auth-Token': token, 'X-User-Id': userid}
rocketusers = requests.get(url + rocketchatusers, headers=headers, verify=False)
print('Status Code:' + str(rocketusers.status_code))
print('Content Type:' + rocketusers.headers['content-type'])
userlist = json.loads(rocketusers.text)
x = 0
y = 0
emails = open('emails', 'w')
while y == 0:
try:
for i in userlist:
print(userlist['users'][x]['emails'][0]['address'], file=emails)
# print(userlist['users'][x]['emails'][0]['address'])
x += 1
except KeyError:
print("This user has no email address", file=emails)
x += 1
except IndexError:
print("End of List")
emails.close()
y += 1
What I have tried and what I would like to do, is something along the lines of an easy FOR loop. There are realistically probably a lot of ways to do what I'm after, I just don't know them.
Something like this:
import os
import csv
import requests
import json
url = "https://rocketchat.internal.net"
login = "/api/v1/login"
rocketchatusers = "/api/v1/users.list"
offset = "?count=500&offset="+p
p = 0
class API:
def userlist(self, userid, token):
headers = {'X-Auth-Token': token, 'X-User-Id': userid}
rocketusers = requests.get(url + rocketchatusers+offset, headers=headers, verify=False)
for r in rocketusers:
print('Status Code:' + str(rocketusers.status_code))
print('Content Type:' + rocketusers.headers['content-type'])
userlist = json.loads(rocketusers.text)
x = 0
y = 0
emails = open('emails', 'w')
while y == 0:
try:
for i in userlist:
print(userlist['users'][x]['emails'][0]['address'], file=emails)
# print(userlist['users'][x]['emails'][0]['address'])
x += 1
except KeyError:
print("This user has no email address", file=emails)
x += 1
except IndexError:
print("End of List")
emails.close()
y += 1
p += 500
Now, obviously this doesn't work, or I'd not be posting, but the why it doesn't work is the issue.
The error that get report is that I can't concatenate an INT, when a STR is expected. Ok, fine. When I attempt something like:
str(p = 0)
I get a type error. I have tried a lot of other things as well, many of them simply silly, such as p = [], p = {} and other more radical idea's as well.
The URL, if not all variables and concatenated would look something like this:
https://rocketchat.internal.net/api/v1/users.list?count=500&offset=0
https://rocketchat.internal.net/api/v1/users.list?count=500&offset=500
https://rocketchat.internal.net/api/v1/users.list?count=500&offset=1000
https://rocketchat.internal.net/api/v1/users.list?count=500&offset=1500
I feel like there is something really simple that I'm missing. I'm reasonably sure that the answer is in the response to the post I listed, but I couldn't get it to work.
So, after asking around, I found out that I had been on the right path to figuring this issue out, I had just tried in the wrong place. Here's what I ended up with:
def userlist(self, userid, token):
p = 0
while p <= 7500:
if not os.path.exists('./emails'):
headers = {'X-Auth-Token': token, 'X-User-Id': userid}
rocketusers = requests.get(url + rocketchatusers + offset + str(p), headers=headers, verify=False)
print('Status Code:' + str(rocketusers.status_code))
print('Content Type:' + rocketusers.headers['content-type'])
print('Creating the file "emails" to use to compare against list of regulated users.')
print(url + rocketchatusers + offset + str(p))
userlist = json.loads(rocketusers.text)
x = 0
y = 0
emails = open('emails', 'a+')
while y == 0:
try:
for i in userlist:
#print(userlist['users'][x]['emails'][0]['address'], file=emails)
print(userlist['users'][x]['ldap'], file=emails)
print(userlist['users'][x]['username'], file=emails)
x += 1
except KeyError:
x += 1
except IndexError:
print("End of List")
emails.close()
p += 50
y += 1
else:
headers = {'X-Auth-Token': token, 'X-User-Id': userid}
rocketusers = requests.get(url + rocketchatusers + offset + str(p), headers=headers, verify=False)
print('Status Code:' + str(rocketusers.status_code))
print('Content Type:' + rocketusers.headers['content-type'])
print('Populating file "emails" - this takes a few moments, please be patient.')
print(url + rocketchatusers + offset + str(p))
userlist = json.loads(rocketusers.text)
x = 0
z = 0
emails = open('emails', 'a+')
while z == 0:
try:
for i in userlist:
#print(userlist['users'][x]['emails'][0]['address'], file=emails)
print(userlist['users'][x]['ldap'], file=emails)
print(userlist['users'][x]['username'], file=emails)
x += 1
except KeyError:
x += 1
except IndexError:
print("End of List")
emails.close()
p += 50
z += 1
This is still a work in progress, unfortunately, this isn't an avenue for collaboration, later I may post this to GitHub so that others can see it.

More simple way to login into multiple site/pages

I am not quite happy the way i coded this. Is there a more simple and convenient way to code this in one function and return the output of multiple pages.
def login():
url = "http://192.168.2.45/pricelogin.php"
r = requests.get(url, auth=('pstats', 'pStats'))
page = r.text
return page
def loginhighpricingerror():
pricingerrorurl = "http://192.168.2.45/airline_error.pl"
peu = requests.get(pricingerrorurl, auth=('pstats', 'pstats'))
peupage = peu.text
return peupage
def loginsuccessfullbookings():
sucurl = "http://192.168.2.45/airlinessucbookings.php"
suc = requests.get(sucbookingurl, auth=('pstats', 'pstats'))
sucpage = suc.text
return sucpage
Use session instead of sessionless module functions:
s = requests.Session()
s.auth=('pstats', 'pStats')
def login():
url = "http://192.168.2.45/pricelogin.php"
r = s.get(url)
page = r.text
return page
def loginhighpricingerror():
pricingerrorurl = "http://192.168.2.45/airline_error.pl"
peu = s.get(pricingerrorurl)
peupage = peu.text
return peupage
def loginsuccessfullbookings():
sucurl = "http://192.168.2.45/airlinessucbookings.php"
suc = s.get(sucbookingurl)
sucpage = suc.text
return sucpage
Of course this should be refactored, but hopefully you can see what I mean.
I would generalize the login function, passing the url as parameter:
def login(url):
try:
r = requests.get(url, auth=('pstats', 'pStats'))
except requests.exceptions.RequestException as e:
print e
return '' # but maybe you want to do something else
page = r.text
return page
And then you can run it for each url accumulating the pages in an array for example:
urls = ["http://192.168.2.45/pricelogin.php", "http://192.168.2.45/airline_error.pl", "http://192.168.2.45/airlinessucbookings.php"]
pages = [] # resulting array
for url in urls:
pages.append(login(url))
Note: I added a check on an exception for requests.get since this might fail when there is a connection problem.

Python script works but fails after compilation (Windows)

I am working on a script to scrape a website, the problem is that it works normally when I run it with the interpreter, however after compiling it (PyInstaller or Py2exe) it fails, it appears to be that mechanize / requests both fail to keep the session alive.
I have hidden my username and password here, but I did put them correctly in the compiled code
import requests
from bs4 import BeautifulSoup as bs
from sys import argv
import re
import logging
url = argv[1]
payload = {"userName": "real_username", "password": "realpassword"}
session = requests.session()
resp = session.post("http://website.net/login.do", data=payload)
if "forgot" in resp.content:
logging.error("Login failed")
exit()
resp = session.get(url)
soup = bs(resp.content)
urlM = url[:url.find("?") + 1] + "page=(PLACEHOLDER)&" + \
url[url.find("?") + 1:]
# Get number of pages
regex = re.compile("\|.*\|\sof\s(\d+)")
script = str(soup.findAll("script")[1])
epNum = int(re.findall(regex, script)[0]) # Number of EPs
pagesNum = epNum // 50
links = []
# Get list of links
# If number of EPs > 50, more than one page
if pagesNum == 0:
links = [url]
else:
for i in range(1, pagesNum + 2):
url = urlM.replace("(PLACEHOLDER)", str(i))
links.append(url)
# Loop over the links and extract info: ID, NAME, START_DATE, END_DATE
raw_info = []
for pos, link in enumerate(links):
print "Processing page %d" % (pos + 1)
sp = bs(session.get(link).content)
table = sp.table.table
raw_info.extend(table.findAll("td"))
epURL = "http://www.website.net/exchange/viewep.do?operation"\
"=executeAction&epId="
# Final data extraction
raw_info = map(str, raw_info)
ids = [re.findall("\d+", i)[0] for i in raw_info[::4]]
names = [re.findall("<td>(.*)</td", i)[0] for i in raw_info[1::4]]
start_dates = [re.findall("<td>(.*)</td", i)[0] for i in raw_info[2::4]]
end_dates = [re.findall("<td>(.*)</td", i)[0] for i in raw_info[3::4]]
emails = []
eplinks = [epURL + str(i) for i in ids]
print names
The error happens on the level of epNum variable, this means as I figured that the HTML page is not the one I requested, but it works normally on linux script and compiled, work on widows as script but fails when compiled.
The py2exe tutorial mentions that you need MSVCR90.dll, did you check its present on the PC?

request empty result issue

I have this simple python code, which returning the content of URL and store the result as json text file named "file", but it keeps returning empty result .
What I am doing wrong here? It is just a simple code I am so disappointed.
I have included all the imports needed import Facebook,import request,and import json.
url ="https://graph.facebook.com/search?limit=5000&type=page&q=%26&access_token=xx&__after_id=139433456868"
content = requests.get(url).json()
file = open("file.txt" , 'w')
file.write(json.dumps(content, indent=1))
file.close()
but it keeps returning empty result to me what I am missing here?
here is the result:
"data": []
any help please?
Its working fine:
import urllib2
accesstoken="CAACEdEose0cBACF6HpTDEuVEwVnjx1sHOJFS3ZBQZBsoWqKKyFl93xwZCxKysqsQMUgOZBLjJoMurSxinn96pgbdfSYbyS9Hh3pULdED8Tv255RgnsYmnlxvR7JZCN7X25zP6fRnRK0ZCNmChfLPejaltwM2JGtPGYBQwnmAL9tQBKBmbZAkGYCEQHAbUf7k1YZD"
urllib2.urlopen("https://graph.facebook.com/search?limit=5000&type=page&q=%26&access_token="+accesstoken+"&__after_id=139433456868").read()
I think you have not requested access token before making the request.
How to find access token?
def getSecretToken(verification_code):
token_url = ( "https://graph.facebook.com/oauth/access_token?" +
"client_id=" + app_id +
"&redirect_uri=" +my_url +
"&client_secret=" + app_secret +
"&code=" + verification_code )
response = requests.get(token_url).content
params = {}
result = response.split("&", 1)
print result
for p in result:
(k,v) = p.split("=")
params[k] = v
return params['access_token']
how do you get that verification code?
verification_code=""
if "code" in request.query:
verification_code = request.query["code"]
if not verification_code:
dialog_url = ( "http://www.facebook.com/dialog/oauth?" +
"client_id=" + app_id +
"&redirect_uri=" + my_url +
"&scope=publish_stream" )
return "<script>top.location.href='" + dialog_url + "'</script>"
else:
access_token = getSecretToken(verification_code)

Categories

Resources