Python 3.7.2
PyCharm
I'm fairly new to Python, and API interaction; I'm trying to loop through the API for Rocket Chat, specifically pulling user email address's out.
Unlike nearly every example I can find, Rocket Chat doesn't use any kind of construct like "Next" - it uses count and offset, which I had actually
though might make this easier.
I have managed to get the first part of this working,
looping over the JSON and getting the emails. What I need to do, is loop through the API endpoints - which is what I have ran into some issue with.
I have looked at this answer Unable to loop through paged API responses with Python
as it seemed to be pretty close to what I want, but I couldn't get it to work correctly.
The code below, is what I have right now; obviously this isn't doing any looping through the API endpoint just yet, its just looping over the returned json.
import os
import csv
import requests
import json
url = "https://rocketchat.internal.net"
login = "/api/v1/login"
rocketchatusers = "/api/v1/users.list"
#offset = "?count=500&offset=0"
class API:
def userlist(self, userid, token):
headers = {'X-Auth-Token': token, 'X-User-Id': userid}
rocketusers = requests.get(url + rocketchatusers, headers=headers, verify=False)
print('Status Code:' + str(rocketusers.status_code))
print('Content Type:' + rocketusers.headers['content-type'])
userlist = json.loads(rocketusers.text)
x = 0
y = 0
emails = open('emails', 'w')
while y == 0:
try:
for i in userlist:
print(userlist['users'][x]['emails'][0]['address'], file=emails)
# print(userlist['users'][x]['emails'][0]['address'])
x += 1
except KeyError:
print("This user has no email address", file=emails)
x += 1
except IndexError:
print("End of List")
emails.close()
y += 1
What I have tried and what I would like to do, is something along the lines of an easy FOR loop. There are realistically probably a lot of ways to do what I'm after, I just don't know them.
Something like this:
import os
import csv
import requests
import json
url = "https://rocketchat.internal.net"
login = "/api/v1/login"
rocketchatusers = "/api/v1/users.list"
offset = "?count=500&offset="+p
p = 0
class API:
def userlist(self, userid, token):
headers = {'X-Auth-Token': token, 'X-User-Id': userid}
rocketusers = requests.get(url + rocketchatusers+offset, headers=headers, verify=False)
for r in rocketusers:
print('Status Code:' + str(rocketusers.status_code))
print('Content Type:' + rocketusers.headers['content-type'])
userlist = json.loads(rocketusers.text)
x = 0
y = 0
emails = open('emails', 'w')
while y == 0:
try:
for i in userlist:
print(userlist['users'][x]['emails'][0]['address'], file=emails)
# print(userlist['users'][x]['emails'][0]['address'])
x += 1
except KeyError:
print("This user has no email address", file=emails)
x += 1
except IndexError:
print("End of List")
emails.close()
y += 1
p += 500
Now, obviously this doesn't work, or I'd not be posting, but the why it doesn't work is the issue.
The error that get report is that I can't concatenate an INT, when a STR is expected. Ok, fine. When I attempt something like:
str(p = 0)
I get a type error. I have tried a lot of other things as well, many of them simply silly, such as p = [], p = {} and other more radical idea's as well.
The URL, if not all variables and concatenated would look something like this:
https://rocketchat.internal.net/api/v1/users.list?count=500&offset=0
https://rocketchat.internal.net/api/v1/users.list?count=500&offset=500
https://rocketchat.internal.net/api/v1/users.list?count=500&offset=1000
https://rocketchat.internal.net/api/v1/users.list?count=500&offset=1500
I feel like there is something really simple that I'm missing. I'm reasonably sure that the answer is in the response to the post I listed, but I couldn't get it to work.
So, after asking around, I found out that I had been on the right path to figuring this issue out, I had just tried in the wrong place. Here's what I ended up with:
def userlist(self, userid, token):
p = 0
while p <= 7500:
if not os.path.exists('./emails'):
headers = {'X-Auth-Token': token, 'X-User-Id': userid}
rocketusers = requests.get(url + rocketchatusers + offset + str(p), headers=headers, verify=False)
print('Status Code:' + str(rocketusers.status_code))
print('Content Type:' + rocketusers.headers['content-type'])
print('Creating the file "emails" to use to compare against list of regulated users.')
print(url + rocketchatusers + offset + str(p))
userlist = json.loads(rocketusers.text)
x = 0
y = 0
emails = open('emails', 'a+')
while y == 0:
try:
for i in userlist:
#print(userlist['users'][x]['emails'][0]['address'], file=emails)
print(userlist['users'][x]['ldap'], file=emails)
print(userlist['users'][x]['username'], file=emails)
x += 1
except KeyError:
x += 1
except IndexError:
print("End of List")
emails.close()
p += 50
y += 1
else:
headers = {'X-Auth-Token': token, 'X-User-Id': userid}
rocketusers = requests.get(url + rocketchatusers + offset + str(p), headers=headers, verify=False)
print('Status Code:' + str(rocketusers.status_code))
print('Content Type:' + rocketusers.headers['content-type'])
print('Populating file "emails" - this takes a few moments, please be patient.')
print(url + rocketchatusers + offset + str(p))
userlist = json.loads(rocketusers.text)
x = 0
z = 0
emails = open('emails', 'a+')
while z == 0:
try:
for i in userlist:
#print(userlist['users'][x]['emails'][0]['address'], file=emails)
print(userlist['users'][x]['ldap'], file=emails)
print(userlist['users'][x]['username'], file=emails)
x += 1
except KeyError:
x += 1
except IndexError:
print("End of List")
emails.close()
p += 50
z += 1
This is still a work in progress, unfortunately, this isn't an avenue for collaboration, later I may post this to GitHub so that others can see it.
Related
I'm using a medium API to get a some information but after some API calls the python script ended with this error:
IndexError: list index out of range
Here is my Python code:
def get_post_responses(posts):
#start = time.time()
count = 0
print('Retrieving the post responses...')
responses = []
for post in posts:
url = MEDIUM + '/_/api/posts/' + post + '/responses'
count = count + 1
print("number of times api called",count)
response = requests.get(url)
response_dict = clean_json_response(response)
responses += response_dict['payload']['value']
#end = time.time()
#four = end - start
#global time_cal
#time_cal.append(four)
return responses
def check_if_high_recommends(response, recommend_min):
if response['virtuals']['recommends'] >= recommend_min:
return True
def check_if_recent(response):
limit_date = datetime.now() - timedelta(days=360)
creation_epoch_time = response['createdAt'] / 1000
creation_date = datetime.fromtimestamp(creation_epoch_time)
if creation_date >= limit_date:
return True
It needs to work for more then 10000 followers for a user.
i got an ans for my question...
just i need to use try catch exception ...
response_dict = clean_json_response(response)
try:
responses += response_dict['payload']['value']
catch:
continue
I'm running into an intermittent issue when I run the code below. I'm trying to collect all the page_tokens in the ajax calls made by pressing the "load more" button if it exists. Basically, I'm trying to get all the page tokens from a YouTube Channel.
Sometimes it will retrieve the tokens, and other times it doesn't. My best guess is either I made a mistake in my "find_embedded_page_token" function or that I need some sort of delay/sleep inserted somewhere.
Below is the full code:
import requests
import pprint
import urllib.parse
import lxml
def find_XSRF_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('"', pos_begin)
return html[pos_begin: pos_end]
def find_page_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('&', pos_begin)
return html[pos_begin: pos_end]
def find_embedded_page_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('&', pos_begin)
excess_str = html[pos_begin: pos_end]
sep = '\\'
rest = excess_str.split(sep,1)[0]
return rest
sxeVid = 'https://www.youtube.com/user/sxephil/videos'
ajaxStr = 'https://www.youtube.com/browse_ajax?action_continuation=1&continuation='
s = requests.Session()
r = s.get(sxeVid)
html = r.text
session_token = find_XSRF_token(html, 'XSRF_TOKEN', 4)
page_token = find_page_token(html, ';continuation=', 0)
print(page_token)
s = requests.Session()
r = s.get(ajaxStr+page_token)
ajxHtml = r.text
ajax_page_token = find_embedded_page_token(ajxHtml, ';continuation=', 0)
while page_token:
ajxBtn = ajxHtml.find('data-uix-load-more-href=')
if ajxBtn != -1:
s = requests.Session()
r = s.get(ajaxStr+ajax_page_token)
ajxHtml = r.text
ajax_page_token = find_embedded_page_token(ajxHtml, ';continuation=', 0)
print(ajax_page_token)
else:
break
This is what's returning randomly that is unexpected. It's pulling not just the token, but also the HTML after the desired cut off.
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D"><span class="yt-uix-button-content"> <span class="load-more-loading hid">
<span class="yt-spinner">
<span class="yt-spinner-img yt-sprite" title="Loading icon"></span>
The expected response I'm expecting is this:
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5MZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5iZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5yZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk43Z0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk9MZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk9iZ0JBQSUzRCUzRA%253D%253D
Any help is greatly appreciated. Also, if my tags are wrong, let me know what tags to +/-.
I am beginner for Python,
How I can solve
AttributeError: module 'urllib' has no attribute 'Request'
As I view other post, still can't understand how solve the problem
Here the screen capture of the error
And this is the code (I refer from https://github.com/minimaxir/facebook-page-post-scraper/blob/master/get_fb_posts_fb_page.py)
import urllib.request
import json, datetime, csv, time
app_id = "xxx"
app_secret = "xxx" # DO NOT SHARE WITH ANYONE!
access_token = "xxx"
page_id = 'xxx'
def testFacebookPageData(page_id, access_token):
# construct the URL string
base = "https://graph.facebook.com/v2.4"
node = "/" + page_id +'/feed'
parameters = "/?access_token=%s" % access_token
url = base + node + parameters
# retrieve data
response = urllib.request.urlopen(url)
data = json.loads(response.read().decode('utf-8'))
print (data)
def request_until_succeed(url):
req = urllib.request.urlopen(url)
success = False
while success is False:
try:
response = urllib.urlopen(req)
if response.getcode() == 200:
success = True
except Exception as e:
print (e)
time.sleep(5)
print (url, datetime.datetime.now())
return response.read()
def getFacebookPageFeedData(page_id, access_token, num_statuses):
# construct the URL string
base = "https://graph.facebook.com"
node = "/" + page_id + "/feed"
parameters = "/?fields=message,link,created_time,type,name,id,likes.limit(1).summary(true),comments.limit(1).summary(true),shares&limit=%s&access_token=%s" % (num_statuses, access_token) # changed
url = base + node + parameters
# retrieve data
data = json.loads(request_until_succeed(url))
return data
def processFacebookPageFeedStatus(status):
# The status is now a Python dictionary, so for top-level items,
# we can simply call the key.
# Additionally, some items may not always exist,
# so must check for existence first
status_id = status['id']
status_message = '' if 'message' not in status.keys() else status['message'].encode('utf-8')
link_name = '' if 'name' not in status.keys() else status['name'].encode('utf-8')
status_type = status['type']
status_link = '' if 'link' not in status.keys() else status['link']
# Time needs special care since a) it's in UTC and
# b) it's not easy to use in statistical programs.
status_published = datetime.datetime.strptime(status['created_time'],'%Y-%m-%dT%H:%M:%S+0000')
status_published = status_published + datetime.timedelta(hours=-5) # EST
status_published = status_published.strftime('%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs
# Nested items require chaining dictionary keys.
num_likes = 0 if 'likes' not in status.keys() else status['likes']['summary']['total_count']
num_comments = 0 if 'comments' not in status.keys() else status['comments']['summary']['total_count']
num_shares = 0 if 'shares' not in status.keys() else status['shares']['count']
# return a tuple of all processed data
return (status_id, status_message, link_name, status_type, status_link,
status_published, num_likes, num_comments, num_shares)
def scrapeFacebookPageFeedStatus(page_id, access_token):
with open('%s_facebook_statuses.csv' % page_id, 'w') as file:
w = csv.writer(file)
w.writerow(["status_id", "status_message", "link_name", "status_type", "status_link",
"status_published", "num_likes", "num_comments", "num_shares"])
has_next_page = True
num_processed = 0 # keep a count on how many we've processed
scrape_starttime = datetime.datetime.now()
print (page_id, scrape_starttime)
statuses = getFacebookPageFeedData(page_id, access_token, 100)
while has_next_page:
for status in statuses['data']:
w.writerow(processFacebookPageFeedStatus(status))
# output progress occasionally to make sure code is not stalling
num_processed += 1
if num_processed % 1000 == 0:
print (num_processed, datetime.datetime.now())
# if there is no next page, we're done.
if 'paging' in statuses.keys():
statuses = json.loads(request_until_succeed(statuses['paging']['next']))
else:
has_next_page = False
print (num_processed, datetime.datetime.now() - scrape_starttime)
if __name__ == '__main__':
scrapeFacebookPageFeedStatus(page_id, access_token)
There is no urllib.Request() in Python 3 - there is urllib.request.Request().
EDIT: you have url = urllib.Request(url) in error message but I don't see this line in your code - maybe you run wrong file.
I have this simple python code, which returning the content of URL and store the result as json text file named "file", but it keeps returning empty result .
What I am doing wrong here? It is just a simple code I am so disappointed.
I have included all the imports needed import Facebook,import request,and import json.
url ="https://graph.facebook.com/search?limit=5000&type=page&q=%26&access_token=xx&__after_id=139433456868"
content = requests.get(url).json()
file = open("file.txt" , 'w')
file.write(json.dumps(content, indent=1))
file.close()
but it keeps returning empty result to me what I am missing here?
here is the result:
"data": []
any help please?
Its working fine:
import urllib2
accesstoken="CAACEdEose0cBACF6HpTDEuVEwVnjx1sHOJFS3ZBQZBsoWqKKyFl93xwZCxKysqsQMUgOZBLjJoMurSxinn96pgbdfSYbyS9Hh3pULdED8Tv255RgnsYmnlxvR7JZCN7X25zP6fRnRK0ZCNmChfLPejaltwM2JGtPGYBQwnmAL9tQBKBmbZAkGYCEQHAbUf7k1YZD"
urllib2.urlopen("https://graph.facebook.com/search?limit=5000&type=page&q=%26&access_token="+accesstoken+"&__after_id=139433456868").read()
I think you have not requested access token before making the request.
How to find access token?
def getSecretToken(verification_code):
token_url = ( "https://graph.facebook.com/oauth/access_token?" +
"client_id=" + app_id +
"&redirect_uri=" +my_url +
"&client_secret=" + app_secret +
"&code=" + verification_code )
response = requests.get(token_url).content
params = {}
result = response.split("&", 1)
print result
for p in result:
(k,v) = p.split("=")
params[k] = v
return params['access_token']
how do you get that verification code?
verification_code=""
if "code" in request.query:
verification_code = request.query["code"]
if not verification_code:
dialog_url = ( "http://www.facebook.com/dialog/oauth?" +
"client_id=" + app_id +
"&redirect_uri=" + my_url +
"&scope=publish_stream" )
return "<script>top.location.href='" + dialog_url + "'</script>"
else:
access_token = getSecretToken(verification_code)
I am new to coding in python (maybe a couple of days in) and basically learning of other people's code on stackoverflow. The code I am trying to write uses beautifulsoup to get the pid and the corresponding price for motorcycles on craigslist. I know there are many other ways of doing this but my current code looks like this:
from bs4 import BeautifulSoup
from urllib2 import urlopen
u = ""
count = 0
while (count < 9):
site = "http://sfbay.craigslist.org/mca/" + str(u)
html = urlopen(site)
soup = BeautifulSoup(html)
postings = soup('p',{"class":"row"})
f = open("pid.txt", "a")
for post in postings:
x = post.getText()
y = post['data-pid']
prices = post.findAll("span", {"class":"itempp"})
if prices == "":
w = 0
else:
z = str(prices)
z = z[:-8]
w = z[24:]
filewrite = str(count) + " " + str(y) + " " +str(w) + '\n'
print y
print w
f.write(filewrite)
count = count + 1
index = 100 * count
print "index is" + str(index)
u = "index" + str(index) + ".html"
It works fine and as I keep learning i plan to optimize it. The problem I have right now, is that entries without price are still showing up. Is there something obvious that I am missing.
thanks.
The problem is how you're comparing prices. You say:
prices = post.findAll("span", {"class":"itempp"})
In BS .findAll returns a list of elements. When you're comparing price to an empty string, it will always return false.
>>>[] == ""
False
Change if prices == "": to if prices == [] and everything should be fine.
I hope this helps.