An API I want to use limits requests to 10 items. I want to download 100 items. I am trying to write a function that makes 10 API, using their offset functionality to make it possible. I figured a loop would be the proper way to do this.
This is the code I have, but it doesn't work and I don't understand why:
import pandas as pd
import requests
api_key = 'THIS_IS_MY_KEY'
api_url = 'http://apiurl.com/doc?limit=10' # fake url
headers = {'Authorization': 'Bearer ' + api_key}
for x in range(0, 10):
number = 0
url = api_url + '&offset=' + str(number + 10)
r = requests.get(url, headers=headers)
x = pd.DataFrame(r.json())
x = x['data'].apply(pd.Series)
return x
You are also using x as your loop counter and as your data frame - which i think is not good practice - although your code might still work because of the way that the for loop works. A better is to use the step parameter in the range call - as demonstrated below. It is also not clear what you are expecting to return - are you wanting to return the last offset you fetched - or the the data frame (since your code re-uses x in 3 different ways it is impossible to determine what you intended - so I left it as it is - although I am pretty sure it is wrong - looking at the panda API)
import pandas as pd
import requests
api_key = 'THIS_IS_MY_KEY'
api_url = 'http://apiurl.com/doc?limit=10' # fake url
headers = {'Authorization': 'Bearer ' + api_key}
for offset in range(0, 100, 10): # makes a list [0, 10,20,30,40,50,60,70,80,90,100]
url = api_url + '&offset=' + str(offset)
r = requests.get(url, headers=headers)
x = pd.DataFrame(r.json())
x = x['data'].apply(pd.Series)
return x
what result do you see?
try
url = api_url + '&offset=' + str(x * 10)
The variable number never change, since it is set to 0 at the start of the loop.
I think you means this:
import pandas as pd
import requests
api_key = 'THIS_IS_MY_KEY'
api_url = 'http://apiurl.com/doc?limit=10' # fake url
headers = {'Authorization': 'Bearer ' + api_key}
number = 0
for x in range(0, 10):
url = api_url + '&offset=' + str(number + 10)
r = requests.get(url, headers=headers)
x = pd.DataFrame(r.json())
x = x['data'].apply(pd.Series)
number += 10
return x
Related
In order to create get-requests I create a Python script. In order to create the URL's for this request I have made the following code:
today = str(datetime.date.today())
start = str(datetime.date.today()- datetime.timedelta (days=30))
report = ["Shifts",
"ShiftStops",
"ShiftStopDetailsByProcessDate",
"TimeRegistrations",
"ShiftsByProcessDate",
"ShiftStopsByProcessDate",
]
for x in report:
url_data = "https://URL"+ report + "?from=" + start + "&until=" + today
data = requests.get(url_data, headers = {'Host': 'services.URL.com', 'Authorization': 'Bearer ' + acces_token})
But the error I get is:
TypeError: can only concatenate str (not "list") to str
What can I do to solve this and create 6 unique url's?
p.s. I have added the word URL to the URL's in order to anonymize my post.
Where you're going wrong is in the following line:
url_data = "https://URL"+ report + "?from=" + start + "&until=" + today
Specifically, you use report which is the entire list. What you'll want to do is use x instead, i.e. the string in the list.
Also you'll want to indent the next line, so altogether it should read:
for x in report:
url_data = "https://URL"+ x + "?from=" + start + "&until=" + today
data = requests.get(url_data, headers = {'Host': 'services.URL.com', 'Authorization': 'Bearer ' + acces_token})
I have already found the answer. The list of url's is created by replacing report by x while create the url_data.
url_data = "https://URL"+ x + "?from=" + start + "&until=" + today
I'm trying to write a script that scrapes the text of multiple webpages with slightly differing URLs. I want to go through the pages with an np.arange function that inserts a string into the URL. But there must be something wrong with the URL the script is composing. In the document, that stores the scraped text, it scrapes just messages like "this site does not exist anymore". The steps I have taken to come closer to the solution are detailed below. Here is my code.
from bs4 import BeautifulSoup
import numpy as np
import datetime
from time import sleep
from random import randint
datum = datetime.datetime.now()
pages = np.arange(1, 20, 1)
datum_jetzt = datum.strftime("%Y") + "-" + datum.strftime("%m") + "-" + datum.strftime("%d")
url = "https://www.shabex.ch/pub/" + datum_jetzt + "/index-"
results = requests.get(url)
file_name = "" + datum.strftime("%Y") + "-" + datum.strftime("%m") + "-" + datum.strftime("%d") + "-index.htm"
for page in pages:
page = requests.get("https://www.shabex.ch/pub/" + datum_jetzt + "/index-" + str(page) + ".htm")
soup = BeautifulSoup(results.text, "html.parser")
texte = soup.get_text()
sleep(randint(2,5))
f = open(file_name, "a")
f.write(texte)
f.close
I found that if I find enter print("https://www.shabex.ch/pub/" + datum_jetzt + "/index-" + str(page) + ".htm") in the console, I get https://www.shabex.ch/pub/2020-05-18/index-<Response [200]>.htm. So the np.arange function returns the response of the webserver instead of the value I seek.
Where have I gone wrong?
I am working on a code where I am fetching records from an API and this API has pagination implemented on it where it would allow maximum of 100 records. So I have to loop in the multiples of 100's. Currently my code compares the total records and loops from offset 100 and then 101,102,103 etc. I want it to loop in 100's(like 100,200,300) and stop as soon as the offset is greater than the total records. I am not sure how to do this, i have partial code which increment by 1 instead of 100 and wont stop when needed. Could anyone please help me with this issue.
import pandas as pd
from pandas.io.json import json_normalize
#Token for Authorization
API_ACCESS_KEY = 'Token'
Accept='application/xml'
#Query Details that is passed in the URL
since = '2018-01-01'
until = '2018-02-01'
limit = '100'
offset = '0'
total = 'true'
def get():
url_address = "https://mywebsite/web?offset="+str('0')
headers = {
'Authorization': 'token={0}'.format(API_ACCESS_KEY),
'Accept': Accept,
}
querystring = {"since":since,"until":until, "limit":limit, "total":total}
# find out total number of pages
r = requests.get(url=url_address, headers=headers, params=querystring).json()
total_record = int(r['total'])
print("Total record: " +str(total_record))
# results will be appended to this list
all_items = []
# loop through all offset and return JSON object
for offset in range(0, total_record):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
offset = offset + 100
print(offset)
# prettify JSON
data = json.dumps(all_items, sort_keys=True, indent=4)
return data
print(get())
Currently when I print the offset I see
Total Records: 345
100,
101,
102,
Expected:
Total Records: 345
100,
200,
300
Stop the loop!
One way you could do it is change
for offset in range(0, total_record):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
offset = offset + 100
print(offset)
to
for offset in range(0, total_record, 100):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
print(offset)
as you cannot change offset inside the loop
loop through all offset and return JSON object
for offset in range(0,total_record,100):
url = "https://mywebsite/web?offset="+str(offset)
response = requests.get(url=url, headers=headers, params=querystring).json()
all_items.append(response)
print(offset)
Python 3.7.2
PyCharm
I'm fairly new to Python, and API interaction; I'm trying to loop through the API for Rocket Chat, specifically pulling user email address's out.
Unlike nearly every example I can find, Rocket Chat doesn't use any kind of construct like "Next" - it uses count and offset, which I had actually
though might make this easier.
I have managed to get the first part of this working,
looping over the JSON and getting the emails. What I need to do, is loop through the API endpoints - which is what I have ran into some issue with.
I have looked at this answer Unable to loop through paged API responses with Python
as it seemed to be pretty close to what I want, but I couldn't get it to work correctly.
The code below, is what I have right now; obviously this isn't doing any looping through the API endpoint just yet, its just looping over the returned json.
import os
import csv
import requests
import json
url = "https://rocketchat.internal.net"
login = "/api/v1/login"
rocketchatusers = "/api/v1/users.list"
#offset = "?count=500&offset=0"
class API:
def userlist(self, userid, token):
headers = {'X-Auth-Token': token, 'X-User-Id': userid}
rocketusers = requests.get(url + rocketchatusers, headers=headers, verify=False)
print('Status Code:' + str(rocketusers.status_code))
print('Content Type:' + rocketusers.headers['content-type'])
userlist = json.loads(rocketusers.text)
x = 0
y = 0
emails = open('emails', 'w')
while y == 0:
try:
for i in userlist:
print(userlist['users'][x]['emails'][0]['address'], file=emails)
# print(userlist['users'][x]['emails'][0]['address'])
x += 1
except KeyError:
print("This user has no email address", file=emails)
x += 1
except IndexError:
print("End of List")
emails.close()
y += 1
What I have tried and what I would like to do, is something along the lines of an easy FOR loop. There are realistically probably a lot of ways to do what I'm after, I just don't know them.
Something like this:
import os
import csv
import requests
import json
url = "https://rocketchat.internal.net"
login = "/api/v1/login"
rocketchatusers = "/api/v1/users.list"
offset = "?count=500&offset="+p
p = 0
class API:
def userlist(self, userid, token):
headers = {'X-Auth-Token': token, 'X-User-Id': userid}
rocketusers = requests.get(url + rocketchatusers+offset, headers=headers, verify=False)
for r in rocketusers:
print('Status Code:' + str(rocketusers.status_code))
print('Content Type:' + rocketusers.headers['content-type'])
userlist = json.loads(rocketusers.text)
x = 0
y = 0
emails = open('emails', 'w')
while y == 0:
try:
for i in userlist:
print(userlist['users'][x]['emails'][0]['address'], file=emails)
# print(userlist['users'][x]['emails'][0]['address'])
x += 1
except KeyError:
print("This user has no email address", file=emails)
x += 1
except IndexError:
print("End of List")
emails.close()
y += 1
p += 500
Now, obviously this doesn't work, or I'd not be posting, but the why it doesn't work is the issue.
The error that get report is that I can't concatenate an INT, when a STR is expected. Ok, fine. When I attempt something like:
str(p = 0)
I get a type error. I have tried a lot of other things as well, many of them simply silly, such as p = [], p = {} and other more radical idea's as well.
The URL, if not all variables and concatenated would look something like this:
https://rocketchat.internal.net/api/v1/users.list?count=500&offset=0
https://rocketchat.internal.net/api/v1/users.list?count=500&offset=500
https://rocketchat.internal.net/api/v1/users.list?count=500&offset=1000
https://rocketchat.internal.net/api/v1/users.list?count=500&offset=1500
I feel like there is something really simple that I'm missing. I'm reasonably sure that the answer is in the response to the post I listed, but I couldn't get it to work.
So, after asking around, I found out that I had been on the right path to figuring this issue out, I had just tried in the wrong place. Here's what I ended up with:
def userlist(self, userid, token):
p = 0
while p <= 7500:
if not os.path.exists('./emails'):
headers = {'X-Auth-Token': token, 'X-User-Id': userid}
rocketusers = requests.get(url + rocketchatusers + offset + str(p), headers=headers, verify=False)
print('Status Code:' + str(rocketusers.status_code))
print('Content Type:' + rocketusers.headers['content-type'])
print('Creating the file "emails" to use to compare against list of regulated users.')
print(url + rocketchatusers + offset + str(p))
userlist = json.loads(rocketusers.text)
x = 0
y = 0
emails = open('emails', 'a+')
while y == 0:
try:
for i in userlist:
#print(userlist['users'][x]['emails'][0]['address'], file=emails)
print(userlist['users'][x]['ldap'], file=emails)
print(userlist['users'][x]['username'], file=emails)
x += 1
except KeyError:
x += 1
except IndexError:
print("End of List")
emails.close()
p += 50
y += 1
else:
headers = {'X-Auth-Token': token, 'X-User-Id': userid}
rocketusers = requests.get(url + rocketchatusers + offset + str(p), headers=headers, verify=False)
print('Status Code:' + str(rocketusers.status_code))
print('Content Type:' + rocketusers.headers['content-type'])
print('Populating file "emails" - this takes a few moments, please be patient.')
print(url + rocketchatusers + offset + str(p))
userlist = json.loads(rocketusers.text)
x = 0
z = 0
emails = open('emails', 'a+')
while z == 0:
try:
for i in userlist:
#print(userlist['users'][x]['emails'][0]['address'], file=emails)
print(userlist['users'][x]['ldap'], file=emails)
print(userlist['users'][x]['username'], file=emails)
x += 1
except KeyError:
x += 1
except IndexError:
print("End of List")
emails.close()
p += 50
z += 1
This is still a work in progress, unfortunately, this isn't an avenue for collaboration, later I may post this to GitHub so that others can see it.
I'm running into an intermittent issue when I run the code below. I'm trying to collect all the page_tokens in the ajax calls made by pressing the "load more" button if it exists. Basically, I'm trying to get all the page tokens from a YouTube Channel.
Sometimes it will retrieve the tokens, and other times it doesn't. My best guess is either I made a mistake in my "find_embedded_page_token" function or that I need some sort of delay/sleep inserted somewhere.
Below is the full code:
import requests
import pprint
import urllib.parse
import lxml
def find_XSRF_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('"', pos_begin)
return html[pos_begin: pos_end]
def find_page_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('&', pos_begin)
return html[pos_begin: pos_end]
def find_embedded_page_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('&', pos_begin)
excess_str = html[pos_begin: pos_end]
sep = '\\'
rest = excess_str.split(sep,1)[0]
return rest
sxeVid = 'https://www.youtube.com/user/sxephil/videos'
ajaxStr = 'https://www.youtube.com/browse_ajax?action_continuation=1&continuation='
s = requests.Session()
r = s.get(sxeVid)
html = r.text
session_token = find_XSRF_token(html, 'XSRF_TOKEN', 4)
page_token = find_page_token(html, ';continuation=', 0)
print(page_token)
s = requests.Session()
r = s.get(ajaxStr+page_token)
ajxHtml = r.text
ajax_page_token = find_embedded_page_token(ajxHtml, ';continuation=', 0)
while page_token:
ajxBtn = ajxHtml.find('data-uix-load-more-href=')
if ajxBtn != -1:
s = requests.Session()
r = s.get(ajaxStr+ajax_page_token)
ajxHtml = r.text
ajax_page_token = find_embedded_page_token(ajxHtml, ';continuation=', 0)
print(ajax_page_token)
else:
break
This is what's returning randomly that is unexpected. It's pulling not just the token, but also the HTML after the desired cut off.
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D"><span class="yt-uix-button-content"> <span class="load-more-loading hid">
<span class="yt-spinner">
<span class="yt-spinner-img yt-sprite" title="Loading icon"></span>
The expected response I'm expecting is this:
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5MZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5iZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5yZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk43Z0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk9MZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk9iZ0JBQSUzRCUzRA%253D%253D
Any help is greatly appreciated. Also, if my tags are wrong, let me know what tags to +/-.