my code is about send get request using query parameters which depends on a page number
After that o have to do for loop to get some ids from the response and also getting the next page number of the same response
and send a new get request with the new next page number that I got from the first response, and I need to get the ids also from the new response
My code works fine , but I’m using two loop which it’s not the right way I think? I couldn’t do it with one loop any ideas?
def get():
response = requests.get(url, headers=header)
data = response.text
data = json.loads(data)
check_if_theres_next_page = data['pagination']['hasMorePages']
check_for_next_page_number = data['pagination']['nextPage']
last_page_number = data['pagination']['lastPage']
orders = data['orders']
list_of_ids = []
for manufacturingOrderId in orders:
ids = manufacturingOrderId['manufacturingOrderId']
list_of_ids.append(ids)
if check_for_next_page_number == 4:
check_for_next_page_number = last_page_number
if check_if_theres_next_page:
url_ = url + '&page_number=' + str(check_for_next_page_number)
response = requests.get(url_, headers=header)
data = response.text
data = json.loads(data)
orders = data['orders']
for manufacturingOrderId_ in orders:
ids = manufacturingOrderId_['manufacturingOrderId']
list_of_ids.append(ids)
if "nextPage" in data['pagination']:
check_for_next_page_number = data['pagination']['nextPage']
else:
check_if_theres_next_page = False
return list_of_ids
Related
I would like to retrieve all records (total 50,000) from an API endpoint. The endpoint only returns a maximum of 1000 records per page. Here's the function to get the records.
def get_products(token,page_number):
url = "https://example.com/manager/nexus?page={}&limit={}".format(page_number,1000)
header = {
"Authorization": "Bearer {}".format(token)
}
response = requests.get(url, headers=header)
product_results = response.json()
total_list = []
for result in product_results['Data']:
date = result['date']
price = result['price']
name = result['name']
total_list.append((date,price,name))
columns = ['date', 'price', 'name']
df = pd.DataFrame(total_list, columns=columns)
results = json.dumps(total_list)
return df, results
How can I loop through each page until the final record without hardcoding the page numbers? Currently, I'm hardcoding the page numbers as below for the first 2 pages to get 2000 records as a test.
for page_number in np.arange(1,3):
token = get_token()
product_df,product_json = get_products(token,page_number)
if page_number==1:
product_all=product_df
else:
product_all=pd.concat([product_all,product_df])
print(product_all)
Thank you.
I don't know about the behavior of the endpoint. Assuming when the page number is greater than the last page number, you would get an empty list instead. If that is the case, you could just check if the result is empty.
page_number = 1
token = get_token()
product_df, product_json = get_products(token,page_number)
product_all=product_df
while product_df.size:
page_number = page_number + 1
token = get_token()
product_df,product_json = get_products(token,page_number)
product_all=pd.concat([product_all,product_df])
print(product_all)
If you are sure there are 1000 records max per page, you could check if the result count is less than 1000 and stop the loop.
It depends on how your backend method: json GET's return.
page and limit are required. you may rewrite the json return all data. in stead of just every 1000.
num = int(50000/1000);
for i in range(1, num):
token = get_token()
product_df,product_json = get_products(token, i)
if i==1:
product_all=product_df
else:
product_all=pd.concat([product_all,product_df])
print(product_all)
I am retrieving data from an API endpoint which only allows me to retrieve a maximum of 100 data points at a time. There is a "next page" field within the response which I could use to retrieve the next 100 data points and so on (there are about 70,000 in total) by plugging the next page url back into the GET request. How can I utilize a for loop or while loop to retrieve all the data available in the endpoint by automatically plugging the "next page" URL back into the get request?
Here is the code im using. The problem is when I execute the While loop I get the same response everytime because it is running on the first response instance. I can't think of the solution of how to adjust this.
response = requests.get(url + '/api/named_users?limit=100', headers=headers)
users = []
resp_json = response.json()
users.append(resp_json)
while resp_json.get('next_page') != '':
response = s.get(resp_json.get('next_page'), headers = headers)
resp_json = response.json()
users.append(resp_json)
To summarize: I want to take the "next page" URL in every response to get the next 100 data points and append it to a list each time until I have all the data fetched.
You can do it, with a recursive function.
For example something like this :
response = requests.get(url + '/api/named_users?limit=100', headers=headers)
users = []
resp_json = response.json()
users.append(resp_json)
users = next_page(resp_json.get('next_page'), users)
def next_page(url, users):
if url != '':
response = s.get(url, headers=headers)
resp_json = response.json()
users.append(resp_json)
if resp_json.get('next_page') != '':
return next_page(resp_json.get('next_page'), users)
return users
But in general, APIs return a total number of items and a number of items per request. So you can easily paginate and loop through all items.
Here is some pseudo-code :
for i in range(items_returned__per_request, total_number_of_items/items_returned__per_request):
response = s.get(resp_json.get('next_page'), headers=headers)
resp_json = response.json()
users.append(resp_json)
Summary: I want to web scrape subreddit and then turn data into data-frames. I know how to do them individually. But I am stuck with using a function.
Here is how I do it one by one.
url = 'https://api.pushshift.io/reddit/search/submission'
params3 = {'subreddit':'Apple', 'size': 500,'before':1579411194}
res3 = requests.get(url, params3)
data = res3.json()
post3 = data['data']
apdf3 = pd.DataFrame(post3)
Here is the function I came up with so far:
url = 'https://api.pushshift.io/reddit/search/submission'
def webscrape (subreddit, size,):
for i in range(1, 11):
params = {"subreddit":subreddit, 'size':size, 'before': f'post{i}'[-1]['created_utc']}
res = requests.get(url, params)
f'data{i}' = res.json()
f'post{i}' = data[f'data{i}']
f'ap_df{i}' = pd.DataFrame(f'post{i})
My problem is that my first parameter doesn't need 'before'. But after the 'post' is created I need to use 'before' in order for me to get all the posts that are earlier than the last post from the previous action. How do I reconcile this conflict?
Many thanks!
What you are asking for is doable, but I don't think f-strings will work here. The code below attaches each dataframe to a dictionary of dataframes. Try it and see if it works:
d = {}
url = 'https://api.pushshift.io/reddit/search/submission'
def webscraper (subreddit, size,):
bef = 0
for i in range(1, 11):
if i==1:
params = {"subreddit":subreddit, 'size':size}
else:
params = {"subreddit":subreddit, 'size':size, 'before': bef}
res = requests.get(url, params)
data = res.json()
dat = data['data']
bef = dat[-1]['created_utc']
df_name = subreddit+str(i)
d[df_name] = pd.DataFrame(dat)
I am want to extract all Wikipedia titles via API.Each response contains continue key which is used to get next logical batch,but after 30 requests continue key starts to repeat it mean I am receiving same pages.
I have tried the following code above and Wikipedia documentation
https://www.mediawiki.org/wiki/API:Allpages
def get_response(self, url):
resp = requests.get(url=url)
return resp.json()
appcontinue = []
url = 'https://en.wikipedia.org/w/api.php?action=query&list=allpages&format=json&aplimit=500'
json_resp = self.get_response(url)
next_batch = json_resp["continue"]["apcontinue"]
url +='&apcontinue=' + next_batch
appcontinue.append(next_batch)
while True:
json_resp = self.get_response(url)
url = url.replace(next_batch, json_resp["continue"]["apcontinue"])
next_batch = json_resp["continue"]["apcontinue"]
appcontinue.append(next_batch)
I am expecting to receive more than 10000 unique continue keys as one response could contains max 500 Titles.
Wikipedia has 5,673,237 articles in English.
Actual response. I did more than 600 requests and there is only 30 unique continue keys.
json_resp["continue"] contains two pairs of values, one is apcontinue and the other is continue. You should add them both to your query. See https://www.mediawiki.org/wiki/API:Query#Continuing_queries for more details.
Also, I think it'll be easier to use the params parameter of request.get instead of manually replacing the continue values. Perhaps something like this:
import requests
def get_response(url, params):
resp = requests.get(url, params)
return resp.json()
url = 'https://en.wikipedia.org/w/api.php?action=query&list=allpages&format=json&aplimit=500'
params = {}
while True:
json_resp = get_response(url, params)
params = json_resp["continue"]
...
I'm trying to do some analytics analysis on Instagram photos that are posted with a specified hashtag. So now I'm trying to store all the images in a temporary database that'll be used for the analysis.
I'm using python and I've a celery task to get all the images, but it is not working when I run with a next_max_tag_id, which is probably wrong.
Does someone know how to get the correct next_max_tag_id?
this is the code I'm using:
#task()
def get_latest_photos():
next_max_tag_id = get_option('next_max_tag_id')
if not next_max_tag_id:
next_max_tag_id = 0
url = BASE + '/tags/{tag}/media/recent?client_id={cliend_id}' \
'&max_tag_id={max_id}'.format(**{
'tag': a_tag,
'cliend_id': getattr(settings, 'INSTAGRAM_CLIENT_ID'),
'max_id': next_max_tag_id
})
while url:
request = requests.get(url)
if request.status_code != 200:
pass #TODO: error
json_response = request.json()
if json_response['meta']['code'] != 200:
pass #TODO: error
# do something with json_response['data']:
url = None
if json_response.has_key('pagination'):
pagination = json_response['pagination']
if pagination.has_key('next_url'):
url = json_response['pagination']['next_url']
if pagination.has_key('next_max_tag_id'):
next_max_tag_id = pagination['next_max_tag_id']
update_option('next_max_tag_id', next_max_tag_id)
The flow is basically this:
get next_max_tag_id from the db (defaults to 0)
while we have a valid URL it fetches the data, the next url and the next_max_tag_id
updates the next_max_tag_id
The only thing that seems wrong to me is the next_max_tag_id, because every time I go to the API URL with the last next_max_tag_id I get the old images.
Yes. Here's how to use pagination correctly. You have to loop through the pages and reference the function you're in. You can update the script below that gets everyone you're following and query for next_max_id as well.
currently_following = set([])
def parse_following(next_url=None):
if next_url == None:
urlUserMedia = "https://api.instagram.com/v1/users/self/follows?access_token=%s" % (auth_token)
else:
urlUserMedia = next_url
values = {
'client_id' : client_id}
try:
data = urllib.urlencode(values)
req = urllib2.Request(urlUserMedia,None,headers)
response = urllib2.urlopen(req)
result = response.read()
dataObj = json.loads(result)
next_url = None
if dataObj.get('pagination') is not None:
next_url = dataObj.get('pagination').get('next_url')
currently_following.update(user['id'] for user in dataObj['data'])
if next_url is not None:
parse_following(next_url)
except Exception as e:
print e