I am trying to call a python API to get a result set that is about 21,500 records with the PageSize limit or default at 4000 records. I also do not know the total pages and there is no "next_url" or "last_page_url" links. The only given is the total number of the results which is 21205 and than I can divide it by the PageSize limit of 4000 equaling 5.30125 pages.
There is 2 possible ways i am thinking just i am not sure how to put it in code.
First doing a while loop to see if result set = PageSize of 4000 than loop through another page.
Second is for each loop and if total pages is 5.3 make it round to a 6 to get all records and paginate through page =+1
Lastly I need to append all the records to ta pandas dataframe so i can export to a sql table.
Any help is greatly appreciated.
url = "https://api2.enquiresolutions.com/v3/?Id=XXXX&ListId=161585&PageSize=4000"
auth = { 'Ocp-Apim-Subscription-Key': 'XXX', 'Content-Type': 'application/json'}
params = {'PageNumber': page}
res = requests.get(url=url, headers=auth, params=params).json()
df = pd.DataFrame(res['result'])
total_result= df['total'][0]
total_pages = int(total_result) /4000
properties = json_normalize(df['individuals'],record_path=['properties'],meta=
['casenumber','individualid','type'])
properties['Data'] = properties.label.str.cat(properties.id,sep='_')
properties = properties.drop(['label','id'],axis=1)
pivotprop = properties.pivot(index='individualid', columns='Data', values='value')
data = pivotprop.reset_index()
data.to_sql('crm_Properties',con=engine, if_exists='append'
are you looking for something like this ? You just loop until the result size is less than 4000 and consolidate the datas in a list
url = "https://api2.enquiresolutions.com/v3/?Id=XXXX&ListId=161585&PageSize=4000"
auth = { 'Ocp-Apim-Subscription-Key': 'XXX', 'Content-Type': 'application/json'}
page = 0
params = {'PageNumber': page}
pages_remaining = True
full_res = []
while pages_remaining:
res = requests.get(url=url, headers=auth, params=params).json()
full_res.append(res['result'])
page += 4000
params = {'PageNumber' : page}
if not len(res['result']) == 4000:
pages_remaining = False
Related
my code is about send get request using query parameters which depends on a page number
After that o have to do for loop to get some ids from the response and also getting the next page number of the same response
and send a new get request with the new next page number that I got from the first response, and I need to get the ids also from the new response
My code works fine , but I’m using two loop which it’s not the right way I think? I couldn’t do it with one loop any ideas?
def get():
response = requests.get(url, headers=header)
data = response.text
data = json.loads(data)
check_if_theres_next_page = data['pagination']['hasMorePages']
check_for_next_page_number = data['pagination']['nextPage']
last_page_number = data['pagination']['lastPage']
orders = data['orders']
list_of_ids = []
for manufacturingOrderId in orders:
ids = manufacturingOrderId['manufacturingOrderId']
list_of_ids.append(ids)
if check_for_next_page_number == 4:
check_for_next_page_number = last_page_number
if check_if_theres_next_page:
url_ = url + '&page_number=' + str(check_for_next_page_number)
response = requests.get(url_, headers=header)
data = response.text
data = json.loads(data)
orders = data['orders']
for manufacturingOrderId_ in orders:
ids = manufacturingOrderId_['manufacturingOrderId']
list_of_ids.append(ids)
if "nextPage" in data['pagination']:
check_for_next_page_number = data['pagination']['nextPage']
else:
check_if_theres_next_page = False
return list_of_ids
I would like to retrieve all records (total 50,000) from an API endpoint. The endpoint only returns a maximum of 1000 records per page. Here's the function to get the records.
def get_products(token,page_number):
url = "https://example.com/manager/nexus?page={}&limit={}".format(page_number,1000)
header = {
"Authorization": "Bearer {}".format(token)
}
response = requests.get(url, headers=header)
product_results = response.json()
total_list = []
for result in product_results['Data']:
date = result['date']
price = result['price']
name = result['name']
total_list.append((date,price,name))
columns = ['date', 'price', 'name']
df = pd.DataFrame(total_list, columns=columns)
results = json.dumps(total_list)
return df, results
How can I loop through each page until the final record without hardcoding the page numbers? Currently, I'm hardcoding the page numbers as below for the first 2 pages to get 2000 records as a test.
for page_number in np.arange(1,3):
token = get_token()
product_df,product_json = get_products(token,page_number)
if page_number==1:
product_all=product_df
else:
product_all=pd.concat([product_all,product_df])
print(product_all)
Thank you.
I don't know about the behavior of the endpoint. Assuming when the page number is greater than the last page number, you would get an empty list instead. If that is the case, you could just check if the result is empty.
page_number = 1
token = get_token()
product_df, product_json = get_products(token,page_number)
product_all=product_df
while product_df.size:
page_number = page_number + 1
token = get_token()
product_df,product_json = get_products(token,page_number)
product_all=pd.concat([product_all,product_df])
print(product_all)
If you are sure there are 1000 records max per page, you could check if the result count is less than 1000 and stop the loop.
It depends on how your backend method: json GET's return.
page and limit are required. you may rewrite the json return all data. in stead of just every 1000.
num = int(50000/1000);
for i in range(1, num):
token = get_token()
product_df,product_json = get_products(token, i)
if i==1:
product_all=product_df
else:
product_all=pd.concat([product_all,product_df])
print(product_all)
Trying to get the api iteration until it pulls the whole records. Any idea/hits would be really appreciated. it returns 5000 records default max per api call and there are almost 30000 rows in account object.
As per doc- more than 5000 records that can be fetched, pass another API call with Offset as 5001 so that remaining records (maximum 5000 records again) are fetched
import requests
import json
url = 'https://xyzabc.com/account'
headers = {'content-type': 'application/json','Accesskey': '1234'}
body = {"select": [
"accountid",
"accountname",
"location"],
"offset" :0}
response = requests.post(url, data=json.dumps(body), headers=headers)
account = response.json()
As your offset is where you are, you can do it in a loop like this:
url = 'https://xyzabc.com/account'
headers = {'content-type': 'application/json','Accesskey': '1234'}
# Please check if you have a better way to get total number from your API specs,
# then specify it - that may need a separate API call.
total_records = 1000000000
# Get the results of your API calls into the list
accounts = []
# Go from 0 to total_records every 5000 records
try:
for i in range(0, total_records, 5000):
body = {"select": ["accountid",
"accountname",
"location"],
"offset" :i}
response = requests.post(url, data=json.dumps(body), headers=headers)
accounts.append(response.json())
except Exception as e:
print(f"Connection error - {e}") # Handle it your way
for account in accounts:
# Your logic for every account fetched.
Summary: I want to web scrape subreddit and then turn data into data-frames. I know how to do them individually. But I am stuck with using a function.
Here is how I do it one by one.
url = 'https://api.pushshift.io/reddit/search/submission'
params3 = {'subreddit':'Apple', 'size': 500,'before':1579411194}
res3 = requests.get(url, params3)
data = res3.json()
post3 = data['data']
apdf3 = pd.DataFrame(post3)
Here is the function I came up with so far:
url = 'https://api.pushshift.io/reddit/search/submission'
def webscrape (subreddit, size,):
for i in range(1, 11):
params = {"subreddit":subreddit, 'size':size, 'before': f'post{i}'[-1]['created_utc']}
res = requests.get(url, params)
f'data{i}' = res.json()
f'post{i}' = data[f'data{i}']
f'ap_df{i}' = pd.DataFrame(f'post{i})
My problem is that my first parameter doesn't need 'before'. But after the 'post' is created I need to use 'before' in order for me to get all the posts that are earlier than the last post from the previous action. How do I reconcile this conflict?
Many thanks!
What you are asking for is doable, but I don't think f-strings will work here. The code below attaches each dataframe to a dictionary of dataframes. Try it and see if it works:
d = {}
url = 'https://api.pushshift.io/reddit/search/submission'
def webscraper (subreddit, size,):
bef = 0
for i in range(1, 11):
if i==1:
params = {"subreddit":subreddit, 'size':size}
else:
params = {"subreddit":subreddit, 'size':size, 'before': bef}
res = requests.get(url, params)
data = res.json()
dat = data['data']
bef = dat[-1]['created_utc']
df_name = subreddit+str(i)
d[df_name] = pd.DataFrame(dat)
I am want to extract all Wikipedia titles via API.Each response contains continue key which is used to get next logical batch,but after 30 requests continue key starts to repeat it mean I am receiving same pages.
I have tried the following code above and Wikipedia documentation
https://www.mediawiki.org/wiki/API:Allpages
def get_response(self, url):
resp = requests.get(url=url)
return resp.json()
appcontinue = []
url = 'https://en.wikipedia.org/w/api.php?action=query&list=allpages&format=json&aplimit=500'
json_resp = self.get_response(url)
next_batch = json_resp["continue"]["apcontinue"]
url +='&apcontinue=' + next_batch
appcontinue.append(next_batch)
while True:
json_resp = self.get_response(url)
url = url.replace(next_batch, json_resp["continue"]["apcontinue"])
next_batch = json_resp["continue"]["apcontinue"]
appcontinue.append(next_batch)
I am expecting to receive more than 10000 unique continue keys as one response could contains max 500 Titles.
Wikipedia has 5,673,237 articles in English.
Actual response. I did more than 600 requests and there is only 30 unique continue keys.
json_resp["continue"] contains two pairs of values, one is apcontinue and the other is continue. You should add them both to your query. See https://www.mediawiki.org/wiki/API:Query#Continuing_queries for more details.
Also, I think it'll be easier to use the params parameter of request.get instead of manually replacing the continue values. Perhaps something like this:
import requests
def get_response(url, params):
resp = requests.get(url, params)
return resp.json()
url = 'https://en.wikipedia.org/w/api.php?action=query&list=allpages&format=json&aplimit=500'
params = {}
while True:
json_resp = get_response(url, params)
params = json_resp["continue"]
...