Python - Loop through each page to get all records - python

I would like to retrieve all records (total 50,000) from an API endpoint. The endpoint only returns a maximum of 1000 records per page. Here's the function to get the records.
def get_products(token,page_number):
url = "https://example.com/manager/nexus?page={}&limit={}".format(page_number,1000)
header = {
"Authorization": "Bearer {}".format(token)
}
response = requests.get(url, headers=header)
product_results = response.json()
total_list = []
for result in product_results['Data']:
date = result['date']
price = result['price']
name = result['name']
total_list.append((date,price,name))
columns = ['date', 'price', 'name']
df = pd.DataFrame(total_list, columns=columns)
results = json.dumps(total_list)
return df, results
How can I loop through each page until the final record without hardcoding the page numbers? Currently, I'm hardcoding the page numbers as below for the first 2 pages to get 2000 records as a test.
for page_number in np.arange(1,3):
token = get_token()
product_df,product_json = get_products(token,page_number)
if page_number==1:
product_all=product_df
else:
product_all=pd.concat([product_all,product_df])
print(product_all)
Thank you.

I don't know about the behavior of the endpoint. Assuming when the page number is greater than the last page number, you would get an empty list instead. If that is the case, you could just check if the result is empty.
page_number = 1
token = get_token()
product_df, product_json = get_products(token,page_number)
product_all=product_df
while product_df.size:
page_number = page_number + 1
token = get_token()
product_df,product_json = get_products(token,page_number)
product_all=pd.concat([product_all,product_df])
print(product_all)
If you are sure there are 1000 records max per page, you could check if the result count is less than 1000 and stop the loop.

It depends on how your backend method: json GET's return.
page and limit are required. you may rewrite the json return all data. in stead of just every 1000.
num = int(50000/1000);
for i in range(1, num):
token = get_token()
product_df,product_json = get_products(token, i)
if i==1:
product_all=product_df
else:
product_all=pd.concat([product_all,product_df])
print(product_all)

Related

get request payload in python

my code is about send get request using query parameters which depends on a page number
After that o have to do for loop to get some ids from the response and also getting the next page number of the same response
and send a new get request with the new next page number that I got from the first response, and I need to get the ids also from the new response
My code works fine , but I’m using two loop which it’s not the right way I think? I couldn’t do it with one loop any ideas?
def get():
response = requests.get(url, headers=header)
data = response.text
data = json.loads(data)
check_if_theres_next_page = data['pagination']['hasMorePages']
check_for_next_page_number = data['pagination']['nextPage']
last_page_number = data['pagination']['lastPage']
orders = data['orders']
list_of_ids = []
for manufacturingOrderId in orders:
ids = manufacturingOrderId['manufacturingOrderId']
list_of_ids.append(ids)
if check_for_next_page_number == 4:
check_for_next_page_number = last_page_number
if check_if_theres_next_page:
url_ = url + '&page_number=' + str(check_for_next_page_number)
response = requests.get(url_, headers=header)
data = response.text
data = json.loads(data)
orders = data['orders']
for manufacturingOrderId_ in orders:
ids = manufacturingOrderId_['manufacturingOrderId']
list_of_ids.append(ids)
if "nextPage" in data['pagination']:
check_for_next_page_number = data['pagination']['nextPage']
else:
check_if_theres_next_page = False
return list_of_ids

How can I web scrape a paginated table while looping through URLs that hold different number of pages

I am trying to find a way to loop through URLs and scrape paginated tables in each of them. The issue arises when some URLs have differing page numbers (in some cases there is no table!). Can someone explain to me where I went wrong and how to fix this? (Please let me know if you require further info.)
def get_injuries(pages):
Injuries_list = []
for page in range(1, pages+1):
for player_id in range(1,10):
headers = {"User-Agent":"Mozilla/5.0"}
url = 'https://www.transfermarkt.co.uk/neymar/verletzungen/spieler/' + str(player_id)
print(url)
html = requests.get(url, headers = headers)
soup = bs(html.content)
# Select first table
if soup.select('.responsive-table > .grid-view > .items > tbody'):
soup = soup.select('.responsive-table > .grid-view > .items > tbody')[0]
try:
for cells in soup.find_all(True, {"class": re.compile("^(even|odd)$")}):
Season = cells.find_all('td')[1].text
Strain = cells.find_all('td')[2].text
Injury_From = cells.find_all('td')[3].text
Injury_To = cells.find_all('td')[4].text
Duration_days = cells.find_all('td')[5].text
Games_missed = cells.find_all('td')[6].text
Club_affected = cells.find_all('td')[6].img['alt']
player = {
'name': cells.find_all("h1", {"itemprop": "name"}),
'Season': Season,
'Strain': Strain,
'Injury_from': Injury_From,
'Injury_To': Injury_To,
'Duration (days)': Duration_days,
'Games_Missed': Games_missed,
'Club_Affected': Club_affected
}
players_list.append(player)
except IndexError:
pass
return Injuries_list
return Injuries_list
This should be in the outer most for loop. After looping once it's returning only one url.
Only one for loop is sufficient to get the data.
players_list=[] .I dont' see this, Create one at the start
you are not doing anything with this list Injuries_list. It's returning an empty list

How to build a webscraping function for subreddit?

Summary: I want to web scrape subreddit and then turn data into data-frames. I know how to do them individually. But I am stuck with using a function.
Here is how I do it one by one.
url = 'https://api.pushshift.io/reddit/search/submission'
params3 = {'subreddit':'Apple', 'size': 500,'before':1579411194}
res3 = requests.get(url, params3)
data = res3.json()
post3 = data['data']
apdf3 = pd.DataFrame(post3)
Here is the function I came up with so far:
url = 'https://api.pushshift.io/reddit/search/submission'
def webscrape (subreddit, size,):
for i in range(1, 11):
params = {"subreddit":subreddit, 'size':size, 'before': f'post{i}'[-1]['created_utc']}
res = requests.get(url, params)
f'data{i}' = res.json()
f'post{i}' = data[f'data{i}']
f'ap_df{i}' = pd.DataFrame(f'post{i})
My problem is that my first parameter doesn't need 'before'. But after the 'post' is created I need to use 'before' in order for me to get all the posts that are earlier than the last post from the previous action. How do I reconcile this conflict?
Many thanks!
What you are asking for is doable, but I don't think f-strings will work here. The code below attaches each dataframe to a dictionary of dataframes. Try it and see if it works:
d = {}
url = 'https://api.pushshift.io/reddit/search/submission'
def webscraper (subreddit, size,):
bef = 0
for i in range(1, 11):
if i==1:
params = {"subreddit":subreddit, 'size':size}
else:
params = {"subreddit":subreddit, 'size':size, 'before': bef}
res = requests.get(url, params)
data = res.json()
dat = data['data']
bef = dat[-1]['created_utc']
df_name = subreddit+str(i)
d[df_name] = pd.DataFrame(dat)

How to call and loop through paginated API using python

I am trying to call a python API to get a result set that is about 21,500 records with the PageSize limit or default at 4000 records. I also do not know the total pages and there is no "next_url" or "last_page_url" links. The only given is the total number of the results which is 21205 and than I can divide it by the PageSize limit of 4000 equaling 5.30125 pages.
There is 2 possible ways i am thinking just i am not sure how to put it in code.
First doing a while loop to see if result set = PageSize of 4000 than loop through another page.
Second is for each loop and if total pages is 5.3 make it round to a 6 to get all records and paginate through page =+1
Lastly I need to append all the records to ta pandas dataframe so i can export to a sql table.
Any help is greatly appreciated.
url = "https://api2.enquiresolutions.com/v3/?Id=XXXX&ListId=161585&PageSize=4000"
auth = { 'Ocp-Apim-Subscription-Key': 'XXX', 'Content-Type': 'application/json'}
params = {'PageNumber': page}
res = requests.get(url=url, headers=auth, params=params).json()
df = pd.DataFrame(res['result'])
total_result= df['total'][0]
total_pages = int(total_result) /4000
properties = json_normalize(df['individuals'],record_path=['properties'],meta=
['casenumber','individualid','type'])
properties['Data'] = properties.label.str.cat(properties.id,sep='_')
properties = properties.drop(['label','id'],axis=1)
pivotprop = properties.pivot(index='individualid', columns='Data', values='value')
data = pivotprop.reset_index()
data.to_sql('crm_Properties',con=engine, if_exists='append'
are you looking for something like this ? You just loop until the result size is less than 4000 and consolidate the datas in a list
url = "https://api2.enquiresolutions.com/v3/?Id=XXXX&ListId=161585&PageSize=4000"
auth = { 'Ocp-Apim-Subscription-Key': 'XXX', 'Content-Type': 'application/json'}
page = 0
params = {'PageNumber': page}
pages_remaining = True
full_res = []
while pages_remaining:
res = requests.get(url=url, headers=auth, params=params).json()
full_res.append(res['result'])
page += 4000
params = {'PageNumber' : page}
if not len(res['result']) == 4000:
pages_remaining = False

Wikipedia All-Pages API after 30 requests returns same pages titles

I am want to extract all Wikipedia titles via API.Each response contains continue key which is used to get next logical batch,but after 30 requests continue key starts to repeat it mean I am receiving same pages.
I have tried the following code above and Wikipedia documentation
https://www.mediawiki.org/wiki/API:Allpages
def get_response(self, url):
resp = requests.get(url=url)
return resp.json()
appcontinue = []
url = 'https://en.wikipedia.org/w/api.php?action=query&list=allpages&format=json&aplimit=500'
json_resp = self.get_response(url)
next_batch = json_resp["continue"]["apcontinue"]
url +='&apcontinue=' + next_batch
appcontinue.append(next_batch)
while True:
json_resp = self.get_response(url)
url = url.replace(next_batch, json_resp["continue"]["apcontinue"])
next_batch = json_resp["continue"]["apcontinue"]
appcontinue.append(next_batch)
I am expecting to receive more than 10000 unique continue keys as one response could contains max 500 Titles.
Wikipedia has 5,673,237 articles in English.
Actual response. I did more than 600 requests and there is only 30 unique continue keys.
json_resp["continue"] contains two pairs of values, one is apcontinue and the other is continue. You should add them both to your query. See https://www.mediawiki.org/wiki/API:Query#Continuing_queries for more details.
Also, I think it'll be easier to use the params parameter of request.get instead of manually replacing the continue values. Perhaps something like this:
import requests
def get_response(url, params):
resp = requests.get(url, params)
return resp.json()
url = 'https://en.wikipedia.org/w/api.php?action=query&list=allpages&format=json&aplimit=500'
params = {}
while True:
json_resp = get_response(url, params)
params = json_resp["continue"]
...

Categories

Resources