import requests
url = 'http://www.justdial.com/autosuggest.php?'
param = {
'cases':'popular',
'strtlmt':'24',
'city':'Mumbai',
'table':'b2c',
'where':'',
'scity':'Mumbai',
'casename':'tmp,tmp1,24-24',
'id':'2'
}
res = requests.get(url,params=param)
res = res.json()
though in first time hit the base url in browser then last 3 params not shown in requests query parameter but its working.
When I hit this API it return a json which contain 2 keys(total & results).
result key contain a list of dictionary(this is main data). and another key which is 'total' contain total number of different categories available in Justdial.
in present case it is total=49 and so have to hit api 3 times because at one time api return only 24 results so (24+24+1 so we need to hit 3 times ).
my question is is there any way to get complete json at one time I mean there are 49 results so instead of hittiing api 3 times can we get all data(all 49 categories) in single hit. I've already tried so many combinations in params but not success.
generally APIs have a count or max_results parameter -- set this on the URL and you'll get more results back.
Here's the documentation for Twitter's API count parameter: https://dev.twitter.com/docs/api/1.1/get/statuses/user_timeline
Github APi requires you to retrieve the data in pages (up to 100 results per page) and the response dict has a 'links' entry with the url to the next page of results.
The code below iterates through all teams in an organisation until it finds the team it's looking for
params = {'page': 1, 'per_page':100}
another_page = True
api = GH_API_URL+'orgs/'+org['login']+'/teams'
while another_page: #the list of teams is paginated
r = requests.get(api, params=params, auth=(username, password))
json_response = json.loads(r.text)
for i in json_response:
if i['name'] == team_name:
return i['id']
if 'next' in r.links: #check if there is another page of organisations
api = r.links['next']['url']
Related
I have a list (lst) which is a list of list. There are 19 elements in this list and each element has ~2500 strings.
lst
[['A', 'B', 'C',...]['E', 'F', 'G',....][.....]]
I am using these strings (A,B....) to call an API endpoint ('q':element). However after ~1800 strings, I am getting a time out.
I am running following lines of code.
def get_val(element):
url = 'https://www.xxxx/yyy/api/search'
headers = {'Content-Type': 'application/json'}
param = {'q': element, 'page' : 500}
try:
response = requests.get(url, headers = headers, params = param, timeout=(3.05, 27))
docs = response.json()['response']['docs']
for result in docs:
file.write("%s\t%s\n" % (element,result['short_form']))
except Timeout:
print('Timeout has been raised.')
#loop through elements of list
for i in lst:
for element in i:
get_val(element)
How can I modify my code to avoid this time out?
One reason for this timeout could be a protection against mass requests, that means, that there are too many requests in a short time.
To overcome this problem a short pause could be added after for example every 100 requests. However this is a try and error approach but it could work. Worst case would be to add a delay after every request.
import time
time.sleep(0.5)
The parameter is added in seconds so 0.5 sec for example.
My goal is to scrape the entire reviews of this firm. I tried manipulating #Driftr95 codes:
def extract(pg):
headers = {'user-agent' : 'Mozilla/5.0'}
url = f'https://www.glassdoor.com/Reviews/3M-Reviews-E446_P{pg}.htm?filter.iso3Language=eng'
# f'https://www.glassdoor.com/Reviews/Google-Engineering-Reviews-EI_IE9079.0,6_DEPT1007_IP{pg}.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')# this a soup function that retuen the whole html
return soup
for j in range(1,21,10):
for i in range(j+1,j+11,1): #3M: 4251 reviews
soup = extract( f'https://www.glassdoor.com/Reviews/3M-Reviews-E446_P{i}.htm?filter.iso3Language=eng')
print(f' page {i}')
for r in soup.select('li[id^="empReview_"]'):
rDet = {'reviewId': r.get('id')}
for sr in r.select(subRatSel):
k = sr.select_one('div:first-of-type').get_text(' ').strip()
sval = getDECstars(sr.select_one('div:nth-of-type(2)'), soup)
rDet[f'[rating] {k}'] = sval
for k, sel in refDict.items():
sval = r.select_one(sel)
if sval: sval = sval.get_text(' ').strip()
rDet[k] = sval
empRevs.append(rDet)
In the case where not all the subratings are always available, all four subratings will turn out to be N.A.
All four subratings will turn out to be N.A.
there were some things that I didn't account for because I hadn't encountered them before, but the updated version of getDECstars shouldn't have that issue. (If you use the longer version with argument isv=True, it's easier to debug and figure out what's missing from the code...)
I scraped 200 reviews in this case, and it turned out that only 170 unique reviews
Duplicates are fairly easy to avoid by maintaining a list of reviewIds that have already been added and checking against it before adding a new review to empRevs
scrapedIds = []
# for...
# for ###
# soup = extract...
# for r in ...
if r.get('id') in scrapedIds: continue # skip duplicate
## rDet = ..... ## AND REST OF INNER FOR-LOOP ##
empRevs.append(rDet)
scrapedIds.append(rDet['reviewId']) # add to list of ids to check against
Https tends to time out after 100 rounds...
You could try adding breaks and switching out user-agents every 50 [or 5 or 10 or...] requests, but I'm quick to resort to selenium at times like this; this is my suggested solution - if you just call it like this and pass a url to start with:
## PASTE [OR DOWNLOAD&IMPORT] from https://pastebin.com/RsFHWNnt ##
startUrl = 'https://www.glassdoor.com/Reviews/3M-Reviews-E446.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng'
scrape_gdRevs(startUrl, 'empRevs_3M.csv', maxScrapes=1000, constBreak=False)
[last 3 lines of] printed output:
total reviews: 4252
total reviews scraped this run: 4252
total reviews scraped over all time: 4252
It clicks through the pages until it reaches the last page (or maxes out maxScrapes). You do have to log in at the beginning though, so fill out login_to_gd with your username and password or log in manually by replacing the login_to_gd(driverG) line with the input(...) line that waits for you to login [then press ENTER in the terminal] before continuing.
I think cookies can also be used instead (with requests), but I'm not good at handling that. If you figure it out, then you can use some version of linkToSoup or your extract(pg); then, you'll have to comment out or remove the lines ending in ## for selenium and uncomment [or follow instructions from] the lines that end with ## without selenium. [But please note that I've only fully tested the selenium version.]
The CSVs [like "empRevs_3M.csv" and "scrapeLogs_empRevs_3M.csv" in this example] are updated after every page-scrape, so even if the program crashes [or you decide to interrupt it], it will have saved upto the previous scrape. Since it also tries to load form the CSVs at the beginning, you can just continue it later (just set startUrl to the url of the page you want to continue from - but even if it's at page 1, remember that duplicates will be ignored, so it's okay - it'll just waste some time though).
I'm currently working in an API and I'm having some issues being able to return all values of something. The API allows for page sizes up to 500 records at a time. By default it uses 25 records per "page." You can also go between pages. You can edit the page sizes by adding to the endpoint: ?page[size]={number between 1-500}. My issue is I'm storing the values returned from the api in a dictionary and whenever I try to get the max amount of data possible if there's more than 500 records in the full population of the data I get no errors but whenever there's less than 500 I get a key error since it's expecting more data than is available. I don't want to have to guess the exact page size each and every request. Is there any easier way I can be going about this to be able to get all available data without having to request the exact amount of data for a page size? Ideally I'd want to be able to just get the max amount from whatever request always going to the upper bounds of what the request can return.
Thanks!
Here's an example of some code from the script:
base_api_endpoint = "{sensitive_url}?page[size]=300"
response = session.get(base_api_endpoint)
print(response)
print(" ")
d = response.json()
data = [item['attributes']['name'] for item in d['data']]
result = {}
sorted_result = {}
for i in data:
result[i] = data.count(i)
sorted_value_index = np.argsort(result.values())
dictionary_keys = list(result.keys())
sorted_dict = {dictionary_keys[i]: sorted(
result.values())[i] for i in range(len(dictionary_keys))}
sorted_d = dict( sorted(result.items(), key=operator.itemgetter(1),reverse=True))
for key, value in sorted_d.items():
print(key)
For some context, this structure of dictionary is used in other areas of the program to print both the key and value pair but for simplicities sake I'm just printing the key here.
I am trying to extract a list of Instagram posts that have been tagged with a certain hashtag. I am using a RAPIDAPI found here. Instagram paginates the results which are returned, so I have to cycle through the pages to get all results. I am encountering a very strange bug/error where I am receiving the next page as requested, but the posts are from the previous page.
To use the analogy of a book, I can see page 1 of the book and I can request to the book to show me page 2. The book is showing me a page labeled page 2, but the contents of the page are the same as page 1.
Using the container provided by the RapidAPI website, I do not encounter this error. This leads me to believe that problem must be on my end, presumably in the while loop I have written.
If somebody could please review my 'while' loop, or suggest anything else which would correct the problem, I would greatly appreciate it. The list of index range error at the bottom is easily fixable, so I'm not concerned about it.
Other info: This particular hashtag has 694 results, and the API returns a page containing 50 items of results.
import http.client
import json
import time
conn = http.client.HTTPSConnection("instagram-data1.p.rapidapi.com") #endpoint supplied by RAPIDAPI
##Begin Credential Section
headers = {
'x-rapidapi-key': "*removed*",
'x-rapidapi-host': "instagram-data1.p.rapidapi.com"
}
##End Credential Section
hashtag = 'givingtuesdayaus'
conn.request("GET", "/hashtag/feed?hashtag=" + hashtag, headers=headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8")) #Purely for debugging, can be disabled
json_dictionary = json.loads(data.decode("utf-8")) #Saving returned results into JSON format, because I find it easier to work with
i = 1 # Results need to cycle through pages, using 'i' to track the number of loops and for input in the name of the file which is saved
with open(hashtag + str(i) + '.json', 'w') as json_file:
json.dump(json_dictionary['collector'], json_file)
#JSON_dictionary contains five fields, 'count' which is number of results for hashtag query, 'has_more' boolean indicating if there are additional pages
# 'end_cursor' string which can be added to the url to cycle to the next page, 'collector' list containing post information, and 'len'
#while loop essentially checks if the 'has_more' indicates there are additional pages, if true uses the 'end_cursor' value to cycle to the next page
while json_dictionary['has_more']:
time.sleep(1)
cursor = json_dictionary['end_cursor']
conn.request("GET", "/hashtag/feed?hashtag=" + hashtag +'&end-cursor=' + cursor, headers=headers)
res = conn.getresponse()
data = res.read()
json_dictionary = json.loads(data.decode("utf-8"))
i += 1
print(i)
print(json_dictionary['collector'][1]['id'])
print(cursor) #these three prints rows are only used for debugging.
with open(hashtag + str(i) + '.json', 'w') as json_file:
json.dump(json_dictionary['collector'], json_file)
Results from python console: (As you can see, cursor and 'i' advance, but post id remains the same. The saved JSON files also all contain the same posts.
> {"count":694,"has_more":true,"end_cursor":"QVFCd2pVdEN2d01rNkw3UmRKSGVUN1EyanBlYzBPMS15MkIyUG1VdHhjWlJWMDBwRmVhaEYxd0czSE0wMktFcGhfMnItak5ZOE1GTzJvd05FU0pTMWxmVg==","collector":[{"id":"2467140087692742224","shortcode":"CI9CtaaDU5Q","type":"GraphImage",.....}
> #shortened by poster 2 2464906276234990574 QVFCd2pVdEN2d01rNkw3UmRKSGVUN1EyanBlYzBPMS15MkIyUG1VdHhjWlJWMDBwRmVhaEYxd0czSE0wMktFcGhfMnItak5ZOE1GTzJvd05FU0pTMWxmVg==
> 3 2464906276234990574
> QVFDVUlROFVKVVB3SEwyR05MSzJHZ2V1UXZqSzlzTVFhWDNBM3hXNENMcThKWExwWU90RFRnRm1FNWtSRGtrbTdORFIwRlU2QWZaSVByOHZhSXFnQnJsVg==
> 4 2464906276234990574
> QVFEVFpheV9SeFZCcWlKYkc3NUZZdG00Rk5KMWJsQVBNakJlZDcyMGlTWm9rUTlIQzRoYjVtTU1uRmhJZG5TTFBSOXdhbHozVUViUjZEbVpLdjVUQlJtVQ==
> Traceback (most recent call last): File "<input>", line 33, in
> <module> IndexError: list index out of range
I see you break a list of lists, it's just that you take a list of lists more than the list of lists
example:
data = [1,2,3,4,5]
You must provide a list of numbers
data[4]
not like this
data[6]
You made a mistake
IndexError: list index out of range
maybe wrong in this
print(json_dictionary['collector'][1]['id'])
print(cursor) #these three prints rows are only used for debugging.
with open(hashtag + str(i) + '.json', 'w') as json_file:
json.dump(json_dictionary['collector'], json_file)
Apologies for everyone who has read this far, I am an idiot.
I have identified the error shortly after posting:
conn.request("GET", "/hashtag/feed?hashtag=" + hashtag +'&end-cursor=' + cursor, headers=headers)
'end-cursor' should be 'end_cursor'.
I'm trying to export a repo list and it always returns me information about the 1rst page. I could extend the number of items per page using URL+"?per_page=100" but it's not enough to get the whole list.
I need to know how can I get the list extracting data from page 1, 2,...,N.
I'm using Requests module, like this:
while i <= 2:
r = requests.get('https://api.github.com/orgs/xxxxxxx/repos?page{0}&per_page=100'.format(i), auth=('My_user', 'My_passwd'))
repo = r.json()
j = 0
while j < len(repo):
print repo[j][u'full_name']
j = j+1
i = i + 1
I use that while condition 'cause I know there are 2 pages, and I try to increase it in that waym but It doesn't work
import requests
url = "https://api.github.com/XXXX?simple=yes&per_page=100&page=1"
res=requests.get(url,headers={"Authorization": git_token})
repos=res.json()
while 'next' in res.links.keys():
res=requests.get(res.links['next']['url'],headers={"Authorization": git_token})
repos.extend(res.json())
If you aren't making a full blown app use a "Personal Access Token"
https://github.com/settings/tokens
From github docs:
Response:
Status: 200 OK
Link: <https://api.github.com/resource?page=2>; rel="next",
<https://api.github.com/resource?page=5>; rel="last"
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
You get the links to the next and the last page of that organization. Just check the headers.
On Python Requests, you can access your headers with:
response.headers
It is a dictionary containing the response headers. If link is present, then there are more pages and it will contain related information. It is recommended to traverse using those links instead of building your own.
You can try something like this:
import requests
url = 'https://api.github.com/orgs/xxxxxxx/repos?page{0}&per_page=100'
response = requests.get(url)
link = response.headers.get('link', None)
if link is not None:
print link
If link is not None it will be a string containing the relevant links for your resource.
From my understanding, link will be None if only a single page of data is returned, otherwise link will be present even when going beyond the last page. In this case link will contain previous and first links.
Here is some sample python which aims to simply return the link for the next page, and returns None if there is no next page. So could incorporate in a loop.
link = r.headers['link']
if link is None:
return None
# Should be a comma separated string of links
links = link.split(',')
for link in links:
# If there is a 'next' link return the URL between the angle brackets, or None
if 'rel="next"' in link:
return link[link.find("<")+1:link.find(">")]
return None
Extending on the answers above, here is a recursive function to deal with the GitHub pagination that will iterate through all pages, concatenating the list with each recursive call and finally returning the complete list when there are no more pages to retrieve, unless the optional failsafe returns the list when there are more than 500 items.
import requests
api_get_users = 'https://api.github.com/users'
def call_api(apicall, **kwargs):
data = kwargs.get('page', [])
resp = requests.get(apicall)
data += resp.json()
# failsafe
if len(data) > 500:
return (data)
if 'next' in resp.links.keys():
return (call_api(resp.links['next']['url'], page=data))
return (data)
data = call_api(api_get_users)
First you use
print(a.headers.get('link'))
this will give you the number of pages the repository has, similar to below
<https://api.github.com/organizations/xxxx/repos?page=2&type=all>; rel="next",
<https://api.github.com/organizations/xxxx/repos?page=8&type=all>; rel="last"
from this you can see that currently we are on first page of repo, rel='next' says that the next page is 2, and rel='last' tells us that your last page is 8.
After knowing the number of pages to traverse through,you just need to use '=' for page number while getting request and change the while loop until the last page number, not len(repo) as it will return you 100 each time.
for e.g
i=1
while i <= 8:
r = requests.get('https://api.github.com/orgs/xxxx/repos?page={0}&type=all'.format(i),
auth=('My_user', 'My_passwd'))
repo = r.json()
for j in repo:
print(repo[j][u'full_name'])
i = i + 1
link = res.headers.get('link', None)
if link is not None:
link_next = [l for l in link.split(',') if 'rel="next"' in l]
if len(link_next) > 0:
return int(link_next[0][link_next[0].find("page=")+5:link_next[0].find(">")])