I'm trying to export a repo list and it always returns me information about the 1rst page. I could extend the number of items per page using URL+"?per_page=100" but it's not enough to get the whole list.
I need to know how can I get the list extracting data from page 1, 2,...,N.
I'm using Requests module, like this:
while i <= 2:
r = requests.get('https://api.github.com/orgs/xxxxxxx/repos?page{0}&per_page=100'.format(i), auth=('My_user', 'My_passwd'))
repo = r.json()
j = 0
while j < len(repo):
print repo[j][u'full_name']
j = j+1
i = i + 1
I use that while condition 'cause I know there are 2 pages, and I try to increase it in that waym but It doesn't work
import requests
url = "https://api.github.com/XXXX?simple=yes&per_page=100&page=1"
res=requests.get(url,headers={"Authorization": git_token})
repos=res.json()
while 'next' in res.links.keys():
res=requests.get(res.links['next']['url'],headers={"Authorization": git_token})
repos.extend(res.json())
If you aren't making a full blown app use a "Personal Access Token"
https://github.com/settings/tokens
From github docs:
Response:
Status: 200 OK
Link: <https://api.github.com/resource?page=2>; rel="next",
<https://api.github.com/resource?page=5>; rel="last"
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
You get the links to the next and the last page of that organization. Just check the headers.
On Python Requests, you can access your headers with:
response.headers
It is a dictionary containing the response headers. If link is present, then there are more pages and it will contain related information. It is recommended to traverse using those links instead of building your own.
You can try something like this:
import requests
url = 'https://api.github.com/orgs/xxxxxxx/repos?page{0}&per_page=100'
response = requests.get(url)
link = response.headers.get('link', None)
if link is not None:
print link
If link is not None it will be a string containing the relevant links for your resource.
From my understanding, link will be None if only a single page of data is returned, otherwise link will be present even when going beyond the last page. In this case link will contain previous and first links.
Here is some sample python which aims to simply return the link for the next page, and returns None if there is no next page. So could incorporate in a loop.
link = r.headers['link']
if link is None:
return None
# Should be a comma separated string of links
links = link.split(',')
for link in links:
# If there is a 'next' link return the URL between the angle brackets, or None
if 'rel="next"' in link:
return link[link.find("<")+1:link.find(">")]
return None
Extending on the answers above, here is a recursive function to deal with the GitHub pagination that will iterate through all pages, concatenating the list with each recursive call and finally returning the complete list when there are no more pages to retrieve, unless the optional failsafe returns the list when there are more than 500 items.
import requests
api_get_users = 'https://api.github.com/users'
def call_api(apicall, **kwargs):
data = kwargs.get('page', [])
resp = requests.get(apicall)
data += resp.json()
# failsafe
if len(data) > 500:
return (data)
if 'next' in resp.links.keys():
return (call_api(resp.links['next']['url'], page=data))
return (data)
data = call_api(api_get_users)
First you use
print(a.headers.get('link'))
this will give you the number of pages the repository has, similar to below
<https://api.github.com/organizations/xxxx/repos?page=2&type=all>; rel="next",
<https://api.github.com/organizations/xxxx/repos?page=8&type=all>; rel="last"
from this you can see that currently we are on first page of repo, rel='next' says that the next page is 2, and rel='last' tells us that your last page is 8.
After knowing the number of pages to traverse through,you just need to use '=' for page number while getting request and change the while loop until the last page number, not len(repo) as it will return you 100 each time.
for e.g
i=1
while i <= 8:
r = requests.get('https://api.github.com/orgs/xxxx/repos?page={0}&type=all'.format(i),
auth=('My_user', 'My_passwd'))
repo = r.json()
for j in repo:
print(repo[j][u'full_name'])
i = i + 1
link = res.headers.get('link', None)
if link is not None:
link_next = [l for l in link.split(',') if 'rel="next"' in l]
if len(link_next) > 0:
return int(link_next[0][link_next[0].find("page=")+5:link_next[0].find(">")])
Related
My goal is to scrape the entire reviews of this firm. I tried manipulating #Driftr95 codes:
def extract(pg):
headers = {'user-agent' : 'Mozilla/5.0'}
url = f'https://www.glassdoor.com/Reviews/3M-Reviews-E446_P{pg}.htm?filter.iso3Language=eng'
# f'https://www.glassdoor.com/Reviews/Google-Engineering-Reviews-EI_IE9079.0,6_DEPT1007_IP{pg}.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')# this a soup function that retuen the whole html
return soup
for j in range(1,21,10):
for i in range(j+1,j+11,1): #3M: 4251 reviews
soup = extract( f'https://www.glassdoor.com/Reviews/3M-Reviews-E446_P{i}.htm?filter.iso3Language=eng')
print(f' page {i}')
for r in soup.select('li[id^="empReview_"]'):
rDet = {'reviewId': r.get('id')}
for sr in r.select(subRatSel):
k = sr.select_one('div:first-of-type').get_text(' ').strip()
sval = getDECstars(sr.select_one('div:nth-of-type(2)'), soup)
rDet[f'[rating] {k}'] = sval
for k, sel in refDict.items():
sval = r.select_one(sel)
if sval: sval = sval.get_text(' ').strip()
rDet[k] = sval
empRevs.append(rDet)
In the case where not all the subratings are always available, all four subratings will turn out to be N.A.
All four subratings will turn out to be N.A.
there were some things that I didn't account for because I hadn't encountered them before, but the updated version of getDECstars shouldn't have that issue. (If you use the longer version with argument isv=True, it's easier to debug and figure out what's missing from the code...)
I scraped 200 reviews in this case, and it turned out that only 170 unique reviews
Duplicates are fairly easy to avoid by maintaining a list of reviewIds that have already been added and checking against it before adding a new review to empRevs
scrapedIds = []
# for...
# for ###
# soup = extract...
# for r in ...
if r.get('id') in scrapedIds: continue # skip duplicate
## rDet = ..... ## AND REST OF INNER FOR-LOOP ##
empRevs.append(rDet)
scrapedIds.append(rDet['reviewId']) # add to list of ids to check against
Https tends to time out after 100 rounds...
You could try adding breaks and switching out user-agents every 50 [or 5 or 10 or...] requests, but I'm quick to resort to selenium at times like this; this is my suggested solution - if you just call it like this and pass a url to start with:
## PASTE [OR DOWNLOAD&IMPORT] from https://pastebin.com/RsFHWNnt ##
startUrl = 'https://www.glassdoor.com/Reviews/3M-Reviews-E446.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng'
scrape_gdRevs(startUrl, 'empRevs_3M.csv', maxScrapes=1000, constBreak=False)
[last 3 lines of] printed output:
total reviews: 4252
total reviews scraped this run: 4252
total reviews scraped over all time: 4252
It clicks through the pages until it reaches the last page (or maxes out maxScrapes). You do have to log in at the beginning though, so fill out login_to_gd with your username and password or log in manually by replacing the login_to_gd(driverG) line with the input(...) line that waits for you to login [then press ENTER in the terminal] before continuing.
I think cookies can also be used instead (with requests), but I'm not good at handling that. If you figure it out, then you can use some version of linkToSoup or your extract(pg); then, you'll have to comment out or remove the lines ending in ## for selenium and uncomment [or follow instructions from] the lines that end with ## without selenium. [But please note that I've only fully tested the selenium version.]
The CSVs [like "empRevs_3M.csv" and "scrapeLogs_empRevs_3M.csv" in this example] are updated after every page-scrape, so even if the program crashes [or you decide to interrupt it], it will have saved upto the previous scrape. Since it also tries to load form the CSVs at the beginning, you can just continue it later (just set startUrl to the url of the page you want to continue from - but even if it's at page 1, remember that duplicates will be ignored, so it's okay - it'll just waste some time though).
I am trying to extract a list of Instagram posts that have been tagged with a certain hashtag. I am using a RAPIDAPI found here. Instagram paginates the results which are returned, so I have to cycle through the pages to get all results. I am encountering a very strange bug/error where I am receiving the next page as requested, but the posts are from the previous page.
To use the analogy of a book, I can see page 1 of the book and I can request to the book to show me page 2. The book is showing me a page labeled page 2, but the contents of the page are the same as page 1.
Using the container provided by the RapidAPI website, I do not encounter this error. This leads me to believe that problem must be on my end, presumably in the while loop I have written.
If somebody could please review my 'while' loop, or suggest anything else which would correct the problem, I would greatly appreciate it. The list of index range error at the bottom is easily fixable, so I'm not concerned about it.
Other info: This particular hashtag has 694 results, and the API returns a page containing 50 items of results.
import http.client
import json
import time
conn = http.client.HTTPSConnection("instagram-data1.p.rapidapi.com") #endpoint supplied by RAPIDAPI
##Begin Credential Section
headers = {
'x-rapidapi-key': "*removed*",
'x-rapidapi-host': "instagram-data1.p.rapidapi.com"
}
##End Credential Section
hashtag = 'givingtuesdayaus'
conn.request("GET", "/hashtag/feed?hashtag=" + hashtag, headers=headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8")) #Purely for debugging, can be disabled
json_dictionary = json.loads(data.decode("utf-8")) #Saving returned results into JSON format, because I find it easier to work with
i = 1 # Results need to cycle through pages, using 'i' to track the number of loops and for input in the name of the file which is saved
with open(hashtag + str(i) + '.json', 'w') as json_file:
json.dump(json_dictionary['collector'], json_file)
#JSON_dictionary contains five fields, 'count' which is number of results for hashtag query, 'has_more' boolean indicating if there are additional pages
# 'end_cursor' string which can be added to the url to cycle to the next page, 'collector' list containing post information, and 'len'
#while loop essentially checks if the 'has_more' indicates there are additional pages, if true uses the 'end_cursor' value to cycle to the next page
while json_dictionary['has_more']:
time.sleep(1)
cursor = json_dictionary['end_cursor']
conn.request("GET", "/hashtag/feed?hashtag=" + hashtag +'&end-cursor=' + cursor, headers=headers)
res = conn.getresponse()
data = res.read()
json_dictionary = json.loads(data.decode("utf-8"))
i += 1
print(i)
print(json_dictionary['collector'][1]['id'])
print(cursor) #these three prints rows are only used for debugging.
with open(hashtag + str(i) + '.json', 'w') as json_file:
json.dump(json_dictionary['collector'], json_file)
Results from python console: (As you can see, cursor and 'i' advance, but post id remains the same. The saved JSON files also all contain the same posts.
> {"count":694,"has_more":true,"end_cursor":"QVFCd2pVdEN2d01rNkw3UmRKSGVUN1EyanBlYzBPMS15MkIyUG1VdHhjWlJWMDBwRmVhaEYxd0czSE0wMktFcGhfMnItak5ZOE1GTzJvd05FU0pTMWxmVg==","collector":[{"id":"2467140087692742224","shortcode":"CI9CtaaDU5Q","type":"GraphImage",.....}
> #shortened by poster 2 2464906276234990574 QVFCd2pVdEN2d01rNkw3UmRKSGVUN1EyanBlYzBPMS15MkIyUG1VdHhjWlJWMDBwRmVhaEYxd0czSE0wMktFcGhfMnItak5ZOE1GTzJvd05FU0pTMWxmVg==
> 3 2464906276234990574
> QVFDVUlROFVKVVB3SEwyR05MSzJHZ2V1UXZqSzlzTVFhWDNBM3hXNENMcThKWExwWU90RFRnRm1FNWtSRGtrbTdORFIwRlU2QWZaSVByOHZhSXFnQnJsVg==
> 4 2464906276234990574
> QVFEVFpheV9SeFZCcWlKYkc3NUZZdG00Rk5KMWJsQVBNakJlZDcyMGlTWm9rUTlIQzRoYjVtTU1uRmhJZG5TTFBSOXdhbHozVUViUjZEbVpLdjVUQlJtVQ==
> Traceback (most recent call last): File "<input>", line 33, in
> <module> IndexError: list index out of range
I see you break a list of lists, it's just that you take a list of lists more than the list of lists
example:
data = [1,2,3,4,5]
You must provide a list of numbers
data[4]
not like this
data[6]
You made a mistake
IndexError: list index out of range
maybe wrong in this
print(json_dictionary['collector'][1]['id'])
print(cursor) #these three prints rows are only used for debugging.
with open(hashtag + str(i) + '.json', 'w') as json_file:
json.dump(json_dictionary['collector'], json_file)
Apologies for everyone who has read this far, I am an idiot.
I have identified the error shortly after posting:
conn.request("GET", "/hashtag/feed?hashtag=" + hashtag +'&end-cursor=' + cursor, headers=headers)
'end-cursor' should be 'end_cursor'.
I have a BeautifulSoup problem that hopefully you can help me out with.
Currently, I have a website with a lot of links on it. The links lead to pages that contain the data of that item that is linked. If you want to check it out, it's this one: http://ogle.astrouw.edu.pl/ogle4/ews/ews.html. What I ultimately want to accomplish is to print out the links of the data that are labeled with an 'N'. It may not be apparent at first, but if you look closely on the website, some of the data have 'N' after their Star No, and others do not. Afterwards, I use that link to download a file containing the information I need on that data. The website is very convenient because the download URLs only change a bit from data to data, so I only need to change a part of the URL, as you'll see in the code below.
I currently have accomplished the data downloading part. However, this is where you come in. Currently, I need to put in the identification number of the BLG event that I desire. (This will become apparent after you view the code below.) However, the website is consistently updating over time, and having to manually search for 'N' events takes up unnecessary time. I want the Python code to be able to do it for me. My original thoughts on the subject were that I could have BeautifulSoup search through the text for all N's, but I ran into some issues on accomplishing that. I feel like I am not familiar enough with BeautifulSoup to get done what I wish to get done. Some help would be appreciated.
The code I have currently is below. I have put in a range of BLG events that have the 'N' label as an example.
#Retrieve .gz files from URLs
from urllib.request import urlopen
import urllib.request
from bs4 import BeautifulSoup
#Access website
URL = 'http://ogle.astrouw.edu.pl/ogle4/ews/ews.html'
soup = BeautifulSoup(urlopen(URL))
#Select the desired data numbers
numbers = list(range(974,998))
x=0
for i in numbers:
numbers[x] = str(i)
x += 1
print(numbers)
#Get all links and put into list
allLinks = []
for link in soup.find_all('a'):
list_links = link.get('href')
allLinks.append(list_links)
#Remove None datatypes from link list
while None in allLinks:
allLinks.remove(None)
#print(allLinks)
#Remove all links but links to data pages and gets rid of the '.html'
list_Bindices = [i for i, s in enumerate(allLinks) if 'b' in s]
print(list_Bindices)
bLinks = []
for x in list_Bindices:
bLinks.append(allLinks[x])
bLinks = [s.replace('.html', '') for s in bLinks]
#print(bLinks)
#Create a list of indices for accessing those pages
list_Nindices = []
for x in numbers:
list_Nindices.append([i for i, s in enumerate(bLinks) if x in s])
#print(type(list_Nindices))
#print(list_Nindices)
nindices_corrected = []
place = 0
while place < (len(list_Nindices)):
a = list_Nindices[place]
nindices_corrected.append(a[0])
place = place + 1
#print(nindices_corrected)
#Get the page names (without the .html) from the indices
nLinks = []
for x in nindices_corrected:
nLinks.append(bLinks[x])
#print(nLinks)
#Form the URLs for those pages
final_URLs = []
for x in nLinks:
y = "ftp://ftp.astrouw.edu.pl/ogle/ogle4/ews/2017/"+ x + "/phot.dat"
final_URLs.append(y)
#print(final_URLs)
#Retrieve the data from the URLs
z = 0
for x in final_URLs:
name = nLinks[z] + ".dat"
#print(name)
urllib.request.urlretrieve(x, name)
z += 1
#hrm = urllib.request.urlretrieve("ftp://ftp.astrouw.edu.pl/ogle/ogle4/ews/2017/blg-0974.tar.gz", "practice.gz")
This piece of code has taken me quite some time to write, as I am not a professional programmer, nor an expert in BeautifulSoup or URL manipulation in any way. In fact, I use MATLAB more than Python. As such, I tend to think in terms of MATLAB, which translates into less efficient Python code. However, efficiency is not what I am searching for in this problem. I can wait the extra five minutes for my code to finish if it means that I understand what is going on and can accomplish what I need to accomplish. Thank you for any help you can offer! I realize this is a fairly muti-faceted problem.
This should do it:
from urllib.request import urlopen
import urllib.request
from bs4 import BeautifulSoup
#Access website
URL = 'http://ogle.astrouw.edu.pl/ogle4/ews/ews.html'
soup = BeautifulSoup(urlopen(URL), 'html5lib')
Here, I'm using the html5lib to parse the url content.
Next, we'll look through the table, extracting links if the star names have a 'N' in them:
table = soup.find('table')
links = []
for tr in table.find_all('tr', {'class' : 'trow'}):
td = tr.findChildren()
if 'N' in td[4].text:
links.append('http://ogle.astrouw.edu.pl/ogle4/ews/' + td[1].a['href'])
print(links)
Output:
['http://ogle.astrouw.edu.pl/ogle4/ews/blg-0974.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0975.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0976.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0977.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0978.html',
...
]
I'm writing a code in Python to get all the 'a' tags in a URL using Beautiful soup, then I use the link at position 3 and follow that link, I will repeat this process about 18 times. I included the code below, which has the process repeated 4 times. When I run this code I get the same 4 links in the results. I should get 4 different links. I think there is something wrong in my loop, specifically in the line that says y=url. I need help figuring out what the problem is.
import re
import urllib
from BeautifulSoup import *
list1=list()
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
for i in range (4): # repeat 4 times
htm2= urllib.urlopen(url).read()
soup1=BeautifulSoup(htm2)
tags1= soup1('a')
for tag1 in tags1:
x2 = tag1.get('href', None)
list1.append(x2)
y= list1[2]
if len(x2) < 3: # no 3rd link
break # exit the loop
else:
url=y
print y
You're continuing to add the third link EVER FOUND to your result list. Instead you should be adding the third link OF THAT ITERATION (which is soup('a')[2]), then reassigning your url and going again.
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
result = []
for i in range(4):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
for link in links:
result.append(link)
try:
third_link = links[2]['href']
except IndexError: # less than three links
break
else:
url = third_link
print(url)
This is actually pretty simple in a recursive function:
def get_links(url):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
if len(links) < 3:
# base case
return links
else:
# recurse on third link
return links + get_links(links[2]['href'])
You can even modify that to make sure you don't recurse too deep
def get_links(url, times=None):
'''Returns all <a> tags from `url` and every 3rd link, up to `times` deep
get_links("protocol://hostname.tld", times=2) -> list
if times is None, recurse until there are fewer than 3 links to be found
'''
def _get_links(url, TTL):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
if (times is not None and TTL >= times) or \
len(links) < 3:
# base case
return links
else:
return links + _get_links(links[2]['href'], TTL+1)
return _get_links(url, 0)
Your current code
y= list1[2]
just prints the URL located at index 2 of list1. Since that list only gets appended to, list[2] doesn't change. You should instead be selecting different indices each time you print if you want different URLs. I'm not sure what it is specifically that you're trying to print, but y= list1[-1] for instance would end up printing the last URL added to the list on that iteration (different each time).
import requests
url = 'http://www.justdial.com/autosuggest.php?'
param = {
'cases':'popular',
'strtlmt':'24',
'city':'Mumbai',
'table':'b2c',
'where':'',
'scity':'Mumbai',
'casename':'tmp,tmp1,24-24',
'id':'2'
}
res = requests.get(url,params=param)
res = res.json()
though in first time hit the base url in browser then last 3 params not shown in requests query parameter but its working.
When I hit this API it return a json which contain 2 keys(total & results).
result key contain a list of dictionary(this is main data). and another key which is 'total' contain total number of different categories available in Justdial.
in present case it is total=49 and so have to hit api 3 times because at one time api return only 24 results so (24+24+1 so we need to hit 3 times ).
my question is is there any way to get complete json at one time I mean there are 49 results so instead of hittiing api 3 times can we get all data(all 49 categories) in single hit. I've already tried so many combinations in params but not success.
generally APIs have a count or max_results parameter -- set this on the URL and you'll get more results back.
Here's the documentation for Twitter's API count parameter: https://dev.twitter.com/docs/api/1.1/get/statuses/user_timeline
Github APi requires you to retrieve the data in pages (up to 100 results per page) and the response dict has a 'links' entry with the url to the next page of results.
The code below iterates through all teams in an organisation until it finds the team it's looking for
params = {'page': 1, 'per_page':100}
another_page = True
api = GH_API_URL+'orgs/'+org['login']+'/teams'
while another_page: #the list of teams is paginated
r = requests.get(api, params=params, auth=(username, password))
json_response = json.loads(r.text)
for i in json_response:
if i['name'] == team_name:
return i['id']
if 'next' in r.links: #check if there is another page of organisations
api = r.links['next']['url']