The code just prints the same email addresses again and again and doesnt go to the next page. Does anybody see the error in my code?
import requests
from bs4 import BeautifulSoup as soup
def get_emails(_links:list):
for i in range(len(_links)):
new_d = soup(requests.get(_links[i]).text, 'html.parser').find_all('a', {'class':'my_modal_open'})
if new_d:
yield new_d[-1]['title']
start=20
while True:
d = soup(requests.get('http://www.schulliste.eu/type/gymnasien/?bundesland=&start=20').text, 'html.parser')
results = [i['href'] for i in d.find_all('a')][52:-9]
results = [link for link in results if link.startswith('http://')]
print(list(get_emails(results)))
next_page=d.find('div', {'class': 'paging'}, 'weiter')
if next_page:
d=next_page.get('href')
start+=20
else:
break
When you press the button "weiter" (next page) the urlending changes from "...start=20" to "start=40". It is in 20s steps because there are 20 results per site.
Assuming next_page returns anything, the problem is you're trying to do the same thing twice at once, but neither are done properly:
1.) You're trying to point d to a the next page, and yet in the beginning of the loop you reassign d to the starting page again.
2.) You're trying to assign start+=20 for the next page but you're not referencing start in any part of your code.
Thus, you have two ways to tackle this:
1.) Move the d assignment outside of the loop, and remove the start object altogether:
# start=20
# You don't need start because it's not being used at all
# move the initial d assignment outside the loop
d = soup(requests.get('http://www.schulliste.eu/type/gymnasien/?bundesland=&start=20').text, 'html.parser')
while True:
# rest of your code
if next_page:
d=next_page.get('href')
# start+=20
# Again, you don't need the start any more.
else:
break
2.) No need to reassign d, just reference start in your url in the beginning of the loop and remove the d assignment in the if next_page:
start=20
while True:
d = soup(requests.get('http://www.schulliste.eu/type/gymnasien/?bundesland=&start={page_id}'.format(page_id=start).text, 'html.parser')
# rest of your code
if next_page:
# d=next_page.get('href')
# this d assignment is redundant as it will get reassigned in the loop. Start is your key.
start+=20
else:
break
The problem is with url you are requesting. Same url is requested everytime because you are not updating the url as per start you are calculating. Try changing url like this:
'http://www.schulliste.eu/type/gymnasien/?bundesland=&start={}'.format(start)
Related
My goal is to scrape the entire reviews of this firm. I tried manipulating #Driftr95 codes:
def extract(pg):
headers = {'user-agent' : 'Mozilla/5.0'}
url = f'https://www.glassdoor.com/Reviews/3M-Reviews-E446_P{pg}.htm?filter.iso3Language=eng'
# f'https://www.glassdoor.com/Reviews/Google-Engineering-Reviews-EI_IE9079.0,6_DEPT1007_IP{pg}.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')# this a soup function that retuen the whole html
return soup
for j in range(1,21,10):
for i in range(j+1,j+11,1): #3M: 4251 reviews
soup = extract( f'https://www.glassdoor.com/Reviews/3M-Reviews-E446_P{i}.htm?filter.iso3Language=eng')
print(f' page {i}')
for r in soup.select('li[id^="empReview_"]'):
rDet = {'reviewId': r.get('id')}
for sr in r.select(subRatSel):
k = sr.select_one('div:first-of-type').get_text(' ').strip()
sval = getDECstars(sr.select_one('div:nth-of-type(2)'), soup)
rDet[f'[rating] {k}'] = sval
for k, sel in refDict.items():
sval = r.select_one(sel)
if sval: sval = sval.get_text(' ').strip()
rDet[k] = sval
empRevs.append(rDet)
In the case where not all the subratings are always available, all four subratings will turn out to be N.A.
All four subratings will turn out to be N.A.
there were some things that I didn't account for because I hadn't encountered them before, but the updated version of getDECstars shouldn't have that issue. (If you use the longer version with argument isv=True, it's easier to debug and figure out what's missing from the code...)
I scraped 200 reviews in this case, and it turned out that only 170 unique reviews
Duplicates are fairly easy to avoid by maintaining a list of reviewIds that have already been added and checking against it before adding a new review to empRevs
scrapedIds = []
# for...
# for ###
# soup = extract...
# for r in ...
if r.get('id') in scrapedIds: continue # skip duplicate
## rDet = ..... ## AND REST OF INNER FOR-LOOP ##
empRevs.append(rDet)
scrapedIds.append(rDet['reviewId']) # add to list of ids to check against
Https tends to time out after 100 rounds...
You could try adding breaks and switching out user-agents every 50 [or 5 or 10 or...] requests, but I'm quick to resort to selenium at times like this; this is my suggested solution - if you just call it like this and pass a url to start with:
## PASTE [OR DOWNLOAD&IMPORT] from https://pastebin.com/RsFHWNnt ##
startUrl = 'https://www.glassdoor.com/Reviews/3M-Reviews-E446.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng'
scrape_gdRevs(startUrl, 'empRevs_3M.csv', maxScrapes=1000, constBreak=False)
[last 3 lines of] printed output:
total reviews: 4252
total reviews scraped this run: 4252
total reviews scraped over all time: 4252
It clicks through the pages until it reaches the last page (or maxes out maxScrapes). You do have to log in at the beginning though, so fill out login_to_gd with your username and password or log in manually by replacing the login_to_gd(driverG) line with the input(...) line that waits for you to login [then press ENTER in the terminal] before continuing.
I think cookies can also be used instead (with requests), but I'm not good at handling that. If you figure it out, then you can use some version of linkToSoup or your extract(pg); then, you'll have to comment out or remove the lines ending in ## for selenium and uncomment [or follow instructions from] the lines that end with ## without selenium. [But please note that I've only fully tested the selenium version.]
The CSVs [like "empRevs_3M.csv" and "scrapeLogs_empRevs_3M.csv" in this example] are updated after every page-scrape, so even if the program crashes [or you decide to interrupt it], it will have saved upto the previous scrape. Since it also tries to load form the CSVs at the beginning, you can just continue it later (just set startUrl to the url of the page you want to continue from - but even if it's at page 1, remember that duplicates will be ignored, so it's okay - it'll just waste some time though).
Instead of writing 10+ IF statements, I'm trying to create one IF statement using a variable. Unfortunately, I'm not familiar with how to implement string concatenation for xpath using python. Can anyone teach me how to perform string formatting for the following code segments?
I would greatly appreciate it, thanks.
if page_number == 1:
next_link = browser.find_element_by_xpath('//*[#title="Go to page 2"]')
next_link.click()
page_number = page_number + 1
time.sleep(30)
elif page_number == 2:
next_link = browser.find_element_by_xpath('//*[#title="Go to page 3"]')
next_link.click()
page_number = page_number + 1
time.sleep(30)
This answer is not about string concatenation, but about simple problem solution...
Instead of clicking particular link on pagination you can click "Next" button:
pages_number = 10
for _ in range(pages_number):
driver.find_element_by_xpath("//a[#title='Go to next page']").click()
time.sleep(30)
If you need to open specific page you can use below:
required_page = 3
driver.find_element_by_link_text(required_page).click()
P.S. I assumed you are talking about this site
You can use a for-loop
Ex:
for i in range(1, 10):
next_link = browser.find_element_by_xpath('//*[#title="Go to page {0}"]'.format(str(i+1))) #using str.format for page number
next_link.click()
time.sleep(30)
This is what I have at the moment:
import bs4
import requests
def getXkcdComic(comicUrl):
for i in range(0,20):
res = requests.get(comicUrl + str(1882 - i))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
img = soup.select_one("div#comic > img")
return str(img['src'])
link = getXkcdComic('https://xkcd.com/')
print(link)
I parses the html, gets one link, the first one, and since I know the url finishes at 1882 and the next I want is 1881, I wrote this for-loop to get the rest.
It only prints one result as if there was not loop written.
Strangely, if I reduce the indentation for the return function it returns a different url.
I didn't quite get how For-loops works yet.
Also, this is my first post ever here so forgive my english and ignorance.
The first time you hit a return statement, the function is going to return, regardless of whether you're in a loop. So your for() loop is going to get to the end of the first iteration, see the return, and that's it. The other 19 iterations won't run.
The reason you get a different URL if you dedent the return is that your for() loop can now run to completion. But since you didn't save any of your previous iterations, it will return only the last one.
What it looks like you might want is to build a list of results, and return that.
def getXkcdComic(comicUrl):
images = [] # Create an empty list for results
for i in range(0,20):
res = requests.get(comicUrl + str(1882 - i))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
img = soup.select_one("div#comic > img")
images.append(str(img['src'])) # Save the result by adding it to the list
return images # Return the list
Just remember then that link in your outer scope will actually be a list of links, and handle it accordingly.
How do you expect to get multiple outputs (url here) with a single method call? The for loop helps you iterate over a range multiple times and get multiple results, but its of no use until you have a single call. You can do one of the following:
Instead of writing a loop inside the method, call the method in a loop. That way your output will be printed for each call.
Write the entire thing in the method so that you have multiple print statements.
Do the following:
def getXkcdComic(comicUrl):
for i in range(0,20):
res = requests.get(comicUrl + str(1882 - i))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
img = soup.select_one("div#comic > img")
print str(img['src'])
getXkcdComic('https://xkcd.com/')
Your function returns control to the caller once it encounters the return statement, here in the first iteration of the for.
You can yield instead of return in your function to produce image links successively from the function and keep the for loop running:
import bs4
import requests
def getXkcdComic(comicUrl):
for i in range(0,20):
...
yield img['src'] # <- here
# make a list of links yielded by function
links = list(getXkcdComic('https://xkcd.com/'))
References:
Understanding Generators in Python
Python yield expression
It happened because you make return in the loop. Try it:
def getXkcdComic(comicUrl):
res = list()
for i in range(0,20):
res = requests.get(comicUrl + str(1882 - i))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
img = soup.select_one("div#comic > img")
res.append(str(img['src']))
return res
And you can change this:
for i in range(0,20):
res = requests.get(comicUrl + str(1882 - i))
on this:
for i in range(1862, 1883, 1):
res = requests.get(comicUrl + str(i))
The other answers are good and general, but for this specific case there's an even better way. xkcd provides a JSON API, so you can use a list comprehension:
def getXkcdComic(comicUrl):
return [requests.get(comicUrl + str(1882 - i) + '/info.0.json').json()['img']
for i in range(0,20)]
This is also faster and more friendly to the xkcd servers.
When you call 'return' during the first loop the entire 'getXkcdComic' function exits and returns.
Something like this may work and print like the original code intended:
import bs4
import requests
def getXkcdComic(comicUrl, number):
res = requests.get(comicUrl + str(number))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
return str(soup.select_one("div#comic > img")['src'])
link = 'https://xkcd.com/'
for i in range(20):
print(getXkcdComic(link, 1882-i))
I'm writing a code in Python to get all the 'a' tags in a URL using Beautiful soup, then I use the link at position 3 and follow that link, I will repeat this process about 18 times. I included the code below, which has the process repeated 4 times. When I run this code I get the same 4 links in the results. I should get 4 different links. I think there is something wrong in my loop, specifically in the line that says y=url. I need help figuring out what the problem is.
import re
import urllib
from BeautifulSoup import *
list1=list()
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
for i in range (4): # repeat 4 times
htm2= urllib.urlopen(url).read()
soup1=BeautifulSoup(htm2)
tags1= soup1('a')
for tag1 in tags1:
x2 = tag1.get('href', None)
list1.append(x2)
y= list1[2]
if len(x2) < 3: # no 3rd link
break # exit the loop
else:
url=y
print y
You're continuing to add the third link EVER FOUND to your result list. Instead you should be adding the third link OF THAT ITERATION (which is soup('a')[2]), then reassigning your url and going again.
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
result = []
for i in range(4):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
for link in links:
result.append(link)
try:
third_link = links[2]['href']
except IndexError: # less than three links
break
else:
url = third_link
print(url)
This is actually pretty simple in a recursive function:
def get_links(url):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
if len(links) < 3:
# base case
return links
else:
# recurse on third link
return links + get_links(links[2]['href'])
You can even modify that to make sure you don't recurse too deep
def get_links(url, times=None):
'''Returns all <a> tags from `url` and every 3rd link, up to `times` deep
get_links("protocol://hostname.tld", times=2) -> list
if times is None, recurse until there are fewer than 3 links to be found
'''
def _get_links(url, TTL):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
if (times is not None and TTL >= times) or \
len(links) < 3:
# base case
return links
else:
return links + _get_links(links[2]['href'], TTL+1)
return _get_links(url, 0)
Your current code
y= list1[2]
just prints the URL located at index 2 of list1. Since that list only gets appended to, list[2] doesn't change. You should instead be selecting different indices each time you print if you want different URLs. I'm not sure what it is specifically that you're trying to print, but y= list1[-1] for instance would end up printing the last URL added to the list on that iteration (different each time).
I'm trying to export a repo list and it always returns me information about the 1rst page. I could extend the number of items per page using URL+"?per_page=100" but it's not enough to get the whole list.
I need to know how can I get the list extracting data from page 1, 2,...,N.
I'm using Requests module, like this:
while i <= 2:
r = requests.get('https://api.github.com/orgs/xxxxxxx/repos?page{0}&per_page=100'.format(i), auth=('My_user', 'My_passwd'))
repo = r.json()
j = 0
while j < len(repo):
print repo[j][u'full_name']
j = j+1
i = i + 1
I use that while condition 'cause I know there are 2 pages, and I try to increase it in that waym but It doesn't work
import requests
url = "https://api.github.com/XXXX?simple=yes&per_page=100&page=1"
res=requests.get(url,headers={"Authorization": git_token})
repos=res.json()
while 'next' in res.links.keys():
res=requests.get(res.links['next']['url'],headers={"Authorization": git_token})
repos.extend(res.json())
If you aren't making a full blown app use a "Personal Access Token"
https://github.com/settings/tokens
From github docs:
Response:
Status: 200 OK
Link: <https://api.github.com/resource?page=2>; rel="next",
<https://api.github.com/resource?page=5>; rel="last"
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
You get the links to the next and the last page of that organization. Just check the headers.
On Python Requests, you can access your headers with:
response.headers
It is a dictionary containing the response headers. If link is present, then there are more pages and it will contain related information. It is recommended to traverse using those links instead of building your own.
You can try something like this:
import requests
url = 'https://api.github.com/orgs/xxxxxxx/repos?page{0}&per_page=100'
response = requests.get(url)
link = response.headers.get('link', None)
if link is not None:
print link
If link is not None it will be a string containing the relevant links for your resource.
From my understanding, link will be None if only a single page of data is returned, otherwise link will be present even when going beyond the last page. In this case link will contain previous and first links.
Here is some sample python which aims to simply return the link for the next page, and returns None if there is no next page. So could incorporate in a loop.
link = r.headers['link']
if link is None:
return None
# Should be a comma separated string of links
links = link.split(',')
for link in links:
# If there is a 'next' link return the URL between the angle brackets, or None
if 'rel="next"' in link:
return link[link.find("<")+1:link.find(">")]
return None
Extending on the answers above, here is a recursive function to deal with the GitHub pagination that will iterate through all pages, concatenating the list with each recursive call and finally returning the complete list when there are no more pages to retrieve, unless the optional failsafe returns the list when there are more than 500 items.
import requests
api_get_users = 'https://api.github.com/users'
def call_api(apicall, **kwargs):
data = kwargs.get('page', [])
resp = requests.get(apicall)
data += resp.json()
# failsafe
if len(data) > 500:
return (data)
if 'next' in resp.links.keys():
return (call_api(resp.links['next']['url'], page=data))
return (data)
data = call_api(api_get_users)
First you use
print(a.headers.get('link'))
this will give you the number of pages the repository has, similar to below
<https://api.github.com/organizations/xxxx/repos?page=2&type=all>; rel="next",
<https://api.github.com/organizations/xxxx/repos?page=8&type=all>; rel="last"
from this you can see that currently we are on first page of repo, rel='next' says that the next page is 2, and rel='last' tells us that your last page is 8.
After knowing the number of pages to traverse through,you just need to use '=' for page number while getting request and change the while loop until the last page number, not len(repo) as it will return you 100 each time.
for e.g
i=1
while i <= 8:
r = requests.get('https://api.github.com/orgs/xxxx/repos?page={0}&type=all'.format(i),
auth=('My_user', 'My_passwd'))
repo = r.json()
for j in repo:
print(repo[j][u'full_name'])
i = i + 1
link = res.headers.get('link', None)
if link is not None:
link_next = [l for l in link.split(',') if 'rel="next"' in l]
if len(link_next) > 0:
return int(link_next[0][link_next[0].find("page=")+5:link_next[0].find(">")])