Scraping Multiple Pages Using BeautifulSoup 3 - python

I want to scrape multiple pages for specific links. For example, I want to be able to choose which link is followed with a specific amount of iterations. The result of the scrape from the initial input must be appended or replacewith the user input. I have:
#url = raw_input('Enter - ')
url = 'http://www.columbia.edu/kermit/k95.html'
itr = raw_input('Enter iteration: ')
i = int(itr)
n = raw_input('Enter Number: ')
n = int(n)
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags = soup('a')
print 'Link:' , url
while i > 0:
i = i - 1
if i == 0:
break
for tag in tags:
me = tag.get('href', None)
#Just to make sure the link/content match print tag.contents[0]
link = tags[(n - 1)]
#print link
links = link.get('href', None)
print 'Link:', links
Enter - http://www.columbia.edu/~fdc/
Enter count: 4
Enter Position: 9
Link: http://www.columbia.edu/~fdc/
Link: http://www.columbia.edu/kermit/k95.html
Link: http://www.columbia.edu/kermit/k95.html (Should be k95faq.html)
Link: http://www.columbia.edu/kermit/k95.html (Should be ckfaq.html)
I'm getting the number of iterations I want, and the specific link, but I need the first url (the users input) to be replaced with the link under the variable "links" for each iteration.
Example would be for a user to input an url like http://www.columbia.edu/~fdc/ with 4 iterations of the 9th link on the page. The first iteration would return http://www.columbia.edu/kermit/k95.html as "links". I want the second iteration to give me the 9th link on "links" which should be k95faq.html

Related

How to find the total number of pages on a website with BeautifulSoup?

Context: I'm working on pagination of this website: https://skoodos.com/schools-in-uttarakhand. When I inspected, this website has no proper number of pages visible except the next button which is ?page=2 after the url. Also, searching for page-link gave me number 20 at the end. So I assumed that the total number of pages is 20, upon checking manually, I learnt that, there only exist 11 pages.
After many trials and errors, I finally decided to go with just the indexing from 0 until 12 (12 is excluded by python however).
What I want to know is that, how wold you go about figuring out the number of pages on a particular website that doesn't show the actual number of pages other than previous and next button and how can I optimize this in terms of the same?
Here's my solution to pagination. Any way to optimize this other than me manually finding the number of pages?
from myWork.commons import url_parser, write
def data_fetch(url):
school_info = []
for page_number in range(0, 4):
next_web_page = url + f'?page={page_number}'
soup = url_parser(next_web_page)
search_results = soup.find('section', {'id': 'search-results'}).find(class_='container').find(class_='row')
# rest of the code
for page_number in range(4, 12):
next_web_page = url + f'?page={page_number}'
soup = url_parser(next_web_page)
search_results = soup.find('section', {'id': 'search-results'}).find(class_='container').find(class_='row')
# rest of the code
def main():
url = "https://skoodos.com/schools-in-uttarakhand"
data_fetch(url)
if __name__ == "__main__":
main()
Each of your pages (except the last one) will have an element like this:
<a class="page-link"
href="https://skoodos.com/schools-in-uttarakhand?page=2"
rel="next">Next »</a>
E.g. you can extract the link as follows (here for the first page):
link = soup.find('a', class_='page-link', href=True, rel='next')
print(link['href'])
https://skoodos.com/schools-in-uttarakhand?page=2
So, you could make your function recursive. E.g. use something like this:
import requests
from bs4 import BeautifulSoup
def data_fetch(url, results = list()):
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'lxml')
search_results = soup.find('section', {'id': 'search-results'})\
.find(class_='container').find(class_='row')
results.append(search_results)
link = soup.find('a', class_='page-link', href=True, rel='next')
# link will be `None` for last page (i.e. `page=11`)
if link:
# just adding some prints to show progress of iteration
if not 'page' in url:
print('getting page: 1', end=', ')
url = link['href']
# subsequent page nums being retrieved
print(f'{url.rsplit("=", maxsplit=1)[1]}', end=', ')
# recursive call
return data_fetch(url, results)
else:
# `page=11` with no link, we're done
print('done')
return results
url = 'https://skoodos.com/schools-in-uttarakhand'
data = data_fetch(url)
So, a call to this function will print progress as:
getting page: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, done
And you'll end up with data with 11x bs4.element.Tag, one for each page.
print(len(data))
11
print(set([type(d) for d in data]))
{<class 'bs4.element.Tag'>}
Good luck with extracting the required info; the site is very slow, and the HTML is particularly sloppy and inconsistent. (e.g. you're right to note there is a page-link elem, which suggests there are 20 pages. But its visibility is set to hidden, so apparently this is just a piece of deprecated/unused code.)
There's a bit at the top that says "Showing the 217 results as per selected criteria". You can code to extract the number from that, then count the number number of results per page and divide by that to get the expected number of pages (don't forget to round up ).
If you want to double check, add more code to go to the calculated last page and
if there's no such page, keep decrementing the total and checking until you hit a page that exists
if there is such a page, but it has an active/enabled "Next" button, keep going to Next page until reaching the last (basically as you are now)
(Remember that the two listed above are contingencies and wouldn't be executed in an ideal scenario.)
So, just to find the number of pages, you could do something like:
import requests
from bs4 import BeautifulSoup
import math
def soupFromUrl(scrapeUrl):
req = requests.get(scrapeUrl)
if req.status_code == 200:
return BeautifulSoup(req.text, 'html.parser')
else:
raise Exception(f'{req.reason} - failed to scrape {scrapeUrl}')
def getPageTotal(url):
soup = soupFromUrl(url)
#totalResults = int(soup.find('label').get_text().split('(')[-1].split(')')[0])
totalResults = int(soup.p.strong.get_text()) # both searches should work
perPageResults = len(soup.select('.m-show')) #probably always 20
print(f'{perPageResults} of {totalResults} results per page')
if not (perPageResults > 0 and totalResults > 0):
return 0
lastPageNum = math.ceil(totalResults/perPageResults)
# Contingencies - will hopefully never be needed
lpSoup = soupFromUrl(f'{url}?page={lastPageNum}')
if lpSoup.select('.m-show'): #page exists
while lpSoup.select_one('a[rel="next"]'):
nextLink = lpSoup.select_one('a[rel="next"]')['href']
lastPageNum = int(nextLink.split('page=')[-1])
lpSoup = soupFromUrl(nextLink)
else: #page does not exist
while not (lpSoup.select('.m-show') or lastPageNum < 1):
lastPageNum = lastPageNum - 1
lpSoup = soupFromUrl(f'{url}?page={lastPageNum}')
# end Contingencies section
return lastPageNum
However, it looks like you only want the total pages in order to start the for-loop, but it's not even necessary to use a for-loop at all - a while-loop might be better:
def data_fetch(url):
school_info = []
nextUrl = url
while nextUrl:
soup = soupFromUrl(nextUrl)
#GET YOUR DATA FROM PAGE
nextHL = soup.select_one('a[rel="next"]')
nextUrl = nextHL.get('href') if nextHL else None
# code after fetching all pages' data
Although, you could still use for-loop if you had a maximum page number in mind:
def data_fetch(url, maxPages):
school_info = []
for p in range(1, maxPages+1):
soup = soupFromUrl(f'{url}?page={p}')
if not soup.select('.m-show'):
break
#GET YOUR DATA FROM PAGE
# code after fetching all pages' data [upto max]

Trying to crawl all newslinks in a site (the parsed link only shows 10 results per page, I would need to find ALL links)

I am trying to crawl all news link that has a certain keyword that is looking for.
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import re
key_word = urllib.parse.quote("금리")
url = "https://search.naver.com/search.naver?where=news&query=" + key_word +"%EA%B8%88%EB%A6%AC&sm=tab_opt&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so%3Ar%2Cp%3Afrom20200413to20200414%2Ca%3Aall&mynews=0&refresh_start=0&related=0"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
anchor_set = soup.findAll('a')
news_link = []
for a in anchor_set:
if str(a).find('https://news.naver.com/main/read.nhn?') != -1:
a = a.get('href')
news_link.append(a)'
Untill this section (code above), I parse into the url and retrieve all links that has a certian read.nhn(naver news platform) and append it to news_link.
This is working fine, but the proble is the url used above only shows 10 articles in the page.
count_tag = soup.find("div",{"class","title_desc all_my"})
count_text=count_tag.find("span").get_text().split()
total_num=count_text[-1][0:-1].replace(",","")
print(total_num)'
Using the code above I've found out there are a total of 1297 articles that I need to collect. but since the original link above only has 10 articles in the page.
for val in range(int(total_num)//10+1):
start_val=str(val*10+1)
I was told i needed to insert this into the url to retrieve ALL newslinks.
Thus, I've used the while method
while start_val <= total_num:
url = "https://search.naver.com/search.naver?where=news&query=" + key_word +"%EA%B8%88%EB%A6%AC&sm=tab_opt&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so%3Ar%2Cp%3Afrom20200413to20200414%2Ca%3Aall&mynews=0&refresh_start=" + start_val + "&related=0"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
news_link = []
anchor_set = soup.findAll('a')
for a in anchor_set:
if str(a).find('https://news.naver.com/main/read.nhn?') != -1:
a = a.get('href')
news_link.append(a)
However, when I run the program, it seems the loop does not stop. obviously there is no else or break.. How can i break this loop and successfully collect all the links?
Your current while loop doesn't stop because you haven't incremented the value of start_val. Also, later you have range(int(total_num)//10+1) so if you converted total_num to a string, then the string comparison in while start_val <= total_num is wrong - for strings "21" > "1297", because "2" > "1". Compare them as int's.
And since you're creating the sequence of vals to use, you don't need a separate upper bound check.
So far, this would give you the correct finite loop:
for val in range(int(total_num)//10+1): # no upper bound check needed
start_val=str(val*10+1)
url = "https://search.naver.com/search.naver?where=news&query=" ...
html = urllib.request.urlopen(url).read()
...
For the values needed for the pages/next starting item, instead of doing:
for val in range(int(total_num)//10+1):
start_val = str(val*10+1)
You can get the actual val's from range(). To starting at 1 and going in steps of 10 to get: 1, 11, 21, ... , upto and including the total:
for val in range(1, total_num + 1, 10):
start_val = str(val) # don't need this assignment actually
Next thing: the URL for page 2 onwards is wrong. Currently, your while loop will generate the following URL for page 2:
https://search.naver.com/search.naver?where=news&query=%EA%B8%88%EB%A6%AC%EA%B8%88%EB%A6%AC&sm=tab_opt&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so%3Ar%2Cp%3Afrom20200413to20200414%2Ca%3Aall&mynews=0&refresh_start=11&related=0
But if you click on page "2" of the results, you get the URL:
https://search.naver.com/search.naver?&where=news&query=%EA%B8%88%EB%A6%AC%EA%B8%88%EB%A6%AC&sm=tab_pge&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so:r,p:from20200413to20200414,a:all&mynews=0&cluster_rank=35&start=11&refresh_start=0
The main difference is at the end: &refresh_start=11 in yours vs &start=11&refresh_start=0 actual. Since that format also works for page 1 (just checked), use that instead.
You have some extra characters in the section after the keyword: ...&query=" + key_word +"%EA%B8%88%EB%A6%AC&sm=tab_opt. That %EA%B8%88%EB%A6%AC is from your previous search keyword.
You can also skip several unneeded URL parameters, by testing which are actually not needed.
Putting all that together:
for val in range(1, total_num + 1, 10):
start_val = str(val)
url = ("https://search.naver.com/search.naver?&where=news&query=" +
key_word +
"&sm=tab_pge&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14" +
"&docid=&nso=so:r,p:from20200413to20200414,a:all&mynews=0&cluster_rank=51" +
"&refresh_start=0&start=" +
start_val)
html = urllib.request.urlopen(url).read()
... # etc.

How to Iterate over pages in Ebay

I am building a scraper for Ebay. I am trying to figure out a way to manipulate the page number portion of the Ebay url to go to the next page until there are no more pages (If you were on page 2 the page number portion would look like "_pgn=2"). I noticed that if you put any number greater than the max number of pages a listing has, the page will reload to the last page, not give like a page doesn't exist error. (If a listing has 5 pages, then the last listing' page number url portion of _pgn=5 would rout to the same page if the page number url portion was _pgn=100). How can I implement a way to start at page one, get the html soup of the page, get the all relevant data I want from the soup, then load up the next page with the new page number and start the process again until there are not any new pages to scrape? I tried to get the number of results a listing has by using selenium xpath and math.ceil the quotient of number of results and 50 (default number of max listings per page) and use that quotient as my max_page, but I get errors saying the element doesn't exist even though it does. self.driver.findxpath('xpath').text. That 243 is what I am trying to get with the xpath.
class EbayScraper(object):
def __init__(self, item, buying_type):
self.base_url = "https://www.ebay.com/sch/i.html?_nkw="
self.driver = webdriver.Chrome(r"chromedriver.exe")
self.item = item
self.buying_type = buying_type + "=1"
self.url_seperator = "&_sop=12&rt=nc&LH_"
self.url_seperator2 = "&_pgn="
self.page_num = "1"
def getPageUrl(self):
if self.buying_type == "Buy It Now=1":
self.buying_type = "BIN=1"
self.item = self.item.replace(" ", "+")
url = self.base_url + self.item + self.url_seperator + self.buying_type + self.url_seperator2 + self.page_num
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup
def getInfo(self, soup):
for listing in soup.find_all("li", {"class": "s-item"}):
raw = listing.find_all("a", {"class": "s-item__link"})
if raw:
raw_price = listing.find_all("span", {"class": "s-item__price"})[0]
raw_title = listing.find_all("h3", {"class": "s-item__title"})[0]
raw_link = listing.find_all("a", {"class": "s-item__link"})[0]
raw_condition = listing.find_all("span", {"class": "SECONDARY_INFO"})[0]
condition = raw_condition.text
price = float(raw_price.text[1:])
title = raw_title.text
link = raw_link['href']
print(title)
print(condition)
print(price)
if self.buying_type != "BIN=1":
raw_time_left = listing.find_all("span", {"class": "s-item__time-left"})[0]
time_left = raw_time_left.text[:-4]
print(time_left)
print(link)
print('\n')
if __name__ == '__main__':
item = input("Item: ")
buying_type = input("Buying Type (e.g, 'Buy It Now' or 'Auction'): ")
instance = EbayScraper(item, buying_type)
page = instance.getPageUrl()
instance.getInfo(page)
if you want to iterate all pages and gather all results then your script needs to check if there is a next page after you visit the page
import requests
from bs4 import BeautifulSoup
class EbayScraper(object):
def __init__(self, item, buying_type):
...
self.currentPage = 1
def get_url(self, page=1):
if self.buying_type == "Buy It Now=1":
self.buying_type = "BIN=1"
self.item = self.item.replace(" ", "+")
# _ipg=200 means that expect a 200 items per page
return '{}{}{}{}{}{}&_ipg=200'.format(
self.base_url, self.item, self.url_seperator, self.buying_type,
self.url_seperator2, page
)
def page_has_next(self, soup):
container = soup.find('ol', 'x-pagination__ol')
currentPage = container.find('li', 'x-pagination__li--selected')
next_sibling = currentPage.next_sibling
if next_sibling is None:
print(container)
return next_sibling is not None
def iterate_page(self):
# this will loop if there are more pages otherwise end
while True:
page = instance.getPageUrl(self.currentPage)
instance.getInfo(page)
if self.page_has_next(page) is False:
break
else:
self.currentPage += 1
def getPageUrl(self, pageNum):
url = self.get_url(pageNum)
print('page: ', url)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup
def getInfo(self, soup):
...
if __name__ == '__main__':
item = input("Item: ")
buying_type = input("Buying Type (e.g, 'Buy It Now' or 'Auction'): ")
instance = EbayScraper(item, buying_type)
instance.iterate_page()
the important functions here are page_has_next and iterate_page
page_has_next - a function that check if the pagination of the page has another li element next to the selected page. e.g < 1 2 3 > if we are on page 1 then it checks if there is 2 next -> something like this
iterate_page - a function that loop until there is no page_next
also note that you don't need selenium for this unless you need to mimic user clicks or need a browser to navigate.

python , run def() multiple times

i wrote this code to get Ebay prices
it asks for full ebay link then it writes its price
import bs4 , requests
print('please enter full Ebay link ..')
link = str(input())
def ebayprice(url):
res = requests.get(link)
res.raise_for_status()
txt = bs4.BeautifulSoup(res.text , 'html.parser')
csselement = txt.select('#mm-saleDscPrc')
return csselement[0].text.strip()
price = ebayprice(link)
print('price is : '+ price)
i want to improve it and i tried my best and i couldnt
i want it to take multiple links and run them one by one and it should write results each time
it doesnt matter if links are from input() or from links = 'www1,www2,www3'
you can split by comma and iterate over the list using a for loop:
def ebayprice(url):
...
for single_link in link.split(','):
price = ebayprice(single_link)
print('price for {} is {}'.format(single_link, price))
if you want you can ask for how many links, someone want to scrape, and after that you can use for loop statment to go through every url
import bs4 , requests
# ask how many links he will pass
print('How many links do you want wo scrape ?')
link_numb = int(input())
# get the links
print('please enter full Ebay link ..')
links = [input() for _ in range(link_numb)]
def ebayprice(link):
res = requests.get(link)
res.raise_for_status()
txt = bs4.BeautifulSoup(res.text , 'html.parser')
csselement = txt.select('#mm-saleDscPrc')
return csselement[0].text.strip()
for link in links:
price = ebayprice(link)
print(price)
Example:
How many links do you want wo scrape ?
2
please enter full Ebay link ..
http://example.com
http://example-just-test.com
# simple print the url
http://example.com
http://example-just-test.com

Following links in Python

I have to write a program that will read the HTML from this link(http://python-data.dr-chuck.net/known_by_Maira.html), extract the href= values from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.
I am supposed to find the link at position 18 (the first name is 1), follow that link and repeat this process 7 times. The answer is the last name that I retrieve.
Here is the code I found and it works just fine.
import urllib
from BeautifulSoup import *
url = raw_input("Enter URL: ")
count = int(raw_input("Enter count: "))
position = int(raw_input("Enter position: "))
names = []
while count > 0:
print "retrieving: {0}".format(url)
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
tag = soup('a')
name = tag[position-1].string
names.append(name)
url = tag[position-1]['href']
count -= 1
print names[-1]
I would really appreciate if someone could explain to me like you would to a 10 year old, what's going on inside the while loop. I am new to Python and would really appreciate the guidance.
while count > 0: # because of `count -= 1` below,
# will run loop count times
print "retrieving: {0}".format(url) # just prints out the next web page
# you are going to get
page = urllib.urlopen(url) # urls reference web pages (well,
# many types of web content but
# we'll stick with web pages)
soup = BeautifulSoup(page) # web pages are frequently written
# in html which can be messy. this
# package "unmessifies" it
tag = soup('a') # in html you can highlight text and
# reference other web pages with <a>
# tags. this get all of the <a> tags
# in a list
name = tag[position-1].string # This gets the <a> tag at position-1
# and then gets its text value
names.append(name) # this puts that value in your own
# list.
url = tag[position-1]['href'] # html tags can have attributes. On
# and <a> tag, the href="something"
# attribute references another web
# page. You store it in `url` so that
# its the page you grab on the next
# iteration of the loop.
count -= 1
You enter the number of urls you want to retrieve from a page
0) prints url
1) opens url
2) reads source
BeautifulSoup docs
3) gets every a tag
4) gets the whole <a ...></a> I think
5) adds it to a list names
6) gets url from the last item of names, ie pulls href from <a ...></a>
7) prints the last of the list names
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
total=0
url = input('Enter - ')
c=input('enter count-')
count=int(c)
p=input('enter position-')
pos=int(p)
while total<=count:
html = urllib.request.urlopen(url, context=ctx).read()
print("Retrieving",url)
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
counter=0
for tag in tags:
counter=counter+1
if(counter<=pos):
x=tag.get('href',None)
url=x
else:
break
total=total+1
Solution with explanations.
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
url = input('Enter - ')
count = int(input('Enter count: '))
position = int(input ('Enter position: '))
names = []
while count > 0:
print('Retrieving: {}'.format(url))
html = urllib.request.urlopen(url) # open the url using urllib
soup = BeautifulSoup(html, 'html.parser')# parse html data in a clean format
# Retrieve all of the anchor tags
tags = soup('a')
# This gets the <a> tag at position-1 and then gets its text value
name = tags[position-1].string
names.append(name) #add the name to our list
url = tags[position-1]['href']#retrieve the url for next iteratopn
count -= 1
print(names)
print('Answer: ',names[count-1])
Hope it helps.

Categories

Resources