I'am trying to build a web crawler using beautifulsoup and urllib. The crawler is working, but it does not open all the pages in a site. It opens the first link and goes to that link, opens the first link of that page and so on.
Here's my code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.parse import urljoin
import json, sys
sys.setrecursionlimit(10000)
url = input('enter url ')
d = {}
d_2 = {}
l = []
url_base = url
count = 0
def f(url):
global count
global url_base
if count <= 100:
print("count: " + str(count))
print('now looking into: '+url+'\n')
count += 1
l.append(url)
html = urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
d[count] = soup
tags = soup('a')
for tag in tags:
meow = tag.get('href',None)
if (urljoin(url, meow) in l):
print("Skipping this one: " + urljoin(url,meow))
elif "mailto" in urljoin(url,meow):
print("Skipping this one with a mailer")
elif meow == None:
print("skipping 'None'")
elif meow.startswith('http') == False:
f(urljoin(url, meow))
else:
f(meow)
else:
return
f(url)
print('\n\n\n\n\n')
print('Scrapping Completed')
print('\n\n\n\n\n')
The reason you're seeing this behavior is due to when the code recursively calls your function. As soon as the code finds a valid link, the function f gets called again preventing the rest of the for loop from running until it returns.
What you're doing is a depth first search, but the internet is very deep. You want to do a breadth first search instead.
Probably the easiest way to modify your code to do that is to have a global list of links to follow. Have the for loop append all the scraped links to the end of this list and then outside of the for loop, remove the first element of the list and follow that link.
You may have to change your logic slightly for your max count.
If count reaches 100, no further links will be opened. Therefore I think you should decrease count by one after leaving the for loop. If you do this, count would be something like the current link depth (and 100 would be the maximum link depth).
If the variable count should refer to the number of opened links, then you might want to control the link depth in another way.
Related
Context: I'm working on pagination of this website: https://skoodos.com/schools-in-uttarakhand. When I inspected, this website has no proper number of pages visible except the next button which is ?page=2 after the url. Also, searching for page-link gave me number 20 at the end. So I assumed that the total number of pages is 20, upon checking manually, I learnt that, there only exist 11 pages.
After many trials and errors, I finally decided to go with just the indexing from 0 until 12 (12 is excluded by python however).
What I want to know is that, how wold you go about figuring out the number of pages on a particular website that doesn't show the actual number of pages other than previous and next button and how can I optimize this in terms of the same?
Here's my solution to pagination. Any way to optimize this other than me manually finding the number of pages?
from myWork.commons import url_parser, write
def data_fetch(url):
school_info = []
for page_number in range(0, 4):
next_web_page = url + f'?page={page_number}'
soup = url_parser(next_web_page)
search_results = soup.find('section', {'id': 'search-results'}).find(class_='container').find(class_='row')
# rest of the code
for page_number in range(4, 12):
next_web_page = url + f'?page={page_number}'
soup = url_parser(next_web_page)
search_results = soup.find('section', {'id': 'search-results'}).find(class_='container').find(class_='row')
# rest of the code
def main():
url = "https://skoodos.com/schools-in-uttarakhand"
data_fetch(url)
if __name__ == "__main__":
main()
Each of your pages (except the last one) will have an element like this:
<a class="page-link"
href="https://skoodos.com/schools-in-uttarakhand?page=2"
rel="next">Next »</a>
E.g. you can extract the link as follows (here for the first page):
link = soup.find('a', class_='page-link', href=True, rel='next')
print(link['href'])
https://skoodos.com/schools-in-uttarakhand?page=2
So, you could make your function recursive. E.g. use something like this:
import requests
from bs4 import BeautifulSoup
def data_fetch(url, results = list()):
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'lxml')
search_results = soup.find('section', {'id': 'search-results'})\
.find(class_='container').find(class_='row')
results.append(search_results)
link = soup.find('a', class_='page-link', href=True, rel='next')
# link will be `None` for last page (i.e. `page=11`)
if link:
# just adding some prints to show progress of iteration
if not 'page' in url:
print('getting page: 1', end=', ')
url = link['href']
# subsequent page nums being retrieved
print(f'{url.rsplit("=", maxsplit=1)[1]}', end=', ')
# recursive call
return data_fetch(url, results)
else:
# `page=11` with no link, we're done
print('done')
return results
url = 'https://skoodos.com/schools-in-uttarakhand'
data = data_fetch(url)
So, a call to this function will print progress as:
getting page: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, done
And you'll end up with data with 11x bs4.element.Tag, one for each page.
print(len(data))
11
print(set([type(d) for d in data]))
{<class 'bs4.element.Tag'>}
Good luck with extracting the required info; the site is very slow, and the HTML is particularly sloppy and inconsistent. (e.g. you're right to note there is a page-link elem, which suggests there are 20 pages. But its visibility is set to hidden, so apparently this is just a piece of deprecated/unused code.)
There's a bit at the top that says "Showing the 217 results as per selected criteria". You can code to extract the number from that, then count the number number of results per page and divide by that to get the expected number of pages (don't forget to round up ).
If you want to double check, add more code to go to the calculated last page and
if there's no such page, keep decrementing the total and checking until you hit a page that exists
if there is such a page, but it has an active/enabled "Next" button, keep going to Next page until reaching the last (basically as you are now)
(Remember that the two listed above are contingencies and wouldn't be executed in an ideal scenario.)
So, just to find the number of pages, you could do something like:
import requests
from bs4 import BeautifulSoup
import math
def soupFromUrl(scrapeUrl):
req = requests.get(scrapeUrl)
if req.status_code == 200:
return BeautifulSoup(req.text, 'html.parser')
else:
raise Exception(f'{req.reason} - failed to scrape {scrapeUrl}')
def getPageTotal(url):
soup = soupFromUrl(url)
#totalResults = int(soup.find('label').get_text().split('(')[-1].split(')')[0])
totalResults = int(soup.p.strong.get_text()) # both searches should work
perPageResults = len(soup.select('.m-show')) #probably always 20
print(f'{perPageResults} of {totalResults} results per page')
if not (perPageResults > 0 and totalResults > 0):
return 0
lastPageNum = math.ceil(totalResults/perPageResults)
# Contingencies - will hopefully never be needed
lpSoup = soupFromUrl(f'{url}?page={lastPageNum}')
if lpSoup.select('.m-show'): #page exists
while lpSoup.select_one('a[rel="next"]'):
nextLink = lpSoup.select_one('a[rel="next"]')['href']
lastPageNum = int(nextLink.split('page=')[-1])
lpSoup = soupFromUrl(nextLink)
else: #page does not exist
while not (lpSoup.select('.m-show') or lastPageNum < 1):
lastPageNum = lastPageNum - 1
lpSoup = soupFromUrl(f'{url}?page={lastPageNum}')
# end Contingencies section
return lastPageNum
However, it looks like you only want the total pages in order to start the for-loop, but it's not even necessary to use a for-loop at all - a while-loop might be better:
def data_fetch(url):
school_info = []
nextUrl = url
while nextUrl:
soup = soupFromUrl(nextUrl)
#GET YOUR DATA FROM PAGE
nextHL = soup.select_one('a[rel="next"]')
nextUrl = nextHL.get('href') if nextHL else None
# code after fetching all pages' data
Although, you could still use for-loop if you had a maximum page number in mind:
def data_fetch(url, maxPages):
school_info = []
for p in range(1, maxPages+1):
soup = soupFromUrl(f'{url}?page={p}')
if not soup.select('.m-show'):
break
#GET YOUR DATA FROM PAGE
# code after fetching all pages' data [upto max]
I have a list of links and for each link I want to check if it contains a specific sublink and add this sublink to the initial list. I have this code:
def getAllLinks():
i = 0
baseUrl = 'http://www.cdep.ro/pls/legis/'
sourcePaths = ['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0','legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0']
while i < len(sourcePaths)+1:
for path in sourcePaths:
res = requests.get(f'{baseUrl}{path}')
soup = BeautifulSoup(res.text)
next_btn = soup.find(lambda e: e.name == 'td' and '1..99' in e.text)
if next_btn:
for a in next_btn.find_all('a', href=True):
linkNextPage = a['href']
sourcePaths.append(linkNextPage)
i += 1
break
else:
i += 1
continue
break
return sourcePaths
print(getAllLinks())
The first link in the list does not contain the sublink, so it's an else case. The code does this OK. However, the second link in the list does contain the sublink, but it gets stuck here:
for a in next_btn.find_all('a', href=True):
linkNextPage = a['href']
sourcePaths.append(linkNextPage)
i += 1
The third link contains the sublink but my code does not get to look at that link. At the end I am getting a list containing the initial links plus 4 times the sublink of the second link.
I think I'm breaking incorrectly somewhere but I can't figure out how to fix it.
Remove the while. It's not needed. Change the selectors
import requests
from bs4 import BeautifulSoup
def getAllLinks():
baseUrl = 'http://www.cdep.ro/pls/legis/'
sourcePaths = ['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0','legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0']
for path in sourcePaths:
res = requests.get(f'{baseUrl}{path}')
soup = BeautifulSoup(res.text, "html.parser")
next_btn = soup.find("p",class_="headline").find("table", {"align":"center"})
if next_btn:
anchor = next_btn.find_all("td")[-1].find("a")
if anchor: sourcePaths.append(anchor["href"])
return sourcePaths
print(getAllLinks())
Output:
['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0', 'legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0', 'legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=100', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0&nrc=100']
Your second break statement never gets executed because the first "for" loop is already broken by the first break statement and never reaches the second break statement. Put condition which break the while loop.
I am trying to crawl all news link that has a certain keyword that is looking for.
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import re
key_word = urllib.parse.quote("금리")
url = "https://search.naver.com/search.naver?where=news&query=" + key_word +"%EA%B8%88%EB%A6%AC&sm=tab_opt&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so%3Ar%2Cp%3Afrom20200413to20200414%2Ca%3Aall&mynews=0&refresh_start=0&related=0"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
anchor_set = soup.findAll('a')
news_link = []
for a in anchor_set:
if str(a).find('https://news.naver.com/main/read.nhn?') != -1:
a = a.get('href')
news_link.append(a)'
Untill this section (code above), I parse into the url and retrieve all links that has a certian read.nhn(naver news platform) and append it to news_link.
This is working fine, but the proble is the url used above only shows 10 articles in the page.
count_tag = soup.find("div",{"class","title_desc all_my"})
count_text=count_tag.find("span").get_text().split()
total_num=count_text[-1][0:-1].replace(",","")
print(total_num)'
Using the code above I've found out there are a total of 1297 articles that I need to collect. but since the original link above only has 10 articles in the page.
for val in range(int(total_num)//10+1):
start_val=str(val*10+1)
I was told i needed to insert this into the url to retrieve ALL newslinks.
Thus, I've used the while method
while start_val <= total_num:
url = "https://search.naver.com/search.naver?where=news&query=" + key_word +"%EA%B8%88%EB%A6%AC&sm=tab_opt&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so%3Ar%2Cp%3Afrom20200413to20200414%2Ca%3Aall&mynews=0&refresh_start=" + start_val + "&related=0"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
news_link = []
anchor_set = soup.findAll('a')
for a in anchor_set:
if str(a).find('https://news.naver.com/main/read.nhn?') != -1:
a = a.get('href')
news_link.append(a)
However, when I run the program, it seems the loop does not stop. obviously there is no else or break.. How can i break this loop and successfully collect all the links?
Your current while loop doesn't stop because you haven't incremented the value of start_val. Also, later you have range(int(total_num)//10+1) so if you converted total_num to a string, then the string comparison in while start_val <= total_num is wrong - for strings "21" > "1297", because "2" > "1". Compare them as int's.
And since you're creating the sequence of vals to use, you don't need a separate upper bound check.
So far, this would give you the correct finite loop:
for val in range(int(total_num)//10+1): # no upper bound check needed
start_val=str(val*10+1)
url = "https://search.naver.com/search.naver?where=news&query=" ...
html = urllib.request.urlopen(url).read()
...
For the values needed for the pages/next starting item, instead of doing:
for val in range(int(total_num)//10+1):
start_val = str(val*10+1)
You can get the actual val's from range(). To starting at 1 and going in steps of 10 to get: 1, 11, 21, ... , upto and including the total:
for val in range(1, total_num + 1, 10):
start_val = str(val) # don't need this assignment actually
Next thing: the URL for page 2 onwards is wrong. Currently, your while loop will generate the following URL for page 2:
https://search.naver.com/search.naver?where=news&query=%EA%B8%88%EB%A6%AC%EA%B8%88%EB%A6%AC&sm=tab_opt&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so%3Ar%2Cp%3Afrom20200413to20200414%2Ca%3Aall&mynews=0&refresh_start=11&related=0
But if you click on page "2" of the results, you get the URL:
https://search.naver.com/search.naver?&where=news&query=%EA%B8%88%EB%A6%AC%EA%B8%88%EB%A6%AC&sm=tab_pge&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so:r,p:from20200413to20200414,a:all&mynews=0&cluster_rank=35&start=11&refresh_start=0
The main difference is at the end: &refresh_start=11 in yours vs &start=11&refresh_start=0 actual. Since that format also works for page 1 (just checked), use that instead.
You have some extra characters in the section after the keyword: ...&query=" + key_word +"%EA%B8%88%EB%A6%AC&sm=tab_opt. That %EA%B8%88%EB%A6%AC is from your previous search keyword.
You can also skip several unneeded URL parameters, by testing which are actually not needed.
Putting all that together:
for val in range(1, total_num + 1, 10):
start_val = str(val)
url = ("https://search.naver.com/search.naver?&where=news&query=" +
key_word +
"&sm=tab_pge&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14" +
"&docid=&nso=so:r,p:from20200413to20200414,a:all&mynews=0&cluster_rank=51" +
"&refresh_start=0&start=" +
start_val)
html = urllib.request.urlopen(url).read()
... # etc.
I've made a web crawler which gives a link and text from link for all sites in given addres it looks like this:
import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize
url = ["http://adbnews.com/area51"]
for u in url:
br = mechanize.Browser()
urls = [u]
visited = [u]
i = 0
while i<len(urls):
try:
br.open(urls[0])
urls.pop(0)
for link in br.links():
levelLinks = []
linkText = []
newurl = urlparse.urljoin(link.base_url, link.url)
b1 = urlparse.urlparse(newurl).hostname
b2 = urlparse.urlparse(newurl).path
newurl = "http://"+b1+b2
linkTxt = link.text
linkText.append(linkTxt)
levelLinks.append(newurl)
if newurl not in visited and urlparse.urlparse(u).hostname in newurl:
urls.append(newurl)
visited.append(newurl)
#print newurl
#get Mechanize Links
for l,lt in zip(levelLinks,linkText):
print newurl,"\n",lt,"\n"
except:
urls.pop(0)
it gets results like that:
http://www.adbnews.com/area51/contact.html
CONTACT
http://www.adbnews.com/area51/about.html
ABOUT
http://www.adbnews.com/area51/index.html
INDEX
http://www.adbnews.com/area51/1st/
FIRST LEVEL!
http://www.adbnews.com/area51/1st/bling.html
BLING
http://www.adbnews.com/area51/1st/index.html
INDEX
http://adbnews.com/area51/2nd/
2ND LEVEL
And I wanna add a counter of somekind which could limit how deep crawler goes..
I've tried add for example steps = 3 and change while i<len(urls) in while i<steps:
but that would go only to first level even the number says 3...
Any advice is welcome
If you want to search a certain "depth", consider using a recursive function instead of just appending a list of URL's.
def crawl(url, depth):
if depth <= 3:
#Scan page, grab links, title
for link in links:
print crawl(link, depth + 1)
return url +"\n"+ title
This allows for easier control of your recursive searching, as well as being faster and less resource heavy :)
I'm scraping all the URL of my domain with recursive function.
But it outputs nothing, without any error.
#usr/bin/python
from bs4 import BeautifulSoup
import requests
import tldextract
def scrape(url):
for links in url:
main_domain = tldextract.extract(links)
r = requests.get(links)
data = r.text
soup = BeautifulSoup(data)
for href in soup.find_all('a'):
href = href.get('href')
if not href:
continue
link_domain = tldextract.extract(href)
if link_domain.domain == main_domain.domain :
problem.append(href)
elif not href == '#' and link_domain.tld == '':
new = 'http://www.'+ main_domain.domain + '.' + main_domain.tld + '/' + href
problem.append(new)
return len(problem)
return scrape(problem)
problem = ["http://xyzdomain.com"]
print(scrape(problem))
When I create a new list, it works, but I don't want to make a list every time for every loop.
You need to structure your code so that it meets the pattern for recursion as your current code doesn't - you also should not call variables the same name as libraries, e.g. href = href.get() because this will usually stop the library working as it becomes the variable, your code as it currently is will only ever return the len() as this return is unconditionally reached before: return scrap(problem).:
def Recursive(Factorable_problem)
if Factorable_problem is Simplest_Case:
return AnswerToSimplestCase
else:
return Rule_For_Generating_From_Simpler_Case(Recursive(Simpler_Case))
for example:
def Factorial(n):
""" Recursively Generate Factorials """
if n < 2:
return 1
else:
return n * Factorial(n-1)
Hello I've made a none recursive version of this that appears to get all the links on the same domain.
The code below I've tested using the problem included in the code. When I'd solved the problems with the recursive version the next problem was hitting the recursion depth limit so I rewrote it so it ran in an iterative fashion, the code and result below:
from bs4 import BeautifulSoup
import requests
import tldextract
def print_domain_info(d):
print "Main Domain:{0} \nSub Domain:{1} \nSuffix:{2}".format(d.domain,d.subdomain,d.suffix)
SEARCHED_URLS = []
problem = [ "http://Noelkd.neocities.org/", "http://youpi.neocities.org/"]
while problem:
# Get a link from the stack of links
link = problem.pop()
# Check we haven't been to this address before
if link in SEARCHED_URLS:
continue
# We don't want to come back here again after this point
SEARCHED_URLS.append(link)
# Try and get the website
try:
req = requests.get(link)
except:
# If its not working i don't care for it
print "borked website found: {0}".format(link)
continue
# Now we get to this point worth printing something
print "Trying to parse:{0}".format(link)
print "Status Code:{0} Thats: {1}".format(req.status_code, "A-OK" if req.status_code == 200 else "SOMTHINGS UP" )
# Get the domain info
dInfo = tldextract.extract(link)
print_domain_info(dInfo)
# I like utf-8
data = req.text.encode("utf-8")
print "Lenght Of Data Retrived:{0}".format(len(data)) # More info
soup = BeautifulSoup(data) # This was here before so i left it.
print "Found {0} link{1}".format(len(soup.find_all('a')),"s" if len(soup.find_all('a')) > 1 else "")
FOUND_THIS_ITERATION = [] # Getting the same links over and over was boring
found_links = [x for x in soup.find_all('a') if x.get('href') not in SEARCHED_URLS] # Find me all the links i don't got
for href in found_links:
href = href.get('href') # You wrote this seems to work well
if not href:
continue
link_domain = tldextract.extract(href)
if link_domain.domain == dInfo.domain: # JUST FINDING STUFF ON SAME DOMAIN RIGHT?!
if href not in FOUND_THIS_ITERATION: # I'ma check you out next time
print "Check out this link: {0}".format(href)
print_domain_info(link_domain)
FOUND_THIS_ITERATION.append(href)
problem.append(href)
else: # I got you already
print "DUPE LINK!"
else:
print "Not on same domain moving on"
# Count down
print "We have {0} more sites to search".format(len(problem))
if problem:
continue
else:
print "Its been fun"
print "Lets see the URLS we've visited:"
for url in SEARCHED_URLS:
print url
Which prints, after a lot of other logging loads of neocities websites!
What's happening is the script is popping a value of the list of websites yet to visit, it then gets all the links on the page which are on the same domain. If those links are to pages we haven't visited we add the link to the list of links to be visited. After we do that we pop the next page and do the same thing again until there are no pages left to visit.
Think this is what your looking for, get back to us in the comments if this doesn't work in the way that you want or if anyone can improve please leave a comment.