Tree traverse in Python - python

I'm trying to write a script to find out the non-responsive links of a web-page in python. While trying, i find out that python doesn't support multi child nodes. Is it true? or we can access the multi child nodes.
Below is my code snippet:
import httplib2
import requests
from bs4 import BeautifulSoup, SoupStrainer
status = {}
response = {}
output = {}
def get_url_status(url, count):
global links
links = []
print(url)
print(count)
if count == 0:
return output
else:
# if url not in output.keys():
headers = requests.utils.default_headers()
req = requests.get(url, headers)
if('200' in str(req)):
# if url not in output.keys():
output[url] = '200';
for link in BeautifulSoup(req.content, parse_only=SoupStrainer('a')):
if 'href' in str(link):
links.append(link.get('href'))
# removing other non-mandotary links
for link in links[:]:
if "mi" not in link:
links.remove(link)
# removing same url
for link in links[:]:
if link.rstrip('/') == url:
links.remove(link)
# removing duplicate links
links = list(dict.fromkeys(links))
if len(links) > 0:
for urllink in links:
return get_url_status(urllink, count-1)
result = get_url_status('https://www.mi.com/in', 5)
print(result)
In this code it's only traversing to only the left nodes of the child and skipping rest. something like this.
And the output is not satisfactory and very very less compared to real.
{'https://www.mi.com/in': '200', 'https://in.c.mi.com/': '200', 'https://in.c.mi.com/index.php': '200', 'https://in.c.mi.com/global/': '200', 'https://c.mi.com/index.php': '200'}
I know, i'm lacking at multiple locations but i've never done something of this scale and this is my first time. So please excuse if this is a novice question.
Note: I've used mi.com just for the reference.

At a glance, there's one obvious problem.
if len(links) > 0:
for urllink in links:
return get_url_status(urllink, count-1)
This snippet does not iterate over links. It has return in its iterative body which means it will only run for the first item in links, and immediately return it. There is another bug. The function returns just None instead of output if it encounters a page with no links before count reaches 0. Do the following instead.
if len(links):
for urllink in links:
get_url_status(urllink, count-1)
return output
And if('200' in str(req)) is not the right way to check the status code. It will check for a substring '200' in the body, instead of only checking the status code. It should be if req.status_code == 200.
Another thing is that the function only adds responsive links to output. If you want to check for non-responsive links, don't you have to add links that do not return the 200 status code?
import requests
from bs4 import BeautifulSoup, SoupStrainer
status = {}
response = {}
output = {}
def get_url_status(url, count):
global links
links = []
# if url not in output.keys():
headers = requests.utils.default_headers()
req = requests.get(url, headers)
if req.status_code == 200:
# if url not in output.keys():
output[url] = '200'
if count == 0:
return output
for link in BeautifulSoup(req.content, parse_only=SoupStrainer('a'), parser="html.parser"):
if 'href' in str(link):
links.append(link.get('href'))
# removing other non-mandotary links
for link in links:
if "mi" not in link:
links.remove(link)
# removing same url
for link in links:
if link.rstrip('/') == url:
links.remove(link)
# removing duplicate links
links = list(dict.fromkeys(links))
print(links)
if len(links):
for urllink in links:
get_url_status(urllink, count-1)
return output
result = get_url_status('https://www.mi.com/in', 1)
print(result)

Related

How to find the total number of pages on a website with BeautifulSoup?

Context: I'm working on pagination of this website: https://skoodos.com/schools-in-uttarakhand. When I inspected, this website has no proper number of pages visible except the next button which is ?page=2 after the url. Also, searching for page-link gave me number 20 at the end. So I assumed that the total number of pages is 20, upon checking manually, I learnt that, there only exist 11 pages.
After many trials and errors, I finally decided to go with just the indexing from 0 until 12 (12 is excluded by python however).
What I want to know is that, how wold you go about figuring out the number of pages on a particular website that doesn't show the actual number of pages other than previous and next button and how can I optimize this in terms of the same?
Here's my solution to pagination. Any way to optimize this other than me manually finding the number of pages?
from myWork.commons import url_parser, write
def data_fetch(url):
school_info = []
for page_number in range(0, 4):
next_web_page = url + f'?page={page_number}'
soup = url_parser(next_web_page)
search_results = soup.find('section', {'id': 'search-results'}).find(class_='container').find(class_='row')
# rest of the code
for page_number in range(4, 12):
next_web_page = url + f'?page={page_number}'
soup = url_parser(next_web_page)
search_results = soup.find('section', {'id': 'search-results'}).find(class_='container').find(class_='row')
# rest of the code
def main():
url = "https://skoodos.com/schools-in-uttarakhand"
data_fetch(url)
if __name__ == "__main__":
main()
Each of your pages (except the last one) will have an element like this:
<a class="page-link"
href="https://skoodos.com/schools-in-uttarakhand?page=2"
rel="next">Next ยป</a>
E.g. you can extract the link as follows (here for the first page):
link = soup.find('a', class_='page-link', href=True, rel='next')
print(link['href'])
https://skoodos.com/schools-in-uttarakhand?page=2
So, you could make your function recursive. E.g. use something like this:
import requests
from bs4 import BeautifulSoup
def data_fetch(url, results = list()):
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'lxml')
search_results = soup.find('section', {'id': 'search-results'})\
.find(class_='container').find(class_='row')
results.append(search_results)
link = soup.find('a', class_='page-link', href=True, rel='next')
# link will be `None` for last page (i.e. `page=11`)
if link:
# just adding some prints to show progress of iteration
if not 'page' in url:
print('getting page: 1', end=', ')
url = link['href']
# subsequent page nums being retrieved
print(f'{url.rsplit("=", maxsplit=1)[1]}', end=', ')
# recursive call
return data_fetch(url, results)
else:
# `page=11` with no link, we're done
print('done')
return results
url = 'https://skoodos.com/schools-in-uttarakhand'
data = data_fetch(url)
So, a call to this function will print progress as:
getting page: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, done
And you'll end up with data with 11x bs4.element.Tag, one for each page.
print(len(data))
11
print(set([type(d) for d in data]))
{<class 'bs4.element.Tag'>}
Good luck with extracting the required info; the site is very slow, and the HTML is particularly sloppy and inconsistent. (e.g. you're right to note there is a page-link elem, which suggests there are 20 pages. But its visibility is set to hidden, so apparently this is just a piece of deprecated/unused code.)
There's a bit at the top that says "Showing the 217 results as per selected criteria". You can code to extract the number from that, then count the number number of results per page and divide by that to get the expected number of pages (don't forget to round up ).
If you want to double check, add more code to go to the calculated last page and
if there's no such page, keep decrementing the total and checking until you hit a page that exists
if there is such a page, but it has an active/enabled "Next" button, keep going to Next page until reaching the last (basically as you are now)
(Remember that the two listed above are contingencies and wouldn't be executed in an ideal scenario.)
So, just to find the number of pages, you could do something like:
import requests
from bs4 import BeautifulSoup
import math
def soupFromUrl(scrapeUrl):
req = requests.get(scrapeUrl)
if req.status_code == 200:
return BeautifulSoup(req.text, 'html.parser')
else:
raise Exception(f'{req.reason} - failed to scrape {scrapeUrl}')
def getPageTotal(url):
soup = soupFromUrl(url)
#totalResults = int(soup.find('label').get_text().split('(')[-1].split(')')[0])
totalResults = int(soup.p.strong.get_text()) # both searches should work
perPageResults = len(soup.select('.m-show')) #probably always 20
print(f'{perPageResults} of {totalResults} results per page')
if not (perPageResults > 0 and totalResults > 0):
return 0
lastPageNum = math.ceil(totalResults/perPageResults)
# Contingencies - will hopefully never be needed
lpSoup = soupFromUrl(f'{url}?page={lastPageNum}')
if lpSoup.select('.m-show'): #page exists
while lpSoup.select_one('a[rel="next"]'):
nextLink = lpSoup.select_one('a[rel="next"]')['href']
lastPageNum = int(nextLink.split('page=')[-1])
lpSoup = soupFromUrl(nextLink)
else: #page does not exist
while not (lpSoup.select('.m-show') or lastPageNum < 1):
lastPageNum = lastPageNum - 1
lpSoup = soupFromUrl(f'{url}?page={lastPageNum}')
# end Contingencies section
return lastPageNum
However, it looks like you only want the total pages in order to start the for-loop, but it's not even necessary to use a for-loop at all - a while-loop might be better:
def data_fetch(url):
school_info = []
nextUrl = url
while nextUrl:
soup = soupFromUrl(nextUrl)
#GET YOUR DATA FROM PAGE
nextHL = soup.select_one('a[rel="next"]')
nextUrl = nextHL.get('href') if nextHL else None
# code after fetching all pages' data
Although, you could still use for-loop if you had a maximum page number in mind:
def data_fetch(url, maxPages):
school_info = []
for p in range(1, maxPages+1):
soup = soupFromUrl(f'{url}?page={p}')
if not soup.select('.m-show'):
break
#GET YOUR DATA FROM PAGE
# code after fetching all pages' data [upto max]

Unable to scrape emails from some websites maybe due to r.html.render() not working properly

I have some website links as samples for extracting any email available in their internal sites.
However, even I am trying to render any JS driven website via r.html.render() within scrape_email(url) method, some of the websites like arken.trygge.dk, gronnebakken.dk, dagtilbud.ballerup.dk/boernehuset-bispevangen etc. does not return any email which might be due to rendering issue.
I have attached the sample file for convenience of running
I dont want to use selenium as there can be thousands or millions of webpage I want to extract emails from.
So far this is my code:
import os
import time
import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import re
from requests_html import HTMLSession
import pandas as pd
from gtts import gTTS
import winsound
# For convenience of seeing console output in the script
pd.options.display.max_colwidth = 180
#Get the start time of script execution
startTime = time.time()
#Paste file name inside ''
input_file_name = 'sample'
input_df = pd.read_excel(input_file_name+'.xlsx', engine='openpyxl')
input_df = input_df.dropna(how='all')
internal_urls = set()
emails = set()
total_urls_visited = 0
def is_valid(url):
"""
Checks whether `url` is a valid URL.
"""
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
def get_internal_links(url):
"""
Returns all URLs that is found on `url` in which it belongs to the same website
"""
# all URLs of `url`
urls = set()
# domain name of the URL without the protocol
domain_name = urlparse(url).netloc
print("Domain name -- ",domain_name)
try:
soup = BeautifulSoup(requests.get(url, timeout=5).content, "html.parser")
for a_tag in soup.findAll("a"):
href = a_tag.attrs.get("href")
if href == "" or href is None:
# href empty tag
continue
# join the URL if it's relative (not absolute link)
href = urljoin(url, href)
parsed_href = urlparse(href)
# remove URL GET parameters, URL fragments, etc.
href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
if not is_valid(href):
# not a valid URL
continue
if href in internal_urls:
# already in the set
continue
if parsed_href.netloc != domain_name:
# if the link is not of same domain pass
continue
if parsed_href.path.endswith((".csv",".xlsx",".txt", ".pdf", ".mp3", ".png", ".jpg", ".jpeg", ".svg", ".mov", ".js",".gif",".mp4",".avi",".flv",".wav")):
# Overlook site images,pdf and other file rather than webpages
continue
print(f"Internal link: {href}")
urls.add(href)
internal_urls.add(href)
return urls
except requests.exceptions.Timeout as err:
print("The website is not loading within 5 seconds... Continuing crawling the next one")
pass
except:
print("The website is unavailable. Continuing crawling the next one")
pass
def crawl(url, max_urls=30):
"""
Crawls a web page and extracts all links.
You'll find all links in `external_urls` and `internal_urls` global set variables.
params:
max_urls (int): number of max urls to crawl, default is 30.
"""
global total_urls_visited
total_urls_visited += 1
print(f"Crawling: {url}")
links = get_internal_links(url)
# for link in links:
# if total_urls_visited > max_urls:
# break
# crawl(link, max_urls=max_urls)
def scrape_email(url):
EMAIL_REGEX = r'\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
# EMAIL_REGEX = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""
try:
# initiate an HTTP session
session = HTMLSession()
# get the HTTP Response
r = session.get(url, timeout=10)
# for JAVA-Script driven websites
r.html.render()
single_url_email = []
for re_match in re.finditer(EMAIL_REGEX, r.html.raw_html.decode()):
single_url_email.append(re_match.group().lower())
r.session.close()
return set(single_url_email)
except:
pass
def crawl_website_scrape_email(url, max_internal_url_no=20):
crawl(url,max_urls=max_internal_url_no)
each_url_emails = []
global internal_urls
global emails
for each_url in internal_urls:
each_url_emails.append(scrape_email(each_url))
URL_WITH_EMAILS={'main_url': url, 'emails':each_url_emails}
emails = {}
internal_urls = set()
return URL_WITH_EMAILS
def list_check(emails_list, email_match):
match_indexes = [i for i, s in enumerate(emails_list) if email_match in s]
return [emails_list[index] for index in match_indexes]
URL_WITH_EMAILS_LIST = [crawl_website_scrape_email(x) for x in input_df['Website'].values]
URL_WITH_EMAILS_DF = pd.DataFrame(data = URL_WITH_EMAILS_LIST)
URL_WITH_EMAILS_DF.to_excel(f"{input_file_name}_email-output.xlsx", index=False)
How can I solve the issue of not being able to scrape email from some of those above-mentioned and similar type of websites?
Is there also any way to detect and print strings if my get request is refused by bot detector or related protocols?
Also how can I make this code more robust?
Thank you in advance

Web Scraping Python BeautifulSoup get elements for each webpage in website

I am in my infancy of python coding. What I am trying to do is build a web scraper which gets all the links from a website and then returns the elements form each site. The code I started with is from https://www.thepythoncode.com/article/extract-all-website-links-python
this works really nicely to get all the links from a website.
As I am only interested in the internal links I have added some extra code to try and get the elements (tile, h1, some other bits which I haven't added yet) to the code. The issue I am running into is I think the href returns an email, then the code tries and extracts the elements from this so obviously this bugs out. I have tried to avoid it picking the email (which i also thought would be in the def_valid function) but i am obviously missing something. Any help would be really appreciated.
import re
import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama
GREEN = colorama.Fore.GREEN
GRAY = colorama.Fore.LIGHTBLACK_EX
RESET = colorama.Fore.RESET
YELLOW = colorama.Fore.YELLOW
internal_urls = set()
external_urls = set()
title_urls = set()
def is_valid(url):
"""
Checks whether `url` is a valid URL.
"""
parsed = urlparse(url)
return bool(parsed.netloc) and bool(parsed.scheme)
def get_all_website_links(url):
"""
Returns all URLs that is found on `url` in which it belongs to the same website
"""
# all URLs of `url`
urls = set()
# domain name of the URL without the protocol
domain_name = urlparse(url).netloc
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# is_internal_link == True:
title_check = soup.find_all('title')
if title_check != " " or title_check != None:
get_title(url)
get_heading_tags(url)
for a_tag in soup.findAll("a"):
# is_internal_link = False
href = a_tag.attrs.get("href")
if href == "" or href is None:
# href empty tag
continue
# join the URL if it's relative (not absolute link)
href = urljoin(url, href)
parsed_href = urlparse(href)
# remove URL GET parameters, URL fragments, etc.
href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
if not is_valid(href):
# not a valid URL
continue
if href in internal_urls:
# already in the set
continue
if domain_name not in href:
# external link
if href not in external_urls:
#print(f"{GRAY}[!] External link: {href}{RESET}")
external_urls.add(href)
continue
print(f"{GREEN}[*] Internal link: {href}{RESET}")
if re.search('#',href) == True:
continue
urls.add(href)
internal_urls.add(href)
return urls
# number of urls visited so far will be stored here
total_urls_visited = 0
def get_title(url): # domain name of the URL without the protocol
domain_name = urlparse(url).netloc
soup = BeautifulSoup(requests.get(url).content, "html.parser")
#print("Title of the website is : ")
for title in soup.find_all('title'):
if title == "" and title == None:
continue
title_text = title.get_text()
title_urls.add(title_text)
print(title_text)
print((len(title_text)))
def get_heading_tags(url):
soup = BeautifulSoup(requests.get(url).content, "html.parser")
heading_tags = ['h1', 'h2', 'h3']
i = 0
for tags in soup.find_all(heading_tags):
if tags == " " or tags == None:
continue
tags_text = tags.get_text()
letters_in_tags = len(tags_text) - tags_text.count(" ")
i += 1
print(f'{tags.name} {i} -> {tags_text} -> Length ->{letters_in_tags} ')
def crawl(url, max_urls=80):
"""
Crawls a web page and extracts all links.
You'll find all links in `external_urls` and `internal_urls` global set variables.
params:
max_urls (int): number of max urls to crawl, default is 30.
"""
global total_urls_visited
total_urls_visited += 1
print(f"{YELLOW}[*] Crawling: {url}{RESET}")
links = get_all_website_links(url)
for link in links:
if re.search('#',link) != True:
if total_urls_visited > max_urls:
break
crawl(link, max_urls=max_urls)
if __name__ == "__main__":
crawl("https://website.com/") #put website here.
print("[+] Total Internal links:", len(internal_urls))
print("[+] Total External links:", len(external_urls))
print("[+] Total URLs:", len(external_urls) + len(internal_urls))
for link in links:
if re.search('#',link) != True:
if total_urls_visited > max_urls:
break
crawl(link, max_urls=max_urls)
You are only checking if # is present in the link (and that too not Correct!) to know if it's an email or not. Also note that links can also have # in them.
Basically, emails inside <a> will be of the form:
So to differentiate emails from links, you can use the below check.
for link in links:
if not link.startswith('mailto:'):
if total_urls_visited > max_urls:
break
crawl(link, max_urls=max_urls)
This will ignore all the emails and only scrape links.

trying to loop through a list of urls and scrape each page for text

I'm having an issue. It loops through the list of URLS, but it's not adding the text content of each page scraped to the presults list.
I haven't gotten to the raw text processing yet. I'll probably make a question for that once I get there if I can't figure out.
What is wrong here? The length of presults remains at 1 even though it seems to be looping through the list of urls for the scrape...
Here's part of the code I'm having an issue with:
counter=0
for xa in range(0,len(qresults)):
pageURL=qresults[xa].format()
pageresp= requests.get(pageURL, headers=headers)
if pageresp.status_code==200:
print(pageURL)
psoup=BeautifulSoup(pageresp.content, 'html.parser')
presults=[]
para=psoup.text
presults.append(para)
print(len(presults))
else: print("Could not reach domain")
print(len(presults))
Your immediate problem is here:
presults=[]
para=psoup.text
presults.append(para)
On every for iteration, you replace your existing presults list with the empty list and add one item. On the next iteration, you again wipe out the previous result.
Your initialization must be done only once and that before the loop:
presults = []
for xa in range(0,len(qresults)):
Ok, I don't even see you looping through any URLs here, but below is a generic example of how this kind of request can be achieved.
import requests
from bs4 import BeautifulSoup
base_url = "http://www.privredni-imenik.com/pretraga?abcd=&keyword=&cities_id=0&category_id=0&sub_category_id=0&page=1"
current_page = 1
while current_page < 200:
print(current_page)
url = base_url + str(current_page)
#current_page += 1
r = requests.get(url)
zute_soup = BeautifulSoup(r.text, 'html.parser')
firme = zute_soup.findAll('div', {'class': 'jobs-item'})
for title in firme:
title1 = title.findAll('h6')[0].text
print(title1)
adresa = title.findAll('div', {'class': 'description'})[0].text
print(adresa)
kontakt = title.findAll('div', {'class': 'description'})[1].text
print(kontakt)
print('\n')
page_line = "{title1}\n{adresa}\n{kontakt}".format(
title1=title1,
adresa=adresa,
kontakt=kontakt
)
current_page += 1

Recursive function gives no output

I'm scraping all the URL of my domain with recursive function.
But it outputs nothing, without any error.
#usr/bin/python
from bs4 import BeautifulSoup
import requests
import tldextract
def scrape(url):
for links in url:
main_domain = tldextract.extract(links)
r = requests.get(links)
data = r.text
soup = BeautifulSoup(data)
for href in soup.find_all('a'):
href = href.get('href')
if not href:
continue
link_domain = tldextract.extract(href)
if link_domain.domain == main_domain.domain :
problem.append(href)
elif not href == '#' and link_domain.tld == '':
new = 'http://www.'+ main_domain.domain + '.' + main_domain.tld + '/' + href
problem.append(new)
return len(problem)
return scrape(problem)
problem = ["http://xyzdomain.com"]
print(scrape(problem))
When I create a new list, it works, but I don't want to make a list every time for every loop.
You need to structure your code so that it meets the pattern for recursion as your current code doesn't - you also should not call variables the same name as libraries, e.g. href = href.get() because this will usually stop the library working as it becomes the variable, your code as it currently is will only ever return the len() as this return is unconditionally reached before: return scrap(problem).:
def Recursive(Factorable_problem)
if Factorable_problem is Simplest_Case:
return AnswerToSimplestCase
else:
return Rule_For_Generating_From_Simpler_Case(Recursive(Simpler_Case))
for example:
def Factorial(n):
""" Recursively Generate Factorials """
if n < 2:
return 1
else:
return n * Factorial(n-1)
Hello I've made a none recursive version of this that appears to get all the links on the same domain.
The code below I've tested using the problem included in the code. When I'd solved the problems with the recursive version the next problem was hitting the recursion depth limit so I rewrote it so it ran in an iterative fashion, the code and result below:
from bs4 import BeautifulSoup
import requests
import tldextract
def print_domain_info(d):
print "Main Domain:{0} \nSub Domain:{1} \nSuffix:{2}".format(d.domain,d.subdomain,d.suffix)
SEARCHED_URLS = []
problem = [ "http://Noelkd.neocities.org/", "http://youpi.neocities.org/"]
while problem:
# Get a link from the stack of links
link = problem.pop()
# Check we haven't been to this address before
if link in SEARCHED_URLS:
continue
# We don't want to come back here again after this point
SEARCHED_URLS.append(link)
# Try and get the website
try:
req = requests.get(link)
except:
# If its not working i don't care for it
print "borked website found: {0}".format(link)
continue
# Now we get to this point worth printing something
print "Trying to parse:{0}".format(link)
print "Status Code:{0} Thats: {1}".format(req.status_code, "A-OK" if req.status_code == 200 else "SOMTHINGS UP" )
# Get the domain info
dInfo = tldextract.extract(link)
print_domain_info(dInfo)
# I like utf-8
data = req.text.encode("utf-8")
print "Lenght Of Data Retrived:{0}".format(len(data)) # More info
soup = BeautifulSoup(data) # This was here before so i left it.
print "Found {0} link{1}".format(len(soup.find_all('a')),"s" if len(soup.find_all('a')) > 1 else "")
FOUND_THIS_ITERATION = [] # Getting the same links over and over was boring
found_links = [x for x in soup.find_all('a') if x.get('href') not in SEARCHED_URLS] # Find me all the links i don't got
for href in found_links:
href = href.get('href') # You wrote this seems to work well
if not href:
continue
link_domain = tldextract.extract(href)
if link_domain.domain == dInfo.domain: # JUST FINDING STUFF ON SAME DOMAIN RIGHT?!
if href not in FOUND_THIS_ITERATION: # I'ma check you out next time
print "Check out this link: {0}".format(href)
print_domain_info(link_domain)
FOUND_THIS_ITERATION.append(href)
problem.append(href)
else: # I got you already
print "DUPE LINK!"
else:
print "Not on same domain moving on"
# Count down
print "We have {0} more sites to search".format(len(problem))
if problem:
continue
else:
print "Its been fun"
print "Lets see the URLS we've visited:"
for url in SEARCHED_URLS:
print url
Which prints, after a lot of other logging loads of neocities websites!
What's happening is the script is popping a value of the list of websites yet to visit, it then gets all the links on the page which are on the same domain. If those links are to pages we haven't visited we add the link to the list of links to be visited. After we do that we pop the next page and do the same thing again until there are no pages left to visit.
Think this is what your looking for, get back to us in the comments if this doesn't work in the way that you want or if anyone can improve please leave a comment.

Categories

Resources