How do I know when I'm done crawling a domain?

How do I know when I'm done crawling a domain? - python

I've written a function in Python that gets all the links on a page.
Then, I run that function for all of the links that first function returned.
My question is, if I were to keep on doing this using CNN as my starting point, how would I know when I had crawled all (or most) of CNN's webpages?
Here's the code for the crawler.
base_url = "http://www.cnn.com"
title = "cnn"
my_file = open(title+".txt","w")
def crawl(site):
seed_url = site
br = Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.open(seed_url)
link_bank = []
for link in br.links():
if link.url[0:4] == "http":
link_bank.append(link.url)
if link.url[0] == "/":
url = link.url
if url.find(".com") == -1:
if url.find(".org") == -1:
link_bank.append(base_url+link.url)
else:
link_bank.append(link.url)
else:
link_bank.append(link.url)
if link.url[0] == "#":
link_bank.append(base_url+link.url)
link_bank = list(set(link_bank))
for link in link_bank:
my_file.write(link+"\n")
return link_bank
my_file.close()

I did not specifically look into your code, but you should look up how to implement a breadth-first-search, and additionally store already visited URLs in a set. If you find a new URL in the currently visited page, append it to the list of URLs to visit, if it wasn't in the set already.
You might need to ignore the query string (everything after the question mark in a URL).

The first thing coming into my mind is to have a set of visited links. Each time you are requesting a link, add a link to a set. Before requesting a link, check if it is not in the set.
Another point is that you are actually reinventing the wheel here, Scrapy web-scraping framework has link extracting mechanism built-in - it's worth using.
Hope that helps.

Related

web2py button should do different things

I'm new to web2py and websites in general.
I would like to upload xml-files with different numbers of questions in it.
I'm using bs4 to parse the uploaded file and then I want to do different things: if there is only one question in the xml file I would like to go to a new site and if there are more questions in it I want to go to another site.
So this is my code:
def import_file():
form = SQLFORM.factory(Field('file','upload', requires = IS_NOT_EMPTY(), uploadfolder = os.path.join(request.folder, 'uploads'), label='file:'))
if form.process().accepted:
soup = BeautifulSoup('file', 'html.parser')
questions = soup.find_all(lambda tag:tag.name == "question" and tag["type"] != "category")
# now I want to check the length of the list to redirect to the different URL's, but it doesn't work, len(questions) is 0.
if len(questions) == 1:
redirect(URL('import_questions'))
elif len(questions) > 1:
redirect(URL('checkboxes'))
return dict(form=form, message=T("Please upload the file"))
Does anybody know what I can do, to check the length of the list, after uploading the xml-file?

BeautifulSoup expects a string or a file-like object, but you're passing 'file'. Instead, you should use:
with open(form.vars.file) as f:
soup = BeautifulSoup(f, 'html.parser')
However this is not a web2py-specific problem.
Hope this helps.

Access only links with given format from a python list

I have written a code that fetches the html code of any given site and then fetch all links from it and save it inside a list. My goal is that I want to change all the relative links in html file with absolute links.
Here are the links:
src="../styles/scripts/jquery-1.9.1.min.js"
href="/PhoneBook.ico"
href="../css_responsive/fontsss.css"
src="http://www.google.com/adsense/search/ads.js"
L.src = '//www.google.com/adsense/search/async-ads.js'
href="../../"
src='../../images/plus.png'
vrUrl ="search.aspx?searchtype=cat"
These are few links that I have copied from html file to keep the question simple and less error prone.
Following are the different URLs used in html file:
http://yourdomain.com/images/example.png
//yourdomain.com/images/example.png
/images/example.png
images/example.png
../images/example.png
../../images/example.png
Python code:
linkList = re.findall(re.compile(u'(?<=href=").*?(?=")|(?<=href=\').*?(?=\')|(?<=src=").*?(?=")|(?<=src=\').*?(?=\')|(?<=action=").*?(?=")|(?<=vrUrl =").*?(?=")|(?<=\')//.*?(?=\')'), str(html))
newLinks = []
for link1 in linkList:
if (link1.startswith("//")):
newLinks.append(link1)
elif (link1.startswith("../")):
newLinks.append(link1)
elif (link1.startswith("../../")):
newLinks.append(link1)
elif (link1.startswith("http")):
newLinks.append(link1)
elif (link1.startswith("/")):
newLinks.append(link1)
else:
newLinks.append(link1)
At this point what happens is when it comes to second condition which is "../" it gives me all the urls which starts with "../" as well as "../../". This is the behavior which I don't need. Same goes for "/"; it also fetches urls starting with "//". I also tried to used the beginning and end parameters of "startswith" function but that doesn't solve the issue.

How about using str.count method:
>>> src="../styles/scripts/jquery-1.9.1.min.js"
>>> src2='../../images/plus.png'
>>> src.count('../')
1
>>> src2.count('../')
2
This seems to be true as ../ only exists at the beginning of urls

Scrapy follow link as well as encoding error

I've been trying to implement a parse function.
Essentially I figured out through the scrapy shell that
response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')).extract()[0]
gives me the url directing me to the next page. So I tried following the instructions with next_page. I took a look around stack overflow and it seems that everyone uses rule(LinkExtractor... which I don't believe I need to use. I'm pretty sure I'm doing it completely wrong though. I originally had a for loop that added every link I wanted to visit in the start_urls because I knew it was all in the form of *p1.html, *p2.html .. etc. but I want to make this smarter.
def parse(self, response):
items = []
for sel in response.xpath('//div[#class="Message"]'):
itemx = mydata()
itemx['information'] = sel.extract()
items.append(itemx)
with open('log.txt', 'a') as f:
f.write('\ninformation: ' + itemx.get('information')
#URL of next page response.xpath('//*[#id="PagerAfter"]/a[last()]/#href').extract()[0]
next_page = (response.xpath('//*[#id="PagerAfter"]/a[last()]/#href'))
if (response.url != response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')):
if next_page:
yield Request(response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')[0], self.parse)
return items
but does not work I get a
next_page = (response.xpath('//*[#id="PagerAfter"]/a[last()]/#href'))
^SyntaxError: invalid syntax
error. Additionally I know that the yield Request part is wrong. I want to recursively call and recursively add each scrape of each page into the list items.
Thank you!

Checking ALL links within links from a source HTML, Python

My code is to search a Link passed in the command prompt, get the HTML code for the webpage at the Link, search the HTML code for links on the webpage, and then repeat these steps for the links found. I hope that is clear.
It should print out any links that cause errors.
Some more needed info:
The max visits it can do is 100.
If a website has an error, a None value is returned.
Python3 is what I am using
eg:
s = readwebpage(url)... # This line of code gets the HTML code for the link(url) passed in its argument.... if the link has an error, s = None.
The HTML code for that website has links that end in p2.html, p3.html, p4.html, and p5.html on its webpage. My code reads all of these, but it does not visit these links individually to search for more links. If it did this, it should search through these links and find a link that ends in p10.html, and then it should report that the link ending with p10.html has errors. Obviously it doesn't do that at the moment, and it's giving me a hard time.
My code..
url = args.url[0]
url_list = [url]
checkedURLs = []
AmountVisited = 0
while (url_list and AmountVisited<maxhits):
url = url_list.pop()
s = readwebpage(url)
print("testing url: http",url) #Print the url being tested, this code is here only for testing..
AmountVisited = AmountVisited + 1
if s == None:
print("* bad reference to http", url)
else:
urls_list = re.findall(r'href="http([\s:]?[^\'" >]+)', s) #Creates a list of all links in HTML code starting with...
while urls_list: #... http or https
insert = urls_list.pop()
while(insert in checkedURLs and urls_list):
insert = urls_list.pop()
url_list.append(insert)
checkedURLs = insert
Please help :)

Here is the code you wanted. However, please, stop using regexes for parsing HTML. BeautifulSoup is the way to go for that.
import re
from urllib import urlopen
def readwebpage(url):
print "testing ",current
return urlopen(url).read()
url = 'http://xrisk.esy.es' #put starting url here
yet_to_visit= [url]
visited_urls = []
AmountVisited = 0
maxhits = 10
while (yet_to_visit and AmountVisited<maxhits):
print yet_to_visit
current = yet_to_visit.pop()
AmountVisited = AmountVisited + 1
html = readwebpage(current)
if html == None:
print "* bad reference to http", current
else:
r = re.compile('(?<=href=").*?(?=")')
links = re.findall(r,html) #Creates a list of all links in HTML code starting with...
for u in links:
if u in visited_urls:
continue
elif u.find('http')!=-1:
yet_to_visit.append(u)
print links
visited_urls.append(current)

Not Python but since you mentioned you aren't tied strictly to regex, I think you might find some use in using wget for this.
wget --spider -o C:\wget.log -e robots=off -w 1 -r -l 10 http://www.stackoverflow.com
Broken down:
--spider: When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
-o C:\wget.log: Log all messages to C:\wget.log.
-e robots=off: Ignore robots.txt
-w 1: set a wait time of 1 second
-r: set recursive search on
-l 10: sets the recursive depth to 10, meaning wget will only go as deep as 10 levels in, this may need to change depending on your max requests
http://www.stackoverflow.com: the URL you want to start with
Once complete, you can review the wget.log entries to determine which links had errors by searching for something like HTTP status codes 404, etc.

I suspect your regex is part of your problem. Right now, you have http outside your capture group, and [\s:] matches "some sort of whitespace (ie \s) or :"
I'd change the regex to: urls_list = re.findall(r'href="(.*)"',s). Also known as "match anything in quotes, after href=". If you absolutely need to ensure the http[s]://, use r'href="(https?://.*)"' (s? => one or zero s)
EDIT: And with actually working regex, using a non-greedly glom: href=(?P<q>[\'"])(https?://.*?)(?P=q)'
(Also, uh, while it's not technically necessary in your case because re caches, I think it's good practice to get into the habit of using re.compile.)
I think it's awfully nice that all of your URLs are full URLs. Do you have to deal with relative URLs at all?
`

Equivalent of wget in Python to download website and resources

Same thing asked 2.5 years ago in Downloading a web page and all of its resource files in Python but doesn't lead to an answer and the 'please see related topic' isn't really asking the same thing.
I want to download everything on a page to make it possible to view it just from the files.
The command
wget --page-requisites --domains=DOMAIN --no-parent --html-extension --convert-links --restrict-file-names=windows
does exactly that I need. However we want to be able to tie it in with other stuff that must be portable, so requires it to be in Python.
I've been looking at Beautiful Soup, scrapy, various spiders posted around the place, but these all seem to deal with getting data/links in clever but specific ways. Using these to do what I want seems like it will require a lot of work to deal with finding all of the resources, when I'm sure there must be an easy way.
thanks very much

You should be using an appropriate tool for the job at hand.
If you want to spider a site and save the pages to disk, Python probably isn't the best choice for that. Open source projects get features when someone needs that feature, and because wget does its job so well, nobody has bothered to try to write a python library to replace it.
Considering wget runs on pretty much any platform that has a Python interpreter, is there a reason you can't use wget?

My colleague wrote up this code, lots pieced together from other sources I believe. Might have some specific quirks for our system but it should help anyone wanting to do the same
"""
Downloads all links from a specified location and saves to machine.
Downloaded links will only be of a lower level then links specified.
To use: python downloader.py link
"""
import sys,re,os,urllib2,urllib,urlparse
tocrawl = set([sys.argv[1]])
# linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?')
linkregex = re.compile('href=[\'|"](.*?)[\'"].*?')
linksrc = re.compile('src=[\'|"](.*?)[\'"].*?')
def main():
link_list = []##create a list of all found links so there are no duplicates
restrict = sys.argv[1]##used to restrict found links to only have lower level
link_list.append(restrict)
parent_folder = restrict.rfind('/', 0, len(restrict)-1)
##a.com/b/c/d/ make /d/ as parent folder
while 1:
try:
crawling = tocrawl.pop()
#print crawling
except KeyError:
break
url = urlparse.urlparse(crawling)##splits url into sections
try:
response = urllib2.urlopen(crawling)##try to open the url
except:
continue
msg = response.read()##save source of url
links = linkregex.findall(msg)##search for all href in source
links = links + linksrc.findall(msg)##search for all src in source
for link in (links.pop(0) for _ in xrange(len(links))):
if link.startswith('/'):
##if /xxx a.com/b/c/ -> a.com/b/c/xxx
link = 'http://' + url[1] + link
elif ~link.find('#'):
continue
elif link.startswith('../'):
if link.find('../../'):##only use links that are max 1 level above reference
##if ../xxx.html a.com/b/c/d.html -> a.com/b/xxx.html
parent_pos = url[2].rfind('/')
parent_pos = url[2].rfind('/', 0, parent_pos-2) + 1
parent_url = url[2][:parent_pos]
new_link = link.find('/')+1
link = link[new_link:]
link = 'http://' + url[1] + parent_url + link
else:
continue
elif not link.startswith('http'):
if url[2].find('.html'):
##if xxx.html a.com/b/c/d.html -> a.com/b/c/xxx.html
a = url[2].rfind('/')+1
parent = url[2][:a]
link = 'http://' + url[1] + parent + link
else:
##if xxx.html a.com/b/c/ -> a.com/b/c/xxx.html
link = 'http://' + url[1] + url[2] + link
if link not in link_list:
link_list.append(link)##add link to list of already found links
if (~link.find(restrict)):
##only grab links which are below input site
print link ##print downloaded link
tocrawl.add(link)##add link to pending view links
file_name = link[parent_folder+1:]##folder structure for files to be saved
filename = file_name.rfind('/')
folder = file_name[:filename]##creates folder names
folder = os.path.abspath(folder)##creates folder path
if not os.path.exists(folder):
os.makedirs(folder)##make folder if it does not exist
try:
urllib.urlretrieve(link, file_name)##download the link
except:
print "could not download %s"%link
else:
continue
if __name__ == "__main__":
main()
thanks for the replies

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I know when I'm done crawling a domain? - python

Related

web2py button should do different things

Access only links with given format from a python list

Scrapy follow link as well as encoding error

Checking ALL links within links from a source HTML, Python

Equivalent of wget in Python to download website and resources

Categories

Resources