I'm building a script to scan a website and capture URLs and test whether it's working or not. The problem is that the script is looking for just the URLs of the website's home page and leaving others aside. How do I capture all pages linked to the site?
Below my code attachment:
import urllib
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
page = urllib.request.urlopen("http://www.google.com/")
soup = BeautifulSoup(page.read(), features='lxml')
links = soup.findAll("a", attrs={'href': re.compile('^(http://)')})
for link in links:
result = (link["href"])
req = Request(result)
try:
response = urlopen(req)
pass
except HTTPError as e:
if e.code != 200:
# Stop, Error!
with open("Document_ERROR.txt", 'a') as archive:
archive.write(result)
archive.write('\n')
archive.write('{} \n'.format(e.reason))
archive.write('{}'.format(e.code))
archive.close()
else:
# Enjoy!
with open("Document_OK.txt", 'a') as archive:
archive.write(result)
archive.write('\n')
archive.close()
The main reason this doesn't work is that you put both the OK and ERROR-writes inside the except-block.
This means that only urls that actually raise an exception will be stored.
In general it would be my advice for you to spray some print-statements into the difference stages of the script - or use an IDE that allows you to step through the code during runtime - line by line. That makes stuff like this so much easier to debug.
PyCharm is free and allows you to do so. Give that a try.
So - I haven't worked with urllib but use requests a lot (python -m pip install requests). A quick refactor using that would look something like below:
import requests
from bs4 import BeautifulSoup
import re
url = "http://www.google.com"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "lxml")
links = soup.find_all("a", attrs={'href': re.compile('^(http://)')})
for link in links:
href = link["href"]
print("Testing for URL {}".format(href))
try:
# since you only want to scan for status code, no need to pull the entire html of the site - use HEAD instead of GET
r = requests.head(href)
status = r.status_code
# 404 etc will not yield an error
error = None
except Exception as e:
# these exception will not have a status_code
status = None
error = e
# store the finding in your files
if status is None or status != 200:
print("URL is broken. Writing to ERROR_Doc")
# do your storing here of href, status and error
else:
print("URL is live. Writing to OK_Doc"
# do your storing here
Hope this makes sense.
Related
Currently I have a script that can only download the HTML of a given page.
Now I want to download all the files of the web page including HTML, CSS, JS and image files (same as we get with a ctrl-s of any website).
My current code is:
import urllib
url = "https://en.wikipedia.org/wiki/Python_%28programming_language%29"
urllib.urlretrieve(url, "t3.html")
I visited many questions but they are all only downloading the HTML.
The following implementation enables you to get the sub-HTML websites. It can be more developed in order to get the other files you need. I sat the depth variable for you to set the maximum sub_websites that you want to parse to.
import urllib2
from BeautifulSoup import *
from urlparse import urljoin
def crawl(pages, depth=None):
indexed_url = [] # a list for the main and sub-HTML websites in the main website
for i in range(depth):
for page in pages:
if page not in indexed_url:
indexed_url.append(page)
try:
c = urllib2.urlopen(page)
except:
print "Could not open %s" % page
continue
soup = BeautifulSoup(c.read())
links = soup('a') #finding all the sub_links
for link in links:
if 'href' in dict(link.attrs):
url = urljoin(page, link['href'])
if url.find("'") != -1:
continue
url = url.split('#')[0]
if url[0:4] == 'http':
indexed_url.append(url)
pages = indexed_url
return indexed_url
pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=2)
print urls
Python3 version, 2019. May this saves some time to somebody:
#!/usr/bin/env python
import urllib.request as urllib2
from bs4 import *
from urllib.parse import urljoin
def crawl(pages, depth=None):
indexed_url = [] # a list for the main and sub-HTML websites in the main website
for i in range(depth):
for page in pages:
if page not in indexed_url:
indexed_url.append(page)
try:
c = urllib2.urlopen(page)
except:
print( "Could not open %s" % page)
continue
soup = BeautifulSoup(c.read())
links = soup('a') #finding all the sub_links
for link in links:
if 'href' in dict(link.attrs):
url = urljoin(page, link['href'])
if url.find("'") != -1:
continue
url = url.split('#')[0]
if url[0:4] == 'http':
indexed_url.append(url)
pages = indexed_url
return indexed_url
pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=1)
print( urls )
You can easily do that with simple python library pywebcopy.
For Current version: 5.0.1
from pywebcopy import save_webpage
url = 'http://some-site.com/some-page.html'
download_folder = '/path/to/downloads/'
kwargs = {'bypass_robots': True, 'project_name': 'recognisable-name'}
save_webpage(url, download_folder, **kwargs)
You will have html, css, js all at your download_folder. Completely working like original site.
Using Python 3+ Requests and other standard libraries.
The function savePage receives a requests.Response and the pagefilename where to save it.
Saves the pagefilename.html on the current folder
Downloads, javascripts, css and images based on the tags script, link and img and saved on a folder pagefilename_files.
Any exception are printed on sys.stderr, returns a BeautifulSoup object .
Requests session must be a global variable unless someone writes a cleaner code here for us.
You can adapt it to your needs.
import os, sys
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def soupfindAllnSave(pagefolder, url, soup, tag2find='img', inner='src'):
if not os.path.exists(pagefolder): # create only once
os.mkdir(pagefolder)
for res in soup.findAll(tag2find): # images, css, etc..
try:
filename = os.path.basename(res[inner])
fileurl = urljoin(url, res.get(inner))
# rename to saved file path
# res[inner] # may or may not exist
filepath = os.path.join(pagefolder, filename)
res[inner] = os.path.join(os.path.basename(pagefolder), filename)
if not os.path.isfile(filepath): # was not downloaded
with open(filepath, 'wb') as file:
filebin = session.get(fileurl)
file.write(filebin.content)
except Exception as exc:
print(exc, file=sys.stderr)
return soup
def savePage(response, pagefilename='page'):
url = response.url
soup = BeautifulSoup(response.text)
pagefolder = pagefilename+'_files' # page contents
soup = soupfindAllnSave(pagefolder, url, soup, 'img', inner='src')
soup = soupfindAllnSave(pagefolder, url, soup, 'link', inner='href')
soup = soupfindAllnSave(pagefolder, url, soup, 'script', inner='src')
with open(pagefilename+'.html', 'w') as file:
file.write(soup.prettify())
return soup
Example saving google page and its contents (google_files folder)
session = requests.Session()
#... whatever requests config you need here
response = session.get('https://www.google.com')
savePage(response, 'google')
Try the Python library Scrapy. You can program Scrapy to recursively scan a website by downloading its pages, scanning, following links:
An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
I was following this tutorial and the code worked perfectly.
Now after doing some other projects I went back and wanted to re-run the same code. Suddenly I was getting an error message that forced me to add features="html.parser" in the soup variable.
So I did, but now when I run the code, literally nothing happens. Why is that, what am I doing wrong?
I checked whether I might have uninstalled beautifulsoup4 module, but no, it is still there. I re-typed the whole code from scratch, but nothing seems to work.
import requests
from bs4 import BeautifulSoup
def spider():
url = "https://www.amazon.de/s?k=laptop+triton&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss"
source = requests.get(url)
plain_text = source.text
soup = BeautifulSoup(plain_text, features="html.parser")
for mylink in soup.findAll('img', {'class':'s-image'}):
mysrc = mylink.get('src')
print(mysrc)
spider()
Ideally I'd want the crawler to print about 10-20 lines of src = "..." of the amazon page in question. This code worked a couple hours ago...
The solution is to add headers={'User-Agent':'Mozilla/5.0'} to requests.get() (without it, Amazon doesn't send the correct page):
import requests
from bs4 import BeautifulSoup
def spider():
url = "https://www.amazon.de/s?k=laptop+triton&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss"
source = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
plain_text = source.text
soup = BeautifulSoup(plain_text, features="html.parser")
for mylink in soup.findAll('img', {'class':'s-image'}):
mysrc = mylink.get('src')
print(mysrc)
spider()
Prints:
https://m.media-amazon.com/images/I/71YPEDap2lL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81kXH-OA6tL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81kXH-OA6tL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81fyVgZuQxL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81kXH-OA6tL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81kXH-OA6tL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71VmlANJMOL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71rAT5E7DfL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71cEKKNfb3L._AC_UL436_.jpg
https://m.media-amazon.com/images/I/61aWXuLIEBL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71B7NyjuU9L._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81s822PQUcL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71fBKuAiQzL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71hXTUR-oRL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81-Lf6jX-OL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/81B85jUARqL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/8140E7+uhZL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/8140E7+uhZL._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71ROCddvJ2L._AC_UL436_.jpg
https://m.media-amazon.com/images/I/71ROCddvJ2L._AC_UL436_.jpg
https://m.media-amazon.com/images/I/41bB8HuoBYL._AC_UL436_.jpg
I have been playing with the cfscrape module which allows you to bypass the cloudflare captcha protection on sites... I have accessed the page's contents but can't seem to get my code to work, instead the whole HTML is printed. I'm only trying to find keywords within the <span class="availability">
import urllib2
import cfscrape
from bs4 import BeautifulSoup
import requests
from lxml import etree
import smtplib
import urllib2, sys
scraper = cfscrape.CloudflareScraper()
url = "http://www.sneakersnstuff.com/en/product/25698/adidas-stan-smith-gtx"
req = scraper.get(url).content
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print("hi")
content = e.fp.read()
soup = BeautifulSoup(content, "lxml")
result = soup.find_all("span", {"class":"availability"})
I have omitted some irrelevant parts of code
try:
page = urllib2.urlopen(req)
content = page.read()
except urllib2.HTTPError, e:
print("hi")
You should read the urlopen's object which contain the html code.
and you should put the content variable before the except.
I want to download all the .xls or .xlsx or .csv from this website into a specified folder.
https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009
I have looked into mechanize, beautiful soup, urllib2 etc. Mechanize does not work in Python 3, urllib2 also had problems with Python 3, I looked for workaround but I couldn't. So, I am currently trying to make it work using Beautiful Soup.
I found some example code and attempted to modify it to suit my problem, as follows -
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/'
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
href = link.get('href')
if href.startswith('javascript:'):
continue
filename = href.rsplit('/', 1)[-1]
href = urljoin(url, quote(href))
try:
urlretrieve(href, filename)
except:
print('failed to download')
However, when run this code does not extract the files from the target page, nor output any failure message (e.g. 'failed to download').
How can I use BeautifulSoup to select the Excel files from the page?
How can I download these files to a local file using Python?
The issues with your script as it stand are:
The url has a trailing / which gives an invalid page when requested, not listing the files you want to download.
The CSS selector in soup.select(...) is selecting div with the attribute webpartid which does not exist anywhere in that linked document.
You are joining the URL and quoting it, even though the links are given in the page as absolute URLs and they do not need quoting.
The try:...except: block is stopping you seeing the errors generated when trying to download the file. Using an except block without a specific exception is bad practise and should be avoided.
A modified version of your code that will get the correct files and attempt to download them is as follows:
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin
# Remove the trailing / you had, as that gives a 404 page
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
# Select all A elements with href attributes containing URLs starting with http://
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
# Make sure it has one of the correct extensions
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = href.rsplit('/', 1)[-1]
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
However, if you run this you'll notice that a urllib.error.HTTPError: HTTP Error 403: Forbidden exception is thrown, even though the file is downloadable in the browser.
At first I thought this was a referral check (to prevent hotlinking), however if you watch at the request in your browser (e.g. Chrome Developer tools) you'll notice that
the initial http:// request is blocked there also, and then Chrome attempts a https:// request for the same file.
In other words, the request must go via HTTPS to work (despite what the URLs in the page say). To fix this you will need to rewrite the http: to https: before using the URL for the request. The following code will correctly modify the URLs and download the files. I've also added an variable to specify the output folder, which is added to the filename using os.path.join:
import os
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve
URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder
u = urlopen(URL)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])
# We need a https:// URL for this site
href = href.replace('http://','https://')
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
I found this to be a good working example, using the BeautifulSoup4, requests, and wget modules for Python 2.7:
import requests
import wget
import os
from bs4 import BeautifulSoup, SoupStrainer
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
file_types = ['.xls', '.xlsx', '.csv']
for file_type in file_types:
response = requests.get(url)
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path = url + link['href']
wget.download(full_path)
i tried above code still giving me urllib.error.HTTPError: HTTP Error 403: Forbidden
Also tried by adding user agents my modified code
import os
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import Request,urlopen, urlretrieve
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
URL = Request('https://www.rbi.org.in/scripts/bs_viewcontent.aspx?Id=2009', headers=headers)
#URL = 'https://www.rbi.org.in/scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = 'E:\python\out' # path to output folder, '.' or '' uses current folder
u = urlopen(URL)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])
# We need a https:// URL for this site
href = href.replace('http://','https://')
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
This worked best for me ... using python3
import os
import urllib
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve
from urllib.error import HTTPError
URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder
u = urlopen(URL)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])
# We need a https:// URL for this site
href = href.replace('http://','https://')
try:
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
except urllib.error.HTTPError as err:
if err.code == 404:
continue
Only interested in Python 3, this is a lab for school and we aren't using Python2.
Tools
Python 3 and
Ubuntu
I want to first be able to download webpages of my choosing, e.g. www.example.com/index.html
and save the index.html or what ever page I want.
Then do the following bash
grep Href cut -d"/" -f3 sort -u
But do this in python not using grep, wget, cut etc... but instead only using python 3 commands.
Also, not using any python scrappers such as scrapy etc... NO legacy python commands no urllib2
so I was thinking to start with,
import urllib.request
from urllib.error import HTTPError,URLError
o = urllib.request.urlopen(www.example.com/index.html) #should I use http:// ?
local_file = open(file_name, "w" + file_mode)
#Then write to my local file
local_file.write(o.read())
local_file.close()
except HTTPError as e:
print("HTTP Error:",e.code , url)
except URLError as e:
print("URL Error:",e.reason , url)
But I still need to filter out the href's from the file, and delete all the other stuff, how do i do that, and is the above code ok ?
I thought urllib.request would be better than urlretrieve because it was faster, but if you think that there is not much different maybe it's better to use urlretrieve ?
There is a python package called BeautifulSoup which does what you need.
import urllib2
from bs4 import BeautifulSoup
html_doc = urllib2.urlopen('http://www.google.com')
soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
print(link.get('href'))
Hope that helps.
My try. Kept it as simple as I could (thus, not very efficient for an actual crawler)
from urllib import request
# Fetch the page from the url we want
page = request.urlopen('http://www.google.com').read()
# print(page)
def get_link(page):
""" Get the first link from the page provided """
tag = 'href="'
tag_end = '"'
tag_start = page.find(tag)
if not tag_start == -1:
tag_start += len(tag)
else:
return None, page
tag_end = page.find(tag_end, tag_start)
link = page[tag_start:tag_end]
page = page[tag_end:]
return link, page
def get_all_links(page):
""" Get all the links from the page provided by calling get_link """
all_links = []
while True:
link, page = get_link(page)
if link:
all_links.append(link)
else:
break
return all_links
print(get_all_links(page))
Here You can read a post where everything you need is explained:
- Download a web page
- Using beautifulSoup
- Extrac something you need
Extract HTML code from URL in Python and C# - http://www.manejandodatos.es/2014/1/extract-html-code-url-python-c