Download .xls files from a webpage using Python and BeautifulSoup - python
I want to download all the .xls or .xlsx or .csv from this website into a specified folder.
https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009
I have looked into mechanize, beautiful soup, urllib2 etc. Mechanize does not work in Python 3, urllib2 also had problems with Python 3, I looked for workaround but I couldn't. So, I am currently trying to make it work using Beautiful Soup.
I found some example code and attempted to modify it to suit my problem, as follows -
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/'
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
href = link.get('href')
if href.startswith('javascript:'):
continue
filename = href.rsplit('/', 1)[-1]
href = urljoin(url, quote(href))
try:
urlretrieve(href, filename)
except:
print('failed to download')
However, when run this code does not extract the files from the target page, nor output any failure message (e.g. 'failed to download').
How can I use BeautifulSoup to select the Excel files from the page?
How can I download these files to a local file using Python?
The issues with your script as it stand are:
The url has a trailing / which gives an invalid page when requested, not listing the files you want to download.
The CSS selector in soup.select(...) is selecting div with the attribute webpartid which does not exist anywhere in that linked document.
You are joining the URL and quoting it, even though the links are given in the page as absolute URLs and they do not need quoting.
The try:...except: block is stopping you seeing the errors generated when trying to download the file. Using an except block without a specific exception is bad practise and should be avoided.
A modified version of your code that will get the correct files and attempt to download them is as follows:
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin
# Remove the trailing / you had, as that gives a 404 page
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
# Select all A elements with href attributes containing URLs starting with http://
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
# Make sure it has one of the correct extensions
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = href.rsplit('/', 1)[-1]
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
However, if you run this you'll notice that a urllib.error.HTTPError: HTTP Error 403: Forbidden exception is thrown, even though the file is downloadable in the browser.
At first I thought this was a referral check (to prevent hotlinking), however if you watch at the request in your browser (e.g. Chrome Developer tools) you'll notice that
the initial http:// request is blocked there also, and then Chrome attempts a https:// request for the same file.
In other words, the request must go via HTTPS to work (despite what the URLs in the page say). To fix this you will need to rewrite the http: to https: before using the URL for the request. The following code will correctly modify the URLs and download the files. I've also added an variable to specify the output folder, which is added to the filename using os.path.join:
import os
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve
URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder
u = urlopen(URL)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])
# We need a https:// URL for this site
href = href.replace('http://','https://')
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
I found this to be a good working example, using the BeautifulSoup4, requests, and wget modules for Python 2.7:
import requests
import wget
import os
from bs4 import BeautifulSoup, SoupStrainer
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
file_types = ['.xls', '.xlsx', '.csv']
for file_type in file_types:
response = requests.get(url)
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path = url + link['href']
wget.download(full_path)
i tried above code still giving me urllib.error.HTTPError: HTTP Error 403: Forbidden
Also tried by adding user agents my modified code
import os
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import Request,urlopen, urlretrieve
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
URL = Request('https://www.rbi.org.in/scripts/bs_viewcontent.aspx?Id=2009', headers=headers)
#URL = 'https://www.rbi.org.in/scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = 'E:\python\out' # path to output folder, '.' or '' uses current folder
u = urlopen(URL)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])
# We need a https:// URL for this site
href = href.replace('http://','https://')
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
This worked best for me ... using python3
import os
import urllib
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve
from urllib.error import HTTPError
URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder
u = urlopen(URL)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])
# We need a https:// URL for this site
href = href.replace('http://','https://')
try:
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
except urllib.error.HTTPError as err:
if err.code == 404:
continue
Related
How to download an HTML file completely? [duplicate]
Currently I have a script that can only download the HTML of a given page. Now I want to download all the files of the web page including HTML, CSS, JS and image files (same as we get with a ctrl-s of any website). My current code is: import urllib url = "https://en.wikipedia.org/wiki/Python_%28programming_language%29" urllib.urlretrieve(url, "t3.html") I visited many questions but they are all only downloading the HTML.
The following implementation enables you to get the sub-HTML websites. It can be more developed in order to get the other files you need. I sat the depth variable for you to set the maximum sub_websites that you want to parse to. import urllib2 from BeautifulSoup import * from urlparse import urljoin def crawl(pages, depth=None): indexed_url = [] # a list for the main and sub-HTML websites in the main website for i in range(depth): for page in pages: if page not in indexed_url: indexed_url.append(page) try: c = urllib2.urlopen(page) except: print "Could not open %s" % page continue soup = BeautifulSoup(c.read()) links = soup('a') #finding all the sub_links for link in links: if 'href' in dict(link.attrs): url = urljoin(page, link['href']) if url.find("'") != -1: continue url = url.split('#')[0] if url[0:4] == 'http': indexed_url.append(url) pages = indexed_url return indexed_url pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"] urls = crawl(pagelist, depth=2) print urls Python3 version, 2019. May this saves some time to somebody: #!/usr/bin/env python import urllib.request as urllib2 from bs4 import * from urllib.parse import urljoin def crawl(pages, depth=None): indexed_url = [] # a list for the main and sub-HTML websites in the main website for i in range(depth): for page in pages: if page not in indexed_url: indexed_url.append(page) try: c = urllib2.urlopen(page) except: print( "Could not open %s" % page) continue soup = BeautifulSoup(c.read()) links = soup('a') #finding all the sub_links for link in links: if 'href' in dict(link.attrs): url = urljoin(page, link['href']) if url.find("'") != -1: continue url = url.split('#')[0] if url[0:4] == 'http': indexed_url.append(url) pages = indexed_url return indexed_url pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"] urls = crawl(pagelist, depth=1) print( urls )
You can easily do that with simple python library pywebcopy. For Current version: 5.0.1 from pywebcopy import save_webpage url = 'http://some-site.com/some-page.html' download_folder = '/path/to/downloads/' kwargs = {'bypass_robots': True, 'project_name': 'recognisable-name'} save_webpage(url, download_folder, **kwargs) You will have html, css, js all at your download_folder. Completely working like original site.
Using Python 3+ Requests and other standard libraries. The function savePage receives a requests.Response and the pagefilename where to save it. Saves the pagefilename.html on the current folder Downloads, javascripts, css and images based on the tags script, link and img and saved on a folder pagefilename_files. Any exception are printed on sys.stderr, returns a BeautifulSoup object . Requests session must be a global variable unless someone writes a cleaner code here for us. You can adapt it to your needs. import os, sys import requests from urllib.parse import urljoin from bs4 import BeautifulSoup def soupfindAllnSave(pagefolder, url, soup, tag2find='img', inner='src'): if not os.path.exists(pagefolder): # create only once os.mkdir(pagefolder) for res in soup.findAll(tag2find): # images, css, etc.. try: filename = os.path.basename(res[inner]) fileurl = urljoin(url, res.get(inner)) # rename to saved file path # res[inner] # may or may not exist filepath = os.path.join(pagefolder, filename) res[inner] = os.path.join(os.path.basename(pagefolder), filename) if not os.path.isfile(filepath): # was not downloaded with open(filepath, 'wb') as file: filebin = session.get(fileurl) file.write(filebin.content) except Exception as exc: print(exc, file=sys.stderr) return soup def savePage(response, pagefilename='page'): url = response.url soup = BeautifulSoup(response.text) pagefolder = pagefilename+'_files' # page contents soup = soupfindAllnSave(pagefolder, url, soup, 'img', inner='src') soup = soupfindAllnSave(pagefolder, url, soup, 'link', inner='href') soup = soupfindAllnSave(pagefolder, url, soup, 'script', inner='src') with open(pagefilename+'.html', 'w') as file: file.write(soup.prettify()) return soup Example saving google page and its contents (google_files folder) session = requests.Session() #... whatever requests config you need here response = session.get('https://www.google.com') savePage(response, 'google')
Try the Python library Scrapy. You can program Scrapy to recursively scan a website by downloading its pages, scanning, following links: An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
My script does not search all links, what to do?
I'm building a script to scan a website and capture URLs and test whether it's working or not. The problem is that the script is looking for just the URLs of the website's home page and leaving others aside. How do I capture all pages linked to the site? Below my code attachment: import urllib from bs4 import BeautifulSoup import re from urllib.request import Request, urlopen from urllib.error import URLError, HTTPError page = urllib.request.urlopen("http://www.google.com/") soup = BeautifulSoup(page.read(), features='lxml') links = soup.findAll("a", attrs={'href': re.compile('^(http://)')}) for link in links: result = (link["href"]) req = Request(result) try: response = urlopen(req) pass except HTTPError as e: if e.code != 200: # Stop, Error! with open("Document_ERROR.txt", 'a') as archive: archive.write(result) archive.write('\n') archive.write('{} \n'.format(e.reason)) archive.write('{}'.format(e.code)) archive.close() else: # Enjoy! with open("Document_OK.txt", 'a') as archive: archive.write(result) archive.write('\n') archive.close()
The main reason this doesn't work is that you put both the OK and ERROR-writes inside the except-block. This means that only urls that actually raise an exception will be stored. In general it would be my advice for you to spray some print-statements into the difference stages of the script - or use an IDE that allows you to step through the code during runtime - line by line. That makes stuff like this so much easier to debug. PyCharm is free and allows you to do so. Give that a try. So - I haven't worked with urllib but use requests a lot (python -m pip install requests). A quick refactor using that would look something like below: import requests from bs4 import BeautifulSoup import re url = "http://www.google.com" r = requests.get(url) html = r.text soup = BeautifulSoup(html, "lxml") links = soup.find_all("a", attrs={'href': re.compile('^(http://)')}) for link in links: href = link["href"] print("Testing for URL {}".format(href)) try: # since you only want to scan for status code, no need to pull the entire html of the site - use HEAD instead of GET r = requests.head(href) status = r.status_code # 404 etc will not yield an error error = None except Exception as e: # these exception will not have a status_code status = None error = e # store the finding in your files if status is None or status != 200: print("URL is broken. Writing to ERROR_Doc") # do your storing here of href, status and error else: print("URL is live. Writing to OK_Doc" # do your storing here Hope this makes sense.
AttributeError: 'NoneType' object has no attribute 'group' with BeautifulSoup4
Hello Community I have a problem and I dont know how to solve it my problem is I write a script to crawl webpages for Images with BeautifuleSoup4 but I got the error (AttributeError: 'NoneType' object has no attribute 'group') import re import requests from bs4 import BeautifulSoup site = 'https://www.fotocommunity.de/natur/wolken/3144?sort=new' response = requests.get(site) soup = BeautifulSoup(response.text, 'html.parser') img_tags = soup.find_all('img', {"src": True}) urls = [img["src"] for img in img_tags] for url in urls: filename = re.search(r'([\w_-]+[.](jpg|png))$', url) with open(filename.group(1), 'wb') as f: if 'http' not in url: # sometimes an image source can be relative # if it is provide the base url which also happens # to be the site variable atm. url = '{}{}'.format(site, url) response = requests.get(url) f.write(response.content)
Your regex is wrong. Use Python's internal urllib to do the heavyweight lifting instead of writing regexes if you're not familiar with them. Use something like this (untested): import re import requests from bs4 import BeautifulSoup from urllib.parse import urlsplit # import this additional library from os.path import basename # import this additional library site = 'https://www.fotocommunity.de/natur/wolken/3144?sort=new' response = requests.get(site) soup = BeautifulSoup(response.text, 'html.parser') images_div = soup.find(id=re.compile(r"fcx-gallery-\w+")) # focus on the div containing the images if img_tags: # test if img_tags has any data img_tags = images_div.find_all('img', {"data-src": True}) # get all the images in that div urls = [img["data-src"] for img in img_tags] # grab sources from data-source for url in urls: filename = basename(urlsplit(url).path) # use this instead of a regex with open(filename, 'wb') as f: # filename is now a string if 'http' not in url: # sometimes an image source can be relative # if it is provide the base url which also happens # to be the site variable atm. url = '{}{}'.format(site, url) response = requests.get(url) f.write(response.content)
Downloading an .xls file results in a "urllib.error.HTTPError: HTTP Error 404: Not Found" error
I'm trying to use BeautifulSoup to scrape .xls tables which are available for download from Xcel Energy's website (https://www.xcelenergy.com/working_with_us/municipalities/community_energy_reports). This function gets the URL links of the tables and attempts to download them: url = 'https://www.xcelenergy.com/working_with_us/municipalities/community_energy_reports' dir = 'C:/Users/aobrien/PycharmProjects/xceldatascraper/' def scraper(page): from bs4 import BeautifulSoup as bs import urllib.request import requests import os import re tld = r'https://www.xcelenergy.com' pageobj = requests.get(page, verify=False) sp = bs(pageobj.content, 'html.parser') xlst, fnms = [], [] links = [a['href'] for a in sp.find_all('a', attrs={'href': re.compile("/staticfiles/")})] for idx, a in enumerate(links): if a.endswith('.xls'): furl = tld + str(a) xlst.append(furl) fnms.append(a.split('/')[4]) naur = zip(fnms, xlst) if not os.path.exists(dir + 'tables'): os.makedirs(dir + 'tables') for name, url in naur: print(url) res = urllib.request.urlopen(url) xls = open(dir + 'tables/' + name, 'wb') xls.write(res.read()) xls.close() scraper(url) The scripts fails when urllib.request.urlopen(url) attempts to access the file, returning "urllib.error.HTTPError: HTTP Error 404: Not Found". The "print(url)" statement prints the url that I had the script construct (https://www.xcelenergy.com/staticfiles/xe-responsive/Working With Us/MI-City-Forest-Lake-2016.xls), and manually pasting that url into a browser downloads the file just fine. What am I missing?
retrieve specific links from web page using python and BeautifulSoup
I have been trying to retrieve href link from a page and using as a variable for next href link. But I stuck at one point where I have multiple href links with the different file extension(like zip,md5 etc) and only needed to a zip extension file. here is the code I am trying to implement. import httplib2 from BeautifulSoup import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request('http://example.com') for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')): if link.has_key('href'): if '/abc' in link['href']: basename = link['href'].split("/")[11] print basename status, response = http.request('http://example.com/%basename',basename) for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')): if link.has_key('href'): if '/abc' in link['href']: basename = link['href'].split("/")[11] print basename
try it: import os # YOY CODE here for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')): if link.has_key('href'): if '/abc' in link['href']: basename = link['href'].split("/")[11] # check file extension filename, file_extension = os.path.splitext(basename) print basename, file_extension if file_extension.lower() == 'zip': continue # YOUR LAST CODE