Using Python webpage(with necessary files) offline - python

Could you help me with this below script, please?
I am trying to store webpage with it's files offline to open them without internet connection using Python. I can copy the source codes of the target page into a file and save it as “example.html” in local PC. but my code is not saving files whenever i trying to open saved html file using python browser it's loading necessary files from the internet and it takes too much time to load (very slow).
This the a part of the original codes:
import urllib2, codecs
def download(site):
response = urllib2.urlopen(site)
html_input = response.read()
return html_input
def derealitivise(site, html_input):
active_html = html_input.replace('<head>', '<head> <base href='+site+'>')
return active_html
def main(site, output_filename):
#try:
site_url = site.encode("utf-8")
html_input = download(site_url)
#active_html = derealitivise(site_url, active_html)
header = "<head> <base href="+site_url+">"
active_html = html_input.replace('<head>', header)
#output_file = open (output_filename, "w")
#output_file = codecs.open(output_filename, "wb", "utf-8-sig")
#output_file.write(active_html.encode("utf-8"))
#output_file.close()
with open(output_filename, 'w') as fid:
fid.write(active_html)
return "OK"

Related

python download folder of text files

The goal is to download GTFS data through python web scraping, starting with https://transitfeeds.com/p/agence-metropolitaine-de-transport/129/latest/download
Currently, I'm using requests like so:
def download(url):
fpath = "prov/city/GTFS"
r = requests.get(url)
if r.ok:
print("Saving file.")
open(fpath, "wb").write(r.content)
else:
print("Download failed.")
The results of requests.content of the above url unfortunately renders the following:
You can see the files of interest within the output (e.g. stops.txt) but how might I access them to read/write?
I fear you're trying to read a zip file with a text editor, perhaps you should try using the "zipfile" module.
The following worked:
def download(url):
fpath = "path/to/output/"
f = requests.get(url, stream = True, headers = headers)
if f.ok:
print("Saving to {}".format(fpath))
g=open(fpath+'output.zip','wb')
g.write(f.content)
g.close()
else:
print("Download failed with error code: ", f.status_code)
You need to write this file into a zip.
import requests
url = "https://transitfeeds.com/p/agence-metropolitaine-de-transport/129/latest/download"
fname = "gtfs.zip"
r = requests.get(url)
open(fname, "wb").write(r.content)
Now fname exists and has several text files inside. If you want to programmatically extract this zip and then read the content of a file, for example stops.txt, then you need first to extract a single file, or simply extractall.
import zipfile
# this will extract only a single file, and
# raise a KeyError if the file is missing from the archive
zipfile.ZipFile(fname).extract("stops.txt")
# this will extract all the files found from the archive,
# overwriting files in the process
zipfile.ZipFile(fname).extractall()
Now you just need to work with your file(s).
thefile = "stops.txt"
# just plain text
text = open(thefile).read()
# csv file
import csv
reader = csv.reader(open(thefile))
for row in reader:
...

Save file received from POST request in Python

I'm trying to implement an upload feature to the basic http.server Python module.
So far, I've created a new class named SimpleHTTPRequestHandlerWithUpload which inherits from SimpleHTTPRequestHandler and added an upload section to list_directory(). The next step would be creating a do_POST() method, which handles the request and saves the file inside the current working directory. However, I have no idea how to do this. I looked at UniIsland's code on GitHub but I can't understand what he did and the code is very old. I also read this question and tried to implement it in my code.
It kind of works, but the file is "littered" with headers. This does not pose a big problem on txt files, but it corrupts all of the other file extensions.
I'd like to know how to remove the headers, save the uploaded file inside the current working directory with its original name and check if the upload was successful or not.
This is my code:
__version__ = '0.1'
import http.server
import html
import io
import os
import socket # For gethostbyaddr()
import sys
import urllib.parse
import contextlib
from http import HTTPStatus
class SimpleHTTPRequestHandlerWithUpload(http.server.SimpleHTTPRequestHandler):
server_version = 'SimpleHTTPWithUpload/' + __version__
def do_POST(self):
"""Serve a POST request."""
data = self.rfile.read(int(self.headers['content-length']))
with open('file.txt', 'wb') as file:
file.write(data)
r = []
enc = sys.getfilesystemencoding()
r.append('<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">')
r.append("<html>\n<title>Upload Result Page</title>\n")
r.append("<body>\n<h2>Upload Result Page</h2>\n")
r.append("</body>\n</html>")
encoded = '\n'.join(r).encode(enc, 'surrogateescape')
f = io.BytesIO()
f.write(encoded)
f.seek(0)
self.send_response(HTTPStatus.OK)
self.send_header("Content-type", "text/html")
self.send_header("Content-Length", str(len(encoded)))
self.end_headers()
if f:
self.copyfile(f, self.wfile)
f.close()
def list_directory(self, path):
"""Helper to produce a directory listing (absent index.html).
Return value is either a file object, or None (indicating an
error). In either case, the headers are sent, making the
interface the same as for send_head().
"""
try:
list = os.listdir(path)
except OSError:
self.send_error(
HTTPStatus.NOT_FOUND,
'No permission to list directory')
return None
list.sort(key=lambda a: a.lower())
r = []
try:
displaypath = urllib.parse.unquote(self.path,
errors='surrogatepass')
except UnicodeDecodeError:
displaypath = urllib.parse.unquote(path)
displaypath = html.escape(displaypath, quote=False)
enc = sys.getfilesystemencoding()
title = 'Directory listing for %s' % displaypath
r.append('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
'"http://www.w3.org/TR/html4/strict.dtd">')
r.append('<html>\n<head>')
r.append('<meta http-equiv="Content-Type" '
'content="text/html; charset=%s">' % enc)
r.append('<title>%s</title>\n</head>' % title)
r.append('<body>\n<h1>%s</h1>' % title)
r.append('<hr>\n<ul>')
for name in list:
fullname = os.path.join(path, name)
displayname = linkname = name
# Append / for directories or # for symbolic links
if os.path.isdir(fullname):
displayname = name + '/'
linkname = name + '/'
if os.path.islink(fullname):
displayname = name + '#'
# Note: a link to a directory displays with # and links with /
r.append('<li>%s</li>' % (urllib.parse.quote(linkname, errors='surrogatepass'),
html.escape(displayname, quote=False)))
r.append('</ul>\n<hr>\n')
r.append('<form id="upload" enctype="multipart/form-data" method="post" action="#">\n'
'<input id="fileupload" name="file" type="file" />\n'
'<input type="submit" value="Submit" id="submit" />\n'
'</form>')
r.append('\n<hr>\n</body>\n</html>\n')
encoded = '\n'.join(r).encode(enc, 'surrogateescape')
f = io.BytesIO()
f.write(encoded)
f.seek(0)
self.send_response(HTTPStatus.OK)
self.send_header('Content-type', 'text/html; charset=%s' % enc)
self.send_header('Content-Length', str(len(encoded)))
self.end_headers()
return f
if __name__ == '__main__':
class DualStackServer(http.server.ThreadingHTTPServer):
def server_bind(self):
# suppress exception when protocol is IPv4
with contextlib.suppress(Exception):
self.socket.setsockopt(
socket.IPPROTO_IPV6, socket.IPV6_V6ONLY, 0)
return super().server_bind()
http.server.test(
HandlerClass=SimpleHTTPRequestHandlerWithUpload,
ServerClass=DualStackServer
)
If you want to test it, just run the script on your machine, open a web browser on a different machine and type in the address bar <IP_ADDRESS_1>:8000 where IP_ADDRESS_1 is the IP of the machine you're running the code on.
Please, tell me if there's something wrong with it other than the do_POST() method. I'm a new Python programmer and I'm trying to improve my software design skills in general. Thank you!
EDIT: I figured out how to remove the headers and save the file with its original name. However, the script hangs on data = self.rfile.readlines() until I close the browser tab and then works well. I don't know what to do. It seems I have to send some sort of EOF to notify readlines() that I'm finished sending the file but I have no clue how to do it. I also can't figure out how to check if the file has been uploaded successfully or not. Any help is appreciated!
Updated do_POST() method:
def do_POST(self):
"""Serve a POST request."""
data = self.rfile.readlines()
filename = re.findall(r'Content-Disposition.*name="file"; filename="(.*)"', str(data[1]))
if len(filename) == 1:
filename = ''.join(filename)
else:
return
data = data[4:-2]
data = b''.join(data)
with open(filename, 'wb') as file:
file.write(data)
r = []
enc = sys.getfilesystemencoding()
r.append('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
'"http://www.w3.org/TR/html4/strict.dtd">')
r.append('<html>\n<title>Upload result page</title>\n')
r.append('<body>\n<h2>Upload result page</h2>\n')
r.append('</body>\n</html>')
encoded = '\n'.join(r).encode(enc, 'surrogateescape')
f = io.BytesIO()
f.write(encoded)
f.seek(0)
self.send_response(HTTPStatus.OK)
self.send_header('Content-type', 'text/html')
self.send_header('Content-Length', str(len(encoded)))
self.end_headers()
if f:
self.copyfile(f, self.wfile)
f.close()
I managed to solve all of my problems. I posted my code on GitHub, for anyone interested.

Downloading multiple files with requests in Python

Currently im facing following problem:
I have 3 download links in a list. Only the last file in the list is downloaded completely.
The others have a file size of one kilobyte.
Code:
from requests import get
def download(url, filename):
with open(filename, "wb") as file:
response = get(url, stream=True)
file.write(response.content)
for link in f:
url = link
split_url = url.split("/")
filename = split_url[-1]
filename = filename.replace("\n", "")
download(url,filename)
The result looks like this:
Result
How do I make sure that all files are downloaded correctly?
All links are direct download links.
Thanks in advance!
EDIT:
I discovered it only happens when I read the links from the .txt
If I create the list in python like this:
links = ["http://ipv4.download.thinkbroadband.com/20MB.zip",
"http://ipv4.download.thinkbroadband.com/10MB.zip",
"http://ipv4.download.thinkbroadband.com/5MB.zip"]
... the problem doesnt appear.
reproduceable example:
from requests import get
def download(url, filename):
with open(filename, "wb") as file:
response = get(url, stream = True)
file.write(response.content)
f = open('links.txt','r')
for link in f:
url = link
split_url = url.split("/")
filename = split_url[-1]
filename = filename.replace("\n", "")
download(url,filename)
content of links.txt:
http://ipv4.download.thinkbroadband.com/20MB.zip
http://ipv4.download.thinkbroadband.com/10MB.zip
http://ipv4.download.thinkbroadband.com/5MB.zip
url = url.replace("\n", "")
solved it!

Using python to download table data without .csv file address provided

My purpose is to download the data from this website:
http://transoutage.spp.org/
When opening this website, in the bottom of web, there is a description used to illustrate how to auto-download the data. For example:
http://transoutage.spp.org/report.aspx?download=true&actualendgreaterthan=3/1/2018&includenulls=true
The code I wrote is this:
import requests
ul_begin = 'http://transoutage.spp.org/report.aspx?download=true'
timeset = '3/1/2018' #define the time, m/d/yyyy
fn = ['&actualendgreaterthan='] + [timeset] + ['&includenulls=true']
fn = ''.join(fn)
ul = ul_begin+fn
r = requests.get(ul, verify=False)
Since, if you input the web address,
http://transoutage.spp.org/report.aspx?download=true&actualendgreaterthan=3/1/2018&includenulls=true,
into the Chrome, it will auto-download the data in .csv file. I do not know how to continue my code.
Please help me!!!!
You need to write the response you receive to a file:
r = requests.get(ul, verify=False)
if 200 >= r.status_code <= 300:
# If the request has succeeded
file_path = '<path_where_file_has_to_be_downloaded>`
f = open(file_path, 'w+')
f.write(r.content)
f.close()
This will work properly if the csv file is small in size. but for large files, you need to use stream param to download: http://masnun.com/2016/09/18/python-using-the-requests-module-to-download-large-files-efficiently.html

Python: downloading xml files in batch returns a damaged zip file

Drawing inspiration from this post, I am trying to download a bunch of xml files in batch from a website:
import urllib2
url='http://ratings.food.gov.uk/open-data/'
f = urllib2.urlopen(url)
data = f.read()
with open("C:\Users\MyName\Desktop\data.zip", "wb") as code:
code.write(data)
The zip file is created within seconds, but as I attempt to access it, an error window comes up:
Windows cannot open the folder.
The Compressed (zipped) Folder "C:\Users\MyName\Desktop\data.zip" is invalid.
What am I doing wrong here?
you are not opening file handles inside the zip file:
import urllib2
from bs4 import BeautifulSoup
import zipfile
url='http://ratings.food.gov.uk/open-data/'
fileurls = []
f = urllib2.urlopen(url)
mainpage = f.read()
soup = BeautifulSoup(mainpage, 'html.parser')
tablewrapper = soup.find(id='openDataStatic')
for table in tablewrapper.find_all('table'):
for link in table.find_all('a'):
fileurls.append(link['href'])
with zipfile.ZipFile("data.zip", "w") as code:
for url in fileurls:
print('Downloading: %s' % url)
f = urllib2.urlopen(url)
data = f.read()
xmlfilename = url.rsplit('/', 1)[-1]
code.writestr(xmlfilename, data)
You are doing nothing to encode this as zip file. If instead you choose to open it in a plain text editor such as notepad it should show you the raw xml.

Categories

Resources