Downloading and zipping files from amazon - python

I'm currently storing all my photos on amazon s3 and using django for my website. I want a to have a button that allows users to click it and have all their photos zipped and returned to them.
I'm currently using boto to interface with amazon and found that I can go through the entire bucket list / use get_key to look for specific files and download them
After this I would need temporarily store them, then zip and return.
What is the best way to go about doing this?
Thanks

you can take a look at this question or at this snippet to download the file
# This is not a full working example, just a starting point
# for downloading images in different formats.
import subprocess
import Image
def image_as_png_pdf(request):
output_format = request.GET.get('format')
im = Image.open(path_to_image) # any Image object should work
if output_format == 'png':
response = HttpResponse(mimetype='image/png')
response['Content-Disposition'] = 'attachment; filename=%s.png' % filename
im.save(response, 'png') # will call response.write()
else:
# Temporary disk space, server process needs write access
tmp_path = '/tmp/'
# Full path to ImageMagick convert binary
convert_bin = '/usr/bin/convert'
im.save(tmp_path+filename+'.png', 'png')
response = HttpResponse(mimetype='application/pdf')
response['Content-Disposition'] = 'attachment; filename=%s.pdf' % filename
ret = subprocess.Popen([ convert_bin,
"%s%s.png"%(tmp_path,filename), "pdf:-" ],
stdout=subprocess.PIPE)
response.write(ret.stdout.read())
return response
to create a zip follow the link that i gave you, you can also use zipimport as shown here examples are on the bottom of the page, follow the documentation for newer versions
you might also be interested in this although it was made for django 1.2, it might not work on 1.3

Using python-zipstream as patched with this pull request you can do something like this:
import boto
import io
import zipstream
import sys
def iterable_to_stream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module). For efficiency, the stream is buffered.
From: https://stackoverflow.com/a/20260030/729491
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def iterate_key():
b = boto.connect_s3().get_bucket('lastage')
key = b.get_key('README.markdown')
for b in key:
yield b
with open('/tmp/foo.zip', 'w') as f:
z = zipstream.ZipFile(mode='w')
z.write(iterable_to_stream(iterate_key()), arcname='foo1')
z.write(iterable_to_stream(iterate_key()), arcname='foo2')
z.write(iterable_to_stream(iterate_key()), arcname='foo3')
for chunk in z:
print "CHUNK", len(chunk)
f.write(chunk)
Basically we iterate over the key contents using boto, convert this iterator to a stream using the iterable_to_stream method from this answer and then have python-zipstream create a zip file on-the-fly.

Related

gz to zip conversion with S3 and Python [duplicate]

Is there a Python library that allows manipulation of zip archives in memory, without having to use actual disk files?
The ZipFile library does not allow you to update the archive. The only way seems to be to extract it to a directory, make your changes, and create a new zip from that directory. I want to modify zip archives without disk access, because I'll be downloading them, making changes, and uploading them again, so I have no reason to store them.
Something similar to Java's ZipInputStream/ZipOutputStream would do the trick, although any interface at all that avoids disk access would be fine.
According to the Python docs:
class zipfile.ZipFile(file[, mode[, compression[, allowZip64]]])
Open a ZIP file, where file can be either a path to a file (a string) or a file-like object.
So, to open the file in memory, just create a file-like object (perhaps using BytesIO).
file_like_object = io.BytesIO(my_zip_data)
zipfile_ob = zipfile.ZipFile(file_like_object)
PYTHON 3
import io
import zipfile
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a",
zipfile.ZIP_DEFLATED, False) as zip_file:
for file_name, data in [('1.txt', io.BytesIO(b'111')),
('2.txt', io.BytesIO(b'222'))]:
zip_file.writestr(file_name, data.getvalue())
with open('C:/1.zip', 'wb') as f:
f.write(zip_buffer.getvalue())
From the article In-Memory Zip in Python:
Below is a post of mine from May of 2008 on zipping in memory with Python, re-posted since Posterous is shutting down.
I recently noticed that there is a for-pay component available to zip files in-memory with Python. Considering this is something that should be free, I threw together the following code. It has only gone through very basic testing, so if anyone finds any errors, let me know and I’ll update this.
import zipfile
import StringIO
class InMemoryZip(object):
def __init__(self):
# Create the in-memory file-like object
self.in_memory_zip = StringIO.StringIO()
def append(self, filename_in_zip, file_contents):
'''Appends a file with name filename_in_zip and contents of
file_contents to the in-memory zip.'''
# Get a handle to the in-memory zip in append mode
zf = zipfile.ZipFile(self.in_memory_zip, "a", zipfile.ZIP_DEFLATED, False)
# Write the file to the in-memory zip
zf.writestr(filename_in_zip, file_contents)
# Mark the files as having been created on Windows so that
# Unix permissions are not inferred as 0000
for zfile in zf.filelist:
zfile.create_system = 0
return self
def read(self):
'''Returns a string with the contents of the in-memory zip.'''
self.in_memory_zip.seek(0)
return self.in_memory_zip.read()
def writetofile(self, filename):
'''Writes the in-memory zip to a file.'''
f = file(filename, "w")
f.write(self.read())
f.close()
if __name__ == "__main__":
# Run a test
imz = InMemoryZip()
imz.append("test.txt", "Another test").append("test2.txt", "Still another")
imz.writetofile("test.zip")
The example Ethier provided has several problems, some of them major:
doesn't work for real data on Windows. A ZIP file is binary and its data should always be written with a file opened 'wb'
the ZIP file is appended to for each file, this is inefficient. It can just be opened and kept as an InMemoryZip attribute
the documentation states that ZIP files should be closed explicitly, this is not done in the append function (it probably works (for the example) because zf goes out of scope and that closes the ZIP file)
the create_system flag is set for all the files in the zipfile every time a file is appended instead of just once per file.
on Python < 3 cStringIO is much more efficient than StringIO
doesn't work on Python 3 (the original article was from before the 3.0 release, but by the time the code was posted 3.1 had been out for a long time).
An updated version is available if you install ruamel.std.zipfile (of which I am the author). After
pip install ruamel.std.zipfile
or including the code for the class from here, you can do:
import ruamel.std.zipfile as zipfile
# Run a test
zipfile.InMemoryZipFile()
imz.append("test.txt", "Another test").append("test2.txt", "Still another")
imz.writetofile("test.zip")
You can alternatively write the contents using imz.data to any place you need.
You can also use the with statement, and if you provide a filename, the contents of the ZIP will be written on leaving that context:
with zipfile.InMemoryZipFile('test.zip') as imz:
imz.append("test.txt", "Another test").append("test2.txt", "Still another")
because of the delayed writing to disc, you can actually read from an old test.zip within that context.
I am using Flask to create an in-memory zipfile and return it as a download. Builds on the example above from Vladimir. The seek(0) took a while to figure out.
import io
import zipfile
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
for file_name, data in [('1.txt', io.BytesIO(b'111')), ('2.txt', io.BytesIO(b'222'))]:
zip_file.writestr(file_name, data.getvalue())
zip_buffer.seek(0)
return send_file(zip_buffer, attachment_filename='filename.zip', as_attachment=True)
I want to modify zip archives without disk access, because I'll be downloading them, making changes, and uploading them again, so I have no reason to store them
This is possible using the two libraries https://github.com/uktrade/stream-unzip and https://github.com/uktrade/stream-zip (full disclosure: written by me). And depending on the changes, you might not even have to store the entire zip in memory at once.
Say you just want to download, unzip, zip, and re-upload. Slightly pointless, but you could slot in some changes to the unzipped content:
from datetime import datetime
import httpx
from stream_unzip import stream_unzip
from stream_zip import stream_zip, ZIP_64
def get_source_bytes_iter(url):
with httpx.stream('GET', url) as r:
yield from r.iter_bytes()
def get_target_files(files):
# stream-unzip doesn't expose perms or modified_at, but stream-zip requires them
modified_at = datetime.now()
perms = 0o600
for name, _, chunks in files:
# Could change name, manipulate chunks, skip a file, or yield a new file
yield name.decode(), modified_at, perms, ZIP_64, chunks
source_url = 'https://source.test/file.zip'
target_url = 'https://target.test/file.zip'
source_bytes_iter = get_source_bytes_iter(source_url)
source_files = stream_unzip(source_bytes_iter)
target_files = get_target_files(source_files)
target_bytes_iter = stream_zip(target_files)
httpx.put(target_url, data=target_bytes_iter)
Helper to create in-memory zip file with multiple files based on data like {'1.txt': 'string', '2.txt": b'bytes'}
import io, zipfile
def prepare_zip_file_content(file_name_content: dict) -> bytes:
"""returns Zip bytes ready to be saved with
open('C:/1.zip', 'wb') as f: f.write(bytes)
#file_name_content dict like {'1.txt': 'string', '2.txt": b'bytes'}
"""
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "a", zipfile.ZIP_DEFLATED, False) as zip_file:
for file_name, file_data in file_name_content.items():
zip_file.writestr(file_name, file_data)
zip_buffer.seek(0)
return zip_buffer.getvalue()
You can use the library libarchive in Python through ctypes - it offers ways of manipulating ZIP data in memory, with a focus on streaming (at least historically).
Say we want to uncompress ZIP files on the fly while downloading from an HTTP server. The below code
from contextlib import contextmanager
from ctypes import CFUNCTYPE, POINTER, create_string_buffer, cdll, byref, c_ssize_t, c_char_p, c_int, c_void_p, c_char
from ctypes.util import find_library
import httpx
def get_zipped_chunks(url, chunk_size=6553):
with httpx.stream('GET', url) as r:
yield from r.iter_bytes()
def stream_unzip(zipped_chunks, chunk_size=65536):
# Library
libarchive = cdll.LoadLibrary(find_library('archive'))
# Callback types
open_callback_type = CFUNCTYPE(c_int, c_void_p, c_void_p)
read_callback_type = CFUNCTYPE(c_ssize_t, c_void_p, c_void_p, POINTER(POINTER(c_char)))
close_callback_type = CFUNCTYPE(c_int, c_void_p, c_void_p)
# Function types
libarchive.archive_read_new.restype = c_void_p
libarchive.archive_read_open.argtypes = [c_void_p, c_void_p, open_callback_type, read_callback_type, close_callback_type]
libarchive.archive_read_finish.argtypes = [c_void_p]
libarchive.archive_entry_new.restype = c_void_p
libarchive.archive_read_next_header.argtypes = [c_void_p, c_void_p]
libarchive.archive_read_support_compression_all.argtypes = [c_void_p]
libarchive.archive_read_support_format_all.argtypes = [c_void_p]
libarchive.archive_entry_pathname.argtypes = [c_void_p]
libarchive.archive_entry_pathname.restype = c_char_p
libarchive.archive_read_data.argtypes = [c_void_p, POINTER(c_char), c_ssize_t]
libarchive.archive_read_data.restype = c_ssize_t
libarchive.archive_error_string.argtypes = [c_void_p]
libarchive.archive_error_string.restype = c_char_p
ARCHIVE_EOF = 1
ARCHIVE_OK = 0
it = iter(zipped_chunks)
compressed_bytes = None # Make sure not garbage collected
#contextmanager
def get_archive():
archive = libarchive.archive_read_new()
if not archive:
raise Exception('Unable to allocate archive')
try:
yield archive
finally:
libarchive.archive_read_finish(archive)
def read_callback(archive, client_data, buffer):
nonlocal compressed_bytes
try:
compressed_bytes = create_string_buffer(next(it))
except StopIteration:
return 0
else:
buffer[0] = compressed_bytes
return len(compressed_bytes) - 1
def uncompressed_chunks(archive):
uncompressed_bytes = create_string_buffer(chunk_size)
while (num := libarchive.archive_read_data(archive, uncompressed_bytes, len(uncompressed_bytes))) > 0:
yield uncompressed_bytes.value[:num]
if num < 0:
raise Exception(libarchive.archive_error_string(archive))
with get_archive() as archive:
libarchive.archive_read_support_compression_all(archive)
libarchive.archive_read_support_format_all(archive)
libarchive.archive_read_open(
archive, 0,
open_callback_type(0), read_callback_type(read_callback), close_callback_type(0),
)
entry = c_void_p(libarchive.archive_entry_new())
if not entry:
raise Exception('Unable to allocate entry')
while (status := libarchive.archive_read_next_header(archive, byref(entry))) == ARCHIVE_OK:
yield (libarchive.archive_entry_pathname(entry), uncompressed_chunks(archive))
if status != ARCHIVE_EOF:
raise Exception(libarchive.archive_error_string(archive))
can be used as follows to do that
zipped_chunks = get_zipped_chunks('https://domain.test/file.zip')
files = stream_unzip(zipped_chunks)
for name, uncompressed_chunks in stream_unzip(zipped_chunks):
print(name)
for uncompressed_chunk in uncompressed_chunks:
print(uncompressed_chunk)
In fact since libarchive supports multiple archive formats, and nothing above is particularly ZIP-specific, it may well work with other formats.

How can I get Helm's binary from their GitHub repo?

I'm trying to download Helm's latest release using a script. I want to download the binary and copy it to a file. I tried looking at the documentation, but it's very confusing to read and I don't understand this. I have found a way to download specific files, but nothing regarding the binary. So far, I have:
from github import Github
def get_helm(filename):
f = open(filename, 'w') # The file I want to copy the binary to
g = Github()
r = g.get_repo("helm/helm")
# Get binary and use f.write() to transfer it to the file
f.close
return filename
I am also well aware of the limits of queries that I can do since there are no credentials.
For Helm in particular, you're not going to have a good time since they apparently don't publish their release files via GitHub, only the checksum metadata.
See https://github.com/helm/helm/releases/tag/v3.6.0 ...
Otherwise, this would be as simple as:
get the JSON data from https://api.github.com/repos/{repo}/releases
get the first release in the list (it's the newest)
look through the assets of that release to find the file you need (e.g. for your architecture)
download it using your favorite HTTP client (e.g. the one you used to get the JSON data in the first step)
Nevertheless, here's a script that works for Helm's additional hoops-to-jump-through:
import requests
def download_binary_with_progress(source_url, dest_filename):
binary_resp = requests.get(source_url, stream=True)
binary_resp.raise_for_status()
with open(dest_filename, "wb") as f:
for chunk in binary_resp.iter_content(chunk_size=524288):
f.write(chunk)
print(f.tell(), "bytes written")
return dest_filename
def download_newest_helm(desired_architecture):
releases_resp = requests.get(
f"https://api.github.com/repos/helm/helm/releases"
)
releases_resp.raise_for_status()
releases_data = releases_resp.json()
newest_release = releases_data[0]
for asset in newest_release.get("assets", []):
name = asset["name"]
# For a project using regular releases, this would be simplified to
# checking for the desired architecture and doing
# download_binary_with_progress(asset["browser_download_url"], name)
if desired_architecture in name and name.endswith(".tar.gz.asc"):
tarball_filename = name.replace(".tar.gz.asc", ".tar.gz")
tarball_url = f"https://get.helm.sh/{tarball_filename}"
return download_binary_with_progress(
source_url=tarball_url, dest_filename=tarball_filename
)
raise ValueError("No matching release found")
download_newest_helm("darwin-arm64")

Fully streaming XML parser

I'm trying to consume the Exchange GetAttachment webservice using requests, lxml and base64io. This service returns a base64-encoded file in a SOAP XML HTTP response. The file content is contained in a single line in a single XML element. GetAttachment is just an example, but the problem is more general.
I would like to stream the decoded file contents directly to disk without storing the entire contents of the attachment in-memory at any point, since an attachment could be several 100 MB.
I have tried something like this:
r = requests.post('https://example.com/EWS/Exchange.asmx', data=..., stream=True)
with open('foo.txt', 'wb') as f:
for action, elem in lxml.etree.iterparse(GzipFile(fileobj=r.raw)):
if elem.tag == 't:Content':
b64_encoder = Base64IO(BytesIO(elem.text))
f.write(b64_encoder.read())
but lxml still stores a copy of the attachment as elem.text. Is there any way I can create a fully streaming XML parser that also streams the content of an element directly from the input stream?
Don't use iterparse in this case. The iterparse() method can only issue element start and end events, so any text in an element is given to you when the closing XML tag has been found.
Instead, use a SAX parser interface. This is a general standard for XML parsing libraries, to pass on parsed data to a content handler. The ContentHandler.characters() callback is passed character data in chunks (assuming that the implementing XML library actually makes use of this possibility). This is a lower level API from the ElementTree API, and and the Python standard library already bundles the Expat parser to drive it.
So the flow then becomes:
wrap the incoming request stream in a GzipFile for easy decompression. Or, better still, set response.raw.decode_content = True and leave decompression to the requests library based on the content-encoding the server has set.
Pass the GzipFile instance or raw stream to the .parse() method of a parser created with xml.sax.make_parser(). The parser then proceeds to read from the stream in chunks. By using make_parser() you first can enable features such as namespace handling (which ensures your code doesn't break if Exchange decides to alter the short prefixes used for each namespace).
The content handler characters() method is called with chunks of XML data; check for the correct element start event, so you know when to expect base64 data. You can decode that base64 data in chunks of (a multiple of) 4 characters at a time, and write it to a file. I'd not use base64io here, just do your own chunking.
A simple content handler could be:
from xml.sax import handler
from base64 import b64decode
class AttachmentContentHandler(handler.ContentHandler):
types_ns = 'http://schemas.microsoft.com/exchange/services/2006/types'
def __init__(self, filename):
self.filename = filename
def startDocument(self):
self._buffer = None
self._file = None
def startElementNS(self, name, *args):
if name == (self.types_ns, 'Content'):
# we can expect base64 data next
self._file = open(self.filename, 'wb')
self._buffer = []
def endElementNS(self, name, *args):
if name == (self.types_ns, 'Content'):
# all attachment data received, close the file
try:
if self._buffer:
raise ValueError("Incomplete Base64 data")
finally:
self._file.close()
self._file = self._buffer = None
def characters(self, data):
if self._buffer is None:
return
self._buffer.append(data)
self._decode_buffer()
def _decode_buffer(self):
remainder = ''
for data in self._buffer:
available = len(remainder) + len(data)
overflow = available % 4
if remainder:
data = (remainder + data)
remainder = ''
if overflow:
remainder, data = data[-overflow:], data[:-overflow]
if data:
self._file.write(b64decode(data))
self._buffer = [remainder] if remainder else []
and you'd use it like this:
import requests
from xml.sax import make_parser, handler
parser = make_parser()
parser.setFeature(handler.feature_namespaces, True)
parser.setContentHandler(AttachmentContentHandler('foo.txt'))
r = requests.post('https://example.com/EWS/Exchange.asmx', data=..., stream=True)
r.raw.decode_content = True # if content-encoding is used, decompress as we read
parser.parse(r.raw)
This will parse the input XML in chunks of up to 64KB (the default IncrementalParser buffer size), so attachment data is decoded in at most 48KB blocks of raw data.
I'd probably extend the content handler to take a target directory and then look for <t:Name> elements to extract the filename, then use that to extract the data to the correct filename for each attachment found. You'd also want to verify that you are actually dealing with a GetAttachmentResponse document, and handle error responses.

Why are my pictures corrupted after downloading and writing them in python?

Preface
This is my first post on stackoverflow so I apologize if I mess up somewhere. I searched the internet and stackoverflow heavily for a solution to my issues but I couldn't find anything.
Situation
What I am working on is creating a digital photo frame with my raspberry pi that will also automatically download pictures from my wife's facebook page. Luckily I found someone who was working on something similar:
https://github.com/samuelclay/Raspberry-Pi-Photo-Frame
One month ago this gentleman added the download_facebook.py script. This is what I needed! So a few days ago I started working on this script to get it working in my windows environment first (before I throw it on the pi). Unfortunately there is no documentation specific to that script and I am lacking in python experience.
Based on the from urllib import urlopen statement, I can assume that this script was written for Python 2.x. This is because Python 3.x is now from urlib import request.
So I installed Python 2.7.9 interpreter and I've had fewer issues than when I was attempting to work with Python 3.4.3 interpreter.
Problem
I've gotten the script to download pictures from the facebook account; however, the pictures are corrupted.
Here is pictures of the problem: http://imgur.com/a/3u7cG
Now, I originally was using Python 3.4.3 and had issues with my method urlrequest(url) (see code at bottom of post) and how it was working with the image data. I tried decoding with different formats such as utf-8 and utf-16 but according to the content headers, it shows utf-8 format (I think).
Conclusion
I'm not quite sure if the problem is with downloading the image or with writing the image to the file.
If anyone can help me with this I'd be forever grateful! Also let me know what I can do to improve my posts in the future.
Thanks in advance.
Code
from urllib import urlopen
from json import loads
from sys import argv
import dateutil.parser as dateparser
import logging
# plugin your username and access_token (Token can be get and
# modified in the Explorer's Get Access Token button):
# https://graph.facebook.com/USER_NAME/photos?type=uploaded&fields=source&access_token=ACCESS_TOKEN_HERE
FACEBOOK_USER_ID = "**USER ID REMOVED"
FACEBOOK_ACCESS_TOKEN = "** TOKEN REMOVED - GET YOUR OWN **"
def get_logger(label='lvm_cli', level='INFO'):
"""
Return a generic logger.
"""
format = '%(asctime)s - %(levelname)s - %(message)s'
logging.basicConfig(format=format)
logger = logging.getLogger(label)
logger.setLevel(getattr(logging, level))
return logger
def urlrequest(url):
"""
Make a url request
"""
req = urlopen(url)
data = req.read()
return data
def get_json(url):
"""
Make a url request and return as a JSON object
"""
res = urlrequest(url)
data = loads(res)
return data
def get_next(data):
"""
Get next element from facebook JSON response,
or return None if no next present.
"""
try:
return data['paging']['next']
except KeyError:
return None
def get_images(data):
"""
Get all images from facebook JSON response,
or return None if no data present.
"""
try:
return data['data']
except KeyError:
return []
def get_all_images(url):
"""
Get all images using recursion.
"""
data = get_json(url)
images = get_images(data)
next = get_next(data)
if not next:
return images
else:
return images + get_all_images(next)
def get_url(userid, access_token):
"""
Generates a useable facebook graph API url
"""
root = 'https://graph.facebook.com/'
endpoint = '%s/photos?type=uploaded&fields=source,updated_time&access_token=%s' % \
(userid, access_token)
return '%s%s' % (root, endpoint)
def download_file(url, filename):
"""
Write image to a file.
"""
data = urlrequest(url)
path = 'C:/photos/%s' % filename
f = open(path, 'w')
f.write(data)
f.close()
def create_time_stamp(timestring):
"""
Creates a pretty string from time
"""
date = dateparser.parse(timestring)
return date.strftime('%Y-%m-%d-%H-%M-%S')
def download(userid, access_token):
"""
Download all images to current directory.
"""
logger = get_logger()
url = get_url(userid, access_token)
logger.info('Requesting image direct link, please wait..')
images = get_all_images(url)
for image in images:
logger.info('Downloading %s' % image['source'])
filename = '%s.jpg' % create_time_stamp(image['created_time'])
download_file(image['source'], filename)
if __name__ == '__main__':
download(FACEBOOK_USER_ID, FACEBOOK_ACCESS_TOKEN)
Answering the question of why #Alastair's solution from the comments worked:
f = open(path, 'wb')
From https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files:
On Windows, 'b' appended to the mode opens the file in binary mode, so
there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows
makes a distinction between text and binary files; the end-of-line
characters in text files are automatically altered slightly when data
is read or written. This behind-the-scenes modification to file data
is fine for ASCII text files, but it’ll corrupt binary data like that
in JPEG or EXE files. Be very careful to use binary mode when reading
and writing such files. On Unix, it doesn’t hurt to append a 'b' to
the mode, so you can use it platform-independently for all binary
files.
(I was on a Mac, which explains why the problem wasn't reproduced for me.)
Alastair McCormack posted something that worked!
He said Try setting binary mode when you open the file for writing: f = open(path, 'wb')
It is now successfully downloading the images correctly. Does anyone know why this worked?

Downloading contents of several html pages using python

I'm new to Python and was trying to figure out how to code a script that will download the contents of HTML pages. I was thinking of doing something like:
Y = 0
X = "example.com/example/" + Y
While Y != 500:
(code to download file), Y++
if Y == 500:
break
so the (Y) is the file name and I need to download files from example.com/example/1 all the way till file number 500, regardless of the file type.
Read this official docs page:
This module provides a high-level interface for fetching data across the World Wide Web.
In particular, the urlopen() function is similar to the built-in function open(), but accepts Universal Resource Locators (URLs) instead of filenames.
Some restrictions apply — it can only open URLs for reading, and no seek operations are available.
So you have code like this:
import urllib
content = urllib.urlopen("http://www.google.com").read()
#urllib.request.urlopen(...).read() in python 3
The following code should meet your need. It will download 500 web contents and save them to disk.
import urllib2
def grab_html(url):
response = urllib2.urlopen(url)
mimetype = response.info().getheader('Content-Type')
return response.read(), mimetype
for i in range(500):
filename = str(i) # Use digit as filename
url = "http://example.com/example/{0}".format(filename)
contents, _ = grab_html(url)
with open(filename, "w") as fp:
fp.write(contents)
Notes:
If you need parallel fetching, here is a great example https://docs.python.org/3/library/concurrent.futures.html

Categories

Resources