Mocking download of a file using Python requests and responses - python

I have some python code which successfully downloads an image from a URL, using requests, and saves it into /tmp/. I want to test this does what it should. I'm using responses to test fetching of JSON files, but I'm not sure how to mock the behaviour of fetching a file.
I assume it'd be similar to mocking a standard response, like the below, but I think I'm blanking on how to set the body to be a file...
#responses.activate
def test_download():
responses.add(responses.GET, 'http://example.org/images/my_image.jpg',
body='', status=200,
content_type='image/jpeg')
#...
UPDATE: Following Ashafix's comment, I'm trying this (python 3):
from io import BytesIO
#responses.activate
def test_download():
with open('tests/images/tester.jpg', 'rb') as img1:
imgIO = BytesIO(img1.read())
responses.add(responses.GET, 'http://example.org/images/my_image.jpg',
body=imgIO, status=200,
content_type='image/jpeg')
imgIO.seek(0)
#...
But when, subsequently, the code I'm testing attempts to do the request I get:
a bytes-like object is required, not '_io.BytesIO'
Feels like it's almost right, but I'm stumped.
UPDATE 2: Trying to follow Steve Jessop's suggestion:
#responses.activate
def test_download():
with open('tests/images/tester.jpg', 'rb') as img1:
responses.add(responses.GET, 'http://example.org/images/my_image.jpg',
body=img1.read(), status=200,
content_type='image/jpeg')
#...
But this time the code being tested raises this:
I/O operation on closed file.
Surely the image should still be open inside the with block?
UPDATE 3: The code I'm testing is something like this:
r = requests.get(url, stream=True)
if r.status_code == 200:
with open('/tmp/temp.jpg', 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
It seems to be that the final shutil line is generating the "I/O operation on closed file." error. I don't understand this enough - the streaming of the file - to know how best to mock this behaviour, to test the downloaded file is saved to /tmp/.

You might need to pass stream=True to the responses.add call. Something like:
#responses.activate
def test_download():
with open("tests/images/tester.jpg", "rb") as img1:
responses.add(
responses.GET,
"http://example.org/images/my_image.jpg",
body=img1.read(),
status=200,
content_type="image/jpeg",
stream=True,
)

First, to summarise my now overly long question... I'm testing some code that's something like:
def download_file(url):
r = requests.get(url, stream=True)
if r.status_code == 200:
filename = os.path.basename(url)
with open('/tmp/%s' % filename, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
return filename
It downloads an image and, streaming it, saves it to /tmp/. I wanted to mock the request so I can test other things.
#responses.activate
def test_downloads_file(self):
url = 'http://example.org/test.jpg'
with open('tests/images/tester.jpg', 'rb') as img:
responses.add(responses.GET, url,
body=img.read(), status=200,
content_type='image/jpg',
adding_headers={'Transfer-Encoding': 'chunked'})
filename = download_file(url)
# assert things here.
Once I had worked out the way to use open() for this, I was still getting "I/O operation on closed file." from shutil.copyfileobj(). The thing that's stopped this is to add in the Transfer-Encoding header, which is present in the headers when I make the real request.
Any suggestions for other, better solutions very welcome!

Related

How to Upload File using FastAPI?

I am using FastAPI to upload a file according to the official documentation, as shown below:
#app.post("/create_file")
async def create_file(file: UploadFile = File(...)):
file2store = await file.read()
# some code to store the BytesIO(file2store) to the other database
When I send a request using Python requests library, as shown below:
f = open(".../file.txt", 'rb')
files = {"file": (f.name, f, "multipart/form-data")}
requests.post(url="SERVER_URL/create_file", files=files)
the file2store variable is always empty. Sometimes (rarely seen), it can get the file bytes, but almost all the time it is empty, so I can't restore the file on the other database.
I also tried the bytes rather than UploadFile, but I get the same results. Is there something wrong with my code, or is the way I use FastAPI to upload a file wrong?
First, as per FastAPI documentation, you need to install python-multipart—if you haven't already—as uploaded files are sent as "form data". For instance:
pip install python-multipart
The below examples use the .file attribute of the UploadFile object to get the actual Python file (i.e., SpooledTemporaryFile), which allows you to call SpooledTemporaryFile's methods, such as .read() and .close(), without having to await them. It is important, however, to define your endpoint with def in this case—otherwise, such operations would block the server until they are completed, if the endpoint was defined with async def. In FastAPI, a normal def endpoint is run in an external threadpool that is then awaited, instead of being called directly (as it would block the server).
The SpooledTemporaryFile used by FastAPI/Starlette has the max_size attribute set to 1 MB, meaning that the data are spooled in memory until the file size exceeds 1 MB, at which point the data are written to a temporary file on disk, under the OS's temp directory. Hence, if you uploaded a file larger than 1 MB, it wouldn't be stored in memory, and calling file.file.read() would actually read the data from disk into memory. Thus, if the file is too large to fit into your server's RAM, you should rather read the file in chunks and process one chunk at a time, as described in "Read the File in chunks" section below. You may also have a look at this answer, which demonstrates another approach to upload a large file in chunks, using Starlette's .stream() method and streaming-form-data package that allows parsing streaming multipart/form-data chunks, which results in considerably minimising the time required to upload files.
If you have to define your endpoint with async def—as you might need to await for some other coroutines inside your route—then you should rather use asynchronous reading and writing of the contents, as demonstrated in this answer. Moreover, if you need to send additional data (such as JSON data) together with uploading the file(s), please have a look at this answer. I would also suggest you have a look at this answer, which explains the difference between def and async def endpoints.
Upload Single File
app.py
from fastapi import File, UploadFile
#app.post("/upload")
def upload(file: UploadFile = File(...)):
try:
contents = file.file.read()
with open(file.filename, 'wb') as f:
f.write(contents)
except Exception:
return {"message": "There was an error uploading the file"}
finally:
file.file.close()
return {"message": f"Successfully uploaded {file.filename}"}
Read the File in chunks
As described earlier and in this answer, if the file is too big to fit into memory—for instance, if you have 8GB of RAM, you can't load a 50GB file (not to mention that the available RAM will always be less than the total amount installed on your machine, as other applications will be using some of the RAM)—you should rather load the file into memory in chunks and process the data one chunk at a time. This method, however, may take longer to complete, depending on the chunk size you choose—in the example below, the chunk size is 1024 * 1024 bytes (i.e., 1MB). You can adjust the chunk size as desired.
from fastapi import File, UploadFile
#app.post("/upload")
def upload(file: UploadFile = File(...)):
try:
with open(file.filename, 'wb') as f:
while contents := file.file.read(1024 * 1024):
f.write(contents)
except Exception:
return {"message": "There was an error uploading the file"}
finally:
file.file.close()
return {"message": f"Successfully uploaded {file.filename}"}
Another option would be to use shutil.copyfileobj(), which is used to copy the contents of a file-like object to another file-like object (have a look at this answer too). By default, the data is read in chunks with the default buffer (chunk) size being 1MB (i.e., 1024 * 1024 bytes) for Windows and 64KB for other platforms, as shown in the source code here. You can specify the buffer size by passing the optional length parameter. Note: If negative length value is passed, the entire contents of the file will be read instead—see f.read() as well, which .copyfileobj() uses under the hood (as can be seen in the source code here).
from fastapi import File, UploadFile
import shutil
#app.post("/upload")
def upload(file: UploadFile = File(...)):
try:
with open(file.filename, 'wb') as f:
shutil.copyfileobj(file.file, f)
except Exception:
return {"message": "There was an error uploading the file"}
finally:
file.file.close()
return {"message": f"Successfully uploaded {file.filename}"}
test.py
import requests
url = 'http://127.0.0.1:8000/upload'
file = {'file': open('images/1.png', 'rb')}
resp = requests.post(url=url, files=file)
print(resp.json())
Upload Multiple (List of) Files
app.py
from fastapi import File, UploadFile
from typing import List
#app.post("/upload")
def upload(files: List[UploadFile] = File(...)):
for file in files:
try:
contents = file.file.read()
with open(file.filename, 'wb') as f:
f.write(contents)
except Exception:
return {"message": "There was an error uploading the file(s)"}
finally:
file.file.close()
return {"message": f"Successfuly uploaded {[file.filename for file in files]}"}
Read the Files in chunks
As described earlier in this answer, if you expect some rather large file(s) and don't have enough RAM to accommodate all the data from the beginning to the end, you should rather load the file into memory in chunks, thus processing the data one chunk at a time (Note: adjust the chunk size as desired, below that is 1024 * 1024 bytes).
from fastapi import File, UploadFile
from typing import List
#app.post("/upload")
def upload(files: List[UploadFile] = File(...)):
for file in files:
try:
with open(file.filename, 'wb') as f:
while contents := file.file.read(1024 * 1024):
f.write(contents)
except Exception:
return {"message": "There was an error uploading the file(s)"}
finally:
file.file.close()
return {"message": f"Successfuly uploaded {[file.filename for file in files]}"}
or, using shutil.copyfileobj():
from fastapi import File, UploadFile
from typing import List
import shutil
#app.post("/upload")
def upload(files: List[UploadFile] = File(...)):
for file in files:
try:
with open(file.filename, 'wb') as f:
shutil.copyfileobj(file.file, f)
except Exception:
return {"message": "There was an error uploading the file(s)"}
finally:
file.file.close()
return {"message": f"Successfuly uploaded {[file.filename for file in files]}"}
test.py
import requests
url = 'http://127.0.0.1:8000/upload'
files = [('files', open('images/1.png', 'rb')), ('files', open('images/2.png', 'rb'))]
resp = requests.post(url=url, files=files)
print(resp.json())
#app.post("/create_file/")
async def image(image: UploadFile = File(...)):
print(image.file)
# print('../'+os.path.isdir(os.getcwd()+"images"),"*************")
try:
os.mkdir("images")
print(os.getcwd())
except Exception as e:
print(e)
file_name = os.getcwd()+"/images/"+image.filename.replace(" ", "-")
with open(file_name,'wb+') as f:
f.write(image.file.read())
f.close()
file = jsonable_encoder({"imagePath":file_name})
new_image = await add_image(file)
return {"filename": new_image}

Python memory issue uploading multiple files to API

I'm running a script to upload 20k+ XML files to an API. About 18k in, I get a memory error. I was looking into it and found the memory is just continually climbing until it reaches the limit and errors out (seemingly on the post call). Anyone know why this is happening or a fix? Thanks. I have tried the streaming uploads found here. The empty strings are due to sensitive data.
def upload(self, oauth_token, full_file_path):
file_name = os.path.basename(full_file_path)
upload_endpoint = {'':''}
params = {'': '','': ''}
headers = {'': '', '': ''}
handler = None
try:
handler = open(full_file_path, 'rb')
response = requests.post(url=upload_endpoint[''], params=params, data=handler, headers=headers, auth=oauth_token, verify=False, allow_redirects=False, timeout=600)
status_code = response.status_code
# status checking
return status_code
finally:
if handler:
handler.close()
def push_data(self):
oauth_token = self.get_oauth_token()
files = os.listdir(f_dir)
for file in files:
status = self.upload(oauth_token, file_to_upload)
What version of Python are you using? It looks like there is a bug in Python 3.4 causing memory leaks related to network requests. See here for a similar issue: https://github.com/psf/requests/issues/5215
It may help to update Python.

Urllib2 Python - Reconnecting and Splitting Response

I am moving to Python from other language and I am not sure how to properly tackle this. Using the urllib2 library it is quite easy to set up a proxy and get a data from a site:
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
The problem I have is that the text file that is retrieved is very large (hundreds of MB) and the connection is often problematic. The code also need to catch connection, server and transfer errors (it will be a part of small extensively used pipeline).
Could anyone suggest how to modify the code above to make sure the code automatically reconnects n times (for example 100 times) and perhaps split the response into chunks so the data will be downloaded faster and more reliably?
I have already split the requests as much as I could so now have to make sure that the retrieve code is as good as it can be. Solutions based on core python libraries are ideal.
Perhaps the library is already doing the above in which case is there any way to improve downloading large files? I am using UNIX and need to deal with a proxy.
Thanks for your help.
I'm putting up an example of how you might want to do this with the python-requests library. The script below checks if the destinations file already exists. If the partially destination file exists, it's assumed to be the partially downloaded file, and tries to resume the download. If the server claims support for a HTTP Partial Request (i.e. the response to a HEAD request contains Accept-Range header), then the script resume based on the size of the partially downloaded file; otherwise it just does a regular download and discard the parts that are already downloaded. I think it should be fairly straight forward to convert this to use just urllib2 if you don't want to use python-requests, it'll probably be just much more verbose.
Note that resuming downloads may corrupt the file if the file on the server is modified between the initial download and the resume. This can be detected if the server supports strong HTTP ETag header so the downloader can check whether it's resuming the same file.
I make no claim that it is bug-free.
You should probably add a checksum logic around this script to detect download errors and retry from scratch if the checksum doesn't match.
import logging
import os
import re
import requests
CHUNK_SIZE = 5*1024 # 5KB
logging.basicConfig(level=logging.INFO)
def stream_download(input_iterator, output_stream):
for chunk in input_iterator:
output_stream.write(chunk)
def skip(input_iterator, output_stream, bytes_to_skip):
total_read = 0
while total_read <= bytes_to_skip:
chunk = next(input_iterator)
total_read += len(chunk)
output_stream.write(chunk[bytes_to_skip - total_read:])
assert total_read == output_stream.tell()
return input_iterator
def resume_with_range(url, output_stream):
dest_size = output_stream.tell()
headers = {'Range': 'bytes=%s-' % dest_size}
resp = requests.get(url, stream=True, headers=headers)
input_iterator = resp.iter_content(CHUNK_SIZE)
if resp.status_code != requests.codes.partial_content:
logging.warn('server does not agree to do partial request, skipping instead')
input_iterator = skip(input_iterator, output_stream, output_stream.tell())
return input_iterator
rng_unit, rng_start, rng_end, rng_size = re.match('(\w+) (\d+)-(\d+)/(\d+|\*)', resp.headers['Content-Range']).groups()
rng_start, rng_end, rng_size = map(int, [rng_start, rng_end, rng_size])
assert rng_start <= dest_size
if rng_start != dest_size:
logging.warn('server returned different Range than requested')
output_stream.seek(rng_start)
return input_iterator
def download(url, dest):
''' Download `url` to `dest`, resuming if `dest` already exists
If `dest` already exists it is assumed to be a partially
downloaded file for the url.
'''
output_stream = open(dest, 'ab+')
output_stream.seek(0, os.SEEK_END)
dest_size = output_stream.tell()
if dest_size == 0:
logging.info('STARTING download from %s to %s', url, dest)
resp = requests.get(url, stream=True)
input_iterator = resp.iter_content(CHUNK_SIZE)
stream_download(input_iterator, output_stream)
logging.info('FINISHED download from %s to %s', url, dest)
return
remote_headers = requests.head(url).headers
remote_size = int(remote_headers['Content-Length'])
if dest_size < remote_size:
logging.info('RESUMING download from %s to %s', url, dest)
support_range = 'bytes' in [s.strip() for s in remote_headers['Accept-Ranges'].split(',')]
if support_range:
logging.debug('server supports Range request')
logging.debug('downloading "Range: bytes=%s-"', dest_size)
input_iterator = resume_with_range(url, output_stream)
else:
logging.debug('skipping %s bytes', dest_size)
resp = requests.get(url, stream=True)
input_iterator = resp.iter_content(CHUNK_SIZE)
input_iterator = skip(input_iterator, output_stream, bytes_to_skip=dest_size)
stream_download(input_iterator, output_stream)
logging.info('FINISHED download from %s to %s', url, dest)
return
logging.debug('NOTHING TO DO')
return
def main():
TEST_URL = 'http://mirror.internode.on.net/pub/test/1meg.test'
DEST = TEST_URL.split('/')[-1]
download(TEST_URL, DEST)
main()
You can try something like this. It reads the file line by line and appends it to a file. It also checks to make sure that you don't go over the same line. I'll write another script that does it by chunks as well.
import urllib2
file_checker = None
print("Please Wait...")
while True:
try:
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req, timeout=20)
print("Connected")
with open("outfile.html", 'w+') as out_data:
for data in response.readlines():
file_checker = open("outfile.html")
if data not in file_checker.readlines():
out_data.write(str(data))
break
except urllib2.URLError:
print("Connection Error!")
print("Connecting again...please wait")
file_checker.close()
print("done")
Here's how to read the data in chunks instead of by lines
import urllib2
CHUNK = 16 * 1024
file_checker = None
print("Please Wait...")
while True:
try:
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req, timeout=1)
print("Connected")
with open("outdata", 'wb+') as out_data:
while True:
chunk = response.read(CHUNK)
file_checker = open("outdata")
if chunk and chunk not in file_checker.readlines():
out_data.write(chunk)
else:
break
break
except urllib2.URLError:
print("Connection Error!")
print("Connecting again...please wait")
file_checker.close()
print("done")

Fetch a file from a local url with Python requests?

I am using Python's requests library in one method of my application. The body of the method looks like this:
def handle_remote_file(url, **kwargs):
response = requests.get(url, ...)
buff = StringIO.StringIO()
buff.write(response.content)
...
return True
I'd like to write some unit tests for that method, however, what I want to do is to pass a fake local url such as:
class RemoteTest(TestCase):
def setUp(self):
self.url = 'file:///tmp/dummy.txt'
def test_handle_remote_file(self):
self.assertTrue(handle_remote_file(self.url))
When I call requests.get with a local url, I got the KeyError exception below:
requests.get('file:///tmp/dummy.txt')
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/packages/urllib3/poolmanager.pyc in connection_from_host(self, host, port, scheme)
76
77 # Make a fresh ConnectionPool of the desired type
78 pool_cls = pool_classes_by_scheme[scheme]
79 pool = pool_cls(host, port, **self.connection_pool_kw)
80
KeyError: 'file'
The question is how can I pass a local url to requests.get?
PS: I made up the above example. It possibly contains many errors.
As #WooParadog explained requests library doesn't know how to handle local files. Although, current version allows to define transport adapters.
Therefore you can simply define you own adapter which will be able to handle local files, e.g.:
from requests_testadapter import Resp
import os
class LocalFileAdapter(requests.adapters.HTTPAdapter):
def build_response_from_file(self, request):
file_path = request.url[7:]
with open(file_path, 'rb') as file:
buff = bytearray(os.path.getsize(file_path))
file.readinto(buff)
resp = Resp(buff)
r = self.build_response(request, resp)
return r
def send(self, request, stream=False, timeout=None,
verify=True, cert=None, proxies=None):
return self.build_response_from_file(request)
requests_session = requests.session()
requests_session.mount('file://', LocalFileAdapter())
requests_session.get('file://<some_local_path>')
I'm using requests-testadapter module in the above example.
Here's a transport adapter I wrote which is more featureful than b1r3k's and has no additional dependencies beyond Requests itself. I haven't tested it exhaustively yet, but what I have tried seems to be bug-free.
import requests
import os, sys
if sys.version_info.major < 3:
from urllib import url2pathname
else:
from urllib.request import url2pathname
class LocalFileAdapter(requests.adapters.BaseAdapter):
"""Protocol Adapter to allow Requests to GET file:// URLs
#todo: Properly handle non-empty hostname portions.
"""
#staticmethod
def _chkpath(method, path):
"""Return an HTTP status for the given filesystem path."""
if method.lower() in ('put', 'delete'):
return 501, "Not Implemented" # TODO
elif method.lower() not in ('get', 'head'):
return 405, "Method Not Allowed"
elif os.path.isdir(path):
return 400, "Path Not A File"
elif not os.path.isfile(path):
return 404, "File Not Found"
elif not os.access(path, os.R_OK):
return 403, "Access Denied"
else:
return 200, "OK"
def send(self, req, **kwargs): # pylint: disable=unused-argument
"""Return the file specified by the given request
#type req: C{PreparedRequest}
#todo: Should I bother filling `response.headers` and processing
If-Modified-Since and friends using `os.stat`?
"""
path = os.path.normcase(os.path.normpath(url2pathname(req.path_url)))
response = requests.Response()
response.status_code, response.reason = self._chkpath(req.method, path)
if response.status_code == 200 and req.method.lower() != 'head':
try:
response.raw = open(path, 'rb')
except (OSError, IOError) as err:
response.status_code = 500
response.reason = str(err)
if isinstance(req.url, bytes):
response.url = req.url.decode('utf-8')
else:
response.url = req.url
response.request = req
response.connection = self
return response
def close(self):
pass
(Despite the name, it was completely written before I thought to check Google, so it has nothing to do with b1r3k's.) As with the other answer, follow this with:
requests_session = requests.session()
requests_session.mount('file://', LocalFileAdapter())
r = requests_session.get('file:///path/to/your/file')
The easiest way seems using requests-file.
https://github.com/dashea/requests-file (available through PyPI too)
"Requests-File is a transport adapter for use with the Requests Python library to allow local filesystem access via file:// URLs."
This in combination with requests-html is pure magic :)
packages/urllib3/poolmanager.py pretty much explains it. Requests doesn't support local url.
pool_classes_by_scheme = {
'http': HTTPConnectionPool,
'https': HTTPSConnectionPool,
}
In a recent project, I've had the same issue. Since requests doesn't support the "file" scheme, I'll patch our code to load the content locally. First, I define a function to replace requests.get:
def local_get(self, url):
"Fetch a stream from local files."
p_url = six.moves.urllib.parse.urlparse(url)
if p_url.scheme != 'file':
raise ValueError("Expected file scheme")
filename = six.moves.urllib.request.url2pathname(p_url.path)
return open(filename, 'rb')
Then, somewhere in test setup or decorating the test function, I use mock.patch to patch the get function on requests:
#mock.patch('requests.get', local_get)
def test_handle_remote_file(self):
...
This technique is somewhat brittle -- it doesn't help if the underlying code calls requests.request or constructs a Session and calls that. There may be a way to patch requests at a lower level to support file: URLs, but in my initial investigation, there didn't seem to be an obvious hook point, so I went with this simpler approach.
To load a file from a local URL, e.g. an image file you can do this:
import urllib
from PIL import Image
Image.open(urllib.request.urlopen('file:///path/to/your/file.png'))
I think simple solution for this will be creating temporary http server using python and using it.
Put all your files in temporary folder eg. tempFolder
Go to that directory and create a temporary http server in terminal/cmd as per your OS using command python -m http.server 8000 (Note 8000 is port no.)
This will you give you a link to http server. You can access it from http://127.0.0.1:8000/
Open your desired file in browser and copy the link to your url.

Gunzipping Contents of a URL - Python

I'm back. :) Again trying to get the gzipped contents of a URL and gunzip them. This time in Python. The #SERVER section of code is the script I'm using to generate the gzipped data. The data is known good, as it works with Java. The #CLIENT section of code is the bit of code I'm using client-side to try and read that data (for eventual JSON parsing). However, somewhere in this transfer, the gzip module forgets how to read the data it created.
#SERVER
outbuf = StringIO.StringIO()
outfile = gzip.GzipFile(fileobj = outbuf, mode = 'wb')
outfile.write(data)
outfile.close()
print "Content-Encoding: gzip\n"
print outbuf.getvalue()
#CLIENT
urlReq = urllib2.Request(url)
urlReq.add_header('Accept-Encoding', '*')
urlConn = urllib2.build_opener().open(urlReq)
urlConnObj = StringIO.StringIO(urlConn.read())
gzin = gzip.GzipFile(fileobj = urlConnObj)
return gzin.read() #IOError: Not a gzipped file.
Other Notes:
outbuf.getvalue() is the same as urlConnObj.getvalue() is the same as urlConn.read()
This StackOverflow question seemed to help me out.
Apparently, it was just wise to by-pass the gzip module entirely, opting for zlib instead. Also, changing "*" to "gzip" in the "Accept-Encoding" header may've helped.
#CLIENT
urlReq = urllib2.Request(url)
urlReq.add_header('Accept-Encoding', 'gzip')
urlConn = urllib2.urlopen(urlReq)
return zlib.decompress(urlConn.read(), 16+zlib.MAX_WBITS)

Categories

Resources