Python memory issue uploading multiple files to API

Python memory issue uploading multiple files to API - python

I'm running a script to upload 20k+ XML files to an API. About 18k in, I get a memory error. I was looking into it and found the memory is just continually climbing until it reaches the limit and errors out (seemingly on the post call). Anyone know why this is happening or a fix? Thanks. I have tried the streaming uploads found here. The empty strings are due to sensitive data.
def upload(self, oauth_token, full_file_path):
file_name = os.path.basename(full_file_path)
upload_endpoint = {'':''}
params = {'': '','': ''}
headers = {'': '', '': ''}
handler = None
try:
handler = open(full_file_path, 'rb')
response = requests.post(url=upload_endpoint[''], params=params, data=handler, headers=headers, auth=oauth_token, verify=False, allow_redirects=False, timeout=600)
status_code = response.status_code
# status checking
return status_code
finally:
if handler:
handler.close()
def push_data(self):
oauth_token = self.get_oauth_token()
files = os.listdir(f_dir)
for file in files:
status = self.upload(oauth_token, file_to_upload)

What version of Python are you using? It looks like there is a bug in Python 3.4 causing memory leaks related to network requests. See here for a similar issue: https://github.com/psf/requests/issues/5215
It may help to update Python.

Related

Memory leak from a docker container with a Flask application

I am having a strange problem, when I start multiple docker containers with Flask applications. The containers with the apps are used for simulation purposes and are not deployed for production, I simply needed a way to allow the docker containers to communicate between each other and GET/POST API calls seemed to be a good solution. However, this is where my problem occurred, when I start the containers and the Flask application starts, the memory usage (which I am observing with htop) starts to increase. Just by starting the Flask server, the container size increases by 200 MB. I can honestly live with that, however, the problem is, that after every API call, the memory usage keeps increasing. Here is a small snippet of one of the functions:
#app.route('/execute/step=<int:step>', methods=['GET'])
def execute(step):
url = f'http://my_url:5000/some/api/call/step={step}'
response = requests.get(url)
data = eval(response.text)
if data:
# unimportant calculations
if demand <= supply:
for b in people_b:
buyer_id = b['id']
post_data = {some_data}
url = f'http://my_url:5000/set_data/id={buyer_id}'
requests.post(url, data=post_data)
for s in people_s[:-1]:
seller_id = s['id']
post_data = {some_data
}
url = f'http://my_url:5000/set_data/id={seller_id}'
requests.post(url, data=post_data)
# unimportant steps
seller_id = local_ids[-1]['id']
post_data = {some_data}
url = f'http://my_url:5000/set_data/id={seller_id}'
requests.post(url, data=post_data)
return 'Success\n'
else:
for s in people_s:
seller_id = s['id']
post_data = {some_data}
url = f'http://my_url:5000/set_data/id={seller_id}'
requests.post(url, data=post_data)
for b in people_b[:-1]:
#unimportant steps
buyer_id = b['id']
post_data = {some_data}
url = f'http://my_url:5000/set_data/id={buyer_id}'
requests.post(url, data=post_data)
#unimportant steps
buyer_id = people_b[-1]['id']
post_data = {some_data}
url = f'http://my_url:5000/set_data/id={buyer_id}'
requests.post(url, data=post_data)
return 'Success\n'
else:
return 'No success\n'
Above is one of the methods, I have deleted some unimportant computation steps, but what I wanted to show is, that there are nested API calls as well. I tried calling gc.collect() before every return in the functions, however, this resulted no success.
Is this behavior expected when performing so many API calls or is there a problem with the implementation or docker/Flask usage?

The problem with the memory leak was entirely due to using eval and response.text. After switching to using json output, the memory leaks were gone.

How to read a stream with Python 3? (not with requests module)

I'm building an HTTP client that reads a stream from a server. Right now I am using the requests module, but I am having trouble with response.iter_lines(). Every several iterations I lose data.
Python Ver. 3.7
requests Ver. 2.21.0
I tried different methods, including the use of generators (which for some reason raise a StopIteration for very small amounts of iterations). I tried setting chunk_size=None in order to prevent losing data but the problem still occurs.
response = requests.get(url, headers=headers, stream=True, timeout=60 * 10)
gen = response.iter_lines(chunk_size=None)
try:
for line in gen:
json_data = json.loads(line)
yield json_data
except StopIteration:
return
def http_parser():
json_list = []
response = requests.get(url, headers=headers, stream=True, timeout=60 * 10)
for line in respone.iter_lines():
json_data = json.loads(line)
json_list.append(json_data)
return json_list
Both functions cause loss of data.
In the requests documentation it is mentioned as a warning that iter_lines() may cause loss of data.
Does anyone have a recommendation of another module that has a similar ability that won't cause any loss of data?

Robot Framework: send binary data in POST request body with

I have a problem with getting my test running using Robot Framework and robotframework-requests. I need to send a POST request and a binary data in the body. I looked at this question already, but it's not really answered. Here's how my test case looks like:
Upload ${filename} file
Create Session mysession http://${ADDRESS}
${data} = Get Binary File ${filename}
&{headers} = Create Dictionary Content-Type=application/octet-stream Accept=application/octet-stream
${resp} = Post Request mysession ${CGIPath} data=${data} headers=&{headers}
[Return] ${resp.status_code} ${resp.text}
The problem is that my binary data is about 250MB. When the data is read with Get Binary File I see that memory consumption goes up to 2.x GB. A few seconds later when the Post Request is triggered my test is killed by OOM. I already looked at files parameter, but it seems it uses multipart encoding upload, which is not what I need.
My other thought was about passing open file handler directly to underlying requests library, but I guess that would require robotframework-request modification. Another idea is to fall back to curl for this test only.
Am I missing something in my test? What is the better way to address this?

I proceeded with the idea of robotframework-request modification and added this method
def post_request_binary(
self,
alias,
uri,
path=None,
params=None,
headers=None,
allow_redirects=None,
timeout=None):
session = self._cache.switch(alias)
redir = True if allow_redirects is None else allow_redirects
self._capture_output()
method_name = "post"
method = getattr(session, method_name)
with open(path, 'rb') as f:
resp = method(self._get_url(session, uri),
data=f,
params=self._utf8_urlencode(params),
headers=headers,
allow_redirects=allow_redirects,
timeout=self._get_timeout(timeout),
cookies=self.cookies,
verify=self.verify)
self._print_debug()
# Store the last session object
session.last_resp = resp
self.builtin.log(method_name + ' response: ' + resp.text, 'DEBUG')
return resp
I guess I can improve it a bit and create a pull request.

Mocking download of a file using Python requests and responses

I have some python code which successfully downloads an image from a URL, using requests, and saves it into /tmp/. I want to test this does what it should. I'm using responses to test fetching of JSON files, but I'm not sure how to mock the behaviour of fetching a file.
I assume it'd be similar to mocking a standard response, like the below, but I think I'm blanking on how to set the body to be a file...
#responses.activate
def test_download():
responses.add(responses.GET, 'http://example.org/images/my_image.jpg',
body='', status=200,
content_type='image/jpeg')
#...
UPDATE: Following Ashafix's comment, I'm trying this (python 3):
from io import BytesIO
#responses.activate
def test_download():
with open('tests/images/tester.jpg', 'rb') as img1:
imgIO = BytesIO(img1.read())
responses.add(responses.GET, 'http://example.org/images/my_image.jpg',
body=imgIO, status=200,
content_type='image/jpeg')
imgIO.seek(0)
#...
But when, subsequently, the code I'm testing attempts to do the request I get:
a bytes-like object is required, not '_io.BytesIO'
Feels like it's almost right, but I'm stumped.
UPDATE 2: Trying to follow Steve Jessop's suggestion:
#responses.activate
def test_download():
with open('tests/images/tester.jpg', 'rb') as img1:
responses.add(responses.GET, 'http://example.org/images/my_image.jpg',
body=img1.read(), status=200,
content_type='image/jpeg')
#...
But this time the code being tested raises this:
I/O operation on closed file.
Surely the image should still be open inside the with block?
UPDATE 3: The code I'm testing is something like this:
r = requests.get(url, stream=True)
if r.status_code == 200:
with open('/tmp/temp.jpg', 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
It seems to be that the final shutil line is generating the "I/O operation on closed file." error. I don't understand this enough - the streaming of the file - to know how best to mock this behaviour, to test the downloaded file is saved to /tmp/.

You might need to pass stream=True to the responses.add call. Something like:
#responses.activate
def test_download():
with open("tests/images/tester.jpg", "rb") as img1:
responses.add(
responses.GET,
"http://example.org/images/my_image.jpg",
body=img1.read(),
status=200,
content_type="image/jpeg",
stream=True,
)

First, to summarise my now overly long question... I'm testing some code that's something like:
def download_file(url):
r = requests.get(url, stream=True)
if r.status_code == 200:
filename = os.path.basename(url)
with open('/tmp/%s' % filename, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
return filename
It downloads an image and, streaming it, saves it to /tmp/. I wanted to mock the request so I can test other things.
#responses.activate
def test_downloads_file(self):
url = 'http://example.org/test.jpg'
with open('tests/images/tester.jpg', 'rb') as img:
responses.add(responses.GET, url,
body=img.read(), status=200,
content_type='image/jpg',
adding_headers={'Transfer-Encoding': 'chunked'})
filename = download_file(url)
# assert things here.
Once I had worked out the way to use open() for this, I was still getting "I/O operation on closed file." from shutil.copyfileobj(). The thing that's stopped this is to add in the Transfer-Encoding header, which is present in the headers when I make the real request.
Any suggestions for other, better solutions very welcome!

Gunzipping Contents of a URL - Python

I'm back. :) Again trying to get the gzipped contents of a URL and gunzip them. This time in Python. The #SERVER section of code is the script I'm using to generate the gzipped data. The data is known good, as it works with Java. The #CLIENT section of code is the bit of code I'm using client-side to try and read that data (for eventual JSON parsing). However, somewhere in this transfer, the gzip module forgets how to read the data it created.
#SERVER
outbuf = StringIO.StringIO()
outfile = gzip.GzipFile(fileobj = outbuf, mode = 'wb')
outfile.write(data)
outfile.close()
print "Content-Encoding: gzip\n"
print outbuf.getvalue()
#CLIENT
urlReq = urllib2.Request(url)
urlReq.add_header('Accept-Encoding', '*')
urlConn = urllib2.build_opener().open(urlReq)
urlConnObj = StringIO.StringIO(urlConn.read())
gzin = gzip.GzipFile(fileobj = urlConnObj)
return gzin.read() #IOError: Not a gzipped file.
Other Notes:
outbuf.getvalue() is the same as urlConnObj.getvalue() is the same as urlConn.read()

This StackOverflow question seemed to help me out.
Apparently, it was just wise to by-pass the gzip module entirely, opting for zlib instead. Also, changing "*" to "gzip" in the "Accept-Encoding" header may've helped.
#CLIENT
urlReq = urllib2.Request(url)
urlReq.add_header('Accept-Encoding', 'gzip')
urlConn = urllib2.urlopen(urlReq)
return zlib.decompress(urlConn.read(), 16+zlib.MAX_WBITS)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python memory issue uploading multiple files to API - python

What version of Python are you using? It looks like there is a bug in Python 3.4 causing memory leaks related to network requests. See here for a similar issue: https://github.com/psf/requests/issues/5215 It may help to update Python.

Related

Memory leak from a docker container with a Flask application

How to read a stream with Python 3? (not with requests module)

Robot Framework: send binary data in POST request body with

Mocking download of a file using Python requests and responses

Gunzipping Contents of a URL - Python

Categories

Resources