Memory leak from a docker container with a Flask application - python

I am having a strange problem, when I start multiple docker containers with Flask applications. The containers with the apps are used for simulation purposes and are not deployed for production, I simply needed a way to allow the docker containers to communicate between each other and GET/POST API calls seemed to be a good solution. However, this is where my problem occurred, when I start the containers and the Flask application starts, the memory usage (which I am observing with htop) starts to increase. Just by starting the Flask server, the container size increases by 200 MB. I can honestly live with that, however, the problem is, that after every API call, the memory usage keeps increasing. Here is a small snippet of one of the functions:
#app.route('/execute/step=<int:step>', methods=['GET'])
def execute(step):
url = f'http://my_url:5000/some/api/call/step={step}'
response = requests.get(url)
data = eval(response.text)
if data:
# unimportant calculations
if demand <= supply:
for b in people_b:
buyer_id = b['id']
post_data = {some_data}
url = f'http://my_url:5000/set_data/id={buyer_id}'
requests.post(url, data=post_data)
for s in people_s[:-1]:
seller_id = s['id']
post_data = {some_data
}
url = f'http://my_url:5000/set_data/id={seller_id}'
requests.post(url, data=post_data)
# unimportant steps
seller_id = local_ids[-1]['id']
post_data = {some_data}
url = f'http://my_url:5000/set_data/id={seller_id}'
requests.post(url, data=post_data)
return 'Success\n'
else:
for s in people_s:
seller_id = s['id']
post_data = {some_data}
url = f'http://my_url:5000/set_data/id={seller_id}'
requests.post(url, data=post_data)
for b in people_b[:-1]:
#unimportant steps
buyer_id = b['id']
post_data = {some_data}
url = f'http://my_url:5000/set_data/id={buyer_id}'
requests.post(url, data=post_data)
#unimportant steps
buyer_id = people_b[-1]['id']
post_data = {some_data}
url = f'http://my_url:5000/set_data/id={buyer_id}'
requests.post(url, data=post_data)
return 'Success\n'
else:
return 'No success\n'
Above is one of the methods, I have deleted some unimportant computation steps, but what I wanted to show is, that there are nested API calls as well. I tried calling gc.collect() before every return in the functions, however, this resulted no success.
Is this behavior expected when performing so many API calls or is there a problem with the implementation or docker/Flask usage?

The problem with the memory leak was entirely due to using eval and response.text. After switching to using json output, the memory leaks were gone.

Related

Multiple Unique Autonomous Sessions with Python Requests

I have searched through all the requests docs I can find, including requests-futures, and I can't find anything addressing this question. Wondering if I missed something or if anyone here can help answer:
Is it possible to create and manage multiple unique/autonomous sessions with requests.Session()?
In more detail, I'm asking if there is a way to create two sessions I could use "get" with separately (maybe via multiprocessing) that would retain their own unique set of headers, proxies, and server-assigned sticky data.
If I want Session_A to hit someimaginarysite.com/page1.html using specific headers and proxies, get cookied and everything else, and then hit someimaginarysite.com/page2.html with the same everything, I can just create one session object and do it.
But what if I want a Session_B, at the same time, to start on page3.html and then hit page4.html with totally different headers/proxies than Session_A, and be assigned its own cookies and whatnot? Crawl multiple pages consecutively with the same session, not just a single request followed by another request from a blank (new) session.
Is this as simple as saying:
import requests
Session_A = requests.Session()
Session_B = requests.Session()
headers_a = {A headers here}
proxies_a = {A proxies here}
headers_b = {B headers here}
proxies_b = {B proxies here}
response_a = Session_A.get('https://someimaginarysite.com/page1.html', headers=headers_a, proxies=proxies_a)
response_a = Session_A.get('https://someimaginarysite.com/page2.html', headers=headers_a, proxies=proxies_a)
# --- and on a separate thread/processor ---
response_b = Session_B.get('https://someimaginarysite.com/page3.html', headers=headers_b, proxies=proxies_b)
response_b = Session_B.get('https://someimaginarysite.com/page4.html', headers=headers_b, proxies=proxies_b)
Or will the above just create one session accessible by two names so the server will see the same cookies and session appearing with two ips and two sets of headers... which would seem more than a little odd.
Greatly appreciate any help with this, I've exhausted my research abilities.
I think that there is probably a better way to do this, but without more information about the pagination and exactly what you want, it is a bit hard to understand exactly what you need. The following will make 2 threads with sequential calls in each keeping the same headers and proxies. Once again, there may be a better way to approach but with the limited information it's a bit murky.
import requests
import concurrent.futures
def session_get_same(urls, headers, proxies):
lst = []
with requests.Session() as s:
for url in urls:
lst.append(s.get(url, headers=headers, proxies=proxies))
return lst
def main():
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [
executor.submit(
session_get_same,
urls=[
'https://someimaginarysite.com/page1.html',
'https://someimaginarysite.com/page2.html'
],
headers={'user-agent': 'curl/7.61.1'},
proxies={'https': 'https://10.10.1.11:1080'}
),
executor.submit(
session_get_same,
urls=[
'https://someimaginarysite.com/page3.html',
'https://someimaginarysite.com/page4.html'
],
headers={'user-agent': 'curl/7.61.2'},
proxies={'https': 'https://10.10.1.11:1080'}
),
]
flst = []
for future in concurrent.futures.as_completed(futures):
flst.append(future.result())
return flst
Not sure if this is better for the first function, mileage may vary
def session_get_same(urls, headers, proxies):
lst = []
with requests.Session() as s:
s.headers.update(headers)
s.proxies.update(proxies)
for url in urls:
lst.append(s.get(url))
return lst

Python memory issue uploading multiple files to API

I'm running a script to upload 20k+ XML files to an API. About 18k in, I get a memory error. I was looking into it and found the memory is just continually climbing until it reaches the limit and errors out (seemingly on the post call). Anyone know why this is happening or a fix? Thanks. I have tried the streaming uploads found here. The empty strings are due to sensitive data.
def upload(self, oauth_token, full_file_path):
file_name = os.path.basename(full_file_path)
upload_endpoint = {'':''}
params = {'': '','': ''}
headers = {'': '', '': ''}
handler = None
try:
handler = open(full_file_path, 'rb')
response = requests.post(url=upload_endpoint[''], params=params, data=handler, headers=headers, auth=oauth_token, verify=False, allow_redirects=False, timeout=600)
status_code = response.status_code
# status checking
return status_code
finally:
if handler:
handler.close()
def push_data(self):
oauth_token = self.get_oauth_token()
files = os.listdir(f_dir)
for file in files:
status = self.upload(oauth_token, file_to_upload)
What version of Python are you using? It looks like there is a bug in Python 3.4 causing memory leaks related to network requests. See here for a similar issue: https://github.com/psf/requests/issues/5215
It may help to update Python.

Obtain all Woocommerce Orders via Python API

I'm looking to export all orders from the WooCommerce API via a python script.
I've followed the
authentication process
and I have been using method to obtain orders described
here. My code looks like the following:
wcapi = API(
url = "url",
consumer_key = consumerkey,
consumer_secret = consumersecret
)
r = wcapi.get('orders')
r = r.json()
r = r['orders']
print(len(r)) # output: 8
This outputs the most recent 8 orders, but I would like to access all of them. There are over 200 orders placed via woocommerce right now. How do I access all of the orders?
Please tell me there is something simple I am missing.
My ultimate goal is to pull these orders automatically, transform them, and then upload to a visualization tool. All input is appreciated.
First: Initialize your API (as you did).
wcapi = API(
url=eshop.url,
consumer_key=eshop.consumer_key,
consumer_secret=eshop.consumer_secret,
wp_api=True,
version="wc/v2",
query_string_auth=True,
verify_ssl = True,
timeout=10
)
Second: Fetch the orders from your request(as you did).
r=wcapi.get("orders")
Third: Fetch the total pages.
total_pages = int(r.headers['X-WP-TotalPages'])
Forth: For every page catch the json and access the data through the API.
for i in range(1,total_pages+1):
r=wcapi.get("orders?&page="+str(i)).json()
...
The relevant parameters found in the corresponding documentation are page and per_page. The per_page parameter defines how many orders should be retrieved at every request. The page parameter defines the current page of the order collection.
For example, the request sent by wcapi.get('orders/per_page=5&page=2') will return orders 5 to 10.
However, as the default of per_page is 10, it is not clear as to why you get only 8 orders.
I encountered the same problem with paginated response for products.
I built on the same approach described by #gtopal, whereby the X-WP-TotalPages header returned by WooCommerce is used to iterate through each page of results.
I knew that I would probably encounter the same issue for other WooCommerce API requests (such as orders), and I didn't want to have to confuse my code by repeatedly performing a loop when I requested a paginated set of results.
To avoid this I used a decorator to abstract the pagination logic, so that get_all_wc_orders can focus just on the request.
I hope the decorator below might be useful to someone else (gist)
from woocommerce import API
WC_MAX_API_RESULT_COUNT = 100
wcapi = API(
url=url,
consumer_key=key,
consumer_secret=secret,
version="wc/v3",
timeout=300,
)
def wcapi_aggregate_paginated_response(func):
"""
Decorator that repeat calls a decorated function to get
all pages of WooCommerce API response.
Combines the response data into a single list.
Decorated function must accept parameters:
- wcapi object
- page number
"""
def wrapper(wcapi, page=0, *args, **kwargs):
items = []
page = 0
num_pages = WC_MAX_API_RESULT_COUNT
while page < num_pages:
page += 1
log.debug(f"{page=}")
response = func(wcapi, page=page, *args, **kwargs)
items.extend(response.json())
num_pages = int(response.headers["X-WP-TotalPages"])
num_products = int(response.headers["X-WP-Total"])
log.debug(f"{num_products=}, {len(items)=}")
return items
return wrapper
#wcapi_aggregate_paginated_response
def get_all_wc_orders(wcapi, page=1):
"""
Query WooCommerce rest api for all products
"""
response = wcapi.get(
"orders",
params={
"per_page": WC_MAX_API_RESULT_COUNT,
"page": page,
},
)
response.raise_for_status()
return response
orders = get_all_wc_orders(wcapi)

How to send asynchronous request using flask to an endpoint with small timeout session?

I am new to backend development using Flask and I am getting stuck on a confusing problem. I am trying to send data to an endpoint whose Timeout session is 3000 ms. My code for the server is as follows.
from flask import Flask, request
from gitStat import getGitStat
import requests
app = Flask(__name__)
#app.route('/', methods=['POST', 'GET'])
def handle_data():
params = request.args["text"].split(" ")
user_repo_path = "https://api.github.com/users/{}/repos".format(params[0])
first_response = requests.get(
user_repo_path, auth=(
'Your Github Username', 'Your Github Password'))
repo_commits_path = "https://api.github.com/repos/{}/{}/commits".format(params[
0], params[1])
second_response = requests.get(
repo_commits_path, auth=(
'Your Github Username', 'Your Github Password'))
if(first_response.status_code == 200 and params[2] < params[3] and second_response.status_code == 200):
values = getGitStat(params[0], params[1], params[2], params[3])
response_url = request.args["response_url"]
payload = {
"response_type": "in_channel",
"text": "Github Repo Commits Status",
"attachments": [
{
"text": values
}
]
}
headers = {'Content-Type': 'application/json',
'User-Agent': 'Mozilla /5.0 (Compatible MSIE 9.0;Windows NT 6.1;WOW64; Trident/5.0)'}
response = requests.post(response_url, json = test, headers = headers)
else:
return "Please enter correct details. Check if the username or reponame exists, and/or Starting date < End date. \
Also, date format should be MM-DD"
My server code takes arguments from the request it receives and from that request's JSON object, it extracts parameters for the code. This code executes getGitStats function and sends the JSON payload as defined in the server code to the endpoint it received the request from.
My problem is that I need to send a text confirmation to the endpoint that I have received the request and data will be coming soon. The problem here is that the function, getGitStats take more than a minute to fetch and parse data from Github API.
I searched the internet and found that I need to make this call asynchronous and I can do that using queues. I tried to understand the application using RQ and RabbitMQ but I neither understood nor I was able to convert my code to an asynchronous format. Can somebody give me pointers or any idea on how can I achieve this?
Thank you.
------------Update------------
Threading was able to solve this problem. Create another thread and call the function in that thread.
If you are trying to have a async task in request, you have to decide whether you want the result/progress or not.
You don't care about the result of the task or if there where any errors while processing the task. You can just process this in a Thread and forget about the result.
If you just want to know about success/fail for the task. You can store the state of the task in Database and query it when needed.
If you want progress of the tasks like (20% done ... 40% done). You have to use something more sophisticated like celery, rabbitMQ.
For you i think option #2 fits better. You can create a simple table GitTasks.
GitTasks
------------------------
Id(PK) | Status | Result
------------------------
1 |Processing| -
2 | Done | <your result>
3 | Error | <error details>
You have to create a simple Threaded object in python to processing.
import threading
class AsyncGitTask(threading.Thread):
def __init__(self, task_id, params):
self.task_id = task_id
self.params = params
def run():
## Do processing
## store the result in table for id = self.task_id
You have to create another endpoint to query the status of you task.
#app.route('/TaskStatus/<int:task_id>')
def task_status(task_id):
## query GitTask table and accordingly return data
Now that we have build all the components we have to put them together in your main request.
from Flask import url_for
#app.route('/', methods=['POST', 'GET'])
def handle_data():
.....
## create a new row in GitTasks table, and use its PK(id) as task_id
task_id = create_new_task_row()
async_task = AsyncGitTask(task_id=task_id, params=params)
async_task.start()
task_status_url = url_for('task_status', task_id=task_id)
## This is request you can return text saying
## that "Your task is being processed. To see the progress
## go to <task_status_url>"

GAE - How can I combine the results of several asynchronous url fetches?

I have a Google AppEngine (in Python) application where I need to perform 4 to 5 url fetches, and then combine the data before I print it out to the response.
I can do this without any problems using a synchronous workflow, but since the urls that I am fetching are not related or dependent on each other, performing this asynchronously would be the most ideal (and quickest).
I have read and re-read the documentation here, but I just can't figure out how to get read the contents for each url. I've also searched the web for a small example (which is really what I am in need of). I have seen this SO question, but again, here they don't mention anything about reading the contents of these individual asynchronous url fetches.
Does anyone have any simple examples of how to perform 4 or 5 asynchronous url fetches with AppEngine? And then combine the results before I print it to the response?
Here is what I have so far:
rpcs = []
for album in result_object['data']:
total_facebook_photo_count = total_facebook_photo_count + album['count']
facebook_albumid_array.append(album['id'])
#Get the photos in the photo album
facebook_photos_url = 'https://graph.facebook.com/%s/photos?access_token=%s&limit=1000' % (album['id'], access_token)
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, facebook_photos_url)
rpcs.append(rpc)
for rpc in rpcs:
result = rpc.get_result()
self.response.out.write(result.content)
However, it still looks like the line: result = rpc.get_result() is forcing it to wait for the first request to finish, then the second, then the third, and so forth. Is there a way to simply put the results in a variables as they are received?
Thanks!
In the example, text = result.content is where you get the content (body).
To do url fetches in parallell, you could set them up, add to a list and check results afterwards. Expanding on the example already mentioned, it could look something like:
from google.appengine.api import urlfetch
futures = []
for url in urls:
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, url)
futures.append(rpc)
contents = []
for rpc in futures:
try:
result = rpc.get_result()
if result.status_code == 200:
contents.append(result.content)
# ...
except urlfetch.DownloadError:
# Request timed out or failed.
# ...
concatenated_result = '\n'.join(contents)
In this example, we assemble the body of all the requests that returned status code 200, and concatenate with linebreak between them.
Or with ndb, my personal preference for anything async on GAE, something like:
#ndb.tasklet
def get_urls(urls):
ctx = ndb.get_context()
result = yield map(ctx.urlfetch, urls)
contents = [r.content for r in result if r.status_code==200]
raise ndb.Return('\n'.join(contents))
I use this code (implmented before I learned about ndb tasklets):
while rpcs:
rpc = UserRPC.wait_any(rpcs)
result = rpc.get_result()
# process result here
rpcs.remove(rpc)

Categories

Resources