Suppressing multi-threading in used libraries? - python

EDIT:
I ended up using a workaround to get the behaviour I wanted.
Disabling threading in the SSHTunnel as suggested in the accepted answer helped me pin down the problem.
I have a Python project that does a few things, mostly ETL.
It works fine when I run it locally, works fine when I stuff it into a docker container and run that locally, but deadlocks 80% in when I run that docker container in the cloud.
When I manually kill the process I get the error linked below, suggesting it is a threading issue. I'm not explicitly using threading anywhere in my code (and am no expert on the subject) and assume it's one of the libraries I'm using employing threading internally.
The idea I had to resolve this problem is to somehow suppress all threading that is happening in the function calls of the libraries I use.
Is there a catch-all way to do that in Python?
Steps of the program include moving PostGresQL data into Google BigQuery, then fetching data from BigQuery (including the new data), creating an Excel report out of that data and emailing it out.
Pandas' data frames are used for the internal representation and easy upload to GBQ using the to_gbq method.
sqlalchemy and sshtunnel are used to extract data from the Postgresql database.
Openpyxl is used for the Excel editing.
The whole thing takes less than a minute to run locally (either in- or outside of a docker container) and manually calling each of the steps separately on the server also works fine.
(The referenced cloud deployment is on a Google Cloud VM instance)

I can't think of any way to globally disable threading; at least not without breaking every piece of code that would use it.
Judging by the traceback, I assume you are using SSHTunnelForwarder from the sshtunnel package. This class takes a boolean argument threaded with True as a default value.
Instantiating SSHTunnelForwarder with threaded=False will disable the use of the _ThreadingForwardServer in favor of the _ForwardServer. This forward server is not using the socketserver.ThreadingMixIn, which is where your block seems to be surfacing. So, that should fix your problem.
However, I'd be curious to know why your project blocks in the cloud context. Judging by the output in your screenshot, the whole thing seems to be almost complete and just hangs when shutting down the tunnel forwarder. The maintainers of the sshtunnel package surely made the use of threading a default for a reason. I'd want to stick to that default if in any way possible, but that's just me :)

Related

Cloud Run: how do I set check_same_thread=False?

My project which runs in Cloud Run of Google Cloud Platform (GCP) has generated errors: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 68387105408768 and this is thread id 68386614675200. for hours before it went back to normal by itself.
Our code is written in Python with flask & no SQLite is involved. Saw suggestions to set check_same_thread to False. May I know where I can set this in Cloud Run or GCP? Thanks.
That setting has nothing to do with your runtime environment, but is set during the connection initialization with sqlite (https://docs.python.org/3/library/sqlite3.html#module-functions), so if you claim that you aren't creating an sqlite connection that won't help you much.
That being said, I find it hard to believe that you are getting that error without using sqlite. More likely is that you are using sqlite via some dependency.
Since sqlite3 is part of the standard library of python it might however not be trivial to figure out which dependency uses it.

How to limit python script so that it can't access local resources?

I am working on a project that allows users to upload a python script to an API and run it on a schedule. Currently, I'm trying to figure out a way to limit the functionality of the script so that it cannot access local files, mess with the flask server running the API, etc. Do you have any ideas on how I can achieve this? Is there anyway to make it so only specific libraries are available for importing?
Running other scripts on your server is serious security issue. If you are trying to deploy Python interpreter on your web application, you can try with something like judge0 - GitHub. It is free if you deploy it yourself and it will run scripts safely inside containers.
The simplest way is to ensure the user running the script is not root, but a user specifically designed for this task (e.g. part of a group that can only read and not write or execute). This means at minimum you should ensure all files have the appropriate mode. Then you can just use a pipe or something to run the script.
Alternatively, you could use a runtime that’s not “local”, like a VM or compute service (AWS lambda, etc). The latter would be simplest, and there’s lots of vendors who offer compute service with programmatic api.

Slow page loading on apache when using Flask

The Issue
I am using my laptop with Apache to act as a server for a local project involving tensorflow and python which uses an API written in Flask to service GET and POST requests coming from an app and maybe another user on the local network.The problem is that the initial page keeps loading when I specifically import tensorflow or the object detection folder within the research folder in the tensorflow github folder, and it never seems to finish doing so, effectively getting it stuck. I suspect the issue has to do with the packages being large in size, but I didn't have any issue with that when running the application on the development server provided with Flask.
Are there any pointers that I should look for when trying to solve this issue? I checked the memory usage, and it doesn't seem to be rising substantially, as well as the CPU usage.
Debugging process
I am able to print basic hello world to the root page quite quickly, but I isolated the issue to the point when the importing takes place where it gets stuck.
The only thing I can think of is to limit the number of threads that are launched, but when I limited the number of threads per child to 5 and number of connections to 5 in the httpd-mpm.conf file, it didn't help.
The error/access logs don't provide much insight to the matter.
A few notes:
Thus far, I used Flask's development server with multi-threading enabled to serve those requests, but I found it to be prone to crashing after 5 minutes of continuous run, so I am now trying to use Apache using the wsgi interface in order to use Python scripts.
I should also note that I am not servicing html files, just basic GET and POST requests. I am just viewing them using the browser.
If it helps, I also don't use virtual environments.
I am using Windows 10, Apache 2.4 and mod_wsgi 4.5.24
The tensorflow module being a C extension module, may not be implemented so it works properly in Python sub interpreters. To combat this, force your application to run in the main Python interpreter context. Details in:
http://modwsgi.readthedocs.io/en/develop/user-guides/application-issues.html#python-simplified-gil-state-api

Forbid Python from writing anything to disk

Are there any command-line options or configurations that forbids Python from writing to disk?
I know I can hack open but it doesn't sound very safe.
I've hosted some Python tutorials I wrote myself on my website for friends who want to learn Python, and I want them to have access to a Python console so they can try as they learn. This is done by creating a Python subprocess from the http server.
However, I do not want them to accidentally or intentionally damage my server, so I need to forbid the Python process from writing anything to disk.
Also I'm running the server on Ubuntu Linux so doing it Python-wise or system-wise are both OK.
I doubt there's a way to do this in the interpreter itself: there are way too many things to patch (open, subprocess, os.system, file, and probably others). I'd suggest looking into a way of containerizing the python runtime via something like Docker. The containerization gives some guarantees restricting access, though not as much as virtualization. See here for more discussion about the security implications.
Running a jupyter/ipython notebook in the docker container would probably be the easiest way to expose a web-frontend. jupyter provides a collection of docker containers for this purpose: see https://github.com/jupyter/tmpnb and https://github.com/jupyter/docker-stacks

Tracing Memory Leaks in Python using Dowser

I am running some tests nightly on a VM with a centos operating system. Recently the tests have been taking up all the memory available and nearly all the swap memory on the machine, I assigned the VM twice as much memory and it's still happening, which results in the physical host machine of the VM dying. These tests were previously running without needing half as much memory so I need to use some form of python memory analyzer to investigate what is going on.
I've looked at Pysizer and Heapy -- but after research Dowser seems to be the one I'm after as it requires zero changes to code.
So far from the documentation and googling I've got this code in it's own class:
import cherrypy
import dowser
class MemoryAnalyzer:
def memoryCheck(self):
cherrypy.config.update({'server.socket_port':8080})
cherrypy.tree.mount(dowser.Root())
cherrypy.engine.start()
I was hoping this would bring up the web interface shown in the documentation to track all instance of python running on the host, which doesn't work. I was confused by the documentation:
'python dowser __init__.py'.
Is it possible to just run this? I get the error :
/usr/bin/python: can't find '__main__.py' in 'dowser'
Can dowser run independently from my test suite on the VM? Or will I have to implement this above code into my main class to run my tests to trace instances of python?
Dowser is meant to be run as part of your application. Therefore, wherever you initialize the application, add the lines
import dowser
cherrypy.tree.mount(dowser.Root(), '/dowser')
Then you can browse to http://localhost:8080/dowser to view the dowser interface.
Note that the invocation you quoted from the documentation is for testing dowser. The correct invocation for that is python dowser/__init__.py.
Managed to get dowser to work using this blog http://www.aminus.org/blogs/index.php/2008/06/11/tracking-memory-leaks-with-dowser?blog=2 and changing the port to 8088 instead of 8080(which wasn't in use on the machine but still doesn't work!)

Categories

Resources