Spleeter on Google Cloud Functions - python

Im trying to run spleeter libĀ on Google Cloud Functions.
I have a problem. The separate_to_file function is creating pretrained_models files while executing in root folder.
The only directory I can write something is /tmp.
Is there any way to change directory of pretrained models?

You can set the MODEL_DIRECTORY environment variable to the path of the directory you want models to be written into before to run Spleeter. Please note that most models are quite heavy and may require lot of storage.

I was doing the same thing. Fetching the models can be solved by pointing to the /tmp directory, as already pointed out.
Unfortunately, that wasn't the end of it. Spleeter depends on linux binaries that weren't included in serverless enironments. I solved this by deploying a docker image instead of a plain script.
Then there was the issue that spleeter consumes a lot of memory, especially for 4-way and 5-way splits. Google Cloud doesn't offer enough RAM. AWS Lambdas offer 10GB RAM, which is enough to split a regular, radio-friendly song.

Related

How to circumvent AWS package and ephemeral limits for large packages + large models

We have a production scenario with users invoking expensive NLP functions running for short periods of time (say 30s). Because of the high load and intermittent usage, we're looking into Lambda function deployment. However - our packages are big.
I'm trying to fit AllenNLP in a lambda function, which in turn depends on pytorch, scipy, spacy and numpy and a few other libs.
What I've tried
Following recommendations made here and the example here, tests and additional files are removed. I also use a non-cuda version of Pytorch which gets its' size down. I can package an AllenNLP deployment down to about 512mb. Currently, this is still too big for AWS Lambda.
Possible fixes?
I'm wondering if anyone of has experience with one of the following potential pathways:
Cutting PyTorch out of AllenNLP. Without Pytorch, we're in reach of getting it to 250mb. We only need to load archived models in production, but that does seem to use some of the PyTorch infrastructure. Maybe there are alternatives?
Invoking PyTorch in (a fork of) AllenNLP as a second lambda function.
Using S3 to deliver some of the dependencies: SIMlinking some of the larger .so files and serving them from an S3 bucket might help. This does create an additional problem: the Semnatic Role Labelling we're using from AllenNLP also requires some language models of around 500mb, for which the ephemeral storage could be used - but maybe these can be streamed directly into RAM from S3?
Maybe i'm missing an easy solution. Any direction or experiences would be much appreciated!
You could deploy your models to SageMaker inside of AWS, and run Lambda -> Sagemaker to avoid having to load up very large functions inside of a Lambda.
Architecture explained here - https://aws.amazon.com/blogs/machine-learning/call-an-amazon-sagemaker-model-endpoint-using-amazon-api-gateway-and-aws-lambda/

securely executing untrusted python code on server

I have a server and a front end, I would like to get python code from the user in the front end and execute it securely on the backend.
I read this article explaining the problematic aspect of each approach - eg pypl sandbox, docker etc.
I can define the input the code needs, and the output, and can put all of them in a directory, what's important to me is that this code should not be able to hurt my filesystem, and create a disk, memory, cpu overflow (I need to be able to set timeouts, and prevent it from accesing files that are not devoted to it)
What is the best practice here? Docker? Kubernetes? Is there a python module for that?
Thanks.
You can have a python docker container with the volume mount. Volume mount will be the directory in local system for code which will be available in your docker container also. By doing this you have isolated user supplied code only to the container when it runs.
Scan your python container with CIS benchmark for better security

Loading Python libraries via http

I have several small Python libraries that I wrote with stuff that I find myself wanting over and over again. I think most programmers have something similar. I want to use these libraries from a variety of different machines so I've started keeping this stuff in my DropBox. However, I'd like to be able to use my code on machines on which I can't install DropBox or other cloud storage applications, even in portable form. I can just download the files every time one of them changes (DropBox can provide me a URL for each file in my Public folder), which is only a moderate nuisance. But--and I admit this is a longshot--is there a solution out there that will let me tell Python to load a library from my DropBox via http?
BTW, I'd like to add the whole remove folder to my sys.path, but getting a URL for a folder is complicated, so I'm going to try to walk before I run by starting with individual files.
Yes, it's possible. I think you want the combination of two previous questions:
How to download a file in python over HTTP
How to dynamically load a library in python
So your task basically breaks down into writing a little bit of glue code: download the URL via the first bullet, write it to a local file, and then import that file using the second bullet.
So that's how you'd do that.
BUT - please keep in mind that dynamically downloading and executing code has many potential security downfalls. Will you be doing this over a secure connection? Who else has the ability to manipulate that URL? There are a bunch of security issues inherent in downloading and executing code on the fly. I would ask you to consider going about your solution in a different way, but I'm giving you the answer you're asking for.
As a simple security check, you can establish a known-good hash for your file, and then refuse to import any file other than one that's on the list of known-good hashes. This makes it a pain to update your modules, but gives you a little bit of extra safety.
Don't use DropBox as a Revision control
Pick a real solution like Git
Setup access to the Git repository on one of your servers
Clone the repository to your worker machines and checkout master
Create a develop branch where you put every change you make
Test the changes and when you consider any of them stable, merge it to master
On your worker machines set up a cron job which periodically pulls from master branch of repository (and possibly restarts some Python processes as importing the same module again won't make Python interpreter aware of changes since imported modules are cached)
Enjoy your automatically updated workers :)
Don't feel shame - it happens that even experienced software developers come up with XY problem

Protect Python script with Google Drive (perhaps Google App Engine)

I have trouble with understanding if it's possible to protect a python script from stealing it (I know this is not in fact possible but at least let's protect it as much as possible) by uploading the program made in python to Google Drive. I did reasearch such as How do I protect my python code? or Store Python scripts & run them online? and many more or less relevant links. But none of them is really answering me.
Let's say I have a python project made a Windows Executable (.exe) with GUI2Exe. It has GUI which loads images from specific folders etc.
I upload all of that to Google Drive.
Somehow run that program from Google Drive and if you can make the user not to realise it that it is from Google Drive than it is even better (log in interface etc)
I would like to know if one of the next solutions are possible or there is another way:
run the exe directly from Google Drive (oh, naivity, or perhaps there is something there I don't see)
run the exe from Google Drive with the help of another python code which can be on your computer but does nothing more than log in to Google and Run the exe from the right folder or download it to temporary and run it from there automatically.
using Google Drive as Windows Service so I guess you can use it as a simple partition of your computer and run the program from there Here is a better description
perhaps avoiding Google as it is and use some kind of encrypting (though I understood it's not really for python)
Becuase my python program has already more then 20000 lines of code and uses about 20 python libraries and needs to load formats as jpg, png, csv (spreadsheets) (etc.) I don't think that Google App Engine is enough. Though I'm quite a noob here so I can be very very wrong.
I hope I made myself clear and please just give me a lead if you can. The right way to go about this and I do my homework. I would really appreciate it.
Consider renting an EC2 Windows instance. Set it up with all the libraries you'll require.

faking a filesystem / virtual filesystem

I have a web service to which users upload python scripts that are run on a server. Those scripts process files that are on the server and I want them to be able to see only a certain hierarchy of the server's filesystem (best: a temporary folder on which I copy the files I want processed and the scripts).
The server will ultimately be a linux based one but if a solution is also possible on Windows it would be nice to know how.
What I though of is creating a user with restricted access to folders of the FS - ultimately only the folder containing the scripts and files - and launch the python interpreter using this user.
Can someone give me a better alternative? as relying only on this makes me feel insecure, I would like a real sandboxing or virtual FS feature where I could run safely untrusted code.
Either a chroot jail or a higher-order security mechanism such as SELinux can be used to restrict access to specific resources.
You are probably best to use a virtual machine like VirtualBox or VMware (perhaps even creating one per user/session).
That will allow you some control over other resources such as memory and network as well as disk
The only python that I know of that has such features built in is the one on Google App Engine. That may be a workable alternative for you too.
This is inherently insecure software. By letting users upload scripts you are introducing a remote code execution vulnerability. You have more to worry about than just modifying files, whats stopping the python script from accessing the network or other resources?
To solve this problem you need to use a sandbox. To better harden the system you can use a layered security approach.
The first layer, and the most important layer is a python sandbox. User supplied scripts will be executed within a python sandbox. This will give you the fine grained limitations that you need. Then, the entire python app should run within its own dedicated chroot. I highly recommend using the grsecurity kernel modules which improve the strength of any chroot. For instance a grsecuirty chroot cannot be broken unless the attacker can rip a hole into kernel land which is very difficult to do these days. Make sure your kernel is up to date.
The end result is that you are trying to limit the resources that an attacker's script has. Layers are a proven approach to security, as long as the layers are different enough such that the same attack won't break both of them. You want to isolate the script form the rest of the system as much as possible. Any resources that are shared are also paths for an attacker.

Categories

Resources