How to bundle Python for AWS Lambda

How to bundle Python for AWS Lambda - python

I have a project I'd like to run on AWS Lambda but it is exceeding the 50MB zipped limit. Right now it is at 128MB zipped and the project folder with the virtual environment sits at 623MB and includes (top users of space):
scipy (~187MB)
pandas (~108MB)
numpy (~74.4MB)
lambda_packages (~71.4MB)
Without the virtualenv the project is <2MB. The requirements.txt is:
click==6.7
cycler==0.10.0
ecdsa==0.13
Flask==0.12.2
Flask-Cors==3.0.3
future==0.16.0
itsdangerous==0.24
Jinja2==2.10
MarkupSafe==1.0
matplotlib==2.1.2
mpmath==1.0.0
numericalunits==1.19
numpy==1.14.0
pandas==0.22.0
pycryptodome==3.4.7
pyparsing==2.2.0
python-dateutil==2.6.1
python-dotenv==0.7.1
python-jose==2.0.2
pytz==2017.3
scipy==1.0.0
six==1.11.0
sympy==1.1.1
Werkzeug==0.14.1
xlrd==1.1.0
I deploy using Zappa, so my understanding of the whole infrastructure is limited. My understanding is that some (very few) of the libraries do not get uploaded so for e.g. numpy, that part does not get uploaded and Amazon's version gets used that is already available in that environment.
I propose the following workflow (without using S3 buckets for slim_handler):
delete all the files that match "test_*.py" in all packages
manually tree shake scipy as I only use scipy.minimize, by deleting most of it and re-running my tests
minify all the code and obfuscate using pyminifier
zappa deploy
Or:
run compileall to get .pyc files
delete all *.py files and let zappa upload .pyc files instead
zappa deploy
I've had issues with slim_handler: true, either my connection drops and the upload fails or some other error occurs and at ~25% of the upload to S3 I get Could not connect to the endpoint URL. For the purposes of this question, I'd like to get the dependencies down to manageable levels.
Nevertheless, over half a gig of dependencies with the main app being less than 2MB has to be some sort of record.
My questions are:
What is the unzipped limit for AWS? Is it 250MB or 500MB?
Am I on the right track with the above method for reducing package sizes?
Is it possible to go a step further and use .pyz files?
Are there any standard utilities out there that help with the above?
Is there no tree shaking library for python?

The limit in AWS is for unpacked 250MB of code (as seen here https://hackernoon.com/exploring-the-aws-lambda-deployment-limits-9a8384b0bec3)
I would suggest going for second method and compile everything.
I think you should also consider using serverless framework. It does not force you to create virtualenv which is very heavy.
I've seen that all your packages can be compressed up to 83MB (just the packages).
My workaround would be:
use serverless framework (consider moving from flask directly to API Gateway)
install your packages locally on the same folder using:
pip install -r requirements.txt -t .
try your method of compiling to .pyc files, and remove others.
Deploy:
sis deploy
Hope it helps.

Related

aws_cdk python error: Unzipped size must be smaller than 262144000 bytes

I use CDK to deploy a lambda function that uses several python modules.
But I got the following error at the deployment.
Unzipped size must be smaller than 262144000 bytes (Service: AWSLambdaInte
rnal; Status Code: 400; Error Code: InvalidParameterValueException;
I have searched following other questions, related to this issue.
question1
question2
But they focus on serverless.yaml and don't solve my problem.
Is there any way around for this problem?
Here is my app.py for CDK.
from aws_cdk import (
aws_events as events,
aws_lambda as lam,
core,
)
class MyStack(core.Stack):
def __init__(self, app: core.App, id: str) -> None:
super().__init__(app, id)
layer = lam.LayerVersion(
self, "MyLayer",
code=lam.AssetCode.from_asset('./lib'),
);
makeQFn = lam.Function(
self, "Singleton",
function_name='makeQ',
code=lam.AssetCode.from_asset('./code'),
handler="makeQ.main",
timeout=core.Duration.seconds(300),
layers=[layer],
runtime=lam.Runtime.PYTHON_3_7,
)
app = core.App()
MyStack(app, "MS")
app.synth()
In ./lib directory, I put python modules like,
python -m pip install numpy -t lib/python

Edit: Better method!
Original:
There is now an experimental package, aws_lambda_python_alpha which you can use to automatically bundle packages listed in a requirements.txt, but I'm unfortunately still running into the same size issue. I'm thinking now to try layers.
For anyone curious, here's a sample of bundling using aws_lambda_python_alpha:
from aws_cdk import aws_lambda_python_alpha as _lambda_python
self.prediction_lambda = _lambda_python.PythonFunction(
scope=self,
id="PredictionLambda",
# entry points to the directory
entry="lambda_funcs/APILambda",
# index is the file name
index="API_lambda.py",
# handler is the function entry point name in the lambda.py file
handler="handler",
runtime=_lambda.Runtime.PYTHON_3_9,
# name of function on AWS
function_name="ExampleAPILambda",
)

I would suggest checking out the aws-cdk-lambda-asset project, which will help bundle internal project dependencies stored in a requirements.txt file. How this works is, it installs dependencies specified in the requirements file to a local folder, then bundles it up in a zip file which is then used for CDK deployment.
For non-Linux environments like Windows/Mac, it will install the dependencies in a Docker image, so first ensure that you have Docker installed and running on your system.
Note that the above code seems to use poetry, which is a dependency management tool. I have no idea what poetry does or why it is used in lieu of a setup.py file. Therefore, I've created a slight modification in a gist here, in case this is of interest; this will install all local dependencies using a regular pip install command instead of the poetry tool, which I'm not too familiar with.

Thanks a lot.
In my case, the issue is solved just by removing all __pycache__ in the local modules before the deployment.
I hope the situation will be improved and we only have to upload requirements.txt instead of preparing all the modules locally.

Nope, there is no way around that limit in single setup.
What you have to do instead is install your dependencies into multiple zips that become multiple layers.
Basically, install several dependencies to a python folder. Then zip that folder into something like intergration_layer. Clear the python folder and install the next set and name it something else. Like data_manipilation.
Then you have two layers in cdk (using aws_lambda.LayerVersion) and add those layers to each lambda. You'll have to break the layers up to be small enough.
You can use a makefile to generate the layers automatically and then tie the makefile, cdk deploy, and some clean up inside a bash script yo tie them all together.
Note. You are still limited on space with layers and limited to 5 layers. If your dependencies outgrow that then look into Elastic File System. You can install dependencies there and tie that single EFS to any lambda to reference them through PYTHONPATH manipulation built into the EFS lambda connection
Basic makefile idea (but not exact for CDK, but still close)
(and if you never have, a good tutorial on how to use a makefile)
CDK documentation for Layers (use from AssetCode for the location)
AWS dev blog on EFS for Lambdas

Reducing Python zip size to use with AWS Lambda

I'm following this blog post to create a runtime environment using Docker for use with AWS Lambda. I'm creating a layer for using with Python 3.8:
docker run -v "$PWD":/var/task "lambci/lambda:build-python3.8" /bin/sh -c "pip install -r requirements.txt -t python/lib/python3.8/site-packages/; exit"
And then archiving the layer as zip: zip -9 -r mylayer.zip python
All standard so far. The problem arises in the .zip size, which is > 250mb and so creates the following error in Lambda: Failed to create layer version: Unzipped size must be smaller than 262144000 bytes.
Here's my requirements.txt:
s3fs
scrapy
pandas
requests
I'm including s3fs since I get the following error when trying to save a parquet file to an S3 bucket using pandas: [ERROR] ImportError: Install s3fs to access S3. This problem is that including s3fs massively increases the layer size. Without s3fs the layer is < 200mb unzipped.
My most direct question would be: How can I reduce the layer size to < 250mb while still using Docker and keeping s3fs in my requirements.txt? I can't explain the 50mb+ difference, especially since s3fs < 100kb on PyPi.
Finally, for those questioning my use of Lambda with Scrapy: my scraper is trivial, and spinning up an EC2 instance would be overkill.

The key idea behind shrinking your layers is to identify what pip installs and what you can get rid off, usually manually.
In your case, since you are only slightly above the limit, I would get rid off pandas/tests. So before you create your zip layer, you can run the following in the layer's folder (mylayer from your past question):
rm -rvf python/lib/python3.8/site-packages/pandas/tests
This should trim your layer below the 262MB limit after unpacking. In my test it is now 244MB.
Alternatively, you can go over python folder manually, and start removing any other tests, documentations, examples, etc, that are not needed.

I can't explain the 50mb+ difference, especially since s3fs < 100kb on PyPi.
That's simple enough to explain. As expected, s3fs has internal dependencies on AWS libraries (botocore in this case). The good news is that boto3 is already included in AWS lambda (see this link for which libraries are available in lambda) therefore you can exclude botocore from your zipped dependencies and save up to ~50MB in total size.
See the above link for more info. The libraries you can safely remove from your zipped artifact file and still be able to run the code on an AWS lambda function running Python 3.8:
boto3
botocore
docutils
jmespath
pip
python-dateutil (generates the dateutil package)
s3transfer
setuptools
six (generates six.py)
urllib3 (if needed, bundled dependencies like chardet could also be removed)
You can also use a bash script to recursively get rid of the following (junk) directories that you don't need:
__pycache__
*.dist-info (example: certifi-2021.5.30.dist-info)
tests - Only possibly, but I can't confirm. If you do choose to recursively get rid of all tests folders, first check if anything breaks on lambda, since in rare cases such a package could be imported in code.
Do all this and you should easily save around ~60MB in zipped artifact size.

Build and use local package for AWS Lambda using serverless framework

I am trying to package a local python package¹ and use it within an AWS lambda deployed via the Serverless framework. I already use serverless-python-requirements plugin to add pip dependencies to deployed package.
How can I proceed ?
Shall I create a package and zip it? Or generate a whl file and use pip? And then, how to deploy it?
¹: I cannot just add it to "normal codebase" as I want to share it with other bricks (Glue jobs for example)

Here's the solution:
1.
Build a .whl file corresponding to package using
python setup.py bdist_wheel
within a parent directory.
2.
Add the relative path to this .whl file to used pip requirement file (requirements.txt for instance) :
req0==1.0.9
req1==5.5.0
../<relative path to local package>/dist/<package name>-<version>-<details>.whl # generated .whl file's name
3.
serverless-python-requirements will automagically pack this dependency within the deployed archive when doing sls deploy. How cool is that, huh!

How to fix "module 'pg8000' has no attribute 'connect'" error in AWS Glue job

I'm trying to set up a daily AWS Glue job that loads data into a RDS PostgreSQL DB. But I need to truncate my tables before loading data into them, since those jobs work on the whole dataset.
To do this, I'm implementing the solution given here: https://stackoverflow.com/a/50984173/11952393.
It uses the pure Python library pg8000. I followed the guidelines in this SO, downloading the library tar, unpacking it, adding the empty __init.py__, zipping the whole think, uploading the zip file to S3 and adding the S3 URL as a Python library in the AWS Glue job config.
When I run the job, the pg8000 module seems to be imported correctly. But then I get the following error:
AttributeError: module 'pg8000' has no attribute 'connect'
I am most certainly doing something wrong... But can't find what. Any constructive feedback is welcome!

Here is what made it work for me.
Do a pip install of the pg8000 package in a separate location
pip install -t /tmp/ pg8000
You would see 2 directories in the /tmp directory
pg8000
scramp
Zip the above 2 directories separately
cd /tmp/
zip -r pg8000.zip pg8000/
zip -r scramp.zip scramp/
Upload these 2 zip files in an S3 location
While creating the job or the Dev Endpoint mention these 2 zip files in the Python Library Path field
s3://<bucket>/<prefix>/pg8000.zip,s3://<bucket>/<prefix>/scramp.zip

Add
install_requires = ['pg8000==1.12.5']
in _setup.py file which is generating .egg file
You should able to access library.

Read Parquet file stored in S3 with AWS Lambda (Python 3)

I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is:
https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others).
This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python
Add a test python function to the zip, send it to S3, update the lambda and test it
It seems that there are two possible approaches, which both work locally to the docker container:
fastparquet with s3fs: Unfortunately the unzipped size of the package is bigger than 256MB and therefore I can't update the Lambda code with it.
pyarrow with s3fs: I followed https://github.com/apache/arrow/pull/916 and when executed with the lambda function I get either:
If I prefix the URI with S3 or S3N (as in the code example): In the Lambda environment OSError: Passed non-file path: s3://mybucket/path/to/myfile in pyarrow/parquet.py, line 848. Locally I get IndexError: list index out of range in pyarrow/parquet.py, line 714
If I don't prefix the URI with S3 or S3N: It works locally (I can read the parquet data). In the Lambda environment, I get the same OSError: Passed non-file path: s3://mybucket/path/to/myfile in pyarrow/parquet.py, line 848.
My questions are :
why do I get a different result in my docker container than I do in the Lambda environment?
what is the proper way to give the URI?
is there an accepted way to read Parquet files in S3 through AWS Lambda?
Thanks!

AWS has a project (AWS Data Wrangler) that allows it with full Lambda Layers support.
In the Docs there is a step-by-step to do it.
Code example:
import awswrangler as wr
# Write
wr.s3.to_parquet(
dataframe=df,
path="s3://...",
dataset=True,
database="my_database", # Optional, only with you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])
# READ
df = wr.s3.read_parquet(path="s3://...")
Reference

I was able to accomplish writing parquet files into S3 using fastparquet. It's a little tricky but my breakthrough came when I realized that to put together all the dependencies, I had to use the same exact Linux that Lambda is using.
Here's how I did it:
1. Spin up a EC2 instance using the Amazon Linux image that is used with Lambda
Source:
https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html
Linux image:
https://console.aws.amazon.com/ec2/v2/home#Images:visibility=public-images;search=amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2
Note: you might need to install many packages and change python version to 3.6 as this Linux is not meant for development. Here's how I looked for packages:
sudo yum list | grep python3
I installed:
python36.x86_64
python36-devel.x86_64
python36-libs.x86_64
python36-pip.noarch
python36-setuptools.noarch
python36-tools.x86_64
2. Used the instructions from here to built a zip file with all of the dependencies that my script would use with dumping them all in a folder and the zipping them with this command:
mkdir parquet
cd parquet
pip install -t . fastparquet
pip install -t . (any other dependencies)
copy my python file in this folder
zip and upload into Lambda
Note: there are some constraints I had to work around: Lambda doesn't let you upload zip larger 50M and unzipped > 260M. If anyone knows a better way to get dependencies into Lambda, please do share.
Source:
Write parquet from AWS Kinesis firehose to AWS S3

This was an environment issue (Lambda in VPC not getting access to the bucket). Pyarrow is now working.
Hopefully the question itself will give a good-enough overview on how to make all that work.

One can also achieve this through the AWS sam cli and Docker (we'll explain this requirement later).
1.Create a directory and initialize sam
mkdir some_module_layer
cd some_module_layer
sam init
by typing the last command a series of three question would be prompted. One could choose the following series of answers (I'm considering working under Python3.7, but other options are possible).
1 - AWS Quick Start Templates
8 - Python 3.7
Project name [sam-app]: some_module_layer
1 - Hello World Example
2. Modify requirements.txt file
cd some_module_layer
vim hello_world/requirements.txt
this will open requirements.txt file on vim, on Windows you could type instead code hello_world/requirements.txt to edit the file on Visual Studio Code.
3. Add pyarrow to requirements.txt
Alongside pyarrow, it will work to include additionnaly pandas and s3fs. In this case including pandas will avoid it to not recognize pyarrow as an engine to read parquet files.
pandas
pyarrow
s3fs
4. Build with a container
Docker is required to use the option --use-container when running the sam build command. If it's the first time, it will pull the lambci/lambda:build-python3.7 Docker image.
sam build --use-container
rm .aws-sam/build/HelloWorldFunction/app.py
rm .aws-sam/build/HelloWorldFunction/__init__.py
rm .aws-sam/build/HelloWorldFunction/requirements.txt
notice that we're keeping only the python libraries.
5. Zip files
cp -r .aws-sam/build/HelloWorldFunction/ python/
zip -r some_module_layer.zip python/
On Windows, it would work to run Compress-Archive python/ some_module_layer.zip.
6. Upload zip file to AWS
The following link is useful for this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.