Introduction
I'm new to AWS and I'm trying to run Jellyfish on an AWS Lambda function with a python container image from an AWS base image.
Context
I want to achieve this by installing the software from source in my python container image (from AWS base image) and then uploading it to AWS ECR to be later used by a Lambda function.
My AWS architecture is:
biome-test-bucket-out (AWS S3 bucket with trigger) -> Lambda function -> biome-test-bucket-jf (AWS S3 bucket)
First, to install it from source I downloaded the latest release .tar.gz file locally, then uncompressed it, and copied the contents in the container.
My Dockerfile looks like this:
FROM public.ecr.aws/lambda/python:3.9
# Copy contents from latest release
WORKDIR /jellyfish-2.3.0
COPY ./jellyfish-2.3.0/ .
WORKDIR /
# Installing Jellyfish dependencies
RUN yum update -y
RUN yum install -y gcc-c++
RUN yum install -y make
# Jellyfish installation (in /bin)
RUN jellyfish-2.3.0/configure --prefix=/bin
RUN make -j 4
RUN make install
RUN chmod -R 777 /bin
# Copy function code
COPY app.py ${LAMBDA_TASK_ROOT}
CMD [ "app.lambda_handler" ]
The installation folder is /bin because it's in $PATH so that way I can just run "jellyfish..." command.
My app.py looks like this:
import subprocess
import boto3
import logging
from pathlib import Path, PurePosixPath
s3 = boto3.client('s3')
print('Loading function')
def lambda_handler(event, context):
# Declare buckets & get name of file
bucket_in = "biome-test-bucket-out"
bucket_out = "biome-test-bucket-jf"
key = event["Records"][0]["s3"]["object"]["key"]
# Paths where files will be stored
input_file_path = f"/tmp/{key}"
file = Path(key).stem
output_file_path = f"/tmp/{file}.jf"
# Download file
with open(input_file_path, 'wb') as f:
s3.download_fileobj(bucket_in, key, f)
# Run jellyfish with downloaded file
command = f"/bin/jellyfish count -C -s 100M -m 20 -t 1 {input_file_path} -o {output_file_path}"
logging.info(subprocess.check_output(command, shell=True))
# Upload file to bucket
try:
with open(output_file_path, 'rb') as f:
p = PurePosixPath(output_file_path).name
s3.upload_fileobj(f, bucket_out, p)
except Exception as e:
logging.error(e)
return False
return 0
Problem
If I build and run the image locally, everything works fine but once the image runs in Lambda I get this error:
/usr/bin/ld: cannot open output file /bin/.libs/12-lt-jellyfish: Read-only file system
That file isn't there after the installation so I'm guessing that Jellyfish creates a new file in /bin/.libs/ when it's running and that file as only read permissions. I'm not sure how to tackle this, any ideas?
Thank you.
Related
I typically install packages in EMR through Spark's install_pypi_package method. This limits where I can install packages from. How can I install a package from a specific GitHub branch? Is there a way I can do this through the install_pypi_package method?
If you have access to cluster creation step, you can install the package using pip from github at bootstrap. (install_pypi_package is needed because the cluster is already running at that time and packages might not resolve on all nodes)
Installing prior Cluster is running:
A simple example (e.g with download.sh bootstrap file) of bootstrap and installing from github using pip is
#!/bin/bash
sudo pip install <you-repo>.git
then you can use this bash at bootstrap as
aws emr create-cluster --name "Test cluster" --bootstrap-actions
Path="s3://elasticmapreduce/bootstrap-actions/download.sh"
or you can use pip3 in bootstrap
sudo pip3 install <you-repo>.git
or just clone it and build it locally on EMR with setup.py file
#!/bin/bash
git clone <your-repo>.git
sudo python setup.py install
After Cluster is running (Complex and not recommended)
If you still want to install or build a custom package when the cluster is already running, AWS has some explanation here that uses AWS-RunShellScript to install package on all core nodes. It says
(I) Install the package to Master node, (doing pip install on running cluster via shell or a jupyter notebook on top of it)
(II) Running following script locally on EMR, for which you pass cluster-id and boostrap script path(for e.g download.sh above) as arguments.
import argparse
import time
import boto3
def install_libraries_on_core_nodes(
cluster_id, script_path, emr_client, ssm_client):
"""
Copies and runs a shell script on the core nodes in the cluster.
:param cluster_id: The ID of the cluster.
:param script_path: The path to the script, typically an Amazon S3 object URL.
:param emr_client: The Boto3 Amazon EMR client.
:param ssm_client: The Boto3 AWS Systems Manager client.
"""
core_nodes = emr_client.list_instances(
ClusterId=cluster_id, InstanceGroupTypes=['CORE'])['Instances']
core_instance_ids = [node['Ec2InstanceId'] for node in core_nodes]
print(f"Found core instances: {core_instance_ids}.")
commands = [
# Copy the shell script from Amazon S3 to each node instance.
f"aws s3 cp {script_path} /home/hadoop",
# Run the shell script to install libraries on each node instance.
"bash /home/hadoop/install_libraries.sh"]
for command in commands:
print(f"Sending '{command}' to core instances...")
command_id = ssm_client.send_command(
InstanceIds=core_instance_ids,
DocumentName='AWS-RunShellScript',
Parameters={"commands": [command]},
TimeoutSeconds=3600)['Command']['CommandId']
while True:
# Verify the previous step succeeded before running the next step.
cmd_result = ssm_client.list_commands(
CommandId=command_id)['Commands'][0]
if cmd_result['StatusDetails'] == 'Success':
print(f"Command succeeded.")
break
elif cmd_result['StatusDetails'] in ['Pending', 'InProgress']:
print(f"Command status is {cmd_result['StatusDetails']}, waiting...")
time.sleep(10)
else:
print(f"Command status is {cmd_result['StatusDetails']}, quitting.")
raise RuntimeError(
f"Command {command} failed to run. "
f"Details: {cmd_result['StatusDetails']}")
def main():
parser = argparse.ArgumentParser()
parser.add_argument('cluster_id', help="The ID of the cluster.")
parser.add_argument('script_path', help="The path to the script in Amazon S3.")
args = parser.parse_args()
emr_client = boto3.client('emr')
ssm_client = boto3.client('ssm')
install_libraries_on_core_nodes(
args.cluster_id, args.script_path, emr_client, ssm_client)
if __name__ == '__main__':
main()
I am trying to deploy a function that converts strings into vectors to AWS Lambda:
def _encode(text: str):
[<main functionality>]
#lru_cache
def encode(text: str):
return _encode(text)
def handler(event, context):
return encode(event["text"])
This function works as expected when I call it in the Python shell:
import app
app.handler({"text"}, None)
<expected result>
The encode() functionally actually is complex and requires external dependencies (>1GB) which is why I am going for the Docker image approach as described in the documentation. This is my Dockerfile:
FROM amazon/aws-lambda-python:3.9
# Install requirements
RUN python -m pip install --upgrade pip
COPY requirements.txt ${LAMBDA_TASK_ROOT}
RUN python -m pip install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"
# Copy app
COPY app.py ${LAMBDA_TASK_ROOT}
CMD [ "app.handler" ]
Build the Docker image and run it:
$ docker build -t my-image:latest .
[...]
Successfully built xxxx
Successfully tagged my-image:latest
$ docker run -p 9000:8080 my-image:latest
time="2021-10-07T10:12:13.165" level=info msg="exec '/var/runtime/bootstrap' (cwd=/var/task, handler=)"
So I try to test the function locally with curl, following the testing documentation, and it succeeds
I've pushed the image to AWS ECR and created a Kambda function. I've created a test in the AWS Lambda console:
{
"text": "doc"
}
When I run the test in the AWS Lambda Console, it times out, even after increasing the timeout to 60s. Locally, the function takes less than 1 second to execute.
On AWS Lambda, only the timeout is logged, I don't see how to get to the root cause. How can I debug this? How can I get more useful logs?
I have a local python code which GPG encrypts a file. I need to convert this to AWS Lambda, once a file has been added to AWS S3 which triggers this lambda.
My local code
import os
import os.path
import time
import sys
gpg = gnupg.GPG(gnupghome='/home/ec2-user/.gnupg')
path = '/home/ec2-user/2021/05/28/'
ptfile = sys.argv[1]
with open(path + ptfile, 'rb')as f:
status = gpg.encrypt_file(f, recipients=['user#email.com'], output=path + ptfile + ".gpg")
print(status.ok)
print(status.stderr)
This works great when I execute this file as python3 encrypt.py file.csv and the result is file.csv.gpg
I'm trying to move this to AWS Lambda and invoked when a file.csv is uploaded to S3.
import json
import urllib.parse
import boto3
import gnupg
import os
import os.path
import time
s3 = boto3.client('s3')
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
try:
gpg = gnupg.GPG(gnupghome='/.gnupg')
ind = key.rfind('/')
ptfile = key[ind + 1:]
with open(ptfile, 'rb')as f:
status = gpg.encrypt_file(f, recipients=['email#company.com'], output= ptfile + ".gpg")
print(status.ok)
print(status.stderr)
My AWS Lambda code zip created a folder structure in AWS
The error I see at runtime is [ERROR] Runtime.ImportModuleError: Unable to import module 'lambda_function': No module named 'gnupg'
Traceback (most recent call last):
You can create a gpg binary suitable for use by python-gnupg on AWS Lambda from the GnuPG 1.4 source. You will need
GCC and associated tools (sudo yum install -y gcc make glibc-static on Amazon Linux 2)
pip
zip
After downloading the GnuPG source package and verifying its signature, build the binary with
$ tar xjf gnupg-1.4.23.tar.bz2
$ cd gnupg-1.4.23
$ ./configure
$ make CFLAGS='-static'
$ cp g10/gpg /path/to/your/lambda/
You will also need the gnupg.py module from python-gnupg, which you can fetch using pip:
$ cd /path/to/your/lambda/
$ pip install -t . python-gnupg
Your Lambda’s source structure will now look something like this:
.
├── gnupg.py
├── gpg
└── lambda_function.py
Update your function to pass the location of the gpg binary to the python-gnupg constructor:
gpg = gnupg.GPG(gnupghome='/.gnupg', gpgbinary='./gpg')
Use zip to package the Lambda function:
$ chmod o+r gnupg.py lambda_function.py
$ chmod o+rx gpg
$ zip lambda_function.zip gnupg.py gpg lambda_function.py
Since there are some system dependencies required to use gpg within python i.e gnupg itself, you will need to build your lambda code using the container runtime environment: https://docs.aws.amazon.com/lambda/latest/dg/lambda-images.html
Using docker will allow you to install underlying system dependencies, as well as import your keys.
Dockerfile will look something like this:
FROM public.ecr.aws/lambda/python:3.8
RUN apt-get update && apt-get install gnupg
# copy handler file
COPY app.py <path-to-keys> ./
# Add keys to gpg
RUN gpg --import <path-to-private-key>
RUN gpg --import <path-to-public-key>
# Install dependencies and open port
RUN pip3 install -r requirements.txt
CMD ["app.lambda_handler"]
app.py would be your lambda code. Feel free to copy any necessary files besides the main lambda handler.
Once the container image is built and uploaded. The lambda can now use the image (including all of its dependencies). The lambda code will run within the containerized environment which contains both gnupg and your imported keys.
Resources:
https://docs.aws.amazon.com/lambda/latest/dg/python-image.html
https://docs.aws.amazon.com/lambda/latest/dg/lambda-images.html
https://medium.com/#julianespinel/how-to-use-python-gnupg-to-decrypt-a-file-into-a-docker-container-8c4fb05a0593
The best way to do this is to add a lambda layer to your python lambda.
You need to make a virtual environment in which you pip install gnupg and then put all the installed python packages in a zip file, which you upload to aws as a lambda layer. This lambda layer can then be used in all lambdas where you need gnupg. To create the lamba layer you basically do:
python3.9 -m venv my_venv
./my_venv/bin/pip3.9 install gnupg
cp -r ./my_venv/lib/python3.9/site-packages/ python
zip -r lambda_layer.zip python
Where the python version above has to match that of the python function in your lambda.
If you don't want to use layers you can additionally do:
zip -r lambda_layer.zip ./.gnupg
zip lambda_layer.zip lambda_funtion.py
And you get a zip file that you can use as a lambda deployment package
gpg is now already installed in
public.ecr.aws/lambda/python:3.8.
However despite that does not seem to be available from Lambda.
So you still need to get the gpg executable into the Lambda environment.
I did it using a docker image.
My Dockerfile is just:
FROM public.ecr.aws/lambda/python:3.8
COPY .venv/lib/python3.8/site-packages/ ./
COPY test_gpg.py .
CMD ["test_gpg.lambda_handler"]
.venv is the directory with my python virtualenv containing
the python packages I need.
The python-gnupg package requires you to have a working installation of the gpg executable, as mentioned in their official docs' Deployment Requirements; I am yet to find a way to access a gpg executable from lambda.
I ended up using a Dockerfile image as the library is already available in one of the Amazon Linux or Lambda Python base images provided by AWS that you can find here https://gallery.ecr.aws/lambda/python (in the tag images tab you will find all the Python versions needed based on your requirements).
You will need to create the following 3 files in your dev environment:
Dockerfile
requirements.txt
lambda script
The requirements.txt contains the python-gnupg for the import and all the other libraries based on your requirements:
boto3==1.15.11 # via -r requirements.in
urllib3==1.25.10 # via botocore
python-gnupg==0.5.0 # required for gpg encryption
This is the Dockerfile:
# Python 3.9 lambda base image
FROM public.ecr.aws/lambda/python:3.9
# Install pip-tools so we can manage requirements
RUN yum install python-pip -y
# Copy requirements.txt file locally
COPY requirements.txt ./
# Install dependencies into current directory
RUN python3.9 -m pip install -r requirements.txt
# Copy lambda file locally
COPY s3_to_sftp_batch.py .
# Define handler file name
CMD ["s3_to_sftp_batch.on_trigger_event_test"]
Then inside your lambda code add:
# define library path to point system libraries
os.environ["LD_LIBRARY_PATH"] = "/usr/bin/"
# create instance of GPG class and specify path that contains gpg binary
gpg = gnupg.GPG(gnupghome='/tmp', gpgbinary='/usr/bin/gpg')
Save these files then go to AWS ECR and create a Private repo then go to the repo and in the top right corner go to View push commands and run them to push your image in AWS.
Finally create your Lambda function using the container image.
I want to run my Python program on IBM cloud functions, because of dependencies this needs to be done in an OpenWhisk Docker. I've changed my code so that it accepts a json:
json_input = json.loads(sys.argv[1])
INSTANCE_NAME = json_input['INSTANCE_NAME']
I can run it from the terminal:
python main/main.py '{"INSTANCE_NAME": "example"}'
I've added this Python program to OpenWhisk with this Dockerfile:
# Dockerfile for example whisk docker action
FROM openwhisk/dockerskeleton
ENV FLASK_PROXY_PORT 8080
### Add source file(s)
ADD requirements.txt /action/requirements.txt
RUN cd /action; pip install -r requirements.txt
# Move the file to
ADD ./main /action
# Rename our executable Python action
ADD /main/main.py /action/exec
CMD ["/bin/bash", "-c", "cd actionProxy && python -u actionproxy.py"]
But now if I run it using IBM Cloud CLI I just get my Json back:
ibmcloud fn action invoke --result e2t-whisk --param-file ../env_var.json
# {"INSTANCE_NAME": "example"}
And if I run from the IBM Cloud Functions website with the same Json feed I get an error like it's not even there.
stderr: INSTANCE_NAME = json_input['INSTANCE_NAME']",
stderr: KeyError: 'INSTANCE_NAME'"
What can be wrong that the code runs when directly invoked, but not from the OpenWhisk container?
I wanted to import jsonschema library in my AWS Lambda in order to perform request validation. Instead of bundling the dependency with my app , I am looking to do this via Lambda Layers. I zipped all the dependencies under venv/lib/python3.6/site-packages/. I uploaded this as a lambda layer and added it to my aws lambda using publish-layer-version and aws lambda update-function-configuration commands respectively. The zip folder is name "lambda-dep.zip" and all the files are under it. However when I try to import jsonschema in my lambda_function , I see the error below -
from jsonschema import validate
{
"errorMessage": "Unable to import module 'lambda_api': No module named 'jsonschema'",
"errorType": "Runtime.ImportModuleError"
}
Am I missing any steps are is there a different mechanism to import anything within lambda layers?
You want to make sure your .zip follows this folder structure when unzipped
python/lib/python3.6/site-packages/{LibrariesGoHere}.
Upload that zip, make sure the layer is added to the Lambda function and you should be good to go.
This is the structure that has worked for me.
Here the script that I use to upload a layer:
#!/usr/bin/env bash
LAYER_NAME=$1 # input layer, retrived as arg
ZIP_ARTIFACT=${LAYER_NAME}.zip
LAYER_BUILD_DIR="python"
# note: put the libraries in a folder supported by the runtime, means that should by python
rm -rf ${LAYER_BUILD_DIR} && mkdir -p ${LAYER_BUILD_DIR}
docker run --rm -v `pwd`:/var/task:z lambci/lambda:build-python3.6 python3.6 -m pip --isolated install -t ${LAYER_BUILD_DIR} -r requirements.txt
zip -r ${ZIP_ARTIFACT} .
echo "Publishing layer to AWS..."
aws lambda publish-layer-version --layer-name ${LAYER_NAME} --zip-file fileb://${ZIP_ARTIFACT} --compatible-runtimes python3.6
# clean up
rm -rf ${LAYER_BUILD_DIR}
rm -r ${ZIP_ARTIFACT}
I added the content above to a file called build_layer.sh, then I call it as bash build_layer.sh my_layer. The script requires a requirements.txt in the same folder, and it uses Docker to have the same runtime used for Python3.6 Lambdas.
The arg of the script is the layer name.
After uploading a layer to AWS, be sure that the right layer's version is referenced inside your Lambda.
Update from previous answers: Per AWS documentation, requirements have been changed to simply be placed in a /python directory without the rest of the directory structure.
https://aws.amazon.com/premiumsupport/knowledge-center/lambda-import-module-error-python/
Be sure your unzipped directory structure has libraries within a /python directory.
There is an easier method. Just install the packages into a python folder. Then install the packages using the -t (Target) option. Note the "." in the zip file. this is a wild card.
mkdir lambda_function
cd lambda_function
mkdir python
cd python
pip install yourPackages -t ./
cd ..
zip /tmp/labmda_layer.zip .
The zip file is now your lambda layer.
The step by step instructions includeing video instructions can be found here.
https://geektopia.tech/post.php?blogpost=Create_Lambda_Layer_Python