EMR install Python package from Git branch - python

I typically install packages in EMR through Spark's install_pypi_package method. This limits where I can install packages from. How can I install a package from a specific GitHub branch? Is there a way I can do this through the install_pypi_package method?

If you have access to cluster creation step, you can install the package using pip from github at bootstrap. (install_pypi_package is needed because the cluster is already running at that time and packages might not resolve on all nodes)
Installing prior Cluster is running:
A simple example (e.g with download.sh bootstrap file) of bootstrap and installing from github using pip is
#!/bin/bash
sudo pip install <you-repo>.git
then you can use this bash at bootstrap as
aws emr create-cluster --name "Test cluster" --bootstrap-actions
Path="s3://elasticmapreduce/bootstrap-actions/download.sh"
or you can use pip3 in bootstrap
sudo pip3 install <you-repo>.git
or just clone it and build it locally on EMR with setup.py file
#!/bin/bash
git clone <your-repo>.git
sudo python setup.py install
After Cluster is running (Complex and not recommended)
If you still want to install or build a custom package when the cluster is already running, AWS has some explanation here that uses AWS-RunShellScript to install package on all core nodes. It says
(I) Install the package to Master node, (doing pip install on running cluster via shell or a jupyter notebook on top of it)
(II) Running following script locally on EMR, for which you pass cluster-id and boostrap script path(for e.g download.sh above) as arguments.
import argparse
import time
import boto3
def install_libraries_on_core_nodes(
cluster_id, script_path, emr_client, ssm_client):
"""
Copies and runs a shell script on the core nodes in the cluster.
:param cluster_id: The ID of the cluster.
:param script_path: The path to the script, typically an Amazon S3 object URL.
:param emr_client: The Boto3 Amazon EMR client.
:param ssm_client: The Boto3 AWS Systems Manager client.
"""
core_nodes = emr_client.list_instances(
ClusterId=cluster_id, InstanceGroupTypes=['CORE'])['Instances']
core_instance_ids = [node['Ec2InstanceId'] for node in core_nodes]
print(f"Found core instances: {core_instance_ids}.")
commands = [
# Copy the shell script from Amazon S3 to each node instance.
f"aws s3 cp {script_path} /home/hadoop",
# Run the shell script to install libraries on each node instance.
"bash /home/hadoop/install_libraries.sh"]
for command in commands:
print(f"Sending '{command}' to core instances...")
command_id = ssm_client.send_command(
InstanceIds=core_instance_ids,
DocumentName='AWS-RunShellScript',
Parameters={"commands": [command]},
TimeoutSeconds=3600)['Command']['CommandId']
while True:
# Verify the previous step succeeded before running the next step.
cmd_result = ssm_client.list_commands(
CommandId=command_id)['Commands'][0]
if cmd_result['StatusDetails'] == 'Success':
print(f"Command succeeded.")
break
elif cmd_result['StatusDetails'] in ['Pending', 'InProgress']:
print(f"Command status is {cmd_result['StatusDetails']}, waiting...")
time.sleep(10)
else:
print(f"Command status is {cmd_result['StatusDetails']}, quitting.")
raise RuntimeError(
f"Command {command} failed to run. "
f"Details: {cmd_result['StatusDetails']}")
def main():
parser = argparse.ArgumentParser()
parser.add_argument('cluster_id', help="The ID of the cluster.")
parser.add_argument('script_path', help="The path to the script in Amazon S3.")
args = parser.parse_args()
emr_client = boto3.client('emr')
ssm_client = boto3.client('ssm')
install_libraries_on_core_nodes(
args.cluster_id, args.script_path, emr_client, ssm_client)
if __name__ == '__main__':
main()

Related

Running third-party command line software on AWS Lambda container image

Introduction
I'm new to AWS and I'm trying to run Jellyfish on an AWS Lambda function with a python container image from an AWS base image.
Context
I want to achieve this by installing the software from source in my python container image (from AWS base image) and then uploading it to AWS ECR to be later used by a Lambda function.
My AWS architecture is:
biome-test-bucket-out (AWS S3 bucket with trigger) -> Lambda function -> biome-test-bucket-jf (AWS S3 bucket)
First, to install it from source I downloaded the latest release .tar.gz file locally, then uncompressed it, and copied the contents in the container.
My Dockerfile looks like this:
FROM public.ecr.aws/lambda/python:3.9
# Copy contents from latest release
WORKDIR /jellyfish-2.3.0
COPY ./jellyfish-2.3.0/ .
WORKDIR /
# Installing Jellyfish dependencies
RUN yum update -y
RUN yum install -y gcc-c++
RUN yum install -y make
# Jellyfish installation (in /bin)
RUN jellyfish-2.3.0/configure --prefix=/bin
RUN make -j 4
RUN make install
RUN chmod -R 777 /bin
# Copy function code
COPY app.py ${LAMBDA_TASK_ROOT}
CMD [ "app.lambda_handler" ]
The installation folder is /bin because it's in $PATH so that way I can just run "jellyfish..." command.
My app.py looks like this:
import subprocess
import boto3
import logging
from pathlib import Path, PurePosixPath
s3 = boto3.client('s3')
print('Loading function')
def lambda_handler(event, context):
# Declare buckets & get name of file
bucket_in = "biome-test-bucket-out"
bucket_out = "biome-test-bucket-jf"
key = event["Records"][0]["s3"]["object"]["key"]
# Paths where files will be stored
input_file_path = f"/tmp/{key}"
file = Path(key).stem
output_file_path = f"/tmp/{file}.jf"
# Download file
with open(input_file_path, 'wb') as f:
s3.download_fileobj(bucket_in, key, f)
# Run jellyfish with downloaded file
command = f"/bin/jellyfish count -C -s 100M -m 20 -t 1 {input_file_path} -o {output_file_path}"
logging.info(subprocess.check_output(command, shell=True))
# Upload file to bucket
try:
with open(output_file_path, 'rb') as f:
p = PurePosixPath(output_file_path).name
s3.upload_fileobj(f, bucket_out, p)
except Exception as e:
logging.error(e)
return False
return 0
Problem
If I build and run the image locally, everything works fine but once the image runs in Lambda I get this error:
/usr/bin/ld: cannot open output file /bin/.libs/12-lt-jellyfish: Read-only file system
That file isn't there after the installation so I'm guessing that Jellyfish creates a new file in /bin/.libs/ when it's running and that file as only read permissions. I'm not sure how to tackle this, any ideas?
Thank you.

How to use gnupg in an AWS Lambda function

I have a local python code which GPG encrypts a file. I need to convert this to AWS Lambda, once a file has been added to AWS S3 which triggers this lambda.
My local code
import os
import os.path
import time
import sys
gpg = gnupg.GPG(gnupghome='/home/ec2-user/.gnupg')
path = '/home/ec2-user/2021/05/28/'
ptfile = sys.argv[1]
with open(path + ptfile, 'rb')as f:
status = gpg.encrypt_file(f, recipients=['user#email.com'], output=path + ptfile + ".gpg")
print(status.ok)
print(status.stderr)
This works great when I execute this file as python3 encrypt.py file.csv and the result is file.csv.gpg
I'm trying to move this to AWS Lambda and invoked when a file.csv is uploaded to S3.
import json
import urllib.parse
import boto3
import gnupg
import os
import os.path
import time
s3 = boto3.client('s3')
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
try:
gpg = gnupg.GPG(gnupghome='/.gnupg')
ind = key.rfind('/')
ptfile = key[ind + 1:]
with open(ptfile, 'rb')as f:
status = gpg.encrypt_file(f, recipients=['email#company.com'], output= ptfile + ".gpg")
print(status.ok)
print(status.stderr)
My AWS Lambda code zip created a folder structure in AWS
The error I see at runtime is [ERROR] Runtime.ImportModuleError: Unable to import module 'lambda_function': No module named 'gnupg'
Traceback (most recent call last):
You can create a gpg binary suitable for use by python-gnupg on AWS Lambda from the GnuPG 1.4 source. You will need
GCC and associated tools (sudo yum install -y gcc make glibc-static on Amazon Linux 2)
pip
zip
After downloading the GnuPG source package and verifying its signature, build the binary with
$ tar xjf gnupg-1.4.23.tar.bz2
$ cd gnupg-1.4.23
$ ./configure
$ make CFLAGS='-static'
$ cp g10/gpg /path/to/your/lambda/
You will also need the gnupg.py module from python-gnupg, which you can fetch using pip:
$ cd /path/to/your/lambda/
$ pip install -t . python-gnupg
Your Lambda’s source structure will now look something like this:
.
├── gnupg.py
├── gpg
└── lambda_function.py
Update your function to pass the location of the gpg binary to the python-gnupg constructor:
gpg = gnupg.GPG(gnupghome='/.gnupg', gpgbinary='./gpg')
Use zip to package the Lambda function:
$ chmod o+r gnupg.py lambda_function.py
$ chmod o+rx gpg
$ zip lambda_function.zip gnupg.py gpg lambda_function.py
Since there are some system dependencies required to use gpg within python i.e gnupg itself, you will need to build your lambda code using the container runtime environment: https://docs.aws.amazon.com/lambda/latest/dg/lambda-images.html
Using docker will allow you to install underlying system dependencies, as well as import your keys.
Dockerfile will look something like this:
FROM public.ecr.aws/lambda/python:3.8
RUN apt-get update && apt-get install gnupg
# copy handler file
COPY app.py <path-to-keys> ./
# Add keys to gpg
RUN gpg --import <path-to-private-key>
RUN gpg --import <path-to-public-key>
# Install dependencies and open port
RUN pip3 install -r requirements.txt
CMD ["app.lambda_handler"]
app.py would be your lambda code. Feel free to copy any necessary files besides the main lambda handler.
Once the container image is built and uploaded. The lambda can now use the image (including all of its dependencies). The lambda code will run within the containerized environment which contains both gnupg and your imported keys.
Resources:
https://docs.aws.amazon.com/lambda/latest/dg/python-image.html
https://docs.aws.amazon.com/lambda/latest/dg/lambda-images.html
https://medium.com/#julianespinel/how-to-use-python-gnupg-to-decrypt-a-file-into-a-docker-container-8c4fb05a0593
The best way to do this is to add a lambda layer to your python lambda.
You need to make a virtual environment in which you pip install gnupg and then put all the installed python packages in a zip file, which you upload to aws as a lambda layer. This lambda layer can then be used in all lambdas where you need gnupg. To create the lamba layer you basically do:
python3.9 -m venv my_venv
./my_venv/bin/pip3.9 install gnupg
cp -r ./my_venv/lib/python3.9/site-packages/ python
zip -r lambda_layer.zip python
Where the python version above has to match that of the python function in your lambda.
If you don't want to use layers you can additionally do:
zip -r lambda_layer.zip ./.gnupg
zip lambda_layer.zip lambda_funtion.py
And you get a zip file that you can use as a lambda deployment package
gpg is now already installed in
public.ecr.aws/lambda/python:3.8.
However despite that does not seem to be available from Lambda.
So you still need to get the gpg executable into the Lambda environment.
I did it using a docker image.
My Dockerfile is just:
FROM public.ecr.aws/lambda/python:3.8
COPY .venv/lib/python3.8/site-packages/ ./
COPY test_gpg.py .
CMD ["test_gpg.lambda_handler"]
.venv is the directory with my python virtualenv containing
the python packages I need.
The python-gnupg package requires you to have a working installation of the gpg executable, as mentioned in their official docs' Deployment Requirements; I am yet to find a way to access a gpg executable from lambda.
I ended up using a Dockerfile image as the library is already available in one of the Amazon Linux or Lambda Python base images provided by AWS that you can find here https://gallery.ecr.aws/lambda/python (in the tag images tab you will find all the Python versions needed based on your requirements).
You will need to create the following 3 files in your dev environment:
Dockerfile
requirements.txt
lambda script
The requirements.txt contains the python-gnupg for the import and all the other libraries based on your requirements:
boto3==1.15.11 # via -r requirements.in
urllib3==1.25.10 # via botocore
python-gnupg==0.5.0 # required for gpg encryption
This is the Dockerfile:
# Python 3.9 lambda base image
FROM public.ecr.aws/lambda/python:3.9
# Install pip-tools so we can manage requirements
RUN yum install python-pip -y
# Copy requirements.txt file locally
COPY requirements.txt ./
# Install dependencies into current directory
RUN python3.9 -m pip install -r requirements.txt
# Copy lambda file locally
COPY s3_to_sftp_batch.py .
# Define handler file name
CMD ["s3_to_sftp_batch.on_trigger_event_test"]
Then inside your lambda code add:
# define library path to point system libraries
os.environ["LD_LIBRARY_PATH"] = "/usr/bin/"
# create instance of GPG class and specify path that contains gpg binary
gpg = gnupg.GPG(gnupghome='/tmp', gpgbinary='/usr/bin/gpg')
Save these files then go to AWS ECR and create a Private repo then go to the repo and in the top right corner go to View push commands and run them to push your image in AWS.
Finally create your Lambda function using the container image.

Azure Databricks cluster init script - install python wheel

I have a python script that mounts a storage account in databricks and then installs a wheel from the storage account. I am trying to run it as a cluster init script but it keeps failing. My script is of the form:
#/databricks/python/bin/python
mount_point = "/mnt/...."
configs = {....}
source = "...."
if not any(mount.mountPoint == mount_point for mount in dbutils.fs.mounts()):
dbutils.fs.mount(source = source, mount_point = mount_point, extra_configs = configs)
dbutils.library.install("dbfs:/mnt/.....")
dbutils.library.restartPython()
It works when I run it in directly in a notebook but if I save to a file called dbfs:/databricks/init_scripts/datalakes/init.py and use it as cluster init script, the cluster fails to start and the error message says that the init script has a non-zero exit status. I've checked the logs and it appears that it is running as bash instead of python:
bash: line 1: mount_point: command not found
I have tried running the python script from a bash script called init.bash containing this one line:
/databricks/python/bin/python "dbfs:/databricks/init_scripts/datalakes/init.py"
Then the cluster using init.bash fails to start, with the logs saying it can't find the python file:
/databricks/python/bin/python: can't open file 'dbfs:/databricks/init_scripts/datalakes/init.py': [Errno 2] No such file or directory
Can anyone tell me how I could get this working please?
Related question: Azure Databricks cluster init script - Install wheel from mounted storage
The solution I went with was to run a notebook which mounts the storage and creates a bash init script that just installs the wheel. Something like this:
mount_point = "/mnt/...."
configs = {....}
source = "...."
if not any(mount.mountPoint == mount_point for mount in dbutils.fs.mounts()):
dbutils.fs.mount(source = source, mount_point = mount_point, extra_configs = configs)
dbutils.fs.put("dbfs:/databricks/init_scripts/datalakes/init.bash","""
/databricks/python/bin/pip install "../../../dbfs/mnt/package-source/parser-3.0-py3-none-any.whl"""", True)"

pysftp library not working in AWS lambda layer

I want to upload files to EC2 instance using pysftp library (Python script). So I have created small Python script which is using below line to connect
pysftp.Connection(
host=Constants.MY_HOST_NAME,
username=Constants.MY_EC2_INSTANCE_USERNAME,
private_key="./mypemfilelocation.pem",
)
some code here .....
pysftp.put(file_to_be_upload, ec2_remote_file_path)
This script will upload files from my local Windows machine to EC2 instance using .pem file and it works correctly.
Now I want to do this action using AWS lambda with API Gateway functionality.
So I have uploaded Python script to AWS lambda. Now I am not sure how to use pysftp library in AWS lambda, so I found solution that add pysftp library Layer in AWS lambda Layer. I did it with
pip3 install pysftp -t ./library_folder
And I make zip of above folder and added in AWS lambda Layer.
But still I got so many errors like one by one :-
No module named 'pysftp'
No module named 'paramiko'
Undefined Symbol: PyInt_FromLong
cannot import name '_bcrypt' from partially initialized module 'bcrypt' (most likely due to a circular import)
cffi module not found
I just fade up of above errors I didn't find the proper solution. How can I can use pysftp library in my AWS lambda seamlessly?
I build pysftp layer and tested it on my lambda with python 3.8. Just to see import and basic print:
import json
import pysftp
def lambda_handler(event, context):
# TODO implement
print(dir(pysftp))
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
I used the following docker tool to build the pysftp layer:
https://github.com/lambci/docker-lambda
So what I did for pysftp was:
# create pysftp fresh python 3.8 environment
python -m venv pysftp
# activate it
source pysftp/bin/activate
cd pysftp
# install pysftp in the environemnt
pip3 install pysftp
# generate requirements.txt
pip freeze > requirements.txt
# use docker to construct the layer
docker run --rm -v `pwd`:/var/task:z lambci/lambda:build-python3.8 python3.8 -m pip --isolated install -t ./mylayer -r requirements.txt
zip -r pysftp-layer.zip .
And the rest is uploading the zip into s3, creating new layer in AWS console, setting Compatible runtime to python 3.8 and using it in my test lambda function.
You can also check here how to use this docker tool (the docker command I used is based on what is in that link).
Hope this helps

How to install external modules in a Python Lambda Function created by AWS CDK?

I'm using the Python AWS CDK in Cloud9 and I'm deploying a simple Lambda function that is supposed to send an API request to Atlassian's API when an Object is uploaded to an S3 Bucket (also created by the CDK). Here is my code for CDK Stack:
from aws_cdk import core
from aws_cdk import aws_s3
from aws_cdk import aws_lambda
from aws_cdk.aws_lambda_event_sources import S3EventSource
class JiraPythonStack(core.Stack):
def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:
super().__init__(scope, id, **kwargs)
# The code that defines your stack goes here
jira_bucket = aws_s3.Bucket(self,
"JiraBucket",
encryption=aws_s3.BucketEncryption.KMS)
event_lambda = aws_lambda.Function(
self,
"JiraFileLambda",
code=aws_lambda.Code.asset("lambda"),
handler='JiraFileLambda.handler',
runtime=aws_lambda.Runtime.PYTHON_3_6,
function_name="JiraPythonFromCDK")
event_lambda.add_event_source(
S3EventSource(jira_bucket,
events=[aws_s3.EventType.OBJECT_CREATED]))
The lambda function code uses the requests module which I've imported. However, when I check the CloudWatch Logs, and test the lambda function - I get:
Unable to import module 'JiraFileLambda': No module named 'requests'
My Question is: How do I install the requests module via the Python CDK?
I've already looked around online and found this. But it seems to directly modify the lambda function, which would result in a Stack Drift (which I've been told is BAD for IaaS). I've also looked at the AWS CDK Docs too but didn't find any mention of external modules/libraries (I'm doing a thorough check for it now) Does anybody know how I can work around this?
Edit: It would appear I'm not the only one looking for this.
Here's another GitHub issue that's been raised.
It is not even necessary to use the experimental PythonLambda functionality in CDK - there is support built into CDK to build the dependencies into a simple Lambda package (not a docker image). It uses docker to do the build, but the final result is still a simple zip of files. The documentation shows it here: https://docs.aws.amazon.com/cdk/api/latest/docs/aws-lambda-readme.html#bundling-asset-code ; the gist is:
new Function(this, 'Function', {
code: Code.fromAsset(path.join(__dirname, 'my-python-handler'), {
bundling: {
image: Runtime.PYTHON_3_9.bundlingImage,
command: [
'bash', '-c',
'pip install -r requirements.txt -t /asset-output && cp -au . /asset-output'
],
},
}),
runtime: Runtime.PYTHON_3_9,
handler: 'index.handler',
});
I have used this exact configuration in my CDK deployment and it works well.
And for Python, it is simply
aws_lambda.Function(
self,
"Function",
runtime=aws_lambda.Runtime.PYTHON_3_9,
handler="index.handler",
code=aws_lambda.Code.from_asset(
"function_source_dir",
bundling=core.BundlingOptions(
image=aws_lambda.Runtime.PYTHON_3_9.bundling_image,
command=[
"bash", "-c",
"pip install --no-cache -r requirements.txt -t /asset-output && cp -au . /asset-output"
],
),
),
)
UPDATE:
It now appears as though there is a new type of (experimental) Lambda Function in the CDK known as the PythonFunction. The Python docs for it are here. And this includes support for adding a requirements.txt file which uses a docker container to add them to your function. See more details on that here. Specifically:
If requirements.txt or Pipfile exists at the entry path, the construct will handle installing all required modules in a Lambda compatible Docker container according to the runtime.
Original Answer:
So this is the awesome bit of code my manager wrote that we now use:
def create_dependencies_layer(self, project_name, function_name: str) -> aws_lambda.LayerVersion:
requirements_file = "lambda_dependencies/" + function_name + ".txt"
output_dir = ".lambda_dependencies/" + function_name
# Install requirements for layer in the output_dir
if not os.environ.get("SKIP_PIP"):
# Note: Pip will create the output dir if it does not exist
subprocess.check_call(
f"pip install -r {requirements_file} -t {output_dir}/python".split()
)
return aws_lambda.LayerVersion(
self,
project_name + "-" + function_name + "-dependencies",
code=aws_lambda.Code.from_asset(output_dir)
)
It's actually part of the Stack class as a method (not inside the init). The way we have it set up here is that we have a folder called lambda_dependencies which contains a text file for every lambda function we are deploying which just has a list of dependencies, like a requirements.txt.
And to utilise this code, we include in the lambda function definition like this:
get_data_lambda = aws_lambda.Function(
self,
.....
layers=[self.create_dependencies_layer(PROJECT_NAME, GET_DATA_LAMBDA_NAME)]
)
You should install the dependencies of your lambda locally before deploying the lambda via CDK. CDK does not have idea how to install the dependencies and which libraries should be installed.
In you case, you should install the dependency requests and other libraries before executing cdk deploy.
For example,
pip install requests --target ./asset/package
There is an example for reference.
Wanted to share 2 template repos I made for this (heavily inspired by some of the above):
https://github.com/iguanaus/cdk-ecs-python-with-requirements- - demo of ecs service of basic python function
https://github.com/iguanaus/cdk-lambda-python-with-requirements - demo of lambda python job with requirements.
Hope they are helpful for folks :)
Lastly; if you want to see a long thread on this subject, see here: https://github.com/aws/aws-cdk/issues/3660
I ran into this issue as well. I used a solution like #Kane and #Jamie suggest just fine when I was working on my ubuntu machine. However, I ran into issue when working on MacOS. Apparently some (all?) python packages don't work on lambda (linux env) if they are pip installeded on a different os (see stackoverflow post)
My solution was to run the pip install inside a docker container. This allowed me to cdk deploy from my macbook and not run into issues with my python packages in lambda.
suppose you have a dir lambda_layers/python in your cdk project that will house your python packages for the lambda layer.
current_path = str(pathlib.Path(__file__).parent.absolute())
pip_install_command = ("docker run --rm --entrypoint /bin/bash -v "
+ current_path
+ "/lambda_layers:/lambda_layers python:3.8 -c "
+ "'pip3 install Pillow==8.1.0 -t /lambda_layers/python'")
subprocess.run(pip_install_command, shell=True)
lambda_layer = aws_lambda.LayerVersion(
self,
"PIL-layer",
compatible_runtimes=[aws_lambda.Runtime.PYTHON_3_8],
code=aws_lambda.Code.asset("lambda_layers"))
As an alternative to my other answer, here's a slightly different approach that also works with docker-in-docker (the bundling-options approach doesn't).
Set up the Lambda function like
lambda_fn = aws_lambda.Function(
self,
"Function",
runtime=lambdas.Runtime.PYTHON_3_9,
code=lambdas.Code.from_docker_build(
"function_source_dir",
),
handler="index.lambda_handler",
)
and in function_source_dir/ have these files:
index.py (to match the above code - you can name this whatever you like)
requirements.txt
Dockerfile
Set up your Dockerfile like
# Note that this dockerfile is only used to build the lambda asset - the
# lambda still just runs with a zip source, not a docker image.
# See the docstring for aws_lambda.Code.from_docker_build
FROM public.ecr.aws/lambda/python:3.9.2022.04.27.10-x86_64
COPY index.py /asset/
COPY requirements.txt /tmp/
RUN pip3 install -r /tmp/requirements.txt -t /asset
and the synth step will build your asset in docker (using the above dockerfile) then pull the built Lambda source from the /asset/ directory in the image.
I haven't looked into too much detail about why the BundlingOptions approach fails to build when running inside a docker container, but this one does work (as long as docker is run with -v /var/run/docker.sock:/var/run/docker.sock to enable docker-in-docker). As always, be sure to consider your security posture when doing this.

Categories

Resources