How to run great expectations on AWS lambda

How to run great expectations on AWS lambda - python

I am trying to use great-expectations, i.e., run expectations suites within an AWS Lambda function.
When I am trying to install the packages in the requirements.txt, I get an error re jupyter lab:
aws-sam\\build\\ValidationFunction\\.\\jupyterlab_widgets-1.1.0.data\\data\\share\\jupyter\\labextension
s\\#jupyter-widgets\\jupyterlab-manager\\schemas\\#jupyter-widgets\\jupyterlab-manager\\package.json.orig'
I am using SAM CLI, version 1.42.0 and am trying to build the function inside a container.
Python version 3.9.
Requirements.txt:
botocore
boto3
awslambdaric
awswrangler
pandas_profiling
importlib-metadata==2.0
great-expectations==0.13.19
s3fs==2021.6.0
python-dateutil==2.8.1
aiobotocore==1.3.0
requests==2.25.1
decorator==4.4.2
pyarrow==2
I read several posts on the internet using Lambda functions to run Great Expectations. However, there are none reporting any issues.
Specifically, the question is does anyone have a solution for running Python code on Lambda functions when the dependencies are a large set of Python packages?

Can you show a bit more of your code and the full error stack? I would start as simple as possible get basic validation working and then add back in dependencies until you find the culprit.
Add a simple lambda and the minimum dependencies, maybe pandas and great expectations and then validate one rule as in:
custom_expectation_suite = ExpectationSuite(expectation_suite_name="deliverable_rules.custom")
custom_expectation_suite.add_expectation(
ExpectationConfiguration(expectation_type="expect_column_values_to_not_be_null",
kwargs={'column': 'first_name'
meta={'reason': f'first name should not be null'}))
validation_result = data_frame_to_validate.validate(custom_expectation_suite, run_id=run_id)

Related

Running sudo command on a Azure function app using consumption plan

I am trying to deploy an azure function written using python to an azure function app. The function is using pyzbar library. The pyzbar library documentation says that in a Linux environment, the below command needs to be executed so that the pyzbar can work.
sudo apt-get install libzbar0
How can I execute this command on the consumption plan. Please note that I can get this to work if I deploy the function with a container approach using a premium or a dedicated plan. But I want to get this to work using the consumption plan.
Any help is highly appreciated.

I have a work around where every time you trigger your function it will run a script that will install the required packages using the command prompt.
This can be achieved using subprocess module
code :
subprocess.run(["apt-get"," install"," libzbar0"])
for Example in the following code I am installing pandas using pip and returning it's version.
But this will increase your execution time as even if you have added the packages it will continue to execute the installation commands every time you trigger the function.

AWS Lambda python Error: Runtime.ImportModuleError: Unable to import module 'app': cannot import name 'operatorPrecedence' from 'pyparsing'

I just redeploy my AWS Lambda & accidentally face an issue:
[ERROR]
Runtime.ImportModuleError: Unable to import module 'app': cannot import name 'operatorPrecedence' from 'pyparsing'
(/opt/python/pyparsing/__init__.py)
(/var/task/pyparsing/__init__.py)
How do I solve this?

I saw a release note that this feature was discontinued in version 3 of pyparsing:
operatorPrecedence synonym for infixNotation - convert to calling
infix_notation
So you may be including a different version of pyparsing than you have been using previously--one that does not include the operatorPrecedence functionality, so double check what version of pyparsing you are including.
This Knowledge Center article outlines the most common issues:
You typically receive this error when your Lambda environment can't
find the specified library in the Python code. This is because Lambda
isn't prepackaged with all Python libraries.
To resolve this error, create a deployment package or Lambda layer
that includes the libraries that you want to use in your Python code
for Lambda.
Here's a couple other examples that don't use layers if you want to include the dependency in the code package directly instead:
https://alexharv074.github.io/2018/08/18/creating-a-zip-file-for-an-aws-lambda-python-function.html
https://dev.to/razcodes/how-to-create-a-lambda-using-python-with-dependencies-4846

Basically, I fix this issue by following steps:
make virtualenv
manually add needed packages/library
prepare requirements.txt by python3 -m pip freeze > requirements.txt
deploy your code
Note:
After doing the above step, If you still facing the same issue,
Then used pyparsing==2.4.7 instead of pyparsing==3.0.6 OR updated version

Unable to import module 'lambda_function': No module named 'flatten_json'

Gettting the below error while running the lambda code , I am using the library called
from flatten_json import flatten
I tried to look for a lambda layer , but did not find any online , please let me know if any one used this before or suggest any alternative

flatten_json library is missing.
Use pip install flatten_json to get it

There are four steps you need to do:
Download the dependency.
Package it in a ZIP file.
Create a new layer in AWS.
Associate the layer with your Lambda.
My answer will focus on 1. and 2. as they are what is most important to your problem. Unfortunately, packaging Python dependencies can be a bit more complicated than for other runtimes.
The main issue is that some dependencies use C code under the hood, especially performance critical libraries, for example for Machine Learning etc.
C code needs to be compiled and if you run pip install on your machine the code will be compiled for your computer. AWS Lambdas use a linux kernel and amd64 architecture. So if you are running pip install on a Linux machine with AMD or Intel processor, you can indeed just use pip install. But if you use macOS or Windows, your best bet is Docker.
Without Docker
pip install --target python flatten_json
zip -r layer.zip python
With Docker
The lambci project provides great Docker container for building and running Lambdas. In the following example I am using their build-python3.8 image.
docker run --rm -v $(pwd):/var/task lambci/lambda:build-python3.8 pip install --target python flatten_json
zip -r layer.zip python
Be aware that $(pwd) is meant to be your current directoy. On macOS and WSL this should work, but if it does not work you can just replace it with the absolute path to your current directory.
Explanation
Those commands will install the dependency into a target folder called python. The name is important, because it is one of two folders of a layer where Lambda looks for dependencies.
The python folder is than archived recursively (-r) in a file called layer.zip.
Your next step is to create a new Layer in AWS and associated your function with that layer.

There are two options to choose from
Option 1) You can use a deployment package to deploy your function code to Lambda.
The deployment package (For e.g zip) will contain your function's code and any dependencies used to run the function's code.
Hence, you can package flatten_json as your code to the Lambda.
Check Creating a function with runtime dependencies page in aws documentation, it explains the use-case of having requests library. In your scenario, the library would be flatten_json
Option 2) Create a layer that has the library dependencies you need, in your case just flatten_json. And then attach that layer to your Lambda.
Check creating and sharing lambda layers guide provided by AWS.
How to decide between 1) and 2)?
Use Option 1) when you just need the dependencies in just that one Lambda. No need to create an extra step of creating a layer.
Layers are useful if you have some common code that you want to share across different Lambdas. So if you need the library accessible in other Lambdas as well, then it's good to have a layer[Option 2)] that can be attached to different lambdas.

You can do this is in a Lambda if you don´t want to create the layer. Keep in mind it will run slower since it has to install the library in every run:
import sys
import subprocess
subprocess.call('pip install flatten_json -t /tmp/ --no-cache-dir'.split(), stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
sys.path.insert(1, '/tmp/')
import flatten_json

Load Python Packages In Lambda During Runtime

I am trying to get Gensim on AWS lambda but after trying all the file reduction techniques (https://github.com/robertpeteuil/build-lambda-layer-python) to try to create layers it still does not fit. So I decided to try to load the packages during runtime of the lambda function as our function is not under a heavy time constraint.
So I first looked at uploaded a venv to S3 and then downloading and activating it from a script following (Activate a virtualenv with a Python script) using the 2nd block of the top rated answer. However, it turned out that the linked script was for python 2 so I looked up the python 3 version (making sure to copy an activiate_this.py from a virtualenv to the normal venv bin since the standard venv package doesn't include one)
activator = "/Volumes/SD.Card/Machine_Learning/lambda/bin/activate_this.py"
with open(activator) as f:
exec(f.read(), {'__file__': activator})
import numpy
After running this script to the target venv with numpy I am still getting a no module found error. I cannot find a good resource for how to do this properly. So I guess my question is: what is the best way to load packages during lambda runtime and how does one carry that out?

AWS Lambda and python numpy module

I am trying to import a python deployment package in aws lambda. The python code uses numpy. I followed the deployment package instructions for virtual env but it still gave Missing required dependencies ['numpy']. I followed the instruction given on stack overflow (skipped step 4 for shared libraries, could not find any shared libraries) but no luck. Any suggestions to make it work?

The easiest way is to use AWS Cloud9, there is no need to start EC2 instances and prepare deployment packages.
Step 1: start up Cloud9 IDE
Go to the AWS console and select the Cloud9 service.
Create environment
enter Name
Environment settings (consider using t2.small instance type, with the default I had sometimes problems to restart the environment)
Review
click Create environment
Step 2: create Lambda function
Create Lambda Function (bottom of screen)
enter Function name and Application name
Select runtime (Python 3.6) and blueprint (empty-python)
Function trigger (none)
Create serverless application (keep defaults)
Finish
wait couple of seconds for the IDE to open
Step 3: install Numpy
at the bottom of the screen there is a command prompt
go to the Application folder, for me this is
cd App
install the package (we have to use pip from the virtual environment, because the default pip points to /usr/bin/pip and this is python 2.7)
venv/bin/pip install numpy -t .
Step4: test installation
you can test the installation by editing the lambda_function.py file:
import numpy as np
def lambda_handler(event, context):
return np.sin(1.0)
save the changes and click the green Run button on the top of the screen
a new tab should appear, now click the green Run button inside the tab
after the Lambda executes I get:
Response
0.8414709848078965
Step 5: deploy Lambda function
on the right hand side of the screen select your Lambda function
click the upwards pointing arrow to deploy
go to the AWS Lambda service tab
the Lambda function should be visible, the name has the format
cloud9-ApplicationName-FunctionName-RandomString

Using Numpy is a real pain.
Numpy needs to be properly compiled on the same OS as it runs. This means that you need to install/compile Numpy on an AMI image in order for it erun properly in Lambda.
The easiest way to do this is to start a small EC2 instance and install it there. Then copy the compiled files (from /usr/lib/python/site-packages/numpy). These are the files you need to include in your Lambda package.
I believe you can also use the serverless tool to achieve this.

NumPy must be compiled on the platform that it will be run on. Easiest way to do this is to use Docker. Docker has a lambda container. Compile NumPy locally in Docker with the lambda container, then push to AWS lambda.
The serverless framework handles all this for you if you want an easy solution. See https://stackoverflow.com/a/50027031/1085343

I was having this same issue and pip install wasn't working for me. Eventually, I found some obscure posts about this issue here and here. They suggested going to pypi, downloading the .whl file (select py version and manylinux1_x86_64), uploading, and unzipping. This is what ultimately worked for me. If nothing else is working for you, I would suggest trying this.

For Python 3.6, you can use a Docker image as someone already suggested. Full instructions for that here: https://medium.com/i-like-big-data-and-i-cannot-lie/how-to-create-an-aws-lambda-python-3-6-deployment-package-using-docker-d0e847207dd6
The key piece is:
docker run -it dacut/amazon-linux-python-3.6

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.