How to use tabula in AWS Lambda to read PDF table - python

Hello I get the following error while trying to use tabula to read a table in a pdf.
I was aware of some of the difficulties (here) using this package with AWS lambda and tried to zip the tabula package via an EC2 (Ubuntu 20.02) and then, add it as a layer in the function.
Many thanks in advance!
{ "errorMessage": "`java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`", "errorType": "JavaNotFoundError", "stackTrace": [ " File \"/var/task/lambda_function.py\", line 39, in lambda_handler\n df = tabula.read_pdf(BytesIO(fs), pages=\"all\", area = [box],\n", " File \"/opt/python/lib/python3.8/site-packages/tabula/io.py\", line 420, in read_pdf\n output = _run(java_options, tabula_options, path, encoding)\n", " File \"/opt/python/lib/python3.8/site-packages/tabula/io.py\", line 98, in _run\n raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)\n" ] }
Code
import boto3
import read_pdf from tabula
from io import BytesIO
def lambda_handler(event, context):
client = boto3.client('s3')
s3 = boto3.resource('s3')
# Get most recent file name
response = client.list_objects_v2(Bucket='S3bucket')
all = response['Contents']
latest = max(all, key=lambda x: x['LastModified'])
latest_key = latest['Key']
# Get file
obj = s3.Object('S3bucket', latest_key)
fs = obj.get()['Body'].read()
# Read PDF
box = [3.99, .22, 8.3, 7.86]
fc = 72
for i in range(0, len(box)):
box[i] *= fc
df = tabula.read_pdf(BytesIO(fs), pages="all", area = [box], output_format = "dataframe", lattice=True)

Here is the Dockerfile that ultimatley worked and allowed me to run tabula in my lambda function:
ARG FUNCTION_DIR="/var/task/"
COPY ./ ${FUNCTION_DIR}
# Install OpenJDK
RUN yum install -y java-1.8.0-openjdk
# Setup Python environment
# Install PYTHON requirements
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
# Copy function code to container
COPY app.py ./
CMD [ "app.handler" ]

Tabula's python package is just a wrapper for java code. Here's a reference to the package here.
Java 8+ is required to be installed for this to work. Your best bet to achieve that is to develop a docker container image where your script works and deploy that image as a lambda function.
AWS has a good walkthrough that might help.

Related

Using a custom docker with Azure ML

I'm following the guidelines (https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-environments) to use a custom docker file on Azure. My script to create the environment looks like this:
from azureml.core.environment import Environment
myenv = Environment(name = "myenv")
myenv.docker.enabled = True
dockerfile = r"""
FROM mcr.microsoft.com/azureml/base:intelmpi2018.3-ubuntu16.04
RUN apt-get update && apt-get install -y libgl1-mesa-glx
RUN echo "Hello from custom container!"
"""
myenv.docker.base_image = None
myenv.docker.base_dockerfile = dockerfile
Upon execution, this is totally ignored and libgl1 is not installed. Any ideas why?
EDIT: Here's the rest of my code:
est = Estimator(
source_directory = '.',
script_params = script_params,
use_gpu = True,
compute_target = 'gpu-cluster-1',
pip_packages = ['scipy==1.1.0', 'torch==1.5.1'],
entry_script = 'AzureEntry.py',
)
run = exp.submit(config = est)
run.wait_for_completion(show_output=True)
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-environments
Have no issues installing the lib. First, please dump your dockerfile content into a file, easier to maintain and read ;)
e = Environment("custom")
e.docker.base_dockerfile = "path/to/your/dockerfile"
will set the content of the file into a string prop.
e.register(ws).build(ws).wait_for_completion()
step 2/16 will have your apt update and libgl1 install
Note, that should work with >=1.7 SDK
This should work :
from azureml.core import Workspace
from azureml.core.environment import Environment
from azureml.train.estimator import Estimator
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core import Experiment
ws = Workspace (...)
exp = Experiment(ws, 'test-so-exp')
myenv = Environment(name = "myenv")
myenv.docker.enabled = True
dockerfile = r"""
FROM mcr.microsoft.com/azureml/base:intelmpi2018.3-ubuntu16.04
RUN apt-get update && apt-get install -y libgl1-mesa-glx
RUN echo "Hello from custom container!"
"""
myenv.docker.base_image = None
myenv.docker.base_dockerfile = dockerfile
## You need to instead put your packages in the Environment definition instead...
## see below for some changes too
myenv.python.conda_dependencies = CondaDependencies.create(pip_packages = ['scipy==1.1.0', 'torch==1.5.1'])
Finally you can build your estimator a bit differently :
est = Estimator(
source_directory = '.',
# script_params = script_params,
# use_gpu = True,
compute_target = 'gpu-cluster-1',
# pip_packages = ['scipy==1.1.0', 'torch==1.5.1'],
entry_script = 'AzureEntry.py',
environment_definition=myenv
)
And submit it :
run = exp.submit(config = est)
run.wait_for_completion(show_output=True)
Let us know if that works.
Totally understandable why you're struggling -- others have also expressed a need for more information.
perhaps base_dockerfile needs to be a text file (with the contents inside) and not a string? I'll ask the environments PM to learn more specifically how this works
another option would be to lever Azure Container Instance (ACI). An ACI is created automatically when spinning up an Azure ML workspace. See this GitHub issue for more info on that.
For more information about using Docker in environments, see the article `Enable
Docker https://learn.microsoft.com/azure/machine-learning/how-to-use-environments#enable-docker
The following example shows how to load docker steps as a string.
from azureml.core import Environment
myenv = Environment(name="myenv")
# Creates the environment inside a Docker container.
myenv.docker.enabled = True
# Specify docker steps as a string.
dockerfile = r'''
FROM mcr.microsoft.com/azureml/intelmpi2018.3-ubuntu16.04
RUN echo "Hello from custom container!"
'''
# Alternatively, load from a file.
#with open("dockerfiles/Dockerfile", "r") as f:
# dockerfile=f.read()
myenv.docker.base_dockerfile = dockerfile
I think it's that you're using an estimator. Estimators create their own environment, unless you set the environment_definition parameter, which I don't see in your snippet. I'm looking at https://learn.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.estimator.estimator?view=azure-ml-py.
Haven't tried it, but I think you can fix this by changing your code to:
est = Estimator(
source_directory = '.',
script_params = script_params,
use_gpu = True,
compute_target = 'gpu-cluster-1',
pip_packages = ['scipy==1.1.0', 'torch==1.5.1'],
entry_script = 'AzureEntry.py',
environment_definition = myenv
)
run = exp.submit(config = est)
run.wait_for_completion(show_output=True)
You might also have to move use_gpu setting into the environment definition, as the SDK page I linked above says that the environment will take precedence over this and a couple other estimator parameters.

Pyrebase4 storage No value for argument 'filename'

I have been struggling with installing pyrebase on my second PC, which has the same python version 3.8.2 as my main PC, my main PC has this pyrebase script working properly
from pyrebase import pyrebase
import os
import time
project_root = os.path.dirname(os.path.dirname(__file__))
keys_path = os.path.join(project_root, 'keys')
hylKeyPath = os.path.join(keys_path, 'ServiceAccountKey.json')
firebase = pyrebase.initialize_app({
"apiKey": "aksdjalksjdlkajsdlkjalkdja",
"authDomain": "lkjsakdjlkjsad.firebaseapp.com",
"databaseURL": "https://asdlkasjldkjaslkd.firebaseio.com",
"storageBucket": "asdasdadjaslkjhd.appspot.com",
"serviceAccount": asdKeyPath
})
storage = firebase.storage()
def sleepCountDown(t):
while t > 0:
print(f"sleeping for {t} seconds...")
t -= 1
time.sleep(1)
while True:
print('fetching data')
files = storage.child('/').list_files()
for file in files:
if 'records/' in file.name:
# get the file url path
# print(storage.child(file.name).get_url(None))
# downloads file
storage.child(file.name).download(os.path.basename(file.name))
# deletes file? kinda deletes the entire folder
storage.delete(file.name)
sleepCountDown(10)
But somehow I am not able to install pyrebase on my second PC so I had to install pyrebase4.
But it seems this one has some bugs and keeps highlighting storage.child(file.name).download(os.path.basename(file.name)) saying that storage: No value for argument 'filename' in method call.
then when I run, it says "Crypto has no method or something"
anyone knows what is going on?
Just noticed the syntax was different on pyrebase4, it's .download(path, filename) the documentation was outdated.
Instead of:
storage.child(file.name).download(os.path.basename(file.name))
Use:
storage.child("filename").download(filename="filename" , path="E:/ddf/")
Where E:/ddf/ is path you want to save your file.

How to deploy and attach a layer to aws lambda function using aws CDK and Python

How to deploy and attach a layer to aws lambda function using aws CDK ?
I need a simple cdk code that deploys and attaches a layer to aws lambda function.
The following aws CDK Python code deploys a layer and attaches it to an aws lambda function.
by yl.
Project Directory Structure
--+
+-app.py
+-cdk_layers_deploy.py
+--/functions+
+-testLambda.py
+--/layers+
+-custom_func.py
app.py file
#!/usr/bin/env python3
import sys
from aws_cdk import (core)
from cdk_layers_deploy import CdkLayersStack
app = core.App()
CdkLayersStack(app, "cdk-layers")
app.synth()
cdk_layers_deploy file
from aws_cdk import (
aws_lambda as _lambda,
core,
aws_iam)
from aws_cdk.aws_iam import PolicyStatement
from aws_cdk.aws_lambda import LayerVersion, AssetCode
class CdkLayersStack(core.Stack):
def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:
super().__init__(scope, id, **kwargs)
# 1) deploy lambda functions
testLambda : _lambda.Function = CdkLayersStack.cdkResourcesCreate(self)
# 2) attach policy to function
projectPolicy = CdkLayersStack.createPolicy(self, testLambda)
# -----------------------------------------------------------------------------------
#staticmethod
def createPolicy(this, testLambda:_lambda.Function) -> None:
projectPolicy:PolicyStatement = PolicyStatement(
effect=aws_iam.Effect.ALLOW,
# resources=["*"],
resources=[testLambda.function_arn],
actions=[ "dynamodb:Query",
"dynamodb:Scan",
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"states:StartExecution",
"states:SendTaskSuccess",
"states:SendTaskFailure",
"cognito-idp:ListUsers",
"ses:SendEmail"
]
)
return projectPolicy;
# -----------------------------------------------------------------------------------
#staticmethod
def cdkResourcesCreate(self) -> None:
lambdaFunction:_lambda.Function = _lambda.Function(self, 'testLambda',
function_name='testLambda',
handler='testLambda.lambda_handler',
runtime=_lambda.Runtime.PYTHON_3_7,
code=_lambda.Code.asset('functions'),
)
ac = AssetCode("layers")
layer = LayerVersion(self, "l1", code=ac, description="test-layer", layer_version_name='Test-layer-version-name')
lambdaFunction.add_layers(layer)
return lambdaFunction
# -----------------------------------------------------------------------------------
testLambda.py
# -------------------------------------------------
# testLambda
# -------------------------------------------------
import custom_func as cf # this line is errored in pyCharm -> will be fixed on aws when import the layer
def lambda_handler(event, context):
print(f"EVENT:{event}")
ret = cf.cust_fun()
return {
'statusCode': 200,
'body': ret
}
custom_func.py - the layer function
import datetime
def cust_fun():
print("Hello from the deep layers!!")
date_time = datetime.datetime.now().isoformat()
print("dateTime:[%s]\n" % (date_time))
return 1
You can make it by creating and deploying your layer first, then import it from aws and pass it as the layer argument for the lambda in your stack definition.
To do so, we found that the easiest solution is to create a /layer folder in your root cdk project and create a bash file to deploy the layer (You need to cd into the /layer folder to run it).
LAYER_NAME="<layer_name>"
echo <your_package>==<package.version.0> >> requirements.txt # See PyPi for the exact version
docker run -v "$PWD":/var/task "lambci/lambda:build-python3.6" /bin/sh -c "pip install -r requirements.txt -t python/lib/python3.6/site-packages/; exit"
zip -r $LAYER_NAME.zip python > /dev/null
aws lambda publish-layer-version \
--layer-name $LAYER_NAME \
--zip-file fileb://$LAYER_NAME.zip \
--compatible-runtimes "python3.6" \
--region <your_region>
rm requirements.txt
rm -rf python
rm $LAYER_NAME.zip
You need then to search the ARN of the layer in your AWS console (Lambda>Layers) and define your layer in your _stack.py by:
layer = aws_lambda.LayerVersion.from_layer_version_arn(
self,
'<layer_name>',
<layer_ARN> # arn:aws:lambda:<your_region:<your_account_id>:layer:<layer_name>:<layer_version>
)
Then you can pass it into your lambda:
lambda = aws_lambda.Function(
scope=self,
id='<lambda_id>',
handler='function.handler',
runtime=aws_lambda.Runtime.PYTHON_3_6,
code=aws_lambda.Code.asset(path='./lambda'),
environment={
},
layers=[
layer
]
)
You can use it as follows,
mysql_lib = lb.LayerVersion(self, 'mysql',
code = lb.AssetCode('lambda/layers/'),
compatible_runtimes = [lb.Runtime.PYTHON_3_6],
)
my_lambda = lb.Function(
self, 'core-lambda-function',
runtime=lb.Runtime.PYTHON_3_6,
code=lb.InlineCode(handler_code),
function_name="lambda_function_name",
handler="index.handler",
layers = [mysql_lib],
timeout=core.Duration.seconds(300),
)
I set this up to use my site-packages in my virtual environment so it not only includes my dependencies, but also their dependencies. In order to do this when creating your environment setup your venv like this:
mkdir .venv
python3 -m venv .venv/python
Then in your stack reference your virtual environment as AssetCode(".venv"). Beause this includes the python version in the path as part of the site-packages you have to limit compatibility based on your version. The easiest way to support all python versions would be to use a different structure as defined https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html
Example:
from aws_cdk import core
from aws_cdk.aws_lambda import AssetCode, LayerVersion, Runtime
class lambda_layer_stack(core.Stack):
def __init__(self, scope: core.Construct, construct_id: str, version: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)
ac = AssetCode(path=".venv")
au_layer = LayerVersion(
self,
id=construct_id,
code=ac,
layer_version_name=construct_id,
description=version,
compatible_runtimes=[Runtime.PYTHON_3_8],
)
That will not work. You need to put custom_func.py to layers/python/lib/python3.7/site-packages/custom_func.py instead to make it work (https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html#configuration-layers-path).

Create and include new file into python package

I want to use version for my package base on git describe command. For this, I created setup.py with function get_version(). This function retrieves version from .version file if it exists, else computes new package version and writes it to a new .version file. However, when I call python setup.py sdist, .version is not copying inside .tar archive. This causes error when I'm trying to install package from PyPi repo. How to properly include my .version file "on the fly" into package?
setup.py:
import pathlib
from subprocess import check_output
from setuptools import find_packages, setup
_VERSION_FILE = pathlib.Path(".version") # Add it to .gitignore!
_GIT_COMMAND = "git describe --tags --long --dirty"
_VERSION_FORMAT = "{tag}.dev{commit_count}+{commit_hash}"
def get_version() -> str:
""" Return version from git, write commit to file
"""
if _VERSION_FILE.is_file():
with _VERSION_FILE.open() as f:
return f.readline().strip()
output = check_output(_GIT_COMMAND.split()).decode("utf-8").strip().split("-")
tag, count, commit = output[:3]
dirty = len(output) == 4
if count == "0" and not dirty:
return tag
version = _VERSION_FORMAT.format(tag=tag, commit_count=count, commit_hash=commit)
with _VERSION_FILE.open("w") as f:
print(version, file=f, end="")
return version
_version = get_version()
setup(
name="mypackage",
package_data={
"": [str(_VERSION_FILE)]
},
version=_version,
packages=find_packages(exclude=["tests"]),
)
If you include a file called MANIFEST.in in the same directory as setup.py with include .version inside, this should get the file picked up.
It was a mistake in setup.py. I forgot to add file dumping in if count == "0" and not dirty:. Now it works with MANIFEST.in.

How can I push AWS CodeCommit to S3 using Lambda?

Python is my preferred language but any supported by Lambda will do.
-- All AWS Architecture --
I have Prod, Beta, and Gamma branches and corresponding folders in S3. I am looking for a method to have Lambda respond to a CodeCommit trigger and based on the Branch that triggered it, clone the repo and place the files in the appropriate S3 folder.
S3://Example-Folder/Application/Prod
S3://Example-Folder/Application/Beta
S3://Example-Folder/Application/Gamma
I tried to utilize GitPython but it does not work because Lambda does not have Git installed on the base Lambda AMI and GitPython depends on it.
I also looked through the Boto3 docs and there are only custodial tasks available; it is not able to return the project files.
Thank you for the help!
The latest version of the boto3 codecommit includes the methods get_differences and get_blob.
You can get all the content of a codecommit repository using these two methods (at least, if you are not interested in the retaining the .git history).
The script below takes all the content of the master branch and adds it to a tar file. Afterwards you could upload it to s3 as you please.
You can run this as a lambda function, which can be invoked when you push to codecommit.
This works with the current lambda python 3.6 environment.
botocore==1.5.89
boto3==1.4.4
import boto3
import pathlib
import tarfile
import io
import sys
def get_differences(repository_name, branch="master"):
response = codecommit.get_differences(
repositoryName=repository_name,
afterCommitSpecifier=branch,
)
differences = []
while "nextToken" in response:
response = codecommit.get_differences(
repositoryName=repository_name,
afterCommitSpecifier=branch,
nextToken=response["nextToken"]
)
differences += response.get("differences", [])
else:
differences += response["differences"]
return differences
if __name__ == "__main__":
repository_name = sys.argv[1]
codecommit = boto3.client("codecommit")
repository_path = pathlib.Path(repository_name)
buf = io.BytesIO()
with tarfile.open(None, mode="w:gz", fileobj=buf) as tar:
for difference in get_differences(repository_name):
blobid = difference["afterBlob"]["blobId"]
path = difference["afterBlob"]["path"]
mode = difference["afterBlob"]["mode"] # noqa
blob = codecommit.get_blob(
repositoryName=repository_name, blobId=blobid)
tarinfo = tarfile.TarInfo(str(repository_path / path))
tarinfo.size = len(blob["content"])
tar.addfile(tarinfo, io.BytesIO(blob["content"]))
tarobject = buf.getvalue()
# save to s3
Looks like LambCI does exactly you want.
Unfortunately, currently CodeCommit doesn’t have an API to upload the repository to S3 bucket. However, if you are open to trying out CodePipeline, You can configure AWS CodePipeline to use a branch in an AWS CodeCommit repository as the source stage for your code. In this way, when you make changes to your selected tracking branch in CodePipeline, an archive of the repository at the tip of that branch will be delivered to your CodePipelie bucket. For more information about CodePipeline, please refer to following link:
http://docs.aws.amazon.com/codepipeline/latest/userguide/tutorials-simple-codecommit.html

Categories

Resources