AWS Glue: passing additional Python modules to the job - ModuleNotFoundError

AWS Glue: passing additional Python modules to the job - ModuleNotFoundError - python

I'm trying to run a Glue job (version 4) to perform a simple data batch processing. I'm using additional python libraries that Glue environment doesn't provide with - translate and langdetect. Additionally, regardless of the Glue env provides with 'nltk' package, when I try to import it I keep receiving the error that dependencies are not found (e.g. regex._regex, _sqlite3).
I tried a few solutions to achieve my goal:
using --extra-py-files where I specified path to s3 bucket where I uploaded either:
.zip file that consists of translate and langdetect python packages
just a directory for already unzipped packages
packages itself in .whl format (along with its dependencies)
using --additional-python-modules where I specified path to s3 bucket where I uploaded:
packages itself in .whl format (along with its dependencies)
or just pinpoint which package has to be installed inside the glue env via pip3
using Docker
Additionally, I followed a few useful sources to overcome the issue of ModuleNotFoundError:
a) https://aws.amazon.com/premiumsupport/knowledge-center/glue-import-error-no-module-named/.
b) https://aws.amazon.com/premiumsupport/knowledge-center/glue-version2-external-python-libraries/
c) https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html
Also, I tried to play with the Glue versions 4 and 3 but haven't had luck. It seems like a bug. All permissions to read s3 bucket is granted to the glue role. The Python script version is the same as the libraries I'm trying to install - Python 3. To give you more clues, I manage glue resources via Terraform.
What did I do wrong?

Related

No module named 'numpy.core._multiarray_umath' when using AWS Lambda

I just uploaded a .zip file to AWS Lambda with all needed packages. I ran all right in my Mac using virtual environment with python 3.8. The AWS Lambda function also has python 3.8. But when I run in AWS Lambda I get this error:
No module named 'numpy.core._multiarray_umath'
I have changed the actual numpy version (1.20.2) to other versions like 1.19.1 and 1.18.5 but the problem can't be fixed.
I am also using spacy 3.0.6 and fastapi 0.63.0.

When I encountered same issue, this steps worked for me:
1- Download required packages(you may need different versions):
- pandas-1.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- python_dateutil-2.8.2-py2.py3-none-any.whl
- pytz-2022.1-py2.py3-none-any.whl
- numpy-1.21.5-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
- If you need others ...
2- Create a project folder and unzip whl files to the folder.
3- Remove *dist-info folders.
4- Add your source code to folder(lambda_function.py)
5- Zip the folder and upload to Lambda as a source code zip file.
Also you can look these links may help you:
https://korniichuk.medium.com/lambda-with-pandas-fd81aa2ff25e
https://github.com/numpy/numpy/issues/13465#issuecomment-545378314

How to fix "module 'pg8000' has no attribute 'connect'" error in AWS Glue job

I'm trying to set up a daily AWS Glue job that loads data into a RDS PostgreSQL DB. But I need to truncate my tables before loading data into them, since those jobs work on the whole dataset.
To do this, I'm implementing the solution given here: https://stackoverflow.com/a/50984173/11952393.
It uses the pure Python library pg8000. I followed the guidelines in this SO, downloading the library tar, unpacking it, adding the empty __init.py__, zipping the whole think, uploading the zip file to S3 and adding the S3 URL as a Python library in the AWS Glue job config.
When I run the job, the pg8000 module seems to be imported correctly. But then I get the following error:
AttributeError: module 'pg8000' has no attribute 'connect'
I am most certainly doing something wrong... But can't find what. Any constructive feedback is welcome!

Here is what made it work for me.
Do a pip install of the pg8000 package in a separate location
pip install -t /tmp/ pg8000
You would see 2 directories in the /tmp directory
pg8000
scramp
Zip the above 2 directories separately
cd /tmp/
zip -r pg8000.zip pg8000/
zip -r scramp.zip scramp/
Upload these 2 zip files in an S3 location
While creating the job or the Dev Endpoint mention these 2 zip files in the Python Library Path field
s3://<bucket>/<prefix>/pg8000.zip,s3://<bucket>/<prefix>/scramp.zip

Add
install_requires = ['pg8000==1.12.5']
in _setup.py file which is generating .egg file
You should able to access library.

Import Python module into AWS Lambda

I have followed all the steps in the documentation:
https://docs.aws.amazon.com/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html
create a directory.
Save all of your Python source files (the .py files) at the root level of this directory.
Install any libraries using pip at the root level of the directory.
Zip the content of the project-dir directory)
But after I uploaded the zip-file to lambda function, I got the error message when I test the script
my code:
import psycopg2
#my code...
the error:
Unable to import module 'myfilemane': No module named 'psycopg2._psycopg'
I don't know where is the suffix '_psycopg' from...
Any help regarding this?

You are using native libraries with lambda. We had this similar problem and here is how we solved it.
Spin a machine with AWS supported AMI that runs your real lambda.
https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html
As this writing, it is,
AMI name: amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2
Full documentation in installing native modules your python lambda.
https://docs.aws.amazon.com/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html
Install the required modules required for your lambda,
pip install module-name -t /path/to/project-dir
and prepare your package to upload along with the native modules under lambda ami environment.
Hope this helps.

I believe this is caused because psycopg2 needs to be build an compiled with statically linked libraries for Linux. Please reference Using psycopg2 with Lambda to Update Redshift (Python) for more details on this issue. Another [reference][1] of problems of compiling psycopg2 on OSX.
There are a few solutions, but basically it comes down to installing the library on a Linux machine and using that as the Psycopg2 Library in your upload package.

Read Parquet file stored in S3 with AWS Lambda (Python 3)

I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is:
https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others).
This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python
Add a test python function to the zip, send it to S3, update the lambda and test it
It seems that there are two possible approaches, which both work locally to the docker container:
fastparquet with s3fs: Unfortunately the unzipped size of the package is bigger than 256MB and therefore I can't update the Lambda code with it.
pyarrow with s3fs: I followed https://github.com/apache/arrow/pull/916 and when executed with the lambda function I get either:
If I prefix the URI with S3 or S3N (as in the code example): In the Lambda environment OSError: Passed non-file path: s3://mybucket/path/to/myfile in pyarrow/parquet.py, line 848. Locally I get IndexError: list index out of range in pyarrow/parquet.py, line 714
If I don't prefix the URI with S3 or S3N: It works locally (I can read the parquet data). In the Lambda environment, I get the same OSError: Passed non-file path: s3://mybucket/path/to/myfile in pyarrow/parquet.py, line 848.
My questions are :
why do I get a different result in my docker container than I do in the Lambda environment?
what is the proper way to give the URI?
is there an accepted way to read Parquet files in S3 through AWS Lambda?
Thanks!

AWS has a project (AWS Data Wrangler) that allows it with full Lambda Layers support.
In the Docs there is a step-by-step to do it.
Code example:
import awswrangler as wr
# Write
wr.s3.to_parquet(
dataframe=df,
path="s3://...",
dataset=True,
database="my_database", # Optional, only with you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])
# READ
df = wr.s3.read_parquet(path="s3://...")
Reference

I was able to accomplish writing parquet files into S3 using fastparquet. It's a little tricky but my breakthrough came when I realized that to put together all the dependencies, I had to use the same exact Linux that Lambda is using.
Here's how I did it:
1. Spin up a EC2 instance using the Amazon Linux image that is used with Lambda
Source:
https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html
Linux image:
https://console.aws.amazon.com/ec2/v2/home#Images:visibility=public-images;search=amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2
Note: you might need to install many packages and change python version to 3.6 as this Linux is not meant for development. Here's how I looked for packages:
sudo yum list | grep python3
I installed:
python36.x86_64
python36-devel.x86_64
python36-libs.x86_64
python36-pip.noarch
python36-setuptools.noarch
python36-tools.x86_64
2. Used the instructions from here to built a zip file with all of the dependencies that my script would use with dumping them all in a folder and the zipping them with this command:
mkdir parquet
cd parquet
pip install -t . fastparquet
pip install -t . (any other dependencies)
copy my python file in this folder
zip and upload into Lambda
Note: there are some constraints I had to work around: Lambda doesn't let you upload zip larger 50M and unzipped > 260M. If anyone knows a better way to get dependencies into Lambda, please do share.
Source:
Write parquet from AWS Kinesis firehose to AWS S3

This was an environment issue (Lambda in VPC not getting access to the bucket). Pyarrow is now working.
Hopefully the question itself will give a good-enough overview on how to make all that work.

One can also achieve this through the AWS sam cli and Docker (we'll explain this requirement later).
1.Create a directory and initialize sam
mkdir some_module_layer
cd some_module_layer
sam init
by typing the last command a series of three question would be prompted. One could choose the following series of answers (I'm considering working under Python3.7, but other options are possible).
1 - AWS Quick Start Templates
8 - Python 3.7
Project name [sam-app]: some_module_layer
1 - Hello World Example
2. Modify requirements.txt file
cd some_module_layer
vim hello_world/requirements.txt
this will open requirements.txt file on vim, on Windows you could type instead code hello_world/requirements.txt to edit the file on Visual Studio Code.
3. Add pyarrow to requirements.txt
Alongside pyarrow, it will work to include additionnaly pandas and s3fs. In this case including pandas will avoid it to not recognize pyarrow as an engine to read parquet files.
pandas
pyarrow
s3fs
4. Build with a container
Docker is required to use the option --use-container when running the sam build command. If it's the first time, it will pull the lambci/lambda:build-python3.7 Docker image.
sam build --use-container
rm .aws-sam/build/HelloWorldFunction/app.py
rm .aws-sam/build/HelloWorldFunction/__init__.py
rm .aws-sam/build/HelloWorldFunction/requirements.txt
notice that we're keeping only the python libraries.
5. Zip files
cp -r .aws-sam/build/HelloWorldFunction/ python/
zip -r some_module_layer.zip python/
On Windows, it would work to run Compress-Archive python/ some_module_layer.zip.
6. Upload zip file to AWS
The following link is useful for this.

Permission error in HDFS when using pure python external library in AWS Glue

I tried to run a customized Python script that imports an external pure python library (psycopg2) on AWS Glue but failed. I checked the CloudWatch log and found out the reason for the failure is that:
Spark failed the permission check on several folders in HDFS, one of them contains the external python library I uploaded to S3 (s3://path/to/psycopg2) which requires -x permission:
org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=READ_EXECUTE, inode="/user/root/.sparkStaging/application_1507598924170_0002/psycopg2":root:hadoop:drw-r--r--
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getListingInt(FSDirStatAndListingOp.java:76)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:4486)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:999)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:634)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
I make sure that the library contains only .py file as instructed in the AWS documentation.
Does anyone know what went wrong?
Many thanks!

You have a directory that doesn't have execute permission. In a Unix-based O/S directories must have the execute bit set (for at least the user) to be usable.
Run something like
sudo chmod +x /user/root/.sparkStaging/application_1507598924170_0002/psycopg2
and try it again.

Glue only support's python only libraries i.e. without any specific native library bindings.

The package psycopg2 is not pure Python, so it will not work with Glue. From the setup.py:
If you prefer to avoid building psycopg2 from source, please install
the PyPI 'psycopg2-binary' package instead.
From the AWS Glue documentation:
You can use Python extension
modules and libraries with your AWS Glue ETL scripts as long as they
are written in pure Python. C libraries such as pandas are not
supported at the present time, nor are extensions written in other
languages.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.