AWS Glue locally - No module named 'awsglue' - python

I installed each prerequisites according to https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html#develop-local-python and still getting No module named 'awsglue' error.
AWS Glue version 3.0,
Apache Maven from the following location: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz
AWS Glue version 3.0: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
SPARK_HOME is setup
ran glue-setup.sh from \\wsl$\Ubuntu-20.04\home\my_user\aws_ds\glue_libs\aws-glue-libs\bin
when I run spark-shell or pyspark, both are working fine
Please help on debbuging this as I don't know where to start else.

Working solution:
Make sure your Glue script is ran in the aws-glue-libs folder
Sync jar files between jarsv1 in aws-glue-libs and jars in your_spark_folder (quava jar may have two versions, leave latest one)
Installation steps to consider
Get Spark on WSL2: https://phoenixnap.com/kb/install-spark-on-ubuntu
Remember to run glue-setup.sh from aws-glue-libs\bin as a last step of Setting up Glue locally

Related

ModuleNotFoundError: No module named 'pyathena' when running AWS Glue Job

Despite setting the parameter for my Python AWS Glue Job like this:
--additional-python-modules pyathena
I still get the following error when I try and run the job:
ModuleNotFoundError: No module named 'pyathena'
I have also tried the following parameters:
--additional-python-modules pyathena
--pip-install pyathena
--pip-install pyathena==2.23.0
In case someone finds themselves with a similar issue this is what worked for me.
I solved this issue by removing the '--additional-python-modules pyathena' argument and upgrading the AWS Glue Job Python version to 3.9 (where pyathena is automatically installed as part of the default analytics package).
As I understand it, the --additional-python-modules argument does not work with earlier versions of AWS Glue Python.

AWS Glue: passing additional Python modules to the job - ModuleNotFoundError

I'm trying to run a Glue job (version 4) to perform a simple data batch processing. I'm using additional python libraries that Glue environment doesn't provide with - translate and langdetect. Additionally, regardless of the Glue env provides with 'nltk' package, when I try to import it I keep receiving the error that dependencies are not found (e.g. regex._regex, _sqlite3).
I tried a few solutions to achieve my goal:
using --extra-py-files where I specified path to s3 bucket where I uploaded either:
.zip file that consists of translate and langdetect python packages
just a directory for already unzipped packages
packages itself in .whl format (along with its dependencies)
using --additional-python-modules where I specified path to s3 bucket where I uploaded:
packages itself in .whl format (along with its dependencies)
or just pinpoint which package has to be installed inside the glue env via pip3
using Docker
Additionally, I followed a few useful sources to overcome the issue of ModuleNotFoundError:
a) https://aws.amazon.com/premiumsupport/knowledge-center/glue-import-error-no-module-named/.
b) https://aws.amazon.com/premiumsupport/knowledge-center/glue-version2-external-python-libraries/
c) https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html
Also, I tried to play with the Glue versions 4 and 3 but haven't had luck. It seems like a bug. All permissions to read s3 bucket is granted to the glue role. The Python script version is the same as the libraries I'm trying to install - Python 3. To give you more clues, I manage glue resources via Terraform.
What did I do wrong?

PySpark 2.4 is not starting

I have been using PySpark 2.4 for some time. Spark was installed in my system at /usr/local/spark. Suddenly I see the error when I type pyspark in command line.
Failed to find Spark jars directory (/assembly/target/scala-2.12/jars).
You need to build Spark with the target "package" before running this program.
System OS = CentOS 7
I have Python 2.7 and Python3.6 installed. However, when I installed spark, then I set Python 3.6 as default.
There are couple of cronjobs that run every day (and have been running for quite sometimes) where pyspark is used. I am afraid such error like above may hinder to run those crons.
Please throw some lights on this.

No module named 'resource' installing Apache Spark on Windows

I am trying to install apache spark to run locally on my windows machine. I have followed all instructions here https://medium.com/#loldja/installing-apache-spark-pyspark-the-missing-quick-start-guide-for-windows-ad81702ba62d.
After this installation I am able to successfully start pyspark, and execute a command such as
textFile = sc.textFile("README.md")
When I then execute a command that operates on textFile such as
textFile.first()
Spark gives me the error 'worker failed to connect back', and I can see an exception in the console coming from worker.py saying 'ModuleNotFoundError: No module named resource'. Looking at the source file I can see that this python file does indeed try to import the resource module, however this module is not available on windows systems. I understand that you can install spark on windows so how do I get around this?
I struggled the whole morning with the same problem. Your best bet is to downgrade to Spark 2.3.2
The fix can be found at https://github.com/apache/spark/pull/23055.
The resource module is only for Unix/Linux systems and is not applicaple in a windows environment. This fix is not yet included in the latest release but you can modify the worker.py in your installation as shown in the pull request. The changes to that file can be found at https://github.com/apache/spark/pull/23055/files.
You will have to re-zip the pyspark directory and move it the lib folder in your pyspark installation directory (where you extracted the pre-compiled pyspark according to the tutorial you mentioned)
Adding to all those valuable answers,
For windows users,Make sure you have copied the correct version of the winutils.exe file(for your specific version of Hadoop) to the spark/bin folder
Say,
if you have Hadoop 2.7.1, then you should copy the winutils.exe file from the Hadoop 2.7.1/bin folder
Link for that is here
https://github.com/steveloughran/winutils
I edited worker.py file. Removed all resource-related lines. Actually # set up memory limits block and import resource. The error disappeared.

pymssql package does not work with lambda in aws

How do we create a pymssql package for lambda. I tried creating it using
pip install pymssql -t . When I run my lambda function it complaints saying that
Unable to import module 'lambda_function': No module named lambda_function
I follow the steps on this link
http://docs.aws.amazon.com/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html
I have a windows machine
Glad that it worked for you, could you please share the working process for your, me too tried different trial and error steps and ended up with the following one which is working fine in AWS Lambda, I am using PYMSSQL package only.
1) did 'pip install pymssql' on amazon EC2 instance as under the hood Amazon uses Linux AMIs to run their Lambda functions.
2) copied the generated .so files and packaged inside the Lambda deployment package hope this will helps others as well who are searching for the solution.
Hope this will help you further, can you please share what you did to connect to MSSQL server using AWS Lambda.
Below is the folder structure of my lambda deployment package
Windows "pymssql" file not supported in AWS lambda since lambda running on Amazon Linux 2.
So we can get Linux-supported wheel files from the following official "pymssql" website download link. You look for "manylinux_2_24_x86_64" version file with the required python version. At the time of writing python 3.9 is the latest version. So download file will be "pymssql-2.2.2-cp39-cp39-manylinux_2_24_x86_64.whl".
Once download the wheel file execute the below command
pip install {path of downloaded wheel file} -t {target folder to store this package}
Example
pip install pymssql-2.2.2-cp39-cp39-manylinux_2_24_x86_64.whl -t /package
Finally i could do it. It didnt worked with windows packages so used ubuntu to package freetds.so file and it worked.
Along with pymssql try to import cypthon.
Just a quick update for new changes.
There seems to be some issue with python 3.9 AWS lambda, it throws the below error on pymssql==2.2.3.
{
"errorMessage": "Unable to import module 'index': No module named
'pymssql._pymssql'",
"errorType": "Runtime.ImportModuleError",
"stackTrace": []
}
but if I change the python version to 3.8, the error vanishes.

Categories

Resources