ModuleNotFoundError: No module named 'pyathena' when running AWS Glue Job - python

Despite setting the parameter for my Python AWS Glue Job like this:
--additional-python-modules pyathena
I still get the following error when I try and run the job:
ModuleNotFoundError: No module named 'pyathena'
I have also tried the following parameters:
--additional-python-modules pyathena
--pip-install pyathena
--pip-install pyathena==2.23.0

In case someone finds themselves with a similar issue this is what worked for me.
I solved this issue by removing the '--additional-python-modules pyathena' argument and upgrading the AWS Glue Job Python version to 3.9 (where pyathena is automatically installed as part of the default analytics package).
As I understand it, the --additional-python-modules argument does not work with earlier versions of AWS Glue Python.

Related

AWS Glue locally - No module named 'awsglue'

I installed each prerequisites according to https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html#develop-local-python and still getting No module named 'awsglue' error.
AWS Glue version 3.0,
Apache Maven from the following location: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz
AWS Glue version 3.0: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
SPARK_HOME is setup
ran glue-setup.sh from \\wsl$\Ubuntu-20.04\home\my_user\aws_ds\glue_libs\aws-glue-libs\bin
when I run spark-shell or pyspark, both are working fine
Please help on debbuging this as I don't know where to start else.
Working solution:
Make sure your Glue script is ran in the aws-glue-libs folder
Sync jar files between jarsv1 in aws-glue-libs and jars in your_spark_folder (quava jar may have two versions, leave latest one)
Installation steps to consider
Get Spark on WSL2: https://phoenixnap.com/kb/install-spark-on-ubuntu
Remember to run glue-setup.sh from aws-glue-libs\bin as a last step of Setting up Glue locally

No module named 'azure.storage.blob.sharedaccesssignature'

I've been struggling with the Azure library for a while now. I want to use Azure in python. I have azure.storage.blob installed (via pip install azure.storage.blob). I have upgraded it (also tried uninstalling and reinstalling) to version 12.8.1. I'm currently running python 3.7.6.
In spite of all this, I keep getting
from azure.storage.blob.sharedaccesssignature import BlobSharedAccessSignature
ModuleNotFoundError: No module named 'azure.storage.blob.sharedaccesssignature'
I see the module here: https://azure-storage.readthedocs.io/ref/azure.storage.blob.sharedaccesssignature.html
So I'm not sure why it's not recognized. Any ideas?
The reason you're getting this error is because BlobSharedAccessSignature is part of the old version of the SDK (azure-storage) whereas you're working with the newer version of the SDK (azure.storage.blob version 12.8.1).
To generate a shared access signature on a blob, you will need to use generate_blob_sas() function in the new SDK.

Monitoring python shell glue jobs in AWS

In the AWS documentation, they specify how to activate monitoring for Spark jobs (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-glue-job-cloudwatch-metrics.html), but not python shell jobs.
Using the code as is gives me this error: ModuleNotFoundError: No module named 'pyspark'
Worse, after commenting out from pyspark.context import SparkContext, I then get ModuleNotFoundError: No module named 'awsglue.context'. It seems the python shell jobs don't have access to glue context? Has anyone solved for this?
The python shell jobs are purely python based environment and do not have access to pyspark ( EMR in the backend). You will not be able to get access to the context attribute here. That is purely a spark concept and glue is essentially a wrapper around pyspark.
I am getting into glue python shell jobs more, and resolving some dependencies in some code files that are shared between my spark jobs and pyshell jobs. I was able to resolve the pyspark dependency, by including in the creation of my .egg/.whl file, in requirements.txt, pyspark==2.4.7. That version because another library required it.
You still cannot use the pyspark context as mentioned above by Emerson, because this is python runtime, not the spark runtime.
So when building distribution with setuptools, can have a requirements.txt that looks like this(below), and when the shell is setup, it will install these dependencies:
elasticsearch
aws_requests_auth
pg8000
pyspark==2.4.7
awsglue-local

boto3 module not found

My Pyspark EC2 instance got terminated (my fault) and I am trying to build another one. I have Spark running and now am trying to run a simple Pyspark script to access S3 bucket. The machine has Python 3 installed and I installed boto3 but I get compilation error for the line below.
from boto3.session import Session
No module named 'boto3' .
Also, I get a logger error saying invalid syntax when I do the following
print rddtest.take(10)
Not sure what I am missing. Thanks in advance.
You might have many python installations in your machine. pip might be installing boto3 for python3.6 while you are currently executing python3.7

GIBCXX_3.4.20 not found error in AWS Lambda

I'd like to use the Python package (Konlpy) in AWS Lambda. However, the following error occurs:
"Unable to import module 'lambda_function': /usr/lib64/libstdc++.so.6: version 'GLIBCXX_3.4.20' Not found (required by /var/task/_jpype.so)"
How can I fix it?
enter image description here
The AWS Lambda is deployed in a clean environment. It does not have the package Konlpy, it has no idea what you are trying to import.
In order to achieve this, you can create a Cloud Formation Stack. This can bundle up your code and allow you to deploy dependencies for your lambda.

Categories

Resources