In the AWS documentation, they specify how to activate monitoring for Spark jobs (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-glue-job-cloudwatch-metrics.html), but not python shell jobs.
Using the code as is gives me this error: ModuleNotFoundError: No module named 'pyspark'
Worse, after commenting out from pyspark.context import SparkContext, I then get ModuleNotFoundError: No module named 'awsglue.context'. It seems the python shell jobs don't have access to glue context? Has anyone solved for this?
The python shell jobs are purely python based environment and do not have access to pyspark ( EMR in the backend). You will not be able to get access to the context attribute here. That is purely a spark concept and glue is essentially a wrapper around pyspark.
I am getting into glue python shell jobs more, and resolving some dependencies in some code files that are shared between my spark jobs and pyshell jobs. I was able to resolve the pyspark dependency, by including in the creation of my .egg/.whl file, in requirements.txt, pyspark==2.4.7. That version because another library required it.
You still cannot use the pyspark context as mentioned above by Emerson, because this is python runtime, not the spark runtime.
So when building distribution with setuptools, can have a requirements.txt that looks like this(below), and when the shell is setup, it will install these dependencies:
elasticsearch
aws_requests_auth
pg8000
pyspark==2.4.7
awsglue-local
Related
My Pyspark EC2 instance got terminated (my fault) and I am trying to build another one. I have Spark running and now am trying to run a simple Pyspark script to access S3 bucket. The machine has Python 3 installed and I installed boto3 but I get compilation error for the line below.
from boto3.session import Session
No module named 'boto3' .
Also, I get a logger error saying invalid syntax when I do the following
print rddtest.take(10)
Not sure what I am missing. Thanks in advance.
You might have many python installations in your machine. pip might be installing boto3 for python3.6 while you are currently executing python3.7
This is for a PySpark / Databricks project:
I've written a Scala JAR library and exposed its functions as UDFs via a simple Python wrapper; everything works as it should in my PySpark notebooks. However, when I try to use any of the functions imported from the JAR in an sc.parallelize(..).foreach(..) environment, execution keeps dying with the following error:
TypeError: 'JavaPackage' object is not callable
at this line in the wrapper:
jc = get_spark()._jvm.com.company.package.class.get_udf(function.__name__)
My suspicious is that the JAR library is not available in the parallelized context, since if I replace the library path to some gibberish, the error remains exactly the same.
I haven't been able to find the necessary clues in the Spark docs so far. Using an sc.addFile("dbfs:/FileStore/path-to-library.jar") didn't help.
You could try adding the JAR in the PYSPARK_SUBMIT_ARGS environment variable (before Spark 2.3 this was doable with SPARK_CLASSPATH as well).
For example with:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars <path/to/jar> pyspark-shell'
How can I ship C compiled modules (for example, python-Levenshtein) to each node in a Spark cluster?
I know that I can ship Python files in Spark using a standalone Python script (example code below):
from pyspark import SparkContext
sc = SparkContext("local", "App Name", pyFiles=['MyFile.py', 'MyOtherFile.py'])
But in situations where there is no '.py', how do I ship the module?
If you can package your module into a .egg or .zip file, you should be able to list it in pyFiles when constructing your SparkContext (or you can add it later through sc.addPyFile).
For Python libraries that use setuptools, you can run python setup.py bdist_egg to build an egg distribution.
Another option is to install the library cluster-wide, either by using pip/easy_install on each machine or by sharing a Python installation over a cluster-wide filesystem (like NFS).
There are two main options here:
If it's a single file or a .zip/.egg, pass it to SparkContext.addPyFile.
Insert pip install into a bootstrap code for the cluster's machines.
Some cloud platforms (DataBricks in this case) have UI to make this easier.
People also suggest using python shell to test if the module is present on the cluster.
I tried to run a customized Python script that imports an external pure python library (psycopg2) on AWS Glue but failed. I checked the CloudWatch log and found out the reason for the failure is that:
Spark failed the permission check on several folders in HDFS, one of them contains the external python library I uploaded to S3 (s3://path/to/psycopg2) which requires -x permission:
org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=READ_EXECUTE, inode="/user/root/.sparkStaging/application_1507598924170_0002/psycopg2":root:hadoop:drw-r--r--
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getListingInt(FSDirStatAndListingOp.java:76)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:4486)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:999)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:634)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
I make sure that the library contains only .py file as instructed in the AWS documentation.
Does anyone know what went wrong?
Many thanks!
You have a directory that doesn't have execute permission. In a Unix-based O/S directories must have the execute bit set (for at least the user) to be usable.
Run something like
sudo chmod +x /user/root/.sparkStaging/application_1507598924170_0002/psycopg2
and try it again.
Glue only support's python only libraries i.e. without any specific native library bindings.
The package psycopg2 is not pure Python, so it will not work with Glue. From the setup.py:
If you prefer to avoid building psycopg2 from source, please install
the PyPI 'psycopg2-binary' package instead.
From the AWS Glue documentation:
You can use Python extension
modules and libraries with your AWS Glue ETL scripts as long as they
are written in pure Python. C libraries such as pandas are not
supported at the present time, nor are extensions written in other
languages.
I'm trying to make use of an external library in my Python mapper script in an AWS Elastic MapReduce job.
However, my script doesn't seem to be able to find the modules in the cache. I archived the files into a tarball called helper_classes.tar and uploaded the tarball to an Amazon S3 bucket. When creating my MapReduce job on the console, I specified the argument as:
cacheArchive s3://folder1/folder2/helper_classes.tar#helper_classes
At the beginning of my Python mapper script, I included the following code to import the library:
import sys
sys.path.append('./helper_classes')
import geoip.database
When I run the MapReduce job, it fails with an ImportError: No module named geoip.database. (geoip is a folder in the top level of helper_classes.tar and database is the module I'm trying to import.)
Any ideas what I could be doing wrong?
This might be late for the topic.
Reason is that the module geoip.database is not installed on all the Hadoop nodes.
You can either try not use uncommon imports in your map/reduce code,
or try to install the needed modules on all Hadoop nodes.