Issue with using files in distributed cache in Elastic MapReduce - python

I'm trying to make use of an external library in my Python mapper script in an AWS Elastic MapReduce job.
However, my script doesn't seem to be able to find the modules in the cache. I archived the files into a tarball called helper_classes.tar and uploaded the tarball to an Amazon S3 bucket. When creating my MapReduce job on the console, I specified the argument as:
cacheArchive s3://folder1/folder2/helper_classes.tar#helper_classes
At the beginning of my Python mapper script, I included the following code to import the library:
import sys
sys.path.append('./helper_classes')
import geoip.database
When I run the MapReduce job, it fails with an ImportError: No module named geoip.database. (geoip is a folder in the top level of helper_classes.tar and database is the module I'm trying to import.)
Any ideas what I could be doing wrong?

This might be late for the topic.
Reason is that the module geoip.database is not installed on all the Hadoop nodes.
You can either try not use uncommon imports in your map/reduce code,
or try to install the needed modules on all Hadoop nodes.

Related

AWS Glue: passing additional Python modules to the job - ModuleNotFoundError

I'm trying to run a Glue job (version 4) to perform a simple data batch processing. I'm using additional python libraries that Glue environment doesn't provide with - translate and langdetect. Additionally, regardless of the Glue env provides with 'nltk' package, when I try to import it I keep receiving the error that dependencies are not found (e.g. regex._regex, _sqlite3).
I tried a few solutions to achieve my goal:
using --extra-py-files where I specified path to s3 bucket where I uploaded either:
.zip file that consists of translate and langdetect python packages
just a directory for already unzipped packages
packages itself in .whl format (along with its dependencies)
using --additional-python-modules where I specified path to s3 bucket where I uploaded:
packages itself in .whl format (along with its dependencies)
or just pinpoint which package has to be installed inside the glue env via pip3
using Docker
Additionally, I followed a few useful sources to overcome the issue of ModuleNotFoundError:
a) https://aws.amazon.com/premiumsupport/knowledge-center/glue-import-error-no-module-named/.
b) https://aws.amazon.com/premiumsupport/knowledge-center/glue-version2-external-python-libraries/
c) https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html
Also, I tried to play with the Glue versions 4 and 3 but haven't had luck. It seems like a bug. All permissions to read s3 bucket is granted to the glue role. The Python script version is the same as the libraries I'm trying to install - Python 3. To give you more clues, I manage glue resources via Terraform.
What did I do wrong?

missing_dependencies error using Pandas in Azure Web Job

I need to run some long running job via Azure Web Job in Python.
I am facing below error trying to import pandas.
File "D:\local\Temp\jobs\triggered\demo2\eveazbwc.iyd\pandas_init_.py", line 13
missing_dependencies.append(f"{dependency}: {e}")
The Web app (under which I will run the web job) also has python code using pandas, and it does not throw any error.
I have tried uploading pandas and numpy folder inside the zip file (creating venv, installing packages and zipping Lib/site-packages content), (for 32 bit and 64 bit python) as well as tried appending 'D:/home/site/wwwroot/my_app_name/env/Lib/site-packages' to sys.path.
I am not facing such issues in importing standard python modules or additional package modules like requests.
Error is also thrown in trying to import numpy.
So, I am assuming some kind of version mismatch is happening somewhere.
Any pointers to solve this will be really useful.
I have been using Python 3.x, not sure if I should try Python 2.x (virtual env, install package and zip content of Lib/site-packages).
Regards
Kunal
The key to solving the problem is to add Python 3.6.4 x64 Extension on the portal.
Steps:
Add Extensions on portal.
Create a requirements.txt file.
pandas==1.1.4
numpy==1.19.3
Create run.cmd.
Create below file and zip them into a single zip.
Upload the zip for the webjob.
For more details, please read this post.
Webjobs Running Error (3587fd: ERR ) from zipfile

Monitoring python shell glue jobs in AWS

In the AWS documentation, they specify how to activate monitoring for Spark jobs (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-glue-job-cloudwatch-metrics.html), but not python shell jobs.
Using the code as is gives me this error: ModuleNotFoundError: No module named 'pyspark'
Worse, after commenting out from pyspark.context import SparkContext, I then get ModuleNotFoundError: No module named 'awsglue.context'. It seems the python shell jobs don't have access to glue context? Has anyone solved for this?
The python shell jobs are purely python based environment and do not have access to pyspark ( EMR in the backend). You will not be able to get access to the context attribute here. That is purely a spark concept and glue is essentially a wrapper around pyspark.
I am getting into glue python shell jobs more, and resolving some dependencies in some code files that are shared between my spark jobs and pyshell jobs. I was able to resolve the pyspark dependency, by including in the creation of my .egg/.whl file, in requirements.txt, pyspark==2.4.7. That version because another library required it.
You still cannot use the pyspark context as mentioned above by Emerson, because this is python runtime, not the spark runtime.
So when building distribution with setuptools, can have a requirements.txt that looks like this(below), and when the shell is setup, it will install these dependencies:
elasticsearch
aws_requests_auth
pg8000
pyspark==2.4.7
awsglue-local

Permission error in HDFS when using pure python external library in AWS Glue

I tried to run a customized Python script that imports an external pure python library (psycopg2) on AWS Glue but failed. I checked the CloudWatch log and found out the reason for the failure is that:
Spark failed the permission check on several folders in HDFS, one of them contains the external python library I uploaded to S3 (s3://path/to/psycopg2) which requires -x permission:
org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=READ_EXECUTE, inode="/user/root/.sparkStaging/application_1507598924170_0002/psycopg2":root:hadoop:drw-r--r--
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getListingInt(FSDirStatAndListingOp.java:76)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:4486)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:999)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:634)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
I make sure that the library contains only .py file as instructed in the AWS documentation.
Does anyone know what went wrong?
Many thanks!
You have a directory that doesn't have execute permission. In a Unix-based O/S directories must have the execute bit set (for at least the user) to be usable.
Run something like
sudo chmod +x /user/root/.sparkStaging/application_1507598924170_0002/psycopg2
and try it again.
Glue only support's python only libraries i.e. without any specific native library bindings.
The package psycopg2 is not pure Python, so it will not work with Glue. From the setup.py:
If you prefer to avoid building psycopg2 from source, please install
the PyPI 'psycopg2-binary' package instead.
From the AWS Glue documentation:
You can use Python extension
modules and libraries with your AWS Glue ETL scripts as long as they
are written in pure Python. C libraries such as pandas are not
supported at the present time, nor are extensions written in other
languages.

Azure ML Python with Script Bundle cannot import module

In Azure ML, I'm trying to execute a Python module that needs to import the module pyxdameraulevenshtein (https://pypi.python.org/pypi/pyxDamerauLevenshtein).
I followed the usual way, which is to create a zip file and then import it; however for this specific module, it seems to never be able to find it. The error message is as usual:
ImportError: No module named 'pyxdameraulevenshtein'
Has anyone included this pyxdameraulevenshtein module in Azure ML with success ?
(I took the package from https://pypi.python.org/pypi/pyxDamerauLevenshtein.)
Thanks for any help you can provide,
PH
I viewed the pyxdameraulevenshtein module page, there are two packages you can download which include a wheel file for MacOS and a source code tar file. I don't think you can directly use the both on Azure ML, because the MacOS one is just a share library .so file for darwin which is not compatible with Azure ML, and the other you need to first compile it.
So my suggestion is as below for using pyxdameraulevenshtein.
First, compile the source code of pyxdameraulevenshtein to a DLL file on Windows, please refer to the document for Python 2/3 or search for doing this.
Write a Python script using the DLL you compiled to implement your needs, please refer to the SO thread How can I use a DLL file from Python? for how to use DLL from Python and refer to the Azure offical tutorial to write your Python script
Package your Python script and DLL file as a zip file, then to upload the zip file to use it in Execute Python script model of Azure ML.
Hope it helps.
Adding the path to pyxdameraulevenshtein to your system path should alleviate this issue. The script checks the system path that the python script is running on and doesn't know where else to look for anything other than the default packages. If your python script is in the same directory as the pyxdameraulevenshtein package in your ZIP file, this should do the trick. Because you are running this within Azure ML and can't be sure of the exact location of your script each time you run it, this solution should account for that.
import os
import sys
sys.path.append(os.path.join(os.getcwd(), 'pyxdameraulevenshtein'))
import pyxdameraulevenshtein

Categories

Resources