I am trying to submit a job to EMR cluster via Livy. My Python script (to submit job) requires importing a few packages. I have installed all those packages on the master node of EMR. The main script resides on S3 which is being called by the script to submit job to Livy from EC2. Everytime I try to run the job on a remote machine (EC2), it dies stating Import Errors(no module named [mod name] )
I have been stuck on it for more than a week and unable to find a possible solution. Any help would be highly appreciated.
Thanks.
These packages that you are trying to import. Are they custom packages ? if so how did you package them. Did you create a wheel file or zip file and specify them as --py-files in your spark submit via livy ?
Possible problem.
You installed the packages only on the master node. You will need to log into your worker nodes and install the packages there too. Else when u provision the emr , install the packages using bootstrap actions
You should be able to add libraries via —py-files option, but it’s safer to just download the wheel files and use them rather than zipping anything yourself.
Related
I am building a CI/CD azure pipeline to build and publish an azure function from a DevOps repo to Azure. The function in question uses a custom SDK stored as a Python package Artifact in an organisation scoped feed.
If I use a pip authenticate task to be able to access the SDK, the task passes but the pipeline then crashes when installing the requirements.txt. Strangely, before we get to the SDK, there is an error installing the azure-functions package. If I remove the SDK requirement and the pip authenticate task this error does not occur however. So something about the authenticate task means the agent cannot access azure-functions.
Additionally, if I swap the order of 'azure-functions' and 'CustomSDK' in the requierments.txt, the agent is still unable to install the SDK artifact so something must be wrong with the authentication task:
steps:
- task: PipAuthenticate#1
displayName: 'Pip Authenticate'
inputs:
artifactFeeds: <organisation-scoped-feed>
pythonDownloadServiceConnections: <service-connection-to-SDK-URL>
Why can I not download these packages?
This was due to confusion around the extra index url. In order to access both PyPI and the artifact feed, the following settings need to be set:
- task: PipAuthenticate#1
displayName: 'Pip Authenticate'
inputs:
pythonDownloadServiceConnections: <service-connection-to-SDK-Feed>
onlyAddExtraIndex: true
This way pip will consult PyPI first, and then the artifact feed.
Try running the function while the _init_.py file is active on the screen.
If you're just trying out the Quickstart, you shouldn't need to change anything in the function.json file. When you start debugging, make sure you're looking at the _init_.py file.
When you run the trigger, make sure you're on the _init_ .py file. Otherwise, VS Code will try to run the current active window's file.
I'm trying to set up an MWAA Airflow 2.0 environment that integrates S3 and GCP's Pub/Sub. While we have no problems with the environment being initialized, we're having trouble installing some dependencies and importing Python packages -- specifically apache-airflow-providers-google==2.2.0.
We've followed all of the instructions based on the official MWAA Python documentation. We already included the constraints file as prescribed by AWS, activated all Airflow logging configs, and tested the requirements.txt file using the MWAA local runner. The result when updating our MWAA environment's requirements would always be like this
When testing using the MWAA local runner, we observed that using the requirements.txt file with the constraints still takes forever to resolve. Installation takes more than 10-30 minutes which is no good.
As an experiment, we tried using a version of the requirements.txt file that omits the constraints and pinned versioning. Doing so installs the packages successfully and we don't receive import errors anymore on both MWAA local runner and our MWAA environment itself. However, all of our dags will fail to run no matter what. Airflow logs are also inaccessible whenever we do this.
The team and I have been trying to get MWAA environments up and running for our different applications and ETL pipelines but we just can't seem to get things to work smoothly. Any help would be appreciated!
I'm having the same problems and in the end we had to refactor a lot of things to remove the dependence. It looks like is a problem with PIP resolver and apache-airflow-providers-google if you look the official page:
https://pypi.org/project/apache-airflow-providers-google/2.0.0rc1/
In the WORST case, you may need to use Airflow direct on EC2 from docker image and abandon MWAA :(
I've been through similar issues but with different packages. There are certain things you need to take into consideration when using MWAA. I didn't have any issue testing the packages on the local runner then on MWAA using a public VPC, I only had issues when using a private VPC as the web server doesn't have an internet connection, so the method to get the packages to MWAA is different.
Things to take into consideration:
The version of the packages; test on the local runner if you can first
Enable the logs; The scheduler and web server logs can show you issues, but also they may not. The reason for this is Fargate serving the images, will try to roll back to a working state rather than have MWAA be in a non-working state. So, you might not see what the error actually is, it may even look like there were no errors in certain scenarios.
Check dependencies; You may need to download a package with pip download <package>==version. There you can inspect the contents of the .whl file and see if there are any dependencies. You may have extra notes that can point you in the right direction. In one case, using the Slack package wouldn't work until I also added the http package, even though Airflow includes this package.
So, yes it's serverless, and you may have an easy time installing/setting MWAA up, but be prepared to do a little investigation if it doesn't work. I did contact AWS support, but managed to solve it myself in the end. Other than trying the obvious things, only those that use MWAA frequently and have faced varying scenarios will be of any assistance.
I have written a python job that uses sqlAlchemy to query a SQL Server database, however when using external libraries with AWS Glue you are required to wrap these libraries in an egg file. This causes an issue with the sqlAlchemy package as it uses the pyodbc package that cannot be wrapped in an egg as to my understanding it has other dependencies.
I have attempted to try and find a way of connecting to a SQL Server database within a Python Glue job but so far the closest advice I've been able to find suggests I write a Spark job instead which isn't appropriate.
Does anyone have experience with connecting to SQL Server within a Python 3 Glue Job? If so can I have an example snippet of code + packages used?
Yes, I actually managed to do something similar by bundling dependencies including transitive dependencies.
Follow the below steps:
1 - Create a script which zips all of the code and dependencies into a zip file and upload to S3:
python3 -m pip install -r requirements.txt --target custom_directory
python3 -m zipapp custom_directory/
mv custom_directory.pyz custom_directory.zip
Upload this zip instead of egg or wheel.
2 - Create a driver program which executes your python source program which we just zipped in step 1.
import sys
if len(sys.argv) == 1:
raise SyntaxError("Please provide a module to load.")
sys.path.append(sys.argv[1])
from your_module import your_function
sys.exit(your_function())
3 - You can then submit your job using:
spark-submit --py-files custom_directory.zip your_program.py
See:
How can you bundle all your python code into a single zip file?
I can't seem to get --py-files on Spark to work
Some background: I have set up Airflow on Kubernetes (on AWS). I am able to run DAGs that query a database, send emails or do anything that doesn't require a package that isn't already a part of Airflow. For example, if I try to run a DAG that uses the Facebook-business SDK the DAG will obviously break because the dependency isn't available. I've tried a couple different ways of trying to get this dependency, along with others, installed but haven't been successful.
I have tried to install python packages by modifying my scheduler and webserver deployments to do a pip install of my dependencies as part of an initContainer. When I do this, the DAG remains broken as it is unable to find the needed packages. When I open a shell to my pod I can see that the dependencies have not been installed (I check using pip list). I have also verified that there aren't other python/pip versions installed.
I have also tried to install the dependencies by running a pip install when I open a shell to my pod. This way is successful in installing the dependency in the correct place and also makes it available. However, instead of the webserver UI showing that my DAG is broken, I get the this dag isn't available in the webserver dagbag object message.
I would expect that running pip install as part of my initContainer or container would makes these dependencies available in my pod. However, this isn't the case. It's as if pip install runs without any issues, but by the time my pods are fully set up the python packages are nowhere to be found
I forgot to say that I have found a way to make it work, but it feels somewhat hacky and like there should be a better way
- If I open a shell to my webserver container and install the needed dependencies and then open a shell to my scheduler and do the same thing, the dependencies are found and the DAG works.
The init container is a separate docker instance. Unless you rig up some sort of shared storage for your python libraries (which is quite dubious) any pip installs in the init container won't impact the running container of the pod.
I see two options:
1) Modify the docker image that you're using to include the packages you need
2) Prepend a pip install to the command being run in the pod. It's not uncommon to string together a few commands with && between them, in order to execute a sequence of operations in a starting pod.
I would recommend updating your Airflow Docker image to include the libraries you need.
If you plan to use lots of different libraries for specific DAGs then it may be worth create multiple Docker images and then reference them at a task level.
MyOperator(...,
executor_config={
"KubernetesExecutor":
{"image": "myCustomDockerImage"}
}
)
Reference: baseoperator.py
So I was looking into scheduling a python script on a daily basis and, rather than using Task Scheduler on my own machine, I was wondering if it is possible to do so using an Azure cloud account.
For your needs, I suggest you use Web Jobs in Web Apps Service.
It has two types of Azure Web Jobs for you to choose:
Continuous and Trigger.
For your needs, Trigger should be adopted.
You could refer to the document here for more details.In addition, here shows how to run tasks in WebJobs.
You could refer to the steps as below to create your webjob.
Step 1 :
Use the virtualenv component to create an independent python runtime environment in your system.Please install it first with command pip install virtualenv if you don't have it.
If you installed it successfully ,you could see it in your python/Scripts file.
Step2 : Run the commad to create independent python runtime environment.
Step 3: Then go into the created directory's Scripts folder and activate it (this step is important , don't miss it)
Please don't close this command window and use pip install <your libraryname> to download external libraries in this command window.
Step 4:Keep the Webjob.py(which is your own business code) uniformly compressed into a folder with the libs packages in the Libs/site-packages folder that you rely on.
Step 5:
Create webjob in Web app service and upload the zip file,then you could execute your Web Job and check the log
You could also refer to the SO thread:
1.Options for running Python scripts in Azure
2.Python libraries on Web Job
BTW,you need to create a azure web app first because Webjob runs in azure web app.
Hope it helps you.