So I'm trying to get my Intellij to see Apache Airflow that I downloaded. The steps I've taken so far:
I've downloaded the most recent Apache Airflow setup and saved the apache airflow 2.2.3 onto my desktop. I'm trying to get it to work with my Intellij, I've tried adding the Apache Airflow folder into the Library and Modules, both have come back with errors stating it's not being utilized. I've tried looking up documentation on it within Airflow but I'm not able to find any documentation on how to implement in your own IDE to write Python scripts for DAGs and other items?
How would I go about doing this as I'm at a complete loss of how to get Intellij to register that Apache Airflow is a Library to utilize for Python code so I can write DAG files correctly within the IDE itself.
Any help would be much appreciated as I've been stuck on this aspect for the past couple of days searching for any kind of documentation to make this work.
Airflow is both application and library. In your case you are not trying to run the application but only looking to write DAGs so you need it just as a library.
You should just open a virtual environment (preferably) and run:
pip install apache-airflow
Then you can write DAGs using the library and Intellij will let you know if you are using wrong imports or deprecated objects.
When your DAG file is ready deploy it to the DAG folder on the machine where Airflow is running.
Related
I would like to create a self-contained, .exe file that launches a JupyterLab server as an IDE on a physical server which doesn't have Python installed itself.
The idea is to deploy it as part of an ETL workflow tool, so that it can be used to view notebooks that will contain the ETL steps in a relatively easily digestible format (the notebooks will be used as pipelines via papermill and scrapbook - not really relevant here).
While I can use Pyinstaller to bundle JupyterLab as a package, there isn't a way to launch it on the Pythonless server (that I can see), and I can't figure out a way to do it using Python code alone.
Is it possible to package JupyterLab this way so that I can run the .exe on the server and then connect to 127.0.0.1:8888 on the server to view a notebook?
I have tried using the link below as a starting point, but I think I'm missing something as no server seems to start using this code alone, and I'm not sure how I would execute this via a tornado server etc.:
https://gist.github.com/bollwyvl/bd56b58ba0a078534272043327c52bd1
I would really appreciate any ideas, help, or somebody to tell my why this idea is impossible madness!
Thanks!
Phil.
P.S. I should add that Docker isn't an option here :( I've done this before using Docker and it's extremely easy.
I'm trying to set up an MWAA Airflow 2.0 environment that integrates S3 and GCP's Pub/Sub. While we have no problems with the environment being initialized, we're having trouble installing some dependencies and importing Python packages -- specifically apache-airflow-providers-google==2.2.0.
We've followed all of the instructions based on the official MWAA Python documentation. We already included the constraints file as prescribed by AWS, activated all Airflow logging configs, and tested the requirements.txt file using the MWAA local runner. The result when updating our MWAA environment's requirements would always be like this
When testing using the MWAA local runner, we observed that using the requirements.txt file with the constraints still takes forever to resolve. Installation takes more than 10-30 minutes which is no good.
As an experiment, we tried using a version of the requirements.txt file that omits the constraints and pinned versioning. Doing so installs the packages successfully and we don't receive import errors anymore on both MWAA local runner and our MWAA environment itself. However, all of our dags will fail to run no matter what. Airflow logs are also inaccessible whenever we do this.
The team and I have been trying to get MWAA environments up and running for our different applications and ETL pipelines but we just can't seem to get things to work smoothly. Any help would be appreciated!
I'm having the same problems and in the end we had to refactor a lot of things to remove the dependence. It looks like is a problem with PIP resolver and apache-airflow-providers-google if you look the official page:
https://pypi.org/project/apache-airflow-providers-google/2.0.0rc1/
In the WORST case, you may need to use Airflow direct on EC2 from docker image and abandon MWAA :(
I've been through similar issues but with different packages. There are certain things you need to take into consideration when using MWAA. I didn't have any issue testing the packages on the local runner then on MWAA using a public VPC, I only had issues when using a private VPC as the web server doesn't have an internet connection, so the method to get the packages to MWAA is different.
Things to take into consideration:
The version of the packages; test on the local runner if you can first
Enable the logs; The scheduler and web server logs can show you issues, but also they may not. The reason for this is Fargate serving the images, will try to roll back to a working state rather than have MWAA be in a non-working state. So, you might not see what the error actually is, it may even look like there were no errors in certain scenarios.
Check dependencies; You may need to download a package with pip download <package>==version. There you can inspect the contents of the .whl file and see if there are any dependencies. You may have extra notes that can point you in the right direction. In one case, using the Slack package wouldn't work until I also added the http package, even though Airflow includes this package.
So, yes it's serverless, and you may have an easy time installing/setting MWAA up, but be prepared to do a little investigation if it doesn't work. I did contact AWS support, but managed to solve it myself in the end. Other than trying the obvious things, only those that use MWAA frequently and have faced varying scenarios will be of any assistance.
I am trying to wrap my head around how DAGs should actually be executed in docker compose environment, when they are dependent on other service (separate python venv) defined in the compose file.
I have setup airflow via docker compose as mentioned in official documentation. Also, I have added a Django service, which has its own dependencies.
Now, i would like to have a DAG, that executes python script using that Django's service python environment (It also uses Django's models. Not sure if that's relevant).
The only way I see it working is with DockerOperator as described here. I managed to setup and execute the test DAG mentioned there, however when I try to run the real task, it fails due to networking issues. Iam quite confident i can solve that issue, but setting everything this way just seems like way too much hassle.
So, in the end I guess Iam wondering what the ideal architecture should when using Airflow via compose? Should the base airflow image be extended with my Django service (creating one hell of a big image) or is there a better way?
You can use PythonVirtualEnvOperator (https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html#pythonvirtualenvoperator) - but it will recreate django virtualenv every time task is run so not ideal
Another option will be to run DockerOperator or KubernetesPodOperator (if you use Kubernetes) and have separate image with Django installed (or even base Django image).
Adding Django to Airflow is probably not the best idea - Airflow has ~500 dependencies when installed with all providers, so chance is that you will have some difficults-to-resolve conflicts.
Also one of the things we consider for Airlfow 2.2 and beyond is to make better way of handling caching, which could help with building cacheable virtualenv created once and shared between workers/pods (but this is just in discussion phase)
You can check out tomorrow's session on Airflow Summit where we discuss what's coming (and generally Airflow Summit is cool):
https://airflowsummit.org/sessions/2021/looking-ahead-what-comes-after-airflow-2/
I'm building a website with React and Firebase that utilizes an algorithm I wrote in python. The database and authentication for the project are both handled by Firebase, so I would like to keep the cloud functions in that same ecosystem if possible.
Right now, I'm using the python-shell npm package to send and receive data from NodeJS to my python script.
I have local unit testing set up so I can test the https.onCall functions locally without needing to deploy and test from the client.
When I am testing locally, everything works perfectly.
However, when I push the functions to the cloud and trigger the function from the client, the logs in the Firebase console show that the python script is missing dependencies.
What is the best way to ensure that the script has all the dependencies available to it up on the server?
I have tried:
-Copying the actual dependency folders from my library/.../site-packages and putting them in the same directory under the /functions folder with the python script. This almost works. I just run into an issue with numpy: "No module named 'numpy.core._multiarray_umath'" is printed to the logs in Firebase.
I apologize if this is an obvious answer. I'm new to Python, and the solutions I've found online seem way to elaborate or involve hosting the python code in another ecosystem (like AWS or Heroku). I am especially hesitant to go to all that work because it runs fine locally. If I can just find a way to send the dependencies up with the script I'm good to go.
Please let me know if you need any more information.
the logs in the Firebase console show that the python script is missing dependencies.
That's because the nodejs runtime targeted by the Firebase CLI doesn't have everything you need to run python programs.
If you need to run a function that's primarily written in python, you should not use the Firebase CLI and instead uses the Google Cloud tools to target the python runtime, which should do everything you want. Yes, it might be extra work for you to learn new tools, and you will not be able to use the Firebase CLI, but it will be the right way to run python in Cloud Functions.
I've got a little question about dependency management for packages used in python operators
We are using airflow in a industralized mode to run scheduled python jobs. it works well but we are facing issues to deal with different python lib needed for each DAG.
Do you have any idea on how to let developers install their own dependencies for their jobs without being admin and being sure that these dependencies don't collide with other jobs ?
Would you recommend having a bash task that loads a virtual env at the beginning of the job ? Any official recommandation to do it ?
Thanks !
Romain.
In general I see two possible solutions for your problem:
Airflow has a PythonVirtualEnvOperator which allows a task to run in a virtualenv which gets created and destroyed automatically. You can pass a python_version and a list of requirements to the task to build the virtual env.
Set up a docker registry and use a DockerOperator rather than a PythonOperator. This would allow teams to set up their own Docker images with specific requirements. This is how I think Heineken set up their airflow jobs as presented in their Airflow Meetup. I'm trying to see whether they posted their slides online but I can't seem to find them.