How to get reproducible Python Apache Beam Dataflow environment builds?

How to get reproducible Python Apache Beam Dataflow environment builds? - python

Currently I build our Google Dataflow Python environment using the official example setup.py: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/setup.py
The problems with this approach are:
OS compatibility issues I am developing on a Mac and Dataflow instances are based on Ubuntu. Using setup.py is quite painful here as it doesn't seem like the right tool to encampsualte.
DataflowRunner take about 20-25 minutes to identify
I think getting a docker image to mirror the Dataflow environment would be a good solution to these problems and running the DirectRunner on the image.
It seems to me templates https://cloud.google.com/dataflow/docs/guides/templates/overview could help executing from different environments though I don't think they provide enough insight into the build process.
I am not sure where to find a Docker image I could use for this or if there are any better ways to reproducibly build Dataflow Python environments?

Related

How do I create a standalone Jupyter Lab server using Pyinstaller or similar?

I would like to create a self-contained, .exe file that launches a JupyterLab server as an IDE on a physical server which doesn't have Python installed itself.
The idea is to deploy it as part of an ETL workflow tool, so that it can be used to view notebooks that will contain the ETL steps in a relatively easily digestible format (the notebooks will be used as pipelines via papermill and scrapbook - not really relevant here).
While I can use Pyinstaller to bundle JupyterLab as a package, there isn't a way to launch it on the Pythonless server (that I can see), and I can't figure out a way to do it using Python code alone.
Is it possible to package JupyterLab this way so that I can run the .exe on the server and then connect to 127.0.0.1:8888 on the server to view a notebook?
I have tried using the link below as a starting point, but I think I'm missing something as no server seems to start using this code alone, and I'm not sure how I would execute this via a tornado server etc.:
https://gist.github.com/bollwyvl/bd56b58ba0a078534272043327c52bd1
I would really appreciate any ideas, help, or somebody to tell my why this idea is impossible madness!
Thanks!
Phil.
P.S. I should add that Docker isn't an option here :( I've done this before using Docker and it's extremely easy.

MWAA: Trouble installing Airflow Google Providers from requirements.txt

I'm trying to set up an MWAA Airflow 2.0 environment that integrates S3 and GCP's Pub/Sub. While we have no problems with the environment being initialized, we're having trouble installing some dependencies and importing Python packages -- specifically apache-airflow-providers-google==2.2.0.
We've followed all of the instructions based on the official MWAA Python documentation. We already included the constraints file as prescribed by AWS, activated all Airflow logging configs, and tested the requirements.txt file using the MWAA local runner. The result when updating our MWAA environment's requirements would always be like this
When testing using the MWAA local runner, we observed that using the requirements.txt file with the constraints still takes forever to resolve. Installation takes more than 10-30 minutes which is no good.
As an experiment, we tried using a version of the requirements.txt file that omits the constraints and pinned versioning. Doing so installs the packages successfully and we don't receive import errors anymore on both MWAA local runner and our MWAA environment itself. However, all of our dags will fail to run no matter what. Airflow logs are also inaccessible whenever we do this.
The team and I have been trying to get MWAA environments up and running for our different applications and ETL pipelines but we just can't seem to get things to work smoothly. Any help would be appreciated!

I'm having the same problems and in the end we had to refactor a lot of things to remove the dependence. It looks like is a problem with PIP resolver and apache-airflow-providers-google if you look the official page:
https://pypi.org/project/apache-airflow-providers-google/2.0.0rc1/
In the WORST case, you may need to use Airflow direct on EC2 from docker image and abandon MWAA :(

I've been through similar issues but with different packages. There are certain things you need to take into consideration when using MWAA. I didn't have any issue testing the packages on the local runner then on MWAA using a public VPC, I only had issues when using a private VPC as the web server doesn't have an internet connection, so the method to get the packages to MWAA is different.
Things to take into consideration:
The version of the packages; test on the local runner if you can first
Enable the logs; The scheduler and web server logs can show you issues, but also they may not. The reason for this is Fargate serving the images, will try to roll back to a working state rather than have MWAA be in a non-working state. So, you might not see what the error actually is, it may even look like there were no errors in certain scenarios.
Check dependencies; You may need to download a package with pip download <package>==version. There you can inspect the contents of the .whl file and see if there are any dependencies. You may have extra notes that can point you in the right direction. In one case, using the Slack package wouldn't work until I also added the http package, even though Airflow includes this package.
So, yes it's serverless, and you may have an easy time installing/setting MWAA up, but be prepared to do a little investigation if it doesn't work. I did contact AWS support, but managed to solve it myself in the end. Other than trying the obvious things, only those that use MWAA frequently and have faced varying scenarios will be of any assistance.

What are the ways to deploy python code on aws ec2?

I have a python project and i want to deploy it on an AWS EC2 instance. My project has dependencies to other python libraries and uses programs installed on my machine. What are the alternatives to deploy my project on an AWS EC2 instance?
Further details : My project consist on a celery periodic task that uses ffmpeg and blender to create short videos.
I have checked elastic bean stalk but it seems it is tailored for web apps. I don't know if containerizing my project via docker is a good idea...
The manual way and the cheapest way to do it would be :
1- Launch a spot instance
2- git clone the project
3- Install the librairies via pip
4- Install all dependant programs
5- Launch periodic task
I am looking for a more automatic way to do it.
Thanks.

Beanstalk is certainly an option. You don't necessarily have to use it for web apps and you can configure all of the dependencies needed via .ebextensions.
Containerization is usually my go to strategy now. If you get it working within Docker locally then you have several deployment options and the whole thing gets much easier since you don't have to worry about setting up all the dependencies within the AWS instance.
Once you have it running in Docker you could use Beanstalk, ECS or CodeDeploy.

Run multiple Python scripts in Azure (using Docker?)

I have a Python script that consumes an Azure queue, and I would like to scale this easily inside Azure infrastructure. I'm looking for the easiest solution possible to
run the Python script in an environment that is as managed as possible
have a centralized way to see the scripts running and their output, and easily scale the amount of scripts running through a GUI or something very easy to use
I'm looking at Docker at the moment, but this seems very complicated for the extremely simple task I'm trying to achieve. What possible approaches are known to do this? An added bonus would be if I could scale wrt the amount of items on the queue, but it is fine if we'd just be able to manually control the amount of parallelism.

You should have a look at Azure Web Apps, which also support Python.
This would be a managed and scaleable environment and also supports background tasks (WebJobs) with a central logging.
Azure Web Apps also offer a free plan for development and testing.

Per my experience, I think CoreOS on Azure can satisfy your needs. You can try to refer to the doc https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-coreos-how-to/ to know how to get started.
CoreOS is a Linux distribution for running Docker as Linux container, that you can remote access via SSH client like putty. For using Docker, you can search the key words Docker tutorial via Bing to rapidly learning some simple usage that enough for running Python scripts.

Sounds to me like you are describing something like a micro-services architecture. From that perspective, Docker is a great choice. I recommend you consider using an orchestration framework such as Apache Mesos or Docker Swarm which will allow you to run your containers on a cluster of VMs with the ability to easily scale, deploy new versions, rollback and implement load balancing. The schedulers Mesos supports (Marathon and Chronos) also have a Web UI. I believe you can also implement some kind of triggered scaling like you describe but that will probably not be off the shelf.
This does seem like a bit of a learning curve but I think is worth it especially once you start considering the complexities of deploying new versions (with possible rollbacks), monitoring failures and even integrating things like Jenkins and continuous delivery.
For Azure, an easy way to deploy and configure a Mesos or Swarm cluster is by using Azure Container Service (ACS) which does all the hard work of configuring the cluster for you. Find additional info here: https://azure.microsoft.com/en-us/documentation/articles/container-service-intro/

Alternatives to Hadoop / Map-reduce framework for win32 platform

I'm finding Hadoop on Windows somewhat frustrating: I want to know if there are any serious alternatives to Hadoop for Win32 users. The features I most value are:
Ease of initial setup & deployment on a smallish network (I'd be astonished if we ever got more than 20 worker-PCs assigned to this project)
Ease of management - the ideal framework should have web/GUI based administration system so that I do not have to write one myself.
Something popular & stable. Bonuses depend on us getting this project delivered in time.
BACKGROUND:
The company I work for wants to build a new grid system to run some financial calculations.
The first framework I have been evaluating is Hadoop. This seemed to do exactly what was intended except that it's very UNIX oriented. I was able to get all of the tutorials up & running on an Ubuntu VirtualBox. Unfortunately nothing seems to run easily on Win32.
Yes... Win32: Our company has a policy that everything has to run on Windows. None of the server admins (or anybody outside of select few developers) know anything about Linux. I'd probably get in trouble if they found my virtual Ubuntu environment! The sad fact is that our grid needs to be hosted on Win32 (since all the test PCs run Windows XP 32bit), with an option to upgrade to Win64 at sometime in the future.
To complicate matters - 95% of what we want to run are Python scripts with C++ Windows 32bit DLL add ons. Our calculation library is overwhelmingly written in Python. Our calculation libraries will not run on anything other than Windows... I do not really have a choice

For python there is:
disco
bigtempo
celery - not really a map-reduce framework, but it's a good start if you want something very customized
And you can find a bunch of hadoop clients/integrations on pypi

You could try MPI. It is a standard for message-passing concurrent applications. We are running it on our Linux cluster but it is cross-platform. The most popular implementation is mpich2, written in C. There are python bindings for MPI through the mpi4py library.

IPython has some parallel computing features that are simple and work on windows. It may be enough for your needs. Here's a good place to start:
http://showmedo.com/videotutorials/video?name=7200100&fromSeriesID=720

I've compiled a list of available MapReduce/Hadoop offerings in the cloud (hosted services, PaaS-level), this might be of help as well.

Many distributed computing frameworks can be used for many-task computing. If you don't need the MapReduce paradigm, but rather the ability to distribute the tasks of a job across separate computers, communication and resource management, then you could take a look at other platforms in this area like Condor, or even Boinc; both run on Windows.
You could also run Hadoop on Linux virtual machines.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.