Save data from Dataproc to Datastore

Save data from Dataproc to Datastore - python

I have implemented a recommendation engine using Python2.7 in Google Dataproc/ Spark, and need to store the output as records in Datastore, for subsequent use by App Engine APIs. However, there doesn't seem to be a way to do this directly.
There is no Python Datastore connector for Dataproc as far as I can see. The Python Dataflow SDK doesn't support writing to Datastore (although the Java one does). MapReduce doesn't have an output writer for Datastore.
That doesn't appear to leave many options. At the moment I think I will have to write the records to Google Cloud Storage and have a separate task running in App Engine to harvest them and store in Datastore. That is not ideal- aligning the two processes has its own difficulties.
Is there a better way to get the data from Dataproc into Datastore?

I succeeded in saving Datastore records from Dataproc. This involved installing additional components on the master VM (ssh from the console)
The appengine sdk is installed and initialised using
sudo apt-get install google-cloud-sdk-app-engine-python
sudo gcloud init
This places a new google directory under /usr/lib/google-cloud-sdk/platform/google_appengine/.
The datastore library is then installed via
sudo apt-get install python-dev
sudo apt-get install python-pip
sudo pip install -t /usr/lib/google-cloud-sdk/platform/google_appengine/ google-cloud-datastore
For reasons I have yet to understand, this actually installed at one level lower, i.e. in /usr/lib/google-cloud-sdk/platform/google_appengine/google/google, so for my purposes it was necessary to manually move the components up one level in the path.
To enable the interpreter to find this code I had to add /usr/lib/google-cloud-sdk/platform/google_appengine/ to the path. The usual BASH tricks weren't being sustained, so I ended up doing this at the start of my recommendation engine.
Because of the large amount of data to be stored, I also spent a lot of time attempting to save it via MapReduce. Ultimately I came to the conclusion that too many of the required services were missing on Dataproc. Instead I am using a multiprocessing pool, which is achieving acceptable performance

In the past, the Cloud Dataproc team maintained a Datastore Connector for Hadoop but it was deprecated for a number of reasons. At present, there are no formal plans to restart development of it.
The page mentioned above has a few options and your approach is one of the solutions mentioned. At this point, I think your setup is probably one of the easiest ones if you're committed to Cloud Datastore.

Related

MWAA: Trouble installing Airflow Google Providers from requirements.txt

I'm trying to set up an MWAA Airflow 2.0 environment that integrates S3 and GCP's Pub/Sub. While we have no problems with the environment being initialized, we're having trouble installing some dependencies and importing Python packages -- specifically apache-airflow-providers-google==2.2.0.
We've followed all of the instructions based on the official MWAA Python documentation. We already included the constraints file as prescribed by AWS, activated all Airflow logging configs, and tested the requirements.txt file using the MWAA local runner. The result when updating our MWAA environment's requirements would always be like this
When testing using the MWAA local runner, we observed that using the requirements.txt file with the constraints still takes forever to resolve. Installation takes more than 10-30 minutes which is no good.
As an experiment, we tried using a version of the requirements.txt file that omits the constraints and pinned versioning. Doing so installs the packages successfully and we don't receive import errors anymore on both MWAA local runner and our MWAA environment itself. However, all of our dags will fail to run no matter what. Airflow logs are also inaccessible whenever we do this.
The team and I have been trying to get MWAA environments up and running for our different applications and ETL pipelines but we just can't seem to get things to work smoothly. Any help would be appreciated!

I'm having the same problems and in the end we had to refactor a lot of things to remove the dependence. It looks like is a problem with PIP resolver and apache-airflow-providers-google if you look the official page:
https://pypi.org/project/apache-airflow-providers-google/2.0.0rc1/
In the WORST case, you may need to use Airflow direct on EC2 from docker image and abandon MWAA :(

I've been through similar issues but with different packages. There are certain things you need to take into consideration when using MWAA. I didn't have any issue testing the packages on the local runner then on MWAA using a public VPC, I only had issues when using a private VPC as the web server doesn't have an internet connection, so the method to get the packages to MWAA is different.
Things to take into consideration:
The version of the packages; test on the local runner if you can first
Enable the logs; The scheduler and web server logs can show you issues, but also they may not. The reason for this is Fargate serving the images, will try to roll back to a working state rather than have MWAA be in a non-working state. So, you might not see what the error actually is, it may even look like there were no errors in certain scenarios.
Check dependencies; You may need to download a package with pip download <package>==version. There you can inspect the contents of the .whl file and see if there are any dependencies. You may have extra notes that can point you in the right direction. In one case, using the Slack package wouldn't work until I also added the http package, even though Airflow includes this package.
So, yes it's serverless, and you may have an easy time installing/setting MWAA up, but be prepared to do a little investigation if it doesn't work. I did contact AWS support, but managed to solve it myself in the end. Other than trying the obvious things, only those that use MWAA frequently and have faced varying scenarios will be of any assistance.

How to update gcloud on Google Cloud Composer worker nodes?

There's a similar question here but from 2018 were the solution requires changing the base image for the workers. Another suggestion is to ssh into each node and apt-get install there. This doesn't seem useful because when auto scale spawns new nodes, you'd need to do it again and again.
Anyway, is there a reasonable way to upgrade the base gcloud in late 2020?

Because task instances run in a shared execution environment, it is generally not recommended to use the gcloud CLI within Composer Airflow tasks, when possible, to avoid state or version conflicts. For example, if you have multiple users using the same Cloud Composer environment, and either of them changes the active credentials used by gcloud, then they can unknowingly break the other's workflows.
Instead, consider using the Cloud SDK Python libraries to do what you need to do programmatically, or use the airflow.providers.google.cloud operators, which may already have what you need.
If you really need to use the gcloud CLI and don't share the environment, then you can use a BashOperator with a install/upgrade script to create a prerequisite for any tasks that need to use the CLI. Alternatively, you can build a custom Docker image with gcloud installed, and use GKEPodOperator or KubernetesPodOperator to run a Kubernetes pod to run the CLI command. That would be slower, but more reliable than verifying dependencies each time.

How to build/deploy a Python Azure Function App using an internal pypi

I'm using the command func azure functionapp publish to publish my python function app to Azure. As best I can tell, the command only packages up the source code and transfers it to Azure, then on a remote machine in Azure, the function app is "built" and deployed. The build phase includes the collection of dependencies from pypi. Is there a way to override where it looks for these dependencies? I'd like to point it to my ow pypi server, or alternatively, provide the wheels locally in my source tree and have it use those. I have a few questions/problems:
Are my assumptions correct?
Assuming they are, is this possible, and how?
I've tried a few things, read some docs, looked at the various --help options in the CLI tool, I've set up a pip.conf file that I've verified works for local pip usage, then on purpose "broken it" and tried to see if the publish would fail (it did not, so this leads me to believe it ignores pip.conf, or the build (and collection of dependencies happens on the remote end). I'm at a loss and any tips, pointers, or answers are appreciated!

You can add additional pip source to point to your own pypi server. Check https://learn.microsoft.com/en-us/azure/azure-functions/functions-reference-python#remote-build-with-extra-index-url
Remote build with extra index URL:
When your packages are available from an accessible custom package index, use a remote build. Before publishing, make sure to create an app setting named PIP_EXTRA_INDEX_URL. The value for this setting is the URL of your custom package index. Using this setting tells the remote build to run pip install using the --extra-index-url option. To learn more, see the Python pip install documentation.
You can also use basic authentication credentials with your extra package index URLs. To learn more, see Basic authentication credentials in Python documentation.
And regarding referring local packages, that is also possible. Check https://learn.microsoft.com/en-us/azure/azure-functions/functions-reference-python#install-local-packages
I hope both of your questions are answered now.

Cannot deploy Google's AppEngine - unable to install google.appengine library - ModuleNotFound ez_setup

I'm trying to deploy an API to Google Cloud on Google AppEngine, using python3 with the standard environment, and I want to use the defer function to put functions in Google CloudTasks, as seen here:
https://cloud.google.com/appengine/docs/standard/python/taskqueue/push/creating-tasks#using_the_instead_of_a_worker_service
I tried putting google.appengine on the requirements.txt file, where the python libraries are usually listed for pip install, adding a line with google-appengine, but it fails on deploy, with the following error message:
ModuleNotFoundError: No module named 'ez_setup'
I've added ez_setup to the list of requirements, before appengine, and it still results in the same error.
I've also tried deploying it without importing google.appengine, thinking it might come already installed, but then I get the expected error saying No module named 'google.appengine' on it's import.
Is there something I'm missing on installation/import of this lib? Or is the library deprecated, and some new library is used to defer?
I've also tried to install the library on my local computer, and did not manage to install it here either.

As specified in the Public Documentation you have just Shared
The feature for creating tasks and placing them in push queues is not available for the python 3.7 runtime. That is the reason behind not being able to deploy it.
If you try it on Python 2.7 it should work without issues.

As mentioned in #OqueliAMartinez's answer the feature you're seeking (the Task Queue API) isn't available in the python37/2nd generation standard environment. The documentation page you referenced is applicable only to the python27/1st generation standard environment.
For the other runtimes/environments, including python37, you need to use the Cloud Tasks API. Which, unfortunately, doesn't support (at least not yet) deferred tasks. From Features in Task Queues not yet available via Cloud Tasks API:
Deferred/delayed tasks:
In some cases where you might need a series of diverse small tasks
handled asynchronously but you don't want to go through the work of
setting up individual distinct handlers, the App Engine SDK allows you
to use runtime specific libraries to create simple functions to manage
these tasks. This feature is not available in Cloud Tasks. Note,
though, that normal tasks can be scheduled in the future using Cloud
Tasks.
The only way to defer functions in this case is to invoke them in the task queue handler on the worker side (and schedule the tasks in the future as needed to implement the deferral).
Somehow related: Cloud Tasks API for python2.7 google app engine

What is the best way to distribute code across servers?

I have a directory of python programs, classes and packages that I currently distribute to 5 servers. It seems I'm continually going to be adding more servers and right now I'm just doing a basic rsync over from my local box to the servers.
What would a better approach be for distributing code across n servers?
thanks

I use Mercurial with fabric to deploy all the source code. Fabric's written in python, so it'll be easy for you to get started. Updating the production service is as simple as fab production deploy. Which ends ups doing something like this:
Shut down all the services and put an "Upgrade in Progress" page.
Update the source code directory.
Run all migrations.
Start up all services.
It's pretty awesome seeing this all happen automatically.

First, make sure to keep all code under revision control (if you're not already doing that), so that you can check out new versions of the code from a repository instead of having to copy it to the servers from your workstation.
With revision control in place you can use a tool such as Capistrano to automatically check out the code on each server without having to log in to each machine and do a manual checkout.
With such a setup, deploying a new version to all servers can be as simple as running
$ cap deploy
from your local machine.

While I also use version control to do this, another approach you might consider is to package up the source using whatever package management your host systems use (for example RPMs or dpkgs), and set up the systems to use a custom repository Then an "apt-get upgrade" or "yum update" will update the software on the systems. Then you could use something like "mussh" to run the stop/update/start commands on all the tools.
Ideally, you'd push it to a "testing" repository first, have your staging systems install it, and once the testing of that was signed off on you could move it to the production repository.
It's very similar to the recommendations of using fabric or version control in general, just another alternative which may suit some people better.
The downside to using packages is that you're probably using version control anyway, and you do have to manage version numbers of these packages. I do this using revision tags within my version control, so I could just as easily do an "svn update" or similar on the destination systems.
In either case, you may need to consider the migration from one version to the next. If a user loads a page that contains references to other elements, you do the update and those elements go away, what do you do? You may wish to do something either within your deployment scripting, or within your code where you first push out a version with the new page, but keep the old referenced elements, deploy that, and then remove the referenced elements and deploy that later.
In this way users won't see broken elements within the page.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Save data from Dataproc to Datastore - python

Related

MWAA: Trouble installing Airflow Google Providers from requirements.txt

How to update gcloud on Google Cloud Composer worker nodes?

How to build/deploy a Python Azure Function App using an internal pypi

Cannot deploy Google's AppEngine - unable to install google.appengine library - ModuleNotFound ez_setup

What is the best way to distribute code across servers?

Categories

Resources