Pass %%local variables from dotenv run in jupyter to Azure HDInsight pyspark cluster - python

Intro
This link details how to install jupyter locally and work against an Azure HDInsight cluster. This works well getting things setup.
However:
Not all python packages that we have available locally are available on the cluster.
Some local processing may want to be done before 'submitting' a cell to the cluster.
I'm aware that python packages that are not installed can be installed via script actions and %%configure, however given our use of dotenv locally these don't seem to be viable solutions.
Problem
Source control with git
Git repos are local on dev machines We store
configuration/sensitive environment variables in .env files
locally (they are not checked into git)
dotenv package is used to
read sensitive variables and set locally for execution
blob storage
account names and keys are example of these variables
how to pass these locally set variables to a pyspark cell?
Local cell example
Followed by pyspark cell

Related

Set environment variable for fastAPI on Cloud RUN

I'm trying to set environment variables using docker (for fastAPI) but clod run doesn't want to see them. I tried many solutions, what would be a good way? I mention that I use the docker image in cloud run

Google Cloud access through JupyterLab bash

1. Problem:
Unable to run basic !gcloud ... commands in JupyterLab (.ipynb file). Runs fine in terminal (no '!'), as well as Jupyter Notebook.
2. Attempted:
Same user, same Anaconda environment in all three instances.
Tried installing Google Cloud again anyway in JupyterLab and it reads the Google Cloud files already exist.
3. Context:
Using this file for a couple commands to scoop and delete temp/test tables from a BigQuery dataset intermittently based on keyword.
Ideally, I could keep bash and python files solely within JupyterLab as Notebook will eventually phase out.

PySpark execute job in standalone mode but with user defined modules?

I have installed spark on some machine to use it in standalone cluster mode. So now I have some machines that have for each the same spark build version (Spark 2.4.0 build on hadoop 2.7+).
I want to use this cluster for parallel data analysis and my language of run is Python so I'm using Pyspark not Spark. I have created some modules of the operations to process the data and give it the form that I want.
However, I don't want to copy manually all this modules that I have created on every machine so I would like to know which option are in PySpark to pass the dependencies so that for every executor I'm sure that the modules are present?
I have thought of virtual environments that will be activated and install the modules but I don't know how to do it in Spark Standalone mode, while in YARN manager seems to be this option, but I won't install YARN.
Ps. Note: some module use data files like .txt and some dynamic libraries like .dll, .so and I want that they are passed to the executors to.
A good solution to distribute Spark and your modules is to use Docker Swarm (I hope you have experience with Docker).
Try to give a look at this repository, it was very useful for me at the time https://github.com/big-data-europe/docker-spark
It is a good base for distributing Spark. On top of that you can build your own modules. So you create your personal Docker Images to distribute in your Docker Hub and then easily distribute them in your cluster with Docker Swarm

Simple Google Cloud deployment: Copy Python files from Google Cloud repository to app engine

I'm implementing continuous integration and continuous delivery for a large enterprise data warehouse project.
All the code reside in Google Cloud Repository and I'm able to set up Google Cloud Build trigger, so that every time code of specific file type (Python scripts) are pushed to the master branch, a Google Cloud build starts.
The Python scripts doesn't make up an app. They contain an ODBC connection string and script to extract data from a source and store it as a CSV-file. The Python scripts are to be executed on a Google Compute Engine VM Instance with AirFlow installed.
So the deployment of the Python scripts is as simple as can be: The .py files are only to be copied from the Google Cloud repository folder to a specific folder on the Google VM instance. There is not really a traditionally build to run, as all the Python files are separate for each other and not part of an application.
I thought this would be really easy, but now I have used several days trying to figure this out with no luck.
Google Cloud Platform provides several Cloud Builders, but as far as I can see none of them can do this simple task. Using GCLOUD also does not work. It can copy files but only from local pc to VM not from source repository to VM.
What I'm looking for is a YAML or JSON build config file to copy those Python files from source repository to Google Compute Engine VM Instance.
Hoping for some help here.
The files/folders in the Google Cloud repository aren't directly accessible (it's like a bare git repository), you need to first clone the repo then copy the desired files/folders from the cloned repo to their destinations.
It might be possible to use a standard Fetching dependencies build step to clone the repo, but I'm not 100% certain of it in your case, since you're not actually doing a build:
steps:
- name: gcr.io/cloud-builders/git
args: ['clone', 'https://github.com/GoogleCloudPlatform/cloud-builders']
If not you may need one (or more) custom build steps. From Creating Custom Build Steps:
A custom build step is a container image that the Cloud Build worker
VM pulls and runs with your source volume-mounted to /workspace.
Your custom build step can execute any script or binary inside the
container; as such, it can do anything a container can do.
Custom build steps are useful for:
Downloading source code or packages from external locations
...

Prevent user from importing os module in jupyter notebook

I am trying to set up a jupyter notebook server so that a few members can have access and run analysis on it. But there are several API credentials that I stored as the environment variables that I don't want users to have access to. Basically I want to prevent users from importing os module in the notebook, since os.environ list all environment variables on the server. What would be a proper way to do this?
You could try to run the jupyter notebook server as a Docker container. That way your environment variables will be isolated from the container. Ipython has an available docker image, so you need to install docker if this approach works for you.
Installing Ipython Docker Image
If you need to pass environment variables for the Docker container refer to this question: Passing env variables to docker

Categories

Resources