Managing Dependencies - pipeline code spans multiple files

Managing Dependencies - pipeline code spans multiple files - python

I'm facing problems running a streaming pipeline on DataFlowRunner after separating out the “main pipeline code” and “custom transforms code”to multiple files, as described here: Multiple File Dependencies - no element (pubs message) is read into the pipeline. Neither the tabs - JOB LOGS, WORKER LOGS, JOB ERROR REPORTING in (new) Dataflow UI - report any errors. Job ID: 2020-04-06_15_23_52-4004061030939218807 if someone wants to have a look...
Pipeline minimal code (BEFORE):
pipeline.py
row = p | "read_sub" >> pubsub.ReadFromPubSub(subscription=SUB,with_attributes=True,) \
| "add_timestamps" >> beam.Map(add_timestamps)
add_timestamps is my custom transform
def add_timestamps(e):
payload = e.data.decode()
return {"message":payload}
All works fine when add_timestamps and the pipeline code are in same file pipeline.py.
AFTER I restructured the files as follows:
root_dir/
pipeline.py
setup.py
my_transforms/
__init__py.py
transforms.py
where, setup.py
import setuptools
setuptools.setup(
name='my-custom-transforms-package',
version='1.0',
install_requires=["datetime"],
packages= ['my_transforms'] #setuptools.find_packages(),
)
all the add_timestamps transform code moved to transforms.py (under my_transforms package directory)
In my pipeline.py I now import and use the transform as follows:
from my_transforms.transforms import add_timestamps
row = p | "read_sub" >> pubsub.ReadFromPubSub(subscription=SUB,with_attributes=True,) \
| "add_timestamps" >> beam.Map(add_timestamps)
While launching the pipline I do set the flag: --setup_file=./setup.py.
However not a single element is read into the pipeline (as you can see Data watermark still stuck and Elements added (Approximate) does not report anything)

I have tested Multiple File dependencies option in Dataflow and for me it works fine. I reproduced example from Medium.
Your directory structure is correct. Have you added any imports in transforms.py file?
I would recommend you to make some changes in setup.py:
import setuptools
REQUIRED_PACKAGES = [
‘datetime’
]
PACKAGE_NAME = 'my_transforms'
PACKAGE_VERSION = '0.0.1'
setuptools.setup(
name=PACKAGE_NAME,
version=PACKAGE_VERSION,
description='My transforms package',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages()
)
When running your pipeline, keep an eye on setting the following fields in PipelineOptions: job_name, project, runner, staging_location, temp_location. You must specify at least one of temp_location or staging_location to run your pipeline on the Google cloud. If you use the Apache Beam SDK for Python 2.15.0 or later, you must also specify region. Remember about specifying full path to setup.py.
It will look similar to that command:
python3 pipeline.py \
--job_name <JOB_NAME>
--project <PROJECT_NAME> \
--runner DataflowRunner \
--region <REGION> \
--temp_location gs://<BUCKET_NAME>/temp \
--setup_file /<FULL_PATH>/setup.py
I hope it helps.

I found the root cause... I was setting the flag --no_use_public_ips and had install_requires=["datetime"] in setup.py..
of-course, without External IP the worker was unable to communicate with python package manager server to install datetime. problem solve by not setting the flag --no_use_public_ips (I'll look at solution later how to disable external IPs for workers and still be able to run successfully). Would have been good it at least some Error message was displayed in Job/Worker logs! Spent like 2-3 days troubleshooting :=)

Related

Windows/Python: bazel run works fine, bazel test not so much

I have a fairly standard Python test (full sources are here):
from absl.testing import absltest
[...]
class BellTest(absltest.TestCase):
def test_bell(self):
[...]
and the corresponding entries in the BUILD file, where the dependencies are part of the BUILD file:
py_test(
name = "bell_test",
size = "small",
srcs = ["bell_test.py"],
python_version = "PY3",
srcs_version = "PY3",
deps = [
":tensor",
":state",
":ops",
":bell",
],
)
Without problems I can 'run' this via
bazel run bell_test
[...] comes out Ok.
However, I cannot 'test' it
bazel test bell_test
[...] FAILED
The log file tells me that it cannot find the dependency to absl.testing. This is puzzling, given that it works with 'run'. This also works on Linux and MacOS without problems.
I tried to add all kinds of ways to add a dependency to absl / testing, but to no avail. Pointers would be greatly appreciated.
Side note: It would be great if bazel would print the path to the log file with Windows backslashes!

Python - Apache-beam outputs an empty file using dataflow runner, works fine with direct runner. Dataflow does not raise any errors

I've been trying to run this apache-beam script. This script runs nightly through an airflow DAG, and works perfectly fine that way so I'm (reasonably) confident that the script is correct. I think the relevant part of my apache script is summarized with this.
def configure_pipeline(p, opt):
"""Specify PCollection and transformations in pipeline."""
read_input_source = beam.io.ReadFromText(opt.input_path,
skip_header_lines=1,
strip_trailing_newlines=True)
_ = (p
| 'Read input' >> read_input_source
| 'Get prediction' >> beam.ParDo(PredictionFromFeaturesDoFn())
| 'Save to disk' >> beam.io.WriteToText(
opt.output_path, file_name_suffix='.ndjson.gz'))
And to execute the script, I run this:
python beam_process.py \
--project=\my-project \
--region=us-central1 \
--runner=DataflowRunner \
--temp_location=gs://staging/location \
--job_name=beam-process-test \
--max_num_workers=5 \
--runner=DataflowRunner \
--input_path="gs://path/to/file/input-000000000000.jsonl.gz" \
--output_path="gs://path/to/output"
The job in dataflow runs with no errors, however the output file is completely empty other than the file name. Running this with direct runner and using local directories, the process runs as expected and I have fully workable outputs. I've tried using different inputs, as well as trying different cloud buckets. The only thing I could think of is a permissions problem that I'm unaware of. I can post the dataflow job details (or at least, what I'm able to see of them) if needed.
EDIT
For the few who end up seeing this, I ended up fixing it but the reason is still unknown to me. By adding quotes around the entire input field:
--runner=DataflowRunner \
'--input_path="gs://path/to/file/input-000000000000.jsonl.gz" \'
--output_path="gs://path/to/output"
allows dataflow to read the input stream.

Airflow not recognising zip file DAG built with pytest fixture

We are using Google Composer (a managed Airflow service) with airflow v1.10 and Python 3.6.8.
To deploy our DAGS, we are taking the Packaged DAG (https://airflow.apache.org/concepts.html?highlight=zip#packaged-dags) method.
All is well when the zip file is created from the cmd line like
zip -r dag_under_test.zip test_dag.py
but when I try to do this from a pytest fixture, so I load in the DagBag and test the integrity of my DAG, airflow doesnt recognise this zip file at all. here is the code to my pytest fixture
#fixture
def setup(config):
os.system("zip -r dag_under_test.zip test_zip.py")
def test_import_dags(setup):
dagbag = DagBag(include_examples=False)
noOfDags = len(dagbag.dags)
dagbag.process_file("dag_under_test.zip")
assert len(dagbag.dags) == noOfDags + 1, 'DAG import failures. Errors: {}'.format(dagbag.import_errors)
I copied this zip file to the DAGs folder, but airflow isnt recognising it at all, no error messages.
But the zip file built with same command from cmdline is being loaded by airflow!! seems like I am missing something obvious here, cant figure out.

In this case, it looks like there is a mismatch between the working directory of os.system and where the DagBag loader is looking. If you inspect the code of airflow/dagbag.py, the path accepted by process_file is passed to os.path.isfile:
def process_file(self, filepath, only_if_updated=True, safe_mode=True):
if filepath is None or not os.path.isfile(filepath):
...
That means within your test, you can probably do some testing to make sure all of these match:
# Make sure this works
os.path.isfile(filepath)
# Make sure these are equal
os.system('pwd')
os.getcwd()

So it turned out that where I am creating the zip file is important. As in this case I am creating the zip file from the test folder and archiving the files in src folders. Although the final zip file looks perfect for the naked eye, airflow is rejecting it.
I tried with adding '-j' to the zip command (to junk the directory names) and my test started working.
zip -r -j dag_under_test_metrics.zip ../src/metricsDAG.py
I had another bigger problem, i.e. to test the same scenario when there is a full folder structure in my DAG project. A dag file at the top level which references a lots of python modules with in the project. I couldnt get this working by the above trick, but came up with a workaround. I have created a small shell script, which does the zip part, like this..
SCRIPT_PATH=${0%/*/*}
cd $SCRIPT_PATH
zip -r -q test/dag_under_test.zip DagRunner.py
zip -r -q test/dag_under_test.zip tasks dag common resources
This shell script is changing the currentdir to project home and archiving from there. I am invoking this shell from the pytest fixture like this
#fixture
def setup():
os.system('rm {}'.format(DAG_UNDER_TEST))
os.system('sh {}'.format(PACKAGE_SCRIPT))
yield
print("-------- clean up -----------")
os.system('rm {}'.format(DAG_UNDER_TEST))
This is perfectly working with my integration test.
def test_conversionDAG(setup):
configuration.load_test_config()
dagbag = DagBag(include_examples=False)
noOfDags = len(dagbag.dags)
dagbag.process_file(DAG_UNDER_TEST)
assert len(dagbag.dags) == noOfDags + 1, 'DAG import failures. Errors: {}'.format(dagbag.import_errors)
assert dagbag.get_dag("name of the dag")

PySpark 2.x: Programmatically adding Maven JAR Coordinates to Spark

The following is my PySpark startup snippet, which is pretty reliable (I've been using it a long time). Today I added the two Maven Coordinates shown in the spark.jars.packages option (effectively "plugging" in Kafka support). Now that normally triggers dependency downloads (performed by Spark automatically):
import sys, os, multiprocessing
from pyspark.sql import DataFrame, DataFrameStatFunctions, DataFrameNaFunctions
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as sFn
from pyspark.sql.types import *
from pyspark.sql.types import Row
# ------------------------------------------
# Note: Row() in .../pyspark/sql/types.py
# isn't included in '__all__' list(), so
# we must import it by name here.
# ------------------------------------------
num_cpus = multiprocessing.cpu_count() # Number of CPUs for SPARK Local mode.
os.environ.pop('SPARK_MASTER_HOST', None) # Since we're using pip/pySpark these three ENVs
os.environ.pop('SPARK_MASTER_POST', None) # aren't needed; and we ensure pySpark doesn't
os.environ.pop('SPARK_HOME', None) # get confused by them, should they be set.
os.environ.pop('PYTHONSTARTUP', None) # Just in case pySpark 2.x attempts to read this.
os.environ['PYSPARK_PYTHON'] = sys.executable # Make SPARK Workers use same Python as Master.
os.environ['JAVA_HOME'] = '/usr/lib/jvm/jre' # Oracle JAVA for our pip/python3/pySpark 2.4 (CDH's JRE won't work).
JARS_IVY_REPO = '/home/jdoe/SPARK.JARS.REPO.d/'
# ======================================================================
# Maven Coordinates for JARs (and their dependencies) needed to plug
# extra functionality into Spark 2.x (e.g. Kafka SQL and Streaming)
# A one-time internet connection is necessary for Spark to autimatically
# download JARs specified by the coordinates (and dependencies).
# ======================================================================
spark_jars_packages = ','.join(['org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.0',
'org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0',])
# ======================================================================
spark_conf = SparkConf()
spark_conf.setAll([('spark.master', 'local[{}]'.format(num_cpus)),
('spark.app.name', 'myApp'),
('spark.submit.deployMode', 'client'),
('spark.ui.showConsoleProgress', 'true'),
('spark.eventLog.enabled', 'false'),
('spark.logConf', 'false'),
('spark.jars.repositories', 'file:/' + JARS_IVY_REPO),
('spark.jars.ivy', JARS_IVY_REPO),
('spark.jars.packages', spark_jars_packages), ])
spark_sesn = SparkSession.builder.config(conf = spark_conf).getOrCreate()
spark_ctxt = spark_sesn.sparkContext
spark_reader = spark_sesn.read
spark_streamReader = spark_sesn.readStream
spark_ctxt.setLogLevel("WARN")
However the plugins aren't downloading and/or loading when I run the snippet (e.g. ./python -i init_spark.py), as they should.
This mechanism used to work, but then stopped. What am I missing?
Thank you in advance!

This is the kind of post where the QUESTION will be worth more than the ANSWER, because the code above works but isn't anywhere to be found in Spark 2.x documentation or examples.
The above is how I've programmatically added functionality to Spark 2.x by way of Maven Coordinates. I had this working but then it stopped working. Why?
When I ran the above code in a jupyter notebook, the notebook had -- behind the scenes -- already run that identical code snippet by way of my PYTHONSTARTUP script. That PYTHONSTARTUP script has the same code as the above, but omits the maven coordinates (by intent).
Here, then, is how this subtle problem emerges:
spark_sesn = SparkSession.builder.config(conf = spark_conf).getOrCreate()
Because a Spark Session already existed, the above statement simply reused that existing session (.getOrCreate()), which did not have the jars/libraries loaded (again, because my PYTHONSTARTUP script intentionally omits them). This is why it is a good idea to put print statements in PYTHONSTARTUP scripts (which are otherwise silent).
In the end, I simply forgot to do this: $ unset PYTHONSTARTUP before starting the JupyterLab / Notebook daemon.
I hope the Question helps others because that's how to programmatically add functionality to Spark 2.x (in this case Kafka). Note that you'll need an internet connection for the one-time download of the specified jars and recursive dependencies from Maven Central.

Python Azure Batch - Permission Denied Linux node

I am running a python script on several Linux nodes (after the creation of a pool) using Azure Batch. Each node uses 14.04.5-LTS version of Ubuntu.
In the script, I am uploading several files on each node and then I run several tasks on each one of these nodes. But, I get a "Permission Denied" error when I try to execute the first task. Actually, the task is an unzip of few files (fyi, the uploading of these zip files went well).
This script was running well until last weeks. I suspect an update of Ubuntu version but maybe it's something else.
Here is the error I get :
error: cannot open zipfile [ /mnt/batch/tasks/shared/01-AXAIS_HPC.zip ]
Permission denied
unzip: cannot find or open /mnt/batch/tasks/shared/01-AXAIS_HPC.zip,
Here is the main part of the code :
credentials = batchauth.SharedKeyCredentials(_BATCH_ACCOUNT_NAME,_BATCH_ACCOUNT_KEY)
batch_client = batch.BatchServiceClient(
credentials,
base_url=_BATCH_ACCOUNT_URL)
create_pool(batch_client,
_POOL_ID,
application_files,
_NODE_OS_DISTRO,
_NODE_OS_VERSION)
helpers.create_job(batch_client, _JOB_ID, _POOL_ID)
add_tasks(batch_client,
_JOB_ID,
input_files,
output_container_name,
output_container_sas_token)
with add_task :
def add_tasks(batch_service_client, job_id, input_files,
output_container_name, output_container_sas_token):
print('Adding {} tasks to job [{}]...'.format(len(input_files), job_id))
tasks = list()
for idx, input_file in enumerate(input_files):
command = ['unzip -q $AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC.zip -d $AZ_BATCH_NODE_SHARED_DIR',
'chmod a+x $AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC/00-EXE/linux/*',
'PATH=$PATH:$AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC/00-EXE/linux',
'unzip -q $AZ_BATCH_TASK_WORKING_DIR/'
'{} -d $AZ_BATCH_TASK_WORKING_DIR/{}'.format(input_file.file_path,idx+1),
'Rscript $AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC/03-MAIN.R $AZ_BATCH_TASK_WORKING_DIR $AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC $AZ_BATCH_TASK_WORKING_DIR/'
'{} {}' .format(idx+1,idx+1),
'python $AZ_BATCH_NODE_SHARED_DIR/01-IMPORT_FILES.py '
'--storageaccount {} --storagecontainer {} --sastoken "{}"'.format(
_STORAGE_ACCOUNT_NAME,
output_container_name,
output_container_sas_token)]
tasks.append(batchmodels.TaskAddParameter(
'Task{}'.format(idx),
helpers.wrap_commands_in_shell('linux', command),
resource_files=[input_file]
)
)
Split = lambda tasks, n=100: [tasks[i:i+n] for i in range(0, len(tasks), n)]
SPtasks = Split(tasks)
for i in range(len(SPtasks)):
batch_service_client.task.add_collection(job_id, SPtasks[i])
Do you have any insights to help me on this issue? Thank you very much.
Robin

looking at the error, i.e.
error: cannot open zipfile [ /mnt/batch/tasks/shared/01-AXAIS_HPC.zip ]
Permission denied unzip: cannot find or open /mnt/batch/tasks/shared/01-AXAIS_HPC.zip,
This seems like that the file is not present at the current shared directory location or it is is not in correct permission. The former is more likely.
Is there any particular reason you are using the shared directory way? also, How are you uploading the file? (i.e. hope that the use of async and await is correctly done, i.e. there is not greedy process which is running your task before the shared_dir stuff is available to the node.)
side note: you own the node so you can RDP / SSH into the node and find it out that the shared_dir are actually present.
Few things to ask will be: how are you uploading these zip files.
Also if I may ask, what is the Design \ user scenario here and how exactly you are intending to use this.
Recommendation:
There are few other ways you can use zip files in the azure node, like via resourcefile or via application package. (The applicaiton package way might suite it better to deal with *.zip file) I have added few documetns and places you can have a look at the sample implementation and guidance for this.
I think a good place to start are: hope material and sample below will help you. :)
Also I would recommend to recreate your pool if it is old which will ensure you have the node running at the latest version.
Azure batch learning path:
Azure batch api basics
Samples & demo link or look here
Detailed walk through depending on what you are using i.e. CloudServiceConfiguration or VirtualMachineConfiguration link.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.