I've written a streaming Google Dataflow pipeline in python using the beam SDK. There's documentation about how I run this locally and set the -runner flag to run it on Dataflow.
I'm now trying to automate the deployment of this to a CI pipeline (bitbucket pipelines but not really relevant). There is documentation on how to 'run' a pipeline, but not really 'deploy' it. The commands I've tested with look like:
python -m dataflow --runner "DataflowRunner" \
--jobName "<jobName>" \
--topic "<pub-sub-topic"" \
--project "<project>" \
--dataset "<dataset>" \
--worker_machine_type "n1-standard-2" \
--temp_location "gs://<bucket-name>/tmp/"
This will run the job, but because it's streaming it will never return. It also internally manages the packaging and pushing to a bucket. I know if I kill that process it keeps running, but setting that up on a CI server in a way where I can detect whether the process actually succeeded or I just killed it after some timeout is difficult.
This seems ridiculous and like I'm missing something obvious, but how do I package and run this module on dataflow in a way I can reliably know it deployed from a CI pipeline?
So yes, it was something dumb.
Basically when you use the
with beam.Pipeline(options=options) as p:
syntax, under the hood it's calling wait_until_finish. So the wait was being invoked without me realizing, causing it to hang around forever. Refactoring to remove the context manager fixes the problem.
To expand on jamielennox's answer.
When on the direct runner on your local development environment, you want to see the pipeline running indefinitely; perhaps only to manually cancel with Ctrl-C after a while.
When deploying the pipeline to run on GCP's Dataflow, you want your script to deploy the job and end.
runner_name = pipeline_options.get_all_options().get('runner')
if runner_name == 'DirectRunner':
with beam.Pipeline(options=pipeline_options) as pipeline:
_my_setup_pipeline(config, pipeline, subscription_full_name)
elif runner_name == 'DataflowRunner':
pipeline = beam.Pipeline(options=pipeline_options)
_my_setup_pipeline(config, pipeline, subscription_full_name)
pipeline.run()
else:
raise Exception(f'Unknown runner: {runner_name}')
Related
I have a pyspark application that i am currently running in local mode. Going forward this needs to be deployed in production and the pyspark job needs to be trigerred from clients end.
What changes do i need to make from my end so that it can be trigerred from clients end?
I am using spark 3.2 and python3.6.
Currently i am executing this job by firing spark-submit command from the same server where spark is installed.
spark-submit --jars /app/some_path/lib/db.jar,/app/some_path/lib/thirdparthy.jar spark_job.py
1.First i think i need to specify jars in my spark-session
spark = SparkSession.builder.master("local[*]").appName('test1').config("spark.jars", "/app/some_path/lib/db.jar,/app/some_path/lib/thirdparthy.jar").getOrCreate()
This didnt work and erroring out with "java.lang.ClassNotFoundException"
I think it is not able to find the jar though it works fine when i paas it from spark-submit.
What is the right way of passing jars in spark session?
2.I dont have a spark cluster. It is more like a R&D project which process small dataset so i dont think i need a cluster yet. How shall i run spark-submit from a remote server at client's end?
Shall i change master from local to my server's ip where spark is installed?
3.How client can trigger this job. Shall i write a python script with a subprocess trigerring this
spark-submit and give it to client so that they can execute this python file at a specific time from their workflow manager tool?
import subprocess
spark_submit_str= "spark-submit --master "server-ip" spark_job.py"
process=subprocess.Popen(spark_submit_str,stdout=subprocess.PIPE,stderr=subprocess.PIPE, universal_newlines=True, shell=True)
stdout,stderr = process.communicate()
if process.returncode !=0:
print(stderr)
print(stdout)
I'm using google cloud run. I run container with simple Flask+gunicorn app that starts heavy computation.
Sometimes it fails
Application exec likely failed
terminated: Application failed to start: not available
I'm 100% confident it's not related to google cloud run timeouts or Flask + gunicorn timeouts.
I've added hooks for gunicorn: worker_exit, worker_abort, worker_int, on_exit. Mentioned hooks are not invoked.
Exactly the same operation works well locally. I can reproduce it at cloud run only.
Seems like something crashes at cloud run and just kills my python process completely.
Is there any chance to debug it?
Maybe I can stream tail -f /var/log/{messages,kernel,dmesg,syslog} somehow in parallel to logs?
The idea is to understand what kills app.
UPD:
I've managed to get a bit more logs
Default
[INFO] Handling signal: term
Caught SIGTERM signal.Caught SIGTERM signal.
What is the right way to find what (and why) sends SIGTERM to my python process?
I would suggest setting up Cloud Logging with your Cloud Run instance. You can easily do so by following this documentation which shows how to attach Cloud Logging to the Python root logger. This will allow you to have more control over the logs that appear for your Cloud Run application.
Setting Up Cloud Logging for Python
Also in setting up Cloud Logging it should allow Cloud Run to pick up automatically any logs under the var/log directory as well as any syslogs (dev/log).
Hope this helps! Let me know if you need further assistance.
I am trying to deploy a Flask webapp using gitlab CI.
In my script I launch the following command :
- if [[ "$STATUS" == "NOTRUN" ]] ; then eval "nohup flask run &" ; fi
The problem is that the webapp is deploying, but my gitlab CI timeouts after 1hour because it thinks the command is still running.
What do I have to add for it to succeed and not fail ?
Thank you very much
Unfortunately it won't be so easy. There is similar issue on gitlab.
Process started with Runner, even if you add nohup and & at the end, is marked with process group ID. When job is finished Runner is sending kill signal to whole process group. So any process started directly from CI job will be terminated at job end. Using service manager you're not starting the process in context of Runner's job. Your only notifying a manager to start a process using prepared configuration
The only solution I know is to create some .service and run it with systemctl.
So now that k8s is integrated directly with spark in 2.3 my spark submit from the console executes correctly on a kuberenetes master without any spark master pods running, spark handles all the k8s details:
spark-submit \
--deploy-mode cluster \
--class com.app.myApp \
--master k8s://https://myCluster.com \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.app.name=myApp \
--conf spark.executor.instances=10 \
--conf spark.kubernetes.container.image=myImage \
local:///myJar.jar
What I am trying to do is do a spark-submit via AWS lambda to my k8s cluster. Previously I used the command via the spark master REST API directly (without kubernetes):
request = requests.Request(
'POST',
"http://<master-ip>:6066/v1/submissions/create",
data=json.dumps(parameters))
prepared = request.prepare()
session = requests.Session()
response = session.send(prepared)
And it worked. Now I want to integrate Kubernetes and do it similarly where I submit an API request to my kubernetes cluster from python and have spark handle all the k8s details, ideally something like:
request = requests.Request(
'POST',
"k8s://https://myK8scluster.com:443",
data=json.dumps(parameters))
Is it possible in the Spark 2.3/Kubernetes integration?
I afraid that is impossible for Spark 2.3, if you using native Kubernetes support.
Based on description from deployment instruction, submission process container several steps:
Spark creates a Spark driver running within a Kubernetes pod.
The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code.
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
So, in fact, you have no place to submit a job until you starting a submission process, which will launch a first Spark's pod (driver) for you. And after application completes, everything terminated.
Because of running a fat container on AWS Lambda is not a best solution, and also because if is not way to run any commands in container itself (is is possible, but with hack, here is blueprint about executing Bash inside an AWS Lambda) the simplest way is to write some small custom service, which will work on machine outside of AWS Lambda and provide REST interface between your application and spark-submit utility. I don't see any other ways to make it without a pain.
I'm using kafka and spark streaming for a project programmed in python. I want to send data from kafka producer to my streaming program. It's working smoothly when i execute the following command with the dependencies specified:
./spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0 ./kafkastreaming.py
Is there any way where i can specify the dependencies and run the streaming code directly(i.e. without using spark-submit or with using spark-submit but not specifying the dependencies.)
I tried specifying the dependencies in the spark-defaults.conf in the conf dir of spark.
The specified dependencies were:
1.org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0
2.org.apache.spark:spark-streaming-kafka-0-8-assembly:2.1.1
NOTE - I referred to spark streaming guide using netcat from
https://spark.apache.org/docs/latest/streaming-programming-guide.html
and it worked without using spark-submit command hence i want to know if i can do the same with kafka and spark streaming.
Provide your additional dependencies into the "jars" folder of your spark distribution. stop and start spark again. This way, dependencies wil be resolved at runtime without adding any additional option in your command line