Can I use google DataFlow with native python? - python

I'm trying to build a python ETL pipeline in google cloud, and google cloud dataflow seemed a good option. When I explored the documentation and the developer guides, I see that the apache beam is always attached to dataflow as it's based on it.
I may find issues processing my dataframes in apache beam.
My questions are:
if I want to build my ETL script in native python with DataFlow is that possible? Or it's necessary to use apache beam for my ETL?
If DataFlow was built just for the purpose of using Apache Beam? Is there any serverless google cloud tool for building python ETL (Google cloud function has 9 minutes time execution, that may cause some issues for my pipeline, I want to avoid in execution limit)
My pipeline aims to read data from BigQuery process it and re save it in a bigquery table. I may use some external APIs inside my script.

Concerning your first question, it looks like Dataflow was primarly written for using it along the Apache SDK, as can be checked in the official Google Cloud Documentation on Dataflow. So, it is possible that's actually a requirement to use Apache Beam for your ETL.
Regarding your second question,this tutorial gives you a guidance on how to build your own ETL Pipeline with Python and Google Cloud Platform functions, which are actually serverless. Could you please confirm if this link has helped you?

Regarding your first question, Dataflow needs to use Apache Beam. In fact, before Apache Beam there was something called Dataflow SDK, which was Google proprietary and then it was open sourced to Apache Beam.
The Python Beam SDK is rather easy once you put a bit of effort into it, and the main process operations you'd need are very close to native Python language.
If your end goal is to read, process and write to BQ, I'd say Beam + Dataflow is a good match.

Related

Where is the python version of Dataflow training course provided by Google GCP

I am learning GCP beam/pubsub/bigquery now, there is a java version of example on Google:
https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/streaming/process/sandiego/src/main/java/com/google/cloud/training/dataanalyst/sandiego
I wonder if there is a python equivalent anywhere?
Can anyone share your experience on how do you run a pub/sub to load data into BigQuery table?

Expected ETA to avail Pipeline I/O and runtime parameters in apache beam GCP dataflow pipeline using python?

Just wanted to know do we have more pipeline I/O and runtime parameters available with new version (3.X) of python. If I am correct then currently apache beam provide only File-based IOs: textio, avroio, tfrecordio when using python. But with Java we have more options available like File-based IOs, BigQueryIO, BigtableIO, PubSubIO and SpannerIO.
In my requirement I want to use BigQueryIO in GCP dataflow pipeline using python 3.X, But currently it is not available. Does anyone have some update on ETA when will it be available by apache beam.
The BigTable Connector for Python 3 is under development for some time now. Currently, there is no ETA but you can follow the relevant Pull-Request from the official Apache Beam repository for further updates.
BigQueryIO has been available for quite some time in the Apache Beam Python SDK.
There is also a Pub/Sub IO available as well as BigTable (write). SpannerIO is being worked on as we speak.
This page has more detail https://beam.apache.org/documentation/io/built-in/
UPDATE:
In line with OP giving more details, it turns out that indeed using value providers in the BigQuery query string was not supported.
This has been remedied in the following PR: https://github.com/apache/beam/pull/11040 and will most likely be part of the 2.21.0 release.
UPDATE 2:
This new feature has been added in the 2.20.0 release of Apache Beam
https://beam.apache.org/blog/2020/04/15/beam-2.20.0.html
Hope it solves your problem!

Python: How to Connect to Snowflake Using Apache Beam?

I see that there's a built-in I/O connector for BigQuery, but a lot of our data is stored in Snowflake. Is there a workaround for connecting to Snowflake? The only thing I can think of doing is to use sqlalchemy to run the query and then dump the output to Cloud Storage Buckets, and then Apache-Beam can get the input data from the files stored in the Bucket.
There were added Snowflake Python and Java connectors to Beam recently.
Right now (version 2.24) it supports only ReadFromSnowflake operation in apache_beam.io.external.snowflake.
In the 2.25 release WriteToSnowflake will also be available in apache_beam.io.snowflake module. You can still use the old path, however it will be considered deprecated in this version.
Right now it runs only on Flink Runner but there is an effort to make it available for other runners as well.
Also, it's a cross-language transform so some additional setup can be needed - it's quite well documented in the pydoc here (I'm pasting it below):
https://github.com/apache/beam/blob/release-2.24.0/sdks/python/apache_beam/io/external/snowflake.py
Snowflake transforms tested against Flink portable runner.
**Setup**
Transforms provided in this module are cross-language transforms
implemented in the Beam Java SDK. During the pipeline construction, Python SDK
will connect to a Java expansion service to expand these transforms.
To facilitate this, a small amount of setup is needed before using these
transforms in a Beam Python pipeline.
There are several ways to setup cross-language Snowflake transforms.
* Option 1: use the default expansion service
* Option 2: specify a custom expansion service
See below for details regarding each of these options.
*Option 1: Use the default expansion service*
This is the recommended and easiest setup option for using Python Snowflake
transforms.This option requires following pre-requisites
before running the Beam pipeline.
* Install Java runtime in the computer from where the pipeline is constructed
and make sure that 'java' command is available.
In this option, Python SDK will either download (for released Beam version) or
build (when running from a Beam Git clone) a expansion service jar and use
that to expand transforms. Currently Snowflake transforms use the
'beam-sdks-java-io-expansion-service' jar for this purpose.
*Option 2: specify a custom expansion service*
In this option, you startup your own expansion service and provide that as
a parameter when using the transforms provided in this module.
This option requires following pre-requisites before running the Beam
pipeline.
* Startup your own expansion service.
* Update your pipeline to provide the expansion service address when
initiating Snowflake transforms provided in this module.
Flink Users can use the built-in Expansion Service of the Flink Runner's
Job Server. If you start Flink's Job Server, the expansion service will be
started on port 8097. For a different address, please set the
expansion_service parameter.
**More information**
For more information regarding cross-language transforms see:
- https://beam.apache.org/roadmap/portability/
For more information specific to Flink runner see:
- https://beam.apache.org/documentation/runners/flink/
Snowflake (as most of the portable IOs) has its own java expansion service which should be downloaded automatically when you don't specify your own custom one. I don't think it should be needed but I'm mentioning it just to be on the safe side. You can download the jar and start it with java -jar <PATH_TO_JAR> <PORT> and then pass it to snowflake.ReadFromSnowflake as expansion_service='localhost:<PORT>'. Link to 2.24 version: https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-io-snowflake-expansion-service/2.24.0
Notice that it's still experimental though and feel free to report issues on Beam Jira.
Google Cloud Support here!
There's no direct connector from Snowflake to Cloud Dataflow, but one workaround would be what you've mentioned. First dump the output to Cloud Storage, and then connect Cloud Storage to Cloud Dataflow.
I hope that helps.
For future folks looking for a tutorial on how to start with Snowflake and Apache Beam, I can recommend the below tutorial which was made by the creators of the connector.
https://www.polidea.com/blog/snowflake-and-apache-beam-on-google-dataflow/

How to call a Dataflow job written in Go from Cloud Functions in GCP

My goal is creating a mechanism that when a new file is uploaded into the Cloud Storage, it'll trigger a Cloud Function. Eventually, this Cloud function will trigger a Cloud Dataflow job.
I have a restriction that the Cloud Dataflow job should be written in Go, and the Cloud Function should be written in Python.
The problem I have been facing right now is, I cannot call Cloud Dataflow job from a Cloud Function.
The problem in Cloud Dataflow written in Go is there is no template-location variable defined in Apache Beam Go SDK. That's why I cannot create dataflow templates. And, since there is no dataflow templates, the only way that I can call Cloud Dataflow job from a cloud function is writing a Python job which calls a bash script which runs dataflow job.
The bash script looks like that:
go run wordcount.go \
--runner dataflow \
--input gs://dataflow-samples/shakespeare/kinglear.txt \
--output gs://${BUCKET?}/counts \
--project ${PROJECT?} \
--temp_location gs://${BUCKET?}/tmp/ \
--staging_location gs://${BUCKET?}/binaries/ \
--worker_harness_container_image=apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515
But above mechanism cannot create a new dataflow job and it seems it's cumbersome.
Is there a better way to achieve my goal? And what am I doing wrong on the above mechanism?
the Cloud Function should be written in Python
The Cloud Dataflow Client SDK can only create Dataflow jobs from templates. Therefore this requirement cannot be achieved unless you create your own template.
I have a restriction that the Cloud Dataflow job should be written in
Go
Since your Python objective cannot be achieved, your other option is to run your Go program in Cloud Functions. Cloud Functions for Go is in alpha. However, I know of no method to execute an Apache Beam (Dataflow) program in Cloud Functions. Keep in mind that an Apache Beam programs begins execution locally and connects itself to a cluster running somewhere else (Dataflow, Spark, etc.) unless you select runner=DirectRunner.
You have chosen the least mature language to use Apache Beam. The order of maturity and features is Java (excellent), Python (good and getting better everyday), Go (not ready yet for primetime).
If you want to run Apache Beam programs written in Go on Cloud Dataflow, then you will need to use a platform such as your local system, Google Compute Engine or Google App Engine Flex. I do not know if App Engine Standard can run Apache Beam in Go.
I found out that the Apache Beam Go SDK supports worker_binary parameter which is similar to template-location for Java Dataflow jobs. Using this option, I was able to kick off a go dataflow job from my python cloud function.

Using Python to send twitter data directly to Google Cloud data storage

How might one send data from Twitter directly to Google Cloud data storage. Would like to skip the step of first downloading it down to my local machine and then uploading it up to the cloud. It would run once. Not looking for full code, but any pointers or tutorials that someone might have learned from. Using python to interact with google-cloud and storage.
Any help would be appreciated.
Here's a blog post which describes the following architecture:
Run a Python script on Compute Engine
Moving your data to BigQuery for storage
Here's another one that describes a somewhat more complex architecture, including the ability to analyze tweets:
Use Google Cloud Dataflow templates
Launch Dataflow pipelines from a Google App Engine (GAE) app
In order to support MapReduce jobs

Categories

Resources