Asking the community for summarizing the limitations of Python SDK on Google DataFlow Templates:
Python SDK has limitation on sources: We don't have connectors for BigQuery, BigTable and Pubsub sources that can take runtime parameters.
We have support for runtime parameters but its only for simple parameter substitution.
NestedValueProvider is not supported(It allows us to compute a value from another ValueProvider object).
Please correct me if i am wrong. Let me know if i missed anything.
According to the Apache Beam Python SDK documentation, the BigQuery read connector has support for ValueProvider objects, so it should be possible to use runtime parameters on BigQuery sources.
BigTable connector is yet to provide reading/source support and, at this moment, it's only possible to use BigTable as writing output; however, ValueProvider arguments are not supported yet.
The Pub/Sub connector supports both sources and sinks, only in streaming pipelines. Same as the BigTable connector, ValueProvider arguments are yet to be supported.
Regarding NestedValueProvider, yes, as mentioned in the Dataflow templates documentation, the Apache Beam SDK for Python does not support NestedValueProvider.
You can always check the Apache Beam release notes to keep updated about latest features or follow up related feature requests on Jira, for example, there's already an open request for DynamicDestinations implementation for BigtableIO, although it's for Java SDK.
Related
I'm trying to build a python ETL pipeline in google cloud, and google cloud dataflow seemed a good option. When I explored the documentation and the developer guides, I see that the apache beam is always attached to dataflow as it's based on it.
I may find issues processing my dataframes in apache beam.
My questions are:
if I want to build my ETL script in native python with DataFlow is that possible? Or it's necessary to use apache beam for my ETL?
If DataFlow was built just for the purpose of using Apache Beam? Is there any serverless google cloud tool for building python ETL (Google cloud function has 9 minutes time execution, that may cause some issues for my pipeline, I want to avoid in execution limit)
My pipeline aims to read data from BigQuery process it and re save it in a bigquery table. I may use some external APIs inside my script.
Concerning your first question, it looks like Dataflow was primarly written for using it along the Apache SDK, as can be checked in the official Google Cloud Documentation on Dataflow. So, it is possible that's actually a requirement to use Apache Beam for your ETL.
Regarding your second question,this tutorial gives you a guidance on how to build your own ETL Pipeline with Python and Google Cloud Platform functions, which are actually serverless. Could you please confirm if this link has helped you?
Regarding your first question, Dataflow needs to use Apache Beam. In fact, before Apache Beam there was something called Dataflow SDK, which was Google proprietary and then it was open sourced to Apache Beam.
The Python Beam SDK is rather easy once you put a bit of effort into it, and the main process operations you'd need are very close to native Python language.
If your end goal is to read, process and write to BQ, I'd say Beam + Dataflow is a good match.
Just wanted to know do we have more pipeline I/O and runtime parameters available with new version (3.X) of python. If I am correct then currently apache beam provide only File-based IOs: textio, avroio, tfrecordio when using python. But with Java we have more options available like File-based IOs, BigQueryIO, BigtableIO, PubSubIO and SpannerIO.
In my requirement I want to use BigQueryIO in GCP dataflow pipeline using python 3.X, But currently it is not available. Does anyone have some update on ETA when will it be available by apache beam.
The BigTable Connector for Python 3 is under development for some time now. Currently, there is no ETA but you can follow the relevant Pull-Request from the official Apache Beam repository for further updates.
BigQueryIO has been available for quite some time in the Apache Beam Python SDK.
There is also a Pub/Sub IO available as well as BigTable (write). SpannerIO is being worked on as we speak.
This page has more detail https://beam.apache.org/documentation/io/built-in/
UPDATE:
In line with OP giving more details, it turns out that indeed using value providers in the BigQuery query string was not supported.
This has been remedied in the following PR: https://github.com/apache/beam/pull/11040 and will most likely be part of the 2.21.0 release.
UPDATE 2:
This new feature has been added in the 2.20.0 release of Apache Beam
https://beam.apache.org/blog/2020/04/15/beam-2.20.0.html
Hope it solves your problem!
I see that there's a built-in I/O connector for BigQuery, but a lot of our data is stored in Snowflake. Is there a workaround for connecting to Snowflake? The only thing I can think of doing is to use sqlalchemy to run the query and then dump the output to Cloud Storage Buckets, and then Apache-Beam can get the input data from the files stored in the Bucket.
There were added Snowflake Python and Java connectors to Beam recently.
Right now (version 2.24) it supports only ReadFromSnowflake operation in apache_beam.io.external.snowflake.
In the 2.25 release WriteToSnowflake will also be available in apache_beam.io.snowflake module. You can still use the old path, however it will be considered deprecated in this version.
Right now it runs only on Flink Runner but there is an effort to make it available for other runners as well.
Also, it's a cross-language transform so some additional setup can be needed - it's quite well documented in the pydoc here (I'm pasting it below):
https://github.com/apache/beam/blob/release-2.24.0/sdks/python/apache_beam/io/external/snowflake.py
Snowflake transforms tested against Flink portable runner.
**Setup**
Transforms provided in this module are cross-language transforms
implemented in the Beam Java SDK. During the pipeline construction, Python SDK
will connect to a Java expansion service to expand these transforms.
To facilitate this, a small amount of setup is needed before using these
transforms in a Beam Python pipeline.
There are several ways to setup cross-language Snowflake transforms.
* Option 1: use the default expansion service
* Option 2: specify a custom expansion service
See below for details regarding each of these options.
*Option 1: Use the default expansion service*
This is the recommended and easiest setup option for using Python Snowflake
transforms.This option requires following pre-requisites
before running the Beam pipeline.
* Install Java runtime in the computer from where the pipeline is constructed
and make sure that 'java' command is available.
In this option, Python SDK will either download (for released Beam version) or
build (when running from a Beam Git clone) a expansion service jar and use
that to expand transforms. Currently Snowflake transforms use the
'beam-sdks-java-io-expansion-service' jar for this purpose.
*Option 2: specify a custom expansion service*
In this option, you startup your own expansion service and provide that as
a parameter when using the transforms provided in this module.
This option requires following pre-requisites before running the Beam
pipeline.
* Startup your own expansion service.
* Update your pipeline to provide the expansion service address when
initiating Snowflake transforms provided in this module.
Flink Users can use the built-in Expansion Service of the Flink Runner's
Job Server. If you start Flink's Job Server, the expansion service will be
started on port 8097. For a different address, please set the
expansion_service parameter.
**More information**
For more information regarding cross-language transforms see:
- https://beam.apache.org/roadmap/portability/
For more information specific to Flink runner see:
- https://beam.apache.org/documentation/runners/flink/
Snowflake (as most of the portable IOs) has its own java expansion service which should be downloaded automatically when you don't specify your own custom one. I don't think it should be needed but I'm mentioning it just to be on the safe side. You can download the jar and start it with java -jar <PATH_TO_JAR> <PORT> and then pass it to snowflake.ReadFromSnowflake as expansion_service='localhost:<PORT>'. Link to 2.24 version: https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-io-snowflake-expansion-service/2.24.0
Notice that it's still experimental though and feel free to report issues on Beam Jira.
Google Cloud Support here!
There's no direct connector from Snowflake to Cloud Dataflow, but one workaround would be what you've mentioned. First dump the output to Cloud Storage, and then connect Cloud Storage to Cloud Dataflow.
I hope that helps.
For future folks looking for a tutorial on how to start with Snowflake and Apache Beam, I can recommend the below tutorial which was made by the creators of the connector.
https://www.polidea.com/blog/snowflake-and-apache-beam-on-google-dataflow/
I am looking at using google cloud python SDKs to manage my resources. I am not able to find compute module in the python SDK.
Python Doc here: https://googlecloudplatform.github.io/google-cloud-python/latest/
However compute module is available in Node.js SDK.
Node.js Doc here: https://googlecloudplatform.github.io/google-cloud-node/#/docs/google-cloud/0.56.0/compute
Can I get information if this module(compute) is available in python?
If not is this being planned and when can I expect it.
The google-cloud-python project you link to is hand-crafted pythonic libraries for GCP APIs. There is not yet one for Compute.
Instead you will have to use the auto-generated Python client library. See https://cloud.google.com/compute/docs/tutorials/python-guide for an example.
I searched for Python API to interact with google bigquery. And I found two packages provides similar APIs: Google BigQuery Client(Part of Google API Client Package googleapiclient) and Gcloud package gcloud.
Here are the documentation about using these two APIs for Bigquery:
Google API Client:googleapiclient
https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/python/latest/index.html
https://cloud.google.com/bigquery/docs/reference/v2/
Google Cloud package: gcloud
http://googlecloudplatform.github.io/gcloud-python/stable/bigquery-usage.html
Both packages are from google, and provides similar functionalities interacting with bigquery. I have the following confusions:
It seems both package includes a wide range of functionalities of Google Cloud Platform. In my view, gcloud provides commandline tool and local environment setup. Generally, what are the differences of these two packages?
In terms of python module, what are the differences of their usage?
Is there any relation between these two packages?
Which is more suitable for accessing Bigquery?
What kind of job are they suitable for?
The googleapiclient client is generated directly from the raw API definition (the definition is a json file, hosted here.)
Because it is automatically generated, it is not what any sane python programmer would do if they were trying to write a python client for BigQuery. That said, it is the lowest-level representation of the API.
The gcloud client, on the other hand, was what a group of more-or-less sane folks at Google came up with when they tried to figure out what a client should look like for BigQuery. It is really quite nice, and lets you focus on what's important rather than converting results from the strange f/v format used in the BigQuery API into something useful.
Additionally, the documentation for the gcloud API was written by a doc writer. The documentation for the googleapiclient was, like the code, automatically generated from a definition of the API.
My advice, having used both (and having, mostly unsuccessfully, helped design the BigQuery API to try to make the generated client behave reasonably), is to use the gcloud client. It will handle a bunch of low-level details for you and generally make your life easier.