Automating data extraction and loading to BigQuery

Automating data extraction and loading to BigQuery - python

I am learning how to pull data from GraphQL API and load it to BigQuery table on daily basis. I am new to GCP and trying to understand the set-up required to establish a secure data-pipeline. To automate the process of regular data extraction and loading, I am following the steps below,
I am first creating a Cloud Function using BigQuery Python client library with pandas and pyarrow. I am loading the data to BigQuery using the method shown in this - Using BigQuery with Pandas — google-cloud-bigquery documentation (googleapis.dev).
As the Trigger type, I have chosen Cloud Pub/Sub. Can I please know if that is a good choice (secure and efficient) for data extraction or should I go with HTTP which requires authentication or with any other Trigger type for my use case.
After which, among the settings, I am setting up only Runtime (is there any other settings that I need to configure?)
Once, the above Cloud Function is set-up, I am creating a Cloud Scheduler to call the Cloud Function created above once everyday at midnight. Under ‘Configure the execution’ I am selecting Target type as Cloud Pub/Sub and selecting the topic.
I do not understand the need for ‘Message body’ after selecting the Cloud Pub/Sub topic to set up Cloud Scheduler for data extraction use case, however, it is an essential field in the settings. I am using a generic message (something like ‘hello world’). Could anyone please correct me if it has any significance, again for my use case and how best to set it?
If anyone of you could please review this method to extract and load data to BQ and please let me know if it is an efficient and secure pipeline, that will be very helpful.
Thank you so much!

First of all slow down a little :D.
You are mixing up two functionalities.
Cloud function can be triggered either via HTTP request or via Pubsub.
When you use cloud scheduler with pubsub topic, the body field there allows you to enter custom data you want to add. This will be sent to the pubsub by cloud scheduler and when the cloud function is triggered via pubsub, it will get the message set by the cloud scheduler. You can use this to trigger different modules of you code based on input you get. AGain its use case specific.
In your case, either of the technique will work. HTTP is easy because you just have to setup the cloud function with appropriate service account, h/w configs. Once deployed, use the trigger url to setup cloud scheduler. Whereas for pubsub there is an additional component in between.
Please read the cloud function document properly. It contains all details on when to use which trigger.
Hope this answers.

Related

Create dataflow job dynamically

I am new to the Google Cloud and to Dataflow. What I want to do is to create a tool which detects and recovers errors in (large) csv files. However at design time, not every error that should be dealt with is already known. Therefore, I need an easy way to add new functions that handle a specific error.
The tool should be something like a framework for automatically creating a dataflow template based on a user selection. I already thought of a workflow which could work, but as mentioned before I am totally new to this, so please feel free to suggest a better solution:
The user selects which error correction methods should be used in the frontend
A yaml file is created which specifies the selected transformations
A python script parses the yaml file and uses error handling functions to build a dataflow job that executes these functions as specified in the yaml file
The dataflow job is stored as a template and run via a REST API Call for a file stored on the GCP
To achieve the extensibility, new functions which implement the error corrections should easily be added. What I thought of was:
A developer writes the required function and uploads it to a specified folder
The new function is manually added to the frontend/or to a database etc. and can be selected to check for/deal with an error
The user can now select the newly added error handling function and the dataflow template that is being created uses this function without the need of editing the code that builds the dataflow template
However my problem is that I am not sure if this is possible or a "good" solution for this problem. Furthermore, I don't know how to create a python script that uses functions which are not known at design time.
(I thought of using something like a strategy pattern, but as far as I know you still need to have the functions implemented at design time already, even though the decision which function to use is being made during run time)
Any help would be greatly appreciated!

What you can use in your architecture is Cloud Functions together with Cloud Composer (hosted solution for Airflow). Apache Airflow is designed to run DAGs on a regular schedule, but you can also trigger DAGs in response to events, such as a change in a Cloud Storage bucket (where CSV files can be stored). You can configure this with your frontend and every time new files arrives into the Bucket, the DAG containing step by step process is being triggered.
Please, have a look for the official documentation, which describes the process of launching Dataflow pipelines with Cloud Composer using DataflowTemplateOperator.

How to connect Superset to external APIs like Google Analytics?

I am willing to show Google Analytics and Google Search Console data directly into Superset through their API.
Make direct queries to Google Analytics API in JSON (instead of storing the results into my database then showing them into Superset) and show the result in Superset
Make direct queries to Google Search Console API in JSON and show the result in Superset
Make direct queries to other amazing JSON APIs and show the result in Superset
How can I do so?
I couldn't find a Google Analytics datasource. I couldn't find a Google Search Console datasource either.
I can't find a way to display in Superset data retrieved from an API, only data stored in a database. I must be missing something, but I can't find anything in the docs related to authenticating & querying external APIs.

Superset can’t query external data API’s directly. Superset has to work with a supported database or data engine (https://superset.incubator.apache.org/installation.html#database-dependencies). This means that you need to find a way to fetch data out of the API and store it in a supported database / data engine. Some options:
Build a little Python pipeline that will query the data API, flatten the data to something tabular / relational, and upload that data to a supported data source - https://superset.incubator.apache.org/installation.html#database-dependencies - and set up Superset so it can talk to that database / data engine.
For more robust solutions, you may want to work with an devops / infrastructure to stand up a workflow scheduler like Apache Airflow (https://airflow.apache.org/) to regularly ping this API and store it in a database of some kind that Superset can talk to.
If you want to regularly query data from a popular 3rd party API, I also recommend checking out Meltano and learning more about Singer taps. These will handle some of the heavy lifting of fetching data from an API regularly and storing it in a database like Postgres. The good news is that there's a Singer tap for Google Analytics - https://github.com/singer-io/tap-google-analytics
Either way, Superset is just a thin layer above your database / data engine. So there’s no way around the reality that you need to find a way to extract data out of an API and store it in a compatible data source.

There is this project named shillelagh by one of Superset's contributors. This gives a SQL interface to REST APIs. This same package is used in Apache Superset to connect with gsheets.
New adapters are relatively easy to implement. There's a step-by-step tutorial that explains how to create a new adapter to an API or filetype in shillelagh.
The package shillelagh underlying uses SQLite Virtual Tables by using the SQLite wrapper APSW

Redash is an alternative to Superset for that task, but it doesn't have the same features. Here is a compared list of integrations for both tools: https://discuss.redash.io/t/a-comparison-of-redash-and-superset/1503
A quick alternative is paying for a third party service like: https://www.stitchdata.com/integrations/google-analytics/superset/

There is no such connector available by default.
A recommended solution would be storing your Google Analytics and Search Console data in a database, you could write a script that pulls data every 4 hours or whichever interval works for you.
Also, you shouldn't store all data but only the dimension/metrics you wish to see in your reports.

Is there a way to send batches of Azure stream analytics data to a python ML pipeline?

I would like to do some machine learning tasks on data as it comes in through stream analytics from event hub. However, much of my data processing pipeline and prediction service is in python. Is there a way to send time chunked data into the python script for processing?
The Azure ML studio function does not suit my need because it appears to work on single rows of data, and the aggregation functions available in Stream Analytics don't seem to work for this data.

With the recently rolled out integration with Azure functions, you might be able to do that. Try that route.
This link describes creating an azure function. You will have to create a Http trigger and choose python as the language. There are also templates for several languages.
This question also has additional details about function.

Per my experience,you could put your data into Azure Storage.Then configure the Import Data component in Azure ML and connect Execute Python Script as input data.
Or you could use Azure Storage Python SDK to query data in your Execute Python Script directly.
However,the two methods mentioned above could only process part of the data at one time, so they should be used only at the experimental stage.
If you need to continue processing data, I suggest you use the web service component.
You could put logical code of querying data and processing result into web service.Please refer to this official tutorial to deploy your web service.
Hope it helps you.

BigQuery to Hadoop Cluster - How to transfer data?

I have a Google Analytics (GA) account which tracks the user activity of an app. I got BigQuery set up so that I can access the raw GA data. Data is coming in from GA to BigQuery on a daily basis.
I have a python app which queries the BigQuery API programmatically. This app is giving me the required response, depending on what I am querying for.
My next step is to get this data from BigQuery and dump it into a Hadoop cluster. I would like to ideally create a hive table using the data. I would like to build something like an ETL process around the python app. For example, on a daily basis, I run the etl process which runs the python app and also exports the data to the cluster.
Eventually, this ETL process should be put on Jenkins and should be able to run on production systems.
What architecture/design/general factors would I need to consider while planning for this ETL process?
Any suggestions on how I should go about this? I am interested in doing this in the most simple and viable way.
Thanks in advance.

The easiest way to go from BigQuery to Hadoop is to use the official Google BigQuery Connector for Hadoop
https://cloud.google.com/hadoop/bigquery-connector
This connector defines a BigQueryInputFormat class.
Write a query to select the appropriate BigQuery objects.
Splits the results of the query evenly among the Hadoop nodes.
Parses the splits into java objects to pass to the mapper. The Hadoop Mapper class receives a JsonObject representation of each selected BigQuery object.
(It uses Google Cloud Storage as an intermediary between BigQuery's data and the splits that Hadoop consumes)

Check out Oozie. It seems to fit your requirements. It has workflow engine, scheduling support and shell script and hive support.
In terms of installation and deployment, it's usually part of hadoop distribution, but can be installed separately. It has a dependency of db as persistence layer. That may require some extra efforts.
It has web UI and rest API. Managing and monitoring jobs could be automated if desired.

Is google cloud storage object versioning compatible with appengine cloud storage client library?

I am looking to back up user-generated data (user profiles, that may change from time to time) from my AppEngine python application into Google Cloud Storage. I could easily periodically brute-force back up all of the user-generated data, but it probably makes more sense to only update data that has changed (only writing it to the cloud storage if the user has changed their data). Later, in the case that data needs to be restored, I would like to take advantage of the object-versioning functionality of the Cloud Storage service to determine which objects need to be restored.
I am trying to understand exactly how the google cloud storage interacts with AppEngine based on the information regarding cloudstorage.open() found at https://developers.google.com/appengine/docs/python/googlecloudstorageclient/functions. However, there is no indication of how this service interacts with versioned objects that are stored in the cloud (versioned objects are documented here: https://developers.google.com/storage/docs/object-versioning).
So, my question is: how can an application running on the AppEngine access specific versions of objects that are stored in Google Cloud Storage.
If there is a better way of doing this, I would be interested in hearing about it as well.

The AppEngine GCS Client Library doesn't support versioning at this time. If you enable versioning on a bucket through other channels, the GCS Client Library will keep working fine, but in order to access or delete older generations of objects, you'll need to use either the XML API or the JSON API (as opposed to the appengine-specific API). There is a Python client for the JSON API that works fine from within appengine, but you'll lose a few of appengine's niceties by using it. See https://developers.google.com/appengine/docs/python/googlecloudstorageclient/#gcs_rest_api for more details.
Here's a bit of info on how to use versioning from the XML and JSON APIs: https://developers.google.com/storage/docs/generations-preconditions

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.