I have a Google Analytics (GA) account which tracks the user activity of an app. I got BigQuery set up so that I can access the raw GA data. Data is coming in from GA to BigQuery on a daily basis.
I have a python app which queries the BigQuery API programmatically. This app is giving me the required response, depending on what I am querying for.
My next step is to get this data from BigQuery and dump it into a Hadoop cluster. I would like to ideally create a hive table using the data. I would like to build something like an ETL process around the python app. For example, on a daily basis, I run the etl process which runs the python app and also exports the data to the cluster.
Eventually, this ETL process should be put on Jenkins and should be able to run on production systems.
What architecture/design/general factors would I need to consider while planning for this ETL process?
Any suggestions on how I should go about this? I am interested in doing this in the most simple and viable way.
Thanks in advance.
The easiest way to go from BigQuery to Hadoop is to use the official Google BigQuery Connector for Hadoop
https://cloud.google.com/hadoop/bigquery-connector
This connector defines a BigQueryInputFormat class.
Write a query to select the appropriate BigQuery objects.
Splits the results of the query evenly among the Hadoop nodes.
Parses the splits into java objects to pass to the mapper. The Hadoop Mapper class receives a JsonObject representation of each selected BigQuery object.
(It uses Google Cloud Storage as an intermediary between BigQuery's data and the splits that Hadoop consumes)
Check out Oozie. It seems to fit your requirements. It has workflow engine, scheduling support and shell script and hive support.
In terms of installation and deployment, it's usually part of hadoop distribution, but can be installed separately. It has a dependency of db as persistence layer. That may require some extra efforts.
It has web UI and rest API. Managing and monitoring jobs could be automated if desired.
Related
I am learning how to pull data from GraphQL API and load it to BigQuery table on daily basis. I am new to GCP and trying to understand the set-up required to establish a secure data-pipeline. To automate the process of regular data extraction and loading, I am following the steps below,
I am first creating a Cloud Function using BigQuery Python client library with pandas and pyarrow. I am loading the data to BigQuery using the method shown in this - Using BigQuery with Pandas — google-cloud-bigquery documentation (googleapis.dev).
As the Trigger type, I have chosen Cloud Pub/Sub. Can I please know if that is a good choice (secure and efficient) for data extraction or should I go with HTTP which requires authentication or with any other Trigger type for my use case.
After which, among the settings, I am setting up only Runtime (is there any other settings that I need to configure?)
Once, the above Cloud Function is set-up, I am creating a Cloud Scheduler to call the Cloud Function created above once everyday at midnight. Under ‘Configure the execution’ I am selecting Target type as Cloud Pub/Sub and selecting the topic.
I do not understand the need for ‘Message body’ after selecting the Cloud Pub/Sub topic to set up Cloud Scheduler for data extraction use case, however, it is an essential field in the settings. I am using a generic message (something like ‘hello world’). Could anyone please correct me if it has any significance, again for my use case and how best to set it?
If anyone of you could please review this method to extract and load data to BQ and please let me know if it is an efficient and secure pipeline, that will be very helpful.
Thank you so much!
First of all slow down a little :D.
You are mixing up two functionalities.
Cloud function can be triggered either via HTTP request or via Pubsub.
When you use cloud scheduler with pubsub topic, the body field there allows you to enter custom data you want to add. This will be sent to the pubsub by cloud scheduler and when the cloud function is triggered via pubsub, it will get the message set by the cloud scheduler. You can use this to trigger different modules of you code based on input you get. AGain its use case specific.
In your case, either of the technique will work. HTTP is easy because you just have to setup the cloud function with appropriate service account, h/w configs. Once deployed, use the trigger url to setup cloud scheduler. Whereas for pubsub there is an additional component in between.
Please read the cloud function document properly. It contains all details on when to use which trigger.
Hope this answers.
There is problem that our people fill every day in google spreadsheet some data and I need with a certain frequency (e.g. once a day) to send these tables in clickhouse
(it located on our aws servers)
it doesn't matter whether clickhouse writes only new data from tables or all tables every time
please tell me a working method how to do it
from the Toolkit are python,can in theory work with sqlalchemy and airflow DAG
but for the development of dag in airflow I have not yet found a guide how to write in python a script to transfer data from googlespreadsheet
the second option is with owox extension for google spreadsheet - but there you need to work with Google BigQuery, and this will breed a zoo, and I would not like to pay for BQ yet
Do you have any ideas how to use scripts to upload tables to Clickhouse from google spreadsheets?
I found the Python library pygsheets - it is easier to access spreadsheets using the api than directly
official pygsheets dock - https://pygsheets.readthedocs.io/en/stable/
in addition I found more libraries: gspread and oauth2client which can also be used to work on Python with the api
step by step guide https://towardsdatascience.com/accessing-google-spreadsheet-data-using-python-90a5bc214fd2
official documentation for gspread https://gspread.readthedocs.io/en/latest/
than i can make dag at airflow and manage etl process
I am willing to show Google Analytics and Google Search Console data directly into Superset through their API.
Make direct queries to Google Analytics API in JSON (instead of storing the results into my database then showing them into Superset) and show the result in Superset
Make direct queries to Google Search Console API in JSON and show the result in Superset
Make direct queries to other amazing JSON APIs and show the result in Superset
How can I do so?
I couldn't find a Google Analytics datasource. I couldn't find a Google Search Console datasource either.
I can't find a way to display in Superset data retrieved from an API, only data stored in a database. I must be missing something, but I can't find anything in the docs related to authenticating & querying external APIs.
Superset can’t query external data API’s directly. Superset has to work with a supported database or data engine (https://superset.incubator.apache.org/installation.html#database-dependencies). This means that you need to find a way to fetch data out of the API and store it in a supported database / data engine. Some options:
Build a little Python pipeline that will query the data API, flatten the data to something tabular / relational, and upload that data to a supported data source - https://superset.incubator.apache.org/installation.html#database-dependencies - and set up Superset so it can talk to that database / data engine.
For more robust solutions, you may want to work with an devops / infrastructure to stand up a workflow scheduler like Apache Airflow (https://airflow.apache.org/) to regularly ping this API and store it in a database of some kind that Superset can talk to.
If you want to regularly query data from a popular 3rd party API, I also recommend checking out Meltano and learning more about Singer taps. These will handle some of the heavy lifting of fetching data from an API regularly and storing it in a database like Postgres. The good news is that there's a Singer tap for Google Analytics - https://github.com/singer-io/tap-google-analytics
Either way, Superset is just a thin layer above your database / data engine. So there’s no way around the reality that you need to find a way to extract data out of an API and store it in a compatible data source.
There is this project named shillelagh by one of Superset's contributors. This gives a SQL interface to REST APIs. This same package is used in Apache Superset to connect with gsheets.
New adapters are relatively easy to implement. There's a step-by-step tutorial that explains how to create a new adapter to an API or filetype in shillelagh.
The package shillelagh underlying uses SQLite Virtual Tables by using the SQLite wrapper APSW
Redash is an alternative to Superset for that task, but it doesn't have the same features. Here is a compared list of integrations for both tools: https://discuss.redash.io/t/a-comparison-of-redash-and-superset/1503
A quick alternative is paying for a third party service like: https://www.stitchdata.com/integrations/google-analytics/superset/
There is no such connector available by default.
A recommended solution would be storing your Google Analytics and Search Console data in a database, you could write a script that pulls data every 4 hours or whichever interval works for you.
Also, you shouldn't store all data but only the dimension/metrics you wish to see in your reports.
Is there an easy way or example to load Google Cloud Storage data into bigtable?
I have lots of json files generated by pyspark and i wish to load data into bigtable.
But I can not find an easy way to do that!
I have tried the python code from google-cloud-python and it work fined, but it just read data line by line into bigtable which was strange for me.
Any help would be greatly appreciated.
There is no simple tool to read data in Cloud Bigtable. Here are some options:
Import the files using Dataflow. This requires java development, and learning the Dataflow programming model.
Use Python (possibly with Pyspark) to read those json files, and write to Cloud Bigtable using a method called mutate_rows which write to Bigtable in bulk.
FYI, I work on the Cloud Bigtable team. I'm a Java developer, so I opt for #1. Our team has been working to improve our python experience. The extended team recently added some reliability improvements to make sure that mutate_rows is resilient for large jobs. We do not yet have any good examples of integrating with PySpark or Apache Beam's python SDK, but they are on our radar.
I would like to do some machine learning tasks on data as it comes in through stream analytics from event hub. However, much of my data processing pipeline and prediction service is in python. Is there a way to send time chunked data into the python script for processing?
The Azure ML studio function does not suit my need because it appears to work on single rows of data, and the aggregation functions available in Stream Analytics don't seem to work for this data.
With the recently rolled out integration with Azure functions, you might be able to do that. Try that route.
This link describes creating an azure function. You will have to create a Http trigger and choose python as the language. There are also templates for several languages.
This question also has additional details about function.
Per my experience,you could put your data into Azure Storage.Then configure the Import Data component in Azure ML and connect Execute Python Script as input data.
Or you could use Azure Storage Python SDK to query data in your Execute Python Script directly.
However,the two methods mentioned above could only process part of the data at one time, so they should be used only at the experimental stage.
If you need to continue processing data, I suggest you use the web service component.
You could put logical code of querying data and processing result into web service.Please refer to this official tutorial to deploy your web service.
Hope it helps you.