There is problem that our people fill every day in google spreadsheet some data and I need with a certain frequency (e.g. once a day) to send these tables in clickhouse
(it located on our aws servers)
it doesn't matter whether clickhouse writes only new data from tables or all tables every time
please tell me a working method how to do it
from the Toolkit are python,can in theory work with sqlalchemy and airflow DAG
but for the development of dag in airflow I have not yet found a guide how to write in python a script to transfer data from googlespreadsheet
the second option is with owox extension for google spreadsheet - but there you need to work with Google BigQuery, and this will breed a zoo, and I would not like to pay for BQ yet
Do you have any ideas how to use scripts to upload tables to Clickhouse from google spreadsheets?
I found the Python library pygsheets - it is easier to access spreadsheets using the api than directly
official pygsheets dock - https://pygsheets.readthedocs.io/en/stable/
in addition I found more libraries: gspread and oauth2client which can also be used to work on Python with the api
step by step guide https://towardsdatascience.com/accessing-google-spreadsheet-data-using-python-90a5bc214fd2
official documentation for gspread https://gspread.readthedocs.io/en/latest/
than i can make dag at airflow and manage etl process
Related
I am learning how to pull data from GraphQL API and load it to BigQuery table on daily basis. I am new to GCP and trying to understand the set-up required to establish a secure data-pipeline. To automate the process of regular data extraction and loading, I am following the steps below,
I am first creating a Cloud Function using BigQuery Python client library with pandas and pyarrow. I am loading the data to BigQuery using the method shown in this - Using BigQuery with Pandas — google-cloud-bigquery documentation (googleapis.dev).
As the Trigger type, I have chosen Cloud Pub/Sub. Can I please know if that is a good choice (secure and efficient) for data extraction or should I go with HTTP which requires authentication or with any other Trigger type for my use case.
After which, among the settings, I am setting up only Runtime (is there any other settings that I need to configure?)
Once, the above Cloud Function is set-up, I am creating a Cloud Scheduler to call the Cloud Function created above once everyday at midnight. Under ‘Configure the execution’ I am selecting Target type as Cloud Pub/Sub and selecting the topic.
I do not understand the need for ‘Message body’ after selecting the Cloud Pub/Sub topic to set up Cloud Scheduler for data extraction use case, however, it is an essential field in the settings. I am using a generic message (something like ‘hello world’). Could anyone please correct me if it has any significance, again for my use case and how best to set it?
If anyone of you could please review this method to extract and load data to BQ and please let me know if it is an efficient and secure pipeline, that will be very helpful.
Thank you so much!
First of all slow down a little :D.
You are mixing up two functionalities.
Cloud function can be triggered either via HTTP request or via Pubsub.
When you use cloud scheduler with pubsub topic, the body field there allows you to enter custom data you want to add. This will be sent to the pubsub by cloud scheduler and when the cloud function is triggered via pubsub, it will get the message set by the cloud scheduler. You can use this to trigger different modules of you code based on input you get. AGain its use case specific.
In your case, either of the technique will work. HTTP is easy because you just have to setup the cloud function with appropriate service account, h/w configs. Once deployed, use the trigger url to setup cloud scheduler. Whereas for pubsub there is an additional component in between.
Please read the cloud function document properly. It contains all details on when to use which trigger.
Hope this answers.
I am currently trying to help a small business transition from using Google Sheets as a database to something more robust and scalable - preferably staying within Google services. I've looked into Google Cloud Storage and BigQuery, however - there are employees that need to manually update new data so anything in GCP won't be user friendly for non technical persons. I was thinking of employees still manually updating the Google sheets, and write a Python program to automatically update GCS or BigQuery, but the issue is that Google sheets is extremely slow and cannot handle the amount of data that's currently stored in there now.
Has anyone faced a similar issue and have any ideas/suggestions? Thank you so much in advance :)
What you might be able to do is to save the Google Sheet file as a .csv file and then import it in BigQuery. From there, maybe the employes can use simple commands to insert data. Please note that this question is very opinion based and anyone can suggest various ways to achieve what you want.
Create a web app hosted on App engine for the front end and data entry and connect it to Cloud SQL or BQ as a backend. You can create a UI in your web app where employees can access data from CloudSQL/BQ if needed. Alternatively you can use Google forms for data entry and connect it to Cloud SQL.
I am trying to read google sheet from google app engine using gspread-pandas package. In local machine, we usually store google_secret.json in specific path. But when it comes to app engine, I saved the file in /root/.config/gspread-pandas/google_secret.json but even then I am getting the same error as below
Please download json from https://console.developers.google.com/apis/credentials and save as /root/.config/gspread_pandas/google_secret.json
2) To add to this, I have created credentials part from the GCS and now trying to get the dict file in Spread class of gspread pandas. But, since we need to store the authorization code the first time, the app engine access, the failure is still happening for google app engine
Thank you in advance
In order to achieve your technical purpose - doing operations with Google Sheets from App Engine, I would recommend you to use the Google Sheets API 1. This API let's you read and write data, format text and numbers, create charts and many other features.
Here 2 there is a quickstart for Python, for this API. If you stil encounter compatibility issues, or you have a hard time getting/ storing the credentials in a persistent way, you can always opt for App Engine Flexible, which offers you more freedom 3.
So I have used Service account key and read the service account key from GCS. This process solved my problem
I am willing to show Google Analytics and Google Search Console data directly into Superset through their API.
Make direct queries to Google Analytics API in JSON (instead of storing the results into my database then showing them into Superset) and show the result in Superset
Make direct queries to Google Search Console API in JSON and show the result in Superset
Make direct queries to other amazing JSON APIs and show the result in Superset
How can I do so?
I couldn't find a Google Analytics datasource. I couldn't find a Google Search Console datasource either.
I can't find a way to display in Superset data retrieved from an API, only data stored in a database. I must be missing something, but I can't find anything in the docs related to authenticating & querying external APIs.
Superset can’t query external data API’s directly. Superset has to work with a supported database or data engine (https://superset.incubator.apache.org/installation.html#database-dependencies). This means that you need to find a way to fetch data out of the API and store it in a supported database / data engine. Some options:
Build a little Python pipeline that will query the data API, flatten the data to something tabular / relational, and upload that data to a supported data source - https://superset.incubator.apache.org/installation.html#database-dependencies - and set up Superset so it can talk to that database / data engine.
For more robust solutions, you may want to work with an devops / infrastructure to stand up a workflow scheduler like Apache Airflow (https://airflow.apache.org/) to regularly ping this API and store it in a database of some kind that Superset can talk to.
If you want to regularly query data from a popular 3rd party API, I also recommend checking out Meltano and learning more about Singer taps. These will handle some of the heavy lifting of fetching data from an API regularly and storing it in a database like Postgres. The good news is that there's a Singer tap for Google Analytics - https://github.com/singer-io/tap-google-analytics
Either way, Superset is just a thin layer above your database / data engine. So there’s no way around the reality that you need to find a way to extract data out of an API and store it in a compatible data source.
There is this project named shillelagh by one of Superset's contributors. This gives a SQL interface to REST APIs. This same package is used in Apache Superset to connect with gsheets.
New adapters are relatively easy to implement. There's a step-by-step tutorial that explains how to create a new adapter to an API or filetype in shillelagh.
The package shillelagh underlying uses SQLite Virtual Tables by using the SQLite wrapper APSW
Redash is an alternative to Superset for that task, but it doesn't have the same features. Here is a compared list of integrations for both tools: https://discuss.redash.io/t/a-comparison-of-redash-and-superset/1503
A quick alternative is paying for a third party service like: https://www.stitchdata.com/integrations/google-analytics/superset/
There is no such connector available by default.
A recommended solution would be storing your Google Analytics and Search Console data in a database, you could write a script that pulls data every 4 hours or whichever interval works for you.
Also, you shouldn't store all data but only the dimension/metrics you wish to see in your reports.
I have a Google Analytics (GA) account which tracks the user activity of an app. I got BigQuery set up so that I can access the raw GA data. Data is coming in from GA to BigQuery on a daily basis.
I have a python app which queries the BigQuery API programmatically. This app is giving me the required response, depending on what I am querying for.
My next step is to get this data from BigQuery and dump it into a Hadoop cluster. I would like to ideally create a hive table using the data. I would like to build something like an ETL process around the python app. For example, on a daily basis, I run the etl process which runs the python app and also exports the data to the cluster.
Eventually, this ETL process should be put on Jenkins and should be able to run on production systems.
What architecture/design/general factors would I need to consider while planning for this ETL process?
Any suggestions on how I should go about this? I am interested in doing this in the most simple and viable way.
Thanks in advance.
The easiest way to go from BigQuery to Hadoop is to use the official Google BigQuery Connector for Hadoop
https://cloud.google.com/hadoop/bigquery-connector
This connector defines a BigQueryInputFormat class.
Write a query to select the appropriate BigQuery objects.
Splits the results of the query evenly among the Hadoop nodes.
Parses the splits into java objects to pass to the mapper. The Hadoop Mapper class receives a JsonObject representation of each selected BigQuery object.
(It uses Google Cloud Storage as an intermediary between BigQuery's data and the splits that Hadoop consumes)
Check out Oozie. It seems to fit your requirements. It has workflow engine, scheduling support and shell script and hive support.
In terms of installation and deployment, it's usually part of hadoop distribution, but can be installed separately. It has a dependency of db as persistence layer. That may require some extra efforts.
It has web UI and rest API. Managing and monitoring jobs could be automated if desired.