How can I load Cloud Storage data into Bigquery using Python? - python

I have some datasets (27 CSV files, separated by semicolons, summing 150+GB) that get uploaded every week to my Cloud Storage bucket.
Currently, I use the BigQuery console to organize that data manually, declaring the variables and changing the filenames 27 times. The first file replaces the entire previous database, then the other 26 get appended to it. The filenames are always the same.
How can I do it using Python?

Please, check out Cloud Functions functionality. It allows to use python. After the function is deployed, Cron Jobs can be created. Here is related question:
Run a python script on schedule on Google App Engine
Also here is and article which describes, how to load data from Cloud Storage Loading CSV data from Cloud Storage

Related

How to Trigger an On-Demand Scheduled Query Using a Cloud Function?

What I basically want to happen is my on demand scheduled query will run when a new file lands in my google cloud storage bucket. This query will load the CSV file into a temporary table, perform some transformation/cleaning and then append to a table.
Just to try and get the first part running, my on demand scheduled query looks like this. The idea being it will pick up the CSV file from the bucket and dump it into a table.
LOAD DATA INTO spreadsheep-20220603.Case_Studies.loading_test
from files
(
format='CSV',
uris=['gs://triggered_upload/*.csv']
);
I was in the process of setting up a Google Cloud Function that triggers when a file lands in the storage bucket, that seems to be fine but I haven't had luck working out how that function will trigger the scheduled query.
Any idea what bit of python code is needed in the function to trigger the query?
It seems to me that it's not really a scheduled query you want at all. You don't want one to run at regular intervals, you want to run a query in response to a certain event.
Now, you've rigged up a cloud function to execute some code whenever a new file is added to a bucket. What this cloud function needs is the BigQuery python client library. Here's an example of how it's used.
All that remains is to wrap this code in an appropriate function and specify dependencies and permissions using the cloud functions python framework. Here is a guide on how to do that.

Google Cloud Storage JSONs to Pandas Dataframe to Warehouse

I am a newbie in ETL. I just managed to extract a lot of information in form of JSONs to GCS. Each JSON file includes identical key-value pairs and now I would like to transform them into dataframes on the basis of certain key values.
The next step would be loading this into a data warehouse like Clickhouse, I guess? I was not able to find any tutorials on this process.
TLDR 1) Is there a way to transform JSON data on GCS in Python without downloading the whole data?
TLDR 2) How can I set this up to run periodically or in real time?
TLDR 3) How can I go about loading the data into a warehouse?
If these are too much, I would love it if you can point me to resources around this. Appreciate the help
There are some ways to do this.
You can add files to storage, then a Cloud Functions is activated every time a new file is added (https://cloud.google.com/functions/docs/calling/storage) and will call an endpoint in Cloud Run (container service - https://cloud.google.com/run/docs/building/containers) running a Python application to transform these JSONs in a dataframe. Note that the container image will be stored in Container Registry. Then the Python notebook running on Cloud Run will save the rows incrementally to BigQuery (warehouse). After that you can have analytics with Looker Studio.
If you need to scale the solution to millions/billions of rows, you can add files to storage, Cloud Functions is activated and calls Dataproc, a service where you can run Python, Anaconda, etc. (How to call google dataproc job from google cloud function). Then this Dataproc cluster will structurate the JSONs as a dataframe and save to the warehouse (BigQuery).

Interacting with files in Google Cloud

I've got a GCP pipeline which contains several containers on Cloud Run, each of which requires a given type of input (XML, PDF etc...). A couple cloud functions written in Python help orchestrate the workflow.
These inputs are stored in Cloud Storage, and I can retrieve them as binary. However, given the XML or PDF object is in a blob representation, the blob cannot directly serve as inputs to my containers.
I could convert the blob to XML or PDF, then store the file in the cloud function temp directory and then pass them to the relevant containers. I'd however like to avoid saving the file, and pass the blob to the container with the minimum amount of intermediary steps.
What would be the most efficient way?

Is there any way to replicate realtime streaming from azure blob storage to to azure my sql

We can basically use databricks as intermediate but I'm stuck on the python script to replicate data from blob storage to azure my sql every 30 second we are using CSV file here.The script needs to store the csv's in current timestamps.
There is no ready stream option for mysql in spark/databricks as it is not stream source/sink technology.
You can use in databricks writeStream .forEach(df) or .forEachBatch(df) option. This way it create temporary dataframe which you can save in place of your choice (so write to mysql).
Personally I would go for simple solution. In Azure Data Factory is enough to create two datasets (can be even without it) - one mysql, one blob and use pipeline with Copy activity to transfer data.

How to upload a small JSON object in memory to Google BigQuery in Python?

Is there any way to take a small data JSON object from memory and upload it to Google BigQuery without using the file system?
We have working code that uploads files to BigQuery. This code no longer works for us because our new script runs on Google App Engine which doesn't have write access to the file system.
We notice that our BigQuery upload configuration has a "MediaFileUpload" option with a data_path. We have been unable to find if there is some way to change the data_path to something in memory - or if there is another method apart from MediaFileUpload that could upload data from memory to BigQuery.
media_body=MediaFileUpload(
data_path,
mimetype='application/octet-stream'))
job = insert_request.execute()
We have seen solutions which recommend uploading data to Google Cloud Storage then referencing that file to upload to BigQuery for when datasets are large. Our dataset is small so this step seems like an unnecessary use of bandwidth, processing and time.
As a result, we are wondering if there is a way to take our small JSON object and upload it directly to BigQuery using Python?

Categories

Resources