What I basically want to happen is my on demand scheduled query will run when a new file lands in my google cloud storage bucket. This query will load the CSV file into a temporary table, perform some transformation/cleaning and then append to a table.
Just to try and get the first part running, my on demand scheduled query looks like this. The idea being it will pick up the CSV file from the bucket and dump it into a table.
LOAD DATA INTO spreadsheep-20220603.Case_Studies.loading_test
from files
(
format='CSV',
uris=['gs://triggered_upload/*.csv']
);
I was in the process of setting up a Google Cloud Function that triggers when a file lands in the storage bucket, that seems to be fine but I haven't had luck working out how that function will trigger the scheduled query.
Any idea what bit of python code is needed in the function to trigger the query?
It seems to me that it's not really a scheduled query you want at all. You don't want one to run at regular intervals, you want to run a query in response to a certain event.
Now, you've rigged up a cloud function to execute some code whenever a new file is added to a bucket. What this cloud function needs is the BigQuery python client library. Here's an example of how it's used.
All that remains is to wrap this code in an appropriate function and specify dependencies and permissions using the cloud functions python framework. Here is a guide on how to do that.
Related
We can basically use databricks as intermediate but I'm stuck on the python script to replicate data from blob storage to azure my sql every 30 second we are using CSV file here.The script needs to store the csv's in current timestamps.
There is no ready stream option for mysql in spark/databricks as it is not stream source/sink technology.
You can use in databricks writeStream .forEach(df) or .forEachBatch(df) option. This way it create temporary dataframe which you can save in place of your choice (so write to mysql).
Personally I would go for simple solution. In Azure Data Factory is enough to create two datasets (can be even without it) - one mysql, one blob and use pipeline with Copy activity to transfer data.
I have some datasets (27 CSV files, separated by semicolons, summing 150+GB) that get uploaded every week to my Cloud Storage bucket.
Currently, I use the BigQuery console to organize that data manually, declaring the variables and changing the filenames 27 times. The first file replaces the entire previous database, then the other 26 get appended to it. The filenames are always the same.
How can I do it using Python?
Please, check out Cloud Functions functionality. It allows to use python. After the function is deployed, Cron Jobs can be created. Here is related question:
Run a python script on schedule on Google App Engine
Also here is and article which describes, how to load data from Cloud Storage Loading CSV data from Cloud Storage
I've stored some records on my datastore console. Now I wanna manipulate them to optimize Big Data analytics.
How can I write a Python cloud routine to make some transformations on my data? Can I trigger it with Datastore events?
Thanks a lot.
I have coded a little bit myself. You can find my code in GitHub.
What it is doing:
It is an HTTP Cloud Function
Establishing connection to Google Cloud Datastore through client()
Updates a value of specific entry in the entity using ID number and entry's column name
How to use:
Test this Cloud Function and get the idea of how it is working. Then manipulate according to your needs. I have tested this my self and it is working.
Create an HTTP trigger Python Cloud Function.
Set the name to updateDatastore
Copy and paste the code from GitHub.
Add this line google-cloud-datastore to the requirements.txt file.
In main code assign ENTITY_KIND your entity's kind value
In main code assign ENTITY_KEY your entity's key value
When clicked on HTTP trigger URL, after your Cloud Function's execution current time will be written in the column.
I am using cloud composer to orchestrate ETL for files arriving in GCS going to BigQuery. I have a cloud function that triggers the dag when a file arrives and the cloud function passes the file name/location to the DAG. In my DAG I have 2 tasks:
1) Use the DataflowPythonOperator to run a dataflow job that reads the data from text in GCS and transforms it and inputs it into BQ, and 2) moves the file to a failure/success bucket depending on whether the job fails or succeeds.
Each file has a file ID which is a column in the bigquery table. Sometimes a file will be edited once or twice (it’s not a streaming thing where it’s often) and I want to be able to delete the existing records for that file first.
I looked into the other airflow operators, but wanted to have 2 tasks in my DAG before I run the dataflow job:
Get the file id based on the file name (right now I have a bigquery table mapping file name -> file ID but I can also just bring in a json that serves as the map I guess if that’s easier)
If the file ID is already present in the bigquery table (the table that outputs the transformed data from the dataflow job), delete it, then run the dataflow job so I have the most updated information. I know one option is to just add a time stamp and only use the most up to date records, but because there could be 1 million records per file and it is not like I am deleting 100 files a day (maybe 1-2 tops) that seems like it could be messy and confusing.
After the dataflow job, ideally before moving the file to the success/failure folder I would like to append to some “records” table saying that this game was inputted at this time. This will be my way to see all the inserts that occurred.
I have tried to look for different ways to do this, I am new to cloud composer so I don’t have a clear cut idea of how this would work after 10+ hours of research or else I would post code for input.
Thanks, I am very thankful for everyone's help and apologize if this is not as clear as you would like, the documentation on airflow is very robust but given cloud composer and bigquery is relatively new, it is difficult to learn as thoroughly how to do some GCP specific tasks.
It sounds a bit complicated. Gladly, there are operators for pretty much every GCP service. Another thing is when to trigger the DAG to execute. Have you figured that out? You'd want to trigger a Google Cloud Function to run every time a new file comes in to that GCS bucket.
Triggering your DAG
To trigger the DAG, you'll want to call it using a Google Cloud Function that relies on Object Finalize or Metadata Update triggers.
Loading data to BigQuery
If your file is already in GCS, and in JSON or CSV format, then using a Dataflow job is overkill. You can use the GoogleCloudStorageToBigQueryOperator to load the file to BQ.
Keeping track of the file ID
Likely the best thing to compute the file ID is using a Bash or Python operator from Airflow. Are you able to derive it directly from the file name?
If so, then you can have a Python operator that is upstream of a GoogleCloudStorageObjectSensor to check whether the file is in the successful directory.
If it is, then you can use the BigQueryOperator to run a delete query on BQ.
After that, you run the GoogleCloudStorageToBigQueryOperator.
Moving files around
If you are moving files from GCS to GCS locations then the GoogleCloudStorageToGoogleCloudStorageOperator should do the trick you need. If your BQ load operator fails, then move to the failed files location, and if it succeeds then move to the successful jobs location.
Logging task logs
Perhaps all you need to keep track of inserts is logging task information to GCS. Check out how to log task information to GCS
Does that help?
I'm pretty new with Google App Engine.
What i need to do is to upload a pretty large CSV to CloudSQL.
I've got an HTML page that has a file upload module which when uploaded reaches the Blobstore.
After which i open the CSV with the Blob reader and execute each line to CloudSQL using cursor.execute("insert into table values"). The problem here is that i can only execute the HTTP request for a minute and not all the data gets inserted in that short a time. It also keeps the screen in a loading state throughout which i would like to avoid by making the code run in the back end if that's possible?
I also tried going the "LOAD DATA LOCAL INFILE" way.
"LOAD DATA LOCAL INFILE" works from my local machine when i'm connected to CloudSQL via the terminal. And its pretty quick.
How would i go about using this within App Engine?
Or is there a better way to import a large CSV into CloudSQL through the Blobstore or Google Cloud Storage directly after uploading the CSV from the HTML?
Also, is it possible to use Task Queues with Blob Store and then insert the data into CloudSQL on the backend?
I have used a similar approach for Datastore and not CloudSQL but the same approach can be applied to your scenario.
Setup a non-default module (previously backend, deprecated now) of your application
Send a http request which will trigger the module endpoint through a task queue (to avoid 60 second deadline)
Use mapreduce with CSV as input and do the operation on each line of csv within the map function (to avoid memory errors and resume pipeline from where it left in case of any errors during operation)
EDIT: Elaborating map reduce as per OP request, and also eliminating the use of taskqueue
Read the mapreduce basics from the docs found here
Download the dependency folders for mapreduce to work (simplejson, graphy, mapreduce)
Download this file to your project folder and save as "custom_input_reader.py"
Now copy the code below to your main_app.py file.
main_app.py
from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from custom_input_reader import GoogleStorageLineInputReader
def testMapperFunc(row):
# do process with csv row
return
class TestGCSReaderPipeline(base_handler.PipelineBase):
def run(self):
yield mapreduce_pipeline.MapPipeline(
"gcs_csv_reader_job",
"main_app.testMapperFunc",
"custom_input_reader.GoogleStorageLineInputReader",
params={
"input_reader": {
"file_paths": ['/' + bucketname + '/' + filename]
}
})
Create a http handler which will initiate the map job
main_app.py
class BeginUpload(webapp2.RequestHandler):
# do whatever you want
upload_task = TestGCSReaderPipeline()
upload_task.start()
# do whatever you want
If you want to pass any parameters, add the parameter in "run" method and provide values when creating the pipeline object
You can try importing CSV data via cloud console:
https://cloud.google.com/sql/docs/import-export?hl=en#import-csv