Create dataflow job dynamically

Create dataflow job dynamically - python

I am new to the Google Cloud and to Dataflow. What I want to do is to create a tool which detects and recovers errors in (large) csv files. However at design time, not every error that should be dealt with is already known. Therefore, I need an easy way to add new functions that handle a specific error.
The tool should be something like a framework for automatically creating a dataflow template based on a user selection. I already thought of a workflow which could work, but as mentioned before I am totally new to this, so please feel free to suggest a better solution:
The user selects which error correction methods should be used in the frontend
A yaml file is created which specifies the selected transformations
A python script parses the yaml file and uses error handling functions to build a dataflow job that executes these functions as specified in the yaml file
The dataflow job is stored as a template and run via a REST API Call for a file stored on the GCP
To achieve the extensibility, new functions which implement the error corrections should easily be added. What I thought of was:
A developer writes the required function and uploads it to a specified folder
The new function is manually added to the frontend/or to a database etc. and can be selected to check for/deal with an error
The user can now select the newly added error handling function and the dataflow template that is being created uses this function without the need of editing the code that builds the dataflow template
However my problem is that I am not sure if this is possible or a "good" solution for this problem. Furthermore, I don't know how to create a python script that uses functions which are not known at design time.
(I thought of using something like a strategy pattern, but as far as I know you still need to have the functions implemented at design time already, even though the decision which function to use is being made during run time)
Any help would be greatly appreciated!

What you can use in your architecture is Cloud Functions together with Cloud Composer (hosted solution for Airflow). Apache Airflow is designed to run DAGs on a regular schedule, but you can also trigger DAGs in response to events, such as a change in a Cloud Storage bucket (where CSV files can be stored). You can configure this with your frontend and every time new files arrives into the Bucket, the DAG containing step by step process is being triggered.
Please, have a look for the official documentation, which describes the process of launching Dataflow pipelines with Cloud Composer using DataflowTemplateOperator.

Related

Automating data extraction and loading to BigQuery

I am learning how to pull data from GraphQL API and load it to BigQuery table on daily basis. I am new to GCP and trying to understand the set-up required to establish a secure data-pipeline. To automate the process of regular data extraction and loading, I am following the steps below,
I am first creating a Cloud Function using BigQuery Python client library with pandas and pyarrow. I am loading the data to BigQuery using the method shown in this - Using BigQuery with Pandas — google-cloud-bigquery documentation (googleapis.dev).
As the Trigger type, I have chosen Cloud Pub/Sub. Can I please know if that is a good choice (secure and efficient) for data extraction or should I go with HTTP which requires authentication or with any other Trigger type for my use case.
After which, among the settings, I am setting up only Runtime (is there any other settings that I need to configure?)
Once, the above Cloud Function is set-up, I am creating a Cloud Scheduler to call the Cloud Function created above once everyday at midnight. Under ‘Configure the execution’ I am selecting Target type as Cloud Pub/Sub and selecting the topic.
I do not understand the need for ‘Message body’ after selecting the Cloud Pub/Sub topic to set up Cloud Scheduler for data extraction use case, however, it is an essential field in the settings. I am using a generic message (something like ‘hello world’). Could anyone please correct me if it has any significance, again for my use case and how best to set it?
If anyone of you could please review this method to extract and load data to BQ and please let me know if it is an efficient and secure pipeline, that will be very helpful.
Thank you so much!

First of all slow down a little :D.
You are mixing up two functionalities.
Cloud function can be triggered either via HTTP request or via Pubsub.
When you use cloud scheduler with pubsub topic, the body field there allows you to enter custom data you want to add. This will be sent to the pubsub by cloud scheduler and when the cloud function is triggered via pubsub, it will get the message set by the cloud scheduler. You can use this to trigger different modules of you code based on input you get. AGain its use case specific.
In your case, either of the technique will work. HTTP is easy because you just have to setup the cloud function with appropriate service account, h/w configs. Once deployed, use the trigger url to setup cloud scheduler. Whereas for pubsub there is an additional component in between.
Please read the cloud function document properly. It contains all details on when to use which trigger.
Hope this answers.

Generate clients from Swagger definition programmatically

I have a Swagger JSON definition file. I would like to generate Python client binding for it. I would like to generate it as part of my development cycle, i.e., every time the definition changes I can run a script locally to regenerate my client
I am aware of Swagger Editor and Swagger Online Generators but they are no satisfying my needs:
both methods rely on an external service I need to call
as result I am getting .zip file I need to unzip - that makes whole process more complex.
I remember times when I could generate Java client binding for SOAP service running Apache CXF locally. Is there something like that for Swagger? I've seen there's an option to clone the whole swagger-codegen project, for all possible programming languages, but that seems like an overkill. Is there another option?
How companies are handling issue like mine?

I've used https://github.com/OpenAPITools/openapi-generator
You don't need to clone the whole repository, just download the JAR file and use it to generate the client/server.

How to setup a push notification Listener/reciever in python to recieve File Changes

I have been trying to utilize Google Drive's REST API to recieve file changes as push notifications, but I have no idea on where to start. As I am new to programming all together, I am unable to find any solutions.
I am using Python to develop my code, and the script that I am writing is to monitor any changes in any given spreadsheet to run some operations on then modified spreadsheet data.
Considering I was able to set up the Sheets and Drive (readonly) APIs properly, I am confident that given some direction, I would be able to setup this notification reciever/listener as well.
Here is Google Drive API Feature Page.

Just follow the guide in Detect Changes
For Google Drive apps that need to keep track of changes to files, the
Changes collection provides an efficient way to detect changes to all
files, including those that have been shared with a user. The
collection works by providing the current state of each file, if and
only if the file has changed since a given point in time.
Retrieving changes requires a pageToken to indicate a point in time to
fetch changes from.
There's a github code demo that you can test and base your project on.

How to set up AWS pipeline for data project with multiple users

I am in the process of moving an internal company tool written entirely in python to the AWS ecosystem, but was having issues figuring out the proper way to set up my data so that it stays organized. This tool is used by people throughout the company, with each person running the tool on their own datasets (that vary in size from a few megabytes to a few gigabytes in size). Currently, users clone the code to their local machines then run the tool on their data locally; we are now trying to move this usage to the cloud.
For a single person, it is simple enough to have them upload their data to s3, then point the python code to that data to run the tool, but I'm worried that as more people start using the tool, the s3 storage will become cluttered/disorganized.
Additionally, each person might make slight changes to the python tool in order to do custom work on their data. Our code is hosted in a bitbucket server, and users will be forking the repo for their custom work.
My questions are:
Are S3 and EC2 the only AWS tools needed to support a project like this?
What is the proper way for users to upload their data, run the code, then download their results so that the data stays organized in S3?
What are the best practices for using EC2 in a situation like this? Do people usually spin up a new EC2 for each job or is scheduling multiple jobs on a single EC2 more efficient?
Is there a way to automate the data uploading process that will allow users to easily run the code on their data without needing to know how to code?
If anyone has any input as to how to set up this project, or has links to any relevant guides/documents, it would be greatly appreciated. Thanks!

You can do something like this.
a) A boto3 script to upload s3 data to specified bucket with maybe
timestamp appended to it.
b) Configure S3 bucket to send notification over SQS when a new item comes
c) Keep 2-3 EC2 machines running actively listening to SQS.
d) When a new item comes, it gets key from SQS.Process it.
Delete event from SQS after successful completion.
e) Put processed data in some place, delete the key from Bucket.
Notify user through mail.
For custom users, they can create a new branch and provide it in data uploaded and ec2 reads it from there and checks out the required branch.After the job the branch can be deleted. This can be a single line with branch name over it.This will involve one time set up.You probably should be using some process manager on EC2 which would restart process if it crashes.

Running complex calculations (using python/pandas) in a Django server

I have developed a RESTful API using the Django-rest-framework in python. I developed the required models, serialised them, set up token authentication and all the other due diligence that goes along with it.
I also built a front-end using Angular, hosted on a different domain. I setup CORS modifications so I can access the API as required. Everything seems to be working fine.
Here is the problem. The web app I am building is a financial application that should allow the user to run some complex calculations on the server and send the results to the front-end app so they can be rendered into charts and other formats. I do not know how or where to put these calculations.
I chose Django for the back-end as I expected that python would help me run such calculations wherever required. Basically, when I call a particular api link on the server, I want to be able to retrieve data from my database, from multiple tables if required, and use the data to run some calculations using python or a library of python (pandas or numpy) and serve the results of the calculations as response to the API call.
If this is a daunting task, I at least want to be able to use the API to retrieve data from the tables to the front-end, process the data a little using JS, and send it to a python function located on the server with this processed data, and this function would run the necessary complex calculations and respond with results which would be rendered into charts / other formats.
Can anyone point me to a direction to move from here? I looked for resources online but I think I am unable to find the correct keywords to search for them. I just want a shell code kind of a thing to integrate into my current backed using which I can call some python scripts that I write to run these calculations.
Thanks in advance.

I assume your question is about "how do I do these calculations in the restful framework for django?", but I think in this case you need to move away from that idea.
You did everything correctly but RESTful APIs serve resources -- basically your model.
A computation however is nothing like that. As I see it, you have two ways of achieving what you want:
1) Write a model that represents the results of a computation and is served using the RESTful framework, thus your computation being a resource (can work nicely if you store the results in your database as a way of caching)
2) Add a route/endpoint to your api, that is meant to serve results of that computation.
Path 1: Computation as Resource
Create a model, that handles the computation upon instantiation.
You could even set up an inheritance structure for computations and implement an interface for your computation models.
This way, when the resource is requested and the restful framework wants to serve this resource, the computational result will be served.
Path 2: Custom Endpoint
Add a route for your computation endpoints like /myapi/v1/taxes/compute.
In the underlying controller of this endpoint, you will load up the models you need for your computation, perform the computation, and serve the result however you like it (probably a json response).
You can still implement computations with the above mentioned inheritance structure. That way, you can instantiate the Computation object based on a parameter (in the above case taxes).
Does this give you an idea?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.