Running large pipelines on GCP

Running large pipelines on GCP - python

I want to scale on cloud a one off pipeline I have locally.
The script takes data from a large (30TB), static S3 bucket made up of PDFs
I pass these PDFs in a ThreadPool to a Docker container, which gives me an output
I save the output to a file.
I can only test it locally on a small fraction of this dataset. The whole pipeline would take a couple days to run on a MacbookPro.
I've been trying to replicate this on GCP - which I am still discovering.
Using Cloud functions doesn't work well because of its max timeout
A full Cloud composer architecture seems a bit of an overkill for a very straightforward pipeline which doesn't require Airflow.
I'd like to avoid coding this in Apache Beam format for Dataflow.
What is the best way to run such a python data processing pipeline with a container on GCP ?

Thanks to the useful comments in the original post, I explored other alternatives on GCP.
Using a VM on Compute Engine worked perfectly. The overhead and setup is much less than I expected ; the setup went smoothly.

Related

Can I use google DataFlow with native python?

I'm trying to build a python ETL pipeline in google cloud, and google cloud dataflow seemed a good option. When I explored the documentation and the developer guides, I see that the apache beam is always attached to dataflow as it's based on it.
I may find issues processing my dataframes in apache beam.
My questions are:
if I want to build my ETL script in native python with DataFlow is that possible? Or it's necessary to use apache beam for my ETL?
If DataFlow was built just for the purpose of using Apache Beam? Is there any serverless google cloud tool for building python ETL (Google cloud function has 9 minutes time execution, that may cause some issues for my pipeline, I want to avoid in execution limit)
My pipeline aims to read data from BigQuery process it and re save it in a bigquery table. I may use some external APIs inside my script.

Concerning your first question, it looks like Dataflow was primarly written for using it along the Apache SDK, as can be checked in the official Google Cloud Documentation on Dataflow. So, it is possible that's actually a requirement to use Apache Beam for your ETL.
Regarding your second question,this tutorial gives you a guidance on how to build your own ETL Pipeline with Python and Google Cloud Platform functions, which are actually serverless. Could you please confirm if this link has helped you?

Regarding your first question, Dataflow needs to use Apache Beam. In fact, before Apache Beam there was something called Dataflow SDK, which was Google proprietary and then it was open sourced to Apache Beam.
The Python Beam SDK is rather easy once you put a bit of effort into it, and the main process operations you'd need are very close to native Python language.
If your end goal is to read, process and write to BQ, I'd say Beam + Dataflow is a good match.

Where to host pub sub publisher on GCP?

I'm looking to create a publisher that streams and sends tweets containing a certain hashtag to a pub/sub topic.
The tweets will then be ingested with cloud dataflow and then loaded into a Big Query database.
In the following article they do something similar where the publisher is hosted on a docker image on a Google Compute Engine instance.
Can anyone recommend alternative Google Cloud resources that could host the publisher code more simply, that avoids the need to create a docker file etc?
The publisher would need to run constantly. Would cloud run for e.g. be a suitable alternative?

There are some workarounds I can think of:
A quick way to avoid containers architecture is having the on_data method inside a loop, for example, by using something like while(true) or start a Stream like explained in Create your Python script and run the code in a Compute Engine in the background with nohup python -u myscript.py. Or follow the steps described in Script on GCE to capture tweets that uses tweepy.Stream to start the streaming.
You might want to reconsider the Dockerfile option since its configuration could be not so difficult, see Tweets & pipelines where there is a script that read the data and publish to PubSub, you will see that 9 lines are used for the Docker file and it is deployed in App Engine using Cloud Build. Another implementation with a Docker file that requires more steps is twitter-for-bigquery, in case it helps, you will see that there are more specific steps and more configurations.
Cloud Functions is also another option, in this guide Serverless Twitter with Google Cloud you can check the Design section to know if it fits your use case.
Airflow with Twitter Scraper could work for you since Cloud Composer is a managed service for Airflow and you can create an Airflow environment quickly. It uses the Twint library, check the Technical section in the link for more details.
Stream Twitter Data into BigQuery with Cloud Dataprep is a workaround that put aside complex configurations. In this case the job won't run constantly but can be scheduled to run within minutes.

What options are there to setup automatic reporting processes for Pandas on AWS?

I'm currently using elastic beanstalk and apscheduler to run Pandas reports everyday automatically on it. The data set is getting larger and i've already increased the memory size 3x.
Elastic Beanstalk is running Dash - dashboard application and runs the automated Pandas reports once every night.
I've tried setting up AWS Lambda to run Pandas reports on there but I couldn't figure out how to use it.
I'm looking for the most cost-effective way to run my reports without having to increase memory usage on Beanstalk. When I run it locally it takes 1gb but running it on beanstalk, it's using more than 16gb.
Curious if someone else has a better option or process how they automatically run their Pandas reports.

Create an .exe using Pyinstaller
Schedule .exe on Task Scheduler on computer
Cheaper than scaling AWS Beanstalk resources which use more resources calculating pandas than your computer locally at least for my case.

Azure Databricks Python Job

I have a requirement to parse a lot of small unstructured files in near real-time inside Azure and load the parsed data into a SQL database. I chose Python (because I don't think any Spark cluster or big data would suite considering the volume of source files and their size) and the parsing logic has been already written. I am looking forward to schedule this python script in different ways using Azure PaaS
Azure Data Factory
Azure Databricks
Both 1+2
May I ask what's the implication of running a Python notebook activity from Azure Data Factory pointing to Azure Databricks? Would I be able to fully leverage the potential of the cluster (Driver & Workers)?
Also, please suggest me if you think the script has to be converted to PySpark to meet my use case requirement to run in Azure Databricks? The only hesitation here is the files are in KB and they are unstructured.

If the script is pure Python then it would only run on the driver node of the Databricks cluster making it very expensive (and slow due to cluster startup times).
You could rewrite as pyspark but if the data volumes are as low as you say then this is still expensive and slow. The smallest cluster will consume two vm’s - each with 4 cores.
I would look at using Azure Functions instead. Python is now an option: https://learn.microsoft.com/en-us/azure/python/tutorial-vs-code-serverless-python-01
Azure Functions also have great integration with Azure Data Factory so your workflow would still work.

Build a docker image for google bigquery

I have a huge amount of data (hundreds of Gigas) on Google BigQuery and for easy of use (many post query treatements) I'm working with the bigquery python package. The problem is that I have to run again all my queries whenever I shut my laptop down, this is very expensive as my dataset is about one Tera. I think of Google Compute Engine but this is a poor solution as I will still paying for my machines if I don't stop them. My last solution is to mount a docker image on our own sandbox, this is cheaper and can do exactly what I'm looking for. So I would like to know if someone has ever mounted a docker image for BigQuery ? Thanks for helping!

We mount all of our python/bigquery projects into docker containers and push them to google cloud registry.
Automated scheduling, dependancy graphing, and logging can be handled with Google Cloud Composer (Airflow). Its pretty simple to get set up, and Airflow has a Kubernetes Pod Operator, That allows you to specify a python file to run in your docker image on GCR. You can use this workflow to make sure all of your queries and python scripts are run on GCP without having to worry about Google Compute Engine, or any devops type of things.
https://cloud.google.com/composer/docs/how-to/using/using-kubernetes-pod-operator
https://cloud.google.com/composer/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.