I have a huge amount of data (hundreds of Gigas) on Google BigQuery and for easy of use (many post query treatements) I'm working with the bigquery python package. The problem is that I have to run again all my queries whenever I shut my laptop down, this is very expensive as my dataset is about one Tera. I think of Google Compute Engine but this is a poor solution as I will still paying for my machines if I don't stop them. My last solution is to mount a docker image on our own sandbox, this is cheaper and can do exactly what I'm looking for. So I would like to know if someone has ever mounted a docker image for BigQuery ? Thanks for helping!
We mount all of our python/bigquery projects into docker containers and push them to google cloud registry.
Automated scheduling, dependancy graphing, and logging can be handled with Google Cloud Composer (Airflow). Its pretty simple to get set up, and Airflow has a Kubernetes Pod Operator, That allows you to specify a python file to run in your docker image on GCR. You can use this workflow to make sure all of your queries and python scripts are run on GCP without having to worry about Google Compute Engine, or any devops type of things.
https://cloud.google.com/composer/docs/how-to/using/using-kubernetes-pod-operator
https://cloud.google.com/composer/
Related
Is it practically possible to simulate AWS environment locally using Moto and Python?
I want to write a aws gluejob that will fetch record from my local database and will upload to S3 bucket for data quality check and later trigger a lambda function for cronjob run using Moto Library using moto.lock_glue decorator.Any suggestion or document would be highly appreciated as I don't see much clue on same.Thank you in advance.
AFAIK, moto is meant to patch boto modules for testing.
I have experience working with LocalStack, a docker you can run locally, and it acts as a live service emulator for most AWS services (some are only available for paying users).
https://docs.localstack.cloud/getting-started/
You can see here which services are supported by the free version.
https://docs.localstack.cloud/user-guide/aws/feature-coverage/
in order to use it, you need to change the endpoint-url to point to the local service running on docker.
As it's a docker, you can incorporate it with remote tests as well e.g., if you're using k8s or a similar orchestrator
I want to scale on cloud a one off pipeline I have locally.
The script takes data from a large (30TB), static S3 bucket made up of PDFs
I pass these PDFs in a ThreadPool to a Docker container, which gives me an output
I save the output to a file.
I can only test it locally on a small fraction of this dataset. The whole pipeline would take a couple days to run on a MacbookPro.
I've been trying to replicate this on GCP - which I am still discovering.
Using Cloud functions doesn't work well because of its max timeout
A full Cloud composer architecture seems a bit of an overkill for a very straightforward pipeline which doesn't require Airflow.
I'd like to avoid coding this in Apache Beam format for Dataflow.
What is the best way to run such a python data processing pipeline with a container on GCP ?
Thanks to the useful comments in the original post, I explored other alternatives on GCP.
Using a VM on Compute Engine worked perfectly. The overhead and setup is much less than I expected ; the setup went smoothly.
I'm looking to create a publisher that streams and sends tweets containing a certain hashtag to a pub/sub topic.
The tweets will then be ingested with cloud dataflow and then loaded into a Big Query database.
In the following article they do something similar where the publisher is hosted on a docker image on a Google Compute Engine instance.
Can anyone recommend alternative Google Cloud resources that could host the publisher code more simply, that avoids the need to create a docker file etc?
The publisher would need to run constantly. Would cloud run for e.g. be a suitable alternative?
There are some workarounds I can think of:
A quick way to avoid containers architecture is having the on_data method inside a loop, for example, by using something like while(true) or start a Stream like explained in Create your Python script and run the code in a Compute Engine in the background with nohup python -u myscript.py. Or follow the steps described in Script on GCE to capture tweets that uses tweepy.Stream to start the streaming.
You might want to reconsider the Dockerfile option since its configuration could be not so difficult, see Tweets & pipelines where there is a script that read the data and publish to PubSub, you will see that 9 lines are used for the Docker file and it is deployed in App Engine using Cloud Build. Another implementation with a Docker file that requires more steps is twitter-for-bigquery, in case it helps, you will see that there are more specific steps and more configurations.
Cloud Functions is also another option, in this guide Serverless Twitter with Google Cloud you can check the Design section to know if it fits your use case.
Airflow with Twitter Scraper could work for you since Cloud Composer is a managed service for Airflow and you can create an Airflow environment quickly. It uses the Twint library, check the Technical section in the link for more details.
Stream Twitter Data into BigQuery with Cloud Dataprep is a workaround that put aside complex configurations. In this case the job won't run constantly but can be scheduled to run within minutes.
I am a newbie who wants to deploy his flask app using google cloud functions. When I am searching it online, people are telling me to deploy it as a Flask app. I want to ask if there is any difference between those two.
A cloud instance or deploying flask app on google cloud VS cloud serverless function
As described by John and Kolban, Cloud Functions is a single purpose endpoint. You want to perform 1 thing, deploy 1 function.
However, if you want to have a many consistent things, like a microservice, you will have to deploy several endpoints that allow you to perform a CRUD on the same data object. You should prefer to deploy several endpoints (CRUD) and to have the capability to easily reuse class and object definitions and business logic. For this, a Flask webserver is that I recommend (and I prefer, I wrote an article on this).
A packaging in Cloud Run is the best for having a serverless platform and pay-per-use pricing model (and automatic scaling and...).
There is an additional great thing: Cloud Functions request object is based on Flask request object. By the way, and it's that I also present in my article, it's easy to switch from one platform to another one. You only have to choose according with your requirements, your skills,... I also wrote another article on this
If you deploy your Flask app as an application in a Compute Engine VM instance, you are basically configuring a computer and application to run your code. The notion of Cloud Functions relieves you from the chore and toil of having to create and manage the environment in which your program runs. A marketing mantra is "You bring the code, we bring the environment". When using Cloud Functions all you need do is code your application logic. The maintenance of the server, scaling up as load increases, making sure the server is available and much more is taken care of for you. When you run your code in your own VM instance, it is your responsibility to manage the whole environment.
References:
HTTP Functions
Deploying a Python serverless function in minutes with GCP
I'm new to IBM Cloud and cloud platforms in general and wanted to start my Flask app on IBM Cloud, I just started with this Getting started with Python but I'm very confused with how it will work.
Is Cloud Foundry working the same way as containers work?
How the platform handle the dependencies in order to Flask use them in both deployment ways?
Your question is (almost) too broad. I can give you some basic answers, but everything else should be separate questions when you run into specific problems. You are referring to the Getting Started with Python and Cloud Foundry on IBM Cloud (this is the IBM Cloud docs, not the GitHub repo).
When working with Cloud Foundry (CF), the CF environment and buildpack takes care of the dependencies. For Python, they are specified in the file requirements.txt and there is the file manifest.yml to configure the app, its name, memory usage, domain and more. When you push the app (either cf push or ibmcloud cf push) the two files are taken into account and everything else is done automatically. That's the appeal of Cloud Foundry.
With containers, you would write a Dockerfile, then build the container image, push the image to a container registry, deploy the container to Kubernetes. When you build the container, your script would need to take care of resolving the dependencies (based on requirements.txt) and include the necessary modules into the image.
I recommend reading the Deploy an Application Cloud Foundry doc as a starter to give some more background. There is also a simple IBM Cloud solution tutorial that walks you through the steps of deploying a Flask app with a Db2 database. That same site with IBM Cloud solution tutorials also has an overview of tutorials by deployment option (Cloud Foundry, Kubernetes, Cloud Functions, etc.).