I have a requirement to parse a lot of small unstructured files in near real-time inside Azure and load the parsed data into a SQL database. I chose Python (because I don't think any Spark cluster or big data would suite considering the volume of source files and their size) and the parsing logic has been already written. I am looking forward to schedule this python script in different ways using Azure PaaS
Azure Data Factory
Azure Databricks
Both 1+2
May I ask what's the implication of running a Python notebook activity from Azure Data Factory pointing to Azure Databricks? Would I be able to fully leverage the potential of the cluster (Driver & Workers)?
Also, please suggest me if you think the script has to be converted to PySpark to meet my use case requirement to run in Azure Databricks? The only hesitation here is the files are in KB and they are unstructured.
If the script is pure Python then it would only run on the driver node of the Databricks cluster making it very expensive (and slow due to cluster startup times).
You could rewrite as pyspark but if the data volumes are as low as you say then this is still expensive and slow. The smallest cluster will consume two vm’s - each with 4 cores.
I would look at using Azure Functions instead. Python is now an option: https://learn.microsoft.com/en-us/azure/python/tutorial-vs-code-serverless-python-01
Azure Functions also have great integration with Azure Data Factory so your workflow would still work.
Related
Is it possible to run arbitrary python script written in Pycharm on my azure Databricks cluster?
Databricks offered using databricks-connect but it turned out to be useful for only spark-jobs.
More specifically I'd like to like to use networkx to analyse some graph so huge that my local machine unable to work with them.
I'm not sure if its possible at all...
Thanks in advance!
I want to scale on cloud a one off pipeline I have locally.
The script takes data from a large (30TB), static S3 bucket made up of PDFs
I pass these PDFs in a ThreadPool to a Docker container, which gives me an output
I save the output to a file.
I can only test it locally on a small fraction of this dataset. The whole pipeline would take a couple days to run on a MacbookPro.
I've been trying to replicate this on GCP - which I am still discovering.
Using Cloud functions doesn't work well because of its max timeout
A full Cloud composer architecture seems a bit of an overkill for a very straightforward pipeline which doesn't require Airflow.
I'd like to avoid coding this in Apache Beam format for Dataflow.
What is the best way to run such a python data processing pipeline with a container on GCP ?
Thanks to the useful comments in the original post, I explored other alternatives on GCP.
Using a VM on Compute Engine worked perfectly. The overhead and setup is much less than I expected ; the setup went smoothly.
I'm trying to build a python ETL pipeline in google cloud, and google cloud dataflow seemed a good option. When I explored the documentation and the developer guides, I see that the apache beam is always attached to dataflow as it's based on it.
I may find issues processing my dataframes in apache beam.
My questions are:
if I want to build my ETL script in native python with DataFlow is that possible? Or it's necessary to use apache beam for my ETL?
If DataFlow was built just for the purpose of using Apache Beam? Is there any serverless google cloud tool for building python ETL (Google cloud function has 9 minutes time execution, that may cause some issues for my pipeline, I want to avoid in execution limit)
My pipeline aims to read data from BigQuery process it and re save it in a bigquery table. I may use some external APIs inside my script.
Concerning your first question, it looks like Dataflow was primarly written for using it along the Apache SDK, as can be checked in the official Google Cloud Documentation on Dataflow. So, it is possible that's actually a requirement to use Apache Beam for your ETL.
Regarding your second question,this tutorial gives you a guidance on how to build your own ETL Pipeline with Python and Google Cloud Platform functions, which are actually serverless. Could you please confirm if this link has helped you?
Regarding your first question, Dataflow needs to use Apache Beam. In fact, before Apache Beam there was something called Dataflow SDK, which was Google proprietary and then it was open sourced to Apache Beam.
The Python Beam SDK is rather easy once you put a bit of effort into it, and the main process operations you'd need are very close to native Python language.
If your end goal is to read, process and write to BQ, I'd say Beam + Dataflow is a good match.
We have just signed up with Azure and were wondering how to schedule and run Python scripts that extract data from various sources like APIs, web scrape scripts, etc. What is the best tool on Azure that can run and schedule those scripts as well as save to target destination.
The output of the scripts will be saved to either data lakes and/or azure sql database.
Thank you.
There're several services in azure can do this task.
I suggest you can take use of azure webjobs(it supports python as well as support running as per schedule).
The rough guidelines are as below:
1.Develop your python scripts locally, make sure it can work locally(like extract data from other sources, save to azure database).
2.In azure portal, Create a scheduled WebJob. During creation, you need to upload the .py file(zip all the files into a .zip file); For "Type", please select "Triggered"; in the Triggers dropdown, select "Scheduled"; then specify at which time to run the .py file by using CRON Expression.
3.It's done.
You can also consider other azure services like azure function with time trigger. But the webjob is much more easier.
Hope it helps, and also please let me know if you still have more issues about that.
We are using a Python Azure Function for ETL based on several files that are dropped off in blob storage every day at the same time. Our current workflow is using several Python functions to pick up those files (using the azure-storage-blob Python library), transform them, load them to our Azure SQL database, and then archive all the files to cold storage. Right now we are relying on timer triggers because some of the functions depend on other functions to be complete before manipulating and archiving the source files.
It seems like using Azure Durable Functions would be a better workflow for this, since we could orchestrate the functions better and decide when either the next ETL process should run or when the files should be archived. The problem is that Azure Durable Functions is not yet supported in Python. Is it possible to use a C# Durable Function to orchestrate a Python Azure Function? Or does Microsoft recommend using the preview Python Durable Functions (https://github.com/Azure/azure-functions-durable-python)? The documentation says that Python Durable Functions should have consumption plan support by May 2020.
Also, have you considered using Logic Apps as the orchestrator for your functions? Seems like that would be a more rapid way to implement orchestration.