We have created an ETL in GCP which reads data from MySQL and migrates it to BigQuery. To read data from MySQL, we use beam-nuggets library. This library is passed as an extra package ('--extra_package=beam-nuggets-0.17.1.tar.gz') to the dataflow job. Cloud functions were used to create the dataflow job. The code was working fine and the Dataflow job got created and the data migration was successful.
After the latest version of sqlalchemy – 1.4 got released, we were unable to deploy the cloud function. The cloud function deployment failed with the exception as mentioned below.
To fix this issue, we tried to give the previous version of sqlalchemy – 1.3.23 in the requirements.txt file of cloud functions. This resolved the issue and the cloud functions got deployed successfully. But when we triggered the dataflow job from cloud functions, we got the same error as mentioned above.
This issue is caused because beam-nuggets library is internally referencing sqlalchemy during the run time and the job fails with the same error. Is it possible to manually enforce beam-nuggets to pick a specific version of sqlalchemy??
Try passing a specific version of sqlalchemy via the extra_package flag as well.
Related
My project which runs in Cloud Run of Google Cloud Platform (GCP) has generated errors: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 68387105408768 and this is thread id 68386614675200. for hours before it went back to normal by itself.
Our code is written in Python with flask & no SQLite is involved. Saw suggestions to set check_same_thread to False. May I know where I can set this in Cloud Run or GCP? Thanks.
That setting has nothing to do with your runtime environment, but is set during the connection initialization with sqlite (https://docs.python.org/3/library/sqlite3.html#module-functions), so if you claim that you aren't creating an sqlite connection that won't help you much.
That being said, I find it hard to believe that you are getting that error without using sqlite. More likely is that you are using sqlite via some dependency.
Since sqlite3 is part of the standard library of python it might however not be trivial to figure out which dependency uses it.
I am using google's python sdk (v2.6.0) for creating jobs on Google Cloud Scheduler that have a HTTP target. Jobs are getting created, however, I am facing a couple of other issues:
The HTTP target doesn't receive the X-Cloudscheduler-Scheduletime header which I am using to check the schedule execution time. If I create the job manually using the cloud console with the same HTTP target, I am getting that header in the request.
Jobs created using python sdk cannot be updated. This is because the time_zone property is not being set properly for jobs created via python sdk. In the console, these jobs don't show any timezone and if I try to update any other property, it doesn't let me do it because it says 'time_zone' can't be left blank which is not true because scheduler jobs cannot be created without a timezone, so this is definitely an issue with the sdk.
I have seen another post with someone using nodejs sdk and facing the same problem. Their solution was to downgrade the sdk version to fix it but that doesnt seem to work with the python counterpart.
I'm trying to build a python ETL pipeline in google cloud, and google cloud dataflow seemed a good option. When I explored the documentation and the developer guides, I see that the apache beam is always attached to dataflow as it's based on it.
I may find issues processing my dataframes in apache beam.
My questions are:
if I want to build my ETL script in native python with DataFlow is that possible? Or it's necessary to use apache beam for my ETL?
If DataFlow was built just for the purpose of using Apache Beam? Is there any serverless google cloud tool for building python ETL (Google cloud function has 9 minutes time execution, that may cause some issues for my pipeline, I want to avoid in execution limit)
My pipeline aims to read data from BigQuery process it and re save it in a bigquery table. I may use some external APIs inside my script.
Concerning your first question, it looks like Dataflow was primarly written for using it along the Apache SDK, as can be checked in the official Google Cloud Documentation on Dataflow. So, it is possible that's actually a requirement to use Apache Beam for your ETL.
Regarding your second question,this tutorial gives you a guidance on how to build your own ETL Pipeline with Python and Google Cloud Platform functions, which are actually serverless. Could you please confirm if this link has helped you?
Regarding your first question, Dataflow needs to use Apache Beam. In fact, before Apache Beam there was something called Dataflow SDK, which was Google proprietary and then it was open sourced to Apache Beam.
The Python Beam SDK is rather easy once you put a bit of effort into it, and the main process operations you'd need are very close to native Python language.
If your end goal is to read, process and write to BQ, I'd say Beam + Dataflow is a good match.
I have an existing Devops pipeline that trains ML models. To guaranty the robustness for the models, it will be necessary to retrain it periodically. For this I decided to create an Azure function that will be executed each month. will It collect the new data, apply the data pre-processing and finally trigger the Azure Devops training pipeline. All of this must be done with python.
By doing a research, I understood that this can be done using REST Azure Devops API request.
I founded this python git repo https://github.com/microsoft/azure-devops-python-api which provides an API to communicate with Azure DevOps. I executed the code provided by this package which displays the list of my devops projects. But I can't found how trigger the pipeline.
Assuming that my organisation named ORGA1, the project named PROJ1 and the pipeline that I want ti execute named PIPELINE1, How can I launch it using an Azure function or even a simple python script ?
PS: I am using a python 3.9 Timer Trigger Azure function.
Thank you in advance for your help.
EDIT
I tired to use LOGIC APP to do this like #mohammed described in the comment and I think that this is a good solution. Above the workflow that I created:
So I launch the logic app each X hours, this will trigger the azure Devops, and ASA it end training with Success it will send me an email.
I have one error here, is that I am creating a new release and not triggering a specific pipeline each time. But navigating in the different actions under the devops service, I cannot found any thing related to launching a devops pipeline. Can anyone have an idea how to do it.
You can use a Logic App with a timer to trigger your DevOps pipeline instead of an Azure function, as it has all the built-in connectors required to interface with your DevOps. See : https://www.serverlessnotes.com/docs/keep-your-team-connected-using-azure-devops-and-azure-logic-apps
Thanks to the tips provided by #Mohammed I found a solution. Logic App provides what I am looking for. Under the list of Devops connectors provided the by the Logic App, there is a connector named Queue a new build and this is exactly what I am looking for. This is my first experiment architecture and I will update it later by adding Azure Function service before calling the Devops Pipeline.
You may try using Azure Durable Functions, you can kind of replicate what a Logic App does while still using Azure Functions. See documentation here 1
I'm trying to deploy an API to Google Cloud on Google AppEngine, using python3 with the standard environment, and I want to use the defer function to put functions in Google CloudTasks, as seen here:
https://cloud.google.com/appengine/docs/standard/python/taskqueue/push/creating-tasks#using_the_instead_of_a_worker_service
I tried putting google.appengine on the requirements.txt file, where the python libraries are usually listed for pip install, adding a line with google-appengine, but it fails on deploy, with the following error message:
ModuleNotFoundError: No module named 'ez_setup'
I've added ez_setup to the list of requirements, before appengine, and it still results in the same error.
I've also tried deploying it without importing google.appengine, thinking it might come already installed, but then I get the expected error saying No module named 'google.appengine' on it's import.
Is there something I'm missing on installation/import of this lib? Or is the library deprecated, and some new library is used to defer?
I've also tried to install the library on my local computer, and did not manage to install it here either.
As specified in the Public Documentation you have just Shared
The feature for creating tasks and placing them in push queues is not available for the python 3.7 runtime. That is the reason behind not being able to deploy it.
If you try it on Python 2.7 it should work without issues.
As mentioned in #OqueliAMartinez's answer the feature you're seeking (the Task Queue API) isn't available in the python37/2nd generation standard environment. The documentation page you referenced is applicable only to the python27/1st generation standard environment.
For the other runtimes/environments, including python37, you need to use the Cloud Tasks API. Which, unfortunately, doesn't support (at least not yet) deferred tasks. From Features in Task Queues not yet available via Cloud Tasks API:
Deferred/delayed tasks:
In some cases where you might need a series of diverse small tasks
handled asynchronously but you don't want to go through the work of
setting up individual distinct handlers, the App Engine SDK allows you
to use runtime specific libraries to create simple functions to manage
these tasks. This feature is not available in Cloud Tasks. Note,
though, that normal tasks can be scheduled in the future using Cloud
Tasks.
The only way to defer functions in this case is to invoke them in the task queue handler on the worker side (and schedule the tasks in the future as needed to implement the deferral).
Somehow related: Cloud Tasks API for python2.7 google app engine