We have just signed up with Azure and were wondering how to schedule and run Python scripts that extract data from various sources like APIs, web scrape scripts, etc. What is the best tool on Azure that can run and schedule those scripts as well as save to target destination.
The output of the scripts will be saved to either data lakes and/or azure sql database.
Thank you.
There're several services in azure can do this task.
I suggest you can take use of azure webjobs(it supports python as well as support running as per schedule).
The rough guidelines are as below:
1.Develop your python scripts locally, make sure it can work locally(like extract data from other sources, save to azure database).
2.In azure portal, Create a scheduled WebJob. During creation, you need to upload the .py file(zip all the files into a .zip file); For "Type", please select "Triggered"; in the Triggers dropdown, select "Scheduled"; then specify at which time to run the .py file by using CRON Expression.
3.It's done.
You can also consider other azure services like azure function with time trigger. But the webjob is much more easier.
Hope it helps, and also please let me know if you still have more issues about that.
Related
I am working on a project that allows users to upload a python script to an API and run it on a schedule. Currently, I'm trying to figure out a way to limit the functionality of the script so that it cannot access local files, mess with the flask server running the API, etc. Do you have any ideas on how I can achieve this? Is there anyway to make it so only specific libraries are available for importing?
Running other scripts on your server is serious security issue. If you are trying to deploy Python interpreter on your web application, you can try with something like judge0 - GitHub. It is free if you deploy it yourself and it will run scripts safely inside containers.
The simplest way is to ensure the user running the script is not root, but a user specifically designed for this task (e.g. part of a group that can only read and not write or execute). This means at minimum you should ensure all files have the appropriate mode. Then you can just use a pipe or something to run the script.
Alternatively, you could use a runtime that’s not “local”, like a VM or compute service (AWS lambda, etc). The latter would be simplest, and there’s lots of vendors who offer compute service with programmatic api.
To start I managed to successfully run pywin32 locally where it opened the Excel workbooks and refreshed the SQL Query then saved and close them.
I had to download those workbooks locally from Sharepoint and have them sync to apply the changes using one drive.
My Question is would this be possible to do within Sharepoint itself ? Have a python script scheduled on a server and have the process occur there in the backend through a command.
I use this program called Alteryx where I can have batch files execute scripts and maybe I could use an API of some sort to accomplish this on a scheduled basis since thats the only server I have access to.
I have tried looking on this site and other sources but I can't find a post where it would reference this specifically.
I use Jupyter Notebooks to write my scripts and Alteryx to build a workflow with those scripts but I can use other IDEs if I need to.
I am developing a python script that downloads some excel files from a web service. These two files are combined with another one stored in my computer locally to produce the final file. This final file is loaded to some database and PowerBI dashboard to finally visualize data.
My question is: How can I schedule this to run it daily if my computer is turned off? As I said, two files are web scraped (so no problem to schedule) but one file is stored locally.
One solution that comes to my mind: Store the local file in Google Drive/OneDrive and download it with the API so my script is not dependent of my computer. But if this was the case, how can I schedule that? What service would you use? Heroku,...?
I am not entirely sure about your context, but I think you could look into using AWS Lambda for this. It is reasonably easy to set it up and also create a schedule for running code.
It is even easier to achieve this using the serverless framework. This link shows an example built with Python that will run on a schedule.
I am running the schedule package for exactly something like that.
It’s easy to setup and works very well.
I'm looking to create a publisher that streams and sends tweets containing a certain hashtag to a pub/sub topic.
The tweets will then be ingested with cloud dataflow and then loaded into a Big Query database.
In the following article they do something similar where the publisher is hosted on a docker image on a Google Compute Engine instance.
Can anyone recommend alternative Google Cloud resources that could host the publisher code more simply, that avoids the need to create a docker file etc?
The publisher would need to run constantly. Would cloud run for e.g. be a suitable alternative?
There are some workarounds I can think of:
A quick way to avoid containers architecture is having the on_data method inside a loop, for example, by using something like while(true) or start a Stream like explained in Create your Python script and run the code in a Compute Engine in the background with nohup python -u myscript.py. Or follow the steps described in Script on GCE to capture tweets that uses tweepy.Stream to start the streaming.
You might want to reconsider the Dockerfile option since its configuration could be not so difficult, see Tweets & pipelines where there is a script that read the data and publish to PubSub, you will see that 9 lines are used for the Docker file and it is deployed in App Engine using Cloud Build. Another implementation with a Docker file that requires more steps is twitter-for-bigquery, in case it helps, you will see that there are more specific steps and more configurations.
Cloud Functions is also another option, in this guide Serverless Twitter with Google Cloud you can check the Design section to know if it fits your use case.
Airflow with Twitter Scraper could work for you since Cloud Composer is a managed service for Airflow and you can create an Airflow environment quickly. It uses the Twint library, check the Technical section in the link for more details.
Stream Twitter Data into BigQuery with Cloud Dataprep is a workaround that put aside complex configurations. In this case the job won't run constantly but can be scheduled to run within minutes.
I have a python script that is intended to run on my local machine every night. It's goal is to pull data from a third party server, do some processing on it, and execute bulk upload to GAE datastore.
My issue though is hot to run bulk upload from a python script. All examples I have seen (including Google's documentation) use command line "appcfg.py upload_data ..." and as far as I can see appcfg.py and bulkloader.py do not expose any API that is guaranteed not to change.
My two options as I see them now is to either execute "appcfg.py upload_data ..." command from my python script, which seems a roundabout way of doing things. Or to directly call appcfg.py's internal methods, which means I have to recode tings in case they change.
Appengine can run cron jobs. All you need is to write is a single script which pulls the data from third party server and upload it to appengine engine, Appenigne will do the rest for you. Appengine cron this has everything you need to know about running a cron job in appengine
This answer is now outdated. Please see the below link for my latest answer for bulk upload data to app engine.
How to upload data in bulk to the appengine datastore? Older methods do not work