Airflow/ Github integration issues when trying to clone remote repo - python

So I'm fairly new to airflow and have only really been using github as a fairly basic push/ pull tool rather than getting under the hood and using it for anything more complex.
That being said, now is the time I wish to do something more complex with airflow/ github.
My organisation uses Google cloud for pretty much everything and I currently use magnus to trigger my scheduled queries.
For many reasons, I'm aiming to move over to airflow to perform these tasks however what I'm actually trying to do is host my source code in github and use gitpython to find the .sql files for airflow to then trigger my refresh.
I'm seemingly having trouble understanding how I can possibly 'host' my github repo in an airflow instance and then isolate a file to push to a dag task.
So, problem 1 - each time I try and connect to my remote repo, I receive a windows error
Cmd('git') not found due to: FileNotFoundError('[WinError 2] The system cannot find the file specified')
cmdline: git pull Remote_server_Address.git
I'm trying various commands but not really finding the documentation useful.
As I'm aiming to host the repo in airflow (preferably within just a python instance) I'm hoping I don't need to provide a local path - but even when I try to do so, I still get the same error.
All help appreciated and apologies if it's vague.
Any other integration suggestions would also be recommended.
Thanks

It is a little hard to understand the setup you describe.
For example
isolate a file to push to a dag task
Does this mean you want a task read a specific file when you run an instance of it?
If that is the case you probably want to pass the file location (likely hosted in GCS) to the dag. This explains how.
However, a more common pattern is for something like a daily job to automatically select the file or run a query based on the date.
You could also setup a sensor that will trigger a dag when a file is added to a specific GCS folder using the gcs sensor.

Related

How to sync python files from PyCharm IDE to GitHub Automatically?

I'm currently using PyCharm IDE to learn Python. I am not aware of how to sync my file automatically to GitHub. Or to be precise I want my code to automatically sync as I type, to my GitHub repo. Like I want the file to exist in GitHub and edit it over my IDE.
Is there any solution for this to happen?
Regards,
Kausik
That is not how git (or Github) works. Version control systems are designed to capture milestones in your project. I think you're confusing git with file management cloud services (e.g., Dropbox or Google Drive). If you need something that would sync your files with each "save" you make to a file then services like Dropbox are what you looking for.
However, version control systems (e.g., git) are much better suited for code management if you adjust your workflow to follow how they were intended to be used. In PyCharm, after each milestone (e.g., a bug fix or a new feature implementation) you would do the following:
Stage changed files by checking them.
Commit the changes by adding a commit message.
Push changes to the remote repository.
All these can be found within the Commit window in PyCharm (View >> Tool Windows >> Commit). All the three steps above can be done in one click
One last thing. If your goal is to collaborate with someone else in real-time, then PyCharm has a new feature called "Code with me" (Tools >> Code With Me ...). I don't know if it is available for free but the idea is that you would invite friends and change the code base together in real-time. And eventually, you would push the changes to the remote repository.

How can I schedule python script in the cloud?

I am developing a python script that downloads some excel files from a web service. These two files are combined with another one stored in my computer locally to produce the final file. This final file is loaded to some database and PowerBI dashboard to finally visualize data.
My question is: How can I schedule this to run it daily if my computer is turned off? As I said, two files are web scraped (so no problem to schedule) but one file is stored locally.
One solution that comes to my mind: Store the local file in Google Drive/OneDrive and download it with the API so my script is not dependent of my computer. But if this was the case, how can I schedule that? What service would you use? Heroku,...?
I am not entirely sure about your context, but I think you could look into using AWS Lambda for this. It is reasonably easy to set it up and also create a schedule for running code.
It is even easier to achieve this using the serverless framework. This link shows an example built with Python that will run on a schedule.
I am running the schedule package for exactly something like that.
It’s easy to setup and works very well.

How do I add a dynamic text file to a very simple heroku app?

I have a very simple heroku app that is basically running one python script every ten minutes or so 24/7. The script uses a text file to store a really simple queue of information (no sensitive info) that gets updated every time it runs.
I have heroku set to deploy the app via Github, but it seems like way too much work to make it programmatically commit, push, and redeploy the entire thing just to update the queue in the text file. How can I attach this file to Heroku in a way that can let it be updated easily? I've been playing around with the free database add-ons but those also seem overcomplicated (in the sense that I've got no clue how to use them).
I'm also totally open to accusations that I'm making mountains out of molehills when I could easily be using some other easier platform to freely run this script 24/7 with the queue file.
At this point I'm sure that nobody cares, but this answer is for you, future troubleshooter.
It turns out that the Heroku script works fine with a txt file queue. Once the queue included in the Heroku deployment, the script will pull from the queue and even update it, giving the correct behavior. all you have to do is put the queue in with the github repo and open/change the file with the python files as you normally would.
The confusing thing is that this does not change the files in Github. It leaves the (github repo copy) queue as the same text file it was when it was originally pushed. This means that pulling and pushing the repo is a little confusing because the stored queue gets out of date very fast.
Thanks for the question, me, I'm happy to help.

Deploying a python cloud service/webjob to Azure via GIT

I'm trying to find a way to bind my cloud service or webjob to GIT.
I've tried following this guide
Everything worked well - the files were uploaded and a build job was initiated on the server, BUT I keep getting the following error:
C:\a\src\AzureCloudService1\Crawler\Crawler.pyproj (48, 0) The
imported project "C:\Program Files
(x86)\MSBuild\Microsoft\VisualStudio\v12.0\Python
Tools\Microsoft.PythonTools.Worker.targets" was not found. Confirm
that the path in the declaration is correct, and that the
file exists on disk.
I've searched this problem and based on suggestions I've found I uploaded the missing files and changed the location that points to them.
The above missing files were then read successfully but they're trying to use other files as well, which can't be found for the same reason.
Shortly, I get a chain of "not found" files.
I'm out of ideas, will appreciate your help.
Your issue is little relationship with Azure Could Service deployment, it’s the limitation of VSO building process. The VSO uses MSBuild to check code and build project, and which is lack of python tools dependence on VSO server. Here is the same issue with you and explained by VSO engineer. I’d like to quote a paragraph of this communication:
VSO Build preview is going to be able to better support non-.NET projects, and explicit support for Python projects will come eventually but is already available through command-line options. Getting Cloud Service projects changed to work better is more difficult (I don't even have a good contact for that team right now). Our own team also has conflicted priorities, and right now we've got everyone focusing on fixing crashes and issues that affect most of our users - working around Cloud Service's lack of extensibility is one of the (many) things that gets pushed down the list.
Currently, we can publish Cloud Server to Azure directly in Visual Studio as a workaround.
For more details, please read Python web and worker roles with Python Tools 2.2 for Visual Studio

Loading Python libraries via http

I have several small Python libraries that I wrote with stuff that I find myself wanting over and over again. I think most programmers have something similar. I want to use these libraries from a variety of different machines so I've started keeping this stuff in my DropBox. However, I'd like to be able to use my code on machines on which I can't install DropBox or other cloud storage applications, even in portable form. I can just download the files every time one of them changes (DropBox can provide me a URL for each file in my Public folder), which is only a moderate nuisance. But--and I admit this is a longshot--is there a solution out there that will let me tell Python to load a library from my DropBox via http?
BTW, I'd like to add the whole remove folder to my sys.path, but getting a URL for a folder is complicated, so I'm going to try to walk before I run by starting with individual files.
Yes, it's possible. I think you want the combination of two previous questions:
How to download a file in python over HTTP
How to dynamically load a library in python
So your task basically breaks down into writing a little bit of glue code: download the URL via the first bullet, write it to a local file, and then import that file using the second bullet.
So that's how you'd do that.
BUT - please keep in mind that dynamically downloading and executing code has many potential security downfalls. Will you be doing this over a secure connection? Who else has the ability to manipulate that URL? There are a bunch of security issues inherent in downloading and executing code on the fly. I would ask you to consider going about your solution in a different way, but I'm giving you the answer you're asking for.
As a simple security check, you can establish a known-good hash for your file, and then refuse to import any file other than one that's on the list of known-good hashes. This makes it a pain to update your modules, but gives you a little bit of extra safety.
Don't use DropBox as a Revision control
Pick a real solution like Git
Setup access to the Git repository on one of your servers
Clone the repository to your worker machines and checkout master
Create a develop branch where you put every change you make
Test the changes and when you consider any of them stable, merge it to master
On your worker machines set up a cron job which periodically pulls from master branch of repository (and possibly restarts some Python processes as importing the same module again won't make Python interpreter aware of changes since imported modules are cached)
Enjoy your automatically updated workers :)
Don't feel shame - it happens that even experienced software developers come up with XY problem

Categories

Resources