I have 3 scripts: the 1st and the 3rd are written in R, and the 2nd in Python.
The output of the 1st script is the input of the 2nd script, and its output is the input of the 3rd one.
The inputs and outputs are search keywords or phrases.
For example, the output of the 1st script is Hello, then the 2nd turns the word to olleH, and the 3rd one converts the letters to uppercase: OLLEH.
My question is how can I connect those scripts and let them run automatically, without my intervention, on AWS. What will be the commands? How can the output of the 1st script be saved, and play a role as the input of the 2nd one, etc.?
I would start an sh Script (or bat on a Windows machine). Then use the return values for the scripts as input for the next. So something like:
SET var1 = Rscript script1.R
SET var2 = py script2.py $var1
SET var3 = Rscript script3.R $ $var2
echo $var3
Of course you need to change your scripts to using the inputs you submitted.
I have never used AWS so I'm unfamiliar with that, but this seems like a workflow management system would solve these issues. Take a look into snakemake or nextflow. With these tools you can easily (after you get used to it) do exactly what you describe. Run scripts/tools that depend on each other sequentially (and also in parallel).
You can use AWS Step Functions to achieve your goal. For Python parts you can use AWS Lambda tasks, for R parts - AWS ECS tasks, and orchestrate data flow accordingly.
https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html
For commands, I wouldn't count on receiving a comprehensive response - workflows are complex and very individual in each case, but I would recommend defining them via some sort of IaC solution like CloudFormation or AWS CDK and keeping them under git.
https://docs.aws.amazon.com/cdk/api/latest/docs/aws-stepfunctions-readme.html
Related
When you complete this tutorial https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-getting-started-hello-world.html you download the AWS SAM CLI and run the commands in order to create a simple AWS hello-world application. When you run the program it triggers what AWS calls a lambda function and at the end of the tutorial you can open it in your browser in the url window using: http://127.0.0.1:3000/hello, if you see a message here that shows curly braces and the words 'hello-world' that means it is successful.
Running the AWS SAM commands generates a lot of boiler plate code which is a bit confusing. This can all be seen inside a code editor. One of them is called event.json, which of course is a JSON object but why is it there? what does it represent in relation to this program? I am trying to understand what this AWS SAM application is ultimately doing and what the files generated mean and represent here.
Can someone simply break down what AWS SAM is doing and the meaning behind the boiler plate code it generates?
Thank you
event.json contain the input your lambda function will get in json format. Regardless of how a lambda is triggered, it will always have 2 fixed parameters: Event and Context. Context contains additional information about the trigger like source, while Event Contains any input parameters that your lambda needs to run.
You can test this out by yourself by editing the event.json and giving your own values. If you open the lambda code file you will see this event object being used in the lambda_handler.
Other boilerplate stuff is your template where you can define the configuration of your lambdas as well as any other services you might use like layers or a database or api gateway.
You also get a requirements.txt file which contains names of any third party libraries that your function requires. These will be packaged along with the code.
Ninad's answer is spot on. I just want to add a practical application of how these json files are used. One way the event.json is used is when you are invoking your lambdas using the command sam local invoke. When you are invoking the lambda locally, you pass the event.json (or what ever you decide to call the file, you will likely have multiples) as a parameter. As Ninad mentioned, the event file has everything your lambda needs to run in terms of input. When the lambdas are hooked up to other services and running live, these inputs would be being feed to your lambda from that service.
I try to create a python program which will deobfuscate powershell malware, which uses IEX. My python program is actually hooking the IEX function and instead of running the desired string, it will print the string.
Now my problem is that I have some .ps1 scripts (for examples 1.ps1, 2.ps1, etc..) and I want to run all of them under the same session so that by this, all the local variables created by 1.ps1 script, the 2.ps1 script will be able to use...
Now I tried so many ways, First I tried with subprocess but it always creates a new session for every time I enter a command (which is the path of the .ps1 file). Then I found this project at GitHub:
https://gist.github.com/MarkBaggett/a7c10195b2626c78009bf73bcdb6db20
Which is really awesome and did work but still, it seems that when I run the command ./1.ps1 it still does not store the local variables at the session (Maybe it opens a new one when running a script).
I tried to do also "Get-Content 1.ps1 | iex" but then it crashes since I have functions there for example:
function Invoke-Expression()
{
param(
[Parameter( `
Mandatory=$True, `
Valuefrompipeline = $True)]
[String]$Command
)
Write-Host $Command
}
taken from PSDecode project:
https://github.com/R3MRUM/PSDecode/blob/master/PSDecode.psm1#L28
Anyway, any ideas about how I can do this? I have those scripts on my desktop but no idea how to run them at the same session so they will use the same local variables...
Two things that I did though but they really suck:
1. Convert all the scripts to 1 script and run it, but in next run that I will use this program I might have 100 scripts or more and I don't really want to do this.
2. I can save the local variables from each script and load it to another yet I want to use it in the worst case scenario and still didn't get there.
Thank you so much for helping me and sorry for my grammar my English is not my mother language as you can see :)
Maybe you're looking for dot sourcing:
Runs a script in the current scope so that any functions, aliases, and variables that the script creates are added to the current scope.
PowerShell
. c:\scripts\sample.ps1
If so dot-source your ps1 files, and call the functions inside them.
Hope that helps.
I am writing a watchman command with watchman-make and I'm at a loss when trying to access exactly what was changed in the directory. I want to run my upload.py script and inside the script I would like to access filenames of newly created files in /var/spool/cups-pdf/ANONYMOUS .
so far I have
$ watchman-make -p '/var/spool/cups-pdf/ANONYMOUS' -—run 'python /home/pi/upload.py'
I'd like to add another argument to python upload.py so I can have an exact filepath to the newly created file so that I can send the new file over to my database in upload.py,
I've been looking at the docs of watchman and the closest thing I can think to use is a trigger object. Please help!
Solution with watchman-wait:
Assuming project layout like this:
/posts/_SUBDIR_WITH_POST_NAME_/index.md
/Scripts/convert.sh
And the shell script like this:
#!/bin/bash
# File: convert.sh
SrcDirPath=$(cd "$(dirname "$0")/../"; pwd)
cd "$SrcDirPath"
echo "Converting: $SrcDirPath/$1"
Then we can launch watchman-wait like this:
watchman-wait . --max-events 0 -p 'posts/**/*.md' | while read line; do ./Scripts/convert.sh $line; done
When we changing file /posts/_SUBDIR_WITH_POST_NAME_/index.md the output will be like this:
...
Converting: /Users/.../Angular/dartweb_quickstart/posts/swift-on-android-building-toolchain/index.md
Converting: /Users/.../Angular/dartweb_quickstart/posts/swift-on-android-building-toolchain/index.md
...
watchman-make is intended to be used together with tools that will perform a follow-up query of their own to discover what they want to do as a next step. For example, running the make tool will cause make to stat the various deps to bring things up to date.
That means that your upload.py script needs to know how to do this for itself if you want to use it with watchman.
You have a couple of options, depending on how sophisticated you want things to be:
Use pywatchman to issue an ad-hoc query
If you want to be able to run upload.py whenever you want and have it figure out the right thing (just like make would do) then you can have it ask watchman directly. You can have upload.py use pywatchman (the python watchman client) to do this. pywatchman will get installed if the the watchman configure script thinks you have a working python installation. You can also pip install pywatchman. Once you have it available and in your PYTHONPATH:
import pywatchman
client = pywatchman.client()
client.query('watch-project', os.getcwd())
result = client.query('query', os.getcwd(), {
"since": "n:pi_upload",
"fields": ["name"]})
print(result["files"])
This snippet uses the since generator with a named cursor to discover the list of files that changed since the last query was issued using that same named cursor. Watchman will remember the associated clock value for you, so you don't need to complicate your script with state tracking. We're using the name pi_upload for the cursor; the name needs to be unique among the watchman clients that might use named cursors, so naming it after your tool is a good idea to avoid potential conflict.
This is probably the most direct way to extract the information you need without requiring that you make more invasive changes to your upload script.
Use pywatchman to initiate a long running subscription
This approach will transform your upload.py script so that it knows how to directly subscribe to watchman, so instead of using watchman-make you'd just directly run upload.py and it would keep running and performing the uploads. This is a bit more invasive and is a bit too much code to try and paste in here. If you're interested in this approach then I'd suggest that you take the code behind watchman-wait as a starting point. You can find it here:
https://github.com/facebook/watchman/blob/master/python/bin/watchman-wait
The key piece of this that you might want to modify is this line:
https://github.com/facebook/watchman/blob/master/python/bin/watchman-wait#L169
which is where it receives the list of files.
Why not triggers?
You could use triggers for this, but we're steering folks away from triggers because they are hard to manage. A trigger will run in the background and have its output go to the watchman log file. It can be difficult to tell if it is running, or to stop it running.
The interface is closer to the unix model and allows you to feed a list of files on stdin.
Speaking of unix, what about watchman-wait?
We also have a command that emits the list of changed files as they change. You could potentially stream the output from watchman-wait in your upload.py. This would make it have some similarities with the subscription approach but do so without directly using the pywatchman client.
I have been wrestling with this problem for the past week and I fear my solution is not conventional according to the SaltStack documentation. We have about 20 minions running on various servers throughout the country and need to be able to not only monitor them, but also issue commands and mysql queries from time to time. This is very easy to do from the CLI via something like:
salt '[minion name here]' cmd.run "tail -4 /usr/local/bin/file.txt"
That would effectively return the last four line in file.txt on the server running that minion. However, what we want to do next is have a script that periodically pulls this file down and caches it on salt master. Since SaltStack is written in python it was a no-brainer to use the same language for our daemons/cron jobs. However, the problem we are running into is that we would very much like a way of interfacing with SaltStack without having to resort to running a process from within our python script. Currently we have the following line of code that does almost the same thing:
subprocess.Popen(['salt', minion, 'cmd.run', '"tail -4 /usr/local/bin/file.txt"', '--out', 'json'], stdout=subprocess.PIPE)
After reading into the documentation it has become apparent that there is a way to do this provided by SaltStack. The issue we're having is that we cannot figure out the code that is needed to actually run such a command without using the subprocess module. Furthermore, we wish to also execute remote mysql queries on some of these minions, but we're so inexperienced (or so stupid) that we cannot decipher what the relevant code should be.
For the purpose of an example, we would like to list all databases located on one of our nodes. We found the following two articles that explain how to do this, but we are confused as to what actually must be executed to get our final result.
https://docs.saltstack.com/en/2015.8/ref/clients/index.html
https://docs.saltstack.com/en/latest/ref/modules/all/salt.modules.mysql.html
From the mysql salt modules we would expect to be able to use salt.modules.mysql.db_list, but according to the documentation that function does not accept any parameters. How would we specify which minion we want to run the query on? I thought there would be some way of instantiating a new instance of salt.modules.mysql that held a reference to the minion in question, but no such functionality seems to exist. Can anyone help us with this issue?
If you execute modules from cli and your minion ids start with something specific, like db-00 and db-01, you would do something like that:
salt 'db*' mysql.db_list
There are other approaches than relying on the minion id. Read more about targeting minions for further information.
From within python you can do the same as described within your linked docs. A slightly adjusted example:
import salt.client
local = salt.client.LocalClient()
local.cmd('db-*', 'mysql.db_list')
I have the situation where I am doing some computation in Python, and based on the outcomes I have a list of target files that are candidates to be passed to 2nd program.
For example, I have 50,000 files which contain ~2000 items each. I want to filter for certain items and call a command line program to do some calculation on some of those.
This Program #2 can be used via shell command line, but requires also a lengthy set of arguments. Because of performance reasons I would have to run Program #2 on a cluster.
Right now, I am running Program #2 via
'subprocess.call("...", shell=True)
But I'd like to run it via qsub in future.
I have not much experience of how exactly this could be done in a reasonably efficient manner.
Would it make sense to write temporary 'qsub' files and run them via subprocess() directly from the Python script? Is there a better, maybe more pythonic solution?
Any ideas and suggestions are very welcome!
It makes perfect sense, although I would go for another solution.
As far as I understand, you have programme #1 that determines which of your 50,000 files needs to be computed by programme #2.
Both programme #1 and #2 are written in Python. Excellent choice.
Incidentally, I have a Python module that might come in handy: https://gist.github.com/stefanedwards/8841307
If you are running the same qsub-system as I have (no idea what ours is called), you cannot use command arguments on the submitted scripts. Instead, any options are submitted via the -v option, that puts them into environment variables, e.g.:
[me#local ~] $ python isprime.py 1
1: True
[me#local ~] $ head -n 5 isprime.py
#!/usr/bin/python
### This is a python script ...
import os
os.chdir(os.environ.get('PBS_O_WORKDIR','.'))
[me#local ~] $ qsub -v isprime='1 2 3' isprime.py
123456.cluster.control.com
[me#local ~]
Here, isprime.py could handle command line arguments using argparse. Then you just need to check whether the script is running as a submitted job, and then retrieve said arguments from the environment variables (os.environ).
When programme #2 is modified to be run on the cluster, programme #1 can submit jobs by using subprocess.call(['qsub','-v options=...','programme2.py'], shell=FALSE)
Another approach would be to queue all the files in a database (say, an SQLite database). Then you could have programme #1 check all non-processed entries in the database, determine the outcome (run, not run, run with special options).
You now have the opportunity to run programme #2 in parallel on the cluster, which simply checks for the database for files to analyse.
Edit: When Programme #2 is an executable
Instead of a python script, we use a bash script that takes environment variables and puts them on a command line for the programme:
#!/bin/bash
cd .
# put options into context/flags etc.
if [ -n $option1 ]; then _opt1="--opt1 $option1"; fi
# we can even define our own defaults
_opt2='--no-verbose'
if [ -n $opt2 ]; then _opt2="-o $opt2"; fi
/path/to/exe $_opt1 $opt2
If you are going for the database solution, then have a python script that checks the database for unprocessed files, mark file as being processed (do these to in a single transaction), get options, call executable with subprocess, when done, mark file as done, check for a new file, etc.
You obviously have built yourself a string cmd containing a command that you could enter in a shell for running the 2nd program. You are currently using subprocess.call(cmd, shell=True) for executing the 2nd program from a Python script (it then becomes executed within a process on the same machine as the calling script).
I understand that you are asking how to submit a job to a cluster so that this 2nd program is run on the cluster instead of the calling machine. Well, this is pretty easy and the method is independent of Python, so there is no 'pythonic' solution, just an obvious one :-) : replace your current cmd with a command that defers the heavy work to the cluster.
First of all, dig into the documentation of your cluster's qsub command (the underlying batch system might be SGE or LSF, or whatever, you need to get the corresponding docs) and try to find the shell command line that properly submits an example job of yours to the cluster. It might look as simple as qsub ...args... cmd, whereas cmd here is the content of the original cmd string. I assume that you now have the entire qsub command needed, let's call it qsubcmd (you have to come up with that on your own, we can't help there). Now all you need to do in your original Python script is calling
subprocess.call(qsubcmd, shell=True)
instead of
subprocess.call(cmd, shell=True)
Note that qsub likely only works on very few machines, typically known as your cluster 'head node(s)'. This means that your Python script that wants to submit these jobs should run on this machine (if that is not possible, you need to add an ssh login procedure to the submission process that we don't want to discuss here).
Please also note that, if you have the time, you should look into the shell=True implications of your subprocess usage. If you can circumvent shell=True, this will be the more secure solution. This might however not be an issue in your environment.