I am having around 20 scripts, each produce one output file as the output which is fed back as input to the next file. I want to now provide the user with an option to restart the batch script from any point in the script.
My friend suggested using make or ant having targets defined for each python script. I want to know your(advanced hackers) suggestions.
Thank you
Make works like this:
Target: dependencies
commands
Based on your scripts, you might try this type of Makefile:
Step20: output19
script20 #reads output19 and produces final output
Step19: output18
script19 # reads output18 and produces output19
.. etc ..
Step2: output1
script2 # reads output1 and produces output2
Step1:
script1 # produces output1
That way, each script won't be run until the output from the previous step has been produced. Running make Step20 will travel down the entire chain, and start at script1 if none of the outputs exist. Or, if output15 exists, it will start running script16.
Related
I have a python project which takes as input a .csv file and some parameters and then give back some results.
Every time I want to try a new coded feature of my code and get the results I have to run the program changing the .csv name and other parameters. So it takes long to change every time this argument, because I have a lot of different input files.
There's a way to write a program in python which can do this for me?
For example a program that does:
- run "project.py" n times
- first time with "aaa.csv" file as input and parm1=7, then put results in "a_res.csv"
- second time with "bbb.csv" file as input and parm1=4, then put results in "b_res.csv"
- third time with "ccc.csv" file as input and parm1=2, then put results in "c_res.csv"
- fourth time with "ddd.csv" file as input and parm1=6, then put results in "d_res.csv"
- ...
Thanks!
yes, make a list of the configurations you want and execute your function in a loop that iterates over this configurations
configurations = [
["aaa.csv", 7, "a_res.csv"],
["bbb.csv", 4, "b_res.csv"],
["ccc.csv", 2, "c_res.csv"],
["ddd.csv", 6, "d_res.csv"]]
for c in configurations:
# assuming your python function accepts 3 parameters:
# input_file, param1, output_file
your_python_function(c[0], c[1], c[2])
I have an application made of many individual scripts. Output of each of them is an input od the next one. Each script reads data on the beggining and saves modified data as its last activity. In short:
script1.py: reads mariadb data to df -> does stuff -> saves raw data in mysql.sql sqlite3 format
script2.py: reads sqlite3 file -> does stuff -> saves raw data in data.txt - tab separated values
program3.exe: reads data.txt -> does stuff -> writes another.txt - tab separated values
script4.py: reads another.txt -> does stuff -> creates data4.csv
script5.py: reads data4.csv -> does stuff -> inserts mariadb entries
What I am searching and asking for is: is there any design pattern (or other mechanism) for creating data provider for situation like that? "Data provider" should be a some abstraction layer which:
have different data source types (like mariadb connection, csv files, txt files, others) predefined and easy to extern that list.
should reads data from "data-specified-source" and deliver the data to given script/proggram (f.i. by execute script with parameter)
should validate if output of each application part (each script/program) is valid or take over the task of generating this data
In general "Data provider" would run script1.py with some parameter (dataframe?) in some sandbox, take over data before it is saved and prepare data for script2.py proper execution. OR it just could run script1.py with some parameter, wait for execution, check if output is valid, convert (if necessary) that output to another format and run script2.py with well-prepared data.
I have access to python script sources (script1.py ... script5.py) and I can modify them. I am unable to modify program3.exe source code but it is always one part of the whole process. What is the best way (or just a way) to design such a layer?
Since you include a .exe file, I'll assume you are using Windows. You can write a batch file or a powershell script. On linux the equivalent would be a bash script.
If your sources and destinations are hard coded, then the batch file is going to be something like
script1.py
REM assume output file is named mysql.sql
script2.py
REM assume output file is data.txt and has tab separated values
program3.exe
REM assume output file is another.txt and has tab separated values
script4.py
REM creates data4.csv
script5.py
The REM is short for REMARK in a batch file and allows for commenting.
Basically I need to run three files simultaneously and independently. These files are started with a user input followed by an infinite While Loop.
I have found some questions similar to mine but the solutions do not quite fit my needs. I am still a beginner.
I have already tried:
python device1.py &
python device2.py &
python device3.py
I also tried doing this all in one file but the file is rather large and complicated, and have not succeeded thus far.
#some code that creates a csv
#input
device = input("input which device you want to connect to")
def function():
#write to csv file from data
while True:
#get live data from device
#csv function
function()
I expect to enter 3 inputs for my 3 scripts, they run their loops, I end the code and have 3 csv files.
Have you tried setting the input in your command?
echo inputForDevice1 | python device1.py &
echo inputForDevice2 | python device2.py &
echo inputForDevice3 | python device3.py &
Also remember to detach from the last python call (python device3.py &), otherwise you'll be stuck in the infinite loop.
I'm trying to use cProfile from: https://docs.python.org/2/library/profile.html#module-cProfile
I can get the data to print but I want to be able to manipulate the data and sort so that I get just the info I want. To get the data to print I use:
b = cProfile.run("function_name")
But after that runs and prints, b = None and I cannot figure out where the data is that it printed so that I can manipulate the data. Of course, I can see the data, but in order to analyze the data I need to able to get some sort of output into my IED editor. I've tried pstats but I get error messages. It seems that to use pstats I have to save some sort of file but I cannot figure out how to run the program and save it to a file.
UPDATE:
I almost have a solution
cProfile.run('re.compile("foo|bar")', 'restats')
There is a second argument where you can save a file as 'restats'. Now I should be able to open it and read it.
SOLVED:
cProfile.run("get_result()", 'data_stats')
p = pstats.Stats('data_stats')
p.strip_dirs().sort_stats(-1).print_stats()
p.sort_stats('name')
cProfile.run("get_result()", 'data_stats')
p = pstats.Stats('data_stats')
p.strip_dirs().sort_stats(-1).print_stats()
p.sort_stats('name')
In addition to the first argument which runs the code, the second argument actually saves the output to a file. The next line will then open the file. Once that file is open you should be able to see the values of p in your IED editor and be able to use normal python operations to manipulate it.
I have 100GB of JSON files whose each row looks like this:
{"field1":100, "field2":200, "field3":[{"in1":20, "in2":"abc"},{"in1":30, "in2":"xyz"}]}
(It's actually a lot more complicated, but for this'll do as a small demo.)
I want to process it to something whose each row looks like this:
{"field1":100, "field2":200, "abc":20, "xyz":30}
Being extremely new to Hadoop, I just want to know if I'm on the right path:
Refering to this:
http://www.glennklockwood.com/di/hadoop-streaming.php
For conventional applications I'd create a a mapper and reducer in Python and execute it using something like:
hadoop \
jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar \
-mapper "python $PWD/mapper.py" \
-reducer "python $PWD/reducer.py" \
-input "wordcount/mobydick.txt" \
-output "wordcount/output"
Now let me know if I'm on the right track:
Since I just need to parse a lot of files into another form; I suppose I don't need any reduction step. I can simply write a mapper which:
Takes input from stdin
Reads std.in line by line
Transforms each line according to my specifications
Outputs into stdout
Then I can run hadoop with simply a mapper and 0 reducers.
Does this approach seem correct? Will I be actually using the cluster properly or would this be as bad as running the Python script on a single host?
You are correct, in this case you don't need any reducer, the output of your mapper is directly what you want so you should set the number of reducers to 0. When you tell Hadoop the input path where your JSON data is, it will automatically feed each mapper with a random number of lines of JSON, which your mapper will process and you need to emit it to the the context, so that it stores the value in the output path. The approach is correct, and this task is 100% parallelizable, so if you have more than one machine in your cluster and your configuration is correct, it should take full advantage of the cluster and it will run much faster than running it on a single host.