Python -- "Batch Processing" of multiple existing scripts - python

I have written three simple scripts (which I will not post here, as they are part of my dissertation research) that are all in working order.
What I would like to do now is write a "batch-processing" script for them. I have many (read as potentially tens of thousands) of data files on which I want these scripts to act.
My questions about this process are as follows:
What is the most efficient way to go about this sort of thing?
I am relatively new to programming. Is there a simple way to do this, or is this a very complex endeavor?
Before anyone downvotes this question as "unresearched" or whatever negative connotation comes to mind, PLEASE just offer help. I have spent days reading documentation and following leads from Google searches, and it would be most appreciated if a human being could offer some input.

If you just need to have the scripts run, probably a shell script would be the easiest thing.
If you want to stay in Python, the best way would be to have a main() (or somesuch) function in each script (and have each script importable), have the batch script import the subscript and then run its main.
If staying in Python:
- your three scripts must have the .py ending to be importable
- they should either be in Python's search path, or the batch control script can set the path
- they should each have a main function (or whatever name you choose) that will activate that script
For example:
batch_script
import sys
sys.path.insert(0, '/location/of/subscripts')
import first_script
import second_script
import third_script
first_script.main('/location/of/files')
second_script.main('/location/of/files')
third_script.main('/location/of/files')
example sub_script
import os
import sys
import some_other_stuff
SOMETHING_IMPORTANT = 'a value'
def do_frobber(a_file):
...
def main(path_to_files):
all_files = os.listdir(path_to_files)
for file in all_files:
do_frobber(os.path.join(path_to_files, file)
if __name__ == '__main__':
main(sys.argv[1])
This way, your subscript can be run on its own, or called from the main script.

You can write a batch script in python using os.walk() to generate a list of the files and then process them one by one with your existing python programs.
import os, re
for root, dir, file in os.walk(/path/to/files):
for f in file:
if re.match('.*\.dat$', f):
run_existing_script1 root + "/" file
run_existing_script2 root + "/" file
If there are other files in the directory you might want to add a regex to ensure you only process the files you're interested in.
EDIT - added regular expression to ensure only files ending ".dat" are processed.

Related

Where is the Mindstorms module files stored

I would like to find out where the source code for the mindstorms module is for the Mindstorms Robot Inventor.
At the start of each file there is a starting header of
from Mindstorms import ...
Etc..
That is what I want to find.
I have tryed multiple python methods to file the file path, but they all return ./projects/8472.py
Thanks,
henos
You've likely been using __file__ and similar, correct? Yes, that will give you the currently running file, but since you can't edit the code for the actual Mindstorms code, it's not helpful. Instead, you want to inspect the directory itself, like you would if your regular Python code were accessing a data file elsewhere.
Most of the internal systems are .mpy files (see http://docs.micropython.org/en/v1.12/reference/mpyfiles.html), so directly reading the code from the device is less than optimal. Additionally, this means that many "standard library" packages are missing or incomplete; you can't import pathlib but you can import os, but you can't use os.walk(). Those restrictions make any sort of directory traversal a little more frustrating, but not impossible.
For instance, the file runtime/extensions.music.mpy looks like the following (note: not copy-pasted since the application doesn't let you):
M☐☐☐ ☐☐☐
☐6runtime/extensions/music.py ☐☐"AbstractExtensions*☐☐$abstract_extension☐☐☐☐YT2 ☐☐MusicExtension☐☐4☐☐☐Qc ☐|
☐☐ ☐ ☐☐ ☐☐☐☐ ☐2 ☐☐play_drum2☐☐☐play_noteQc ☐d:
☐☐ *☐☐_call_sync#☐,☐+/-☐☐drumb6☐YQc☐ ☐☐drum_nos☐musicExtension.playDrum☐☐E☐
☐ *☐ #☐,☐+/-☐☐instrumentb*☐☐noteb*☐☐durationb6☐YQc☐ ☐☐☐☐s☐musicExtension.playNote
Sure, you can kind of see what's going on, but it isn't that helpful.
You'll want to use combinations of os.listdir and print here, since the MicroPython implementation doesn't give access to os.walk. Example code to get you started:
import os
import sys
print(os.uname()) # note: doesn't reflect actual OS version, see https://stackoverflow.com/questions/64449448/how-to-import-from-custom-python-modules-on-new-lego-mindstorms-robot-inventor#comment115866177_64508469
print(os.listdir(".")) # ['util', 'projects', 'runtime', ...]
print(os.listdir("runtime/extenstions")) # ['__init__.mpy', 'abstract_extension.mpy', ...]
with open("runtime/extensions/music/mpy", "r") as f:
for line in f:
print(line)
sys.exit()
Again, the lack of copy-paste from the console is rough, so even when you do get to the "show contents of the file on screen" part, it's not that helpful.
One cool thing to note though is that if you load up a scratch program, you can read the code in regular .py. It's about as intelligible as you'd expect, since it uses very low-level calls and abstractions, not the hub.light_matrix.show_image("CLOCK6") that you'd normally write.

How should I organize my scripts which are mostly the same?

So I'm new to Python and I need some help on how to improve my life. I learned Python for work and need to cut my workload a little. I have three different scripts which I run around 5 copies of at the same time all the time, they read XML data and add in information etc... However, when I make a change to a script I have to change the 5 other files too, which is annoying after a while. I can't just run the same script 5 times because each file needs some different parameters which I store as variables at the start in every script (different filepaths...).
But I'm sure theres a much better way out there?
A very small example:
script1.py
xml.open('c:\file1.xls')
while True:
do script...
script2.py
xml.open('c:\file2.xls')
while True:
do exactley the same script...
etc...
You'll want to learn about Python functions and modules.
A function is the solution to your problem: it bundles some functionality and allows you to call it to run it, with only minor differences passed as a parameter:
def do_something_with_my_sheet(name):
xml.open(name)
while True:
do script...
Elsewhere in your script, you can just call the function:
do_something_with_my_sheet('c:\file1.xls')
Now, if you want to use the same function from multiple other scripts, you can put the function in a module and import it from both scripts. For example:
This is my_module.py:
def do_something_with_my_sheet(name):
xml.open(name)
while True:
do script...
This is script1.py:
import my_module
my_module.do_something_with_my_sheet('c:\file1.xls')
And this could be script2.py (showing a different style of import):
from my_module import do_something_with_my_sheet
do_something_with_my_sheet('c:\file2.xls')
Note that the examples above assume you have everything sitting in a single folder, all the scripts in one place. You can separate stuff for easier reuse by putting your module in a package, but that's beyond the scope of this answer - look into it if you're curious.
You only need one script, that takes the name of the file as an argument:
import sys
xml.open(sys.argv[1])
while True:
do script...
Then run the script. Other variables can be passed as additional arguments, accessed via sys.argv[2], etc.
If there are many such parameters, it may be easier to save them in a configuration file, the pass the name of the configuration file as the single argument. Your script would then parse the file for all the information it needs.
For example, you might have a JSON file with contents like
{
"filename": "c:\file1.xls",
"some_param": 6,
"some_other_param": True
}
and your script would look like
import json
import sys
with open(sys.argv[1]) as f:
config = json.load(f)
xml.open(config['filename'])
while True:
do stuff using config['some_param'] and config['some_other_param']

Making Python guess a file Name

I have the following function:
unpack_binaryfunction('third-party/jdk-6u29-linux-i586.bin' , ('/home/user/%s/third-party' % installdir), 'jdk1.6.0_29')
Which uses os.sys to execute a java deployment. The line, combined with the function (Which is unimportant, it just calls some linux statements) works perfectly.
However, this only works if in the 'third-party' folder is specificaly that version of the jdk.
Therefore I need a code that will look at the files in the 'third-party' folder and find one that starts with 'jdk' and fill out the rest of the filename itself.
I am absolutely stuck. Are there any functions or libraries that can help with file searching etc?
To clarify: I need the code to not include the entire: jdk-6u29-linux-i586.bin but to use the jdk-xxxx... that will be in the third-party folder.
This can easily be done using the glob module, and then a bit a string parsing to extract the version.
import glob
import os.path
for path in glob.glob('third-party/jdk-*'):
parent, name = os.path.split(path) # "third-party", "jdk-6u29-linux-i586.bin"
version, update = name.split('-')[1].split('u') # ("6", "29")
unpack_binaryfunction(path, ('/home/user/%s/third-party' % installdir), 'jdk1.{}.0_{}'.format(version, update))

Python: Using multiprocessing when the function called is dependent on a file

I have a very old fortran file, which is too complex for me to convert into python. So I have to compile the file and run it via python.
For the fortran file to work it requires 3 input values on 3 lines from the mobcal.run file. They are as follows:
line 1 - Name of file to run
line 2 - Name of output file
line 3 - random seed number
I change the input values per worker in the run() function.
When I run my script (see below), only 2 output file were created, but all 32 processers were running which i found out via the top command
I'm guessing the issue is here, is that there was not enough time to change the mobcal.run file for each worker.
The only solution I have come up with so far is to put a time.sleep(random.randint(1,100)) at the beginning of the run() function. But I dont find this solution very elegant and may not always work as two workers may have the same random.randint, is there a more pythonic way to solve this?
def run(mfj_file):
import shutil
import random
import subprocess
#shutil.copy('./mfj_files/%s' % mfj_file, './')
print 'Calculating cross sections for: %s' % mfj_file[:-4]
with open('mobcal.run', 'w') as outf:
outf.write(mfj_file+'\n'+mfj_file[:-4]+'.out\n'+str(random.randint(5000000,6000000)))
ccs = subprocess.Popen(['./a.out'])
ccs.wait()
shutil.move('./'+mfj_file[:-4]+'.out', './results/%s.out' % mfj_file[:-4])
def mobcal_multi_cpu():
from multiprocessing import Pool
import os
import shutil
mfj_list = os.listdir('./mfj_files/')
for f in mfj_list:
shutil.copy('./mfj_files/'+f, './')
if __name__ == '__main__':
pool = Pool(processes=32)
pool.map(run,mfj_list)
mobcal_multi_cpu()
I assume your a.out looks in the current working directory for its mobcal.run. If you run each instance in it's own directory then each process can have it's own mobcal.run without clobbering the others. This isn't necessarily the most pythonic way but it's the most unixy.
import tempfile
import os
def run(mjf_file):
dir = tempfile.mkdtemp(dir=".")
os.chdir(dir)
# rest of function here
# create mobcal.run in current directory
# while changing references to other files from "./" to "../"
Create several directories, with one mobcal.run each, and run your fortran program into them instead.
If you need a sleep() in multiprocessing you are doing it wrong.

How to concatenate multiple Python source files into a single file?

(Assume that: application start-up time is absolutely critical; my application is started a lot; my application runs in an environment in which importing is slower than usual; many files need to be imported; and compilation to .pyc files is not available.)
I would like to concatenate all the Python source files that define a collection of modules into a single new Python source file.
I would like the result of importing the new file to be as if I imported one of the original files (which would then import some more of the original files, and so on).
Is this possible?
Here is a rough, manual simulation of what a tool might produce when fed the source files for modules 'bar' and 'baz'. You would run such a tool prior to deploying the code.
__file__ = 'foo.py'
def _module(_name):
import types
mod = types.ModuleType(name)
mod.__file__ = __file__
sys.modules[module_name] = mod
return mod
def _bar_module():
def hello():
print 'Hello World! BAR'
mod = create_module('foo.bar')
mod.hello = hello
return mod
bar = _bar_module()
del _bar_module
def _baz_module():
def hello():
print 'Hello World! BAZ'
mod = create_module('foo.bar.baz')
mod.hello = hello
return mod
baz = _baz_module()
del _baz_module
And now you can:
from foo.bar import hello
hello()
This code doesn't take account of things like import statements and dependencies. Is there any existing code that will assemble source files using this, or some other technique?
This is very similar idea to tools being used to assemble and optimise JavaScript files before sending to the browser, where the latency of multiple HTTP requests hurts performance. In this Python case, it's the latency of importing hundreds of Python source files at startup which hurts.
If this is on google app engine as the tags indicate, make sure you are using this idiom
def main():
#do stuff
if __name__ == '__main__':
main()
Because GAE doesn't restart your app every request unless the .py has changed, it just runs main() again.
This trick lets you write CGI style apps without the startup performance hit
AppCaching
If a handler script provides a main()
routine, the runtime environment also
caches the script. Otherwise, the
handler script is loaded for every
request.
I think that due to the precompilation of Python files and some system caching, the speed up that you'll eventually get won't be measurable.
Doing this is unlikely to yield any performance benefits. You're still importing the same amount of Python code, just in fewer modules - and you're sacrificing all modularity for it.
A better approach would be to modify your code and/or libraries to only import things when needed, so that a minimum of required code is loaded for each request.
Without dealing with the question, whether or not this technique would boost up things at your environment, say you are right, here is what I would have done.
I would make a list of all my modules e.g.
my_files = ['foo', 'bar', 'baz']
I would then use os.path utilities to read all lines in all files under the source directory and writes them all into a new file, filtering all import foo|bar|baz lines since all code is now within a single file.
Of curse, at last adding the main() from __init__.py (if there is such) at the tail of the file.

Categories

Resources