Problems with running a python script over many files

Problems with running a python script over many files - python

I am on a Linux (Ubuntu 11.10) machine; bourne again shell.
I have to process a directory full of files with a python script. My colleague wrote the python script and I have successfully used it before on one file at a time. It takes two arguments: a path to the file to be processed enclosed in quotes and a secondary argument called -min which requires an integer. Also, the script writes to standard out.
From my experience of shell scripting and following others on this forum, I used the following method to iterate over the directory of files:
for f in path/to/data_directory/*; do
path/to/pythonscript.py $f -min 1 > path/to/out_directory/$f;
done
I get the desired file names in the out_directory. The content of each is something only the python script can write. That is, the above for loop successfully passes the files to the script. However, the nature of the content of each file is completely wrong (as in the computation the script does was wrong). When I run the python script on one of the files in the data_directory, the output file has the correct content (the computation performed by the script is correct).
The thing that makes it more complex is that the same shell method (the for loop) works perfectly in the Mac OS X my colleague has.
Where is the issue? Am I missing something very fundamental about Linux shells? Maybe it's a syntax error?
Any help will be appreciated.
Update: I just ran the for loop again but instead of pointing it to the data_directory of files, I pointed it to a file within the data_directory. I had the same problem - the python script did not compute the correct result.

The only problem I see is that filenames may contain white-space - so you should quote filenames:
for f in path/to/data_directory/*; do
path/to/pythonscript.py "$f" -min 1 > "path/to/out_directory/$f"
done

Well I don't know if this helps but.
path/to/pythonscript.py $f -min > path/to/out_director/$f
Substitutes out to
path/to/pythongscript.py path/to/data_directory/myfile -min 1 > path/out_directory/path/to/data_directory/myfile
script should be
cd path/to/data_directory
for f in *; do
path/to/pythonscript.py $f -min 1 > path/to/out_directory/$f
done
What version of bash are you running?
what do you get if you run this script?
cd path/to/data_directory
for f in *; do
echo $f > /tmp/$f
done
of course that should give you a bunch of files containing their own file names.

Related

Linux bash executables behave differently depending on whether I am activating from the command line or with os.system('')

I've been attempting to execute a certain CLI from within python and store the output for later use within the same script. I suspect this question has a simple answer, but if one wishes to go through the entire pipeline, here is the tool in question.
wget http://rna.urmc.rochester.edu/Releases/current/RNAstructureForLinux.tgz
tar xvf as usual, go inside the resulting directory and execute 'make all', the executables I use in the bash script are within the 'exe' directory.
I attempted to execute the commands with os.system(), but with little luck. The CLI I am using; however, seems to be running. The function which I have set to execute the os.system() commands contains the following block.
txt = open('home/spectre/tools/RNAstructure/exe/RNAStructure_nucleic_acid.txt',"w")
txt.write('AAGGCTGTCCAGGCGCAATGTGGTGGCTGCTTCTCTGGGGAGTCCTCCAGGCTTGCCCAACCCGGGGCTCCGTCCTCTTGGCCCAAGAGCTACCCCAGCAGCTGACATCCCCCGGGTACCCAGAGCCGTATGGCAAAGGCCAAGAGAGCAGCACGGACATCAAGGCTCCAGAGGGCTTTGCTGTGAGGCTCGTCTTCCAGGACTTCGACCTGGAGCCGTCCCAGGACTGTGCAGGGGACTCTGTCACAGTGAGCTGGGGATGGGGGGGGTCCCGCCAGGACTGTGGCCAGGGAGATTCCCGGGGTTGTGGGAAGTGGCGGTGCCCTGAATCCCCCATCTGGAGGAGGGATGAAT')
os.system(' cd ~/tools/RNAstructure/exe ; ./python_RNA_structure.sh')
nucleotides, structure, MFE =
RNAStructure_from_file('home/spectre/tools/RNAstructure/exe/RNAStructure_bracket_output.txt')
The executable *.sh file contains this.
#!/bin/bash
cd ~/tools/RNAstructure/exe
./Fold RNAStructure_nucleic_acid.txt RNAStructure_nucleic_acid_output.txt
./ct2dot RNAStructure_nucleic_acid_output.txt -1 RNAStructure_bracket_output.txt
If I execute the bash script from the command line the output should look a little like this
Initializing nucleic acids...
Using auto-detected DATAPATH: "../data_tables" (set DATAPATH to avoid this warning).
done.
98% \[==================================================\] \\ done.
Writing output ct file...done.
Single strand folding complete.
Converting CT file...
Using auto-detected DATAPATH: "../data_tables" (set DATAPATH to avoid this warning).
CT file conversion complete.
If I execute the bash script form the python file.
Initializing nucleic acids...
Using auto-detected DATAPATH: "../data_tables" (set DATAPATH to avoid this warning).
Error reading sequence. The file did not contain any nucleotides.
Single strand folding complete with errors.
Converting CT file...
Using auto-detected DATAPATH: "../data_tables" (set DATAPATH to avoid this warning).
CT file conversion complete.
It looks an awful lot like my CLI can find the files it needs inside the terminal, but not outside of it. I haven't experimented with any parameters like trying absolute paths, but I understood by using os.system() I could execute a bash script, but it is not clear to me why this is changing how that script behaves.
What I've done to resolve the problem:
reopening the file seems to resolve the problem, but I am still working out why.

The problem seems to resolve when I reopen the file within the python script like so:
txt = open('home/spectre/tools/RNAstructure/exe/RNAStructure_nucleic_acid.txt',"w")
txt.write('AAGGCTGTCCAGGCGCAATGTGGTGGCTGCTTCTCTGGGGAGTCCTCCAGGCTTGCCCAACCCGGGGCTCCGTCCTCTTGGCCCAAGAGCTACCCCAGCAGCTGACATCCCCCGGGTACCCAGAGCCGTATGGCAAAGGCCAAGAGAGCAGCACGGACATCAAGGCTCCAGAGGGCTTTGCTGTGAGGCTCGTCTTCCAGGACTTCGACCTGGAGCCGTCCCAGGACTGTGCAGGGGACTCTGTCACAGTGAGCTGGGGATGGGGGGGGTCCCGCCAGGACTGTGGCCAGGGAGATTCCCGGGGTTGTGGGAAGTGGCGGTGCCCTGAATCCCCCATCTGGAGGAGGGATGAAT')
txt = open('home/spectre/tools/RNAstructure/exe/RNAStructure_nucleic_acid.txt')
os.system(' cd ~/tools/RNAstructure/exe ; ./python_RNA_structure.sh')
nucleotides, structure, MFE =
RNAStructure_from_file('home/spectre/tools/RNAstructure/exe/RNAStructure_bracket_output.txt')
I am not sure why this resolves the problem, I found this solution serendipitously. I'll update the answer when I figure out why, unless someone wants to beat me to it. It's magic to me for now.
It seems that after opening the file, RNAStructure_nucleic_acid.txt, and assigning it to the txt variable for writing, I need to reopen it after writing is complete. Otherwise the file is blank when I try printing it's output within the program, but after the program finishes executing, the file contains the correct text.

Execute Knime with multiple arguments in Python (subprocess.run)

Hello all,
i'm looking for a way to execute a KNIME workflow in Python in batch mode (without opening the GUI of KNIME, https://www.knime.com/faq#q12)
After hours of trying I am asking you whether you can help me in this case:
When I run the python file, it opens the Knime exe, after some seconds the knime GUI is also opened. Unfortunately, the exe is not excuting the workflow (for testing the workflow should read an csv file and save it in another file destination)
This is actual code in python 3.7:
import subprocess
subprocess.run(["C:/Program Files/KNIME/knime.exe","-consoleLog","-nosplash","-noexit","-nosave","-reset","-application org.knime.product.KNIME_BATCH_APPLICATION","-workflowDir= C:/Users/jssch/knime-workspace/testexecute"]
When i paste the following code in command line the code is working and is executed correctly (it just hands over the arguments and does not open the knime GUI):
C:\Program Files\KNIME\knime.exe" -consoleLog -noexit -nosplash -nosave -reset -application org.knime.product.KNIME_BATCH_APPLICATION -workflowDir="C:\Users\jssch\knime-workspace\testexecute"
Thanks for your help in advance!

I think you made a mistake with the -application part, they should be in different Strings. Also the -workflowDir= C:/... seems to have an extra space too.
The problematic part:
"-application org.knime.product.KNIME_BATCH_APPLICATION"
it should be:
"-application", "org.knime.product.KNIME_BATCH_APPLICATION"
Probably you do not want the -noexit argument either.
All together:
import subprocess
subprocess.run(["C:/Program Files/KNIME/knime.exe", "-consoleLog", "-nosplash", "-nosave", "-reset", "-application", "org.knime.product.KNIME_BATCH_APPLICATION", "-workflowDir=C:/Users/jssch/knime-workspace/testexecute"]
(I usually prefer the paths without spaces, strange characters, I would use a KNIME installation from a different path, though this is fine too.)

Python, searching own codes wrote

There are scripts wrote and all of them may of different topics. Some about documents handling, some about text extraction, some about automation.
Sometimes I forgot a usage, for example how to create a .xls file, so I want to search if in the scripts there is a line about how to do it.
What I am doing is to convert all the .py files into .txt, and combine all txt files together. Then I use Word to open this aggregated .txt file and find.
What’s the better way to search specific lines in own written codes?
**converting .py to .txt:
folder = "C:\\Python27\\"
for a in os.listdir(folder):
root, ext = os.path.splitext(a)
if ext == ".py":
os.rename(folder + a, folder + root + ".txt")
**putting all .txt together:
base_folder = "C:\\TXTs\\"
all_files = []
for each in os.listdir(base_folder):
if each.endswith('.txt'):
kk = os.path.join(base_folder, each)
all_files.append(kk)
with open(base_folder + " final.txt", 'w') as outfile:
for fname in all_files:
with open(fname) as infile:
for line in infile:
outfile.write(line)

Keep all of your code in a single directory tree, e.g. code.
If your operating system doesn't have a decent search tool like grep, install "the silver searcher". (You will also need a decent terminal emulator.)
For example (I'm using FreeBSD here), I tend to keep all my source code under src/. If I want to know in which scripts I use xls files, I type:
ag xls src/
Which retuns:
src/progs/uren/project-uren.py
24: print("Usage: {} urenbriefjesNNNN.xlsm project".format(binary),
This tells me to look at line 24 of the file src/progs/uren/project-uren.py.
If I e.g. search for Counter:
ag Counter src/
I get multiple hits:
src/scripts/public/csv2tbl.py
13:from collections import Counter
45: letters, sep = Counter(), Counter()
src/scripts/public/dvd2webm.py
15:from collections import Counter
94: rv = Counter(re.findall('crop=(\d+:\d+:\d+:\d+)', proc.stderr))
src/scripts/public/scripts-tests.py
14:from collections import Counter
26: v = Counter(rndcaps(100000)).values()

You can install rStudio, an open source IDE for the r language. In the Edit menu there is a Find in File... feature you can use just like find and replace in a word document. It will go through files in the directory you point it to...I have not yet had problems searching scripts as they are, untransformed to txt. It will search for terms or regex expressions....it is pretty useful!
As is R!

cat *.py | grep xls if you're on Linux.
Otherwise it may be helpful to keep some sort of README file with your python scripts. I, personally, prefer Markdown:
## Scripts
They do stuff
### Script A
Does stuff A, call `script_a.py -h` for more info
### Script B
Does stuff B, call `script_b.py -h` for more info
It compiles to this:
Scripts
They do stuff
Script A
Does stuff A, call script_a.py -h for more info
Script B
Does stuff B, call script_b.py -h for more info
It takes basically no time to write and Markdown can be easily used on sites such as SO, Github, Reddit and others. This very answer, in fact, is written in Markdown. But if you can't be bothered with Markdown, a simple README.txt is still much better than nothing.

The technical term for what you're trying to do is a "full text file search". Googling this together with your operating system name will give you many methods. Here is one for Windows: https://www.howtogeek.com/99406/how-to-search-for-text-inside-of-any-file-using-windows-search/.
If you're on MacOS I recommend looking into BASH command line syntax to do a bit more complex automation tasks (although what you need is also perfectly covered in Spotlight search). On Windows 10 you could check out the new Linux Subsystem that gives you the same syntax [1]. Composing small commands together using pipes and xargs in command line is a very powerful automation tool. For what you're asking I still think a full text search is the most straightforward solution, but since you're already into programming I thought I bring this up.
To demonstrate, the task you describe would be something like
find . -name "*.py" | xargs -I {} grep -H "xls" {}
This would search your working directory (and all subdirectories) for python files (using . as its first argument to find, which refers to the directory you're currently in, shown by pwd), and then search each of those python files for the string "xls". xargs takes all lines from standard input (which the pipe | gets from the last command) and converts them into command line parameters. grep -H searches files for the specified string and prints the occurrences together with the file name.
[1] I'm assuming you're not on Linux already since you like to use MS Office.

Using cat command in Python for printing

In the Linux kernel, I can send a file to the printer using the following command
cat file.txt > /dev/usb/lp0
From what I understand, this redirects the contents in file.txt into the printing location. I tried using the following command
>>os.system('cat file.txt > /dev/usb/lp0')
I thought this command would achieve the same thing, but it gave me a "Permission Denied" error. In the command line, I would run the following command prior to concatenating.
sudo chown root:lpadmin /dev/usb/lp0
Is there a better way to do this?

While there's no reason your code shouldn't work, this probably isn't the way you want to do this. If you just want to run shell commands, bash is much better than python. On the other hand, if you want to use Python, there are better ways to copy files than shell redirection.
The simplest way to copy one file to another is to use shutil:
shutil.copyfile('file.txt', '/dev/usb/lp0')
(Of course if you have permissions problems that prevent redirect from working, you'll have the same permissions problems with copying.)
You want a program that reads input from the keyboard, and when it gets a certain input, it prints a certain file. That's easy:
import shutil
while True:
line = raw_input() # or just input() if you're on Python 3.x
if line == 'certain input':
shutil.copyfile('file.txt', '/dev/usb/lp0')
Obviously a real program will be a bit more complex—it'll do different things with different commands, and maybe take arguments that tell it which file to print, and so on. If you want to go that way, the cmd module is a great help.

Remember, in UNIX - everything is a file. Even devices.
So, you can just use basic (or anything else, e.g. shutil.copyfile) files methods (http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files).
In your case code may (just a way) be like that:
# Read file.txt
with open('file.txt', 'r') as content_file:
content = content_file.read()
with open('/dev/usb/lp0', 'w') as target_device:
target_device.write(content)
P. S. Please, don't use system() call (or similar) to solve your issue.

under windows OS there is no cat command you should usetype instead of cat under windows
(**if you want to run cat command under windows please look at: https://stackoverflow.com/a/71998867/2723298 )
import os
os.system('type a.txt > copy.txt')
..or if your OS is linux and cat command didn't work anyway here are other methods to copy file..
with grep:
import os
os.system('grep "" a.txt > b.txt')
*' ' are important!
copy file with sed:
os.system('sed "" a.txt > sed.txt')
copy file with awk:
os.system('awk "{print $0}" a.txt > awk.txt')

Python Script to Change Folder Names

I'm on OS X and I'm fed up with our labeling system where I work. The labels are mm/dd/yy and I think that they should be yy/mm/dd. Is there a way to write a script to do this? I understand a bit of Python with lists and how to change the position of characters.
Any suggestions or tips?
What I have now:
083011-HalloweenBand
090311-ViolaClassRecital
090411-JazzBand
What I want:
110830-HalloweenBand
110903-ViolaClassRecital
110904-JazzBand
Thanks

Assuming the script is in the same directory as the files you want to rename, and you already have the list of files that you want to rename, you can do this:
for file in rename_list:
os.rename(file, file[4:6] + file[:2] + file[2:4] + file[6:])

There is a Q&A with information on traversing directories with Python that you could modify to do this. The key method is walk(), but you'll need to add the appropriate calls to rename().
As a beginner it is probably best to start by traversing the directories and writing out the new directory names before attempting to change the directory names. You should also make a backup and notify anyone who might care about this change before doing it.

i know you asked for python, but I would do it from the shell. this is a simple one liner.
ls | awk '{print "mv " $0 FS substr($1,5,2) substr($1,1,4) substr($1,7) }' | bash
I do not use osx but I think it is a bash shell. you may need to rename bash to sh, or awk to gawk.
but what that line is doing is piping the directory listing to awk which is printing "mv" $0 (the line) and a space (FS = field separator, which defaults to space) then two substrings.
substr(s,c,n). This returns the substring from string s starting from character position c up to a maximum length of n characters. If n is not supplied, the rest of the string from c is returned.
lastly this is piped to the shell. allowing it to be executed. This works without problems on ubuntu and variations of this command I use quite a bit. a version of awk (awk,nawk,gawk) should be isntalled on osx which I believe uses bash

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.