Python, searching own codes wrote

Python, searching own codes wrote - python

There are scripts wrote and all of them may of different topics. Some about documents handling, some about text extraction, some about automation.
Sometimes I forgot a usage, for example how to create a .xls file, so I want to search if in the scripts there is a line about how to do it.
What I am doing is to convert all the .py files into .txt, and combine all txt files together. Then I use Word to open this aggregated .txt file and find.
What’s the better way to search specific lines in own written codes?
**converting .py to .txt:
folder = "C:\\Python27\\"
for a in os.listdir(folder):
root, ext = os.path.splitext(a)
if ext == ".py":
os.rename(folder + a, folder + root + ".txt")
**putting all .txt together:
base_folder = "C:\\TXTs\\"
all_files = []
for each in os.listdir(base_folder):
if each.endswith('.txt'):
kk = os.path.join(base_folder, each)
all_files.append(kk)
with open(base_folder + " final.txt", 'w') as outfile:
for fname in all_files:
with open(fname) as infile:
for line in infile:
outfile.write(line)

Keep all of your code in a single directory tree, e.g. code.
If your operating system doesn't have a decent search tool like grep, install "the silver searcher". (You will also need a decent terminal emulator.)
For example (I'm using FreeBSD here), I tend to keep all my source code under src/. If I want to know in which scripts I use xls files, I type:
ag xls src/
Which retuns:
src/progs/uren/project-uren.py
24: print("Usage: {} urenbriefjesNNNN.xlsm project".format(binary),
This tells me to look at line 24 of the file src/progs/uren/project-uren.py.
If I e.g. search for Counter:
ag Counter src/
I get multiple hits:
src/scripts/public/csv2tbl.py
13:from collections import Counter
45: letters, sep = Counter(), Counter()
src/scripts/public/dvd2webm.py
15:from collections import Counter
94: rv = Counter(re.findall('crop=(\d+:\d+:\d+:\d+)', proc.stderr))
src/scripts/public/scripts-tests.py
14:from collections import Counter
26: v = Counter(rndcaps(100000)).values()

You can install rStudio, an open source IDE for the r language. In the Edit menu there is a Find in File... feature you can use just like find and replace in a word document. It will go through files in the directory you point it to...I have not yet had problems searching scripts as they are, untransformed to txt. It will search for terms or regex expressions....it is pretty useful!
As is R!

cat *.py | grep xls if you're on Linux.
Otherwise it may be helpful to keep some sort of README file with your python scripts. I, personally, prefer Markdown:
## Scripts
They do stuff
### Script A
Does stuff A, call `script_a.py -h` for more info
### Script B
Does stuff B, call `script_b.py -h` for more info
It compiles to this:
Scripts
They do stuff
Script A
Does stuff A, call script_a.py -h for more info
Script B
Does stuff B, call script_b.py -h for more info
It takes basically no time to write and Markdown can be easily used on sites such as SO, Github, Reddit and others. This very answer, in fact, is written in Markdown. But if you can't be bothered with Markdown, a simple README.txt is still much better than nothing.

The technical term for what you're trying to do is a "full text file search". Googling this together with your operating system name will give you many methods. Here is one for Windows: https://www.howtogeek.com/99406/how-to-search-for-text-inside-of-any-file-using-windows-search/.
If you're on MacOS I recommend looking into BASH command line syntax to do a bit more complex automation tasks (although what you need is also perfectly covered in Spotlight search). On Windows 10 you could check out the new Linux Subsystem that gives you the same syntax [1]. Composing small commands together using pipes and xargs in command line is a very powerful automation tool. For what you're asking I still think a full text search is the most straightforward solution, but since you're already into programming I thought I bring this up.
To demonstrate, the task you describe would be something like
find . -name "*.py" | xargs -I {} grep -H "xls" {}
This would search your working directory (and all subdirectories) for python files (using . as its first argument to find, which refers to the directory you're currently in, shown by pwd), and then search each of those python files for the string "xls". xargs takes all lines from standard input (which the pipe | gets from the last command) and converts them into command line parameters. grep -H searches files for the specified string and prints the occurrences together with the file name.
[1] I'm assuming you're not on Linux already since you like to use MS Office.

Related

How to concatenate sequences in the same multiFASTA files and then print result to a new FASTA file?

I have a folder with over 50 FASTA files each with anywhere from 2-8 FASTA sequences within them, here's an example:
testFOR.id_AH004930.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGA
>AH004930|2:237-401_Miopithecus_talapoin
GGGT
>AH004930|2:502-580_Miopithecus_talapoin
CTTTGCT
>AH004930|2:681-747_Miopithecus_talapoin
GGTG
testFOR.id_M95099.fasta
>M95099|1:1-90_Homo_sapien
TCTTTGC
>M95099|1:100-243_Homo_sapien
ATGGTCTTTGAA
They're all grouped based on their ID number (in this case AH004930 and M95099), which I've managed to extract from the original raw multiFASTA file using the very handy seqkit code found HERE.
What I am aiming to do is:
Use cat to put these sequences together within the file like this:
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTG
>M95099|1:1-90_Homo_sapien
TCTTTGCATGGTCTTTGAA
(I'm not fussed about the nucleotide position, I'm fussed about the ID and species name!)
Print this result out into a new FASTA file.
Ideally I'd really like to have all of these 50 files condensed into 1 FASTA that I can then go ahead and filter/align:
GENE_L.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTG
>M95099|1:1-90_Homo_sapien
TCTTTGCATGGTCTTTGAA
....
So far I have found a way to achieve what I want but only one file at a time (using this code: cat myfile.fasta | sed -e '1!{/^>.*/d;}' | sed ':a;N;$!ba;s/\n//2g' > output.fasta which I've sadly lost the link for the credit for) but a lot of these file names are very similar, so it's inevitable that if I did it manually, I'd miss some/it would be way too slow.
I have tried to put this into a loop and it's kind of there! But what it does is it cats each FASTA file, put's it into a new one BUT only keeps the first header, leaving me with a massive stitched together sequence;
for FILE in *; do cat *.fasta| sed -e '1!{/^>.*/d;}'| sed ':a;N;$!ba;s/\n//2g' > output.fasta; done
output.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTGTCTTTGCATGGTCTTTGAAGGTCTTTGAAATGAGTGGT...
I wondered if making a loop similar to the one HERE would be any good but I am really unsure how to get it to print each header once it opens a new file.
How can I cat these sequences, print them into a new file and still keep these headers?
I would really appreciate any advice on where I've gone wrong in the loop and any solutions suitable for a zsh shell! I'm open to any python or linux solution. Thank you kindly in advance

This might work for you (GNU sed):
sed -s '1h;/>/d;H;$!d;x;s/\n//2g' file1 file2 file3 ...
Set -s to treat each file separately.
Copy the first line.
Delete any other lines containing >.
Append all other lines to the first.
Delete these lines except for the last.
At the end of the file, swap to the copies and remove all newlines except the first.
Repeat for all files.
Alternative for non-GNU seds:
for file in *.fasta; do sed '1h;/>/d;H;$!d;x;s/\n//2g' "$file"; done
N.B. MacOS sed may need to be put into a script and invoked using the -f option or split into several pieces using the -e option (less the ; commands), your luck may vary.
Or perhaps:
for file in file?; do sed $'1h;/>/d;H;$!d;x;s/\\n/#/;s/\\n//g;s/#/\\n/' "$file"; done

Not sure I understand exactly your issue, but if you simply want to concatenate contents from many files to a single file I believe the (Python) code below should work:
import os
input_folder = 'path/to/your/folder/with/fasta/files'
output_file = 'output.fasta'
with open(output_file, 'w') as outfile:
for file_name in os.listdir(input_folder):
if not file_name.endswith('.fasta'): # ignore this
continue
file_path = os.path.join(input_folder, file_name)
with open(file_path, 'r') as inpfile:
outfile.write(inpfile.read())

Import multiple files output from bash script into Python lists

I have a bash script that connects to multiple compute nodes and pulls data from each one depending on some arguments entered after the bash script is called. For simplicity sake, I'm essentially doing this:
for h in node{0..7}; do ssh $h 'fold -w 80 /program.headers | grep "RA"
| head -600 | tr -d "RA =" > '$h'filename'; done
I'm trying to take the 8 files that come out of this (each have 600 pieces of information) and save them each as a list in Python. I then need to manipulate them in Python (split and convert to float) to be able to plot the data with Matplotlib.
For a bash script that only outputs one file, I can easily make a variable name equal to check_output and then manipulate from there:
test = subprocess.check_output("./bashscript")
test_list = test.split()
test = [float(a) for a in test_list]
I am also able to read a saved file from my bash script by using:
test = subprocess.check_output(['cat', '/path/filename'])
test_list = test.split()
test = [float(a) for a in test_list]
The problem is, I'm working with over 80 files after I get all that I need. Is there some way in Python to say, "for every file made store the contents of that as a list"?

Instead of capturing data by using subprocess you can use os.popen() to execute scripts. The benefit of using it is that you can read the output of a command/script as you are reading a file. So you can use read(), readlines(),readline() according to your wish which all will give you a list. By using that you can execute the script and capture output like this
import os
output=os.popen("./bashscript").readlines() #now output has the op of bashsceipt with each line as a seperate item as list.
check this for more info on how to use os.popen(). check this to know difference between read(),readlines(),readline(),xreadlines()

Define a simple interface between your bash script and your python script
It looks like the simple interface used to be a print out of the file, but this solution did not scale to multiple files. Now, I recommend the interface be printing out the names of files created. It would look something like this:
filenames = subprocess.check_output("./bashscript").split()
for filename in filenames:
with open(filename) as file_obj:
file_data = [float(a) for a in file_obj.readlines()]
It looks like you are unfamiliar with Python but are familiar with bash. As a result, you are programming hobbled on bash crutches, instead you should embrace Python and use it in your application. You probably do not need the bash script at all.

Find files with same name but different content

I need to find files with the same name but different content in a linux folder structure with a lot of files.
Something like this does the job partially, how do i eliminate files with different content?
#!/bin/sh
dirname=/path/to/directory
find $dirname -type f | sed 's_.*/__' | sort| uniq -d|
while read fileName
do
find $dirname -type f | grep "$fileName"
done
(How to find duplicate filenames (recursively) in a given directory? BASH)
Thanks so much !

The first question is, how can you determine whether two files have the same content?
One obviously possibility is to read (or mmap) both files and compare them a block at a time. On some platforms, a stat is a lot faster than a read, so you may want to first compare sizes. And there are other optimizations that might be useful, depending on what you're actually doing (e.g., if you're going to run this thousands of times, and most of the files are the same every time, you could hash them and cache the hashes, and only check the actual files when the hashes match). But I doubt you're too worried about that kind of performance tweak if your existing code is acceptable (since it searches the whole tree once for every file in the tree), so let's just do the simplest thing.
Here's one way to do it in Python:
#!/usr/bin/env python3
import sys
def readfile(path):
with open(path, 'rb') as f:
return f.read()
contents = [readfile(fname) for fname in sys.argv[1:]]
sys.exit(all(content == contents[0] for content in contents[1:]))
This will exit with code 1 if all files are identical, code 0 if any pair of files are different. So, save this as allequal.py, make it executable, and your bash code can just run allequal.py on the results of that grep, and use the exit value (e.g., via $?) to decide whether to print those results for you.

I am facing the same problem as described in the question. In a large directory tree, some files have the same name and either same content or different content. The ones where the content differs need human attention to decide how to fix the situation in each case. I need to create a list of these files to guide the person doing this.
The code in the question and the code in the abernet's response are both helpful. Here is how one would combine both: Store the python code from abernet's response in some file, e.g. /usr/local/bin/do_these_files_have_different_content:
sudo tee /usr/local/bin/do_these_files_have_different_content <<EOF
#!/usr/bin/env python3
import sys
def readfile(path):
with open(path, 'rb') as f:
return f.read()
contents = [readfile(fname) for fname in sys.argv[1:]]
sys.exit(all(content == contents[0] for content in contents[1:]))
EOF
sudo chmod a+x /usr/local/bin/do_these_files_have_different_content
Then extend the bash code from Illusionist's question to call this program when needed, and react on its outcome:
#!/bin/sh
dirname=$1
find $dirname -type f | sed 's_.*/__' | sort| uniq -d|
while read fileName
do
if do_these_files_have_different_content $(find $dirname -type f | grep "$fileName")
then find $dirname -type f | grep "$fileName"
echo
fi
done
This will write to stdout the paths of all files with same name but different content. Groups of files with same name but different content are separated by empty lines. I store the shell script in /usr/local/bin/find_files_with_same_name_but_different_content and invoke it as
find_files_with_same_name_but_different_content /path/to/my/storage/directory

File names have a `hidden' m character prepended

I have a simple python script which produces some data in a Neutron star mode. I use it to automate file names so I don't later forget the inputs. The script succesfully saves the file as
some_parameters.txt
but when I then list the files in terminal I see
msome_parameters.txt
The file name without the "m" is still valid and trying to call the file with the m returns
$ ls m*
No such file or directory
So I think the "m" has some special meaning of which numerous google searches do not yields answers. While I can carry on without worrying, I would like to know the cause. Here is how I create the file in python
# chi,epsI etc are all floats. Make a string for the file name
file_name = "chi_%s_epsI_%s_epsA_%s_omega0_%s_eta_%s.txt" % (chi,epsI,epsA,omega0,eta)
# a.out is the compiled c file which outputs data
os.system("./a.out > %s" % (file_name) )
Any advise would be much appreciated, usually I can find the answer already posted in the stackoverflow but this time I'm really confused.

You have a file with some special characters in the name which is confusing the terminal output. What happens if you do ls -l or (if possible) use a graphical file manager - basically, find a different way of listing the files so you can see what's going on. Another possibility would be to do ls > some_other_filename and then look at the file with a hex editor.

Problems with running a python script over many files

I am on a Linux (Ubuntu 11.10) machine; bourne again shell.
I have to process a directory full of files with a python script. My colleague wrote the python script and I have successfully used it before on one file at a time. It takes two arguments: a path to the file to be processed enclosed in quotes and a secondary argument called -min which requires an integer. Also, the script writes to standard out.
From my experience of shell scripting and following others on this forum, I used the following method to iterate over the directory of files:
for f in path/to/data_directory/*; do
path/to/pythonscript.py $f -min 1 > path/to/out_directory/$f;
done
I get the desired file names in the out_directory. The content of each is something only the python script can write. That is, the above for loop successfully passes the files to the script. However, the nature of the content of each file is completely wrong (as in the computation the script does was wrong). When I run the python script on one of the files in the data_directory, the output file has the correct content (the computation performed by the script is correct).
The thing that makes it more complex is that the same shell method (the for loop) works perfectly in the Mac OS X my colleague has.
Where is the issue? Am I missing something very fundamental about Linux shells? Maybe it's a syntax error?
Any help will be appreciated.
Update: I just ran the for loop again but instead of pointing it to the data_directory of files, I pointed it to a file within the data_directory. I had the same problem - the python script did not compute the correct result.

The only problem I see is that filenames may contain white-space - so you should quote filenames:
for f in path/to/data_directory/*; do
path/to/pythonscript.py "$f" -min 1 > "path/to/out_directory/$f"
done

Well I don't know if this helps but.
path/to/pythonscript.py $f -min > path/to/out_director/$f
Substitutes out to
path/to/pythongscript.py path/to/data_directory/myfile -min 1 > path/out_directory/path/to/data_directory/myfile
script should be
cd path/to/data_directory
for f in *; do
path/to/pythonscript.py $f -min 1 > path/to/out_directory/$f
done
What version of bash are you running?
what do you get if you run this script?
cd path/to/data_directory
for f in *; do
echo $f > /tmp/$f
done
of course that should give you a bunch of files containing their own file names.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.