Find files with same name but different content - python

I need to find files with the same name but different content in a linux folder structure with a lot of files.
Something like this does the job partially, how do i eliminate files with different content?
#!/bin/sh
dirname=/path/to/directory
find $dirname -type f | sed 's_.*/__' | sort| uniq -d|
while read fileName
do
find $dirname -type f | grep "$fileName"
done
(How to find duplicate filenames (recursively) in a given directory? BASH)
Thanks so much !

The first question is, how can you determine whether two files have the same content?
One obviously possibility is to read (or mmap) both files and compare them a block at a time. On some platforms, a stat is a lot faster than a read, so you may want to first compare sizes. And there are other optimizations that might be useful, depending on what you're actually doing (e.g., if you're going to run this thousands of times, and most of the files are the same every time, you could hash them and cache the hashes, and only check the actual files when the hashes match). But I doubt you're too worried about that kind of performance tweak if your existing code is acceptable (since it searches the whole tree once for every file in the tree), so let's just do the simplest thing.
Here's one way to do it in Python:
#!/usr/bin/env python3
import sys
def readfile(path):
with open(path, 'rb') as f:
return f.read()
contents = [readfile(fname) for fname in sys.argv[1:]]
sys.exit(all(content == contents[0] for content in contents[1:]))
This will exit with code 1 if all files are identical, code 0 if any pair of files are different. So, save this as allequal.py, make it executable, and your bash code can just run allequal.py on the results of that grep, and use the exit value (e.g., via $?) to decide whether to print those results for you.

I am facing the same problem as described in the question. In a large directory tree, some files have the same name and either same content or different content. The ones where the content differs need human attention to decide how to fix the situation in each case. I need to create a list of these files to guide the person doing this.
The code in the question and the code in the abernet's response are both helpful. Here is how one would combine both: Store the python code from abernet's response in some file, e.g. /usr/local/bin/do_these_files_have_different_content:
sudo tee /usr/local/bin/do_these_files_have_different_content <<EOF
#!/usr/bin/env python3
import sys
def readfile(path):
with open(path, 'rb') as f:
return f.read()
contents = [readfile(fname) for fname in sys.argv[1:]]
sys.exit(all(content == contents[0] for content in contents[1:]))
EOF
sudo chmod a+x /usr/local/bin/do_these_files_have_different_content
Then extend the bash code from Illusionist's question to call this program when needed, and react on its outcome:
#!/bin/sh
dirname=$1
find $dirname -type f | sed 's_.*/__' | sort| uniq -d|
while read fileName
do
if do_these_files_have_different_content $(find $dirname -type f | grep "$fileName")
then find $dirname -type f | grep "$fileName"
echo
fi
done
This will write to stdout the paths of all files with same name but different content. Groups of files with same name but different content are separated by empty lines. I store the shell script in /usr/local/bin/find_files_with_same_name_but_different_content and invoke it as
find_files_with_same_name_but_different_content /path/to/my/storage/directory

Related

How to concatenate sequences in the same multiFASTA files and then print result to a new FASTA file?

I have a folder with over 50 FASTA files each with anywhere from 2-8 FASTA sequences within them, here's an example:
testFOR.id_AH004930.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGA
>AH004930|2:237-401_Miopithecus_talapoin
GGGT
>AH004930|2:502-580_Miopithecus_talapoin
CTTTGCT
>AH004930|2:681-747_Miopithecus_talapoin
GGTG
testFOR.id_M95099.fasta
>M95099|1:1-90_Homo_sapien
TCTTTGC
>M95099|1:100-243_Homo_sapien
ATGGTCTTTGAA
They're all grouped based on their ID number (in this case AH004930 and M95099), which I've managed to extract from the original raw multiFASTA file using the very handy seqkit code found HERE.
What I am aiming to do is:
Use cat to put these sequences together within the file like this:
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTG
>M95099|1:1-90_Homo_sapien
TCTTTGCATGGTCTTTGAA
(I'm not fussed about the nucleotide position, I'm fussed about the ID and species name!)
Print this result out into a new FASTA file.
Ideally I'd really like to have all of these 50 files condensed into 1 FASTA that I can then go ahead and filter/align:
GENE_L.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTG
>M95099|1:1-90_Homo_sapien
TCTTTGCATGGTCTTTGAA
....
So far I have found a way to achieve what I want but only one file at a time (using this code: cat myfile.fasta | sed -e '1!{/^>.*/d;}' | sed ':a;N;$!ba;s/\n//2g' > output.fasta which I've sadly lost the link for the credit for) but a lot of these file names are very similar, so it's inevitable that if I did it manually, I'd miss some/it would be way too slow.
I have tried to put this into a loop and it's kind of there! But what it does is it cats each FASTA file, put's it into a new one BUT only keeps the first header, leaving me with a massive stitched together sequence;
for FILE in *; do cat *.fasta| sed -e '1!{/^>.*/d;}'| sed ':a;N;$!ba;s/\n//2g' > output.fasta; done
output.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTGTCTTTGCATGGTCTTTGAAGGTCTTTGAAATGAGTGGT...
I wondered if making a loop similar to the one HERE would be any good but I am really unsure how to get it to print each header once it opens a new file.
How can I cat these sequences, print them into a new file and still keep these headers?
I would really appreciate any advice on where I've gone wrong in the loop and any solutions suitable for a zsh shell! I'm open to any python or linux solution. Thank you kindly in advance
This might work for you (GNU sed):
sed -s '1h;/>/d;H;$!d;x;s/\n//2g' file1 file2 file3 ...
Set -s to treat each file separately.
Copy the first line.
Delete any other lines containing >.
Append all other lines to the first.
Delete these lines except for the last.
At the end of the file, swap to the copies and remove all newlines except the first.
Repeat for all files.
Alternative for non-GNU seds:
for file in *.fasta; do sed '1h;/>/d;H;$!d;x;s/\n//2g' "$file"; done
N.B. MacOS sed may need to be put into a script and invoked using the -f option or split into several pieces using the -e option (less the ; commands), your luck may vary.
Or perhaps:
for file in file?; do sed $'1h;/>/d;H;$!d;x;s/\\n/#/;s/\\n//g;s/#/\\n/' "$file"; done
Not sure I understand exactly your issue, but if you simply want to concatenate contents from many files to a single file I believe the (Python) code below should work:
import os
input_folder = 'path/to/your/folder/with/fasta/files'
output_file = 'output.fasta'
with open(output_file, 'w') as outfile:
for file_name in os.listdir(input_folder):
if not file_name.endswith('.fasta'): # ignore this
continue
file_path = os.path.join(input_folder, file_name)
with open(file_path, 'r') as inpfile:
outfile.write(inpfile.read())

Running .py file with and argument in .bat

The problem: I want to iterate over folder in search of certain file type, then execute it with a program and the name.ext as argument, and then run python script that changes the output name of the first program.
I know there is probably a better way to do the above, but the way I thought of was this:
[BAT]
for /R "C:\..\folder" %%a IN (*.extension) do ( SET name=%%a "C:\...\first_program.exe" "%%a" "C:\...\script.py" "%name%" )
[PY]
import io
import sys
def rename(i):
name = i
with open('my_file.txt', 'r') as file:
data = file.readlines()
data[40] ='"C:\\\\Users\\\\UserName\\\\Desktop\\\\folder\\\\folder\\\\' + name + '"\n'
with open('my_file.txt', 'w') as file:
file.writelines( data )
if __name__ == "__main__":
rename(sys.argv[1])
Expected result: I wish the python file changed the name, but after putting it once into the console it seems to stay with the script. The BAT does not change it and it bothers me.
PS. If there is a better way, I'll be glad to get to know it.
This is the linux bash version, I am sure you can change the loop etc to make it work as batch file instead of your *.exe I use cat as a generic input output example
#! /bin/sh
for f in *.txt
do
suffix=".txt"
name=${f%$suffix}
cat $f > tmp.dat
awk -v myName=$f '{if(NR==5) print $0 myName; else print $0 }' tmp.dat > $name.dat
done
This produces "unique" output *.dat files named after the input *.txt files. The files are treated by cat (virtually your *.exe) and the output is put into a temorary file. Eventually, this is handled by awk changing line 5 here. with the output placed in the unique file, as mentioned above.

How can I run this shell script inside python?

I want to run a bash script from a python program. The script has a command like this:
find . -type d -exec bash -c 'cd "$0" && gunzip -c *.gz | cut -f 3 >> ../mydoc.txt' {} \;
Normally I would run a subprocess call like:
subprocess.call('ls | wc -l', shell=True)
But that's not possible here because of the quoting signs. Any suggestions?
Thanks!
While the question is answered already, I'll still jump in because I assume that you want to execute that bash script because you do not have the functionally equivalent Python code (which is lees than 40 lines basically, see below).
Why do this instead the bash script?
Your script now is able to run on any OS that has a Python interpreter
The functionality is a lot easier to read and understand
If you need anything special, it is always easier to adapt your own code
More Pythonic :-)
Please bear in mind that is (as your bash script) without any kind of error checking and the output file is a global variable, but that can be changed easily.
import gzip
import os
# create out output file
outfile = open('/tmp/output.txt', mode='w', encoding='utf-8')
def process_line(line):
"""
get the third column (delimiter is tab char) and write to output file
"""
columns = line.split('\t')
if len(columns) > 3:
outfile.write(columns[3] + '\n')
def process_zipfile(filename):
"""
read zip file content (we assume text) and split into lines for processing
"""
print('Reading {0} ...'.format(filename))
with gzip.open(filename, mode='rb') as f:
lines = f.read().decode('utf-8').split('\n')
for line in lines:
process_line(line.strip())
def process_directory(dirtuple):
"""
loop thru the list of files in that directory and process any .gz file
"""
print('Processing {0} ...'.format(dirtuple[0]))
for filename in dirtuple[2]:
if filename.endswith('.gz'):
process_zipfile(os.path.join(dirtuple[0], filename))
# walk the directory tree from current directory downward
for dirtuple in os.walk('.'):
process_directory(dirtuple)
outfile.close()
Escape the ' marks with a \.
i.e. For every: ', replace with: \'
Triple quotes or triple double quotes ('''some string''' or """some other string""") are handy as well. See here (yeah, its python3 documentation, but it all works 100% in python2)
mystring = """how many 'cakes' can you "deliver"?"""
print(mystring)
how many 'cakes' can you "deliver"?

Python, searching own codes wrote

There are scripts wrote and all of them may of different topics. Some about documents handling, some about text extraction, some about automation.
Sometimes I forgot a usage, for example how to create a .xls file, so I want to search if in the scripts there is a line about how to do it.
What I am doing is to convert all the .py files into .txt, and combine all txt files together. Then I use Word to open this aggregated .txt file and find.
What’s the better way to search specific lines in own written codes?
**converting .py to .txt:
folder = "C:\\Python27\\"
for a in os.listdir(folder):
root, ext = os.path.splitext(a)
if ext == ".py":
os.rename(folder + a, folder + root + ".txt")
**putting all .txt together:
base_folder = "C:\\TXTs\\"
all_files = []
for each in os.listdir(base_folder):
if each.endswith('.txt'):
kk = os.path.join(base_folder, each)
all_files.append(kk)
with open(base_folder + " final.txt", 'w') as outfile:
for fname in all_files:
with open(fname) as infile:
for line in infile:
outfile.write(line)
Keep all of your code in a single directory tree, e.g. code.
If your operating system doesn't have a decent search tool like grep, install "the silver searcher". (You will also need a decent terminal emulator.)
For example (I'm using FreeBSD here), I tend to keep all my source code under src/. If I want to know in which scripts I use xls files, I type:
ag xls src/
Which retuns:
src/progs/uren/project-uren.py
24: print("Usage: {} urenbriefjesNNNN.xlsm project".format(binary),
This tells me to look at line 24 of the file src/progs/uren/project-uren.py.
If I e.g. search for Counter:
ag Counter src/
I get multiple hits:
src/scripts/public/csv2tbl.py
13:from collections import Counter
45: letters, sep = Counter(), Counter()
src/scripts/public/dvd2webm.py
15:from collections import Counter
94: rv = Counter(re.findall('crop=(\d+:\d+:\d+:\d+)', proc.stderr))
src/scripts/public/scripts-tests.py
14:from collections import Counter
26: v = Counter(rndcaps(100000)).values()
You can install rStudio, an open source IDE for the r language. In the Edit menu there is a Find in File... feature you can use just like find and replace in a word document. It will go through files in the directory you point it to...I have not yet had problems searching scripts as they are, untransformed to txt. It will search for terms or regex expressions....it is pretty useful!
As is R!
cat *.py | grep xls if you're on Linux.
Otherwise it may be helpful to keep some sort of README file with your python scripts. I, personally, prefer Markdown:
## Scripts
They do stuff
### Script A
Does stuff A, call `script_a.py -h` for more info
### Script B
Does stuff B, call `script_b.py -h` for more info
It compiles to this:
Scripts
They do stuff
Script A
Does stuff A, call script_a.py -h for more info
Script B
Does stuff B, call script_b.py -h for more info
It takes basically no time to write and Markdown can be easily used on sites such as SO, Github, Reddit and others. This very answer, in fact, is written in Markdown. But if you can't be bothered with Markdown, a simple README.txt is still much better than nothing.
The technical term for what you're trying to do is a "full text file search". Googling this together with your operating system name will give you many methods. Here is one for Windows: https://www.howtogeek.com/99406/how-to-search-for-text-inside-of-any-file-using-windows-search/.
If you're on MacOS I recommend looking into BASH command line syntax to do a bit more complex automation tasks (although what you need is also perfectly covered in Spotlight search). On Windows 10 you could check out the new Linux Subsystem that gives you the same syntax [1]. Composing small commands together using pipes and xargs in command line is a very powerful automation tool. For what you're asking I still think a full text search is the most straightforward solution, but since you're already into programming I thought I bring this up.
To demonstrate, the task you describe would be something like
find . -name "*.py" | xargs -I {} grep -H "xls" {}
This would search your working directory (and all subdirectories) for python files (using . as its first argument to find, which refers to the directory you're currently in, shown by pwd), and then search each of those python files for the string "xls". xargs takes all lines from standard input (which the pipe | gets from the last command) and converts them into command line parameters. grep -H searches files for the specified string and prints the occurrences together with the file name.
[1] I'm assuming you're not on Linux already since you like to use MS Office.

Copy the nth column of all the files in a directory into a single file

I've a directory containing many .csv files. How can I extract the nth column of every file into a new file column-wise?
For example:
File A:
111,222,333
111,222,333
File B:
AAA,BBB,CCC
AAA,BBB,CCC
File C:
123,456,789
456,342,122
and so on...
If n = 2, I want my resultant file to be:
222,BBB,456,...
222,BBB,342,...
where ... represents that there will be as many columns as the number of files in the directory.
My try so far:
#!/bin/bash
for i in `find ./ -iname "*.csv"`
do
awk -F, '{ print $2}' < $i >> result.csv ## This would append row-wise, not column-wise.
done
UPDATE:
I'm not trying to just join two files. There are 100 of files in a particular directory, and I want to copy the nth column of all the files into a single file. I gave two files as an example to show how I want the data to be if there were only two files.
As pointed out in the comments, joining two files is trivial but joining multiple files may be not that easy which is the whole point of my question. Would python help to do this job?
Building on triplee's solution, here's a generic version which uses eval:
eval paste -d, $(printf "<(cut -d, -f2 %s) " *.csv)
I'm not too fond of eval (always be careful when using it), but it has its uses.
Hmm. My first thought is to have both an outer and inner loop. The outer loop would be a counter on line number. The inner loop would go through the csv files. You'd need to use head/tail in the inner loop to get the correct line number so you could grab the right field.
An alternative is to use the one loop you have now but write each line to a separate file and then merge them.
Neither of these seem ideal. Quite honestly, I'd do this in Perl so you could use an actual in memory data structure and avoid the need to have complex logic.
this one liner should work:
awk -F, -v OFS="," 'NR==FNR{a[NR]=$2;next}{print a[FNR],$2}' file1 file2
Assuming Bash process substitutions are acceptable (i.e. you don't require the solution to be portable to systems where Bash is not available);
paste -d, <(cut -d, -f2 file1) <(cut -d, -f2 file2) <(cut -d, -f2 file3) # etc
A POSIX solution requires temporary files instead.

Categories

Resources