bash: copy files with the same pattern - python

I want to copy files scattered in separate directories into a single directory.
find . -name "*.off" > offFile
while read line; do
cp "${line}" offModels #offModels is the destination directory
done < offFile
While the file offFile has 1831 lines, but cd offModels and ls | wc -l gives 1827. I think four files end with ".off" are not copied.
At first, I think that because I use the double quote in shell script, files with names which contain dollor sign, backtick or backslash may be missed. Then I find one file named $.... But how to find another three? After cd offModels and ls > ../File, I write a python script like this:
fname1="offFile" #records files scattered
with open(fname1) as f1:
contents1=f1.readlines()
fname2="File"
with open(fname2) as f2:
contents2=f2.readlines()
visited=[0]*len(contents1)
for substr in contents2:
substr="/"+substr
for i, string in enumerate(contents1):
if string.find(substr)>=0:
visited[i]=1
break
for i,j in enumerate(visited):
if j==0:
print contents1[i]
The output gives four lines while they are wrong. But I can find all the four files in the destination directory.
Edit
As the comment and answers point out, there are four files with duplicated names with other four. One thing interest me now is that, with the bash script I used, the file with name $CROSS.off is copied. That really suprised me.

Looks like you have files with the same filenames, and cp just overwrites them.
You can use the --backup=numbered option for cp; here is a one-liner:
find -name '*.off' -exec cp --backup=numbered '{}' '/full/path/to/offModels' ';'
The -exec option allows you to execute a command on every file matched; you should use {} to get the file's name and end the command with ; (usually written as \; or ';', because bash treats semicolons as command separators).

Related

Remove Files Based On Partial Match In Unix

Lets say I have a file named duplicates.txt which appears as the following:
ID-32532
ID-78313
ID-89315
I also have a directory Fastq of files with the following names:
ID-18389_Feb92003_R1.fastq
ID-18389_Feb92003_R2.fastq
ID-32532_Feb142003_R1.fastq
ID-32532_Feb142003_R2.fastq
ID-48247_Mar202004_R1.fastq
ID-48247_Mar202004_R2.fastq
I want to enter a command that will search duplicates.txt and find any file whose name is a partial match in the Fastq directory and remove the file. Based on the provided example this would remove the files named ID-32532_Feb142003_{R1/R2}.fastq.
What Unix command should I use or if need be I could write a script in Python.
Here's a little bash function to do it:
lrmduplicates(){
while read -r dupe;
do
echo removing "$dupe" ;
#fine tune with ls first...
#ls Fastq/$dupe*
rm Fastq/$dupe*
# dupes file: dont forget a line feed after 3rd pattern
# i.e. end on empty line.
done < duplicates.txt
}
For extra bonus, suppress error when no match. Not sure how to do that myself. rm -f or rm 2>/dev/null didnt do it (zsh on macos).
In unix, just replace the variable character with a '?' or '.*'.
duplicates.txt
remove ID-?????
Fastq
remove ID-?????_????????_??.fastq
remove ID-.*fastq

Bash/Python to gather several paths to one and replacing the file name by '*' character

I'm coding a bash script in order to generate .spec file (building RPM) automatically. I read all the files in the directory (which I hope to convert it into rpm package) and write all the paths of files needed to install in .spec file, I realize that I need to shorten them. An example:
/tmp/a/1.jpg
/tmp/a/2.conf
/tmp/a/b/srf.cfg
/tmp/a/b/ssp.cfg
/tmp/a/conf_web_16.2/c/.htaccess
/tmp/a/conf_web_16.2/c/.htaccess.WebProv
/tmp/a/conf_web_16.2/c/.htprofiles
=> What I want to get:
/tmp/a/*.jpg
/tmp/a/*.conf
/tmp/a/b/*.cfg
/tmp/a/conf_web_16.2/c/*
/tmp/a/conf_web_16.2/c/*.WebProv
You guys please give me some advice about my problem. I hope you guys can suggest your idea in bash shell, python or C. Thank you in advance.
To convert any file name which contains a dot in a character other than the first into a wildcard covering the part up to just before the dot, and any remaining files to just a wildcard,
sed -e 's%/[^/][^/]*\(\.[^./]*\)$%/*\1%' -e t -e 's%/[^/]*$%/*%'
The behavior of sed is to read its input one line at a time, and execute the script of commands on each in turn. The s%foo%bar% substitution command replaces a regex match with a string, and the t command causes the script to skip further substitutions if one was already performed on the current line. (I'm simplifying somewhat.) The first regex matches file names which contain a dot in a position other than the first, and captures the match from the dot through the end in a back reference which is used in the substitution as well (that's the \1). The second is applied to any remaining file names, because of the t command in between.
The result will probably need to be piped to sort -u to remove any duplicates.
If you don't have a list of the file names, you can use find to pipe in a listing.
find . -type f | sed ... | sort -u

Renaming files recursively to make the filename a concatenation of their path

Sorry if the title is unclear. An example folder structure to help understand:
/images/icons/654/323/64/64/icon.png
/images/icons/837/283/64/64/icon.png
to be renamed to
/images-icons-654-323-64-64-icon.png
/images-icons-837-283-64-64-icon.png
I'm not great at bash so all I have to start with is:
find . -name "*.png"
which will find all of the files, which I then am hoping to using -exec rename with, or whatever works. also open to using any language to get the job done!
Solution in bash:
for f in $(find images_dir -type f); do
mv -v "$f" ${f//\//-}
done
This finds all files in the images_dir directory, replaces any / in their path with - thanks to the parameter expansion, and moves the file to the new path.
For example, file images_dir/icons/654/321/b.png will be moved to images_dir-icons-654-321-a.png.
Note that if you execute find ., you will encounter an issue as find outputs filenames starting with ./, which means your files will be renamed to something like .-<filename>.
As #gniourf_gniourf notices in the comments, this will fail if your file names include spaces or newlines.
Whitespace-proof:
find images_dir -type f -exec bash -c 'for i; do mv -v "$i" "${i//\//-}; done' _ {} +
In python you could do it like so:
import fnmatch
import os
def find(base_dir, some_string):
matches = []
for root, dirnames, filenames in os.walk(base_dir):
for filename in fnmatch.filter(filenames, some_string):
matches.append(os.path.join(root, filename))
return matches
files = find('.', '*.png')
new_files = ['-'.join(ele.split('/')[1:]) for ele in files]
for idx, ele in enumerate(files):
os.rename(ele, new_files[idx])
And to give proper credit, the find function I took from this answer.
This should do it for you:
for file in `find image -iname "*.png"`
do
newfile=`echo "$file" | sed 's=/=-=g'`
mv -v $file $newfile
done
The backtick find command will expand to a list of files ending with ".png", the -iname performs the search in a case independent manner.
The sed command will replace all slashes with dashes, resulting in the new target name.
The mv does the heavy lifting. The -v is optional and will cause a verbose move.
To debug, you can put an echo statement in front of the mv.

Find files with same name but different content

I need to find files with the same name but different content in a linux folder structure with a lot of files.
Something like this does the job partially, how do i eliminate files with different content?
#!/bin/sh
dirname=/path/to/directory
find $dirname -type f | sed 's_.*/__' | sort| uniq -d|
while read fileName
do
find $dirname -type f | grep "$fileName"
done
(How to find duplicate filenames (recursively) in a given directory? BASH)
Thanks so much !
The first question is, how can you determine whether two files have the same content?
One obviously possibility is to read (or mmap) both files and compare them a block at a time. On some platforms, a stat is a lot faster than a read, so you may want to first compare sizes. And there are other optimizations that might be useful, depending on what you're actually doing (e.g., if you're going to run this thousands of times, and most of the files are the same every time, you could hash them and cache the hashes, and only check the actual files when the hashes match). But I doubt you're too worried about that kind of performance tweak if your existing code is acceptable (since it searches the whole tree once for every file in the tree), so let's just do the simplest thing.
Here's one way to do it in Python:
#!/usr/bin/env python3
import sys
def readfile(path):
with open(path, 'rb') as f:
return f.read()
contents = [readfile(fname) for fname in sys.argv[1:]]
sys.exit(all(content == contents[0] for content in contents[1:]))
This will exit with code 1 if all files are identical, code 0 if any pair of files are different. So, save this as allequal.py, make it executable, and your bash code can just run allequal.py on the results of that grep, and use the exit value (e.g., via $?) to decide whether to print those results for you.
I am facing the same problem as described in the question. In a large directory tree, some files have the same name and either same content or different content. The ones where the content differs need human attention to decide how to fix the situation in each case. I need to create a list of these files to guide the person doing this.
The code in the question and the code in the abernet's response are both helpful. Here is how one would combine both: Store the python code from abernet's response in some file, e.g. /usr/local/bin/do_these_files_have_different_content:
sudo tee /usr/local/bin/do_these_files_have_different_content <<EOF
#!/usr/bin/env python3
import sys
def readfile(path):
with open(path, 'rb') as f:
return f.read()
contents = [readfile(fname) for fname in sys.argv[1:]]
sys.exit(all(content == contents[0] for content in contents[1:]))
EOF
sudo chmod a+x /usr/local/bin/do_these_files_have_different_content
Then extend the bash code from Illusionist's question to call this program when needed, and react on its outcome:
#!/bin/sh
dirname=$1
find $dirname -type f | sed 's_.*/__' | sort| uniq -d|
while read fileName
do
if do_these_files_have_different_content $(find $dirname -type f | grep "$fileName")
then find $dirname -type f | grep "$fileName"
echo
fi
done
This will write to stdout the paths of all files with same name but different content. Groups of files with same name but different content are separated by empty lines. I store the shell script in /usr/local/bin/find_files_with_same_name_but_different_content and invoke it as
find_files_with_same_name_but_different_content /path/to/my/storage/directory

Copy the nth column of all the files in a directory into a single file

I've a directory containing many .csv files. How can I extract the nth column of every file into a new file column-wise?
For example:
File A:
111,222,333
111,222,333
File B:
AAA,BBB,CCC
AAA,BBB,CCC
File C:
123,456,789
456,342,122
and so on...
If n = 2, I want my resultant file to be:
222,BBB,456,...
222,BBB,342,...
where ... represents that there will be as many columns as the number of files in the directory.
My try so far:
#!/bin/bash
for i in `find ./ -iname "*.csv"`
do
awk -F, '{ print $2}' < $i >> result.csv ## This would append row-wise, not column-wise.
done
UPDATE:
I'm not trying to just join two files. There are 100 of files in a particular directory, and I want to copy the nth column of all the files into a single file. I gave two files as an example to show how I want the data to be if there were only two files.
As pointed out in the comments, joining two files is trivial but joining multiple files may be not that easy which is the whole point of my question. Would python help to do this job?
Building on triplee's solution, here's a generic version which uses eval:
eval paste -d, $(printf "<(cut -d, -f2 %s) " *.csv)
I'm not too fond of eval (always be careful when using it), but it has its uses.
Hmm. My first thought is to have both an outer and inner loop. The outer loop would be a counter on line number. The inner loop would go through the csv files. You'd need to use head/tail in the inner loop to get the correct line number so you could grab the right field.
An alternative is to use the one loop you have now but write each line to a separate file and then merge them.
Neither of these seem ideal. Quite honestly, I'd do this in Perl so you could use an actual in memory data structure and avoid the need to have complex logic.
this one liner should work:
awk -F, -v OFS="," 'NR==FNR{a[NR]=$2;next}{print a[FNR],$2}' file1 file2
Assuming Bash process substitutions are acceptable (i.e. you don't require the solution to be portable to systems where Bash is not available);
paste -d, <(cut -d, -f2 file1) <(cut -d, -f2 file2) <(cut -d, -f2 file3) # etc
A POSIX solution requires temporary files instead.

Categories

Resources