Remove Files Based On Partial Match In Unix - python

Lets say I have a file named duplicates.txt which appears as the following:
ID-32532
ID-78313
ID-89315
I also have a directory Fastq of files with the following names:
ID-18389_Feb92003_R1.fastq
ID-18389_Feb92003_R2.fastq
ID-32532_Feb142003_R1.fastq
ID-32532_Feb142003_R2.fastq
ID-48247_Mar202004_R1.fastq
ID-48247_Mar202004_R2.fastq
I want to enter a command that will search duplicates.txt and find any file whose name is a partial match in the Fastq directory and remove the file. Based on the provided example this would remove the files named ID-32532_Feb142003_{R1/R2}.fastq.
What Unix command should I use or if need be I could write a script in Python.

Here's a little bash function to do it:
lrmduplicates(){
while read -r dupe;
do
echo removing "$dupe" ;
#fine tune with ls first...
#ls Fastq/$dupe*
rm Fastq/$dupe*
# dupes file: dont forget a line feed after 3rd pattern
# i.e. end on empty line.
done < duplicates.txt
}
For extra bonus, suppress error when no match. Not sure how to do that myself. rm -f or rm 2>/dev/null didnt do it (zsh on macos).

In unix, just replace the variable character with a '?' or '.*'.
duplicates.txt
remove ID-?????
Fastq
remove ID-?????_????????_??.fastq
remove ID-.*fastq

Related

How can I concatenate multiple text or xml files but omit specific lines from each file?

I have a number of xml files (which can be considered as text files in this situation) that I wish to concatenate. Normally I think I could do something like this from a Linux command prompt or bash script:
cat somefile.xml someotherfile.xml adifferentfile.xml > out.txt
Except that in this case, I need to copy the first file in its entirety EXCEPT for the very last line, but in all subsequent files omit exactly the first four lines and the very last line (technically, I do need the last line from the last file but it is always the same, so I can easily add it with a separate statement).
In all these files the first four lines and the last line are always the same, but the contents in between varies. The names of the xml files can be hardcoded into the script or read from a separate data file, and the number of them may vary from time to time but always will number somewhere around 10-12.
I'm wondering what would be the easiest and most understandable way to do this. I think I would prefer either a bash script or maybe a python script, though I generally understand bash scripts a little better. What I can't get my head around is how to trim off just those first four lines (on all but the first file) and the last line of every file. My suspicion is there's some Linux command that can do this, but I have no idea what it would be. Any suggestions?
sed '$d' firstfile > out.txt
sed --separate '1,4d; $d' file2 file3 file4 >> out.txt
sed '1,4d' lastfile >> out.txt
It's important to use the --separate (or shorter -s) option so that the range statements 1,4 and $ apply to each file individually.
From GNU sed manual:
-s, --separate
By default, sed will consider the files specified on the command line as a single continuous long stream. This GNU sed
extension allows the user to consider them as separate files.
Do it in two steps:
use the head command (to get the lines you want)
Use cat to combine
You could use temp files or bash trickery.

Bash/Python to gather several paths to one and replacing the file name by '*' character

I'm coding a bash script in order to generate .spec file (building RPM) automatically. I read all the files in the directory (which I hope to convert it into rpm package) and write all the paths of files needed to install in .spec file, I realize that I need to shorten them. An example:
/tmp/a/1.jpg
/tmp/a/2.conf
/tmp/a/b/srf.cfg
/tmp/a/b/ssp.cfg
/tmp/a/conf_web_16.2/c/.htaccess
/tmp/a/conf_web_16.2/c/.htaccess.WebProv
/tmp/a/conf_web_16.2/c/.htprofiles
=> What I want to get:
/tmp/a/*.jpg
/tmp/a/*.conf
/tmp/a/b/*.cfg
/tmp/a/conf_web_16.2/c/*
/tmp/a/conf_web_16.2/c/*.WebProv
You guys please give me some advice about my problem. I hope you guys can suggest your idea in bash shell, python or C. Thank you in advance.
To convert any file name which contains a dot in a character other than the first into a wildcard covering the part up to just before the dot, and any remaining files to just a wildcard,
sed -e 's%/[^/][^/]*\(\.[^./]*\)$%/*\1%' -e t -e 's%/[^/]*$%/*%'
The behavior of sed is to read its input one line at a time, and execute the script of commands on each in turn. The s%foo%bar% substitution command replaces a regex match with a string, and the t command causes the script to skip further substitutions if one was already performed on the current line. (I'm simplifying somewhat.) The first regex matches file names which contain a dot in a position other than the first, and captures the match from the dot through the end in a back reference which is used in the substitution as well (that's the \1). The second is applied to any remaining file names, because of the t command in between.
The result will probably need to be piped to sort -u to remove any duplicates.
If you don't have a list of the file names, you can use find to pipe in a listing.
find . -type f | sed ... | sort -u

Modify string in bash to contain new line character?

I am using a bash script to call google-api's upload_video.py (https://developers.google.com/youtube/v3/guides/uploading_a_video )
I have a mp4 called output.mp4 which I would like to upload.
The problem is I cannot get my array to work how I would like.
This new line character is "required" because my arguments to python script contain spaces.
Here is a simplified version of my bash script:
# Operator may change these
hold=100
location="Foo, Montana "
declare -a file_array=("unique_ID_0" "unique_ID_1")
upload_file=upload_file.txt
upload_movie=output.mp4
# Hit enter at end b/c \n not recognized
upload_title=$location' - '${file_array[0]}' - Hold '$hold' Sweeps
'
upload_description='The spectrum recording was made in at '$location'.
'
# Overwrite with 1st call > else apppend >>
echo "$upload_title" > $upload_file
echo "$upload_description" >> $upload_file
# Load each line of text file into array
IFS=$'\n'
cmd_google=$(<$upload_file)
unset IFS
nn=1
for i in "${cmd_google[#]}"
do
echo "$i"
# Delete last character: \n
#i=${i[-nn]%?}
#i=${i: : -nn}
#i=${i::${#i}-nn}
i=${i%?}
#i=${i#"\n"}
#i=${i%"\n"}
echo "$i"
done
python upload_video.py --file=$upload_movie --title="${cmd_google[0]}" --description="${cmd_google[1]}"
At first I attempted to remove the new line character, but it appears that the enter or \n is not working how I would like, each line is not separate. It writes the title and description as one line.
How do I modify my bash script to recognize a newline character?
This is much simpler than you are making it.
# Operator may change these
hold=100
location="Foo, Montana"
declare -a file_array=("unique_ID_0" "unique_ID_1")
upload_file=upload_file.txt
upload_movie=output.mp4
upload_title="$location - ${file_array[0]} - Hold $hold Sweeps"
upload_description="The spectrum recording was made in at $location."
cat <<EOF > "$upload_file"
$upload_title
$upload_description
EOF
# ...
readarray -t cmd_google < "$upload_file"
python upload_video.py --file="$upload_movie" --title="${cmd_google[0]}" --description="${cmd_google[1]}"
I suspect the readarray command is all you are really looking for, since much of the above code is simply creating a file that I assume you are receiving already created.
I figured it out with help from chepner's answer. My question hid the fact that I wanted to write new line characters into the video's description.
Instead of adding a new line character in the bash script, it is much easier to have a text file which contains the correctly formatted script and read it in, then concatenate it with run-time specific variable.
In my case the correctly formatted text is called description.txt:
Here is a snip of my description.txt which contains newline characters
Here is my final version of the script:
# Operator may change these
hold=100
location="Foo, Montana"
declare -a file_array=("unique_ID_0" "unique_ID_1")
upload_title="$location - ${file_array[0]} - Hold $hold Sweeps"
upload_description="The spectrum recording was made in at $location. "
# Read in script which contains newline
temp=$(<description.txt)
# Concatenate them
upload_description="$upload_description$temp"
upload_movie=output.mp4
python upload_video.py --file="$upload_movie" --title="$upload_title" --description="$upload_description"

bash: copy files with the same pattern

I want to copy files scattered in separate directories into a single directory.
find . -name "*.off" > offFile
while read line; do
cp "${line}" offModels #offModels is the destination directory
done < offFile
While the file offFile has 1831 lines, but cd offModels and ls | wc -l gives 1827. I think four files end with ".off" are not copied.
At first, I think that because I use the double quote in shell script, files with names which contain dollor sign, backtick or backslash may be missed. Then I find one file named $.... But how to find another three? After cd offModels and ls > ../File, I write a python script like this:
fname1="offFile" #records files scattered
with open(fname1) as f1:
contents1=f1.readlines()
fname2="File"
with open(fname2) as f2:
contents2=f2.readlines()
visited=[0]*len(contents1)
for substr in contents2:
substr="/"+substr
for i, string in enumerate(contents1):
if string.find(substr)>=0:
visited[i]=1
break
for i,j in enumerate(visited):
if j==0:
print contents1[i]
The output gives four lines while they are wrong. But I can find all the four files in the destination directory.
Edit
As the comment and answers point out, there are four files with duplicated names with other four. One thing interest me now is that, with the bash script I used, the file with name $CROSS.off is copied. That really suprised me.
Looks like you have files with the same filenames, and cp just overwrites them.
You can use the --backup=numbered option for cp; here is a one-liner:
find -name '*.off' -exec cp --backup=numbered '{}' '/full/path/to/offModels' ';'
The -exec option allows you to execute a command on every file matched; you should use {} to get the file's name and end the command with ; (usually written as \; or ';', because bash treats semicolons as command separators).

How to use awk if statement and for loop in subprocess.call

Trying to print filename of files that don't have 12 columns.
This works at the command line:
for i in *dim*; do awk -F',' '{if (NR==1 && NF!=12)print FILENAME}' $i; done;
When I try to embed this in subprocess.call in a python script, it doesn't work:
subprocess.call("""for %i in (*dim*.csv) do (awk -F, '{if ("NR==1 && NF!=12"^) {print FILENAME}}' %i)""", shell=True)
The first error I received was "Print is unexpected at this time" so I googled and added ^ within parentheses. Next error was "unexpected newline or end of string" so googled again and added the quotes around NR==1 && NF!=12. With the current code it's printing many lines in each file so I suspect something is wrong with the if statement. I've used awk and for looped before in this style in subprocess.call but not combined and with an if statement.
Multiple input files in AWK
In the string you are passing to subprocess.call(), your if statement is evaluating a string (probably not the comparison you want). It might be easier to just simplify the shell command by doing everything in AWK. You are executing AWK for every $i in the shell's for loop. Since you can give multiple input files to AWK, there is really no need for this loop.
You might want to scan through the entire files until you find any line that has other than 12 fields, and not only check the first line (NR==1). In this case, the condition would be only NF!=12.
If you want to check only the first line of each file, then NR==1 becomes FNR==1 when using multiple files. NR is the "number of records" (across all input files) and FNR is "file number of records" for the current input file only. These are special built-in variables in AWK.
Also, the syntax of AWK allows for the blocks to be executed only if the line matches some condition. Giving no condition (as you did) runs the block for every line. For example, to scan through all files given to AWK and print the name of a file with other than 12 fields on the first line, try:
awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv
I have added the .csv to your wildcard *dim* as you had in the Python version. The -F, of course changes the field separator to a comma from the default space. For every line in each file, AWK checks if the number of fields NF is 12, if it's not, it executes the block of code, otherwise it goes on to the next line. This block prints the FILENAME of the current file AWK is processing, then skips to the beginning of the next file with nextfile.
Try running this AWK version with your subprocess module in Python:
subprocess.call("""awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv""", shell=True)
The triple quotes makes it a literal string. The output of AWK goes to stdout and I'm assuming you know how to use this in Python with the subprocess module.
Using only Python
Don't forget that Python is itself an expressive and powerful language. If you are already using Python, it may be simpler, easier, and more portable to use only Python instead of a mixture of Python, bash, and AWK.
You can find the names of files (selected from *dim*.csv) with the first line of each file having other than 12 comma-separated fields with:
import glob
files_found = []
for filename in glob.glob('*dim*.csv'):
with open(filename, 'r') as f:
firstline = f.readline()
if len(firstline.split(',')) != 12:
files_found.append(filename)
f.close()
print(files_found)
The glob module gives the listing of files matching the wildcard pattern *dim*.csv. The first line of each of these files is read and split into fields separated by commas. If the number of these fields is not 12, it is added to the list files_found.

Categories

Resources