Script to rearrange files into correct folders - python

I have a csv file with three columns: First column has two distinct entries bad or good. Distinct entries in column 2 are learn, query and test and the third column are file path names which indicates where to find the file.
bad test vff/v1/room_10-to-room_19_CDFFN5D5_x_0000
bad test vff/v1/room_10-to-room_19_BVFGFGN5D5_x_0023
bad learn vff2/v3/room_01-to-room_02_ERTY8LOK_x_00039
bad learn vff/v3/room_01-to-room_02_TRT8LOK_x_00210
bad query vff/v3/room_16-to-room_08_56TCS95_y_00020
bad query vff2/v3/room_16-to-room_08_856C6S95_y_00201
good test person/room_44/exit_call_room__5818
good test person/room_34/cleaning_pan__812
good learn person/room_43/walking_in_cafe_edited__717
good learn person/room_54/enterit_call_room__387
good query person/room_65/talki_speech_cry__1080
good query person/room_75/walking_against_wall__835
Using this csv, I wanted to create three folders based oncolumn 2. So basically, use column 2 to create three folders namely test, learn and query. Within each of these 3 folders, I want to create two folders i.e.bad and good based on column 1. Then be able to pull the data using column3 andplace the respective files in these defined folders. Is there a  python or command linescript that can do this? 

Assuming this csv file is named file.csv
#!/bin/bash
FILE="file.csv"
# Create direcory structure
for C2 in `cat ${FILE} | cut -f 2 -d ',' | sort -u`
do
for C1 in `cat ${FILE} | cut -f 1 -d ',' | sort -u`
do
mkdir -p "${C2}/${C1}"
done
done
# Move files
while IFS= read -r line
do
file="$(echo $line | cut -f 3 -d ',' | tr -d ' ')"
dir="$(echo $line | cut -f 2 -d ',' | tr -d ' ')"
dir+="/$(echo $line | cut -f 1 -d ',')"
mv "${file}" "${dir}"
done < "${FILE}"
Some things that are happening in this bash script:
cut This command is very usefull for selecting the n'th item from a delimiter separated list. In this instance we are working with a csv so you will see cut -d ',' to specify a comma as a delimiter.
Creating the directory structure: Column 2 is the parent directory and Column 1 is the child directory, thus the cut -f 2 list is the outer for loop and cut -f 1 is the inner for loop
sort -u removes repeated occurrences of a string. This allows us to iterate through all the different entries for a given column
Moving the files: Every line in file.csv contains a file that needs to be move, thus the iteration through each line in the file. Then the directory we created earlier is extracted from Columns 2 and 1 and the file is moved to it's new home

Related

Extracting all JavaScript filenames from a log file using bash script

I have 4 different named log files, all with txt extensions. I need to write a bash script file that extracts JavaScript file names from any of these log files regardless of their names. The output of the script should not include the path, have to be unique, and sorted
After some research I came up with this:
cat logfile1.txt | grep '[^.(]*\.js' | awk -F " " '{print $7}' | sort | uniq -c| sort -nr
This code does only haft the job;
PRO: It does extract any JS, sorts it, and gives unique results.
CON: I need this in a file.sh not a command line as, it is now. Also, I'm getting the entire path to the JS file. I only need the file name jquery.js
I tried adding grep -v "*/name-of-path-before-JS" to block the result from giving me the full path but that isn't working.
I found someone who made something kind of similar using python;
source
filenames = set()
with open(r"/home/filelog.txt") as f:
for line in f:
end = line.rfind(".js") + 3 # 3 = len(".js")
start = line.rfind("/", 0, end) + 1 # 1 = len("/")
filename = line[start:end]
if filename.endswith(".js"):
filenames.add(filename)
for filename in sorted(filenames, key=str.lower):
print(filename)
Although is missing the sort and uniq options when giving the output it does give the results by only putting out filename.js and not the whole path as the command line I made. Also, I to add the path to the log.txt file while running the script and not just appended it as in the python script below.
Example;
$./LogReaderScript.sh File-log.txt
Would you please try the shell script LogReaderScript.sh:
#!/bin/bash
if [[ $# -eq 0 ]]; then # if no filenames are given
echo "usage: $0 logfile .." # then show the usage and abort
exit 1
fi
grep -hoE "[^/]+\.js" "$#" | sort | uniq -c | sort -nr
By setting the file as executable with chmod +x LogReaderScript.sh,
you can invoke:
./LogReaderScript.sh File-log.txt
If you want to process multiple files at a time, you can also say something
like:
./LogReaderScript.sh *.txt
-o option to grep tells grep to print the matched substrings only,
instead of printing the matched line.
-E option specifies extended regex as a pettern.
-h option suppresses the prefixed filenames on the output if multiple
files are given.
The pattern (regex) [^/]+\.js matches a sequence of any characters
other than a slash, and followed by a extention .js. It will match
the target filenames.
"$#" is expanded to the filename(s) passed as arguments to the script.
There's really no need to have a script as you can do the job with the oneliner, since you've mentioned you have multiple log files to parse i'm assuming this is a task you're doing on a regular basis.
In this case just define an alias in your .bashrc file with this oneliner:
cat $1 | awk '{print $7}' | grep '.js' | awk -F\/ '{print $NF}' | sort | uniq
Let's say you've created the alias parser then you'd just have to invoke parser /path/to/logfile.log
With the example logfile you've provided above, the output is:
➜ ~ cat logfile.txt | awk '{print $7}' | grep '.js' | awk -F\/ '{print $NF}' | sort | uniq
jquery.js
jquery.jshowoff.min.js
jshowoff.css
Explanation:
cat is used to parse the file and then pipe the content into..
awk which is extracting the 7th space separated field from the file, since those are apache access logs and you're searching for the requested file, the seventh field is what you need
grep is extracting only the javascript files, ie. those ending with the .js extension
awk is used again to print only the file name, we're defining a custom field separator this time with the -F flag, and executing the print command using the $NF argument which instructs awk to print only the last field
sort and uniq are self explanatory, we're sorting the output then printing only the first occurrence for each repeated value.
jquery.jshowoff.min.js looked like bogus to me and i suspected i did something wrong with my commands, but it's an actual line (280) in your logfile
75.75.112.64 - - [21/Apr/2013:17:32:23 -0700] "GET /include/jquery.jshowoff.min.js HTTP/1.1" 200 2553 "http://random-site.com/" "Mozilla/5.0 (iPod; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A403 Safari/8536.25" "random-site.com"

Bash iterate through every file but start from 2nd file and get names of 1st and 2nd files

I have files all named following the convention:
xxx_yyy_zzz_ooo_date_ppp.tif
I have a python functions that needs 3 inputs: the date of two consecutive files in my folder, and an output name generated from those two dates.
I created a loop that:
goes through every file in the folder
grabs the date of the file and assigns it to a variable ("file2", 5th place in the file name)
runs a python function that takes as inputs: date file 1, date file 2, output name
How could I make my loop start at the 2nd file in my folder, and grab the name of the previous file to assign it to a variable "file1" (so far it only grabs the date of 1 file at a time) ?
#!/bin/bash
output_path=path # Folder in which my output will be saved
for file2 in *; do
f1=$( "$file1" | awk -F'[_.]' '{print $5}' ) # File before the one over which the loop is running
f2=$( "$file2" | awk -F'[_.]' '{print $5}' ) # File 2 over which the loop is running
outfile=$output_path+$f1+$f2
function_python -$f1 -$f2 -$outfile
done
You could make it work like this:
#!/bin/bash
output_path="<path>"
readarray -t files < <(find . -maxdepth 1 -type f | sort) # replaces '*'
for ((i=1; i < ${#files[#]}; i++)); do
f1=$( echo "${files[i-1]}" | awk -F'[_.]' '{print $5}' ) # previous file
f2=$( echo "${files[i]}" | awk -F'[_.]' '{print $5}' ) # current file
outfile="${output_path}/${f1}${f2}"
function_python -"$f1" -"$f2" -"$outfile"
done
Not exactly sure about the call to function_python though, I have never seen that tool before (can't ask since I can't comment yet).
Read the files into an array and then iterate from index 1 instead of over the whole array.
#!/bin/bash
set -euo pipefail
declare -r output_path='/some/path/'
declare -a files fsegments
for file in *; do files+=("$file"); done
declare -ar files # optional
declare -r file1="${files[0]}"
IFS=_. read -ra fsegments <<< "$file1"
declare -r f1="${fsegments[4]}"
for file2 in "${files[#]:1}"; do # from 1
IFS=_. read -ra fsegments <<< "$file2"
f2="${fsegments[4]}"
outfile="${output_path}${f1}${f2}"
function_python -"$f1" -"$f2" -"$outfile" # weird format!
done

How do I write a for-loop so a program reiterates itself for a set of 94 DNA samples?

I have written some code in a bash shell (so I can submit it to my university's supercomputer) to edit out contaminant sequences from a batch of DNA extracts I have. Essentially what this code does is take the sequences from the negative extraction blank I did (A1-BLANK) and subtract it from all of the other samples.
I have figured out how to get this to work with individual samples, but I'm attempting to write a for loop so that the little chunks of code will reiterate themselves for each sample, with the outcome of this file being a .sam file with a unique name for each sample where both the forward and reverse reads for the sample are merged and edited for contamination.I have checked stack overflow extensively for help with this specific problem, but haven't been able to apply related answered questions to my code.
Here's an example of part what I'm trying to do for an individual sample, named F10-61C-3-V4_S78_L001_R1_001.fastq:
bowtie2 -q --end-to-end --very-sensitive \ ##bowtie2 is a program that examines sequence similarity compared to a standard
-N 0 -L 31 --time --reorder \
-x A1-BlankIndex \ ##This line compares the sample to the negative extraction blank
-1 /file directory/F10-61C-3-V4_S78_L001_R1_001.fastq
-2 /file directory/F10-61C-3-V4_S78_L001_R2_001.fastq \ ##These two lines above merge the forward and reverse reads of the DNA sequences within the individual files into one file
-S 61C-3.sam ##This line renames the merged and edited file and transforms it into a .sam file
Here's what I've got so far for this little step of the process:
for file in /file directory/*.fastq
do
bowtie2 -q --end-to-end --very-sensitive \
-N 0 -L 31 --time --reorder \
-x A1-BlankIndex \
-1 /file directory/*.fastq
-2 /file directory/*.fastq \
-S *.sam
done
In my resulting slurm file, the error I'm getting right now has to do with the -S command. I'm not sure how to give each merged and edited sample a unique name for the .sam file. I'm new to writing for loops in python (my only experience is in R) and I'm sure it's a simple fix, but I haven't been able to find any specific answers to this question.
Here's a first try. Note I assume the entire fragment between do and done is one command, and therefore needs continuation markers (\).
Also note in my example "$file" occurs twice. I feel a bit uneasy about this, but you seem to explicity need this in your described example.
And finally note I am giving the sam file just a numeric name, because I don't really know what you would like that name to be.
I hope this provides enough information to get you started.
#!/bin/bash
i=0
for file in /file/directory/*.fastq
do
bowtie2 -q --end-to-end --very-sensitive \
-N 0 -L 31 --time --reorder \
-x A1-BlankIndex \
-1 "$file" \
-2 "$file" \
-S "$i".sam
i=$((i+1))
done
This may work as your example but automatically select the output file name referrence with a RegEx:
#!/usr/bin/env bash
input_samples='/input_samples_directory'
output_samples='/output_merged_samples_directory'
while IFS= read -r -d '' R1_fastq; do
# Deduce R2 sample from R1 sample file name
R2_fastq="${R1_fastq/_R1_/_R2_}"
# RegEx match capture group in () for the output sample reference
[[ $R1_fastq =~ [^-]+-([[:digit:]]+[[:alpha:]]-[[:digit:]]).* ]]
# Construct the output sample file path with the captured referrenced
# from the RegEx above
sam="$output_samples/${BASH_REMATCH[1]}.sam"
# Perform the merging
bowtie2 -q --end-to-end --very-sensitive \
-N 0 -L 31 --time --reorder \
-x A1-BlankIndex \
-1 "$R1_fastq" \
-2 "$R2_fastq" \
-S "$sam"
done < <(find "$input_samples" -maxdepth 1 -type -f -name '*_R1_*.fastq' -print0)

find and delete duplicate content in several files

I have many files (acls) containing ips, macs, hostnames and other data.
Important: The problem is about "duplicate content" in files. No "duplicate lines"
Example: (only a file, but i have several acls)
192.168.1.20;08:00:00:00:00:01;peter
192.168.1.21;08:00:00:00:00:01;android
192.168.1.21;08:00:00:00:00:02;john
192.168.1.22;08:00:00:00:00:03;julia
192.168.1.23;08:00:00:00:00:04;android
Lines with duplicate content. And this is what i want. A command to tell me this result:
192.168.1.20;08:00:00:00:00:01;peter
192.168.1.21;08:00:00:00:00:01;android
192.168.1.21;08:00:00:00:00:02;john
192.168.1.23;08:00:00:00:00:04;android
The duplicate content into a lines above is 08:00:00:00:00:01, 192.168.1.21 and android
Command I use to find duplicates into acls folder (doesn't work):
cat /home/user/files/* | sort | uniq -c | head -20
I've tried with this python script, but the results are not as expected
First (At least) i want to detect the lines with duplicate content, and (if possible) delete lines with duplicate content.
Thanks
Considering your comments about what you consider as duplicate this should be close:
$ a=$(cut -d';' -f1 c.txt |sort |uniq -d)
$ b=$(cut -d';' -f2 c.txt |sort |uniq -d)
$ c=$(cut -d';' -f3 c.txt |sort |uniq -d)
$ echo "$a:$b:$c"
192.168.1.21:08:00:00:00:00:01:android
But in reality we talk about three different situations.
Variable a contains only the duplicate IP , ignoring rest fields.
Variable b contains only the duplicate MAC, ignoring rest fields.
Variable c contains only the duplicate Host Name , ignoring rest fields.
I don't see the meaning on this confusing information.
The only explanation is that you can use a grep later like this:
$ grep -v -e "$a" -e "$b" -e "$c" c.txt
192.168.1.22;08:00:00:00:00:03;julia;222222
To get the lines from your original file that have a completely unique IP that has not been used even once, a completely unique MAC and a completely unique Host name.
Is this what you want to achieve?

Better way in Python

Is there a better / simpler way to accomplish this in Python?
I have a bash script that calculates CPS (calls per second). It runs fine on small files but poorly on large ones. It basically takes the file that we are calculating the CPS for and extracts field 7 which is the INVITING time, sorts, and only gets the unique values. This is all put in a tmp.file. The script then cats the original file and greps for each of the values in the tmp.file, counts them, and outputs the time and count to a final file.
#!/bin/bash
cat $1 |cut -d "," -f 7 | sort |uniq > /tmp/uniq.time.txt;
list="/tmp/uniq.time.txt";
while read time
do
VALUE1=`cat $1 |grep "$time" |wc -l`;
echo $VALUE1 >> /tmp/cps.tmp;
done < $list;
rm /tmp/cps.tmp;
I think what you're trying to do is simply:
cat $1 |cut -d "," -f 7 | sort | uniq -c
note: if you want to swap the order of the fields:
| awk -F " *" '{print $3, $2}'
This can certainly be done easier and more efficiently in Python:
import sys
from itertools import groupby
with open(sys.argv[1]) as f:
times = [line.split(",")[6] for line in f]
times.sort()
for time, occurrences in groupby(times):
print time, len(list(occurrences))
The problem with your approach is that you have to serach the whole file for each unique time. You could write this more efficiently even in bash, but I think it's more convenient to do this in Python.
Reading CSV files:
http://docs.python.org/library/csv.html
Uniquifying:
set(nonUniqueItems)

Categories

Resources