Better way in Python - python

Is there a better / simpler way to accomplish this in Python?
I have a bash script that calculates CPS (calls per second). It runs fine on small files but poorly on large ones. It basically takes the file that we are calculating the CPS for and extracts field 7 which is the INVITING time, sorts, and only gets the unique values. This is all put in a tmp.file. The script then cats the original file and greps for each of the values in the tmp.file, counts them, and outputs the time and count to a final file.
#!/bin/bash
cat $1 |cut -d "," -f 7 | sort |uniq > /tmp/uniq.time.txt;
list="/tmp/uniq.time.txt";
while read time
do
VALUE1=`cat $1 |grep "$time" |wc -l`;
echo $VALUE1 >> /tmp/cps.tmp;
done < $list;
rm /tmp/cps.tmp;

I think what you're trying to do is simply:
cat $1 |cut -d "," -f 7 | sort | uniq -c
note: if you want to swap the order of the fields:
| awk -F " *" '{print $3, $2}'

This can certainly be done easier and more efficiently in Python:
import sys
from itertools import groupby
with open(sys.argv[1]) as f:
times = [line.split(",")[6] for line in f]
times.sort()
for time, occurrences in groupby(times):
print time, len(list(occurrences))
The problem with your approach is that you have to serach the whole file for each unique time. You could write this more efficiently even in bash, but I think it's more convenient to do this in Python.

Reading CSV files:
http://docs.python.org/library/csv.html
Uniquifying:
set(nonUniqueItems)

Related

Extracting all JavaScript filenames from a log file using bash script

I have 4 different named log files, all with txt extensions. I need to write a bash script file that extracts JavaScript file names from any of these log files regardless of their names. The output of the script should not include the path, have to be unique, and sorted
After some research I came up with this:
cat logfile1.txt | grep '[^.(]*\.js' | awk -F " " '{print $7}' | sort | uniq -c| sort -nr
This code does only haft the job;
PRO: It does extract any JS, sorts it, and gives unique results.
CON: I need this in a file.sh not a command line as, it is now. Also, I'm getting the entire path to the JS file. I only need the file name jquery.js
I tried adding grep -v "*/name-of-path-before-JS" to block the result from giving me the full path but that isn't working.
I found someone who made something kind of similar using python;
source
filenames = set()
with open(r"/home/filelog.txt") as f:
for line in f:
end = line.rfind(".js") + 3 # 3 = len(".js")
start = line.rfind("/", 0, end) + 1 # 1 = len("/")
filename = line[start:end]
if filename.endswith(".js"):
filenames.add(filename)
for filename in sorted(filenames, key=str.lower):
print(filename)
Although is missing the sort and uniq options when giving the output it does give the results by only putting out filename.js and not the whole path as the command line I made. Also, I to add the path to the log.txt file while running the script and not just appended it as in the python script below.
Example;
$./LogReaderScript.sh File-log.txt
Would you please try the shell script LogReaderScript.sh:
#!/bin/bash
if [[ $# -eq 0 ]]; then # if no filenames are given
echo "usage: $0 logfile .." # then show the usage and abort
exit 1
fi
grep -hoE "[^/]+\.js" "$#" | sort | uniq -c | sort -nr
By setting the file as executable with chmod +x LogReaderScript.sh,
you can invoke:
./LogReaderScript.sh File-log.txt
If you want to process multiple files at a time, you can also say something
like:
./LogReaderScript.sh *.txt
-o option to grep tells grep to print the matched substrings only,
instead of printing the matched line.
-E option specifies extended regex as a pettern.
-h option suppresses the prefixed filenames on the output if multiple
files are given.
The pattern (regex) [^/]+\.js matches a sequence of any characters
other than a slash, and followed by a extention .js. It will match
the target filenames.
"$#" is expanded to the filename(s) passed as arguments to the script.
There's really no need to have a script as you can do the job with the oneliner, since you've mentioned you have multiple log files to parse i'm assuming this is a task you're doing on a regular basis.
In this case just define an alias in your .bashrc file with this oneliner:
cat $1 | awk '{print $7}' | grep '.js' | awk -F\/ '{print $NF}' | sort | uniq
Let's say you've created the alias parser then you'd just have to invoke parser /path/to/logfile.log
With the example logfile you've provided above, the output is:
➜ ~ cat logfile.txt | awk '{print $7}' | grep '.js' | awk -F\/ '{print $NF}' | sort | uniq
jquery.js
jquery.jshowoff.min.js
jshowoff.css
Explanation:
cat is used to parse the file and then pipe the content into..
awk which is extracting the 7th space separated field from the file, since those are apache access logs and you're searching for the requested file, the seventh field is what you need
grep is extracting only the javascript files, ie. those ending with the .js extension
awk is used again to print only the file name, we're defining a custom field separator this time with the -F flag, and executing the print command using the $NF argument which instructs awk to print only the last field
sort and uniq are self explanatory, we're sorting the output then printing only the first occurrence for each repeated value.
jquery.jshowoff.min.js looked like bogus to me and i suspected i did something wrong with my commands, but it's an actual line (280) in your logfile
75.75.112.64 - - [21/Apr/2013:17:32:23 -0700] "GET /include/jquery.jshowoff.min.js HTTP/1.1" 200 2553 "http://random-site.com/" "Mozilla/5.0 (iPod; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A403 Safari/8536.25" "random-site.com"

Script to rearrange files into correct folders

I have a csv file with three columns: First column has two distinct entries bad or good. Distinct entries in column 2 are learn, query and test and the third column are file path names which indicates where to find the file.
bad test vff/v1/room_10-to-room_19_CDFFN5D5_x_0000
bad test vff/v1/room_10-to-room_19_BVFGFGN5D5_x_0023
bad learn vff2/v3/room_01-to-room_02_ERTY8LOK_x_00039
bad learn vff/v3/room_01-to-room_02_TRT8LOK_x_00210
bad query vff/v3/room_16-to-room_08_56TCS95_y_00020
bad query vff2/v3/room_16-to-room_08_856C6S95_y_00201
good test person/room_44/exit_call_room__5818
good test person/room_34/cleaning_pan__812
good learn person/room_43/walking_in_cafe_edited__717
good learn person/room_54/enterit_call_room__387
good query person/room_65/talki_speech_cry__1080
good query person/room_75/walking_against_wall__835
Using this csv, I wanted to create three folders based oncolumn 2. So basically, use column 2 to create three folders namely test, learn and query. Within each of these 3 folders, I want to create two folders i.e.bad and good based on column 1. Then be able to pull the data using column3 andplace the respective files in these defined folders. Is there a  python or command linescript that can do this? 
Assuming this csv file is named file.csv
#!/bin/bash
FILE="file.csv"
# Create direcory structure
for C2 in `cat ${FILE} | cut -f 2 -d ',' | sort -u`
do
for C1 in `cat ${FILE} | cut -f 1 -d ',' | sort -u`
do
mkdir -p "${C2}/${C1}"
done
done
# Move files
while IFS= read -r line
do
file="$(echo $line | cut -f 3 -d ',' | tr -d ' ')"
dir="$(echo $line | cut -f 2 -d ',' | tr -d ' ')"
dir+="/$(echo $line | cut -f 1 -d ',')"
mv "${file}" "${dir}"
done < "${FILE}"
Some things that are happening in this bash script:
cut This command is very usefull for selecting the n'th item from a delimiter separated list. In this instance we are working with a csv so you will see cut -d ',' to specify a comma as a delimiter.
Creating the directory structure: Column 2 is the parent directory and Column 1 is the child directory, thus the cut -f 2 list is the outer for loop and cut -f 1 is the inner for loop
sort -u removes repeated occurrences of a string. This allows us to iterate through all the different entries for a given column
Moving the files: Every line in file.csv contains a file that needs to be move, thus the iteration through each line in the file. Then the directory we created earlier is extracted from Columns 2 and 1 and the file is moved to it's new home

time difference between counting lines in a file from python/unix

im using 'wc -l' for file with 50 columns and 3000 records to count the lines in python code itself below
cmd='wc -l /path of file'
status,output=command.getstatusoutput(cmd)
and again i tried using the below one in python
row_count=sum(1 for line in(file path))
I just tried taking time from both of the commands wc -l is faster,I just dont know which is faster could you let me know the reasons behind this
ex : time
wc -l : 0.005s
python : 0.54s
Try this one:
with open("inp.txt", "r") as inpt:
print(len(inpt.readlines()))

How to split files according to a field and edit content

I am not sure if I can do this using unix commands or I need a more complicated code, like python.
I have a big input file with 3 columns - id, different sequences (second column) grouped in different groups (3rd column).
Seq1 MVRWNARGQPVKEASQVFVSYIGVINCREVPISMEN Group1
Seq2 PSLFIAGWLFVSTGLRPNEYFTESRQGIPLITDRFDSLEQLDEFSRSF Group1
Seq3 HQAPAPAPTVISPPAPPTDTTLNLNGAPSNHLQGGNIWTTIGFAITVFLAVTGYSF Group20
I would like:
split this file according the group id, and create separate files for each group; edit the info in each file, adding a ">" sign in the beginning of the id; and then create a new row for the sequence
Group1.txt file
>Seq1
MVRWNARGQPVKEASQVFVSYIGVINCREVPISMEN
>Seq2
PSLFIAGWLFVSTGLRPNEYFTESRQGIPLITDRFDSLEQLDEFSRSF
Group20.txt file
>Seq3
HQAPAPAPTVISPPAPPTDTTLNLNGAPSNHLQGGNIWTTIGFAITVFLAVTGYSF
How can I do that?
AWK will do the trick:
awk '{ print ">"$1 "\n" $2 >> $3".txt"}' input.txt
This shell script should do the trick:
#!/usr/bin/env bash
filename="data.txt"
while read line; do
id=$(echo "${line}" | awk '{print $1}')
sequence=$(echo "${line}" | awk '{print $2}')
group=$(echo "${line}" | awk '{print $3}')
printf ">${id}\n${sequence}\n" >> "${group}.txt"
done < "${filename}"
where data.txt is the name of the file containing the original data.
Importantly, the Group-files should not exist prior to running the script.

How to split csv files into multiple files using the delimiter? python

I have a tab delimited file as such:
this is a sentence. abb
what is this foo bar. bev
hello foo bar blah black sheep. abb
I could use cut -f1 and cut -f2 in unix terminal to split into two files:
this is a sentence.
what is this foo bar.
hello foo bar blah black sheep.
and:
abb
bev
abb
But is it possible to do the same in python? would it be faster?
I've been doing it as such:
[i.split('\t')[0] for i in open('in.txt', 'r')]
But is it possible to do the same in python?
yes you can:
l1, l2 = [[],[]]
with open('in.txt', 'r') as f:
for i in f:
# will loudly fail if more than two columns on a line
left, right = i.split('\t')
l1.append(left)
l2.append(right)
print("\n".join(l1))
print("\n".join(l2))
would it be faster?
it's not likely, cut is a C program that is optimized towards that kind of processing, python is a general purpose language which has a great flexibility, but is not necessarily fast.
Though, the only advantage you may get by working with an algorithm such as the one I wrote, is that you read the file only once, whereas with cut, you're reading it twice. That could make the difference.
Though we'd need to run some benchmarking to be 100%.
Here's a small benchmark, on my laptop, for what it's worth:
>>> timeit.timeit(stmt=lambda: t("file_of_606251_lines"), number=1)
1.393364901014138
vs
% time cut -d' ' -f1 file_of_606251_lines > /dev/null
cut -d' ' -f1 file_of_606251_lines > /dev/null 0.74s user 0.02s system 98% cpu 0.775 total
% time cut -d' ' -f2 file_of_606251_lines > /dev/null
cut -d' ' -f2 file_of_606251_lines > /dev/null 1.18s user 0.02s system 99% cpu 1.215 total
which is 1.990 seconds.
So the python version is indeed faster, as expected ;-)

Categories

Resources