Extracting all JavaScript filenames from a log file using bash script

Extracting all JavaScript filenames from a log file using bash script - python

I have 4 different named log files, all with txt extensions. I need to write a bash script file that extracts JavaScript file names from any of these log files regardless of their names. The output of the script should not include the path, have to be unique, and sorted
After some research I came up with this:
cat logfile1.txt | grep '[^.(]*\.js' | awk -F " " '{print $7}' | sort | uniq -c| sort -nr
This code does only haft the job;
PRO: It does extract any JS, sorts it, and gives unique results.
CON: I need this in a file.sh not a command line as, it is now. Also, I'm getting the entire path to the JS file. I only need the file name jquery.js
I tried adding grep -v "*/name-of-path-before-JS" to block the result from giving me the full path but that isn't working.
I found someone who made something kind of similar using python;
source
filenames = set()
with open(r"/home/filelog.txt") as f:
for line in f:
end = line.rfind(".js") + 3 # 3 = len(".js")
start = line.rfind("/", 0, end) + 1 # 1 = len("/")
filename = line[start:end]
if filename.endswith(".js"):
filenames.add(filename)
for filename in sorted(filenames, key=str.lower):
print(filename)
Although is missing the sort and uniq options when giving the output it does give the results by only putting out filename.js and not the whole path as the command line I made. Also, I to add the path to the log.txt file while running the script and not just appended it as in the python script below.
Example;
$./LogReaderScript.sh File-log.txt

Would you please try the shell script LogReaderScript.sh:
#!/bin/bash
if [[ $# -eq 0 ]]; then # if no filenames are given
echo "usage: $0 logfile .." # then show the usage and abort
exit 1
fi
grep -hoE "[^/]+\.js" "$#" | sort | uniq -c | sort -nr
By setting the file as executable with chmod +x LogReaderScript.sh,
you can invoke:
./LogReaderScript.sh File-log.txt
If you want to process multiple files at a time, you can also say something
like:
./LogReaderScript.sh *.txt
-o option to grep tells grep to print the matched substrings only,
instead of printing the matched line.
-E option specifies extended regex as a pettern.
-h option suppresses the prefixed filenames on the output if multiple
files are given.
The pattern (regex) [^/]+\.js matches a sequence of any characters
other than a slash, and followed by a extention .js. It will match
the target filenames.
"$#" is expanded to the filename(s) passed as arguments to the script.

There's really no need to have a script as you can do the job with the oneliner, since you've mentioned you have multiple log files to parse i'm assuming this is a task you're doing on a regular basis.
In this case just define an alias in your .bashrc file with this oneliner:
cat $1 | awk '{print $7}' | grep '.js' | awk -F\/ '{print $NF}' | sort | uniq
Let's say you've created the alias parser then you'd just have to invoke parser /path/to/logfile.log
With the example logfile you've provided above, the output is:
➜ ~ cat logfile.txt | awk '{print $7}' | grep '.js' | awk -F\/ '{print $NF}' | sort | uniq
jquery.js
jquery.jshowoff.min.js
jshowoff.css
Explanation:
cat is used to parse the file and then pipe the content into..
awk which is extracting the 7th space separated field from the file, since those are apache access logs and you're searching for the requested file, the seventh field is what you need
grep is extracting only the javascript files, ie. those ending with the .js extension
awk is used again to print only the file name, we're defining a custom field separator this time with the -F flag, and executing the print command using the $NF argument which instructs awk to print only the last field
sort and uniq are self explanatory, we're sorting the output then printing only the first occurrence for each repeated value.
jquery.jshowoff.min.js looked like bogus to me and i suspected i did something wrong with my commands, but it's an actual line (280) in your logfile
75.75.112.64 - - [21/Apr/2013:17:32:23 -0700] "GET /include/jquery.jshowoff.min.js HTTP/1.1" 200 2553 "http://random-site.com/" "Mozilla/5.0 (iPod; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A403 Safari/8536.25" "random-site.com"

Related

Bash iterate through every file but start from 2nd file and get names of 1st and 2nd files

I have files all named following the convention:
xxx_yyy_zzz_ooo_date_ppp.tif
I have a python functions that needs 3 inputs: the date of two consecutive files in my folder, and an output name generated from those two dates.
I created a loop that:
goes through every file in the folder
grabs the date of the file and assigns it to a variable ("file2", 5th place in the file name)
runs a python function that takes as inputs: date file 1, date file 2, output name
How could I make my loop start at the 2nd file in my folder, and grab the name of the previous file to assign it to a variable "file1" (so far it only grabs the date of 1 file at a time) ?
#!/bin/bash
output_path=path # Folder in which my output will be saved
for file2 in *; do
f1=$( "$file1" | awk -F'[_.]' '{print $5}' ) # File before the one over which the loop is running
f2=$( "$file2" | awk -F'[_.]' '{print $5}' ) # File 2 over which the loop is running
outfile=$output_path+$f1+$f2
function_python -$f1 -$f2 -$outfile
done

You could make it work like this:
#!/bin/bash
output_path="<path>"
readarray -t files < <(find . -maxdepth 1 -type f | sort) # replaces '*'
for ((i=1; i < ${#files[#]}; i++)); do
f1=$( echo "${files[i-1]}" | awk -F'[_.]' '{print $5}' ) # previous file
f2=$( echo "${files[i]}" | awk -F'[_.]' '{print $5}' ) # current file
outfile="${output_path}/${f1}${f2}"
function_python -"$f1" -"$f2" -"$outfile"
done
Not exactly sure about the call to function_python though, I have never seen that tool before (can't ask since I can't comment yet).

Read the files into an array and then iterate from index 1 instead of over the whole array.
#!/bin/bash
set -euo pipefail
declare -r output_path='/some/path/'
declare -a files fsegments
for file in *; do files+=("$file"); done
declare -ar files # optional
declare -r file1="${files[0]}"
IFS=_. read -ra fsegments <<< "$file1"
declare -r f1="${fsegments[4]}"
for file2 in "${files[#]:1}"; do # from 1
IFS=_. read -ra fsegments <<< "$file2"
f2="${fsegments[4]}"
outfile="${output_path}${f1}${f2}"
function_python -"$f1" -"$f2" -"$outfile" # weird format!
done

"command not found" using line in argument to os.system using python

I am new to python and working on some xyz project where i am taking the day-1 dated report, fetching the data and redirecting it into another file on linux machine
here is my code.
#!/usr/bin/python
import os
cur_date = os.popen("date -d '-1 day' '+%Y%m%d'").read()
print (cur_date)
os.system('zgrep "919535144580" /var/tmp/comp?/emse_revres_rdc.log.%s* | grep -v "RI" | cut -d "|" -f 9,10,23,24,26 | sort | uniq -c | sort -nr >> /var/tmp/Andy/test.txt'%cur_date)
it is printing below error.
20180731
**gzip: /var/tmp/comp?/emse_revres_rdc.log.20180731.gz: No such file or directory
sh: line 1: 1: command not found**
but when i am executing the same in shell it is running absolutely fine or if i manually give the date and run the above, it runs successfully.
Please provide your suggestions on the same.

The * has nothing to do with the problem; the string you're substituting with %s ends with a newline, and that newline is what breaks your code.
When you use os.popen('...').read(), you get the entire output of ... -- including the trailing newline, which shell command substitutions implicitly trim.
The best answer would be to rewrite your logic in Python, but the easy answer here is to use such a command substitution, which also avoids trying to pass values into a script via string substitution (which is a fast route to shell-injection security bugs):
shell_script = r'''
cur_date=$(date -d '-1 day' '+%Y%m%d')
zgrep "919535144580" /var/tmp/comp?/emse_revres_rdc.log."$cur_date"* \
| grep -v "RI" \
| cut -d "|" -f 9,10,23,24,26 \
| sort \
| uniq -c \
| sort -nr \
>> /var/tmp/Andy/test.txt
'''
os.system(shell_script)
That said, if you're just going for the shortest change possible, put the following before your original code's os.system() call:
cur_date = cur_date.rstrip('\n')

Bash use of temporary file pipes disallows redirection to dependent files? [duplicate]

Basically I want to take as input text from a file, remove a line from that file, and send the output back to the same file. Something along these lines if that makes it any clearer.
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name > file_name
however, when I do this I end up with a blank file.
Any thoughts?

Use sponge for this kind of tasks. Its part of moreutils.
Try this command:
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | sponge file_name

You cannot do that because bash processes the redirections first, then executes the command. So by the time grep looks at file_name, it is already empty. You can use a temporary file though.
#!/bin/sh
tmpfile=$(mktemp)
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name > ${tmpfile}
cat ${tmpfile} > file_name
rm -f ${tmpfile}
like that, consider using mktemp to create the tmpfile but note that it's not POSIX.

Use sed instead:
sed -i '/seg[0-9]\{1,\}\.[0-9]\{1\}/d' file_name

try this simple one
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | tee file_name
Your file will not be blank this time :) and your output is also printed to your terminal.

You can't use redirection operator (> or >>) to the same file, because it has a higher precedence and it will create/truncate the file before the command is even invoked. To avoid that, you should use appropriate tools such as tee, sponge, sed -i or any other tool which can write results to the file (e.g. sort file -o file).
Basically redirecting input to the same original file doesn't make sense and you should use appropriate in-place editors for that, for example Ex editor (part of Vim):
ex '+g/seg[0-9]\{1,\}\.[0-9]\{1\}/d' -scwq file_name
where:
'+cmd'/-c - run any Ex/Vim command
g/pattern/d - remove lines matching a pattern using global (help :g)
-s - silent mode (man ex)
-c wq - execute :write and :quit commands
You may use sed to achieve the same (as already shown in other answers), however in-place (-i) is non-standard FreeBSD extension (may work differently between Unix/Linux) and basically it's a stream editor, not a file editor. See: Does Ex mode have any practical use?

One liner alternative - set the content of the file as variable:
VAR=`cat file_name`; echo "$VAR"|grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' > file_name

Since this question is the top result in search engines, here's a one-liner based on https://serverfault.com/a/547331 that uses a subshell instead of sponge (which often isn't part of a vanilla install like OS X):
echo "$(grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name)" > file_name
The general case is:
echo "$(cat file_name)" > file_name
Edit, the above solution has some caveats:
printf '%s' <string> should be used instead of echo <string> so that files containing -n don't cause undesired behavior.
Command substitution strips trailing newlines (this is a bug/feature of shells like bash) so we should append a postfix character like x to the output and remove it on the outside via parameter expansion of a temporary variable like ${v%x}.
Using a temporary variable $v stomps the value of any existing variable $v in the current shell environment, so we should nest the entire expression in parentheses to preserve the previous value.
Another bug/feature of shells like bash is that command substitution strips unprintable characters like null from the output. I verified this by calling dd if=/dev/zero bs=1 count=1 >> file_name and viewing it in hex with cat file_name | xxd -p. But echo $(cat file_name) | xxd -p is stripped. So this answer should not be used on binary files or anything using unprintable characters, as Lynch pointed out.
The general solution (albiet slightly slower, more memory intensive and still stripping unprintable characters) is:
(v=$(cat file_name; printf x); printf '%s' ${v%x} > file_name)
Test from https://askubuntu.com/a/752451:
printf "hello\nworld\n" > file_uniquely_named.txt && for ((i=0; i<1000; i++)); do (v=$(cat file_uniquely_named.txt; printf x); printf '%s' ${v%x} > file_uniquely_named.txt); done; cat file_uniquely_named.txt; rm file_uniquely_named.txt
Should print:
hello
world
Whereas calling cat file_uniquely_named.txt > file_uniquely_named.txt in the current shell:
printf "hello\nworld\n" > file_uniquely_named.txt && for ((i=0; i<1000; i++)); do cat file_uniquely_named.txt > file_uniquely_named.txt; done; cat file_uniquely_named.txt; rm file_uniquely_named.txt
Prints an empty string.
I haven't tested this on large files (probably over 2 or 4 GB).
I have borrowed this answer from Hart Simha and kos.

This is very much possible, you just have to make sure that by the time you write the output, you're writing it to a different file. This can be done by removing the file after opening a file descriptor to it, but before writing to it:
exec 3<file ; rm file; COMMAND <&3 >file ; exec 3>&-
Or line by line, to understand it better :
exec 3<file # open a file descriptor reading 'file'
rm file # remove file (but fd3 will still point to the removed file)
COMMAND <&3 >file # run command, with the removed file as input
exec 3>&- # close the file descriptor
It's still a risky thing to do, because if COMMAND fails to run properly, you'll lose the file contents. That can be mitigated by restoring the file if COMMAND returns a non-zero exit code :
exec 3<file ; rm file; COMMAND <&3 >file || cat <&3 >file ; exec 3>&-
We can also define a shell function to make it easier to use :
# Usage: replace FILE COMMAND
replace() { exec 3<$1 ; rm $1; ${#:2} <&3 >$1 || cat <&3 >$1 ; exec 3>&- }
Example :
$ echo aaa > test
$ replace test tr a b
$ cat test
bbb
Also, note that this will keep a full copy of the original file (until the third file descriptor is closed). If you're using Linux, and the file you're processing on is too big to fit twice on the disk, you can check out this script that will pipe the file to the specified command block-by-block while unallocating the already processed blocks. As always, read the warnings in the usage page.

The following will accomplish the same thing that sponge does, without requiring moreutils:
shuf --output=file --random-source=/dev/zero
The --random-source=/dev/zero part tricks shuf into doing its thing without doing any shuffling at all, so it will buffer your input without altering it.
However, it is true that using a temporary file is best, for performance reasons. So, here is a function that I have written that will do that for you in a generalized way:
# Pipes a file into a command, and pipes the output of that command
# back into the same file, ensuring that the file is not truncated.
# Parameters:
# $1: the file.
# $2: the command. (With $3... being its arguments.)
# See https://stackoverflow.com/a/55655338/773113
siphon()
{
local tmp file rc=0
[ "$#" -ge 2 ] || { echo "Usage: siphon filename [command...]" >&2; return 1; }
file="$1"; shift
tmp=$(mktemp -- "$file.XXXXXX") || return
"$#" <"$file" >"$tmp" || rc=$?
mv -- "$tmp" "$file" || rc=$(( rc | $? ))
return "$rc"
}

There's also ed (as an alternative to sed -i):
# cf. http://wiki.bash-hackers.org/howto/edit-ed
printf '%s\n' H 'g/seg[0-9]\{1,\}\.[0-9]\{1\}/d' wq | ed -s file_name

You can use slurp with POSIX Awk:
!/seg[0-9]\{1,\}\.[0-9]\{1\}/ {
q = q ? q RS $0 : $0
}
END {
print q > ARGV[1]
}
Example

This does the trick pretty nicely in most of the cases I faced:
cat <<< "$(do_stuff_with f)" > f
Note that while $(…) strips trailing newlines, <<< ensures a final newline, so generally the result is magically satisfying.
(Look for “Here Strings” in man bash if you want to learn more.)
Full example:
#! /usr/bin/env bash
get_new_content() {
sed 's/Initial/Final/g' "${1:?}"
}
echo 'Initial content.' > f
cat f
cat <<< "$(get_new_content f)" > f
cat f
This does not truncate the file and yields:
Initial content.
Final content.
Note that I used a function here for the sake of clarity and extensibility, but that’s not a requirement.
A common usecase is JSON edition:
echo '{ "a": 12 }' > f
cat f
cat <<< "$(jq '.a = 24' f)" > f
cat f
This yields:
{ "a": 12 }
{
"a": 24
}

Try this
echo -e "AAA\nBBB\nCCC" > testfile
cat testfile
AAA
BBB
CCC
echo "$(grep -v 'AAA' testfile)" > testfile
cat testfile
BBB
CCC

I usually use the tee program to do this:
grep -v 'seg[0-9]\{1,\}\.[0-9]\{1\}' file_name | tee file_name
It creates and removes a tempfile by itself.

How to split files according to a field and edit content

I am not sure if I can do this using unix commands or I need a more complicated code, like python.
I have a big input file with 3 columns - id, different sequences (second column) grouped in different groups (3rd column).
Seq1 MVRWNARGQPVKEASQVFVSYIGVINCREVPISMEN Group1
Seq2 PSLFIAGWLFVSTGLRPNEYFTESRQGIPLITDRFDSLEQLDEFSRSF Group1
Seq3 HQAPAPAPTVISPPAPPTDTTLNLNGAPSNHLQGGNIWTTIGFAITVFLAVTGYSF Group20
I would like:
split this file according the group id, and create separate files for each group; edit the info in each file, adding a ">" sign in the beginning of the id; and then create a new row for the sequence
Group1.txt file
>Seq1
MVRWNARGQPVKEASQVFVSYIGVINCREVPISMEN
>Seq2
PSLFIAGWLFVSTGLRPNEYFTESRQGIPLITDRFDSLEQLDEFSRSF
Group20.txt file
>Seq3
HQAPAPAPTVISPPAPPTDTTLNLNGAPSNHLQGGNIWTTIGFAITVFLAVTGYSF
How can I do that?

AWK will do the trick:
awk '{ print ">"$1 "\n" $2 >> $3".txt"}' input.txt

This shell script should do the trick:
#!/usr/bin/env bash
filename="data.txt"
while read line; do
id=$(echo "${line}" | awk '{print $1}')
sequence=$(echo "${line}" | awk '{print $2}')
group=$(echo "${line}" | awk '{print $3}')
printf ">${id}\n${sequence}\n" >> "${group}.txt"
done < "${filename}"
where data.txt is the name of the file containing the original data.
Importantly, the Group-files should not exist prior to running the script.

Escaping escape sequence in Python

I am kind of new to python. Goal is to execute a shell command using subprocess parse & retrive the printed output from shell. The execution errors out as shown in the sample output msg below. Also shown below is the sample code snippet
Code snippet:
testStr = "cat tst.txt | grep Location | sed -e '/.*Location: //g' "
print "testStr = "+testStr
testStrOut = subprocess.Popen([testStr],shell=True,stdout=subprocess.PIPE).communicate()[0]
Output:
testStr = cat tst.txt | grep Location | sed -e '/.*Location: //g'
cat: tst.txt: No such file or directory
sed: -e expression #1, char 15: unknown command: `/'
Is there a workaround or a function that could be used ?
Appreciate your help
Thanks

I suppose your main error is not python related. To be more precise, there are 3 of them:
You forgot to import subprocess.
It should be sed -e 's/.*Location: //g'. You wrote ///g instead of s///g.
tst.txt does not exist.

You should be passing testStr directly as the first argument, rather than enclosing it in a list. See subprocess.Popen, the paragraph that starts "On Unix, with shell=True: ...".

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting all JavaScript filenames from a log file using bash script - python

Related

Bash iterate through every file but start from 2nd file and get names of 1st and 2nd files

"command not found" using line in argument to os.system using python

Bash use of temporary file pipes disallows redirection to dependent files? [duplicate]

How to split files according to a field and edit content

Escaping escape sequence in Python

Categories

Resources