find and delete duplicate content in several files

find and delete duplicate content in several files - python

I have many files (acls) containing ips, macs, hostnames and other data.
Important: The problem is about "duplicate content" in files. No "duplicate lines"
Example: (only a file, but i have several acls)
192.168.1.20;08:00:00:00:00:01;peter
192.168.1.21;08:00:00:00:00:01;android
192.168.1.21;08:00:00:00:00:02;john
192.168.1.22;08:00:00:00:00:03;julia
192.168.1.23;08:00:00:00:00:04;android
Lines with duplicate content. And this is what i want. A command to tell me this result:
192.168.1.20;08:00:00:00:00:01;peter
192.168.1.21;08:00:00:00:00:01;android
192.168.1.21;08:00:00:00:00:02;john
192.168.1.23;08:00:00:00:00:04;android
The duplicate content into a lines above is 08:00:00:00:00:01, 192.168.1.21 and android
Command I use to find duplicates into acls folder (doesn't work):
cat /home/user/files/* | sort | uniq -c | head -20
I've tried with this python script, but the results are not as expected
First (At least) i want to detect the lines with duplicate content, and (if possible) delete lines with duplicate content.
Thanks

Considering your comments about what you consider as duplicate this should be close:
$ a=$(cut -d';' -f1 c.txt |sort |uniq -d)
$ b=$(cut -d';' -f2 c.txt |sort |uniq -d)
$ c=$(cut -d';' -f3 c.txt |sort |uniq -d)
$ echo "$a:$b:$c"
192.168.1.21:08:00:00:00:00:01:android
But in reality we talk about three different situations.
Variable a contains only the duplicate IP , ignoring rest fields.
Variable b contains only the duplicate MAC, ignoring rest fields.
Variable c contains only the duplicate Host Name , ignoring rest fields.
I don't see the meaning on this confusing information.
The only explanation is that you can use a grep later like this:
$ grep -v -e "$a" -e "$b" -e "$c" c.txt
192.168.1.22;08:00:00:00:00:03;julia;222222
To get the lines from your original file that have a completely unique IP that has not been used even once, a completely unique MAC and a completely unique Host name.
Is this what you want to achieve?

Related

Extracting all JavaScript filenames from a log file using bash script

I have 4 different named log files, all with txt extensions. I need to write a bash script file that extracts JavaScript file names from any of these log files regardless of their names. The output of the script should not include the path, have to be unique, and sorted
After some research I came up with this:
cat logfile1.txt | grep '[^.(]*\.js' | awk -F " " '{print $7}' | sort | uniq -c| sort -nr
This code does only haft the job;
PRO: It does extract any JS, sorts it, and gives unique results.
CON: I need this in a file.sh not a command line as, it is now. Also, I'm getting the entire path to the JS file. I only need the file name jquery.js
I tried adding grep -v "*/name-of-path-before-JS" to block the result from giving me the full path but that isn't working.
I found someone who made something kind of similar using python;
source
filenames = set()
with open(r"/home/filelog.txt") as f:
for line in f:
end = line.rfind(".js") + 3 # 3 = len(".js")
start = line.rfind("/", 0, end) + 1 # 1 = len("/")
filename = line[start:end]
if filename.endswith(".js"):
filenames.add(filename)
for filename in sorted(filenames, key=str.lower):
print(filename)
Although is missing the sort and uniq options when giving the output it does give the results by only putting out filename.js and not the whole path as the command line I made. Also, I to add the path to the log.txt file while running the script and not just appended it as in the python script below.
Example;
$./LogReaderScript.sh File-log.txt

Would you please try the shell script LogReaderScript.sh:
#!/bin/bash
if [[ $# -eq 0 ]]; then # if no filenames are given
echo "usage: $0 logfile .." # then show the usage and abort
exit 1
fi
grep -hoE "[^/]+\.js" "$#" | sort | uniq -c | sort -nr
By setting the file as executable with chmod +x LogReaderScript.sh,
you can invoke:
./LogReaderScript.sh File-log.txt
If you want to process multiple files at a time, you can also say something
like:
./LogReaderScript.sh *.txt
-o option to grep tells grep to print the matched substrings only,
instead of printing the matched line.
-E option specifies extended regex as a pettern.
-h option suppresses the prefixed filenames on the output if multiple
files are given.
The pattern (regex) [^/]+\.js matches a sequence of any characters
other than a slash, and followed by a extention .js. It will match
the target filenames.
"$#" is expanded to the filename(s) passed as arguments to the script.

There's really no need to have a script as you can do the job with the oneliner, since you've mentioned you have multiple log files to parse i'm assuming this is a task you're doing on a regular basis.
In this case just define an alias in your .bashrc file with this oneliner:
cat $1 | awk '{print $7}' | grep '.js' | awk -F\/ '{print $NF}' | sort | uniq
Let's say you've created the alias parser then you'd just have to invoke parser /path/to/logfile.log
With the example logfile you've provided above, the output is:
➜ ~ cat logfile.txt | awk '{print $7}' | grep '.js' | awk -F\/ '{print $NF}' | sort | uniq
jquery.js
jquery.jshowoff.min.js
jshowoff.css
Explanation:
cat is used to parse the file and then pipe the content into..
awk which is extracting the 7th space separated field from the file, since those are apache access logs and you're searching for the requested file, the seventh field is what you need
grep is extracting only the javascript files, ie. those ending with the .js extension
awk is used again to print only the file name, we're defining a custom field separator this time with the -F flag, and executing the print command using the $NF argument which instructs awk to print only the last field
sort and uniq are self explanatory, we're sorting the output then printing only the first occurrence for each repeated value.
jquery.jshowoff.min.js looked like bogus to me and i suspected i did something wrong with my commands, but it's an actual line (280) in your logfile
75.75.112.64 - - [21/Apr/2013:17:32:23 -0700] "GET /include/jquery.jshowoff.min.js HTTP/1.1" 200 2553 "http://random-site.com/" "Mozilla/5.0 (iPod; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A403 Safari/8536.25" "random-site.com"

arranging text files side by side using python

I have 3000 text files in a directory and each .txt file contain single column data. i want to arrange them side by side to make it a mxn matrix file.
For example: paste 1.txt 2.txt 3.txt 4.txt .............3000.txt in linux
For this i tried
printf "%s\n" *.txt | sort -n | xargs -d '\n' paste
However it gives error paste: filename.txt: Too many open files
please suggest a better solution for the same using python.

Based on the inputs received, please follow the below steps.
# Change ulimit to increase the no of open files at a time
$ ulimit -n 4096
# Remove blank lines from all the files
$ sed -i '/^[[:space:]]*$/d' *.txt
# Join all files side by side to form a matrix view
$ paste $(ls -v *.txt) > matrix.txt
# Fill the blank values in the matrix view with 0's using awk inplace
$ awk -i inplace 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = 0 }; 1' matrix.txt

You don't need python for this; if you first increase the number of open files a process can have using ulimit, it becomes easy to get columns in the right order in bash, zsh, or ksh93 shells, using paste and brace expansion to generate the filenames in the desired order instead of having to sort the results of filename expansion:
% ulimit -n 4096
% paste {1..3000}.txt > matrix.txt
(I tested this in all three shells I mentioned on a Linux box, and it works with all of them with no errors about the command line being too long or anything else.)
You could also arrange to have the original files use a different naming scheme that sorts naturally, like 0001.txt, 0002.txt, ..., 3000.txt and then just paste [0-9]*.txt > matrix.txt.

Script to rearrange files into correct folders

I have a csv file with three columns: First column has two distinct entries bad or good. Distinct entries in column 2 are learn, query and test and the third column are file path names which indicates where to find the file.
bad test vff/v1/room_10-to-room_19_CDFFN5D5_x_0000
bad test vff/v1/room_10-to-room_19_BVFGFGN5D5_x_0023
bad learn vff2/v3/room_01-to-room_02_ERTY8LOK_x_00039
bad learn vff/v3/room_01-to-room_02_TRT8LOK_x_00210
bad query vff/v3/room_16-to-room_08_56TCS95_y_00020
bad query vff2/v3/room_16-to-room_08_856C6S95_y_00201
good test person/room_44/exit_call_room__5818
good test person/room_34/cleaning_pan__812
good learn person/room_43/walking_in_cafe_edited__717
good learn person/room_54/enterit_call_room__387
good query person/room_65/talki_speech_cry__1080
good query person/room_75/walking_against_wall__835
Using this csv, I wanted to create three folders based oncolumn 2. So basically, use column 2 to create three folders namely test, learn and query. Within each of these 3 folders, I want to create two folders i.e.bad and good based on column 1. Then be able to pull the data using column3 andplace the respective files in these defined folders. Is there a  python or command linescript that can do this?

Assuming this csv file is named file.csv
#!/bin/bash
FILE="file.csv"
# Create direcory structure
for C2 in `cat ${FILE} | cut -f 2 -d ',' | sort -u`
do
for C1 in `cat ${FILE} | cut -f 1 -d ',' | sort -u`
do
mkdir -p "${C2}/${C1}"
done
done
# Move files
while IFS= read -r line
do
file="$(echo $line | cut -f 3 -d ',' | tr -d ' ')"
dir="$(echo $line | cut -f 2 -d ',' | tr -d ' ')"
dir+="/$(echo $line | cut -f 1 -d ',')"
mv "${file}" "${dir}"
done < "${FILE}"
Some things that are happening in this bash script:
cut This command is very usefull for selecting the n'th item from a delimiter separated list. In this instance we are working with a csv so you will see cut -d ',' to specify a comma as a delimiter.
Creating the directory structure: Column 2 is the parent directory and Column 1 is the child directory, thus the cut -f 2 list is the outer for loop and cut -f 1 is the inner for loop
sort -u removes repeated occurrences of a string. This allows us to iterate through all the different entries for a given column
Moving the files: Every line in file.csv contains a file that needs to be move, thus the iteration through each line in the file. Then the directory we created earlier is extracted from Columns 2 and 1 and the file is moved to it's new home

How to efficiently find small typos in source code files?

I would like to recursively search a large code base (mostly python, HTML and javascript) for typos in comments, strings and also variable/method/class names. Strong preference for something that runs in a terminal.
The problem is that spell checkers like aspell or scspell find almost only false positives (e.g. programming terms, camelcased terms) while I would be happy if it could help me primarily find simple typos like scrambled or missing letters e.g. maintenane vs. maintenance, resticted vs. restricted, dpeloyment vs. deployment.
What I was playing with so far is:
for f in **/*.py ; do echo $f ; aspell list < $f | uniq -c ; done
but it will find anything like: assertEqual, MyTestCase, lifecycle

This solution of my own focuses on python files but in the end also found them in html and js. It still needed manual sorting out of false positives but that only took few minutes work and it identified about 150 typos in comments that then also could be found in the non-comment parts.
Save this as executable file e.g extractcomments:
#!/usr/bin/env python3
import argparse
import io
import tokenize
if __name__ == "__main__":
parser = argparse.ArgumentParser(add_help=False)
parser.add_argument('filename')
args = parser.parse_args()
with io.open(args.filename, "r", encoding="utf-8") as sourcefile:
for t in tokenize.generate_tokens(sourcefile.readline):
if t.type == tokenize.COMMENT:
print(t.string.lstrip("#").strip())
Collect all comments for further processing:
for f in **/*.py ; do ~/extractcomments $f >> ~/comments.txt ; done
Run it recursively on your code base with one or more aspell dictionaries and collect all it identified as typos and count their occurrences:
aspell <~/comments.txt --lang=en list|aspell --lang=de list | sort | uniq -c | sort -n > ~/typos.txt
Produces something like:
10 availabe
8 assignement
7 hardwird
Take the list without leading numbers, clean out the false positives, copy it to a 2nd file correct.txt and run aspell on it to get desired replacement for each typo: aspell -c correct.txt
Now paste the two files to get a format of typo;correction with paste -d";" typos.txt correct.txt > known_typos.csv
Now we want to recursively replace those in our codebase:
#!/bin/bash
root_dir=$(git rev-parse --show-toplevel)
while IFS=";" read -r typo fix ; do
git grep -l -z -w "${typo}" -- "*.py" "*.html" | xargs -r --null sed -i "s/\b${typo}\b/${fix}/g"
done < $root_dir/known_typos.csv
My bash skills are poor so there is certainly space for improvement.
Update: I could find more typos in method names by running this:
grep -r def --include \*.py . | cut -d ":" -f 2- |tr "_" " " | aspell --lang=en list | sort -u
Update2: Managed to fix typos that are e.g. inside underscored names or strings that do not have word boundaries as such e.g i_am_a_typpo3:
#!/bin/bash
root_dir=$(git rev-parse --show-toplevel)
while IFS=";" read -r typo fix ; do
echo ${typo}
find $root_dir \( -name '*.py' -or -name '*.html' \) -print0 | xargs -0 perl -pi -e "s/(?<![a-zA-Z])${typo}(?![a-zA-Z])/${fix}/g"
done < $root_dir/known_typos.csv

If you're using typescript you could use the gulp plugin i created for spellchecking:
https://www.npmjs.com/package/gulp-ts-spellcheck

If you are developing in JavaScript or Typescript then you can this spell check plugin for ESLint:
https://www.npmjs.com/package/eslint-plugin-spellcheck
I found it to be very useful.
Another option is scspell:
https://github.com/myint/scspell
It is language-agnostic and claims to "usually catch many errors without an annoying false positive rate."

bash: remove all files except last version in file name

I have 10K+ files like below. File system date(export time) is one for all files.
YYY101R1.corp.company.org-RUNNINGCONFIG-2015-07-10-23-10-15.config
YYY101R1.corp.company.org-RUNNINGCONFIG-2015-07-11-22-11-10.config
YYY101R1.corp.company.org-RUNNINGCONFIG-2015-10-01-10-05-08.config
LLL101S1.corp.company.org-RUNNINGCONFIG-2015-08-10-23-10-15.config
LLL101S1.corp.company.org-RUNNINGCONFIG-2015-09-11-20-11-10.config
LLL101S1.corp.company.org-RUNNINGCONFIG-2015-10-02-19-05-07.config
How can I delete all files except last version(last date) of file from file name and rename it to
YYY101R1.corp.company.org.config
LLL101S1.corp.company.org.config
Thank you.

The UNIX shell command
ls -t YYY101R1.corp.company.org*
will list files in order of age, newest first. Grab the first line as "latest" and make a symbolic ("soft") link to it:
ln -s $latest YYY101R1.corp.company.org.config
Repeat for each file group.
Does that get you going? If not, please post your code and explanation of the specific problem. See https://stackoverflow.com/help/mcve

I got something similar, first get list of all files sorted ascending by modification time, then get number of them, display all minus last 2 of them and pass that to list to command removing file.
ls -tr | wc -l
ls -tr | head -number_of_files_minus_2 | xargs rm
Was it helpful?

FLast=`ls -tr 'YYY101R1.corp.company.org*' | tail -n 1`
mv ${FLast} YYY101R1.corp.company.org.config
rm -f YYY101R1.corp.company.org-RUNNINGCONFIG-*

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

find and delete duplicate content in several files - python

Related

Extracting all JavaScript filenames from a log file using bash script

arranging text files side by side using python

Script to rearrange files into correct folders

How to efficiently find small typos in source code files?

bash: remove all files except last version in file name

Categories

Resources