Here is my problem.
I have n files and they all have overlapping and common text in them. I want to create a file using these n files such that the new file only contains unique lines in it that exist across all of the n files.
I am looking for a bash command, python api that can do it for me. If there is an algorithm I can also attempt to code it myself.
If the order of the lines is not important, you could do this:
sort -u file1 file2 ...
This will (a) sort all the lines in all the files, and then (b) remove duplicates. This will give you the lines that are unique among all the files.
For testing common data you can use comm:
DESCRIPTION
The comm utility reads file1 and file2, which should be sorted lexically,
and produces three text columns as output: lines only in file1; lines only in
file2; and lines in both files.
Another useful tool would be merge:
DESCRIPTION
merge incorporates all changes that lead from file2 to file3 into file1.
The result ordinarily goes into file1. merge is useful for combining separate
changes to an original.
sort might mess up your order. You can try the following awk command. It hasn't been tested so make sure you backup your files. :)
awk ' !x[$0]++' big_merged_file
This will remove all duplicate lines from your file.
This might work for you:
# ( seq 1 5; seq 3 7; )
1
2
3
4
5
3
4
5
6
7
# ( seq 1 5; seq 3 7; ) | sort -nu
1
2
3
4
5
6
7
# ( seq 1 5; seq 3 7; ) | sort -n | uniq -u
1
2
6
7
# ( seq 1 5; seq 3 7; ) | sort -n | uniq -d
3
4
5
You need to merge everything first, sort then finally remove duplicates
#!/bin/bash
for file in test/*
do
cat "$file" >> final
done
sort final > final2
uniq final2 final
rm -rf final2
Related
I have a (tab-delimited) file where the first "word" on each line is the line number. However, some line numbers are missing. I want to insert new lines (with corresponding line number) so that throughout the file, the number printed on the line matches the actual line number. (This is for later consumption into readarray with cut/awk to get the line after the line number.)
I've written this logic in python and tested it works, however I need to run this in an environment that doesn't have python. The actual file is about 10M rows. Is there a way to represent this logic using sed, awk, or even just plain shell / bash?
linenumre = re.compile(r"^\d+")
i = 0
for line in sys.stdin:
i = i + 1
linenum = int(linenumre.findall(line)[0])
while (i < linenum):
print(i)
i = i + 1
print(line, end='')
test file looks like:
1 foo 1
2 bar 1
4 qux 1
6 quux 1
9 2
10 fun 2
expected output like:
1 foo 1
2 bar 1
3
4 qux 1
5
6 quux 1
7
8
9 2
10 fun 2
Like this, with awk:
awk '{while(++ln!=$1){print ln}}1' input.txt
Explanation, as a multiline script:
{
# Loop as long as the variable ln (line number)
# is not equal to the first column and insert blank
# lines.
# Note: awk will auto-initialize an integer variable
# with 0 upon its first usage
while(++ln!=$1) {
print ln
}
}
1 # this always expands to true, making awk print the input lines
I've written this logic in python and tested it works, however I need to run this in an environment that doesn't have python.
In case you want to have running python code where python is not installed you might freeze your code. The Hitchhiker's Guide to Python has overview of tools which are able to do it. I suggest first trying pyinstaller as it support various operation system and seems easy to use.
This might work for you (GNU join, seq and join):
join -a1 -t' ' <(seq $(sed -n '$s/ .*//p' file)) file 2>/dev/null
Join a file created by the command seq using the last line number in file with file.
Hii experts i want to split a large column of text file at a particular symbol(here >) and want to paste the splitted file side by side as given in a example below:
I tried with split -l 4 inputfile > otputfile but it doesnot help.I hope some expert will definitely help me.
For example i have data as given below:
>
1
2
2
4
>
4
3
5
3
>
4
5
2
3
and i need output like as below
1 4 4
2 3 5
2 5 2
4 3 3
EDIT: As per OP's comment lines between > mark may not be regular in numbers if this is the case I have come up with following, where it will add NA for missing specific occurrence of >. Written and tested with GNU awk and considering no empty lines in your Input_file here.
awk -v RS=">" -v FS="\n" '
FNR==NR{
max=(max>NF?max:NF)
next
}
FNR>1{
for(i=2;i<max;i++){
val[i]=(val[i]?val[i] OFS:"")($i?$i:"NA")
}
}
END{
for(i=2;i<max;i++){
print val[i]
}
}' Input_file Input_file
Could you please try following, written and tested with shown samples in GNU awk.
awk '
/^>/{
count=""
next
}
{
++count
val[count]=(val[count]?val[count] OFS:"")$0
}
END{
for(i=1;i<=count;i++){
print val[i]
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^>/{ ##Checking condition if a line starts from > then do following.
count="" ##Nullifying count variable here.
next ##next will skip all further statements from here.
}
{
++count ##Incrementing count variable with 1 here.
val[count]=(val[count]?val[count] OFS:"")$0 ##Creating val with index count and keep adding current lines values to it with spaces.
}
END{ ##Starting END block for this awk program from here.
for(i=1;i<=count;i++){ ##Starting a for loop from here.
print val[i] ##Printing array val with index i here.
}
}' Input_file ##Mentioning Input_file name here.
Please try below program
a=""">
1
2
2
4
>
4
3
5
3
>
4
5
2
3"""
res=[[c for c in b.split("\n") if c] for b in a.split(">") if b]
print("\n".join([" ".join([item[i] for item in res]) for i in range(len(res[0]))]))
Output
1 4 4
2 3 5
2 5 2
4 3 3
If you want to read from file. use this program as below.
This produces same output as above.
with open("input.txt","r") as f, open("output.txt","w") as f1:
a=f.read()
res=[[c for c in b.split("\n") if c] for b in a.split(">") if b]
f1.write("\n".join([" ".join([item[i] for item in res]) for i in range(len(res[0]))]))
A Python solution as you tagged Python:
columns = [] # List of columns, each column will be another list of lines
with open('example.txt', 'r') as f:
for line in f:
line = line.strip() # Remove leading and trailing white spaces like "\n"
if line == '>':
columns.append([]) # If we find a ">" append a new column
else:
columns[-1].append(line) # else append the line to the last column
with open('output.txt', 'w') as f:
for row in zip(*columns): # zip(*columns) trasposes the matrix
f.write(" ".join(row) + "\n")
I am using mostly one liners in shell scripting.
If I have a file with contents as below:
1
2
3
and want it to be pasted like:
1 1
2 2
3 3
how can I do it in shell scripting using python one liner?
PS: I tried the following:-
python -c "file = open('array.bin','r' ) ; cont=file.read ( ) ; print cont*3;file.close()"
but it printed contents like:-
1
2
3
1
2
3
file = open('array.bin','r' )
cont = file.readlines()
for line in cont:
print line, line
file.close()
You could replace your print cont*3 with the following:
print '\n'.join(' '.join(ch * n) for ch in cont.strip().split())
Here n is the number of columns.
You need to break up the lines and then reassemble:
One Liner:
python -c "file=open('array.bin','r'); cont=file.readlines(); print '\n'.join([' '.join([c.strip()]*2) for c in cont]); file.close()"
Long form:
file=open('array.bin', 'r')
cont=file.readlines()
print '\n'.join([' '.join([c.strip()]*2) for c in cont])
file.close()
With array.bin having:
1
2
3
Gives:
1 1
2 2
3 3
Unfortunately, you can't use a simple for statement for a one-liner solution (as suggested in a previous answer). As this answer explains, "as soon as you add a construct that introduces an indented block (like if), you need the line break."
Here's one possible solution that avoids this problem:
Open file and read lines into a list
Modify the list (using a list comprehension). For each item:
Remove the trailing new line character
Multiply by the number of columns
Join the modified list using the new line character as separator
Print the joint list and close file
Detailed/long form (n = number of columns):
f = open('array.bin', 'r')
n = 5
original = list(f)
modified = [line.strip() * n for line in original]
print('\n'.join(modified))
f.close()
One-liner:
python -c "f = open('array.bin', 'r'); n = 5; print('\n'.join([line.strip()*n for line in list(f)])); f.close()"
REPEAT_COUNT=3 && cat contents.txt| python -c "print('\n'.join(w.strip() * ${REPEAT_COUNT} for w in open('/dev/stdin').readlines()))"
First test from the command propmt:
paste -d" " array.bin array.bin
EDIT:
OP wants to use a variable n to show how much columns are needed.
There are different ways to repeat a command 10 times, such as
for i in {1..10}; do echo array.bin; done
seq 10 | xargs -I -- echo "array.bin"
source <(yes echo "array.bin" | head -n10)
yes "array.bin" | head -n10
Other ways are given by https://superuser.com/a/86353 and I will use a variation of
printf -v spaces '%*s' 10 ''; printf '%s\n' ${spaces// /ten}
My solution is
paste -d" " $(printf "%*s" $n " " | sed 's/ /array.bin /g')
I am used to have awk to retrieve a column from a file.
I need to do something similar now in python. At the moment I use a subprocess and save the result in a variable.
Is possible to run something similar to awk in python, without write a lot of code? I was looking at split; but I don't get how do you parse trough multiple lines.
The input that I have is similar to a simple ls -la or netstat -r. I would like to get the 3rd column, so I can do what I would do with
awk '{print $3}'
Example of the source:
a b c d e
1 2 4 5 2
X Y Z S R
The shortest that I can think of, is a loop splitting for each line, then split each line in single string, print the string[2]. But I am not sure how to write this in the simplest and shortest way; as short as write the awk command in a subprocess.
In bash, using pythonpy
rtb#bartek-laptop ~ $ cat tmp
a b c d e
1 2 4 5 2
X Y Z S R
rtb#bartek-laptop ~ $ cat tmp | py -x "x.split()[2]"
c
4
Z
Or in script
with open('tmp') as f:
result = [line.split()[2] for line in f]
# now result contains list ['c', '4', 'Z']
I have two files. One has two columns, ref.txt. The other has three columns, file.txt.
In ref.txt,
1 2
2 3
3 5
In file.txt,
1 2 4 <---here matching
3 4 5
6 9 4
2 3 10 <---here matching
4 7 9
3 5 7 <---here matching
I would like to compare two columns for each file, then only print the lines in file.txt matching the ref.txt.
So, the output should be,
1 2 4
2 3 10
3 5 7
I thought two dictionary comparison like,
mydict = {}
mydict1 = {}
with open('ref.txt') as f1:
for line in f1:
key, key1 = line.split()
sp1 = mydict[key, key1]
with open('file.txt') as f2:
for lines in f2:
item1, item2, value = lines.split()
sp2 = mydict1[item1, item2]
if sp1 == sp2:
print value
How can I compare two files appropriately with dictionary or others?
I found some perl and python code to solve the same number of columns in both file.
In my case, one file has two columns and the other has three columns.
How to compare two files and only print matching values?
Here's another option:
use strict;
use warnings;
my $file = pop;
my %hash = map { chomp; $_ => 1 } <>;
push #ARGV, $file;
while (<>) {
print if /^(\d+\s+\d+)/ and $hash{$1};
}
Usage: perl script.pl ref.txt file.txt [>outFile]
The last, optional parameter directs output to a file.
Output on your datasets:
1 2 4
2 3 10
3 5 7
Hope this helps!
grep -Ff ref.txt file.txt
is enough if the amount of whitespace between the characters is the same in both files. If it is not, you can do
awk '{print "^" $1 "[[:space:]]+" $2}' | xargs -I {} grep -E {} file.txt
combining three of my favorite utilities: awk, grep, and xargs... This latter method also ensures that the match only occurs at the start of the line (comparing column 1 with column 1, and column 2 with column 2).
Here's a revised and commented version that should work on your larger data set:
#read in your reference and the file
reference = open("ref.txt").read()
filetext = open("file.txt").read()
#split the reference file into a list of strings, splitting each time you encounter a new line
splitReference = reference.split("\n")
#do the same for the file
splitFile = filetext.split("\n")
#then, for each line in the reference,
for referenceLine in splitReference:
#split that line into a list of strings, splitting each time you encouter a stretch of whitespace
referenceCells = referenceLine.split()
#then, for each line in your 'file',
for fileLine in splitFile:
#split that line into a list of strings, splitting each time you encouter a stretch of whitespace
lineCells = fileLine.split()
#now, for each line in 'reference' check to see if the first value is equal to the first value of the current line in 'file'
if referenceCells[0] == lineCells[0]:
#if those are equal, then check to see if the current rows of the reference and the file both have a length of more than one
if len(referenceCells) > 1:
if len(lineCells) > 1:
#if both have a length of more than one, compare the values in their second columns. If they are equal, print the file line
if referenceCells[1] == lineCells[1]:
print fileLine
Output:
1 2 4
2 3 10
3 5 7