Create reverse complement sequence based on AWK - python

Dear stackoverflow users,
I have TAB sep data like this:
head -4 input.tsv
seq A C change
seq T A ok
seq C C change
seq AC CCT change
And I need create reverse complement function in awk which do something like this
head -4 output.tsv
seq T G change
seq T A ok
seq G G change
seq GT AGG change
So if 4th column is flagged "change" I need to create reverse complement sequence.
HINT - the same things doing for example tr in bash - Bash one liner for this task is:
echo "ACCGA" | rev | tr "ATGC" "TACG"
I was tried something like this
awk 'BEGIN {c["A"] = "T"; c["C"] = "G"; c["G"] = "C"; c["T"] = "A" }{OFS="\t"}
function revcomp( i, o) {
o = ""
for(i = length; i > 0; i--)
o = o c[substr($0, i, 1)]
return(o)
}
{
if($4 == "change"){$2 = revcom(); $3 = revcom()} print $0; else print $0}' input
Biological reverse sequence mean:
A => T
C => G
G => C
T => A
and reverse complement mean:
ACCATG => CATGGT
Edited: Also anybody just for education can share this solution in python.

With a little tinkering of your attempt you can do something like below.
function revcomp(arg) {
o = ""
for(i = length(arg); i > 0; i--)
o = o c[substr(arg, i, 1)]
return(o)
}
BEGIN {c["A"] = "T"; c["C"] = "G"; c["G"] = "C"; c["T"] = "A" ; OFS="\t"}
{
if($4 == "change") {
$2 = revcomp($2);
$3 = revcomp($3)
}
}1
The key here was to use the function revcomp to take the arguments as the column values and operate on it by iterating from end. You were previously doing on the whole line $0, i.e. substr($0, i, 1) which would be causing a lot of unusual lookups on the array c.
I've also taken the liberty of changing the prototype of your function revcomp to take the input string and return the reversed one. Because I wasn't sure how you were intending to use in your original attempt.
If you are intending to use the above in a part of a larger script, I would recommend putting the whole code as-above in a script file, set the she-bang interpreter to #!/usr/bin/awk -f and run the script as awk -f script.awk input.tsv
A crude bash version implemented in awk would like below. Note that, it is not clean and not a recommended approach. See more at AllAboutGetline
As before call the function as $2 = revcomp_bash($2) and $3 = revcomp_bash($3)
function revcomp_bash(arg) {
o = ""
cmd = "printf \"%s\" " arg "| rev | tr \"ATGC\" \"TACG\""
while ( ( cmd | getline o ) > 0 ) {
}
close(cmd);
return(o)
}
Your whole code speaks GNU awk-ism, so didn't care for converting it to a POSIX compliant one. You could use split() with a empty de-limiter instead of length() but the POSIX specification gladly says that "The effect of a null string as the value of fs is unspecified."

Could you please try following, written and tested with shown samples(in GNU awk).
awk '
BEGIN{
label["A"]="T"
label["C"]="G"
label["G"]="C"
label["T"]="A"
}
function cVal(field){
delete array
num=split($field,array,"")
for(k=1;k<=num;k++){
if(array[k] in label){
val=label[array[k]] val
}
}
$field=val
val=""
}
$NF=="change"{
for(i=2;i<=(NF-1);i++){
cVal(i)
}
}
1
' Input_file | column -t
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this code here.
label["A"]="T" ##Creating array label with index A and value T.
label["C"]="G" ##Creating array label with index C and value G.
label["G"]="C" ##Creating array label with index G and value C.
label["T"]="A" ##Creating array label with index T and value A.
}
function cVal(field){ ##Creating function named cVal here with passing value field in it.
delete array ##Deleting array here.
num=split($field,array,"") ##Splitting current field value passed to it and creating array.
for(k=1;k<=num;k++){ ##Running for loop fromk=1 to till value of num here.
if(array[k] in label){ ##Checking condition if array value with index k is present in label array then do following.
val=label[array[k]] val ##Creating val which has label value with index array with index k and keep concatenating its value to it.
}
}
$field=val ##Setting current field value to val here.
val="" ##Nullifying val here.
}
$NF=="change"{ ##Checking condition if last field is change then do following.
for(i=2;i<=(NF-1);i++){ ##Running for loop from 2nd field to 2nd last field.
cVal(i) ##Calling function with passing current field number to it.
}
}
1 ##1 will print current line here.
' Input_file | column -t ##Mentioning Input_file name here.

Kinda inefficient for this particular application since it creates the mapping array on each call to tr() and does the same loop in tr() and then again in rev() but figured I'd show how to write standalone tr() and rev() functions and it'll probably be fast enough for your needs anyway:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
$4 == "change" {
for ( i=2; i<=3; i++) {
$i = rev(tr($i,"ACGT","TGCA"))
}
}
{ print }
function tr(instr,old,new, outstr,pos,map) {
for (pos=1; pos<=length(old); pos++) {
map[substr(old,pos,1)] = substr(new,pos,1)
}
for (pos=1; pos<=length(instr); pos++) {
outstr = outstr map[substr(instr,pos,1)]
}
return outstr
}
function rev(instr, outstr,pos) {
for (pos=1; pos<=length(instr); pos++) {
outstr = substr(instr,pos,1) outstr
}
return outstr
}
.
$ awk -f tst.awk file
seq T G change
seq T A ok
seq G G change
seq GT AGG change

If you are okay with perl:
$ perl -F'\t' -lane 'if($F[3] eq "change") {
$F[1] = (reverse $F[1] =~ tr/ATGC/TACG/r);
$F[2] = (reverse $F[2] =~ tr/ATGC/TACG/r) }
print join "\t", #F' ip.txt
seq T G change
seq T A ok
seq G G change
seq GT AGG change
Can also use, but this is not specific to columns, will change any sequence of ATCG characters:
perl -lpe 's/\t\K[ATCG]++(?=.*\tchange$)/reverse $&=~tr|ATGC|TACG|r/ge'

Related

How to merge lines and add column values?

So I have a laaaaaaaarge file like this:
Item|Cost1|Cost2
Pizza|50|25
Sugar|100|100
Spices|100|200
Pizza|100|25
Sugar|200|100
Pizza|50|100
I want to add all Cost1s and Cost2s for a particular item and produce a merged output.
I've written a python code to do this,
item_dict = {}
for line in file:
fields = line.split('|')
item = fields[0]
cost1 = fields[1]
cost2 = fields[2]
if item_dict.has_key(item):
item_dict[item][0] += int(cost1)
item_dict[item][1] += int(cost2)
else:
item_dict[item] = [int(cost1),int(cost2)]
for key, val in item_dict.items():
print key,"|".join(val)
Is there anyway to do this very efficiently and quickly in awk or using any other wizardry?
Or can I make my python more elegant and faster?
Expected Output
Pizza|200|150
Sugar|300|200
Spices|100|200
Something like this...
$ awk 'BEGIN{OFS=FS="|"}
NR>1 {cost1[$1]+=$2; cost2[$1]+=$3}
END{ for (i in cost1) print i, cost1[i], cost2[i]}' file
Sugar|300|200
Spices|100|200
Pizza|200|150
Explanation
BEGIN{OFS=FS="|"} sets the (input & output) field separator to be |.
NR>1 means that we are going to do some actions for line number bigger than 1. This way we skip the header.
cost1 and cost2 are arrays whose index is the first field and its value is the sum till that point.
END {} is something we do after reading the whole file. It consists in looping through the array and printing the values.
awk '
BEGIN { FS=OFS="|" }
NR==1 { expectedNF = NF; next }
NF != expectedNF { print "Fix your #%##&! data, idiot!"; exit 1 }'
{
items[$1]
for (c=2;c<=NF;c++)
cost[$1,c] += $c
}
END {
for (i in items) {
printf "%s", i
for (c=2;c<=NF;c++)
printf "%s%s", OFS, cost[i,c]
print ""
}
}
' file
Feel free to compress it onto 1 or 2 lines as you see fit.
In practice I would have done what fedorqui did. For completeness however, this python script should be faster than your original:
#!/usr/bin/env python
import fileinput
item_dict = {}
for line in fileinput.input():
if not fileinput.isfirstline():
fields = line.strip().split('|')
item = fields[0]
cost1 = int(fields[1])
cost2 = int(fields[2])
try:
item_dict[item][0] += cost1
item_dict[item][1] += cost2
except KeyError:
item_dict[item] = [cost1, cost2]
for key, val in item_dict.items():
print "%s|%s|%s" % (key,val[0],val[1])
Save the script to a file such as sumcols and make it executable chmod +x sumcols and run like:
$ ./sumcols file
Spices|100|200
Sugar|300|200
Pizza|200|150

Combine lines with matching keys

I have a text file with the following structure
ID,operator,a,b,c,d,true
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12236,S1,77.2,81.5,69.4,84.1,82.1
WCBP12236,S2,68.0,68.0,53.2,68.5,82.1
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,75.3
WCBP12234,S1,81.8,82.7,67.0,87.5,75.3
WCBP12234,S2,66.6,67.9,53.0,70.7,75.3
WCBP12238,J1,78.6,79.0,56.2,82.1,84.1
WCBP12239,J2,66.6,72.9,79.5,76.6,82.1
WCBP12239,S1,86.6,87.8,23.0,23.0,82.1
WCBP12239,S2,86.0,86.9,62.3,89.7,82.1
WCBP12239,J1,70.9,71.3,66.0,73.7,82.1
WCBP12238,J2,75.1,75.2,54.3,76.4,84.1
WCBP12238,S1,65.9,66.0,40.2,66.5,84.1
WCBP12238,S2,72.7,73.2,52.6,73.9,84.1
Each ID corresponds to a dataset which is analysed by an operator several times. i.e J1 and J2 are the first and second attempt by operator J. The measures a, b, c and d use 4 slightly different algorithms to measure a value whose true value lies in the column true
What I would like to do is to create 3 new text files comparing the results for J1 vs J2, S1 vs S2 and J1 vs S1. Example output for J1 vs J2:
ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true
WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1
WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3
where a1 is measurement a for J1, etc.
Another example is for S1 vs S2:
ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true
WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1
WCBP12234,81.8,66.6,82.7,67.9,67.0,53,87.5,70.7,75.3
The IDs will not be in alphanumerical order nor will the operators be clustered for the same ID. I'm not certain how best to approach this task - using linux tools or a scripting language like perl/python.
My initial attempt using linux quickly hit a brick wall
First find all unique IDs (sorted)
awk -F, '/^WCBP/ {print $1}' file | uniq | sort -k 1.5n > unique_ids
Loop through these IDs and sort J1, J2:
foreach i (`more unique_ids`)
grep $i test.txt | egrep 'J[1-2]' | sort -t',' -k2
end
This gives me the data sorted
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,80.4
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12238,J1,78.6,79.0,56.2,82.1,82.1
WCBP12238,J2,75.1,75.2,54.3,76.4,82.1
WCBP12239,J1,70.9,71.3,66.0,73.7,75.3
WCBP12239,J2,66.6,72.9,79.5,76.6,75.3
I'm not sure how to rearrange this data to get the desired structure. I tried adding an additional pipe to awk in the foreach loop awk 'BEGIN {RS="\n\n"} {print $1, $3,$10,$4,$11,$5,$12,$6,$13,$7}'
Any ideas? I'm sure this can be done in a less cumbersome manner using awk, although it may be better using a proper scripting language.
You can use the Perl csv module Text::CSV to extract the fields, and then store them in a hash, where ID is the main key, the second field is the secondary key and all the fields are stored as the value. It should then be trivial to do whatever comparisons you want. If you want to retain the original order of your lines, you can use an array inside the first loop.
use strict;
use warnings;
use Text::CSV;
my %data;
my $csv = Text::CSV->new({
binary => 1, # safety precaution
eol => $/, # important when using $csv->print()
});
while ( my $row = $csv->getline(*ARGV) ) {
my ($id, $J) = #$row; # first two fields
$data{$id}{$J} = $row; # store line
}
Python Way:
import os,sys, re, itertools
info=["WCBP12236,J1,75.7,80.6,65.9,83.2,82.1",
"WCBP12236,J2,76.3,79.6,61.7,81.9,82.1",
"WCBP12236,S1,77.2,81.5,69.4,84.1,82.1",
"WCBP12236,S2,68.0,68.0,53.2,68.5,82.1",
"WCBP12234,J1,63.7,67.7,72.2,71.6,75.3",
"WCBP12234,J2,68.6,68.4,41.4,68.9,80.4",
"WCBP12234,S1,81.8,82.7,67.0,87.5,75.3",
"WCBP12234,S2,66.6,67.9,53.0,70.7,72.7",
"WCBP12238,J1,78.6,79.0,56.2,82.1,82.1",
"WCBP12239,J2,66.6,72.9,79.5,76.6,75.3",
"WCBP12239,S1,86.6,87.8,23.0,23.0,82.1",
"WCBP12239,S2,86.0,86.9,62.3,89.7,82.1",
"WCBP12239,J1,70.9,71.3,66.0,73.7,75.3",
"WCBP12238,J2,75.1,75.2,54.3,76.4,82.1",
"WCBP12238,S1,65.9,66.0,40.2,66.5,80.4",
"WCBP12238,S2,72.7,73.2,52.6,73.9,72.7" ]
def extract_data(operator_1, operator_2):
operator_index=1
id_index=0
data={}
result=[]
ret=[]
for line in info:
conv_list=line.split(",")
if len(conv_list) > operator_index and ((operator_1.strip().upper() == conv_list[operator_index].strip().upper()) or (operator_2.strip().upper() == conv_list[operator_index].strip().upper()) ):
if data.has_key(conv_list[id_index]):
iters = [iter(conv_list[int(operator_index)+1:]), iter(data[conv_list[id_index]])]
data[conv_list[id_index]]=list(it.next() for it in itertools.cycle(iters))
continue
data[conv_list[id_index]]=conv_list[int(operator_index)+1:]
return data
ret=extract_data("j1", "s2")
print ret
O/P:
{'WCBP12239': ['70.9', '86.0', '71.3', '86.9', '66.0', '62.3', '73.7', '89.7', '75.3', '82.1'], 'WCBP12238': ['72.7', '78.6', '73.2', '79.0', '52.6', '56.2', '73.9', '82.1', '72.7', '82.1'], 'WCBP12234': ['66.6', '63.7', '67.9', '67.7', '53.0', '72.2', '70.7', '71.6', '72.7', '75.3'], 'WCBP12236': ['68.0', '75.7', '68.0', '80.6', '53.2', '65.9', '68.5', '83.2', '82.1', '82.1']}
I didn't use Text::CSV like TLP did. If you needed to you could but for this example, I thought since there were no embedded commas in the fields, I did a simple split on ','. Also, the true fields from both operators are listed (instead of just 1) as I thought the special case of the last value complicates the solution.
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw/ mesh /;
my %data;
while (<DATA>) {
chomp;
my ($id, $op, #vals) = split /,/;
$data{$id}{$op} = \#vals;
}
my #ops = ([qw/J1 J2/], [qw/S1 S2/], [qw/J1 S1/]);
for my $id (sort keys %data) {
for my $comb (#ops) {
open my $fh, ">>", "#$comb.txt" or die $!;
my $a1 = $data{$id}{ $comb->[0] };
my $a2 = $data{$id}{ $comb->[1] };
print $fh join(",", $id, mesh(#$a1, #$a2)), "\n";
close $fh or die $!;
}
}
__DATA__
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12236,S1,77.2,81.5,69.4,84.1,82.1
WCBP12236,S2,68.0,68.0,53.2,68.5,82.1
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,75.3
WCBP12234,S1,81.8,82.7,67.0,87.5,75.3
WCBP12234,S2,66.6,67.9,53.0,70.7,75.3
WCBP12239,J1,78.6,79.0,56.2,82.1,82.1
WCBP12239,J2,66.6,72.9,79.5,76.6,82.1
WCBP12239,S1,86.6,87.8,23.0,23.0,82.1
WCBP12239,S2,86.0,86.9,62.3,89.7,82.1
WCBP12238,J1,70.9,71.3,66.0,73.7,84.1
WCBP12238,J2,75.1,75.2,54.3,76.4,84.1
WCBP12238,S1,65.9,66.0,40.2,66.5,84.1
WCBP12238,S2,72.7,73.2,52.6,73.9,84.1
The output files produced are below
J1 J2.txt
WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3,75.3
WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1,82.1
WCBP12238,70.9,75.1,71.3,75.2,66.0,54.3,73.7,76.4,84.1,84.1
WCBP12239,78.6,66.6,79.0,72.9,56.2,79.5,82.1,76.6,82.1,82.1
S1 S2.txt
WCBP12234,81.8,66.6,82.7,67.9,67.0,53.0,87.5,70.7,75.3,75.3
WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1,82.1
WCBP12238,65.9,72.7,66.0,73.2,40.2,52.6,66.5,73.9,84.1,84.1
WCBP12239,86.6,86.0,87.8,86.9,23.0,62.3,23.0,89.7,82.1,82.1
J1 S1.txt
WCBP12234,63.7,81.8,67.7,82.7,72.2,67.0,71.6,87.5,75.3,75.3
WCBP12236,75.7,77.2,80.6,81.5,65.9,69.4,83.2,84.1,82.1,82.1
WCBP12238,70.9,65.9,71.3,66.0,66.0,40.2,73.7,66.5,84.1,84.1
WCBP12239,78.6,86.6,79.0,87.8,56.2,23.0,82.1,23.0,82.1,82.1
Update: To get only 1 true value, the for loop could be written like this:
for my $id (sort keys %data) {
for my $comb (#ops) {
local $" = '';
open my $fh, ">>", "#$comb.txt" or die $!;
my $a1 = $data{$id}{ $comb->[0] };
my $a2 = $data{$id}{ $comb->[1] };
pop #$a2;
my #mesh = grep defined, mesh(#$a1, #$a2);
print $fh join(",", $id, #mesh), "\n";
close $fh or die $!;
}
}
Update: Added 'defined' for test in grep expr. as it is the proper way (instead of just testing '$_', which possibly could be 0 and wrongly excluded for the list by grep).
Any problem that awk or sed can solve, there is no doubt that python, perl, java, go, c++, c can too. However, it is not necessary to write a complete program in any of them.
Use awk in one liner
VERSION 1
For the most use cases, I think the VERSION 1 is good enough.
tail -n +2 file | # the call to `tail` to remove the 1st line is not necessary
sort -t, -k 1,1 |
awk -F ',+' -v OFS=, '$2==x{id=$1;a=$3;b=$4;c=$5;d=$6} id==$1 && $2==y{$3=a","$3; $4=b","$4; $5=c","$5; $6=d","$6; $2=""; $0=$0; $1=$1; print}' \
x=J1 y=S1
Just replace the value of the argument x and y with what you like.
Please note the value of x and y must follow the alphabet order, e.g., x=J1 y=S1 is OK, but x=S1 y=J1 doesn't work.
VERSION 2
The limitation mentioned in VERSION 1 that you have to specify the x and y in alphabet order is removed. Like, x=S1 y=J1 is OK now.
tail -n +2 file | # the call to `tail` to remove the 1st line is not necessary
sort -t, -k 1,1 |
awk -F ',+' -v OFS=, 'id!=$1 && ($2==x||$2==y){z=$2==x?y:x; id=$1; a=$3;b=$4;c=$5;d=$6} id==$1 && $2==z{$3=a","$3;$4=b","$4;$5=c","$5;$6=d","$6; $2=""; $0=$0; $1=$1; print}' \
x=S1 y=J1
However, the data of J1 is still put before the data of S1, which means the column a1 in the resulting output is always the column a of J1 in the input file, and a2 in the resulting output is always the column a of S1 in the input file.
VERSION 3
The limitation mentioned in the VERSION 2 is removed. Now with x=S1 y=J1, the output column a1 would be the input column a of S1, and the a2 would be the a of J1.
tail -n +2 file | # the call to `tail` to remove the 1st line is not necessary
sort -t, -k 1,1 |
awk -F ',+' -v OFS=, 'id!=$1 && ($2==x||$2==y){z=$2==x?y:x; id=$1; a=$3;b=$4;c=$5;d=$6} id==$1 && $2==z{if (z==y) {$3=a","$3;$4=b","$4;$5=c","$5;$6=d","$6} else {$3=$3","a;$4=$4","b;$5=$5","c;$6=$6","d} $2=""; $0=$0; $1=$1; print}' \
x=S1 y=J1

splitting file into smaller files using by number of fields

I'm having a hard time breaking a large (50GB) csv file into smaller part. Each line has a few thousand fields. Some of the fields are strings in double quotes, others are integers, decimals and boolean.
I want to parse the file line by line and split by the number of fields in each row. The strings contain possibly several commas (such as ), as well as a number of empty fields.
,,1,30,50,"Sold by father,son and daughter for $4,000" , ,,,, 12,,,20.9,0,
I tried using
perl -pe' s{("[^"]+")}{($x=$1)=~tr/,/|/;$x}ge ' file >> file2
to change the commas inside the quotes to | but that didn't work. I plan to use
awk -F"|" conditional statement appending to new k_fld_files file2
Is there an easier way to do this please? I'm looking at python, but I probably need a utility that will stream process the file, line by line.
Using Python - if you just want to parse CSV including embedded delimiters, and stream out with a new delimiter, then something such as:
import csv
import sys
with open('filename.csv') as fin:
csvout = csv.writer(sys.stdout, delimiter='|')
for row in csv.reader(fin):
csvout.writerow(row)
Otherwise, it's not much more difficult to make this do all kinds of stuff.
Example of outputting to files per column (untested):
cols_to_output = {}
for row in csv.reader(fin):
for colno, col in enumerate(row):
output_to = cols_to_output.setdefault(colno, open('column_output.{}'.format(colno), 'wb')
csv.writer(output_to).writerow(row)
for fileno in cols_to_output.itervalues():
fileno.close()
Here's an awk alternative.
Assuming the quoted strings are well formatted, i.e. always have starting and terminating quotes, and no quotes within other quotes, you could do the replacement you suggested by doing a gsub on every other field replacing , with |.
With pipes
Below is an example of how this might go when grabbing columns 3 through 6, 11 and 14-15 with coreutils cut:
awk -F'"' -v OFS='' '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub(",", "|", $i);
$i = FS $i FS; # reinsert the quotes
}
print
}'\
| cut -d , -f 3-6,11,14-15 \
| awk -F'"' -v OFS='' -e '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub("\\|", ",", $i)
$i = FS $i FS; # reinsert the quotes
}
print
}'
Note that there is an additional post-processing step that reverts the | to ,.
Entirely in awk
Alternatively, you could do the whole thing in awk with some loss of generality with regards to range specification. Here we only grab columns 3 to 6:
extract.awk
BEGIN {
OFS = ""
start = 3
end = 6
}
{
for(i=2; i<=NF; i+=2) {
gsub(",", "|", $i)
$i = FS $i FS
}
split($0, record, ",")
for(i=start; i<=end-1; i++) {
gsub("\\|", ",", record[i])
printf("%s,", record[i])
}
gsub("\\|", ",", record[end])
printf("%s\n", record[end])
}

how to get unique values set from a repeating values list

I need to parse a large log file (flat file), which contains two column of values (column-A , column-B).
Values in both columns are repeating. I need to find for each unique value in column-A , I need to find a set of column-B values.
Is this can be done using unix shell command or need to write any perl or python script? What are the ways this can be done?
Example:
xxxA 2
xxxA 1
xxxB 2
XXXC 3
XXXA 3
xxxD 4
output:
xxxA - 2,1,3
xxxB - 2
xxxC - 3
xxxD - 4
Perl 'one-liner' intended/expanded out so that everything fits in the window:
$ perl -F -lane '
$hash{ $F[0] }{ $F[1] }++;
} END {
for my $columnA ( keys %hash ) {
print $columnA, " - ", join( ",", keys %$hash{$columnA} ), "\n";
}
'
Explanation will follow if I see a concerted attempt on the part of the original poster.
I would use Python dictionaries where the dictionary keys are column A values and the dictionary values are Python's built-in Set type holding column B values
def parse_the_file():
lower = str.lower
split = str.split
with open('f.txt') as f:
d = {}
lines = f.read().split('\n')
for A,B in [split(l) for l in lines]:
try:
d[lower(A)].add(B)
except KeyError:
d[lower(A)] = set(B)
for a in d:
print "%s - %s" % (a,",".join(list(d[a])))
if __name__ == "__main__":
parse_the_file()
The advantage of using a dictionary is that you'll have a single dictionary key per column A value. The advantage of using a set is that you'll have a unique set of column B values.
Efficiency notes:
The use of try-catch is more efficient than using an if\else statement to check for initial cases.
The evaluation and assignment of the str functions outside of the loop is more efficient than simply using them inside the loop.
Depending on the proportion of new A values vs. reappearance of A values throughout the file, you may consider using a = lower(A) before the try catch statement
I used a function, as accessing local variables is more efficient in Python than accessing global variables
Some of these performance tips are from here
Testing the code above on your input example yields:
xxxd - 4
xxxa - 1,3,2
xxxb - 2
xxxc - 3
You can use this simple multimap:
class MultiMap(object):
values = {}
def __getitem__(self, index):
return self.values[index]
def __setitem__(self, index, value):
if not self.values.has_key(index):
self.values[index] = []
self.values[index].append(value)
def __repr__(self):
return repr(self.values)
See it in action: http://codepad.org/xOOrlbnf
Simple Perl version:
#!/usr/bin/perl
use strict;
use warnings;
my (%v, #row);
foreach (<DATA>) {
chomp;
$_ = lc($_);
#row = split(/\s+/, $_);
push( #{ $v{$row[0]} }, $row[1]);
}
foreach (sort keys %v) {
print "$_ - ", join( ", ", #{ $v{$_} } ), "\n";
}
__DATA__
xxxA 2
xxxA 1
xxxB 2
XXXC 3
XXXA 3
xxxD 4
Did not focus on variable names. From example i see they are not case sensitive.
f = """xxxA 2
xxxA 1
xxxB 2
XXXC 3
XXXA 3
xxxD 4"""
d = {}
for line in f.split("\n"):
key, val = line.lower().split()
try:
d[key].append(val)
except KeyError:
d[key] = [val]
print d
Python
while() {
($key, $value) = split / /, $_;
$hash{lc($key)} = 1;
push(#array, "$key$value");
}
foreach $key (sort keys %hash) {
#arr = (grep /$key/i, #array);
chomp(#arr);
$val = join (", ", #arr);
$val =~ s#$key##gi;
print "$key\t$val\n";
}
Using Perl oneliner:
perl -lane'$F[0]=~s/.../lc$&/e;exists$s{$F[0]}and$s{$F[0]}.=",$F[1]"or push#v,$F[0]and$s{$F[0]}=$F[1]}{print"$_ $s{$_}"for#v'
You can remove $F[0]=~s/.../lc$&/e; if your key is case sensitive (which is not true in your test data) or use $F[0]=lc$F[0]; or $F[0]=uc$F[0]; if you can unify your key to lower or upper case.

Summing up two columns the Unix way

# To fix the symptom
How can you sum up the following columns effectively?
Column 1
1
3
3
...
Column 2
2323
343
232
...
This should give me
Expected result
2324
346
235
...
I have the columns in two files.
# Initial situation
I use sometimes too many curly brackets such that I have used one more this { than this } in my files.
I am trying to find where I have used the one unnecessary curly bracket.
I have used the following steps in getting the data
Find commands
find . * -exec grep '{' {} + > /tmp/1
find . * -exec grep '}' {} + > /tmp/2
AWK commands
awk -F: '{ print $2 }' /tmp/1 > /tmp/11
awk -F: '{ print $2 }' /tmp/2 > /tmp/22
The column are in the files /tmp/11 and /tmp/22.
I repeat a lot of similar commands in my procedure.
This suggests me that this is not the right way.
Please, suggests me any way such as Python, Perl or any Unix tool which can decrease the number of steps.
If c1 and c2 are youre files, you can do this:
$ paste c1 c2 | awk '{print $1 + $2}'
Or (without AWK):
$ paste c1 c2 | while read i j; do echo $(($i+$j)); done
Using python:
totals = [ int(i)+int(j) for i, j in zip ( open(fname1), open(fname2) ) ]
You can avoid the intermediate steps by just using a command that do the counts and the comparison at the same time:
find . -type f -exec perl -nle 'END { print $ARGV if $h{"{"} != $h{"}"} } $h{$_}++ for /([}{])/g' {}\;
This calls the Perl program once per file, the Perl program counts the number of each type curly brace and prints the name of the file if they counts don't match.
You must be careful with the /([}{]])/ section, find will think it needs to do the replacement on {} if you say /([{}]])/.
WARNING: this code will have false positives and negatives if you are trying to run it against source code. Consider the following cases:
balanced, but curlies in strings:
if ($s eq '{') {
print "I saw a {\n"
}
unbalanced, but curlies in strings:
while (1) {
print "}";
You can expand the Perl command by using B::Deparse:
perl -MO=Deparse -nle 'END { print $ARGV if $h{"{"} != $h{"}"} } $h{$_}++ for /([}{])/g'
Which results in:
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
sub END {
print $ARGV if $h{'{'} != $h{'}'};
}
;
++$h{$_} foreach (/([}{])/g);
}
We can now look at each piece of the program:
BEGIN { $/ = "\n"; $\ = "\n"; }
This is caused by the -l option. It sets both the input and output record separators to "\n". This means anything read in will be broken into records based "\n" and any print statement will have "\n" appended to it.
LINE: while (defined($_ = <ARGV>)) {
}
This is created by the -n option. It loops over every file passed in via the commandline (or STDIN if no files are passed) reading each line of those files. This also happens to set $ARGV to the last file read by <ARGV>.
chomp $_;
This removes whatever is in the $/ variable from the line that was just read ($_), it does nothing useful here. It was caused by the -l option.
sub END {
print $ARGV if $h{'{'} != $h{'}'};
}
This is an END block, this code will run at the end of the program. It prints $ARGV (the name of the file last read from, see above) if the values stored in %h associated with the keys '{' and '}' are equal.
++$h{$_} foreach (/([}{])/g);
This needs to be broken down further:
/
( #begin capture
[}{] #match any of the '}' or '{' characters
) #end capture
/gx
Is a regex that returns a list of '{' and '}' characters that are in the string being matched. Since no string was specified the $_ variable (which holds the line last read from the file, see above) will be matched against. That list is fed into the foreach statement which then runs the statement it is in front of for each item (hence the name) in the list. It also sets $_ (as you can see $_ is a popular variable in Perl) to be the item from the list.
++h{$_}
This line increments the value in $h that is associated with $_ (which will be either '{' or '}', see above) by one.
In Python (or Perl, Awk, &c) you can reasonably do it in a single stand-alone "pass" -- I'm not sure what you mean by "too many curly brackets", but you can surely count curly use per file. For example (unless you have to worry about multi-GB files), the 10 files using most curly braces:
import heapq
import os
import re
curliest = dict()
for path, dirs, files in os.walk('.'):
for afile in files:
fn = os.path.join(path, afile)
with open(fn) as f:
data = f.read()
braces = data.count('{') + data.count('}')
curliest[fn] = bracs
top10 = heapq.nlargest(10, curlies, curliest.get)
top10.sort(key=curliest.get)
for fn in top10:
print '%6d %s' % (curliest[fn], fn)
Reply to Lutz'n answer
My problem was finally solved by this commnad
paste -d: /tmp/1 /tmp/2 | awk -F: '{ print $1 "\t" $2 - $4 }'
your problem can be solved with just 1 awk command...
awk '{getline i<"file1";print i+$0}' file2

Categories

Resources