How to merge lines and add column values? - python

So I have a laaaaaaaarge file like this:
Item|Cost1|Cost2
Pizza|50|25
Sugar|100|100
Spices|100|200
Pizza|100|25
Sugar|200|100
Pizza|50|100
I want to add all Cost1s and Cost2s for a particular item and produce a merged output.
I've written a python code to do this,
item_dict = {}
for line in file:
fields = line.split('|')
item = fields[0]
cost1 = fields[1]
cost2 = fields[2]
if item_dict.has_key(item):
item_dict[item][0] += int(cost1)
item_dict[item][1] += int(cost2)
else:
item_dict[item] = [int(cost1),int(cost2)]
for key, val in item_dict.items():
print key,"|".join(val)
Is there anyway to do this very efficiently and quickly in awk or using any other wizardry?
Or can I make my python more elegant and faster?
Expected Output
Pizza|200|150
Sugar|300|200
Spices|100|200

Something like this...
$ awk 'BEGIN{OFS=FS="|"}
NR>1 {cost1[$1]+=$2; cost2[$1]+=$3}
END{ for (i in cost1) print i, cost1[i], cost2[i]}' file
Sugar|300|200
Spices|100|200
Pizza|200|150
Explanation
BEGIN{OFS=FS="|"} sets the (input & output) field separator to be |.
NR>1 means that we are going to do some actions for line number bigger than 1. This way we skip the header.
cost1 and cost2 are arrays whose index is the first field and its value is the sum till that point.
END {} is something we do after reading the whole file. It consists in looping through the array and printing the values.

awk '
BEGIN { FS=OFS="|" }
NR==1 { expectedNF = NF; next }
NF != expectedNF { print "Fix your #%##&! data, idiot!"; exit 1 }'
{
items[$1]
for (c=2;c<=NF;c++)
cost[$1,c] += $c
}
END {
for (i in items) {
printf "%s", i
for (c=2;c<=NF;c++)
printf "%s%s", OFS, cost[i,c]
print ""
}
}
' file
Feel free to compress it onto 1 or 2 lines as you see fit.

In practice I would have done what fedorqui did. For completeness however, this python script should be faster than your original:
#!/usr/bin/env python
import fileinput
item_dict = {}
for line in fileinput.input():
if not fileinput.isfirstline():
fields = line.strip().split('|')
item = fields[0]
cost1 = int(fields[1])
cost2 = int(fields[2])
try:
item_dict[item][0] += cost1
item_dict[item][1] += cost2
except KeyError:
item_dict[item] = [cost1, cost2]
for key, val in item_dict.items():
print "%s|%s|%s" % (key,val[0],val[1])
Save the script to a file such as sumcols and make it executable chmod +x sumcols and run like:
$ ./sumcols file
Spices|100|200
Sugar|300|200
Pizza|200|150

Related

Python csv merge multiple files with different columns

I hope somebody can help me with this issue.
I have about 20 csv files (each file with its headers), each of this files has hundreds of columns.
My problem is related to merging those files, because a couple of them have extra columns.
I was wondering if there is an option to merge all those files in one adding all the new columns with related data without corrupting the other files.
So far I used I used the awk terminal command:
awk '(NR == 1) || (FNR > 1)' *.csv > file.csv
to merge removing the headers from all the files expect from the first one.
I got this from my previous question
Merge multiple csv files into one
But this does not solve the issue with the extra column.
EDIT:
Here are some file csv in plain text with the headers.
file 1
"#timestamp","#version","_id","_index","_type","ad.(fydibohf23spdlt)/cn","ad.</o","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","ad.totalComputers","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationUserPrivileges","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-07-27 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file2
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-07-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file3
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
file4
"#timestamp","#version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Those are 4 of the 20 files, I included all the headers but no rows because they contain sensitive data.
When I run the script on those files, I can see that it writes the timestamp value. But when I run it against the original files (with a lot of data) all what it does, is writing the header and that's it.Please if you need some more info just let me know.
Once I run the script on the original file. This is what I get back
There are 20 rows (one for each file) but it doesn't write the content of each file. This could be related to the sniffing of the first line? because I think that is checking only the first line of the files and moves forward as in the script. So how is that in a small file, it manage to copy merge also the content?
Your question isn't clear, idk if you really want a solution in awk or python or either, and it doesn't have any sample input/output we can test with so it's a guess but is this what you're trying to do (using any awk in any shell on every Unix box)?
$ head file{1..2}.csv
==> file1.csv <==
1,2
a,b
c,d
==> file2.csv <==
1,2,3
x,y,z
$ cat tst.awk
BEGIN {
FS = OFS = ","
for (i=1; i<ARGC; i++) {
if ( (getline < ARGV[i]) > 0 ) {
if ( NF > maxNF ) {
maxNF = NF
hdr = $0
}
}
}
}
NR == 1 { print hdr }
FNR > 1 { NF=maxNF; print }
$ awk -f tst.awk file{1..2}.csv
1,2,3
a,b,
c,d,
x,y,z
See http://awk.freeshell.org/AllAboutGetline for details on when/how to use getline and it's associated caveats.
Alternatively with an assist from GNU head for -q:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
if ( NF > maxNF ) {
maxNF = NF
hdr = $0
}
next
}
!doneHdr++ { print hdr }
FNR > 1 { NF=maxNF; print }
$ head -q -n 1 file{1..2}.csv | awk -f tst.awk - file{1..2}.csv
1,2,3
a,b,
c,d,
x,y,z
As already explained in your original question, you can easily extend the columns in Awk if you know how many to expect.
awk -F ',' -v cols=5 'BEGIN { OFS=FS }
FNR == 1 && NR > 1 { next }
NF<cols { for (i=NF+1; i<=cols; ++i) $i = "" }
1' *.csv >file.csv
I slightly refactored this to skip the unwanted lines with next rather than vice versa; this simplifies the rest of the script slightly. I also added the missing comma separator.
You can easily print the number of columns in each file, and just note the maximum:
awk -F , 'FNR==1 { print NF, FILENAME }' *.csv
If you don't know how many fields there are going to be in files you do not yet have, or if you need to cope with complex CSV with quoted fields, maybe switch to Python for this. It's not too hard to do the field number sniffing in Awk, but coping with quoting is tricky.
import csv
import sys
# Sniff just the first line from every file
fields = 0
for filename in sys.argv[1:]:
with open(filename) as raw:
for row in csv.reader(raw):
# If the line is longer than current max, update
if len(row) > fields:
fields = len(row)
titles = row
# Break after first line, skip to next file
break
# Now do the proper reading
writer = csv.writer(sys.stdout)
writer.writerow(titles)
for filename in sys.argv[1:]:
with open(filename) as raw:
for idx, row in enumerate(csv.reader(raw)):
if idx == 0:
next
row.extend([''] * (fields - len(row)))
writer.writerow(row)
This simply assumes that the additional fields go at the end. If the files could have extra columns between other columns, or columns in different order, you need a more complex solution (though not by much; the Python CSV DictReader subclass could do most of the heavy lifting).
Demo: https://ideone.com/S998l4
If you wanted to do the same type of sniffing in Awk, you basically have to specify the names of the input files twice, or do some nontrivial processing in the BEGIN block to read all the files before starting the main script.

How to Remove Pipe character from a data filed in a pipe delimited file

Experts, i have a simple pipe delimited file from source system which has a free flow text field and for one of the records, i see that "|" character is coming in as part of data. This is breaking my file unevenly and not getting parsed in to correct number of fields. I want to replace the "|" in the data field with a "#".
Record coming in from source system. There are total 9 fields in the file.
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456
If you Notice the 4th field - Free"flow|Text , this is complete value from source which has a pipe in it.
i want to change it to- Free"flow#Text and then read the file with a pipe delimiter.
Desired Outcome-
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow#Text|20191029|X|X|X|3456
I tried few awk/sed combinations, but didn't get the desired output.
Thanks
Since you know there are 9 fields, and the 4th is a problem: take the first 3 fields and the last 5 fields and whatever is left over is the 4th field.
You did tag shell, so here's some bash: I'm sure the python equivalent is close:
line='OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456'
IFS='|'
read -ra fields <<<"$line"
first3=( "${fields[#]:0:3}" )
last5=( "${fields[#]: -5}" )
tmp=${line#"${first3[*]}$IFS"} # remove the first 3 joined with pipe
field4=${tmp%"$IFS${last5[*]}"} # remove the last 5 joined with pipe
data=( "${first3[#]}" "$field4" "${last5[#]}" )
newline="${first3[*]}$IFS${field4//$IFS/#}$IFS${last5[*]}"
# .......^^^^^^^^^^^^....^^^^^^^^^^^^^^^^^....^^^^^^^^^^^
printf "%s\n" "$line" "$newline"
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow#Text|20191029|X|X|X|3456
with awk, it's simpler: If there are 10 fields, join fields 4 and 5, and shift the rest down one.
echo "$line" | awk '
BEGIN { FS = OFS = "|" }
NF == 10 {
$4 = $4 "#" $5
for (i=5; i<NF; i++)
$i = $(i+1)
NF--
}
1
'
OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow#Text|20191029|X|X|X|3456
You tagged your question with Python so I assume a Python-based answer is acceptable.
I assume not all records in your file have the additional "|" in it, but only some records have the "|" in the free text column.
For a more realistic example, I create an input with some correct records and some erroneous records.
I use StringIO to simulate the file, in your environment read the real file with 'open'.
from io import StringIO
sample = 'OutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456\nOutboundManualCall|J|LALALA HTREDFST|FreeHalalText|20191029|X|X|X|3456\nOutboundManualCall|J|LALALA HTREDFST|FrulaalText|20191029|X|X|X|3456\nOutboundManualCall|H|RTYEHLA HTREDFST|Free"flow|Text|20191029|X|X|X|3456'
infile = StringIO(sample)
outfile = StringIO()
for line in infile.readlines():
cols = line.split("|")
if len(cols) > 9:
print(f"bad colum {cols[3:5]}")
line = "|".join(cols[:3]) + "#".join(cols[3:5]) + "|".join(cols[5:])
outfile.write(line)
print("Corrected file:")
print(outfile.getvalue())
Results in:
> bad colum ['Free"flow', 'Text']
> bad colum ['Free"flow', 'Text']
> Corrected file:
> OutboundManualCall|H|RTYEHLA HTREDFSTFree"flow#Text20191029|X|X|X|3456
> OutboundManualCall|J|LALALA HTREDFST|FreeHalalText|20191029|X|X|X|3456
> OutboundManualCall|J|LALALA HTREDFST|FrulaalText|20191029|X|X|X|3456
> OutboundManualCall|H|RTYEHLA HTREDFSTFree"flow#Text20191029|X|X|X|3456

Create reverse complement sequence based on AWK

Dear stackoverflow users,
I have TAB sep data like this:
head -4 input.tsv
seq A C change
seq T A ok
seq C C change
seq AC CCT change
And I need create reverse complement function in awk which do something like this
head -4 output.tsv
seq T G change
seq T A ok
seq G G change
seq GT AGG change
So if 4th column is flagged "change" I need to create reverse complement sequence.
HINT - the same things doing for example tr in bash - Bash one liner for this task is:
echo "ACCGA" | rev | tr "ATGC" "TACG"
I was tried something like this
awk 'BEGIN {c["A"] = "T"; c["C"] = "G"; c["G"] = "C"; c["T"] = "A" }{OFS="\t"}
function revcomp( i, o) {
o = ""
for(i = length; i > 0; i--)
o = o c[substr($0, i, 1)]
return(o)
}
{
if($4 == "change"){$2 = revcom(); $3 = revcom()} print $0; else print $0}' input
Biological reverse sequence mean:
A => T
C => G
G => C
T => A
and reverse complement mean:
ACCATG => CATGGT
Edited: Also anybody just for education can share this solution in python.
With a little tinkering of your attempt you can do something like below.
function revcomp(arg) {
o = ""
for(i = length(arg); i > 0; i--)
o = o c[substr(arg, i, 1)]
return(o)
}
BEGIN {c["A"] = "T"; c["C"] = "G"; c["G"] = "C"; c["T"] = "A" ; OFS="\t"}
{
if($4 == "change") {
$2 = revcomp($2);
$3 = revcomp($3)
}
}1
The key here was to use the function revcomp to take the arguments as the column values and operate on it by iterating from end. You were previously doing on the whole line $0, i.e. substr($0, i, 1) which would be causing a lot of unusual lookups on the array c.
I've also taken the liberty of changing the prototype of your function revcomp to take the input string and return the reversed one. Because I wasn't sure how you were intending to use in your original attempt.
If you are intending to use the above in a part of a larger script, I would recommend putting the whole code as-above in a script file, set the she-bang interpreter to #!/usr/bin/awk -f and run the script as awk -f script.awk input.tsv
A crude bash version implemented in awk would like below. Note that, it is not clean and not a recommended approach. See more at AllAboutGetline
As before call the function as $2 = revcomp_bash($2) and $3 = revcomp_bash($3)
function revcomp_bash(arg) {
o = ""
cmd = "printf \"%s\" " arg "| rev | tr \"ATGC\" \"TACG\""
while ( ( cmd | getline o ) > 0 ) {
}
close(cmd);
return(o)
}
Your whole code speaks GNU awk-ism, so didn't care for converting it to a POSIX compliant one. You could use split() with a empty de-limiter instead of length() but the POSIX specification gladly says that "The effect of a null string as the value of fs is unspecified."
Could you please try following, written and tested with shown samples(in GNU awk).
awk '
BEGIN{
label["A"]="T"
label["C"]="G"
label["G"]="C"
label["T"]="A"
}
function cVal(field){
delete array
num=split($field,array,"")
for(k=1;k<=num;k++){
if(array[k] in label){
val=label[array[k]] val
}
}
$field=val
val=""
}
$NF=="change"{
for(i=2;i<=(NF-1);i++){
cVal(i)
}
}
1
' Input_file | column -t
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this code here.
label["A"]="T" ##Creating array label with index A and value T.
label["C"]="G" ##Creating array label with index C and value G.
label["G"]="C" ##Creating array label with index G and value C.
label["T"]="A" ##Creating array label with index T and value A.
}
function cVal(field){ ##Creating function named cVal here with passing value field in it.
delete array ##Deleting array here.
num=split($field,array,"") ##Splitting current field value passed to it and creating array.
for(k=1;k<=num;k++){ ##Running for loop fromk=1 to till value of num here.
if(array[k] in label){ ##Checking condition if array value with index k is present in label array then do following.
val=label[array[k]] val ##Creating val which has label value with index array with index k and keep concatenating its value to it.
}
}
$field=val ##Setting current field value to val here.
val="" ##Nullifying val here.
}
$NF=="change"{ ##Checking condition if last field is change then do following.
for(i=2;i<=(NF-1);i++){ ##Running for loop from 2nd field to 2nd last field.
cVal(i) ##Calling function with passing current field number to it.
}
}
1 ##1 will print current line here.
' Input_file | column -t ##Mentioning Input_file name here.
Kinda inefficient for this particular application since it creates the mapping array on each call to tr() and does the same loop in tr() and then again in rev() but figured I'd show how to write standalone tr() and rev() functions and it'll probably be fast enough for your needs anyway:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
$4 == "change" {
for ( i=2; i<=3; i++) {
$i = rev(tr($i,"ACGT","TGCA"))
}
}
{ print }
function tr(instr,old,new, outstr,pos,map) {
for (pos=1; pos<=length(old); pos++) {
map[substr(old,pos,1)] = substr(new,pos,1)
}
for (pos=1; pos<=length(instr); pos++) {
outstr = outstr map[substr(instr,pos,1)]
}
return outstr
}
function rev(instr, outstr,pos) {
for (pos=1; pos<=length(instr); pos++) {
outstr = substr(instr,pos,1) outstr
}
return outstr
}
.
$ awk -f tst.awk file
seq T G change
seq T A ok
seq G G change
seq GT AGG change
If you are okay with perl:
$ perl -F'\t' -lane 'if($F[3] eq "change") {
$F[1] = (reverse $F[1] =~ tr/ATGC/TACG/r);
$F[2] = (reverse $F[2] =~ tr/ATGC/TACG/r) }
print join "\t", #F' ip.txt
seq T G change
seq T A ok
seq G G change
seq GT AGG change
Can also use, but this is not specific to columns, will change any sequence of ATCG characters:
perl -lpe 's/\t\K[ATCG]++(?=.*\tchange$)/reverse $&=~tr|ATGC|TACG|r/ge'

I need to replace a string in one file using key value paris from another file

I have a single attribute file that has two columns. The string in column 1 matches the string in the files that need to be changed. The string in file 2 needs to be the string in file 1 column 2.
I'm not sure the best way to approach this sed? awk? There is only a single file 1 that has every key and value pair, they are all unique. There are over 10,000 File 2, that are each different but have the same format, that I would need to change from the numbers to the names. Every number in any of the File 2's will be in File 1.
File 1
1000079541 ALBlai_CCA27168
1000079542 ALBlai_CCA27169
1000082614 PHYsoj_128987
1000082623 PHYsoj_128997
1000112581 PHYcap_Phyca_508162
1000112588 PHYcap_Phyca_508166
1000112589 PHYcap_Phyca_508170
1000112592 PHYcap_Phyca_549547
1000120087 HYAara_HpaP801280
1000134210 PHYinf_PITG_01218T0
1000134213 PHYinf_PITG_01223T0
1000134221 PHYinf_PITG_01231T0
1000144497 PHYinf_PITG_13921T0
1000153541 PYTultPYU1_T002777
1000162512 PYTultPYU1_T013706
1000163504 PYTultPYU1_T014907
1000168326 PHYram_79731
1000168327 PHYram_79730
1000168332 PHYram_79725
1000168335 PHYram_79722
...
File 2
(1000079542:0.60919245567850022205,((1000162512:0.41491233674846345059,(1000153541:0.39076742568979516701,1000163504:0.52813999143574519302):0.14562273102476630537):0.28880212838980307000,(((1000144497:0.20364901110426453235,1000168327:0.22130795712572320921):0.35964649479701132906,((1000120087:0.34990382691181332042,(1000112588:0.08084123331549526725,(1000168332:0.12176200773214326811,1000134213:0.09481932223544080329):0.00945982345360765406):0.01846847662360769429):0.19758412044470402558,((1000168326:0.06182031367986642878,1000112589:0.07837371928562210377):0.03460740736793390532,(1000134210:0.13512192366876615846,(1000082623:0.13344777464787777044,1000112592:0.14943677128375676411):0.03425386814075986885):0.05235436818005634318):0.44112430521695145114):0.21763784827666701749):0.22507080810857052477,(1000112581:0.02102132893524749635,(1000134221:0.10938436290969000275,(1000082614:0.05263067805665807425,1000168335:0.07681947209386902342):0.03562545894572662769):0.02623229853693959113):0.49114147006852687527):0.23017851954961116023):0.64646763541457552549,1000079541:0.90035900920746847476):0.0;
Desired Result
(ALBlai_CCA27169:0.60919245567850022205,((PYTultPYU1_T013706:0.41491233674846345059, ...
Python:
import re
# Build a dictionary of replacements:
with open('File 1') as f:
repl = dict(line.split() for line in f)
# Read in the file and make the replacements:
with open('File 2') as f:
data = f.read()
data = re.sub(r'(\d+):',lambda m: repl[m.group(1)]+':',data)
# Write it back out:
with open('File 2','w') as f:
f.write(data)
Full running awk solution. Hope it helps.
awk -F":" 'BEGIN {
while (getline < "file1")
{
split($0,dat," ");
a[dat[1]]=dat[2];
}
}
{
gsub(substr($1,2,length($1)),a[substr($1,2,length($1))],$0); print
}' file2
I'll do something like that in bash:
while read -r key value
do
echo s/($key:/($value:/g >> sedtmpfile
done < file1
sed -f sedtmpfile file2 > result
rm sedtmpfile

Search and sort data from several files

I have a set of 1000 text files with names in_s1.txt, in_s2.txt and so. Each file contains millions of rows and each row has 7 columns like:
ccc245 1 4 5 5 3 -12.3
For me the most important is the values from the first and seventh columns; the pairs ccc245 , -12.3
What I need to do is to find between all the in_sXXXX.txt files, the 10 cases with the lowest values of the seventh column value, and I also need to get where each value is located, in which file. I need something like:
FILE 1st_col 7th_col
in_s540.txt ccc3456 -9000.5
in_s520.txt ccc488 -723.4
in_s12.txt ccc34 -123.5
in_s344.txt ccc56 -45.6
I was thinking about using python and bash for this purpose but at the moment I did not find a practical approach. All what I know to do is:
concatenate all in_ files in IN.TXT
search the lowest values there using: for i in IN.TXT ; do sort -k6n $i | head -n 10; done
given the 1st_col and 7th_col values of the top ten list, use them to filter the in_s files, using grep -n VALUE in_s*, so I get for each value the name of the file
It works but it is a bit tedious. I wonder about a faster approach only using bash or python or both. Or another better language for this.
Thanks
In python, use the nsmallest function in the heapq module -- it's designed for exactly this kind of task.
Example (tested) for Python 2.5 and 2.6:
import heapq, glob
def my_iterable():
for fname in glob.glob("in_s*.txt"):
f = open(fname, "r")
for line in f:
items = line.split()
yield fname, items[0], float(items[6])
f.close()
result = heapq.nsmallest(10, my_iterable(), lambda x: x[2])
print result
Update after above answer accepted
Looking at the source code for Python 2.6, it appears that there's a possibility that it does list(iterable) and works on that ... if so, that's not going to work with a thousand files each with millions of lines. If the first answer gives you MemoryError etc, here's an alternative which limits the size of the list to n (n == 10 in your case).
Note: 2.6 only; if you need it for 2.5 use a conditional heapreplace() as explained in the docs. Uses heappush() and heappushpop() which don't have the key arg :-( so we have to fake it.
import glob
from heapq import heappush, heappushpop
from pprint import pprint as pp
def my_iterable():
for fname in glob.glob("in_s*.txt"):
f = open(fname, "r")
for line in f:
items = line.split()
yield -float(items[6]), fname, items[0]
f.close()
def homegrown_nlargest(n, iterable):
"""Ensures heap never has more than n entries"""
heap = []
for item in iterable:
if len(heap) < n:
heappush(heap, item)
else:
heappushpop(heap, item)
return heap
result = homegrown_nlargest(10, my_iterable())
result = sorted(result, reverse=True)
result = [(fname, fld0, -negfld6) for negfld6, fname, fld0 in result]
pp(result)
I would:
take first 10 items,
sort them and then
for every line read from files insert the element into those top10:
in case its value is lower than highest one from current top10,
(keeping the sorting for performance)
I wouldn't post the complete program here as it looks like homework.
Yes, if it wasn't ten, this would be not optimal
Try something like this in python:
min_values = []
def add_to_min(file_name, one, seven):
# checks to see if 7th column is a lower value than exiting values
if len(min_values) == 0 or seven < max(min_values)[0]:
# let's remove the biggest value
min_values.sort()
if len(min_values) != 0:
min_values.pop()
# and add the new value tuple
min_values.append((seven, file_name, one))
# loop through all the files
for file_name in os.listdir(<dir>):
f = open(file_name)
for line in file_name.readlines():
columns = line.split()
add_to_min(file_name, columns[0], float(columns[6]))
# print answers
for (seven, file_name, one) in min_values:
print file_name, one, seven
Haven't tested it, but it should get you started.
Version 2, just runs the sort a single time (after a prod by S. Lott):
values = []
# loop through all the files and make a long list of all the rows
for file_name in os.listdir(<dir>):
f = open(file_name)
for line in file_name.readlines():
columns = line.split()
values.append((file_name, columns[0], float(columns[6]))
# sort values, print the 10 smallest
values.sort()
for (seven, file_name, one) in values[:10]
print file_name, one, seven
Just re-read you question, with millions of rows, you might run out of RAM....
A small improvement of your shell solution:
$ cat in.txt
in_s1.txt
in_s2.txt
...
$ cat in.txt | while read i
do
cat $i | sed -e "s/^/$i /" # add filename as first column
done |
sort -n -k8 | head -10 | cut -d" " -f1,2,8
This might be close to what you're looking for:
for file in *; do sort -k6n "$file" | head -n 10 | cut -f1,7 -d " " | sed "s/^/$file /" > "${file}.out"; done
cat *.out | sort -k3n | head -n 10 > final_result.out
If your files are million lines, you might want to consider using "buffering". the below script goes through those million lines, each time comparing field 7 with those in the buffer. If a value is smaller than those in the buffer, one of them in buffer is replaced by the new lower value.
for file in in_*.txt
do
awk -vt=$t 'NR<=10{
c=c+1
val[c]=$7
tag[c]=$1
}
NR>10{
for(o=1;o<=c;o++){
if ( $7 <= val[o] ){
val[o]=$7
tag[o]=$1
break
}
}
}
END{
for(i=1;i<=c;i++){
print val[i], tag[i] | "sort"
}
}' $file
done

Categories

Resources