Merge two lines generated from contigs.fa - python

I have a file generated by assemblers. It looks like following.
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAAC
CAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTAT
ACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGA
ACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTAT
TCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTG
TCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGT
CTTCC
I want to merge the lines using python or linux sed command and want result in this way.
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC
like every seqeunce consider as single line and Node name as other line.

A small pipe of tr and sed would do this:
$ tr -d '\n' < contigser.fa | sed 's/\(>[^.]\+\.[0-9]\+\)/\n\1\n/g' > newfile.fa
In python:
file = open('contigser.fa','r+')
lines= file.read().splitlines()
file.seek(0)
file.truncate()
for line in lines:
if line.startswith('>'):
file.write('\n'+line+'\n')
else:
file.write(line)
Note: the python solution stores the changes back to contigser.fa.

You can use awk to do the job:
awk < input_file '/^>/ {print ""; print; next} {printf "%s", $0} END {print ""}'
This only starts one process (awk). Only drawback: it adds an empty first line. You can avoid such things by adding a state variable (the code belongs on one line, it's just to make it better readable):
awk < input_file '/^>/ { if (flag) print ""; print; flag=0; next }
{ printf "%s", $0; flag=1 } END { if (flag) print "" }'
#how to store it in a new file:
awk < input_file > output_file '/^>/ { .... }'

$ awk '/^>/{printf "%s%s\n",(NR>1?ORS:""),$0; next} {printf "%s",$0} END{print ""}' file
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC

$ awk 'NR==1;ORS="";{sub(/>.*$/,"\n&\n");print (NR>1)?$0:""}END{print"\n"}' file
>NODE_1_length_211_cov_22.379147
CATTTGCTGAAGAAAAATTACGAGAAATGGAGCACAAGGCTGTTTTTGTGAATGTCAAACCAAGTGACAACTCTATAGCGTTTGTATAAGACTCTCATACTAATCCCAAGCAAACTCTATACTGACGCATGAACATGGAAGAGAAATGCTGCTCGTGTATGTATTATGGACCAGCTTGGAACACCATGTTAGGACTTTATAGATGTCTTACGATTTTTTCGACGTGATGAAGAAGTCTATTCAGCATTTGA
>NODE_2_length_85_cov_19.094118
TACTCCTGAGCACTTTGTGCTCTTAGTTCTTACTAGAACTGTTACAGCTCCACGAACTTGTCGACTCTTTGAGTCAATTTCTGTTAGTTCCTACGAACTAAGAGGCTCTCTGAGCCCAGTCTTCC

This might work for you (GNU sed):
sed '/^>/n;:a;$!N;s/\n\([^>]\)/\1/;ta;P;D' file
Following a line beginning with >, delete any newlines that preceed any character other than a > symbol.

Related

what is the efficient way to change huge file in place

i have huge file(~2000000 lines) and i am trying to replace few different patterns while i am reading the file only once.
so i am guessing sed is not good since i have different patterns
i tried to use awk with if else but the file is not change
#!/usr/bin/awk -f
{
if($0 ~ /data for AAA/)
{
sub(/^[0-9]+$/, "bla_AAA", $2)
}
if($0 ~ /data for BBB/)
{
sub(/^[0-9]+$/, "bla_BBB", $2)
}
}
I expect the output of
address 01000 data for AAA
....
address 02000 data for BBB
....
to be
address bla_AAA data for AAA
....
address bla_BBB data for BBB
....
I don't see any indication in your question that your file really is large as 2000000 lines is nothing and each sample line in your question is small, so chances are this is all you need:
awk '
/data for AAA/ { $2 = "bla_AAA"; next }
/data for BBB/ { $2 = "bla_BBB"; next }
' file > tmp && mv tmp file
GNU awk has a -i inplace option to do the same kind of "inplace" editing that sed, perl, etc. do (i.e. with a tmp file being used internally).
If you really didn't have enough storage to create a copy of the input file then you could use something like this (untested!):
headLines=10000
beg=1
tmp=$(mktemp) || exit 1
while -s file; do
head -n "$headLines" file | awk 'above script' >> "$tmp" &&
headBytes=$(head -n "$headLines" file |wc -c) &&
dd if=file bs="$headBytes" skip=1 conv=notrunc of=file &&
truncate -s "-$headBytes" file
rslt=$?
done
(( rslt == 0 )) && mv "$tmp" file
so you're never using up more storage than the size of your input file plus headLines lines (massage that number to suit). See https://stackoverflow.com/a/17331179/1745001 for info on what truncate and the 2 lines before it are doing.
Something like this:
(Read a line, do the text manipulation, write the modified data to output file)
with open('in.txt') as f_in:
with open('out.txt', 'w') as f_out:
line = f_in.readline().strip()
while line:
fields = line.split(' ')
fields[1] = 'bla_{}'.format(fields[4])
f_out.write(' '.join(fields) + '\n')
line = f_in.readline()

adding character at the end of each string in a file

I have a txt file that contains a single column of single words as such:
windfall
winnable
winner
winners
winning
I want to use the words in the file as regex strings for a mapping jobs. When finished the words should look like this:
windfall|winnable|winner|winners|winning
I need to use python or awk to open the file, place a | at the end of each and write the new content to a new file with the new character added and the column converted to a single horizontal line.
any suggestions?
Simplest is tr:
tr '\n' '|' < file.txt
Using Python you could do:
with open('oldfile.txt') as fin:
with open('newfile.txt', 'w') as fout:
fout.write('|'.join(map(str.strip, fin)))
The str.split removes newlines and whitespaces, while the join concatenates the lines with |.
Using sed:
$ cat file
windfall
winnable
winner
winners
winning
$ sed ':a;N;s/\n/|/;ba' file
windfall|winnable|winner|winners|winning
Create a loop using :a
Load the new line N in to execution space
substitute the newline with pipe
rinse and repeat.
In awk, if you don't want the trailing |:
$ awk '{ s=s (NR>1"?"|":"") $0 } END { print s }' file
windfall|winnable|winner|winners|winning
The original version with getline which was basically an (not even the) outcome of an awk jamming session was:
$ awk 'BEGIN {
while(r=getline) { # read until EOF
s=s (p==r?"|":"") $0; # pile it to s, preceed with | after the first
p=r # p revious r eturn value of getline
} print s # out with the pile
}' file
windfall|winnable|winner|winners|winning
awk -v RS= -v OFS="|" '/ /{next}$1=$1' file
windfall|winnable|winner|winners|winning
Use paste:
$ cat /tmp/so.txt
windfall
winnable
winner
winners
winning
$ paste -sd'|' /tmp/so.txt
windfall|winnable|winner|winners|winning
assuming no blank lines in between rows, and input is smaller than 500 MB, then better to keep it simple :
echo 'windfall
winnable
winner
winners
winning' |
{m,g,n}awk NF=NF RS= OFS='|'
windfall|winnable|winner|winners|winning

Filter a smaller file using another huge file

I have a huge csv file with about 10^9 lines where each line has a pair of ids such as:
IDa,IDb
IDb,IDa
IDc,IDd
Call this file1. I have another much smaller csv file with about 10^6 lines in the same format. Call this file2.
I want to simply find the lines in file2 which contain at least one ID that exists somewhere in file1.
Is there a fast way to do this? I don't mind if it is in awk, python or perl.
$ cat > file2 # make test file2
IDb,IDa
$ awk -F, 'NR==FNR{a[$1];a[$2];next} ($1 in a&&++a[$1]==1){print $1} ($2 in a&&++a[$2]==1){print $2}' file2 file1 > file3
$ cat file3 # file2 ids in file1 put to file3
IDa
IDb
$ awk -F, 'NR==FNR{a[$1];next} ($1 in a)||($2 in a){print $0}' file3 file2
IDb,IDa
I would actually use sqlite for something like that. You could create a new database from the same directory as two files with sqlite3 test.sqlite and then do something like that:
create table file1(id1, id2);
create table file2(id1, id2);
.separator ","
.import file1.csv file1
.import file2.csv file2
WITH all_ids AS (
SELECT id1 FROM file1 UNION SELECT id2 FROM file1
)
SELECT * FROM file2 WHERE id1 IN all_ids OR id2 IN all_ids;
The advantage of using sqlite is that you can manage the memory more intelligently than a simple script that you could write in some scripting language.
Using these input files for testing:
$ cat file1
IDa,IDb
IDb,IDa
IDc,IDd
$ cat file2
IDd,IDw
IDx,IDc
IDy,IDz
If file1 can fit in memory:
$ awk -F, 'NR==FNR{a[$1];a[$2];next} ($1 in a) || ($2 in a)' file1 file2
IDd,IDw
IDx,IDc
If not but file2 can fit in memory:
$ awk -F, '
ARGIND==2 {
if ($1 in inBothFiles) {
inBothFiles[$1] = 1
}
if ($2 in inBothFiles) {
inBothFiles[$2] = 1
}
next
}
ARGIND==1 {
inBothFiles[$1] = 0
inBothFiles[$2] = 0
next
}
ARGIND==3 {
if (inBothFiles[$1] || inBothFiles[$2]) {
print
}
}
' file2 file1 file2
IDd,IDw
IDx,IDc
The above uses GNU awk for ARGIND - with other awks just add a FNR==1{ARGIND++} block at the start.
I have the ARGIND==2 block (i.e. the part that processes the 2nd argument which in this case is the 10^9 file1) listed first for efficiency so we don't unnecessarily test ARGIND==1 for every line in the much larger file.
In perl,
use strict;
use warnings;
use autodie;
# read file2
open my $file2, '<', 'file2';
chomp( my #file2 = <$file2> );
close $file2;
# record file2 line numbers each id is found on
my %id;
for my $line_number (0..$#file2) {
for my $id ( split /,/, $file2[$line_number] ) {
push #{ $id{$id} }, $line_number;
}
}
# look for those ids in file1
my #use_line;
open my $file1, '<', 'file1';
while ( my $line = <$file1> ) {
chomp $line;
for my $id ( split /,/, $line ) {
if ( exists $id{$id} ) {
#use_line[ #{ $id{$id} } ] = #{ $id{$id} };
}
}
}
close $file1;
# print lines whose ids were found
print "$_\n" for #file2[ grep defined, #use_line ];
Sample files:
cat f1
IDa,IDb
IDb,IDa
IDc,IDd
cat f2
IDt,IDy
IDb,IDj
Awk solution:
awk -F, 'NR==FNR {a[$1]=$1;b[$2]=$2;next} ($1 in a)||($2 in b)' f1 f2
IDb,IDj
This will store first and second columns of file1 in array a and b. Then print those line if either first or second column is seen for second file.

splitting file into smaller files using by number of fields

I'm having a hard time breaking a large (50GB) csv file into smaller part. Each line has a few thousand fields. Some of the fields are strings in double quotes, others are integers, decimals and boolean.
I want to parse the file line by line and split by the number of fields in each row. The strings contain possibly several commas (such as ), as well as a number of empty fields.
,,1,30,50,"Sold by father,son and daughter for $4,000" , ,,,, 12,,,20.9,0,
I tried using
perl -pe' s{("[^"]+")}{($x=$1)=~tr/,/|/;$x}ge ' file >> file2
to change the commas inside the quotes to | but that didn't work. I plan to use
awk -F"|" conditional statement appending to new k_fld_files file2
Is there an easier way to do this please? I'm looking at python, but I probably need a utility that will stream process the file, line by line.
Using Python - if you just want to parse CSV including embedded delimiters, and stream out with a new delimiter, then something such as:
import csv
import sys
with open('filename.csv') as fin:
csvout = csv.writer(sys.stdout, delimiter='|')
for row in csv.reader(fin):
csvout.writerow(row)
Otherwise, it's not much more difficult to make this do all kinds of stuff.
Example of outputting to files per column (untested):
cols_to_output = {}
for row in csv.reader(fin):
for colno, col in enumerate(row):
output_to = cols_to_output.setdefault(colno, open('column_output.{}'.format(colno), 'wb')
csv.writer(output_to).writerow(row)
for fileno in cols_to_output.itervalues():
fileno.close()
Here's an awk alternative.
Assuming the quoted strings are well formatted, i.e. always have starting and terminating quotes, and no quotes within other quotes, you could do the replacement you suggested by doing a gsub on every other field replacing , with |.
With pipes
Below is an example of how this might go when grabbing columns 3 through 6, 11 and 14-15 with coreutils cut:
awk -F'"' -v OFS='' '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub(",", "|", $i);
$i = FS $i FS; # reinsert the quotes
}
print
}'\
| cut -d , -f 3-6,11,14-15 \
| awk -F'"' -v OFS='' -e '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub("\\|", ",", $i)
$i = FS $i FS; # reinsert the quotes
}
print
}'
Note that there is an additional post-processing step that reverts the | to ,.
Entirely in awk
Alternatively, you could do the whole thing in awk with some loss of generality with regards to range specification. Here we only grab columns 3 to 6:
extract.awk
BEGIN {
OFS = ""
start = 3
end = 6
}
{
for(i=2; i<=NF; i+=2) {
gsub(",", "|", $i)
$i = FS $i FS
}
split($0, record, ",")
for(i=start; i<=end-1; i++) {
gsub("\\|", ",", record[i])
printf("%s,", record[i])
}
gsub("\\|", ",", record[end])
printf("%s\n", record[end])
}

Summing up two columns the Unix way

# To fix the symptom
How can you sum up the following columns effectively?
Column 1
1
3
3
...
Column 2
2323
343
232
...
This should give me
Expected result
2324
346
235
...
I have the columns in two files.
# Initial situation
I use sometimes too many curly brackets such that I have used one more this { than this } in my files.
I am trying to find where I have used the one unnecessary curly bracket.
I have used the following steps in getting the data
Find commands
find . * -exec grep '{' {} + > /tmp/1
find . * -exec grep '}' {} + > /tmp/2
AWK commands
awk -F: '{ print $2 }' /tmp/1 > /tmp/11
awk -F: '{ print $2 }' /tmp/2 > /tmp/22
The column are in the files /tmp/11 and /tmp/22.
I repeat a lot of similar commands in my procedure.
This suggests me that this is not the right way.
Please, suggests me any way such as Python, Perl or any Unix tool which can decrease the number of steps.
If c1 and c2 are youre files, you can do this:
$ paste c1 c2 | awk '{print $1 + $2}'
Or (without AWK):
$ paste c1 c2 | while read i j; do echo $(($i+$j)); done
Using python:
totals = [ int(i)+int(j) for i, j in zip ( open(fname1), open(fname2) ) ]
You can avoid the intermediate steps by just using a command that do the counts and the comparison at the same time:
find . -type f -exec perl -nle 'END { print $ARGV if $h{"{"} != $h{"}"} } $h{$_}++ for /([}{])/g' {}\;
This calls the Perl program once per file, the Perl program counts the number of each type curly brace and prints the name of the file if they counts don't match.
You must be careful with the /([}{]])/ section, find will think it needs to do the replacement on {} if you say /([{}]])/.
WARNING: this code will have false positives and negatives if you are trying to run it against source code. Consider the following cases:
balanced, but curlies in strings:
if ($s eq '{') {
print "I saw a {\n"
}
unbalanced, but curlies in strings:
while (1) {
print "}";
You can expand the Perl command by using B::Deparse:
perl -MO=Deparse -nle 'END { print $ARGV if $h{"{"} != $h{"}"} } $h{$_}++ for /([}{])/g'
Which results in:
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
sub END {
print $ARGV if $h{'{'} != $h{'}'};
}
;
++$h{$_} foreach (/([}{])/g);
}
We can now look at each piece of the program:
BEGIN { $/ = "\n"; $\ = "\n"; }
This is caused by the -l option. It sets both the input and output record separators to "\n". This means anything read in will be broken into records based "\n" and any print statement will have "\n" appended to it.
LINE: while (defined($_ = <ARGV>)) {
}
This is created by the -n option. It loops over every file passed in via the commandline (or STDIN if no files are passed) reading each line of those files. This also happens to set $ARGV to the last file read by <ARGV>.
chomp $_;
This removes whatever is in the $/ variable from the line that was just read ($_), it does nothing useful here. It was caused by the -l option.
sub END {
print $ARGV if $h{'{'} != $h{'}'};
}
This is an END block, this code will run at the end of the program. It prints $ARGV (the name of the file last read from, see above) if the values stored in %h associated with the keys '{' and '}' are equal.
++$h{$_} foreach (/([}{])/g);
This needs to be broken down further:
/
( #begin capture
[}{] #match any of the '}' or '{' characters
) #end capture
/gx
Is a regex that returns a list of '{' and '}' characters that are in the string being matched. Since no string was specified the $_ variable (which holds the line last read from the file, see above) will be matched against. That list is fed into the foreach statement which then runs the statement it is in front of for each item (hence the name) in the list. It also sets $_ (as you can see $_ is a popular variable in Perl) to be the item from the list.
++h{$_}
This line increments the value in $h that is associated with $_ (which will be either '{' or '}', see above) by one.
In Python (or Perl, Awk, &c) you can reasonably do it in a single stand-alone "pass" -- I'm not sure what you mean by "too many curly brackets", but you can surely count curly use per file. For example (unless you have to worry about multi-GB files), the 10 files using most curly braces:
import heapq
import os
import re
curliest = dict()
for path, dirs, files in os.walk('.'):
for afile in files:
fn = os.path.join(path, afile)
with open(fn) as f:
data = f.read()
braces = data.count('{') + data.count('}')
curliest[fn] = bracs
top10 = heapq.nlargest(10, curlies, curliest.get)
top10.sort(key=curliest.get)
for fn in top10:
print '%6d %s' % (curliest[fn], fn)
Reply to Lutz'n answer
My problem was finally solved by this commnad
paste -d: /tmp/1 /tmp/2 | awk -F: '{ print $1 "\t" $2 - $4 }'
your problem can be solved with just 1 awk command...
awk '{getline i<"file1";print i+$0}' file2

Categories

Resources