fasta file header lines into column - python

I have a fasta file that contains sequence headers and their corresponding sequences as so:
>ID101_hg19
ATGGGTGTATCGTACCC
>ID102_hg19
AGCTTTAGCGGGGTACA
I want to change the header line to be another tab separated column next to the sequence. Here's the desired output:
>ID101_hg19 ATGGGTGTATCGTACCC
>ID102_hg19 AGCTTTAGCGGGGTACA
Any ideas on how to do this task?

Using Sed, you could do it like:
sed 'N;s/\n/\t/' file.txt
Using awk, you could do the following:
awk '{getline a; printf("%s\t%s", $0, a);}' file.txt

A slight correction to SMA's answer...
awk '{getline a; printf("%s\t%s\n", $0, a);}' file.txt
Adds newline

In general, each header line in a FASTA file can be followed by more than one line of data, so one might want to handle such cases. If the goal is to string together all the contiguous data lines, then the following would do the job:
awk '/^>/ {if (prev) {print prev;}; prev=$0 "\t"; next}
{prev=prev $0;}
END {print prev}'
If, on the other hand, the header is to be attached to just one line of data, then assuming the $'...' syntax is available, the sed command to use would be:
sed $'/^>/ {N;s/\\n/\t/;}'

Related

how to add header text with adjacent content in un-formatted data set, side by side with a delimiter separated value using sed/awk/python

I have a long list of unformatted data say data.txt where each set is started with a header and ends with a blank line, like:
TypeA/Price:20$
alexmob
moblexto
unkntom
TypeB/Price:25$
moblexto2
unkntom0
alexmob3
poptop9
tyloret
TypeC/Price:30$
rtyuoper0
kunlohpe6
mobryhox
Now, i want to add the header of each set with it's content side by side with comma separated. Like:
alexmob,TypeA/Price:20$
moblexto,TypeA/Price:20$
unkntom,TypeA/Price:20$
moblexto2,TypeB/Price:25$
unkntom0,TypeB/Price:25$
alexmob3,TypeB/Price:25$
poptop9,TypeB/Price:25$
tyloret,TypeB/Price:25$
rtyuoper0,TypeC/Price:30$
kunlohpe6,TypeC/Price:30$
mobryhox,TypeC/Price:30$
so that whenever i will grep with one keyword, relevant content along with the header comes together. Like:
$grep mob data.txt
alexmob,TypeA/Price:20$
moblexto,TypeA/Price:20$
moblexto2,TypeB/Price:25$
alexmob3,TypeB/Price:25$
mobryhox,TypeC/Price:30$
I am newbie on bash scripting as well as python and recently started learning these, so would really appreciate any simple bash scipting (using sed/awk) or python scripting.
Using sed
$ sed '/Type/{h;d;};/[a-z]/{G;s/\n/,/}' input_file
alexmob,TypeA/Price:20$
moblexto,TypeA/Price:20$
unkntom,TypeA/Price:20$
moblexto2,TypeB/Price:25$
unkntom0,TypeB/Price:25$
alexmob3,TypeB/Price:25$
poptop9,TypeB/Price:25$
tyloret,TypeB/Price:25$
rtyuoper0,TypeC/Price:30$
kunlohpe6,TypeC/Price:30$
mobryhox,TypeC/Price:30$
Match lines containing Type, hold it in memory and delete it.
Match lines with alphabetic characters, append G the contents of the hold space. Finally, sub new line for a comma.
I would use GNU AWK for this task following way, let file.txt content be
TypeA/Price:20$
alexmob
moblexto
unkntom
TypeB/Price:25$
moblexto2
unkntom0
alexmob3
poptop9
tyloret
TypeC/Price:30$
rtyuoper0
kunlohpe6
mobryhox
then
awk '/^Type/{header=$0;next}{print /./?$0 ";" header:$0}' file.txt
output
alexmob;TypeA/Price:20$
moblexto;TypeA/Price:20$
unkntom;TypeA/Price:20$
moblexto2;TypeB/Price:25$
unkntom0;TypeB/Price:25$
alexmob3;TypeB/Price:25$
poptop9;TypeB/Price:25$
tyloret;TypeB/Price:25$
rtyuoper0;TypeC/Price:30$
kunlohpe6;TypeC/Price:30$
mobryhox;TypeC/Price:30$
Explanation: If line starts with (^) Type set header value to that line ($0) and go to next line. For every line print if it does contain at least one character (/./) line ($0) concatenated with ; and header, otherwise print line ($0) as is.
(tested in GNU Awk 5.0.1)
Using any awk in any shell on every Unix box regardless of which characters are in your data:
$ awk -v RS= -F'\n' -v OFS=',' '{for (i=2;i<=NF;i++) print $i, $1; print ""}' file
alexmob,TypeA/Price:20$
moblexto,TypeA/Price:20$
unkntom,TypeA/Price:20$
moblexto2,TypeB/Price:25$
unkntom0,TypeB/Price:25$
alexmob3,TypeB/Price:25$
poptop9,TypeB/Price:25$
tyloret,TypeB/Price:25$
rtyuoper0,TypeC/Price:30$
kunlohpe6,TypeC/Price:30$
mobryhox,TypeC/Price:30$

How to add new columns of zeroes to a file?

I have a file of 10000 rows, e.g.,
1.2341105289455E+03 1.1348135000000E+00
I would like to have
1.2341105289455E+03 0.0 1.1348135000000E+00 0.0
and insert columns of '0.0' in it.
I tried to replace 'space' into '0.0' it works but I don't think it is the best solution. I tried with awk but I was only able to add '0.0' at the end of the file.
I bet there is a better solution to it. Do you know how to do it? awk? python? emacs?
Use this Perl one-liner:
perl -lane 'print join "\t", $F[0], "0.0", $F[1], "0.0"; ' in_file > out_file
The perl one-liner uses these command line flags:
-e : tells Perl to look for code in-line, instead of in a file.
-n : loop over the input one line at a time, assigning it to $_ by default.
-l : strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : split $_ into array #F on whitespace or on the regex specified in -F option.
SEE ALSO:
perlrun: command line switches
with awk
awk '{print $1,"0.0",$2,"0.0"}' file
If you want to modify the file inplace, you can do it either with GNU awk adding the-i inplace option, or adding > tmp && mv tmp file to the existing command. But always run it first without replacing, to test it and confirm the output.

How to turn text file into python list form

I'm trying to turn a list of numbers in a text file into python list form. For example, I want to make
1
2
3
4
5
into
[1,2,3,4,5]
I found something that almost worked in another post using sed.
sed '1s/^/[/;$!s/$/,/;$s/$/]/' file
but this didn't remove the new line after every number. How can I modify this sed command to get it to do what I want. Also, an explanation on the components of the sed command would also be appreciated. Thanks
With GNU sed for -z to read the whole file at once:
sed -z 's/\n/,/g; s/^/[/; s/,$/]\n/' file
[1,2,3,4,5]
With any awk in any shell on any UNIX box:
$ awk '{printf "%s%s", (NR>1 ? "," : "["), $0} END{print "]"}' file
[1,2,3,4,5]
You can append all the lines into the pattern space first before performing substitutions:
sed ':a;N;$!ba;s/\n/,/g;s/^/\[/;s/$/\]/' file
This outputs:
[1,2,3,4,5]
This might work for you (GNU sed):
sed '1h;1!H;$!d;x;s/\n/,/g;s/.*/[&]/' file
Copy the first line to the hold space, append copies of subsequent lines and delete the originals. At the end of the file, swap to the hold space, replace newlines by commas, and surround the remaining string by square brackets.
If you want the list using python, a simple implementation is
with open('./num.txt') as f:
num = [int(line) for line in f]

How to split files according to a field and edit content

I am not sure if I can do this using unix commands or I need a more complicated code, like python.
I have a big input file with 3 columns - id, different sequences (second column) grouped in different groups (3rd column).
Seq1 MVRWNARGQPVKEASQVFVSYIGVINCREVPISMEN Group1
Seq2 PSLFIAGWLFVSTGLRPNEYFTESRQGIPLITDRFDSLEQLDEFSRSF Group1
Seq3 HQAPAPAPTVISPPAPPTDTTLNLNGAPSNHLQGGNIWTTIGFAITVFLAVTGYSF Group20
I would like:
split this file according the group id, and create separate files for each group; edit the info in each file, adding a ">" sign in the beginning of the id; and then create a new row for the sequence
Group1.txt file
>Seq1
MVRWNARGQPVKEASQVFVSYIGVINCREVPISMEN
>Seq2
PSLFIAGWLFVSTGLRPNEYFTESRQGIPLITDRFDSLEQLDEFSRSF
Group20.txt file
>Seq3
HQAPAPAPTVISPPAPPTDTTLNLNGAPSNHLQGGNIWTTIGFAITVFLAVTGYSF
How can I do that?
AWK will do the trick:
awk '{ print ">"$1 "\n" $2 >> $3".txt"}' input.txt
This shell script should do the trick:
#!/usr/bin/env bash
filename="data.txt"
while read line; do
id=$(echo "${line}" | awk '{print $1}')
sequence=$(echo "${line}" | awk '{print $2}')
group=$(echo "${line}" | awk '{print $3}')
printf ">${id}\n${sequence}\n" >> "${group}.txt"
done < "${filename}"
where data.txt is the name of the file containing the original data.
Importantly, the Group-files should not exist prior to running the script.

How to use awk if statement and for loop in subprocess.call

Trying to print filename of files that don't have 12 columns.
This works at the command line:
for i in *dim*; do awk -F',' '{if (NR==1 && NF!=12)print FILENAME}' $i; done;
When I try to embed this in subprocess.call in a python script, it doesn't work:
subprocess.call("""for %i in (*dim*.csv) do (awk -F, '{if ("NR==1 && NF!=12"^) {print FILENAME}}' %i)""", shell=True)
The first error I received was "Print is unexpected at this time" so I googled and added ^ within parentheses. Next error was "unexpected newline or end of string" so googled again and added the quotes around NR==1 && NF!=12. With the current code it's printing many lines in each file so I suspect something is wrong with the if statement. I've used awk and for looped before in this style in subprocess.call but not combined and with an if statement.
Multiple input files in AWK
In the string you are passing to subprocess.call(), your if statement is evaluating a string (probably not the comparison you want). It might be easier to just simplify the shell command by doing everything in AWK. You are executing AWK for every $i in the shell's for loop. Since you can give multiple input files to AWK, there is really no need for this loop.
You might want to scan through the entire files until you find any line that has other than 12 fields, and not only check the first line (NR==1). In this case, the condition would be only NF!=12.
If you want to check only the first line of each file, then NR==1 becomes FNR==1 when using multiple files. NR is the "number of records" (across all input files) and FNR is "file number of records" for the current input file only. These are special built-in variables in AWK.
Also, the syntax of AWK allows for the blocks to be executed only if the line matches some condition. Giving no condition (as you did) runs the block for every line. For example, to scan through all files given to AWK and print the name of a file with other than 12 fields on the first line, try:
awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv
I have added the .csv to your wildcard *dim* as you had in the Python version. The -F, of course changes the field separator to a comma from the default space. For every line in each file, AWK checks if the number of fields NF is 12, if it's not, it executes the block of code, otherwise it goes on to the next line. This block prints the FILENAME of the current file AWK is processing, then skips to the beginning of the next file with nextfile.
Try running this AWK version with your subprocess module in Python:
subprocess.call("""awk -F, 'FNR==1 && NF!=12{print FILENAME; nextfile}' *dim*.csv""", shell=True)
The triple quotes makes it a literal string. The output of AWK goes to stdout and I'm assuming you know how to use this in Python with the subprocess module.
Using only Python
Don't forget that Python is itself an expressive and powerful language. If you are already using Python, it may be simpler, easier, and more portable to use only Python instead of a mixture of Python, bash, and AWK.
You can find the names of files (selected from *dim*.csv) with the first line of each file having other than 12 comma-separated fields with:
import glob
files_found = []
for filename in glob.glob('*dim*.csv'):
with open(filename, 'r') as f:
firstline = f.readline()
if len(firstline.split(',')) != 12:
files_found.append(filename)
f.close()
print(files_found)
The glob module gives the listing of files matching the wildcard pattern *dim*.csv. The first line of each of these files is read and split into fields separated by commas. If the number of these fields is not 12, it is added to the list files_found.

Categories

Resources