How to modify a tsv-file column with Python - python

I have a GFF3 file (mainly a TSV file with 9 columns) and I'm trying to make some changes in the first column of my file in order to overwrite the modification to the file itself.
The GFF3 file looks like this:
## GFF3 file
## replicon1
## replicon2
replicon_1 prokka gene 0 15 . # . ID=some_gene_1;
replicon_1 prokka gene 40 61 . # . ID=some_gene_1;
replicon_2 prokka gene 8 32 . # . ID=some_gene_2;
replicon_2 prokka gene 70 98 . # . ID=some_gene_2;
I wrote few lines of code in which I decide a certain symbol to change (e.g. "_") and the symbol I want to replace (e.g. "#"):
import os
import re
import argparse
import pandas as pd
def myfunc() -> tuple:
ap.add_argument("-f", "--file", help="path to file")
ap.add_argument("-i", "--input_word",help="Symbol to delete")
ap.add_argument("-o", "--output_word", help="Symbol to insert")
return ap.parse_args()
args = myfunc()
my_file = args.file
in_char = args.input_word
out_char = args.output_word
with open (my_file, 'r+') as f:
rawfl = f.read()
rawfl = re.sub(in_char, out_char, rawfl)
f.seek(0)
f.write(rawfl)
f.close()
The output is something like this:
## GFF3 file
## replicon1
## replicon2
replicon#1 prokka gene 0 15 . # . ID=some#gene#1;
replicon#1 prokka gene 40 61 . # . ID=some#gene#1;
replicon#2 prokka gene 8 32 . # . ID=some#gene#2;
replicon#2 prokka gene 70 98 . # . ID=some#gene#2;
As you can see, all the "_" has been changed in "#".
I tried to modify the script using pandas in order to apply the modification only to the first column (seqid, here below):
with open (my_file, 'r+') as f:
genomic_dataframe = pd.read_csv(f, sep="\t", names=['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes'])
id = genomic_dataframe.seqid
id = str(id) #this is used because re.sub expects strings, not dataframe
id = re.sub(in_char, out_char, genid)
f.seek(0)
f.write(genid)
f.close()
I do not obtain the expected result but something like the seqid column (correctly modified) that is added to file but not overwritten respect the original one.
What I'd like to obtain is something like this:
## GFF3 file
## replicon1
## replicon2
replicon#1 prokka gene 0 15 . # . ID=some_gene_1;
replicon#1 prokka gene 40 61 . # . ID=some_gene_1;
replicon#2 prokka gene 8 32 . # . ID=some_gene_2;
replicon#2 prokka gene 70 98 . # . ID=some_gene_2;
Where the "#" symbol is present only in the first column while the "_" is maintained in the 9th column.
Do you know how to fix this? Thank you all.

If you only want to replace the first occurence of _ by #, you can do it this way without the need to load your file as a dataframe and without the use of any 3rd party lib such as pandas.
with open('f') as f:
lines = [line.rstrip() for line in f]
for line in lines:
# Ignore comments
if line[0] == '#':
continue
line = line.replace('_', '#', 1)
This will return lines which contains
## GFF3 file
## replicon1
## replicon2
replicon#1 prokka gene 0 15 . # . ID=some_gene_1;
replicon#1 prokka gene 40 61 . # . ID=some_gene_1;
replicon#2 prokka gene 8 32 . # . ID=some_gene_2;
replicon#2 prokka gene 70 98 . # . ID=some_gene_2;

You can use re.sub with pattern that starts with ^ (start of the string) + use lambda function in re.sub. For example:
import re
# change only first column:
r = re.compile(r"^(.*?)(?=\s)")
in_char = "_"
out_char = "#"
with open("input_file.txt", "r") as f_in, open("output_file.txt", "w") as f_out:
for line in map(str.strip, f_in):
# skip empty lines and lines starting with ##
if not line or line.startswith("##"):
print(line, file=f_out)
continue
line = r.sub(lambda g: g.group(1).replace(in_char, out_char), line)
print(line, file=f_out)
Creates output_file.txt:
## GFF3 file
## replicon1
## replicon2
replicon#1 prokka gene 0 15 . # . ID=some_gene_1;
replicon#1 prokka gene 40 61 . # . ID=some_gene_1;
replicon#2 prokka gene 8 32 . # . ID=some_gene_2;
replicon#2 prokka gene 70 98 . # . ID=some_gene_2;

Related

.vcf data to pandas dataframe

How can I convert the following .vcf data into a pandas dataframe?
GDrive Link To .txt File
Ideally I would like it in the form:
Thus far I have only been able to get the headers:
import pandas as pd
f = open('clinvar_final.txt',"r")
for line in f.readlines():
if line[:5] == 'CHROM':
vcf_header = line.strip().split('\t')
df = pd.DataFrame
df.header = vcf_header
There is no need to read line by line.
Pandas has an option called comment which can be used to skip unwanted lines.
You can directly load VCF files into pandas by running the following line.
In [9]: pd.read_csv('clinvar_final.txt', sep="\t", comment='#')
Out[9]:
CHROM POS ID REF ALT FILTER QUAL INFO
0 1 1014O42 475283 G A . . AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;...
1 1 1O14122 542074 C T . . AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926...
2 1 1014143 183381 C T . . ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:...
3 1 1014179 542075 C T . . ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:...
4 1 1014217 475278 C T . . AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;...
... ... ... ... .. .. ... ... ...
102316 3 179210507 403908 A G . . ALLELEID=393412;CLNDISDB=MedGen:C0018553,Orpha...
102317 3 179210511 526648 T C . . ALLELEID=519163;CLNDISDB=MedGen:C0018553,Orpha...
102318 3 179210515 526640 A C . . AF_EXAC=0.00002;ALLELEID=519178;CLNDISDB=MedGe...
102319 3 179210516 246681 A G . . AF_EXAC=0.00001;ALLELEID=245287;CLNDISDB=MedGe...
102320 3 179210538 259958 A T . . AF_EXAC=0.00001;ALLELEID=251013;CLNDISDB=MedGe...
GATK Variantstotable is what you need to avoid any issue due to the flexibility of the format of the VCF. Then, when having the csv, import it into pandas. I would say that this is the most robust way to do this.
https://gatk.broadinstitute.org/hc/en-us/articles/360036896892-VariantsToTable
import pandas as pd
with open(filename, "r") as f:
lines = f.readlines()
chrom_index = [i for i, line in enumerate(lines) if line.strip().startswith("#CHROM")]
data = lines[chrom_index[0]:]
header = data[0].strip().split("\t")
informations = [d.strip().split("\t") for d in data[1:]]
vcf = pd.DataFrame(informations, columns=header)

Pandas: read_table remove comment lines with '##' but not '#<string>'?

I have some large tab separated data sets that have long commented sections, followed by the table header, formatted like this:
##FORMAT=<ID=AMQ,Number=.,Type=Integer,Description="Average mapping quality for each allele present in the genotype">
##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Variant status relative to non-adjacent Normal, 0=wildtype,1=germline,2=somatic,3=LOH,4=unknown">
##FORMAT=<ID=SSC,Number=1,Type=Integer,Description="Somatic Score">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
chr1 2985885 . c G . . . GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC 0/0:0/0:202:36,166,0,0:0,202,0,0:255:225:0:36:60:60:0:. 0/1:0/1:321:29,108,37,147:0,137,184,0:228:225:228:36,36:60:60,60:2:225
chr1 3312963 . C T . . . GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC 0/1:0/1:80:36,1,43,0:0,37,0,43:80:195:80:36,31:60:60,60:1:. 0/0:0/0:143:138,5,0,0:0,143,0,0:255:195:255:36:60:60:3:57
Everything that starts with ## is a comment that needs to be stripped out, but I need to keep the header that starts with #CHROM. Is there any way to do this? The only options I am seeing for Pandas read_table allow only a single character for the comment string, and I do not see options for regular expressions.
The code I am using is this:
SS_txt_df = pd.read_table(SS_txt_file,sep='\t',comment='#')
This removes all lines that start with #, including the header I want to keep
EDIT: For clarification, the header region starting with ## is of variable length. In bash this would simply be grep -Ev '^##'.
you can easily calculate the number of header lines, that must be skipped when reading your CSV file:
fn = '/path/to/file.csv'
skip_rows = 0
with open(fn, 'r') as f:
for line in f:
if line.startswith('##'):
skip_rows += 1
else:
break
df = pd.read_table(fn, sep='\t', skiprows=skip_rows)
The first part will read only the header lines - so it should be very fast
use skiprows as a workaround:
SS_txt_df = pd.read_table(SS_txt_file,sep='\t',skiprows=3)
df
Out[13]:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
0 chr1 2985885 . c G . . . GT:IGT...
1 chr1 3312963 . C T . . . GT:IGT...
then rename your first column to remove #.
Update:
As you said your ## varies so, I know this is not a feasible solution but you can drop all rows starting with # and then pass the column headers as listas your columns don't change:
name=['CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO' ,'FORMAT','NORMAL','TUMOR']
df=pd.read_table(SS_txt_file,sep='\t',comment='#',names=name)
df
Out[34]:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
0 chr1 2985885 . c G . . . GT:IGT...
1 chr1 3312963 . C T . . . GT:IGT...

Name columns of file.txt

I have file.txt with five columns and 50 lines. I want to name each column
1 5
2 4.2
. .
. .
you find here the code python ton generate file.txt
f = open("test_01.txt", "a")
for i in xrange(len(end_zapstart_zap)):
f.write(" {} {} {} {} {} \n".format(Leavestart_zap[i],end_zapUDP[i],end_zapstart_zap[i],Leave_join[i],UDP_join[i]))
f.close()
I like to rename the first column Leavestart_zap,the seond column end_zapUDP etc..
here is my file test_01.txt:
0:00:00.672511 0:00:02.615662 0:00:03.433344 0:00:00.119777 0:00:00.025394 0:00:00.002278
0:00:00.263144 0:00:03.184893 0:00:03.541187 0:00:00.090872 0:00:00.025394 0:00:00.002278

extract a column from a text file

I am trying to extract two columns from a text file here datapoint and index, and I want both of the columns to be written in a text file as a column. I made a small program that is somewhat doing what I want but its not working completely,
any suggestion on this please ?
My program is:
f = open ('infilename', 'r')
header1= f.readline()
for line in f:
line = line.strip()
columns = line.split()
j=float(columns[1])
i=columns[3]
print i, j
f.close()
it is also giving an error
j=float(columns[1])
IndexError: list index out of range
Sample data:
datapoint index
66.199748 200 0.766113 0 1
66.295962 200 0.826375 1 0
66.295962 200 0.762582 1 1
66.318076 200 0.850936 2 0
66.318076 200 0.751474 2 1
66.479436 200 0.821261 3 0
66.479436 200 0.765673 3 1
66.460284 200 0.869779 4 0
66.460284 200 0.741051 4 1
66.551778 200 0.841143 5 0
66.551778 200 0.765198 5 1
66.303606 200 0.834398 6 0
. . . . .
. . . . .
. . . . .
. . . . .
69.284336 200 0.926158 19998 0
69.284336 200 0.872788 19998 1
69.403861 200 0.943316 19999 0
69.403861 200 0.884889 19999 1
The following code will allow you do all of the file writing through Python. Redirecting through the command line like you were doing works fine, this will just be self contained instead.
f = open ('in.txt', 'r')
out = open("out.txt", "w")
header1= f.readline()
for line in f:
line = line.strip()
columns = line.split()
if len(columns) > 2:
j=float(columns[1])
i = columns[3]
i=columns[3]
out.write("%s %s\n" %(i, j))
f.close()
Warning: This will always overwrite "out.txt". If you would simply like to add to the end of it if it already exists, or create a new file if it doesn't, you can change the "w" to "a" when you open out.

How to combine header files with data files with python?

I have separated files, one part are files only contained header info, like the example shown in below:
~content of "header1.txt"~
a 3
b 2
c 4
~content of "header2.txt"~
a 4
b 3
c 5
~content of "header3.txt"~
a 1
b 7
c 6
And another part are files only contained data, as shown below:
~content of "data1.txt"~
10 20 30 40
20 14 22 33
~content of "data2.txt"~
11 21 31 41
21 24 12 23
~content of "data3.txt"~
21 22 11 31
10 26 14 33
After combined the corresponded data files, the results are similar with examples in below:
~content of "asc1.txt"~
a 3
b 2
c 4
10 20 30 40
20 14 22 33
~content of "asc2.txt"~
a 4
b 3
c 5
11 21 31 41
21 24 12 23
~content of "asc3.txt"~
a 1
b 7
c 6
21 22 11 31
10 26 14 33
Can anyone give me some help in writing this in python? Thanks!
If you really want it in Python, here is the way to do
for i in range(1,4):
h = open('header{0}.txt'.format(i),'r')
d = open('data{0}.txt'.format(i),'r')
a = open('asc{0}.txt'.format(i),'a')
hdata = h.readlines()
ddata = d.readlines()
a.writelines(hdata+ddata)
a.close()
Of course, assuming that the number of both files is 3 and all follow the same naming convention you used.
Try this (written in python 3.4 idle). Pretty long but should be easier to understand:
# can start by creating a function to read contents of
# each file and return the contents as a string
def readFile(file):
contentsStr = ''
for line in file:
contentsStr += line
return contentsStr
# Read all the header files header1, header2, header3
header1 = open('header1.txt','r')
header2 = open('header2.txt','r')
header3 = open('header3.txt','r')
# Read all the data files data1, data2, data3
data1 = open('data1.txt','r')
data2 = open('data2.txt','r')
data3 = open('data3.txt','r')
# Open/create output files asc1, asc2, asc3
asc1_outFile = open('asc1.txt','w')
asc2_outFile = open('asc2.txt','w')
asc3_outFile = open('asc3.txt','w')
# read contents of each header file and data file into string variabls
header1_contents = readFile(header1)
header2_contents = readFile(header2)
header3_contents = readFile(header3)
data1_contents = readFile(data1)
data2_contents = readFile(data2)
data3_contents = readFile(data3)
# Append the contents of each data file contents to its
# corresponding header file
asc1_contents = header1_contents + '\n' + data1_contents
asc2_contents = header2_contents + '\n' + data2_contents
asc3_contents = header3_contents + '\n' + data3_contents
# now write the necessary results to asc1.txt, asc2.txt, and
# asc3.txt output files respectively
asc1_outFile.write(asc1_contents)
asc2_outFile.write(asc2_contents)
asc3_outFile.write(asc3_contents)
# close the file streams
header1.close()
header2.close()
header3.close()
data1.close()
data2.close()
data3.close()
asc1_outFile.close()
asc2_outFile.close()
asc3_outFile.close()
# done!
By the way, ensure that the header files and data files are in the same directory as the python script. Otherwise, you can simply edit the file paths of these files accordingly in the code above. The output files asc1.txt, asc2.txt, and asc3.txt will be created in the same directory as your python source file.
This works if the number of header file is equal to the number of data files are equal
#Glob is imported to get file names matching to the given pattern
import glob
header=[]
data=[]
#Traversing through the file and getting the content
for files1 in glob.glob("directory/header*.txt"):
a=open(files1,"r").read()
header.append(a)
for files2 in glob.glob("directory/data*.txt"):
a1=open(files2,"r").read()
data.append(a1)
#Writng the content into the file
for i in range(1,len(data)+1):
writer=open("directory/asc"+str(i)+".txt","w")
writer.write(header[i-1]+"\n\n"+data[i-1])
writer.close()
Edit
This method will only work if they are in different folder and there should be no other files other than header or data file in that folder
#Glob is imported to get file names matching to the given pattern
import glob
header=[]
data=[]
#Traversing through the file and getting the content
for files1 in glob.glob("directory1/*.txt"):
a=open(files1,"r").read()
header.append(a)
for files2 in glob.glob("directory2/*.txt"):
a1=open(files2,"r").read()
data.append(a1)
#Writng the content into the file
for i in range(1,len(data)+1):
writer=open("directory/asc"+str(i)+".txt","w")
writer.write(header[i-1]+"\n\n"+data[i-1])
writer.close()

Categories

Resources