How can I convert the following .vcf data into a pandas dataframe?
GDrive Link To .txt File
Ideally I would like it in the form:
Thus far I have only been able to get the headers:
import pandas as pd
f = open('clinvar_final.txt',"r")
for line in f.readlines():
if line[:5] == 'CHROM':
vcf_header = line.strip().split('\t')
df = pd.DataFrame
df.header = vcf_header
There is no need to read line by line.
Pandas has an option called comment which can be used to skip unwanted lines.
You can directly load VCF files into pandas by running the following line.
In [9]: pd.read_csv('clinvar_final.txt', sep="\t", comment='#')
Out[9]:
CHROM POS ID REF ALT FILTER QUAL INFO
0 1 1014O42 475283 G A . . AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;...
1 1 1O14122 542074 C T . . AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926...
2 1 1014143 183381 C T . . ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:...
3 1 1014179 542075 C T . . ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:...
4 1 1014217 475278 C T . . AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;...
... ... ... ... .. .. ... ... ...
102316 3 179210507 403908 A G . . ALLELEID=393412;CLNDISDB=MedGen:C0018553,Orpha...
102317 3 179210511 526648 T C . . ALLELEID=519163;CLNDISDB=MedGen:C0018553,Orpha...
102318 3 179210515 526640 A C . . AF_EXAC=0.00002;ALLELEID=519178;CLNDISDB=MedGe...
102319 3 179210516 246681 A G . . AF_EXAC=0.00001;ALLELEID=245287;CLNDISDB=MedGe...
102320 3 179210538 259958 A T . . AF_EXAC=0.00001;ALLELEID=251013;CLNDISDB=MedGe...
GATK Variantstotable is what you need to avoid any issue due to the flexibility of the format of the VCF. Then, when having the csv, import it into pandas. I would say that this is the most robust way to do this.
https://gatk.broadinstitute.org/hc/en-us/articles/360036896892-VariantsToTable
import pandas as pd
with open(filename, "r") as f:
lines = f.readlines()
chrom_index = [i for i, line in enumerate(lines) if line.strip().startswith("#CHROM")]
data = lines[chrom_index[0]:]
header = data[0].strip().split("\t")
informations = [d.strip().split("\t") for d in data[1:]]
vcf = pd.DataFrame(informations, columns=header)
Related
I have a text file arranged in the following way:
Time x1 y1 z1 x2 y2 z2 ... x54 y54 z54
0 x1(0) ect. x2(0) . . . .
1 x1(1) . . . . . . .
. . . . . . . . . .
1e10
and instead I would like it to look like this:
Time x y z
0 x1(0) y1(0) z1(0)
0 x2(0) y2(0) z2(0)
.
.
0 x54(0) y54(0) z54(0)
1 x1(1) y1(1) z1(1)
.
.
.
1e10 x1(1e10) y1(1e10) z1(1e10)
.
.
1e10 x54(1e10) y54(1e10) z54(1e10)
I initially thought to do:
with open("file.txt", 'r') as f:
lines = f.readlines()
time = [float(line.split()[0]) for line in lines]
x1 = [float(line.split()[1]) for line in lines]
x2 = [float(line.split()[2]) for line in lines]
ect. until I have 1 + 3 x 54 lists. Then I would have to combine the lists (separately for x, y and z lists) i.e. [x1(0), x2(0),...,x54(0),x1(1), ect.]. This seems very inefficient and since my file is so large I think I would have issues here. Does anyone have any better ideas on how to do this?
Thanks
Can you use pandas ?
import pandas as pd
df = pd.read_csv("file.txt")
df.T
I have a GFF3 file (mainly a TSV file with 9 columns) and I'm trying to make some changes in the first column of my file in order to overwrite the modification to the file itself.
The GFF3 file looks like this:
## GFF3 file
## replicon1
## replicon2
replicon_1 prokka gene 0 15 . # . ID=some_gene_1;
replicon_1 prokka gene 40 61 . # . ID=some_gene_1;
replicon_2 prokka gene 8 32 . # . ID=some_gene_2;
replicon_2 prokka gene 70 98 . # . ID=some_gene_2;
I wrote few lines of code in which I decide a certain symbol to change (e.g. "_") and the symbol I want to replace (e.g. "#"):
import os
import re
import argparse
import pandas as pd
def myfunc() -> tuple:
ap.add_argument("-f", "--file", help="path to file")
ap.add_argument("-i", "--input_word",help="Symbol to delete")
ap.add_argument("-o", "--output_word", help="Symbol to insert")
return ap.parse_args()
args = myfunc()
my_file = args.file
in_char = args.input_word
out_char = args.output_word
with open (my_file, 'r+') as f:
rawfl = f.read()
rawfl = re.sub(in_char, out_char, rawfl)
f.seek(0)
f.write(rawfl)
f.close()
The output is something like this:
## GFF3 file
## replicon1
## replicon2
replicon#1 prokka gene 0 15 . # . ID=some#gene#1;
replicon#1 prokka gene 40 61 . # . ID=some#gene#1;
replicon#2 prokka gene 8 32 . # . ID=some#gene#2;
replicon#2 prokka gene 70 98 . # . ID=some#gene#2;
As you can see, all the "_" has been changed in "#".
I tried to modify the script using pandas in order to apply the modification only to the first column (seqid, here below):
with open (my_file, 'r+') as f:
genomic_dataframe = pd.read_csv(f, sep="\t", names=['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes'])
id = genomic_dataframe.seqid
id = str(id) #this is used because re.sub expects strings, not dataframe
id = re.sub(in_char, out_char, genid)
f.seek(0)
f.write(genid)
f.close()
I do not obtain the expected result but something like the seqid column (correctly modified) that is added to file but not overwritten respect the original one.
What I'd like to obtain is something like this:
## GFF3 file
## replicon1
## replicon2
replicon#1 prokka gene 0 15 . # . ID=some_gene_1;
replicon#1 prokka gene 40 61 . # . ID=some_gene_1;
replicon#2 prokka gene 8 32 . # . ID=some_gene_2;
replicon#2 prokka gene 70 98 . # . ID=some_gene_2;
Where the "#" symbol is present only in the first column while the "_" is maintained in the 9th column.
Do you know how to fix this? Thank you all.
If you only want to replace the first occurence of _ by #, you can do it this way without the need to load your file as a dataframe and without the use of any 3rd party lib such as pandas.
with open('f') as f:
lines = [line.rstrip() for line in f]
for line in lines:
# Ignore comments
if line[0] == '#':
continue
line = line.replace('_', '#', 1)
This will return lines which contains
## GFF3 file
## replicon1
## replicon2
replicon#1 prokka gene 0 15 . # . ID=some_gene_1;
replicon#1 prokka gene 40 61 . # . ID=some_gene_1;
replicon#2 prokka gene 8 32 . # . ID=some_gene_2;
replicon#2 prokka gene 70 98 . # . ID=some_gene_2;
You can use re.sub with pattern that starts with ^ (start of the string) + use lambda function in re.sub. For example:
import re
# change only first column:
r = re.compile(r"^(.*?)(?=\s)")
in_char = "_"
out_char = "#"
with open("input_file.txt", "r") as f_in, open("output_file.txt", "w") as f_out:
for line in map(str.strip, f_in):
# skip empty lines and lines starting with ##
if not line or line.startswith("##"):
print(line, file=f_out)
continue
line = r.sub(lambda g: g.group(1).replace(in_char, out_char), line)
print(line, file=f_out)
Creates output_file.txt:
## GFF3 file
## replicon1
## replicon2
replicon#1 prokka gene 0 15 . # . ID=some_gene_1;
replicon#1 prokka gene 40 61 . # . ID=some_gene_1;
replicon#2 prokka gene 8 32 . # . ID=some_gene_2;
replicon#2 prokka gene 70 98 . # . ID=some_gene_2;
This question already has answers here:
How to write to an Excel spreadsheet using Python?
(12 answers)
Closed 3 years ago.
I have a long list of lists of the following form ---
a = [[1.2,'abc',3],[1.2,'werew',4],........,[1.4,'qew',2]]
i.e. the values in the list are of different types -- float,int, strings.How do I write it into a csv file so that my output csv file looks like
1.2,abc,3
1.2,werew,4
.
.
.
1.4,qew,2
and it is easily achieve by this code:
You could use pandas:
In [1]: import pandas as pd
In [2]: a = [[1.2,'abc',3],[1.2,'werew',4],[1.4,'qew',2]]
In [3]: my_df = pd.DataFrame(a)
In [4]: my_df.to_csv('my_csv.csv', index=False, header=False)
But i want in this Format:
1.2 1.2 . . . 1.4
abc wer . . . qew
3 4 . . . 2
Try this. Just a proof of concept, but you can probably wrap this in to whatever you need:
a = [[1.2,'abc',3],[1.2,'werew',4],[1.4,'qew',2]]
# Rotate
rotated = zip(*a)
# Find the longest string of each set
max_lengths = [ max([ len(str(value)) for value in l ]) for l in a ]
# Generate lines
lines = []
for values in rotated:
items = []
for i, value in enumerate(values):
value_str = str(value)
if len(value_str) < max_lengths[i]:
# Add appropriate padding to string
value_str += " " * (max_lengths[i] - len(value_str))
items.append(value_str)
lines.append(" ".join(items))
print "\n".join(lines)
Which outputs:
1.2 1.2 1.4
abc werew qew
3 4 2
I have some large tab separated data sets that have long commented sections, followed by the table header, formatted like this:
##FORMAT=<ID=AMQ,Number=.,Type=Integer,Description="Average mapping quality for each allele present in the genotype">
##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Variant status relative to non-adjacent Normal, 0=wildtype,1=germline,2=somatic,3=LOH,4=unknown">
##FORMAT=<ID=SSC,Number=1,Type=Integer,Description="Somatic Score">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
chr1 2985885 . c G . . . GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC 0/0:0/0:202:36,166,0,0:0,202,0,0:255:225:0:36:60:60:0:. 0/1:0/1:321:29,108,37,147:0,137,184,0:228:225:228:36,36:60:60,60:2:225
chr1 3312963 . C T . . . GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC 0/1:0/1:80:36,1,43,0:0,37,0,43:80:195:80:36,31:60:60,60:1:. 0/0:0/0:143:138,5,0,0:0,143,0,0:255:195:255:36:60:60:3:57
Everything that starts with ## is a comment that needs to be stripped out, but I need to keep the header that starts with #CHROM. Is there any way to do this? The only options I am seeing for Pandas read_table allow only a single character for the comment string, and I do not see options for regular expressions.
The code I am using is this:
SS_txt_df = pd.read_table(SS_txt_file,sep='\t',comment='#')
This removes all lines that start with #, including the header I want to keep
EDIT: For clarification, the header region starting with ## is of variable length. In bash this would simply be grep -Ev '^##'.
you can easily calculate the number of header lines, that must be skipped when reading your CSV file:
fn = '/path/to/file.csv'
skip_rows = 0
with open(fn, 'r') as f:
for line in f:
if line.startswith('##'):
skip_rows += 1
else:
break
df = pd.read_table(fn, sep='\t', skiprows=skip_rows)
The first part will read only the header lines - so it should be very fast
use skiprows as a workaround:
SS_txt_df = pd.read_table(SS_txt_file,sep='\t',skiprows=3)
df
Out[13]:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
0 chr1 2985885 . c G . . . GT:IGT...
1 chr1 3312963 . C T . . . GT:IGT...
then rename your first column to remove #.
Update:
As you said your ## varies so, I know this is not a feasible solution but you can drop all rows starting with # and then pass the column headers as listas your columns don't change:
name=['CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO' ,'FORMAT','NORMAL','TUMOR']
df=pd.read_table(SS_txt_file,sep='\t',comment='#',names=name)
df
Out[34]:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
0 chr1 2985885 . c G . . . GT:IGT...
1 chr1 3312963 . C T . . . GT:IGT...
I have file.txt with five columns and 50 lines. I want to name each column
1 5
2 4.2
. .
. .
you find here the code python ton generate file.txt
f = open("test_01.txt", "a")
for i in xrange(len(end_zapstart_zap)):
f.write(" {} {} {} {} {} \n".format(Leavestart_zap[i],end_zapUDP[i],end_zapstart_zap[i],Leave_join[i],UDP_join[i]))
f.close()
I like to rename the first column Leavestart_zap,the seond column end_zapUDP etc..
here is my file test_01.txt:
0:00:00.672511 0:00:02.615662 0:00:03.433344 0:00:00.119777 0:00:00.025394 0:00:00.002278
0:00:00.263144 0:00:03.184893 0:00:03.541187 0:00:00.090872 0:00:00.025394 0:00:00.002278