I have some large tab separated data sets that have long commented sections, followed by the table header, formatted like this:
##FORMAT=<ID=AMQ,Number=.,Type=Integer,Description="Average mapping quality for each allele present in the genotype">
##FORMAT=<ID=SS,Number=1,Type=Integer,Description="Variant status relative to non-adjacent Normal, 0=wildtype,1=germline,2=somatic,3=LOH,4=unknown">
##FORMAT=<ID=SSC,Number=1,Type=Integer,Description="Somatic Score">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
chr1 2985885 . c G . . . GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC 0/0:0/0:202:36,166,0,0:0,202,0,0:255:225:0:36:60:60:0:. 0/1:0/1:321:29,108,37,147:0,137,184,0:228:225:228:36,36:60:60,60:2:225
chr1 3312963 . C T . . . GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC 0/1:0/1:80:36,1,43,0:0,37,0,43:80:195:80:36,31:60:60,60:1:. 0/0:0/0:143:138,5,0,0:0,143,0,0:255:195:255:36:60:60:3:57
Everything that starts with ## is a comment that needs to be stripped out, but I need to keep the header that starts with #CHROM. Is there any way to do this? The only options I am seeing for Pandas read_table allow only a single character for the comment string, and I do not see options for regular expressions.
The code I am using is this:
SS_txt_df = pd.read_table(SS_txt_file,sep='\t',comment='#')
This removes all lines that start with #, including the header I want to keep
EDIT: For clarification, the header region starting with ## is of variable length. In bash this would simply be grep -Ev '^##'.
you can easily calculate the number of header lines, that must be skipped when reading your CSV file:
fn = '/path/to/file.csv'
skip_rows = 0
with open(fn, 'r') as f:
for line in f:
if line.startswith('##'):
skip_rows += 1
else:
break
df = pd.read_table(fn, sep='\t', skiprows=skip_rows)
The first part will read only the header lines - so it should be very fast
use skiprows as a workaround:
SS_txt_df = pd.read_table(SS_txt_file,sep='\t',skiprows=3)
df
Out[13]:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
0 chr1 2985885 . c G . . . GT:IGT...
1 chr1 3312963 . C T . . . GT:IGT...
then rename your first column to remove #.
Update:
As you said your ## varies so, I know this is not a feasible solution but you can drop all rows starting with # and then pass the column headers as listas your columns don't change:
name=['CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO' ,'FORMAT','NORMAL','TUMOR']
df=pd.read_table(SS_txt_file,sep='\t',comment='#',names=name)
df
Out[34]:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
0 chr1 2985885 . c G . . . GT:IGT...
1 chr1 3312963 . C T . . . GT:IGT...
Related
How can I convert the following .vcf data into a pandas dataframe?
GDrive Link To .txt File
Ideally I would like it in the form:
Thus far I have only been able to get the headers:
import pandas as pd
f = open('clinvar_final.txt',"r")
for line in f.readlines():
if line[:5] == 'CHROM':
vcf_header = line.strip().split('\t')
df = pd.DataFrame
df.header = vcf_header
There is no need to read line by line.
Pandas has an option called comment which can be used to skip unwanted lines.
You can directly load VCF files into pandas by running the following line.
In [9]: pd.read_csv('clinvar_final.txt', sep="\t", comment='#')
Out[9]:
CHROM POS ID REF ALT FILTER QUAL INFO
0 1 1014O42 475283 G A . . AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;...
1 1 1O14122 542074 C T . . AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926...
2 1 1014143 183381 C T . . ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:...
3 1 1014179 542075 C T . . ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:...
4 1 1014217 475278 C T . . AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;...
... ... ... ... .. .. ... ... ...
102316 3 179210507 403908 A G . . ALLELEID=393412;CLNDISDB=MedGen:C0018553,Orpha...
102317 3 179210511 526648 T C . . ALLELEID=519163;CLNDISDB=MedGen:C0018553,Orpha...
102318 3 179210515 526640 A C . . AF_EXAC=0.00002;ALLELEID=519178;CLNDISDB=MedGe...
102319 3 179210516 246681 A G . . AF_EXAC=0.00001;ALLELEID=245287;CLNDISDB=MedGe...
102320 3 179210538 259958 A T . . AF_EXAC=0.00001;ALLELEID=251013;CLNDISDB=MedGe...
GATK Variantstotable is what you need to avoid any issue due to the flexibility of the format of the VCF. Then, when having the csv, import it into pandas. I would say that this is the most robust way to do this.
https://gatk.broadinstitute.org/hc/en-us/articles/360036896892-VariantsToTable
import pandas as pd
with open(filename, "r") as f:
lines = f.readlines()
chrom_index = [i for i, line in enumerate(lines) if line.strip().startswith("#CHROM")]
data = lines[chrom_index[0]:]
header = data[0].strip().split("\t")
informations = [d.strip().split("\t") for d in data[1:]]
vcf = pd.DataFrame(informations, columns=header)
I have a DF that looks like this:
ids
-----------
cat-1,paws
dog-2,paws
bird-1,feathers,fish
cows-2,bird_3
.
.
.
I need to remove all the ids that have a - or _ in the dataframe. So, final data frame should be
ids
-----------
paws
paws
feathers,fish
.
.
.
I've tried using lambda like this:
df['ids'] = df['ids'].apply(lambda x: x.replace('cat-1', '').replace('dog-2', '' )...)
But this is not a scalable solution and I would need to add all the ids with dashes and underscores into the above. What would be a more scalable/efficient solution?
You can use a regex pattern:
df.ids.str.replace('\w*[-_]\w*,?', '')
Output:
0 paws
1 paws
2 feathers,fish
3
Name: ids, dtype: object
I'm trying to write a python script that concats two csv files and then drops the duplicate rows. Here is an example of the csv's I'm concating:
csv_1
type state city date estimate id
lux tx dal 2019/08/15 .8273452 10
sed ny ny 2019/05/12 .624356 10
cou cal la 2013/04/24 .723495 10
. . . . . .
. . . . . .
csv_2
type state city date estimate id
sed col den 2013/05/02 .7234957 232
sed mi det 2015/11/17 .4249357 232
lux nj al 2009/02/29 .627234 232
. . . . . .
. . . . . .
As of now, my code to concat these two together looks like this:
csv_1 = pd.read_csv('csv_1.csv')
csv_2 = pd.read_csv('csv_2.csv')
union_df = pd.concat([csv_1, csv_2])
union_df.drop_duplicates(subset=['type', 'state', 'city', 'date'], inplace=True, keep='first')
Is there any way I can ensure only rows with id = 232 are deleted and none with id = 10 are? Just a way to specify only rows from the second csv are removed from the concatenated csv?
Thank you
Use, duplicated and boolean logic:
union_df.loc[~union_df.duplicated(subset=['type','state','city','date'], keep='first') & (union_df['id'] == 233)]
Instead of directly dropping the duplicates using the drop_duplicates method, I would recommend you using the duplicated method. The latter works the same way as the first but it returns a boolean vector indicating which rows are duplicated. Once you call it, you can combine its output with the id for achieving your purpose. Take a look below.
csv_1 = pd.read_csv('csv_1.csv')
csv_2 = pd.read_csv('csv_2.csv')
union_df = pd.concat([csv_1, csv_2])
union_df["dups"]= union_df.duplicated(subset=['type', 'state', 'city', 'date'],
inplace=True, keep='first')
union_df = union_df.loc[lambda d: ~((d.dups) & (d.id==232))]
Question:
How do I apply the same python code to multiple columns of data.
Data Format:
I am just learning python and I have written a script to reformat my data. My current format starts with 4 descriptive columns followed by many columns of data (e.g., 1/1)
#CHROM POS REF ALT IND_1 IND_2 IND_3 IND_4
2L 6631 A G 1/1 0/0 0/0 0/0
2L 6633 T C 0/0 1/0 0/0 0/0
2L 6637 C G 1/1 0/0 0/0 0/0
I am trying to change the 0 and 1 to the values in the REF and ALT columns, respectively with the desired end format to look like:
2L 6631 A G G/G A/A A/A A/A
2L 6633 T C T/T C/T T/T T/T
2L 6637 C G G/G C/C C/C C/C
What I have so far:
I have written a script that will do this for an single column, but I have 100+ columns of data so I was wondering if there is a way to apply this script to multiple columns instead of writing it out specifically for each one.
for line in openfile:
## skip header
if line.startswith("#CHROM"):
continue
columns = line.rstrip().split("\t")
CHROM = columns[0]
POS = columns[1]
REF = columns[2]
ALT = columns[3]
ALLELES1 = columns[4].replace("0",REF).replace("1",ALT).replace(".","0")
ALLELES2 = columns[5].replace("0",REF).replace("1",ALT).replace(".","0")
print CHROM, POS, REF, ALT, ALLELES1, ALLELES2
Here is my solution:
def read_data(filename):
with open(filename, "r") as file_handle:
for line in file_handle:
# skip header
if line.startswith("#CHROM"):
continue
columns = line.rstrip().split("\t")
CHROM = columns[0]
POS = columns[1]
REF = columns[2]
ALT = columns[3]
ALLELS = [value.replace("0", REF).replace("1", ALT).replace(".", "0") for value in columns[4:]]
print("\t".join(columns[0:4] + ALLELS))
You call it like this:
read_data("file.txt")
[value.replace("0", REF).replace("1", ALT).replace(".", "0") for value in columns[4:]]is called a "List comprehension". It looks at every value of a list and does something with it. See Documentation.
columns[4:] means, look at all my columns and get me the columns starting at index 4 until the last column.
sep="\t" in the print statement means, that all the elements you pass to the print function should be printed with a TAB in between them.
"\t".join(columns[0:4] + ALLELS) returns a single string in which all elements are joined by a TAB. See Stephen Rauch.
I would suggest implementing this using a list comprehension:
for line in f.readlines():
## skip header
if line.startswith("#CHROM"):
continue
columns = line.rstrip().split("\t")
REF, ALT = columns[2:4]
modified = [c.replace("0", REF).replace("1", ALT).replace(".", "0")
for c in columns[4:]]
print('\t'.join(columns[0:4] + modified))
Three additions to your code:
REF, ALT = columns[2:4]
Which is a clean way to grab two elements from the list.
modified = [c.replace("0", REF).replace("1", ALT).replace(".", "0")
for c in columns[4:]]
Which is a list comprehension to do the replace across all of the fields at once. And then
print('\t'.join(columns[0:4] + modified))
which reassembles everything at once.
I have file.txt with five columns and 50 lines. I want to name each column
1 5
2 4.2
. .
. .
you find here the code python ton generate file.txt
f = open("test_01.txt", "a")
for i in xrange(len(end_zapstart_zap)):
f.write(" {} {} {} {} {} \n".format(Leavestart_zap[i],end_zapUDP[i],end_zapstart_zap[i],Leave_join[i],UDP_join[i]))
f.close()
I like to rename the first column Leavestart_zap,the seond column end_zapUDP etc..
here is my file test_01.txt:
0:00:00.672511 0:00:02.615662 0:00:03.433344 0:00:00.119777 0:00:00.025394 0:00:00.002278
0:00:00.263144 0:00:03.184893 0:00:03.541187 0:00:00.090872 0:00:00.025394 0:00:00.002278