Comparing two tab-delimiter files using Python

Comparing two tab-delimiter files using Python - python

I obtained the text output file for my data sample which reports TE insertion sites in the genome. It looks like that:
sample chr pos strand family order support comment frequency
1 1 4254339 . hAT|9 hAT R - 0,954
1 1 34804000 . Stowaway|41 Stowaway R - 1
1 1 12839440 . Tourist|15 Tourist F - 1
1 1 11521962 . Tourist|10 Tourist R - 1
1 1 28197852 . Tourist|11 Tourist F - 1
1 1 7367886 . Stowaway|36 Stowaway R - 1
1 1 13130538 . Stowaway|36 Stowaway R - 1
1 1 6177708 . hAT|4 hAT F - 1
1 1 3783728 . hAT|20 hAT F - 1
1 1 10332288 . uc|12 uc R - 1
1 1 15780052 . uc|5 uc R - 1
1 1 28309928 . uc|5 uc R - 1
1 1 31010266 . uc|33 uc R - 0,967
1 1 84155 . uc|1 uc F - 1
1 1 3815830 . uc|31 uc R - 0,879
1 1 66241 . Mutator|4 Mutator F - 1
1 1 15709187 . Mutator|4 Mutator F - 0,842
I want to compare it with the bed file representing TE sites for the reference genome. It looks like that:
chr start end family
1 12005 12348 Tourist|7
1 4254339 4254340 hAT|9
1 66241 66528 Mutator|4
1 76762 76849 Stowaway|10
1 81966 82251 Stowaway|39
1 84155 84402 uc|1
1 84714 84841 uc|28
1 13130538 13130540 Stowaway|3
I want to check if TE insertions found in my sample occur in the reference, for example, if the first TE: hAT|9 in position 4254339 on chromosome 1 will be found in the bed file between the range defined by column 2 as the start and 3 as the end AND recognized as hAT|9 family according to column 4. I try to do it with pandas but I'm pretty confused. Thanks for the suggestions!
EDIT:
I slightly modified the input files view for easier understanding and parsing. Below is my code using pandas with two for loops (thanks #furas for suggesting).
import pandas as pd
ref_base = pd.read_csv('ref_test.bed', sep='\t')
te_output = pd.read_csv('srr_test1.txt', sep='\t')
result = []
for te in te_output.index:
te_pos = te_output['pos'][te]
te_family_sample = te_output['family'][te]
for ref in ref_base.index:
te_family_ref = ref_base['family'][ref]
start = ref_base['start'][ref]
end = ref_base['end'][ref]
if te_family_sample == te_family_ref and te_pos in range(start, end):
# print(te_output["chr"][te], te_output["pos"][te], te_output["family"][te], te_output["support"][te],
# te_output["frequency"][te])
result.append(str(te_output["chr"][te]) + '\t' + str(te_output["pos"][te]) + '\t' + te_output["family"][te]
+ '\t' + te_output["support"][te] + '\t' + str(te_output["frequency"][te]))
print(result)
resultFile = open("result.txt", 'w')
# write data to file
for r in result:
resultFile.write(r + '\n')
resultFile.close()
There is my expected result:
1 4254339 hAT|9 R 0,954
1 84155 uc|1 F 1
1 66241 Mutator|4 F 1
I write it in the easy way as I could but I would like to find more efficient solution. Any ideas?

Related

How to separate text between specific strings and then convert it to dataframe?

This is a small example of a bigger data.
I have a text file like this one below.
Code: 44N
______________
Unit: m
Color: red
Length - Width - Height -
31 - 8 - 6 -
32 - 4 - 3 -
35 - 5 - 6 -
----------------------------------------
Code: 40N
______________
Unit: m
Color: blue
Length - Width - Height -
32 - 3 - 2 -
37 - 2 - 8 -
33 - 1 - 6 -
31 - 5 - 8 -
----------------------------------------
Code: 38N
I would like to get the lines containing the text that starts with " Length" until the line that starts with "----------------------------------------". I would like to do this for every time it happens and then convert each of this new data in a dataframe... maybe adding it to a list of dataframes.
At this example, I should have two dataframes like these ones:
Length Width Height
31 8 6
32 4 3
35 5 6
Length Width Height
32 3 2
37 2 8
33 1 6
31 5 8
I already tried something, but it only saves one text to the list and not both of them. And then I don't know how to convert them to a dataframe.
file = open('test.txt', 'r')
file_str = file.read()
well_list = []
def find_between(data, first, last):
start = data.index(first)
end = data.index(last, start)
return data[start:end]
well_list.append(find_between(file_str, " Length", "----------------------------------------" ))
Anyone could help me?

Hey that shows how parsing data can be tricky. Use .split() method off strings to do the job. Here is a way to do it.
import pandas as pd
import numpy as np
with open('test.txt', 'r') as f:
text = f.read()
data_start = 'Length - Width - Height'
data_end = '----------------------------------------'
# split the text in sections containing the data
sections = text.split(data_start)[1:]
# number of columns in the dataframe
col_names = data_start.split('-')
num_col = len(col_names)
for s in sections:
# remove everything after '------'
s = s.split(data_end)[0]
# now keep the numbers only
data = s.split()
# change string to int and discard '-'
data = [int(d) for d in data if d!='-']
# reshape the data (num_rows, num_cols)
data = np.array(data).reshape((int(len(data)/num_col), num_col))
df = pd.DataFrame(data, columns=col_names)
print(df)

pandas multiple condition and add multiple column

I have a df:
import pandas as pd
df.head(20)
id ch start end strand
0 10:100026072-100029645(+) 10 100026072 100029645 +
1 10:110931880-110932381(+) 10 110931880 110932381 +
2 10:110932431-110933096(+) 10 110932431 110933096 +
3 10:111435307-111439556(-) 10 111435307 111439556 -
4 10:115954439-115964883(-) 10 115954439 115964883 -
5 10:115986231-116018509(-) 10 115986231 116018509 -
6 10:116500106-116500762(-) 10 116500106 116500762 -
7 10:116654355-116657389(-) 10 116654355 116657389 -
8 10:117146840-117147002(-) 10 117146840 117147002 -
9 10:126533798-126533971(-) 10 126533798 126533971 -
10 10:127687390-127688824(+) 10 127687390 127688824 +
11 10:19614164-19624369(-) 10 19614164 19624369 -
12 10:42537888-42543687(+) 10 42537888 42543687 +
13 10:61927486-61931038(-) 10 61927486 61931038 -
14 10:70699779-70700206(-) 10 70699779 70700206 -
15 10:76532243-76532565(-) 10 76532243 76532565 -
16 10:79336852-79337034(-) 10 79336852 79337034 -
17 10:79342487-79343173(+) 10 79342487 79343173 +
18 10:79373277-79373447(-) 10 79373277 79373447 -
19 10:82322045-82337358(+) 10 82322045 82337358 +
df.shape
(501, 5)
>>>df.dtypes
id object
ch object
start object
end object
strand object
dtype: object
Question:
I would like to perform multiple operations based on 'start' and 'end' column
first create two additional columns called
newstart newend
desiredoperation: if strand == '+':
df['newstart'] = end - int(27)
df['newend'] = end + 2
elif:
strand == '-'
df['newstart'] = start - int(3)
df['newend'] = start + 26
how can i do this using pandas, I found the link below but not sure how to execute it. If any one can provide a pseudo code will build up on it.
adding multiple columns to pandas simultaneously

You can do it using np.where, 2 lines but readable
df['newstart'] = np.where(df.strand == '+', df.end-int(27), df.start-int(3))
df['newend'] = np.where(df.strand == '+', df.end+int(2), df.start+int(26))
id ch start end strand newstart newend
0 10:100026072-100029645(+) 10 100026072 100029645 + 100029618 100029647
1 10:110931880-110932381(+) 10 110931880 110932381 + 110932354 110932383
2 10:110932431-110933096(+) 10 110932431 110933096 + 110933069 110933098
3 10:111435307-111439556(-) 10 111435307 111439556 - 111435304 111435333
4 10:115954439-115964883(-) 10 115954439 115964883 - 115954436 115954465
5 10:115986231-116018509(-) 10 115986231 116018509 - 115986228 115986257
6 10:116500106-116500762(-) 10 116500106 116500762 - 116500103 116500132
7 10:116654355-116657389(-) 10 116654355 116657389 - 116654352 116654381
8 10:117146840-117147002(-) 10 117146840 117147002 - 117146837 117146866
9 10:126533798-126533971(-) 10 126533798 126533971 - 126533795 126533824

If you want to do it in pandas, df.loc is a good candidate:
df['newstart'] = df['start'] - 3
df['newend'] = df['start'] + 26
subset = df['strand'] == '+'
df.loc[subset,'newstart']=df.loc[subset,'end']-27
df.loc[subset,'newend']=df.loc[subset,'end']+2
I think it is a good idea to keep using pandas to process your data: it will keep your code consistent, and there is probably a better, shorter way to write the code above.
df.loc is a very useful function to perform data lookup and processing, try to fiddle with it since it is a great tool.
Enjoy

Improve python code in terms of speed

I have a very big file (1.5 billion lines) in the following form:
1 67108547 67109226 gene1$transcript1 0 + 1 0
1 67108547 67109226 gene1$transcript1 0 + 2 1
1 67108547 67109226 gene1$transcript1 0 + 3 3
1 67108547 67109226 gene1$transcript1 0 + 4 4
.
.
.
1 33547109 33557650 gene2$transcript1 0 + 239 2
1 33547109 33557650 gene2$transcript1 0 + 240 0
.
.
.
1 69109226 69109999 gene1$transcript1 0 + 351 1
1 69109226 69109999 gene1$transcript1 0 + 352 0
What I want to do is to reorganize/sort this file based on the identifier on column 4. The file is consisted of blocks. If you concatenate columns 4,1,2 and 3 you create the unique identifier for each block. This is the key for the dicionary all_exons and the value is a numpy array containing all the values of column 8. Then I have a second dictionary unique_identifiers that has as key the attributes from column 4 and values a list of the corresponding block identifiers. As output I write a file in the following form:
>gene1
0
1
3
4
1
0
>gene2
2
0
I already wrote some code (see below) that does this, but my implementation is very slow. It takes around 18 hours to run.
import os
import sys
import time
from contextlib import contextmanager
import pandas as pd
import numpy as np
def parse_blocks(bedtools_file):
unique_identifiers = {} # Dictionary with key: gene, value: list of exons
all_exons = {} # Dictionary contatining all exons
# Parse file and ...
with open(bedtools_file) as fp:
sp_line = []
for line in fp:
sp_line = line.strip().split("\t")
current_id = sp_line[3].split("$")[0]
identifier="$".join([sp_line[3],sp_line[0],sp_line[1],sp_line[2]])
if(identifier in all_exons):
item = float(sp_line[7])
all_exons[identifier]=np.append(all_exons[identifier],item)
else:
all_exons[identifier] = np.array([sp_line[7]],float)
if(current_id in unique_identifiers):
unique_identifiers[current_id].add(identifier)
else:
unique_identifiers[current_id] =set([identifier])
return unique_identifiers, all_exons
identifiers, introns = parse_blocks(options.bed)
w = open(options.out, 'w')
for gene in sorted(list(identifiers)):
w.write(">"+str(gene)+"\n")
for intron in sorted(list(identifiers[gene])):
for base in introns[intron]:
w.write(str(base)+"\n")
w.close()
How can I impove the above code in order to run faster?

You also import pandas, therefore, I provide a pandas solution which requires basically only two lines of code.
However, I do not know how it performs on large data sets and whether that is faster than your approach (but I am pretty sure it is).
In the example below, the data you provide is stored in table.txt. I then use groupby to get all the values in your 8th column, store them in a list for the respective identifier in your column 4 (note that my indices start at 0) and convert this data structure into a dictionary which can then be printed easily.
import pandas as pd
df=pd.read_csv("table.txt", header=None, sep = r"\s+") # replace the separator by e.g. '/t'
op = dict(df.groupby(3)[7].apply(lambda x: x.tolist()))
So in this case op looks like this:
{'gene1$transcript1': [0, 1, 3, 4, 1, 0], 'gene2$transcript1': [2, 0]}
Now you could print the output like this and pipeline it in a certain file:
for k,v in op.iteritems():
print k.split('$')[0]
for val in v:
print val
This gives you the desired output:
gene1
0
1
3
4
1
0
gene2
2
0
Maybe you can give it a try and let me know how it compares to your solution!?
Edit2:
In the comments you mentioned that you would like to print the genes in the correct order. You can do this as follows:
# add some fake genes to op from above
op['gene0$stuff'] = [7,9]
op['gene4$stuff'] = [5,9]
# print using 'sorted'
for k,v in sorted(op.iteritems()):
print k.split('$')[0]
for val in v:
print val
which gives you:
gene0
7
9
gene1
0
1
3
4
1
0
gene2
2
0
gene4
5
9
EDIT1:
I am not sure whether duplicates are intended but you could easily get rid of them by doing the following:
op2 = dict(df.groupby(3)[7].apply(lambda x: set(x)))
Now op2 would look like this:
{'gene1$transcript1': {0, 1, 3, 4}, 'gene2$transcript1': {0, 2}}
You print the output as before:
for k,v in op2.iteritems():
print k.split('$')[0]
for val in v:
print val
which gives you
gene1
0
1
3
4
gene2
0
2

I'll try to simplify your question, my solution is like this:
First, scan over the big file. For every different current_id, open a temporary file and append value of column 8 to that file.
After the full scan, catenate all chunks to a result file.
Here's the code:
# -*- coding: utf-8 -*-
import os
import tempfile
import subprocess
class ChunkBoss(object):
"""Boss for file chunks"""
def __init__(self):
self.opened_files = {}
def write_chunk(self, current_id, value):
if current_id not in self.opened_files:
self.opened_files[current_id] = open(tempfile.mktemp(), 'wb')
self.opened_files[current_id].write('>%s\n' % current_id)
self.opened_files[current_id].write('%s\n' % value)
def cat_result(self, filename):
"""Catenate chunks to one big file
"""
# Sort the chunks
chunk_file_list = []
for current_id in sorted(self.opened_files.keys()):
chunk_file_list.append(self.opened_files[current_id].name)
# Flush chunks
[chunk.flush() for chunk in self.opened_files.values()]
# By calling cat command
with open(filename, 'wb') as fp:
subprocess.call(['cat', ] + chunk_file_list, stdout=fp, stderr=fp)
def clean_up(self):
[os.unlink(chunk.name) for chunk in self.opened_files.values()]
def main():
boss = ChunkBoss()
with open('bigfile.data') as fp:
for line in fp:
data = line.strip().split()
current_id = data[3].split("$")[0]
value = data[7]
# Write value to temp chunk
boss.write_chunk(current_id, value)
boss.cat_result('result.txt')
boss.clean_up()
if __name__ == '__main__':
main()
I tested the performance of my script, with bigfile.data containing about 150k lines. It took about 0.5s to finish on my laptop. Maybe you can give it a try.

extract a column from a text file

I am trying to extract two columns from a text file here datapoint and index, and I want both of the columns to be written in a text file as a column. I made a small program that is somewhat doing what I want but its not working completely,
any suggestion on this please ?
My program is:
f = open ('infilename', 'r')
header1= f.readline()
for line in f:
line = line.strip()
columns = line.split()
j=float(columns[1])
i=columns[3]
print i, j
f.close()
it is also giving an error
j=float(columns[1])
IndexError: list index out of range
Sample data:
datapoint index
66.199748 200 0.766113 0 1
66.295962 200 0.826375 1 0
66.295962 200 0.762582 1 1
66.318076 200 0.850936 2 0
66.318076 200 0.751474 2 1
66.479436 200 0.821261 3 0
66.479436 200 0.765673 3 1
66.460284 200 0.869779 4 0
66.460284 200 0.741051 4 1
66.551778 200 0.841143 5 0
66.551778 200 0.765198 5 1
66.303606 200 0.834398 6 0
. . . . .
. . . . .
. . . . .
. . . . .
69.284336 200 0.926158 19998 0
69.284336 200 0.872788 19998 1
69.403861 200 0.943316 19999 0
69.403861 200 0.884889 19999 1

The following code will allow you do all of the file writing through Python. Redirecting through the command line like you were doing works fine, this will just be self contained instead.
f = open ('in.txt', 'r')
out = open("out.txt", "w")
header1= f.readline()
for line in f:
line = line.strip()
columns = line.split()
if len(columns) > 2:
j=float(columns[1])
i = columns[3]
i=columns[3]
out.write("%s %s\n" %(i, j))
f.close()
Warning: This will always overwrite "out.txt". If you would simply like to add to the end of it if it already exists, or create a new file if it doesn't, you can change the "w" to "a" when you open out.

Getting errors when converting a text file to VCF format

I have a python code where I am trying to convert a text file containing variant information in the rows to a variant call format file (vcf) for my downstream analysis.
I am getting everything correct but when I am trying to run the code I miss out the first two entries , I mean the first two rows. The code is below, The line which is not reading the entire file is highlighted. I would like some expert advice.
I just started coding in python so I am not well versed entirely with it.
##fileformat=VCFv4.0
##fileDate=20140901
##source=dbSNP
##dbSNP_BUILD_ID=137
##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
import sys
text=open(sys.argv[1]).readlines()
print text
print "First print"
text=filter(lambda x:x.split('\t')[31].strip()=='KEEP',text[2:])
print text
print "################################################"
text=map(lambda x:x.split('\t')[0]+'\t'+x.split('\t')[1]+'\t.\t'+x.split('\t')[2]+'\t'+x.split('\t')[3]+'\t.\tPASS\t.\n',text)
print text
file=open(sys.argv[1].replace('.txt','.vcf'),'w')
file.write('##fileformat=VCFv4.0\n')
file.write('##source=dbSNP')
file.write('##dbSNP_BUILD_ID=137')
file.write('##reference=hg19\n')
file.write('#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n')
for i in text:
file.write(i)
file.close()
INPUT:
chrM 152 T C T_S7998 N_S8980 0 DBSNP COVERED 1 1 1 282 36 0 163.60287 0.214008 0.02 11.666081 202 55 7221 1953 0 0 TT 14.748595 49 0 1786 0 KEEP
chr9 311 T C T_S7998 N_S8980 0 NOVEL COVERED 0.993882 0.999919 0.993962 299 0 0 207.697923 1 0.02 1.854431 0 56 0 1810 1 116 CC -44.649001 0 12 0 390 KEEP
chr13 440 C T T_S7998 N_S8980 0 NOVEL COVERED 1 1 1 503 7 0 4.130339 0.006696 0.02 4.124606 445 3 16048 135 0 0 CC 12.942762 40 0 1684 0 KEEP
OUTPUT desired:
##fileformat=VCFv4.0
##source=dbSNP##dbSNP_BUILD_ID=137##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
chrM 152 . T C . PASS .
chr9 311 . T C . PASS .
chr13 440 . C T . PASS .
OUTPUT obtained:
##fileformat=VCFv4.0
##source=dbSNP##dbSNP_BUILD_ID=137##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
chr13 440 . C T . PASS .
I would like to have some help regarding how this error can be rectified.

There are couple of issues with your code
In the filter function you are passing text[2:]. I think you want to pass text to get all the rows.
In the last loop where you write to the .vcf file, you are closing the file inside the loop. You should first write all the values and then close the file outside the loop.
So your code will look like (I removed all the prints):
import sys
text=open(sys.argv[1]).readlines()
text=filter(lambda x:x.split('\t')[31].strip()=='KEEP',text) # Pass text
text=map(lambda x:x.split('\t')[0]+'\t'+x.split('\t')[1]+'\t.\t'+x.split('\t')[2]+'\t'+x.split('\t')[3]+'\t.\tPASS\t.\n',text)
file=open(sys.argv[1].replace('.txt','.vcf'),'w')
file.write('##fileformat=VCFv4.0\n')
file.write('##source=dbSNP')
file.write('##dbSNP_BUILD_ID=137')
file.write('##reference=hg19\n')
file.write('#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n')
for i in text:
file.write(i)
file.close() # close after writing all the values, in the end

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing two tab-delimiter files using Python - python

Related

How to separate text between specific strings and then convert it to dataframe?

pandas multiple condition and add multiple column

Improve python code in terms of speed

extract a column from a text file

Getting errors when converting a text file to VCF format

Categories

Resources