Getting errors when converting a text file to VCF format - python

I have a python code where I am trying to convert a text file containing variant information in the rows to a variant call format file (vcf) for my downstream analysis.
I am getting everything correct but when I am trying to run the code I miss out the first two entries , I mean the first two rows. The code is below, The line which is not reading the entire file is highlighted. I would like some expert advice.
I just started coding in python so I am not well versed entirely with it.
##fileformat=VCFv4.0
##fileDate=20140901
##source=dbSNP
##dbSNP_BUILD_ID=137
##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
import sys
text=open(sys.argv[1]).readlines()
print text
print "First print"
text=filter(lambda x:x.split('\t')[31].strip()=='KEEP',text[2:])
print text
print "################################################"
text=map(lambda x:x.split('\t')[0]+'\t'+x.split('\t')[1]+'\t.\t'+x.split('\t')[2]+'\t'+x.split('\t')[3]+'\t.\tPASS\t.\n',text)
print text
file=open(sys.argv[1].replace('.txt','.vcf'),'w')
file.write('##fileformat=VCFv4.0\n')
file.write('##source=dbSNP')
file.write('##dbSNP_BUILD_ID=137')
file.write('##reference=hg19\n')
file.write('#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n')
for i in text:
file.write(i)
file.close()
INPUT:
chrM 152 T C T_S7998 N_S8980 0 DBSNP COVERED 1 1 1 282 36 0 163.60287 0.214008 0.02 11.666081 202 55 7221 1953 0 0 TT 14.748595 49 0 1786 0 KEEP
chr9 311 T C T_S7998 N_S8980 0 NOVEL COVERED 0.993882 0.999919 0.993962 299 0 0 207.697923 1 0.02 1.854431 0 56 0 1810 1 116 CC -44.649001 0 12 0 390 KEEP
chr13 440 C T T_S7998 N_S8980 0 NOVEL COVERED 1 1 1 503 7 0 4.130339 0.006696 0.02 4.124606 445 3 16048 135 0 0 CC 12.942762 40 0 1684 0 KEEP
OUTPUT desired:
##fileformat=VCFv4.0
##source=dbSNP##dbSNP_BUILD_ID=137##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
chrM 152 . T C . PASS .
chr9 311 . T C . PASS .
chr13 440 . C T . PASS .
OUTPUT obtained:
##fileformat=VCFv4.0
##source=dbSNP##dbSNP_BUILD_ID=137##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
chr13 440 . C T . PASS .
I would like to have some help regarding how this error can be rectified.

There are couple of issues with your code
In the filter function you are passing text[2:]. I think you want to pass text to get all the rows.
In the last loop where you write to the .vcf file, you are closing the file inside the loop. You should first write all the values and then close the file outside the loop.
So your code will look like (I removed all the prints):
import sys
text=open(sys.argv[1]).readlines()
text=filter(lambda x:x.split('\t')[31].strip()=='KEEP',text) # Pass text
text=map(lambda x:x.split('\t')[0]+'\t'+x.split('\t')[1]+'\t.\t'+x.split('\t')[2]+'\t'+x.split('\t')[3]+'\t.\tPASS\t.\n',text)
file=open(sys.argv[1].replace('.txt','.vcf'),'w')
file.write('##fileformat=VCFv4.0\n')
file.write('##source=dbSNP')
file.write('##dbSNP_BUILD_ID=137')
file.write('##reference=hg19\n')
file.write('#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n')
for i in text:
file.write(i)
file.close() # close after writing all the values, in the end

Related

Extract values under a particular section occurring multiple times in a file

It is always helpful here at stackoverflow and need another help yet again.
In this file.txt, I have "Page Statistics section". This section appears 2 times in the file.
I have to extract the values under "DelAck" for SSD and FC. NOTE: These values are for Node 0 and Node 1
I would like to extract the values and put it in a list for Node 0 and Node 1.
Somewhat like this below:
Result:
SSD_0 = [115200, 115200] // DelAck values for Node 0
SSD_1 = [115200, 115200] // DelAck values for Node 1
FC_0 = [0, 0] // DelAck values for Node 0
FC_1 = [0, 0] // DelAck values for Node 1
Here is the file.txt which has the data to be extracted. Page Statistics sections appears multiple time. I have it here for 2 times. Need to extact values for SSD and FC like I mentioned earlier for Node 0 and Node 1 separately as shown above.
Hope I have explained my situation in detail.
***************************
file.txt
22:26:35 04/29/2021 ----- Current ----- ---------- Total ----------
Node Type Accesses Hits Hit% Accesses Hits Hit% LockBlk
0 Read 75630 39845 53 149728 79438 53 0
0 Write 47418 19709 42 93184 38230 41 0
1 Read 74076 38698 52 145810 75445 52 0
1 Write 42525 16099 38 84975 31751 37 0
Page Statistics
--CfcDirty-- ----CfcMax----- -DelAck--
Node FC NL SSD FC NL SSD FC NL SSD
0 250 0 14843 19200 0 115200 0 0 0
1 284 0 15618 19200 0 115200 0 0 0
Press the enter key to stop...
22:26:33 04/29/2021 ----- Current ----- ---------- Total ----------
Node Type Accesses Hits Hit% Accesses Hits Hit% LockBlk
0 Read 74098 39593 53 74098 39593 53 0
0 Write 45766 18521 40 45766 18521 40 0
1 Read 71734 36747 51 71734 36747 51 0
1 Write 42450 15652 37 42450 15652 37 0
Page Statistics
--CfcDirty-- ----CfcMax----- -DelAck--
Node FC NL SSD FC NL SSD FC NL SSD
0 258 0 13846 19200 0 115200 0 0 0
1 141 0 13356 19200 0 115200 0 0 0
Press the enter key to stop...
***************************
Then I would work on getting the average of the values in the list and print to user.
The challenging part is extracting the data under PAGE Statistics for SSD and FC drives.
Any help would be of immense help.
Thanks!
Since the format of this file is quite unique, you will have to parse it by hand. You can do this with a simple program such as this:
## Function to get the values from these blank-spaced tables
def values(line):
while ' ' in line:
line = line.replace(' ',' ')
line = line[:-1]
if line[0]==' ':line = line[1:]
if line[-1]==' ':line = line[:-1]
return line.split(' ')
## Load the file
inf = open("file.txt","r")
lines = inf.readlines()
inf.close()
## Get the relevant tables
tables = []
for i in range(len(lines)):
if 'Page Statistics' in lines[i]:
tables.append(lines[i+2:i+5])
## Initiate the results dictionary
result = {"Node 0":{}, "Node 1":{}}
for c in ['CfcDirty','CfcMax','DelAck']:
for n in result.keys():
result[n][c]={}
for value in ['FC','NL','SSD']:
result[n][c][value]=[]
## Parse the tables and fill up the results
for t in tables:
vnames = values(t[0])
node0 = values(t[1])
node1 = values(t[2])
i = 0
for c in ['CfcDirty','CfcMax','DelAck']:
for _ in range(3):
i+=1
result['Node 0'][c][vnames[i]].append(node0[i])
result['Node 1'][c][vnames[i]].append(node1[i])
The result will be a dictionary with node, column1, and column2 as keys. So, you can easily get any value from these tables:
>> print(result['Node 0']['DelAck']['SSD'])
['0', '0']
>> print(result['Node 1']['CfcMax']['SSD'])
['115200', '115200']
Now, you can compose any number of new variables that contain some values from these tables.
(BTW, I don't understand how you get the values for your example: SSD_0 = [115200, 115200] // DelAck values for Node 0. Node 0 always has the value of 0 for SSD in DelAck.)

Sort list of strings with data in the middle of them

So I have a list of strings which correlates to kibana indices, the strings look like this:
λ curl '10.10.43.210:9200/_cat/indices?v'
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open filebeat-2016.10.08 5 1 899 0 913.8kb 913.8kb
yellow open filebeat-2016.10.12 5 1 902 0 763.9kb 763.9kb
yellow open filebeat-2016.10.13 5 1 816 0 588.9kb 588.9kb
yellow open filebeat-2016.10.10 5 1 926 0 684.1kb 684.1kb
yellow open filebeat-2016.10.11 5 1 876 0 615.2kb 615.2kb
yellow open filebeat-2016.10.09 5 1 745 0 610.7kb 610.7kb
The dates are coming back unsorted. I want to sort these by the index (which is a date) filebeat-2016-10.xx ASC or DESC is fine.
As it stands now I isolate the strings like this:
subp = subprocess.Popen(['curl','-XGET' ,'-H', '"Content-Type: application/json"', '10.10.43.210:9200/_cat/indices?v'], stdout=subproce$
curlstdout, curlstderr = subp.communicate()
op = str(curlstdout)
kibanaIndices = op.splitlines()
for index,elem in enumerate(kibanaIndices):
if "kibana" not in kibanaIndices[index]:
print kibanaIndices[index]+"\n"
kibanaIndexList.append(kibanaIndices[index])
But can't sort them in a meaningful way.
Is it what you need?
lines = """yellow open filebeat-2016.10.08 5 1 899 0 913.8kb 913.8kb
yellow open filebeat-2016.10.12 5 1 902 0 763.9kb 763.9kb
yellow open filebeat-2016.10.13 5 1 816 0 588.9kb 588.9kb
yellow open filebeat-2016.10.10 5 1 926 0 684.1kb 684.1kb
yellow open filebeat-2016.10.11 5 1 876 0 615.2kb 615.2kb
yellow open filebeat-2016.10.09 5 1 745 0 610.7kb 610.7kb
""".splitlines()
def extract_date(line):
return line.split()[2]
lines.sort(key=extract_date)
print("\n".join(lines))
Here extract_date is a function that returns third column (like filebeat-2016.10.12). We use this function as key argument to sort to use this value as a sort key. You date format can be sorted just as strings. You can probably use more sophisticated extract_line function to extract only the date.
I copied your sample data into a text file as UTF-8 since I don't have access to the server you referenced. Using list comprehensions and string methods you can clean the data then break it down into its component parts. Sorting is accomplished by passing a lambda function as an argument to the builtin sorted() method:
# read text data into list one line at a time
result_lines = open('kibana_data.txt').readlines()
# remove trailing newline characters
clean_lines = [line.replace("\n", "") for line in result_lines]
# get first line of file
info = clean_lines[0]
# create list of field names
header = [val.replace(" ", "")
for val in clean_lines[1].split()]
# create list of lists for data rows
data = [[val.replace(" ", "") for val in line.split()]
for line in clean_lines[2:]]
# sort data rows by date (third row item at index 2)
final = sorted(data, key=lambda row: row[2], reverse=False)
More info on list comprehensions here: https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions
More info on sorting here: https://wiki.python.org/moin/HowTo/Sorting

Compiling UniProt txt file into a dict to retrieve key (ID) and values (MOD_RES)

I am not much familiar with python and trying to retrieve data from a text file(test1), Uniprot, that looks like this:
ID YSH1_YEAST Reviewed; 779 AA.
AC Q06224; D6VYS4;
DT 10-JAN-2006, integrated into UniProtKB/Swiss-Prot
DT 01-NOV-1996, sequence version 1.
.
.
.
FT METAL 184 184 Zinc 1. {ECO:0000250}.
FT METAL 184 184 Zinc 2. {ECO:0000250}.
FT METAL 430 430 Zinc 2. {ECO:0000250}.
FT MOD_RES 517 517 Phosphoserine; by ATM or ATR.
FT {ECO:0000244|PubMed:18407956}.
FT MUTAGEN 37 37 D->N: Loss of endonuclease activity.
.
.
So far I am able to retrieve the MOD_RES and AC separately, by using these codelets:
test = open('test1', 'r')
regex2 = re.compile(r'^AC\s+\w+')
for line in test:
ac = regex2.findall(line)
for a in ac:
a=''.join(a)
print(a[5:12])
Q06224
P16521
testfile = open('test1')
regex = re.compile(r'^FT\s+\MOD_RES\s+\w+\s+\w+\s+\w.+')
for line in testfile:
po = regex.findall(line)
for p in po:
p=''.join(p)
print(p[23:48])
517 Phosphoserine;
2 N-acetylserine
187 N6,N6,N6-trime
196 N6,N6,N6-trime
the goal is to get AC and their relevant Modification residues (MOD_RES) into a tab separate format. Also, if more than one MOS_RES appear in the data for a particular AC, duplicate that AC and get the table format like this:
AC MOD_RES
Q06224 517 517 Phosphoserine
P04524 75 75 Phosphoserine
Q06224 57 57 Phosphoserine
Have you taken a look at Biopython?
You should be able to parse your Uniprot file like that:
from Bio import SwissProt
for record in SwissProt.parse(open('/path/to/your/uniprot_sprot.dat')):
for feature in record.features:
print feature
From there you should be able to print what you want to a file.

extract a column from a text file

I am trying to extract two columns from a text file here datapoint and index, and I want both of the columns to be written in a text file as a column. I made a small program that is somewhat doing what I want but its not working completely,
any suggestion on this please ?
My program is:
f = open ('infilename', 'r')
header1= f.readline()
for line in f:
line = line.strip()
columns = line.split()
j=float(columns[1])
i=columns[3]
print i, j
f.close()
it is also giving an error
j=float(columns[1])
IndexError: list index out of range
Sample data:
datapoint index
66.199748 200 0.766113 0 1
66.295962 200 0.826375 1 0
66.295962 200 0.762582 1 1
66.318076 200 0.850936 2 0
66.318076 200 0.751474 2 1
66.479436 200 0.821261 3 0
66.479436 200 0.765673 3 1
66.460284 200 0.869779 4 0
66.460284 200 0.741051 4 1
66.551778 200 0.841143 5 0
66.551778 200 0.765198 5 1
66.303606 200 0.834398 6 0
. . . . .
. . . . .
. . . . .
. . . . .
69.284336 200 0.926158 19998 0
69.284336 200 0.872788 19998 1
69.403861 200 0.943316 19999 0
69.403861 200 0.884889 19999 1
The following code will allow you do all of the file writing through Python. Redirecting through the command line like you were doing works fine, this will just be self contained instead.
f = open ('in.txt', 'r')
out = open("out.txt", "w")
header1= f.readline()
for line in f:
line = line.strip()
columns = line.split()
if len(columns) > 2:
j=float(columns[1])
i = columns[3]
i=columns[3]
out.write("%s %s\n" %(i, j))
f.close()
Warning: This will always overwrite "out.txt". If you would simply like to add to the end of it if it already exists, or create a new file if it doesn't, you can change the "w" to "a" when you open out.

Need modification to regex to work in other situations

Just discovered that the structure of my file could be different and my regex only works sometimes because of this change. My regex is
v6 = re.findall(r'(?s)----------\s*LOW VOLTAGE SUMMARY BY AREA.*?\rACTIVITY.+?',wholefile)
It currently matches the following section of the file.
---------- LOW VOLTAGE SUMMARY BY AREA ----------
BUS NAME BASKV VOLT TIME AREA ZONE
12006 [AMISTAD 69.0] 0.971 1.8700 10 NEW MEXICO 121
11223 [WHITESA213.8] 0.918 1.9900 11 EL PASO 110
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
79086 [PAGOSA 115] 0.937 2.0333 73 WAPA R.M. 791
ACTIVITY?
PDEV
ENTER OUTPUT DEVICE CODE:
0 FOR NO OUTPUT
1 FOR PROGRESS WINDOW
However that section of the file sometimes is as below
---------- LOW VOLTAGE SUMMARY BY AREA ----------
BUS NAME BASKV VOLT TIME AREA ZONE
12006 [AMISTAD 69.0] 0.742 13.2060 10 NEW MEXICO 121
11223 [WHITESA213.8] 0.916 1.8367 11 EL PASO 110
70187 [FTGARLND69.0] 0.936 19.6099 70 PSCOLORADO 710
73216 [WINDRIVR 115] 0.858 3.6100 73 WAPA R.M. 750
(VFSCAN) AT TIME = 20.0000 UP TO 100 BUSES WITH LOW FREQUENCY BELOW 59.600:
X ----- BUS ------ X FREQ X ----- BUS ------ X FREQ
12063 [ROSEBUD 13.8] 59.506
On both occasions I would like to capture just the section below:
---------- LOW VOLTAGE SUMMARY BY AREA ----------
BUS NAME BASKV VOLT TIME AREA ZONE
12006 [AMISTAD 69.0] 0.971 1.8700 10 NEW MEXICO 121
11223 [WHITESA213.8] 0.918 1.9900 11 EL PASO 110
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
79086 [PAGOSA 115] 0.937 2.0333 73 WAPA R.M. 791
How can my regex return the section above regardless of which version of the file I am lookin at?
This should work
v6 = re.findall(r'(?s)----------\s*LOW VOLTAGE SUMMARY BY AREA.*?\r(ACTIVITY|\(VFSCAN\)).+?',wholefile)
I would not suggest to use a regex, but do some parsing instead. Let's assume your data is in a string called data:
lines = [line for line in data.split("\n")]
# find start of header
for index, line in enumerate(lines):
if "LOW VOLTAGE SUMMARY BY AREA" in line:
start_index = index
break
# first first data entry (line starting with whitespace and then a number)
for index, line in enumerate(lines[start_index:]):
if line.strip() and line.split()[0].isdigit():
first_entry_index = start_index + index
break
# find last data entry (line starting with whitespace and then a number)
for index, line in enumerate(lines[first_entry_index:]):
# we don't do this inside the if because it's possible
# to end the data with only entries and whitespace
end_entry_index = first_entry_index + index
if line.strip() and not line.split()[0].isdigit():
break
# print all lines between header and last data entry
print("\n".join(lines[start_index:end_entry_index]))

Categories

Resources