extract a column from a text file - python

I am trying to extract two columns from a text file here datapoint and index, and I want both of the columns to be written in a text file as a column. I made a small program that is somewhat doing what I want but its not working completely,
any suggestion on this please ?
My program is:
f = open ('infilename', 'r')
header1= f.readline()
for line in f:
line = line.strip()
columns = line.split()
j=float(columns[1])
i=columns[3]
print i, j
f.close()
it is also giving an error
j=float(columns[1])
IndexError: list index out of range
Sample data:
datapoint index
66.199748 200 0.766113 0 1
66.295962 200 0.826375 1 0
66.295962 200 0.762582 1 1
66.318076 200 0.850936 2 0
66.318076 200 0.751474 2 1
66.479436 200 0.821261 3 0
66.479436 200 0.765673 3 1
66.460284 200 0.869779 4 0
66.460284 200 0.741051 4 1
66.551778 200 0.841143 5 0
66.551778 200 0.765198 5 1
66.303606 200 0.834398 6 0
. . . . .
. . . . .
. . . . .
. . . . .
69.284336 200 0.926158 19998 0
69.284336 200 0.872788 19998 1
69.403861 200 0.943316 19999 0
69.403861 200 0.884889 19999 1

The following code will allow you do all of the file writing through Python. Redirecting through the command line like you were doing works fine, this will just be self contained instead.
f = open ('in.txt', 'r')
out = open("out.txt", "w")
header1= f.readline()
for line in f:
line = line.strip()
columns = line.split()
if len(columns) > 2:
j=float(columns[1])
i = columns[3]
i=columns[3]
out.write("%s %s\n" %(i, j))
f.close()
Warning: This will always overwrite "out.txt". If you would simply like to add to the end of it if it already exists, or create a new file if it doesn't, you can change the "w" to "a" when you open out.

Related

Comparing two tab-delimiter files using Python

I obtained the text output file for my data sample which reports TE insertion sites in the genome. It looks like that:
sample chr pos strand family order support comment frequency
1 1 4254339 . hAT|9 hAT R - 0,954
1 1 34804000 . Stowaway|41 Stowaway R - 1
1 1 12839440 . Tourist|15 Tourist F - 1
1 1 11521962 . Tourist|10 Tourist R - 1
1 1 28197852 . Tourist|11 Tourist F - 1
1 1 7367886 . Stowaway|36 Stowaway R - 1
1 1 13130538 . Stowaway|36 Stowaway R - 1
1 1 6177708 . hAT|4 hAT F - 1
1 1 3783728 . hAT|20 hAT F - 1
1 1 10332288 . uc|12 uc R - 1
1 1 15780052 . uc|5 uc R - 1
1 1 28309928 . uc|5 uc R - 1
1 1 31010266 . uc|33 uc R - 0,967
1 1 84155 . uc|1 uc F - 1
1 1 3815830 . uc|31 uc R - 0,879
1 1 66241 . Mutator|4 Mutator F - 1
1 1 15709187 . Mutator|4 Mutator F - 0,842
I want to compare it with the bed file representing TE sites for the reference genome. It looks like that:
chr start end family
1 12005 12348 Tourist|7
1 4254339 4254340 hAT|9
1 66241 66528 Mutator|4
1 76762 76849 Stowaway|10
1 81966 82251 Stowaway|39
1 84155 84402 uc|1
1 84714 84841 uc|28
1 13130538 13130540 Stowaway|3
I want to check if TE insertions found in my sample occur in the reference, for example, if the first TE: hAT|9 in position 4254339 on chromosome 1 will be found in the bed file between the range defined by column 2 as the start and 3 as the end AND recognized as hAT|9 family according to column 4. I try to do it with pandas but I'm pretty confused. Thanks for the suggestions!
EDIT:
I slightly modified the input files view for easier understanding and parsing. Below is my code using pandas with two for loops (thanks #furas for suggesting).
import pandas as pd
ref_base = pd.read_csv('ref_test.bed', sep='\t')
te_output = pd.read_csv('srr_test1.txt', sep='\t')
result = []
for te in te_output.index:
te_pos = te_output['pos'][te]
te_family_sample = te_output['family'][te]
for ref in ref_base.index:
te_family_ref = ref_base['family'][ref]
start = ref_base['start'][ref]
end = ref_base['end'][ref]
if te_family_sample == te_family_ref and te_pos in range(start, end):
# print(te_output["chr"][te], te_output["pos"][te], te_output["family"][te], te_output["support"][te],
# te_output["frequency"][te])
result.append(str(te_output["chr"][te]) + '\t' + str(te_output["pos"][te]) + '\t' + te_output["family"][te]
+ '\t' + te_output["support"][te] + '\t' + str(te_output["frequency"][te]))
print(result)
resultFile = open("result.txt", 'w')
# write data to file
for r in result:
resultFile.write(r + '\n')
resultFile.close()
There is my expected result:
1 4254339 hAT|9 R 0,954
1 84155 uc|1 F 1
1 66241 Mutator|4 F 1
I write it in the easy way as I could but I would like to find more efficient solution. Any ideas?

.vcf data to pandas dataframe

How can I convert the following .vcf data into a pandas dataframe?
GDrive Link To .txt File
Ideally I would like it in the form:
Thus far I have only been able to get the headers:
import pandas as pd
f = open('clinvar_final.txt',"r")
for line in f.readlines():
if line[:5] == 'CHROM':
vcf_header = line.strip().split('\t')
df = pd.DataFrame
df.header = vcf_header
There is no need to read line by line.
Pandas has an option called comment which can be used to skip unwanted lines.
You can directly load VCF files into pandas by running the following line.
In [9]: pd.read_csv('clinvar_final.txt', sep="\t", comment='#')
Out[9]:
CHROM POS ID REF ALT FILTER QUAL INFO
0 1 1014O42 475283 G A . . AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;...
1 1 1O14122 542074 C T . . AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926...
2 1 1014143 183381 C T . . ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:...
3 1 1014179 542075 C T . . ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:...
4 1 1014217 475278 C T . . AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;...
... ... ... ... .. .. ... ... ...
102316 3 179210507 403908 A G . . ALLELEID=393412;CLNDISDB=MedGen:C0018553,Orpha...
102317 3 179210511 526648 T C . . ALLELEID=519163;CLNDISDB=MedGen:C0018553,Orpha...
102318 3 179210515 526640 A C . . AF_EXAC=0.00002;ALLELEID=519178;CLNDISDB=MedGe...
102319 3 179210516 246681 A G . . AF_EXAC=0.00001;ALLELEID=245287;CLNDISDB=MedGe...
102320 3 179210538 259958 A T . . AF_EXAC=0.00001;ALLELEID=251013;CLNDISDB=MedGe...
GATK Variantstotable is what you need to avoid any issue due to the flexibility of the format of the VCF. Then, when having the csv, import it into pandas. I would say that this is the most robust way to do this.
https://gatk.broadinstitute.org/hc/en-us/articles/360036896892-VariantsToTable
import pandas as pd
with open(filename, "r") as f:
lines = f.readlines()
chrom_index = [i for i, line in enumerate(lines) if line.strip().startswith("#CHROM")]
data = lines[chrom_index[0]:]
header = data[0].strip().split("\t")
informations = [d.strip().split("\t") for d in data[1:]]
vcf = pd.DataFrame(informations, columns=header)

Text file combining script - Python - Big Data

I was wondering if anyone could help me come up with a better way of doing this,
basically I have text files that are formatted like this (some have more columns some have less, each column separated by spaces)
AA BB CC DD Col1 Col2 Col3
XX XX XX Total 1234 1234 1234
Aaaa OO0 LAHB TEXT 111 41 99
Aaaa OO0 BLAH XETT 112 35 176
Aaaa OO0 BALH TXET 131 52 133
Aaaa OO0 HALB EXTT 144 32 193
These text files ranged in size from a few hundred KB to around 100MB for the newest and largest filesWhat I need to do is combine two or more files by adding the checking to see if there are any duplicate data first of all so checking if AA BB CC and DD from each row match with any rows from the other files, if so then I append the data from Col1 Col2 Col3 (etc) on to that row, if not then I fill the new columns in with zeros. The I calculate the top 100 rows based on the total of each row and output the top 100 results to a webpage.
here is the python code I'm using
import operator
def combine(dataFolder, getName, sort):
files = getName.split(",")
longestHeader = 0
longestHeaderFile =[]
dataHeaders = []
dataHeaderCode = []
fileNumber = 1
combinedFile = {}
for fileName in files:
lines = []
file = dataFolder+"/tableFile/"+fileName+".txt"
with open(file) as f:
x = 0
for line in f:
lines.append(line.upper().split())
if x == 1:
break
splitLine = lines[1].index("TOTAL")+1
dataHeaders.extend(lines[0][splitLine:])
headerNumber = 1
for name in lines[0][splitLine:]:
dataHeaderCode.append(str(fileNumber)+"_"+str(headerNumber))
headerNumber += 1
if splitLine > longestHeader:
longestHeader = splitLine
longestHeaderFile = lines[0][:splitLine]
fileNumber += 1
for fileName in files:
lines = []
file = dataFolder+"/tableFile/"+fileName+".txt"
with open(file) as f:
for line in f:
lines.append(line.upper().split())
splitLine = lines[1].index("TOTAL")+1
headers = lines[0][:splitLine]
data = lines[0][splitLine:]
for x in range(2, len(lines)):
normalizedLine = {}
lineName = ""
total = 0
for header in longestHeaderFile:
try:
if header == "EE" or header == "DD":
index = splitLine-1
else:
index = headers.index(header)
normalizedLine[header] = lines[x][index]
except ValueError:
normalizedLine[header] = "XX"
lineName += normalizedLine[header]
combinedFile[lineName] = normalizedLine
for header in dataHeaders:
headIndex = dataHeaders.index(header)
name = dataHeaderCode[headIndex]
try:
index = splitLine+data.index(header)
value = int(lines[x][index])
except ValueError:
value = 0
except IndexError:
value = 0
try:
value = combinedFile[lineName][header]
combinedFile[lineName][name] = int(value)
except KeyError:
combinedFile[lineName][name] = int(value)
total += int(value)
combinedFile[lineName]["TOTAL"] = total
combined = sorted(combinedFile.values(), key=operator.itemgetter(sort), reverse=True)
return combined
I'm pretty new to Python so this may not be the most "Pythonic" way of doing it, anyway this works but its slow (about 12 seconds for two files about 6MB each) and when we uploaded the code to our AWS server we found that we would get a 500 error from the server saying headers were too large (when we tried to combine larger files). Can anyone help me refine this into something a bit quicker and more suited for a web environment. Also just to clarify I don't have access to the AWS server or the setting of it, that goes through our Lead Developer, so I have no actual clue on how its set up, I do most of my dev work through localhost then commit to Github.

Name columns of file.txt

I have file.txt with five columns and 50 lines. I want to name each column
1 5
2 4.2
. .
. .
you find here the code python ton generate file.txt
f = open("test_01.txt", "a")
for i in xrange(len(end_zapstart_zap)):
f.write(" {} {} {} {} {} \n".format(Leavestart_zap[i],end_zapUDP[i],end_zapstart_zap[i],Leave_join[i],UDP_join[i]))
f.close()
I like to rename the first column Leavestart_zap,the seond column end_zapUDP etc..
here is my file test_01.txt:
0:00:00.672511 0:00:02.615662 0:00:03.433344 0:00:00.119777 0:00:00.025394 0:00:00.002278
0:00:00.263144 0:00:03.184893 0:00:03.541187 0:00:00.090872 0:00:00.025394 0:00:00.002278

Getting errors when converting a text file to VCF format

I have a python code where I am trying to convert a text file containing variant information in the rows to a variant call format file (vcf) for my downstream analysis.
I am getting everything correct but when I am trying to run the code I miss out the first two entries , I mean the first two rows. The code is below, The line which is not reading the entire file is highlighted. I would like some expert advice.
I just started coding in python so I am not well versed entirely with it.
##fileformat=VCFv4.0
##fileDate=20140901
##source=dbSNP
##dbSNP_BUILD_ID=137
##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
import sys
text=open(sys.argv[1]).readlines()
print text
print "First print"
text=filter(lambda x:x.split('\t')[31].strip()=='KEEP',text[2:])
print text
print "################################################"
text=map(lambda x:x.split('\t')[0]+'\t'+x.split('\t')[1]+'\t.\t'+x.split('\t')[2]+'\t'+x.split('\t')[3]+'\t.\tPASS\t.\n',text)
print text
file=open(sys.argv[1].replace('.txt','.vcf'),'w')
file.write('##fileformat=VCFv4.0\n')
file.write('##source=dbSNP')
file.write('##dbSNP_BUILD_ID=137')
file.write('##reference=hg19\n')
file.write('#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n')
for i in text:
file.write(i)
file.close()
INPUT:
chrM 152 T C T_S7998 N_S8980 0 DBSNP COVERED 1 1 1 282 36 0 163.60287 0.214008 0.02 11.666081 202 55 7221 1953 0 0 TT 14.748595 49 0 1786 0 KEEP
chr9 311 T C T_S7998 N_S8980 0 NOVEL COVERED 0.993882 0.999919 0.993962 299 0 0 207.697923 1 0.02 1.854431 0 56 0 1810 1 116 CC -44.649001 0 12 0 390 KEEP
chr13 440 C T T_S7998 N_S8980 0 NOVEL COVERED 1 1 1 503 7 0 4.130339 0.006696 0.02 4.124606 445 3 16048 135 0 0 CC 12.942762 40 0 1684 0 KEEP
OUTPUT desired:
##fileformat=VCFv4.0
##source=dbSNP##dbSNP_BUILD_ID=137##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
chrM 152 . T C . PASS .
chr9 311 . T C . PASS .
chr13 440 . C T . PASS .
OUTPUT obtained:
##fileformat=VCFv4.0
##source=dbSNP##dbSNP_BUILD_ID=137##reference=hg19
#CHROM POS ID REF ALT QUAL FILTER INFO
chr13 440 . C T . PASS .
I would like to have some help regarding how this error can be rectified.
There are couple of issues with your code
In the filter function you are passing text[2:]. I think you want to pass text to get all the rows.
In the last loop where you write to the .vcf file, you are closing the file inside the loop. You should first write all the values and then close the file outside the loop.
So your code will look like (I removed all the prints):
import sys
text=open(sys.argv[1]).readlines()
text=filter(lambda x:x.split('\t')[31].strip()=='KEEP',text) # Pass text
text=map(lambda x:x.split('\t')[0]+'\t'+x.split('\t')[1]+'\t.\t'+x.split('\t')[2]+'\t'+x.split('\t')[3]+'\t.\tPASS\t.\n',text)
file=open(sys.argv[1].replace('.txt','.vcf'),'w')
file.write('##fileformat=VCFv4.0\n')
file.write('##source=dbSNP')
file.write('##dbSNP_BUILD_ID=137')
file.write('##reference=hg19\n')
file.write('#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n')
for i in text:
file.write(i)
file.close() # close after writing all the values, in the end

Categories

Resources