Python: compare column in two files - python

I'm just trying to solving this text-processing task using python, but I'm not able to compare column.
What I have tried :
#!/usr/bin/env python
import sys
def Main():
print "This is your input Files %s,%s" % ( file1,file2 )
f1 = open(file1, 'r')
f2 = open(file2, 'r')
for line in f1:
column1_f1 = line.split()[:1]
#print column1_f1
for check in f2:
column2_f2 = check.split()[:1]
print column1_f1,column2_f2
if column1_f1 == column2_f2:
print "Match",line
else:
print line,check
f1.close()
f2.close()
if __name__ == '__main__':
if len(sys.argv) != 3:
print >> sys.stderr, "This Script need exact 2 argument, aborting"
exit(1)
else:
ThisScript, file1, file2 = sys.argv
Main()
I'm new in Python, Please help me to learn and understand this..

I would resolve it in similar way in python3 that user46911 did with awk. Read second file and save its keys in a dictionary. Later check if exists for each line of first file:
import sys
codes = {}
with open(sys.argv[2], 'r') as f2:
for line in f2:
fields = line.split()
codes[fields[0]] = fields[1]
with open(sys.argv[1], 'r') as f1:
for line in f1:
fields = line.split(None, 1)
if fields[0] in codes:
print('{0:4s}{1:s}'.format(codes[fields[0]], line[4:]), end='')
else:
print(line, end='')
Run it like:
python3 script.py file1 file2
That yields:
060090 AKRABERG FYR DN 6138 -666 101
EKVG 060100 VAGA FLOGHAVN DN 6205 -728 88
060110 TORSHAVN DN 6201 -675 55
060120 KIRKJA DN 6231 -631 55
060130 KLAKSVIK HELIPORT DN 6221 -656 75
060160 HORNS REV A DN 5550 786 21
060170 HORNS REV B DN 5558 761 10
060190 SILSTRUP DN 5691 863 0
060210 HANSTHOLM DN 5711 858 0
EKGF 060220 TYRA OEST DN 5571 480 43
EKTS 060240 THISTED LUFTHAVN DN 5706 870 8
060290 GROENLANDSHAVNEN DN 5703 1005 0
EKYT 060300 FLYVESTATION AALBORG DN 5708 985 13
060310 TYLSTRUP DN 5718 995 0
060320 STENHOEJ DN 5736 1033 56
060330 HIRTSHALS DN 5758 995 0
EKSN 060340 SINDAL FLYVEPLADS DN 5750 1021 28

Related

How to add a word after a particular line in a text?

I am trying to add a word after a particular line in my file. I am using infile and a reference and trying to create outfile. The infile and reference are the same type, but the reference contains the TER word at a particular position. I want to add the TER word into the outfile (basically create a copy of infile and add the TER from the reference).
I am trying to search by a number (resnum in the code) but there will be a problem as many consecutive lines have the same number.
Can anybody help?
from sys import argv
import argparse
script,infile,outfile, reference = argv
Ter = []
res = []
def get_Ter(reference):
reference_1 = open(reference,"r")
for line in reference_1:
contents = line.split(" ")
if contents[0] == "TER":
resnum = line[22:27]
resname = line[17:20]
chain = line[21]
Ter.append(resnum)
def find_TER(infile,outfile):
with open(infile, "r") as infile_1:
content = infile_1.readlines()
with open(outfile, "w+") as outfile_1:
outfile_1.write(content)
if line[0:6] == "ATOM ":
resnum_1 = line[22:27]
res.append(resnum_1)
if resnum_1 in res == resnum in Ter:
outfile_1.write(line + "\nTER")
find_TER(infile,outfile)
get_Ter(reference)
example of a file (this is the reference, the infile is the same but missing the TER). They are all nicely lined up underneath each other (The formatting here):
ATOM 992 SG CYX D 452 23.296 45.745 28.572 1.00 0.00
ATOM 993 C CYX D 452 20.742 42.431 27.841 1.00 0.00
ATOM 994 O CYX D 452 20.689 41.447 28.565 1.00 0.00
ATOM 995 OXT CYX D 452 19.788 42.822 27.185 1.00 0.00
TER 995 CYS D 452
ATOM 996 N ARG D 492 27.510 26.357 34.041 1.00 0.00
ATOM 997 H1 ARG D 492 26.590 26.591 33.694 1.00 0.00
ATOM 998 H2 ARG D 492 28.138 27.135 34.182 1.00 0.00
ATOM 999 H3 ARG D 492 27.422 26.030 34.993 1.00 0.00
ATOM 1000 CA ARG D 492 28.179 25.410 33.192 1.00 0.00
Now I have this:
from sys import argv
import argparse
script,infile,outfile, reference = argv
Ter = []
res = []
def get_Ter(reference):
reference_1 = open(reference,"r")
for line in reference_1:
contents = line.split(" ")
if contents[0] == "TER":
ternum = line[22:27]
def find_TER(infile,outfile):
with open(infile, "r") as infile_1:
content = infile_1.readlines()
with open(outfile, "w+") as outfile_1:
for line in content:
outfile_1.write(line)
line = line.split(" ")
if line[0] == "ATOM":
resnum = line[22:27]
if ternum == resnum:
find_TER(infile,outfile)
get_Ter(reference)
The basic logic is twofold:
Determine when you need the TER line and generate it. (You've done this.)
Detect when it's time to write that line to the output.
All you really need to do for the second part is to recognize that you have a pending TER output for resnum 452 (or whatever number it is). You can do this with a simple variable: keep it at -1 until you have a valid resnum.
As you read, you check that resnum consistently. If it's positive and different from the most recent input line, then you have to print the TER line before doing anything else. Something like this:
contents = line.split():
resnum = line[22:27]
if ternum > 0 and ternum != int(resnum):
# write out the TER line
ternum = -1
# continue with rest of the program.
if contents[0] == "TER":
...
You might also need to check at end of file, in case the last resnum has a hanging TER line to print out.
Is that enough to move you along?

Using text file data, classification and make other text file in python

using python, i want to seperated some data file.
file form is text file and there are no tabs only one space between inside data.
here is example file,
//test.txt
Class name age room fund.
13 A 25 B101 300
12 B 21 B102 200
9 C 22 B103 200
13 D 25 B102 100
20 E 23 B105 100
13 F 25 B103 300
11 G 25 B104 100
13 H 22 B101 300
I want to take only line containing specific data,
class : 13 , fund 300
,and save another text file.
if this code was worked, making text file is that
//new_test.txt
Class name age room fund.
13 A 25 B101 300
13 F 25 B103 300
13 H 22 B101 300
thanks.
Hk
This should do.
with open('new_test.txt','w') as new_file:
with open('test.txt') as file:
print(file.readline(),end='',file=new_file)
for line in file:
arr=line.strip().split()
if arr[0]=='13' and arr[-1]=='300':
print(line,end='',file=new_file)
However, you should include your code when asking a question. It ensures that the purpose of this site is served.
If you want to filter your data:
def filter_data(src_file, dest_file, filters):
data = []
with open(src_file) as read_file:
header = [h.lower().strip('.') for h in read_file.readline().split()]
for line in read_file:
values = line.split()
row = dict(zip(header, values))
data.append(row)
for k, v in filters.items():
if data and row.get(k, None) != v:
data.pop()
break
with open(dest_file, 'w') as write_file:
write_file.write(' '.join(header) + '\n')
for row in data:
write_file.write(' '.join(row.values()) + '\n')
my_filters = {
"class": "13",
"fund": "300"
}
filter_data(src_file='test.txt', dest_file='new_test.txt', filters=my_filters)

making all possible combination in Python and also use of google API for csv/xlsx file

I have to write a script in python that will do following actions
I have a xlsx/csv file in which there are 300 cities listed in one column
I have to make all pairs between them and also with help of google api I have to add their distance and travel time in the second column
my CSV file is looks like this:
=======
SOURCE
=======
Agra
Delhi
Jaipur
and expected output in csv/xlsx file be like this
=============================================
SOURCE | DESTINATION | DISTANCE | TIME_TRAVEL
=============================================
Agra | Delhi | 247 | 4
Agra | Jaipur | 238 | 4
Delhi | Agra | 247 | 4
Delhi | jaipur | 281 | 5
Jaipur | Agra | 238 | 4
Jaipur | Delhi | 281 | 5
and so on.. how to do this.?
NOTE: Distance and Travel Time are from google.
To make the pairs you can use itertools.permutations to get all possible pairs.
Code for the same would be as :
import csv # imports the csv module
import sys # imports the sys module
import ast
import itertools
source_list = []
destination_list = []
type_list = []list
f = open(sys.argv[1], 'rb')
g = open(sys.argv[2], 'wb')
# opens the csv file
try:
reader = csv.reader(f)
my_list = list(reader) # creates the reader object
for i in my_list:
source_list.append(i[0])
a = list(itertools.permutations(source_list, 2))
for i in a:
source_list.append(i[0])
destination_list.append(i[1])
mywriter=csv.writer(g)
rows = zip(source_list,destination_list)
mywriter.writerows(rows)
g.close()
finally:
f.close()
Apart from that to get distance and time from the google this sample code may work for full debugging.
import csv # imports the csv module
import sys # imports the sys module
import urllib2,json
import ast
api_google_key = ''
api_google_url = 'https://maps.googleapis.com/maps/api/distancematrix/json?origins='
source_list = []
destination_list = []
distance_list = []
duration_list = []
f = open(sys.argv[1], 'rb')
g = open(sys.argv[2], 'wb')
# opens the csv file
try:
reader = csv.reader(f)
my_list = list(reader) # creates the reader object
for i in my_list:
if i:
s = (i[0])
src = s.replace(" ","")
d = (i[1])
dest = d.replace(" ","")
source = ''.join(e for e in src if e.isalnum())
destination = ''.join(e for e in dest if e.isalnum())
print 'source status = '+str(source.isalnum())
print 'dest status = '+str(destination.isalnum())
source_list.append(source)
destination_list.append(destination)
request = api_google_url+source+'&destinations='+destination+'&key='+api_google_key
print request
dist = json.load(urllib2.urlopen(request))
if dist['rows']:
if 'duration' in dist['rows'][0]['elements'][0].keys():
duration_dict = dist['rows'][0]['elements'][0]['duration']['text']
distance_dict = dist['rows'][0]['elements'][0]['distance']['text']
else:
duration_dict = 0
distance_dict = 0
else:
duration_dict = 0
distance_dict = 0
distance_list.append(distance_dict)
duration_list.append(duration_dict)
mywriter=csv.writer(g)
rows = zip(source_list,destination_list,distance_list,duration_list)
mywriter.writerows(rows)
g.close()
finally:
f.close()
You can do this by using itertools.product but that'll mean that you'll also get repetitions like (Agra, Agra) the distance for which will be 0 really.
import itertools
cities = ["Agra","Delhi","Jaipur"]
cities2 = cities
p = itertools.product(cities, cities2)
print(list(p))
In this case you'd get
[('Agra', 'Agra'), ('Agra', 'Delhi'), ('Agra', 'Jaipur'), ('Delhi', 'Agra'), ('Delhi', 'Delhi'), ('Delhi', 'Jaipur'), ('Jaipur', 'Agra'), ('Jaipur', 'Delhi'), ('Jaipur', 'Jaipur')]
You can take loop in this forlist and make a request to google to get the travel time and distance.
>>> for pair in list(p):
... print (pair)
...
('Agra', 'Agra')
('Agra', 'Delhi')
('Agra', 'Jaipur')
('Delhi', 'Agra')
('Delhi', 'Delhi')
('Delhi', 'Jaipur')
('Jaipur', 'Agra')
('Jaipur', 'Delhi')
('Jaipur', 'Jaipur')
You can get all the combinations with itertools.permutations() like so:
from itertools import permutations
with open(cities_file, 'r') as f, open(newfile, 'w') as f2:
for pair in (permutations([a.strip() for a in f.read().splitlines()], 2)):
print pair
response = googleapi.get(pair)
f2.write(response+'\n')
Output of print pair
('Agra', 'Delhi')
('Agra', 'Jaipur')
('Delhi', 'Agra')
('Delhi', 'Jaipur')
('Jaipur', 'Agra')
('Jaipur', 'Delhi')
You can then hit the api from the list elements 1 by 1 and keep storing the result in the file.

Rename line by line a file

I input, i have the following lines in my file.
...
VOAUT0000001712_19774.JPG FRYW-1712-19774.jpg eab516afc1aaa10ad23edb5c15ae4ea4 1712 01
VOAUT0000001712_19775.JPG FRYW-1712-19775.jpg 2715ceba8fd5c69b4ca6952e942a1a8a 1712 02
VOAUT0000001712_19776.JPG FRYW-1712-19776.jpg b1a0c4ec6160da3511e23c617517ff6f 1712 03
VOAUT0000001713_19795.JPG FRYW-1713-19795.jpg 56cd173c6e9436b19d39de214669cc53 1713 01
VOAUT0000001713_19796.JPG FRYW-1713-19796.jpg 271aa1b9ef2ac39c502a270c82b31fa3 1713 02
VOAUT0000001713_19797.JPG FRYW-1713-19797.jpg 667732a85660bebec168bc46b884d9b7 1713 03
VOAUT0000001714_19763.JPG FRYW-1714-19763.jpg d37770d6cde5639ce5db4e6a436498a8 1714 01
VOAUT0000001714_19764.JPG FRYW-1714-19764.jpg ce891ca4d4ea59c3a312a468bb0d4408 1714 02
VOAUT0000001714_19765.JPG FRYW-1714-19765.jpg bd7fed521fe3997bf5c879d9d5ce942d 1714 03
...
I would like to modify my file line by line in order to have this
17124615_01_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1712-19774.jpg eab516afc1aaa10ad23edb5c15ae4ea4
17124615_02_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1712-19775.jpg 2715ceba8fd5c69b4ca6952e942a1a8a
17124615_03_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1712-19776.jpg b1a0c4ec6160da3511e23c617517ff6f
17134615_01_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1713-19795.jpg 56cd173c6e9436b19d39de214669cc53
17134615_02_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1713-19796.jpg 271aa1b9ef2ac39c502a270c82b31fa3
17134615_03_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1713-19797.jpg 667732a85660bebec168bc46b884d9b7
17144615_01_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1714-19763.jpg d37770d6cde5639ce5db4e6a436498a8
17144615_02_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1714-19764.jpg ce891ca4d4ea59c3a312a468bb0d4408
17144615_03_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1714-19765.jpg bd7fed521fe3997bf5c879d9d5ce942d
Here is the beginning of my code:
def renameLineByLine():
with open('/opt/data/photos.txt') as f:
for line in f:
newname, file, path, checksum = line.split()
if ..?? :
try:
rename(...???)
except OSError:
logger.error('Got a problem')
but I do not see how to rename each line with the new format ..?
You need to read in the file correctly: Skip empty lines and split for the correct values. There is no path where you expect one and you do not account for the two numbers at the end, which are crucial for the conversion.
Here I simply write all newly formatted lines into a new file, photos_new.txt. I hope this gets you started.
Note however, that your method name renameLineByLine as well as your try/except seem to hint that you also want to move/rename/do some work on your pictures. If that is the case, this answer will not be sufficient and you should try to elaborate a little bit more.
def renameLineByLine():
new_lines = []
path = '/opt/AutoPrivilege/client/photos/'
with open('/opt/data/photos.txt', 'r') as fin, \
open('/opt/data/photos_new.txt', 'w') as fout:
for line in fin:
if len(line) != 1:
newname, file, checksum, no1, no2 = line.split()
fout.write(" ".join([
"{}4615_{}_hd.jpg".format(no1, no2),
path + file, checksum, '\n'])
)
else:
fout.write('\n')
Input:
VOAUT0000001712_19774.JPG FRYW-1712-19774.jpg eab516afc1aaa10ad23edb5c15ae4ea4 1712 01
VOAUT0000001712_19775.JPG FRYW-1712-19775.jpg 2715ceba8fd5c69b4ca6952e942a1a8a 1712 02
VOAUT0000001712_19776.JPG FRYW-1712-19776.jpg b1a0c4ec6160da3511e23c617517ff6f 1712 03
VOAUT0000001713_19795.JPG FRYW-1713-19795.jpg 56cd173c6e9436b19d39de214669cc53 1713 01
VOAUT0000001713_19796.JPG FRYW-1713-19796.jpg 271aa1b9ef2ac39c502a270c82b31fa3 1713 02
VOAUT0000001713_19797.JPG FRYW-1713-19797.jpg 667732a85660bebec168bc46b884d9b7 1713 03
VOAUT0000001714_19763.JPG FRYW-1714-19763.jpg d37770d6cde5639ce5db4e6a436498a8 1714 01
VOAUT0000001714_19764.JPG FRYW-1714-19764.jpg ce891ca4d4ea59c3a312a468bb0d4408 1714 02
VOAUT0000001714_19765.JPG FRYW-1714-19765.jpg bd7fed521fe3997bf5c879d9d5ce942d 1714 03
Output:
17124615_01_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1712-19774.jpg eab516afc1aaa10ad23edb5c15ae4ea4
17124615_02_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1712-19775.jpg 2715ceba8fd5c69b4ca6952e942a1a8a
17124615_03_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1712-19776.jpg b1a0c4ec6160da3511e23c617517ff6f
17134615_01_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1713-19795.jpg 56cd173c6e9436b19d39de214669cc53
17134615_02_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1713-19796.jpg 271aa1b9ef2ac39c502a270c82b31fa3
17134615_03_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1713-19797.jpg 667732a85660bebec168bc46b884d9b7
17144615_01_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1714-19763.jpg d37770d6cde5639ce5db4e6a436498a8
17144615_02_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1714-19764.jpg ce891ca4d4ea59c3a312a468bb0d4408
17144615_03_hd.jpg /opt/AutoPrivilege/client/photos/FRYW-1714-19765.jpg bd7fed521fe3997bf5c879d9d5ce942d
All the information you need to generate the lines is already provided. Thanks to #SebastianHöffner for pointing out the obvious.
out = open('output.txt','w')
for line in open('data.txt'):
if len(line) != 1:
a, b, c, d, e = line.split()
l = d + '4615_'+ e + '_hd.jpg /opt/AutoPrivilege/client/photos/' + b + ' '+ c
out.write(l + '\n')
else:
out.write('\n')
out.close()
Something like this:
with open('1.txt','r') as inF:
with open('12.txt', 'w') as outF:
for line in inF:
if line not in ('\n','\r\n'):
t = []
s = line.split()
t.append(s[3]+'4615_'+s[4]+'_hd.'+s[0].split('.')[1].lower())
t.append('/opt/AutoPrivilege/client/photos/' + s[1])
t.append(s[2]+'\n')
outF.write(' '.join(t))
else:
outF.write(line)

Biopython translate() error

I have a file that looks as so:
Type Variant_class ACC_NUM dbsnp genomic_coordinates_hg18 genomic_coordinates_hg19 HGVS_cdna HGVS_protein gene disease sequence_context_hg18 sequence_context_hg19 codon_change codon_number intron_number site location location_reference_point author journal vol page year pmid entrezid sift_score sift_prediction mutpred_score
1 DM CM920001 rs1800433 null chr12:9232351:- NM_000014.4 NP_000005.2:p.C972Y A2M Chronicobstructivepulmonarydisease null CACAAAATCTTCTCCAGATGCCCTATGGCT[G/A]TGGAGAGCAGAATATGGTCCTCTTTGCTCC TGT TAT 972 null null 2 null Poller HUMGENET 88 313 1992 1370808 2 0 DAMAGING 0.594315245478036
1 DM CM004784 rs74315453 null chr22:43089410:- NM_017436.4 NP_059132.1:p.M183K A4GALT Pksynthasedeficiency(pphenotype) null TGCTCTCCGACGCCTCCAGGATCGCACTCA[T/A]GTGGAAGTTCGGCGGCATCTACCTGGACAC ATG AAG 183 null null 2 null Steffensen JBC 275 16723 2000 10747952 53947 0 DAMAGING 0.787878787878788
I want to translate the information from column 13 and 14 to their corresponding amino acids. Here is the script that I've generated:
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
InFile = open("disease_mut_splitfinal.txt", 'rU')
InFile.readline()
OriginalSeq_list = []
MutSeq_list = []
import csv
with open("disease_mut_splitfinal.txt") as f:
reader = csv.DictReader(f, delimiter= "\t")
for row in reader:
OriginalSeq = row['codon_change']
MutSeq = row['codon_number']
region = row["genomic_coordinates_hg19"]
gene = row["gene"]
OriginalSeq_list.append(OriginalSeq)
MutSeq_list.append(MutSeq)
OutputFileName = "Translated.txt"
OutputFile = open(OutputFileName, 'w')
OutputFile.write(''+region+'\t'+gene+'\n')
for i in range(0, len(OriginalSeq_list)):
OrigSeq = OriginalSeq_list[i]
MutSEQ = MutSeq_list[i]
print OrigSeq
translated_original = OrigSeq.translate()
translated_mut= MutSEQ.translate()
OutputFile.write("\n" + OriginalSeq_list[i]+ "\t" + str(translated_original) + "\t" +MutSeq_list[i] + "\t" + str(translated_mut)+ "\n")
However, I keep getting this error:
TypeError: translate expected at least 1 arguments, got 0
I'm kind of at a loss for what I'm doing wrong. Any suggestions?
https://www.dropbox.com/s/cd8chtacj3glb8d/disease_mut_splitfinal.txt?dl=0
(File should still be downloadable even if you don't have a dropbox)
You are using the string method "translate" instead of the biopython seq object method translate, which is what I assume you want to do. You need to convert the string into a seq object and then translate that. Try
from Bio import Seq
OrigSeq = Seq.Seq(OriginalSeq_list[i])
translated_original = OrigSeq.translate()
Alternatively
from Bio.Seq import Seq
OrigSeq = Seq(OriginalSeq_list[i])
translated_original = OrigSeq.translate()

Categories

Resources