I am trying to identify amino acid residues in contact in the 3D protein structure. I am new to BioPython but found this helpful website http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/protein_contact_map/
Following their lead (which I will reproduce here for completion; Note, however, that I am using a different protein):
import Bio.PDB
import numpy as np
pdb_code = "1QHW"
pdb_filename = "1qhw.pdb"
def calc_residue_dist(residue_one, residue_two) :
"""Returns the C-alpha distance between two residues"""
diff_vector = residue_one["CA"].coord - residue_two["CA"].coord
return np.sqrt(np.sum(diff_vector * diff_vector))
def calc_dist_matrix(chain_one, chain_two) :
"""Returns a matrix of C-alpha distances between two chains"""
answer = np.zeros((len(chain_one), len(chain_two)), np.float)
for row, residue_one in enumerate(chain_one) :
for col, residue_two in enumerate(chain_two) :
answer[row, col] = calc_residue_dist(residue_one, residue_two)
return answer
structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
model = structure[0]
dist_matrix = calc_dist_matrix(model["A"], model["A"])
But when I run the above code, I get the following error message:
Traceback (most recent call last):
File "<ipython-input-26-7239fb7ebe14>", line 4, in <module>
dist_matrix = calc_dist_matrix(model["A"], model["A"])
File "<ipython-input-3-730a11883f27>", line 15, in calc_dist_matrix
answer[row, col] = calc_residue_dist(residue_one, residue_two)
File "<ipython-input-3-730a11883f27>", line 6, in calc_residue_dist
diff_vector = residue_one["CA"].coord - residue_two["CA"].coord
File "/Users/anaconda/lib/python3.6/site-packages/Bio/PDB/Entity.py", line 39, in __getitem__
return self.child_dict[id]
KeyError: 'CA'
Any suggestions on how to fix this issue?
You have heteroatoms (water, ions, etc; anything that isn't an amino acid or nucleic acid) in your structure, remove them with:
for residue in chain:
if residue.id[0] != ' ':
chain.detach_child(residue.id)
This will remove them from your entire structure. You may want to modify if want to keep the heteroatoms for further analysis.
I believe the problem is that some of the elements in model["A"] are not amino acids and therefore do not contain "CA".
To get around this, I wrote a new function which returns only the amino acid residues:
from Bio.PDB import *
chain = model["A"]
def aa_residues(chain):
aa_only = []
for i in chain:
if i.get_resname() in standard_aa_names:
aa_only.append(i)
return aa_only
AA_1 = aa_residues(model["A"])
dist_matrix = calc_dist_matrix(AA_1, AA_1)
So I've been testing (bear in mind I know very little about Bio) and it looks like whatever is in you 1qhw.pdb file is very different from the one in that example.
pdb_code = '1qhw'
structure = Bio.PDB.PDBParser().get_structure(pdb_code, pdb_filename)
model = structure[0]
next, to see what is in it, I did:
print(list(model))
Which gave me:
[<Chain id=A>]
exploring this, it appears the pdb file is a dict of dicts. So, using this id,
test = model['A']
gives me the next dict. This level is the level being passed to your function that is causing the error. Printing this with:
print(list(test))
Gave me a huge list of the data inside, including lots of residues and related info. But crucially, no CA. Try using this to see whats inside and modify the line:
diff_vector = residue_one["CA"].coord - residue_two["CA"].coord
to reflect what you are after, replacing CA where appropriate.
I hope this helps, its a little tricky to get much more specific.
Another solution to obtain the contact map for a protein chain is to use the PdbParser shipped with ConKit.
ConKit is a library specifically designed to work with predicted contacts but has the functionality to extract contacts from a PDB file:
>>> from conkit.io.PdbIO import PdbParser
>>> p = PdbParser()
>>> with open("1qhw.pdb", "r") as pdb_fhandle:
... pdb = p.read(pdb_fhandle, f_id="1QHW", atom_type="CA")
>>> print(pdb)
ContactFile(id="1QHW_0" nmaps=1
This reads your PDB file into the pdb variable, which stores an internal ContactFile hierarchy. In this example, two residues are considered to be in contact if the participating CA atoms are within 8Å of each other.
To access the information, you can then iterate through the ContactFile and access each ContactMap, which in your case corresponds to intra-molecular contacts for chain A.
>>> for cmap in pdb:
... print(cmap)
ContactMap(id="A", ncontacts=1601)
If you would have more than one chain, there would be a ContactMap for each chain, and additional ones for inter-molecular contacts between chains.
The ContactMap for chain A contains 1601 contact pairs. You can access the Contact instances in each ContactMap by either iterating or indexing. Both work fine.
>>> print(cmap[0])
Contact(id="(26, 27)" res1="S" res1_chain="A" res1_seq=26 res2="T" res2_chain="A" res2_seq=27 raw_score=0.961895)
Each level in the hierarchy has various functions with which you could manipulate contact maps. Examples can be found here.
Related
I am working on a project that uses fuzzy logic on a list of names that could go about 100,000 unique records. On the recent screening that we have conducted, the functions that we use can complete a single name within 2.20 seconds on average. This means that on a list of 10,000 names, this process could take 6 hours, which is really too long.
Is there a way that we can speed up our process? Here's the snippet of the script that we use.
# Importing packages
import pandas as pd
import Levenshtein as lev
# Reading cleaned datasets
df_name_reference = pd.read_csv('path_to_file')
df_name_to_screen = pd.read_csv('path_to_file')
# Function used in name screening
def get_similarity_score(s1, s2):
''' Return match percentage between 2 strings disregarding name swapping
Parameters
-----------
s1 : str : name from df_name_reference (to be used within pandas apply)
s2 : str : name from df_name_to_screen (ref_name variable)
Return
-----------
float
'''
# Get sorted names
s1_sort = ' '.join(sorted(s1.split(' '))).strip() if type(s1)==str else ''
s2_sort = ' '.join(sorted(s2.split(' '))).strip() if type(s2)==str else ''
# Get ratios and return the max value
# THIS COULD BE THE BOTTLENECK OF OUR SCRIPT: MORE DETAILS BELOW
return max([
lev.ratio(s1, s2),
lev.ratio(s1_sort, s2),
lev.ratio(s1, s2_sort),
lev.ratio(s1_sort, s2_sort)
])
# Returning file
screening_results = []
for row in range(df_name_to_screen.shape[0]):
# Get name to screen
ref_name = df_name_to_screen.loc[row, 'fullname']
# Get scores
scores = df_name_reference.fullname.apply(lev.ratio, args=(ref_name,))
# Append results
screening_results.append(pd.DataFrame({'screened_name':ref_name, 'scores':scores}))
I took four scores from lev.ratio. This is to address variations in the arrangement of names, ie. firstname-lastname and lastname-firstname formats. I know that fuzzywuzzy package has token_sort_ratio, but I've noticed that it's just splitting the name parts, and sorting it alphabetically, which leads to lower scores. Plus, fuzzywuzzy is slower than Levenshtein. So, I had to manually capture the similarity score of sorted and unsorted names.
Can anyone give an approach that I could try? Thanks!
EDIT: Here's a sample dataset that you may try. This is in Google Drive.
In case you don't need scores for all entries in the reference data but just the top N then you can use difflib.get_close_matches to remove the others before calculating any scores:
screening_results = []
for row in range(df_name_to_screen.shape[0]):
ref_name = df_name_to_screen.loc[row, 'fullname']
skimmed = pd.DataFrame({
'fullname': difflib.get_close_matches(
ref_name,
df_name_reference.fullname,
N_RESULTS,
0
)
})
scores = skimmed.fullname.apply(lev.ratio, args=(ref_name,))
screening_results.append(pd.DataFrame({'screened_name': ref_name, 'scores': scores}))
This takes about 50ms per row using the file you provided.
I'm trying to compare two files containing key-value pairs. Here is an example of such pairs:
ORIGINAL:
z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.304856846498851e-06, 1.304754136591639e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733914e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384797e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.0045412539447373e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664106e-06, 1.3049327764458588e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n
FILE 2:
z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.3048568464988513e-06, 1.3047541365916392e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733912e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384799e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.004541253944737e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664108e-06, 1.304932776445859e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n
When I run both files trough difflib.Differ().compare() I get the following output which is wrong:
'- z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.304856846498851e-06, 1.304754136591639e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733914e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384797e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.0045412539447373e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664106e-06, 1.3049327764458588e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n', '+ z = [1.304924111430007e-06, 1.3049474394241187e-06, 1.3048568464988513e-06, 1.3047541365916392e-06, 1.3047515476558986e-06, 1.3047540563617633e-06, 1.2584599050733912e-06, 1.0044863475043152e-06, 1.0044888558008254e-06, 1.0044913640973358e-06, 1.0045062486228207e-06, 1.0045211585401347e-06, 1.0045236668616387e-06, 1.0045261751942757e-06, 1.0045286838384799e-06, 1.004531191133402e-06, 1.0045336999215094e-06, 1.0045362224355439e-06, 1.004538746759441e-06, 1.004541253944737e-06, 1.0045437622215247e-06, 1.0045462691234405e-06, 1.0045487673867214e-06, 1.0045512756837494e-06, 1.0045330661500387e-06, 1.0045314983896072e-06, 1.0045340066861176e-06, 1.0508595121822218e-06, 1.3048122384371079e-06, 1.3048147469973539e-06, 1.3048172706092426e-06, 1.3048251638664108e-06, 1.304932776445859e-06, 1.3050280023408756e-06, 1.30500941487857e-06]\n'
If you look closely you'll see that the line from both files is extremely similar with some minor differences. For some reason difflib doesn't recognize the similar characters and just gives the whole line as a difference.
Does anyone have a solution for this problem?
I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?
IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).
About a year back, I wrote a little program in python that basically automates a part of my job (with quite a bit of assistance from you guys!) However, I ran into a problem. As I kept making the program better and better, I realized that Python did not want to play nice with excel, and (without boring you with the details suffice to say xlutils will not copy formulas) I NEED to have more access to excel for my intentions.
So I am starting back at square one with VB (2010 Express if it helps.) The only programming course I ever took in my life was on it, and it was pretty straight forward so I decided I'd go back to it for this. Unfortunately, I've forgotten much of what I had learned, and we never really got this far down the rabbit hole in the first place. So, long story short I am trying to:
1) Read data from a .csv structured as so:
41,332.568825,22.221759,-0.489714,eow
42,347.142926,-2.488763,-0.19358,eow
46,414.9969,19.932693,1.306851,r
47,450.626074,21.878299,1.841957,r
48,468.909171,21.362568,1.741944,r
49,506.227269,15.441723,1.40972,r
50,566.199838,17.656284,1.719818,r
51,359.069935,-11.773073,2.443772,l
52,396.321911,-8.711589,1.83507,l
53,423.766684,-4.238343,1.85591,l
2) Sort that data alphabetically by column 5
3) Then selecting only the ones with an "l" in column 5, sort THOSE numerically by column 2 (ascending order) AND copy them to a new file called coil.csv
4) Then selecting only the ones that have an "r" in column 5, sort those numerically by column 2 (descending order) and copy them to the SAME file coil.csv (appended after the others obviously)
After all of that hoopla I wish to get out:
51,359.069935,-11.773073,2.443772,l
52,396.321911,-8.711589,1.83507,l
53,423.766684,-4.238343,1.85591,l
50,566.199838,17.656284,1.719818,r
49,506.227269,15.441723,1.40972,r
48,468.909171,21.362568,1.741944,r
47,450.626074,21.878299,1.841957,r
46,414.9969,19.932693,1.306851,r
I realize that this may be a pretty involved question, and I certainly understand if no one wants to deal with all this bs, lol. Anyway, some full on code, snippets, ideas or even relevant links would be GREATLY appreciated. I've been, and still am googling, but it's harder than expected to find good reliable information pertaining to this.
P.S. Here is the piece of python code that did what I am talking about (although it created two seperate files for the lefts and rights which I don't really need) - if it helps you at all.
msgbox(msg="Please locate your survey file in the next window.")
mainfile = fileopenbox(title="Open survey file")
toponame = boolbox(msg="What is the name of the shots I should use for topography? Note: TOPO is used automatically",choices=("Left","Right"))
fieldnames = ["A","B","C","D","E"]
surveyfile = open(mainfile, "r")
left_file = open("left.csv",'wb')
right_file = open("right.csv",'wb')
coil_file = open("coil1.csv","wb")
reader = csv.DictReader(surveyfile, fieldnames=fieldnames, delimiter=",")
left_writer = csv.DictWriter(left_file, fieldnames + ["F"], delimiter=",")
sortedlefts = sorted(reader,key=lambda x:float(x["B"]))
surveyfile.seek(0,0)
right_writer = csv.DictWriter(right_file, fieldnames + ["F"], delimiter=",")
sortedrights = sorted(reader,key=lambda x:float(x["B"]), reverse=True)
coil_writer = csv.DictWriter(coil_file, fieldnames, delimiter=",",extrasaction='ignore')
for row in sortedlefts:
if row["E"] == "l" or row["E"] == "cl+l":
row['F'] = '%s,%s' % (row['B'], row['D'])
left_writer.writerow(row)
coil_writer.writerow(row)
for row in sortedrights:
if row["E"] == "r":
row['F'] = '%s,%s' % (row['B'], row['D'])
right_writer.writerow(row)
coil_writer.writerow(row)
One option you have is to start with a class to hold the fields. This allows you to override the ToString method to facilitate the output. Then, it's a fairly simple matter of reading each line and assigning the values to a list of the class. In your case you'll want the extra step of making 2 lists sorting one descending and combining them:
Class Fields
Property A As Double = 0
Property B As Double = 0
Property C As Double = 0
Property D As Double = 0
Property E As String = ""
Public Overrides Function ToString() As String
Return Join({A.ToString, B.ToString, C.ToString, D.ToString, E}, ",")
End Function
End Class
Function SortedFields(filename As String) As List(Of Fields)
SortedFields = New List(Of Fields)
Dim test As New List(Of Fields)
Dim sr As New IO.StreamReader(filename)
Using sr As New IO.StreamReader(filename)
Do Until sr.EndOfStream
Dim fieldarray() As String = sr.ReadLine.Split(","c)
If fieldarray.Length = 5 AndAlso Not fieldarray(4)(0) = "e"c Then
If fieldarray(4) = "r" Then
test.Add(New Fields With {.A = Double.Parse(fieldarray(0)), .B = Double.Parse(fieldarray(1)), .C = Double.Parse(fieldarray(2)), .D = Double.Parse(fieldarray(3)), .E = fieldarray(4)})
Else
SortedFields.Add(New Fields With {.A = Double.Parse(fieldarray(0)), .B = Double.Parse(fieldarray(1)), .C = Double.Parse(fieldarray(2)), .D = Double.Parse(fieldarray(3)), .E = fieldarray(4)})
End If
End If
Loop
End Using
SortedFields = SortedFields.OrderBy(Function(x) x.B).Concat(test.OrderByDescending(Function(x) x.B)).ToList
End Function
One simple way of writing the data to a csv file is to use the IO.File.WriteAllLines methods and the ConvertAll method of the List:
IO.File.WriteAllLines(" coil.csv", SortedFields("textfile1.txt").ConvertAll(New Converter(Of Fields, String)(Function(x As Fields) x.ToString)))
You'll notice how the ToString method facilitates this quite easily.
If the class will only be used for this you do have the option to make all the fields string.
I am beginner in python (also in programming)I have a larg file containing repeating 3 lines with numbers 1 empty line and again...
if I print the file it looks like:
1.93202838
1.81608154
1.50676177
2.35787777
1.51866227
1.19643624
...
I want to take each three numbers - so that it is one vector, make some math operations with them and write them back to a new file and move to another three lines - to another vector.so here is my code (doesnt work):
import math
inF = open("data.txt", "r+")
outF = open("blabla.txt", "w")
a = []
fin = []
b = []
for line in inF:
a.append(line)
if line.startswith(" \n"):
fin.append(b)
h1 = float(fin[0])
k2 = float(fin[1])
l3 = float(fin[2])
h = h1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
k = k1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
l = l1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
vector = [str(h), str(k), str(l)]
outF.write('\n'.join(vector)
b = a
a = []
inF.close()
outF.close()
print "done!"
I want to get "vector" from each 3 lines in my file and put it into blabla.txt output file. Thanks a lot!
My 'code comment' answer:
take care to close all parenthesis, in order to match the opened ones! (this is very likely to raise SyntaxError ;-) )
fin is created as an empty list, and is never filled. Trying to call any value by fin[n] is therefore very likely to break with an IndexError;
k2 and l3 are created but never used;
k1 and l1 are not created but used, this is very likely to break with a NameError;
b is created as a copy of a, so is a list. But you do a fin.append(b): what do you expect in this case by appending (not extending) a list?
Hope this helps!
This is only in the answers section for length and formatting.
Input and output.
Control flow
I know nothing of vectors, you might want to look into the Math module or NumPy.
Those links should hopefully give you all the information you need to at least get started with this problem, as yuvi said, the code won't be written for you but you can come back when you have something that isn't working as you expected or you don't fully understand.