Get all bounds lengths from PDB - python

I want to get all bounds lengths from PDB file.
I tried Bio.PDB, but I don't understand NeighborSearch class and it's methods: search() and search_all()
from Bio.PDB import *
import numpy as np
structure = PDBParser().get_structure('Kek', '1wba.pdb')
atom_list = [_ for _ in structure.get_atoms()]
kek = NeighborSearch(atom_list).search_all(2)
for atom_pair in kek:
a = atom_pair[0]
b = atom_pair[1]
distance = np.linalg.norm(np.array(a.coord) - np.array(b.coord))
print(distance)
How can I solve my task? Maybe there's another framework - I'll watch every variant if it works right!

As per my understanding you are looking to find ways to calculate the distance between atoms in a PDB file. I adapted your answer and this Biostars solution. Hope it helps a bit
import Bio.PDB
parser = Bio.PDB.PDBParser(QUIET=True)
structures = parser.get_structure('2rdx', '2rdx.pdb')
structure = structures[0]
atom_list = [_ for _ in structure.get_atoms()]
ns = Bio.PDB.NeighborSearch(atom_list)
_cutoff_dist = 5
for target in atom_list:
close_atoms = ns.search(target.coord, _cutoff_dist)
for close_atom in close_atoms:
print(target, close_atom, target - close_atom)
print ("==========")
You can easily find distance between two Atom objects by using the - operator.

Related

List of Lists of Coordinates

I am new to Python, and am struggling with a task that I assume is an extremely simple one for an experienced programmer.
I am trying to create a list of lists of coordinates for different lines. For instance:
list = [ [(x,y), (x,y), (x,y)], [Line 2 Coordinates], ....]
I have the following code:
masterlist_x = list(range(-5,6))
oneline = []
data = []
numberoflines = list(range(2))
i = 1
for i in numberoflines:
slope = randint(-5,5)
y_int = randint(-10,10)
for element in masterlist_x:
oneline.append((element,slope * element + y_int))
data.append(oneline)
The output of the variable that should hold the coordinates to one line (oneline) holds two lines:
Output
I know this is an issue with the outer looping mechanism, but I am not sure how to proceed.
Any and all help is much appreciated. Thank you very much!
#khuynh is right, you simply had the oneline = [] in wrong place, you put all the coords in one line.
Also, you have a couple unnecessary things in your code:
you don't need list() the range(), you can just iterate them directly with for
also you don't need to declare the i for the for, it does it itself
that i is not actually used, which is fine. Python convention for unused variables is _
Fixed version:
from random import randint
masterlist_x = range(-5,6)
data = []
numberoflines = range(2)
for _ in numberoflines:
oneline = []
slope = randint(-5,5)
y_int = randint(-10,10)
for element in masterlist_x:
oneline.append((element,slope * element + y_int))
data.append(oneline)
print(data)
Also on-line there where you can run it: https://repl.it/repls/GreedyRuralProduct
I suspect the whole thing could be also made with much less code, and in a way in a simpler fashion, as a list comprehension ..
UPDATE: the inner loop is indeed very suitable for a list comprehension. Maybe the outer could be made into one as well, and the whole thing could two nested list comprehensions, but I only got confused when tried that. But this is clear:
from random import randint
masterlist_x = range(-5,6)
data = []
numberoflines = range(2)
for _ in numberoflines:
slope = randint(-5,5)
y_int = randint(-10,10)
oneline = [(element, slope * element + y_int)
for element in masterlist_x]
data.append(oneline)
print(data)
Again on repl.it too: https://repl.it/repls/SoupyIllustriousApplicationsoftware

How to implement a function with Python(Sympy), realizing the same as '_' and replacement rule in Wolfram Mathematica?

In Wolfram Mathematica, I can define named patterns where _ (called Blank) matches any expression and then use the match in a replacement rule.
An example:
testexpr = p1[MM]*p2[NN] + p1[XX]*p2[MM] + p1[XX]^2;
FunTest[expr_] := Expand[expr] /. {(p1[l1_]*p2[l2_]) -> FF1[l1]*FF2[l2],
p1[l1_]^n_ -> 0, p2[l1_]^n_ -> 0}
FunTest[testexpr]
The result is FF1[XX] FF2[MM] + FF1[MM] FF2[NN]
However, I don't know how to use sympy in Python to do the same thing.
import sympy as sp
p1 = sp.IndexedBase("p1")
p2 = sp.IndexedBase("p2")
FF1 = sp.IndexedBase("FF1")
FF2 = sp.IndexedBase("FF2")
MM,NN,XX=sp.symbols('MM NN XX')
SSlist=[MM,NN,XX]
testexpr=p1[MM]*p2[NN] + p1[XX]*p2[MM] + p1[XX]**2
def FunTest(expr):
expr=expr.subs([(p1[SS]*p2[SS2],FF1[SS]*FF2[SS2]) for SS in SSlist
for SS2 in SSlist]+[(p1[SS]**2,0) for SS in SSlist]+[(p2[SS]**2,0)
for SS in SSlist],simultaneous=True)
return expr
rest=FunTest(testexpr)
print(rest)
So the result is also FF1[MM]*FF2[NN] + FF1[XX]*FF2[MM].
But I want to know if there is an easy way to make it more general, as in Wolfram Mathematica. If SSlist is a large list and there are many different variables, it will be a difficult to implement with my solution.
I wonder whether there is an easy way without writing a loop over the whole list, for SS in SSlist, as in Mathematica. Can someone familiar with sympy give me any hints?
Thanks a lot!
I have found out one solution to my own question. It works out as I want. Instead of using subs(), I use wild operator and replace().
import sympy as sp
p1 = sp.IndexedBase("p1")
p2 = sp.IndexedBase("p2")
FF1 = sp.IndexedBase("FF1")
FF2 = sp.IndexedBase("FF2")
MM,NN,XX=sp.symbols('MM NN XX')
SSlist=[MM,NN,XX]
SS = sp.Wild('SS')
SS1 = sp.Wild('SS1')
testexpr=p1[MM]*p2[NN] + p1[XX]*p2[MM] + p1[XX]**2
replacements = {p1[SS]*p2[SS1] : FF1[SS]*FF2[SS1], p1[SS]**2: 0, p2[SS]**2 : 0}
def replaceall(expr, repls):
for i, j in repls.items():
expr = expr.replace(i, j, map=False, simultaneous=True, exact=False)
return expr
rest=replaceall(testexpr,replacements)
print(rest)
The result gives exactly the same as I did before:
FF1[MM]*FF2[NN] + FF1[XX]*FF2[MM]
One thing I want to know is this efficient when there are many symbols because of the for loop. It seems that the two methods are similar, just the one I found recently looks more concise.
I would like to know whether there is a more general way to do such things as Wolfram Mathematica does.
Any comments are welcomed. Thanks!

Can I write python code that modifies itself during execution?

I mean,
target_ZCR_mean = sample_dataframe_summary['ZCR'][1]
target_ZCR_std = sample_dataframe_summary['ZCR'][2]
lower_ZCR_lim = target_ZCR_mean - target_ZCR_std
upper_ZCR_lim = target_ZCR_mean + target_ZCR_std
target_RMS_mean = sample_dataframe_summary['RMS'][1]
target_RMS_std = sample_dataframe_summary['RMS'][2]
lower_RMS_lim = target_RMS_mean - target_RMS_std
upper_RMS_lim = target_RMS_mean + target_RMS_std
target_TEMPO_mean = sample_dataframe_summary['Tempo'][1]
target_TEMPO_std = sample_dataframe_summary['Tempo'][2]
lower_TEMPO_lim = target_TEMPO_mean - target_TEMPO_std
upper_TEMPO_lim = target_TEMPO_mean + target_TEMPO_std
target_BEAT_SPACING_mean = sample_dataframe_summary['Beat Spacing'][1]
target_BEAT_SPACING_std = sample_dataframe_summary['Beat Spacing'][2]
lower_BEAT_SPACING_lim = target_BEAT_SPACING_mean - target_BEAT_SPACING_std
upper_BEAT_SPACING_lim = target_BEAT_SPACING_mean + target_BEAT_SPACING_std
each block of four lines of code are very similar to each other except for a few characters.
Can I write a function, a class or some other piece of code, such that I can wrap just a template of four lines of code into it and let it modify itself during runtime to get the code do the work of the above code...?
By the way, I use python 3.6.
If you find yourself storing lots of variables like this, especially if they are related, there is almost certainly a better way of doing it. Modifying the source dynamically is never the solution. One way is to use a function to keep the repeated logic, and use a namedtuple to store the resultant data:
import collections
Data = collections.namedtuple('Data', 'mean, std, lower_lim, upper_lim')
def get_data(key, sample_dataframe_summary):
mean = sample_dataframe_summary[key][1]
std = sample_dataframe_summary[key][2]
lower_lim = mean - std
upper_lim = mean + std
return Data(mean, std, lower_lim, upper_lim)
zcr = get_data('ZCR', sample_dataframe_summary)
rms = get_data('RMS', sample_dataframe_summary)
tempo = get_data('Tempo', sample_dataframe_summary)
beat_spacing = get_data('Beat Spacing', sample_dataframe_summary)
then you can access the data with the . notation like zcr.mean and tempo.upper_lim

How do I vectorize the following loop in Numpy?

"""Some simulations to predict the future portfolio value based on past distribution. x is
a numpy array that contains past returns.The interpolated_returns are the returns
generated from the cdf of the past returns to simulate future returns. The portfolio
starts with a value of 100. portfolio_value is filled up progressively as
the program goes through every loop. The value is multiplied by the returns in that
period and a dollar is removed."""
portfolio_final = []
for i in range(10000):
portfolio_value = [100]
rand_values = np.random.rand(600)
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
I couldn't find a way to write this code using numpy. I was having a look at iterations using nditer but I was unable to move ahead with that.
I guess the easiest way to figure out how you can vectorize your stuff would be to look at the equations that govern your evolution and see how your portfolio actually iterates, finding patterns that could be vectorized instead of trying to vectorize the code you already have. You would have noticed that the cumprod actually appears quite often in your iterations.
Nevertheless you can find the semi-vectorized code below. I included your code as well such that you can compare the results. I also included a simple loop version of your code which is much easier to read and translatable into mathematical equations. So if you share this code with somebody else I would definitely use the simple loop option. If you want some fancy-pants vectorizing you can use the vector version. In case you need to keep track of your single steps you can also add an array to the simple loop option and append the pv at every step.
Hope that helps.
Edit: I have not tested anything for speed. That's something you can easily do yourself with timeit.
import numpy as np
from scipy.special import erf
# Prepare simple return model - Normal distributed with mu &sigma = 0.01
x = np.linspace(-10,10,100)
cdf_values = 0.5*(1+erf((x-0.01)/(0.01*np.sqrt(2))))
# Prepare setup such that every code snippet uses the same number of steps
# and the same random numbers
nSteps = 600
nIterations = 1
rnd = np.random.rand(nSteps)
# Your code - Gives the (supposedly) correct results
portfolio_final = []
for i in range(nIterations):
portfolio_value = [100]
rand_values = rnd
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
# Using vectors
portfolio_final = []
for i in range(nIterations):
portfolio_values = np.ones(nSteps)*100.0
rcp = np.cumprod(np.interp(rnd,cdf_values,x) + 1)
portfolio_values = rcp * (portfolio_values - np.cumsum(1.0/rcp))
portfolio_final.append(portfolio_values[-1])
print (np.mean(portfolio_final))
# Simple loop
portfolio_final = []
for i in range(nIterations):
pv = 100
rets = np.interp(rnd,cdf_values,x) + 1
for i in range(nSteps):
pv = pv * rets[i] - 1
portfolio_final.append(pv)
print (np.mean(portfolio_final))
Forget about np.nditer. It does not improve the speed of iterations. Only use if you intend to go one and use the C version (via cython).
I'm puzzled about that inner loop. What is it supposed to be doing special? Why the loop?
In tests with simulated values these 2 blocks of code produce the same thing:
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio[j-1])
portfolio_value[j] = portfolio_value[j]-1
interpolated_returns = (interpolated_returns+1)*portfolio - 1
portfolio_value = portfolio_value + interpolated_returns.tolist()
I assuming that interpolated_returns and portfolio are 1d arrays of the same length.

Optimizing Algorithm of large dataset calculations

Once again I find myself stumped with pandas, and how to best perform a 'vector operation'. My code works, however it will take a long time to iterate through everything.
What the code is trying to do is loop through shapes.cv and determine which shape_pt_sequence is a stop_id, and then assigns the stop_lat and stop_lon to shape_pt_lat and shape_pt_lon, while also marking the shape_pt_sequence as is_stop.
GISTS
stop_times.csv LINK
trips.csv LINK
shapes.csv LINK
Here is my code:
import pandas as pd
from haversine import *
'''
iterate through shapes and match stops along a shape_pt_sequence within
x amount of distance. for shape_pt_sequence that is closest, replace the stop
lat/lon to the shape_pt_lat/shape_pt_lon, and mark is_stop column with 1.
'''
# readability assignments for shapes.csv
shapes = pd.read_csv('csv/shapes.csv')
shapes_index = list(set(shapes['shape_id']))
shapes_index.sort(key=int)
shapes.set_index(['shape_id', 'shape_pt_sequence'], inplace=True)
# readability assignments for trips.csv
trips = pd.read_csv('csv/trips.csv')
trips_index = list(set(trips['trip_id']))
trips.set_index(['trip_id'], inplace=True)
# readability assignments for stops_times.csv
stop_times = pd.read_csv('csv/stop_times.csv')
stop_times.set_index(['trip_id','stop_sequence'], inplace=True)
print(len(stop_times.loc[1423492]))
# readability assginments for stops.csv
stops = pd.read_csv('csv/stops.csv')
stops.set_index(['stop_id'], inplace=True)
# for each trip_id
for i in trips_index:
print('******NEW TRIP_ID******')
print(i)
i = i.astype(int)
# for each stop_sequence in stop_times
for x in range(len(stop_times.loc[i])):
stop_lat = stop_times.loc[i,['stop_lat','stop_lon']].iloc[x,[0,1]][0]
stop_lon = stop_times.loc[i,['stop_lat','stop_lon']].iloc[x,[0,1]][1]
stop_coordinate = (stop_lat, stop_lon)
print(stop_coordinate)
# shape_id that matches trip_id
print('**SHAPE_ID**')
trips_shape_id = trips.loc[i,['shape_id']].iloc[0]
trips_shape_id = int(trips_shape_id)
print(trips_shape_id)
smallest = 0
for y in range(len(shapes.loc[trips_shape_id])):
shape_lat = shapes.loc[trips_shape_id].iloc[y,[0,1]][0]
shape_lon = shapes.loc[trips_shape_id].iloc[y,[0,1]][1]
shape_coordinate = (shape_lat, shape_lon)
haversined = haversine_mi(stop_coordinate, shape_coordinate)
if smallest == 0 or haversined < smallest:
smallest = haversined
smallest_shape_pt_indexer = y
else:
pass
print(haversined)
print('{0:.20f}'.format(smallest))
print('{0:.20f}'.format(smallest))
print(smallest_shape_pt_indexer)
# mark is_stop as 1
shapes.iloc[smallest_shape_pt_indexer,[2]] = 1
# replace coordinate value
shapes.loc[trips_shape_id].iloc[y,[0,1]][0] = stop_lat
shapes.loc[trips_shape_id].iloc[y,[0,1]][1] = stop_lon
shapes.to_csv('csv/shapes.csv', index=False)
What you could do to optmizing this code is use some threads/workers instead those for.
I recommend using the Pool of Workes as its very simple to use.
In:
for i in trips_index:
You could use something like:
from multiprocessing import Pool
pool = Pool(processes=4)
result = pool.apply_async(func, trips_index)
And than the method func would be like:
def func(i):
#code here
And you could simply put the whole for loop inside this method.
It would make it work with 4 subprocess in this example, git it a nice improvment.
One thing to consider is that a collection of trips will often have the same sequence of stops and the same shape data (the only difference between trips is the timing). So it might make sense to cache the find-closest-point-on-shape operation for (stop_id, shape_id). I bet that would reduce your runtime by an order-of-magnitude.

Categories

Resources