Data analysis - MD analysis Python

Data analysis - MD analysis Python - python

I am seeing this error, need help on this!
warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type))
Traceback (most recent call last):
File "traj_residue-z.py", line 48, in
protein_z=protein.centroid()[2]
IndexError: index 2 is out of bounds for axis 0 with size 0

The problem was solved through a discussion in the mailing list thread https://groups.google.com/g/mdnalysis-discussion/c/J8oJ0M9Rjb4/m/kSD2jURODQAJ
In brief: The traj_residue-z.py script contained the line
protein=u.select_atoms('resid 1-%d' % (nprotein_res))
It turned out that the selection 'resid 1-%d' % (nprotein_res) would not select anything because the input GRO file started with resid 1327
1327LEU N 1 2.013 3.349 8.848 0.4933 -0.2510 0.2982
1327LEU H1 2 1.953 3.277 8.893 0.0174 0.1791 0.3637
1327LEU H2 3 1.960 3.377 8.762 0.6275 -0.5669 0.1094
...
and hence the selection of resids starting at 1 did not match anything. This produced an empty AtomGroup protein.
The subsequent centroid calculation
protein_z=protein.centroid()[2]
failed because for an empty AtomGroup, protein.centroid() returns an empty array and so trying to get the element at index 2 raises IndexError.
The solution (thanks to #IAlibay) was to
either change the selection string 'resid 1-%d' to accommodate start and stop resids, or
to just select the first nprotein_res residues protein = u.residues[:nprotein_res].atoms by slicing the ResiduesGroup.

Related

"There are no fields in dtype int64." why am i getting it?

b= np.array([[1,2,3,4,5],[2,3,4,5,6]])
b[1,1]
output:----------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-15-320f1bda41d3> in <module>()
9 """
10 # let's say we want to access the digit 5 in 2nd row.
---> 11 b[1,1]
12 # here the the 1st one is representing the row no. 1 but you may ask the question if the 5 is in the 2nd row then why did we passed the argument saying the row that wwe want to access is 1.
13 # well the answer is pretty simple:- the thing is here we are providing the index number that is assigned by the python it has nothing to do with the normal sequencing that starts from 1 rather we use python sequencing that starts from 0,1,2,3....
KeyError: 'There are no fields in dtype int64.'

trend following using meta label issue with time ubtracting `n`, use `n * obj.freq`

I am trying to implement the trend following labeling from asset managment book. I found the following code that I wanted to implement, however I am getting an error
TypeError: Addition/subtraction of integers and integer-arrays with Timestamp is no longer supported. Instead of adding/subtracting n, use n * obj.freq
!pip install yfinance
!pip install mplfinance
import yfinance as yf
import mplfinance as mpf
import numpy as np
import pandas as pd
# get the data from yfiance
df=yf.download('BTC-USD',start='2008-01-04',end='2021-06-3',interval='1d')
#code snippet 5.1
# Fit linear regression on close
# Return the t-statistic for a given parameter estimate.
def tValLinR(close):
#tValue from a linear trend
x = np.ones((close.shape[0],2))
x[:,1] = np.arange(close.shape[0])
ols = sm1.OLS(close, x).fit()
return ols.tvalues[1]
#code snippet 5.2
'''
#search for the maximum absolutet-value. To identify the trend
# - molecule - index of observations we wish to labels.
# - close - which is the time series of x_t
# - span - is the set of values of L (look forward period) that the algorithm will #try (window_size)
# The L that maximizes |tHat_B_1| (t-value) is choosen - which is the look-forward #period
# with the most significant trend. (optimization)
'''
def getBinsFromTrend(molecule, close, span):
#Derive labels from the sign of t-value of trend line
#output includes:
# - t1: End time for the identified trend
# - tVal: t-value associated with the estimated trend coefficient
#- bin: Sign of the trend (1,0,-1)
#The t-statistics for each tick has a different look-back window.
#- idx start time in look-forward window
#- dt1 stop time in look-forward window
#- df1 is the look-forward window (window-size)
#- iloc ?
out = pd.DataFrame(index=molecule, columns=['t1', 'tVal', 'bin', 'windowSize'])
hrzns = range(*span)
windowSize = span[1] - span[0]
maxWindow = span[1]-1
minWindow = span[0]
for idx in close.index:
idx += maxWindow
if idx >= len(close):
break
df_tval = pd.Series(dtype='float64')
iloc0 = close.index.get_loc(idx)
if iloc0+max(hrzns) > close.shape[0]:
continue
for hrzn in hrzns:
dt1 = close.index[iloc0-hrzn+1]
df1 = close.loc[dt1:idx]
df_tval.loc[dt1] = tValLinR(df1.values) #calculates t-statistics on period
dt1 = df_tval.replace([-np.inf, np.inf, np.nan], 0).abs().idxmax() #get largest t-statistics calculated over span period
print(df_tval.index[-1])
print(dt1)
print(abs(df_tval.values).argmax() + minWindow)
out.loc[idx, ['t1', 'tVal', 'bin', 'windowSize']] = df_tval.index[-1], df_tval[dt1], np.sign(df_tval[dt1]), abs(df_tval.values).argmax() + minWindow #prevent leakage
out['t1'] = pd.to_datetime(out['t1'])
out['bin'] = pd.to_numeric(out['bin'], downcast='signed')
#deal with massive t-Value outliers - they dont provide more confidence and they ruin the scatter plot
tValueVariance = out['tVal'].values.var()
tMax = 20
if tValueVariance < tMax:
tMax = tValueVariance
out.loc[out['tVal'] > tMax, 'tVal'] = tMax #cutoff tValues > 20
out.loc[out['tVal'] < (-1)*tMax, 'tVal'] = (-1)*tMax #cutoff tValues < -20
return out.dropna(subset=['bin'])
if __name__ == '__main__':
#snippet 5.3
idx_range_from = 3
idx_range_to = 10
df1 = getBinsFromTrend(df.index, df['Close'], [idx_range_from,idx_range_to,1]) #[3,10,1] = range(3,10) This is the issue
tValues = df1['tVal'].values #tVal
doNormalize = False
#normalise t-values to -1, 1
if doNormalize:
np.min(tValues)
minusArgs = [i for i in range(0, len(tValues)) if tValues[i] < 0]
tValues[minusArgs] = tValues[minusArgs] / (np.min(tValues)*(-1.0))
plus_one = [i for i in range(0, len(tValues)) if tValues[i] > 0]
tValues[plus_one] = tValues[plus_one] / np.max(tValues)
#+(idx_range_to-idx_range_from+1)
plt.scatter(df1.index, df0.loc[df1.index].values, c=tValues, cmap='viridis') #df1['tVal'].values, cmap='viridis')
plt.plot(df0.index, df0.values, color='gray')
plt.colorbar()
plt.show()
plt.savefig('fig5.2.png')
plt.clf()
plt.df['Close']()
plt.scatter(df1.index, df0.loc[df1.index].values, c=df1['bin'].values, cmap='vipridis')
#Test methods
ols_tvalue = tValLinR( np.array([3.0, 3.5, 4.0]) )

There are at least several issues with the code that you "found" (not just the one issue that you posted).
Before I go through some of the issues, let me say the following, which I may very well be wrong about, but based on my experience and on the way your question is worded (and the fact that you are "wanted to implement" code that you "found") it seems to me that you have minimal experience with coding and debugging.
Stackoverflow is not a place to ask others to debug your code. That said, I will try to walk you though some of the steps that I took in trying to figure out what's going on with this code, and then perhaps point you to some resources where you can learn the same skills.
Step 1:
I took the code as you posted it, and copy/pasted it into a file that I named so68871906.py; I then commented out the two lines at the top that are installing yfinance and mplfinance because I don't want to try to install them every time I run the code; rather I will install them once before running the code.
I then ran the code with the following result (similar to what you posted) ...
dino#DINO:~/code/mplfinance/examples/scratch_pad/issues$ python so68871906.py
[*********************100%***********************] 1 of 1 completed
Traceback (most recent call last):
File "so68871906.py", line 84, in <module>
df1 = getBinsFromTrend(df.index, df['Close'], [idx_range_from,idx_range_to,1]) #[3,10,1] = range(3,10) This is the issue
File "so68871906.py", line 50, in getBinsFromTrend
idx += maxWindow
File "pandas/_libs/tslibs/timestamps.pyx", line 310, in pandas._libs.tslibs.timestamps._Timestamp.__add__
TypeError: Addition/subtraction of integers and integer-arrays with Timestamp is no longer supported. Instead of adding/subtracting `n`, use `n * obj.freq`
The key to successful debugging, especially in python, is to recognize that the Traceback gives you a lot of very important information. You just have to read through it very carefully. In the above case, the Traceback tells me:
The problem is with this line of code: idx += maxWindow. This line of code is adding idx + maxWindow and reassigning the result back to idx
The error is a TypeError which tells me there is a problem with the types of the variables. Since there are two variables on that line of code (idx and maxWindow) one may guess that one or both of those variables is the wrong type or otherwise incompatible with what the code is trying to do with the variable.
Based on the error message "Addition/subtraction of integers and integer-arrays with Timestamp is no longer supported", and the fact that we are doing addition of idx and maxWindow, you can guess that one of the variables is of type integer or integer-array, while the other is of type Timestamp.
You can verify the types by adding print statements just before the error occurs. The code looks like this:
maxWindow = span[1]-1
minWindow = span[0]
for idx in close.index:
print('type(idx)=',type(idx))
print('type(maxWindow)=',type(maxWindow))
idx += maxWindow
Now the output looks like this:
dino#DINO:~/code/mplfinance/examples/scratch_pad/issues$ python so68871906.py
[*********************100%***********************] 1 of 1 completed
type(idx)= <class 'pandas._libs.tslibs.timestamps.Timestamp'>
type(maxWindow)= <class 'int'>
Traceback (most recent call last):
File "so68871906.py", line 86, in <module>
df1 = getBinsFromTrend(df.index, df['Close'], [idx_range_from,idx_range_to,1]) #[3,10,1] = range(3,10) This is the issue
File "so68871906.py", line 52, in getBinsFromTrend
idx += maxWindow
File "pandas/_libs/tslibs/timestamps.pyx", line 310, in pandas._libs.tslibs.timestamps._Timestamp.__add__
TypeError: Addition/subtraction of integers and integer-arrays with Timestamp is no longer supported. Instead of adding/subtracting `n`, use `n * obj.freq`
Notice that indeed type(maxWindow) is int and type(idx) is Timestamp
The TypeError exception message further states "Instead of adding/subtracting n, use n * obj.freq" from which one may infer that n is intended to represent the integer. It seems that the error is suggesting that we multiply the integer by some frequency before adding it to the Timestamp variable. This is not entirely clear, so I Googled "pandas add integer to Timestamp" (because clearly that is what the code is trying to do). All of the top answers suggest using pandas.to_timedelta() or pandas.Timedelta().
At this point I thought to myself: it makes sense that you can't just add an integer to a Timestamp, because what are you adding? minutes? seconds? days? weeks?
However, you can add an integer number of one of these frequencies, in fact the pandas.Timedelta() constructor takes a value argument indicates a number of days, weeks, etc.
The data from yf.download() is daily (interval='1d') which suggest that the integer should be multiplied by a pandas.Timedelta of 1 day. I cannot be certain of this, because I don't have your textbook, so I am not 100% sure what the code is trying to accomplish there, but it is a reasonable guess, so I will change idx += maxWindow to
idx += (maxWindow*pd.Timedelta('1 day'))
and see what happens:
dino#DINO:~/code/mplfinance/examples/scratch_pad/issues$ python so68871906.py
[*********************100%***********************] 1 of 1 completed
Traceback (most recent call last):
File "so68871906.py", line 84, in <module>
df1 = getBinsFromTrend(df.index, df['Close'], [idx_range_from,idx_range_to,1]) #[3,10,1] = range(3,10) This is the issue
File "so68871906.py", line 51, in getBinsFromTrend
if idx >= len(close):
TypeError: '>=' not supported between instances of 'Timestamp' and 'int'
Step 2:
The code passes the modified line of code (line 50), but now it's failing on the next line of code with a similar TypeError in that it doesn't support comparing ('>=') and integer and a Timestamp. So next I try similarly modifying line 51 to:
if idx >= len(close)*pd.Timedelta('1 day'):
The result:
dino#DINO:~/code/mplfinance/examples/scratch_pad/issues$ python so68871906.py
[*********************100%***********************] 1 of 1 completed
Traceback (most recent call last):
File "so68871906.py", line 84, in <module>
df1 = getBinsFromTrend(df.index, df['Close'], [idx_range_from,idx_range_to,1]) #[3,10,1] = range(3,10) This is the issue
File "so68871906.py", line 51, in getBinsFromTrend
if idx >= len(close)*pd.Timedelta('1 day'):
TypeError: '>=' not supported between instances of 'Timestamp' and 'Timedelta'
This doesn't work either, as you can see (can't compare Timestamp and Timedelta).
Step 3:
Looking more closely at the code, it seems the code is trying to determine if adding maxWindow to idx has moved idx past the end of the data.
Looking a couple lines higher in the code, you can see that the variable idx comes from the list of Timestamp objects in close.index, so perhaps the correct comparison would be:
if idx >= close.index[-1]:
that is, comparing idx to the last possible idx value. The result:
dino#DINO:~/code/mplfinance/examples/scratch_pad/issues$ python so68871906.py
[*********************100%***********************] 1 of 1 completed
Traceback (most recent call last):
File "so68871906.py", line 84, in <module>
df1 = getBinsFromTrend(df.index, df['Close'], [idx_range_from,idx_range_to,1]) #[3,10,1] = range(3,10) This is the issue
File "so68871906.py", line 60, in getBinsFromTrend
df_tval.loc[dt1] = tValLinR(df1.values) #calculates t-statistics on period
File "so68871906.py", line 18, in tValLinR
ols = sm1.OLS(close, x).fit()
NameError: name 'sm1' is not defined
Step 4:
Wow! Great, we got past the errors on lines 50 and 51. But now we have an error on line 18. Indicating that the name sm1 is not defined. This suggests that either you did not copy all of the code that you should have, or perhaps there is something else that needs to be imported that would define sm1 for the python interpreter.
So you see, this is the basic process of debugging your code. Again the key, at least with python, is carefully reading the Traceback. And again, Stackoverflow is not intended as a place to ask others to debug your code. A little bit of searching online for things like "learn python" and "python debugging techniques" will yield a wealth of helpful information.
I hope this is pointing you in the right direction. If for some reason I am way off base with this answer, let me know and I will delete it.

Too many indices in array error brian2-python

I am trying to compare an array value with the previous and the next one using the below code but i get the too many indices in array error, which I would like to bypass, but I dont know how.
spikes=print(M.V[0])
#iterate in list of M.V[0] with three iterators to find the spikes
for i,x in enumerate(M.V[0]):
if (i>=1):
if x[i-1]<x[i] & x[i]>x[i+1] & x[i]>25*mV:
spikes+=1
print(spikes)
and I get this error:
IndexError Traceback (most recent call last)
<ipython-input-24-76d7b392071a> in <module>
3 for i,x in enumerate(M.V[0]):
4 if (i>=1):
----> 5 if x[i-1]<x[i] & x[i]>x[i+1] & x[i]>25*mV:
6 spikes+=1
7 print(spikes)
~/anaconda3/lib/python3.6/site-packages/brian2/units/fundamentalunits.py in __getitem__(self, key)
1306 single integer or a tuple of integers) retain their unit.
1307 '''
-> 1308 return Quantity(np.ndarray.__getitem__(self, key), self.dim)
1309
1310 def __getslice__(self, start, end):
IndexError: too many indices for array
Do note that M.V[0] is an array by itself

You said that "M.V[0] is an array by itself". However, you need to say more about it. Probably M is StateMonitor object detailed in https://brian2.readthedocs.io/en/stable/user/recording.html#recording-spikes . Is this correct ?
If so, you need to give full and minimal code in order to understand your details. For instance, what is your neuron model inside NeuronGroup object? More importantly, instead of finding spike event on your own, why don't you use SpikeMonitor class which extremely ease what you are planning ?
SpikeMonitor class in Brian2 : https://brian2.readthedocs.io/en/stable/reference/brian2.monitors.spikemonitor.SpikeMonitor.html

Error while using sum() in Python SFrame

I'm new to python and I'm performing a basic EDA analysis on two similar SFrames. I have a dictionary as two of my columns and I'm trying to find out if the max values of each dictionary are the same or not. In the end I want to sum up the Value_Match column so that I can know how many values match but I'm getting a nasty error and I haven't been able to find the source. The weird thing is I have used the same methodology for both the SFrames and only one of them is giving me this error but not the other one.
I have tried calculating max_func in different ways as given here but the same error has persisted : getting-key-with-maximum-value-in-dictionary
I have checked for any possible NaN values in the column but didn't find any of them.
I have been stuck on this for a while and any help will be much appreciated. Thanks!
Code:
def max_func(d):
v=list(d.values())
k=list(d.keys())
return k[v.index(max(v))]
sf['Max_Dic_1'] = sf['Dic1'].apply(max_func)
sf['Max_Dic_2'] = sf['Dic2'].apply(max_func)
sf['Value_Match'] = sf['Max_Dic_1'] == sf['Max_Dic_2']
sf['Value_Match'].sum()
Error :
RuntimeError Traceback (most recent call last)
<ipython-input-70-f406eb8286b3> in <module>()
----> 1 x = sf['Value_Match'].sum()
2 y = sf.num_rows()
3
4 print x
5 print y
C:\Users\rakesh\Anaconda2\lib\site-
packages\graphlab\data_structures\sarray.pyc in sum(self)
2216 """
2217 with cython_context():
-> 2218 return self.__proxy__.sum()
2219
2220 def mean(self):
C:\Users\rakesh\Anaconda2\lib\site-packages\graphlab\cython\context.pyc in
__exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let
exception propagate
RuntimeError: Runtime Exception. Exception in python callback function
evaluation:
ValueError('max() arg is an empty sequence',):
Traceback (most recent call last):
File "graphlab\cython\cy_pylambda_workers.pyx", line 426, in
graphlab.cython.cy_pylambda_workers._eval_lambda
File "graphlab\cython\cy_pylambda_workers.pyx", line 169, in
graphlab.cython.cy_pylambda_workers.lambda_evaluator.eval_simple
File "<ipython-input-63-b4e3c0e28725>", line 4, in max_func
ValueError: max() arg is an empty sequence

In order to debug this problem, you have to look at the stack trace. On the last line we see:
File "<ipython-input-63-b4e3c0e28725>", line 4, in max_func
ValueError: max() arg is an empty sequence
Python thus says that you aim to calculate the maximum of a list with no elements. This is the case if the dictionary is empty. So in one of your dataframes there is probably an empty dictionary {}.
The question is what to do in case the dictionary is empty. You might decide to return a None into that case.
Nevertheless the code you write is too complicated. A simpler and more efficient algorithm would be:
def max_func(d):
if d:
return max(d,key=d.get)
else:
# or return something if there is no element in the dictionary
return None

Making a histogram in Python

I have been trying to make a histogram using the data given in survey below.
#represents the "information" (the summarized data)
ranking = [0,0,0,0,0,0,0,0,0,0,0]
survey = [1,5,3,4,1,1,1,2,1,2,1,3,4,5,1,7]
for i in range(len(survey)):
ranking[survey[i]]+=1
#create histogram
print("\nCreating a histogram from values: ")
print("%3s %5s %7s"%("Element", "Value", "Histogram"))
for i in range(len(ranking)):
print("%7d %5d %-s"%(i+1, ranking[i+1], "*" * ranking[i+1]))
Here is exactly what the shell displays when I run my code:
Creating a histogram from values:
Element Value Histogram
1 7 *******
2 2 **
3 2 **
4 2 **
5 2 **
6 0
7 1 *
8 0
9 0
10 0
Traceback (most recent call last):
File "C:\folder\file23.py", line 17, in <module>
print("%7d %5d %-s"%(i+1, ranking[i+1], "*" * ranking[i+1]))
IndexError: list index out of range
My expected output is the above thing just without the traceback.
The shell is displaying the right thing, I'm just unsure of the error message. How can I fix this?

When i reaches its highest value, len(ranking) - 1, your use of ranking[i+1] is clearly "out of range"! Use range(len(ranking) - 1 in the for loop to avoid the error.
The counting can be simplified, too:
import collections
ranking = collections.Counter(survey)
for i in range(min(ranking), max(ranking)+1):
print("%7d %5d %-s"%(i, ranking[i], "*" * ranking[i]))
Here you need min and max because Counter is mapping-like, not sequence-like. But it will still work fine (and you can use range(0, max(ranking)+1) if you prefer!-)

Your index i is incremented up to len(ranking)-1, which is the last valid index in ranking, but you're trying to access ranking[i+1], hence the IndexError.
Fix:
for i in range(len(ranking)-1):

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data analysis - MD analysis Python - python

I am seeing this error, need help on this! warnings.warn("Failed to guess the mass for the following atom types: {}".format(atom_type)) Traceback (most recent call last): File "traj_residue-z.py", line 48, in protein_z=protein.centroid()[2] IndexError: index 2 is out of bounds for axis 0 with size 0

Related

"There are no fields in dtype int64." why am i getting it?

trend following using meta label issue with time ubtracting `n`, use `n * obj.freq`

Too many indices in array error brian2-python

Error while using sum() in Python SFrame

Making a histogram in Python

Categories

Resources