Hope you are well. I've been converting an adjacency matrix to a connectivity index. I am having problems trying to get it take the second neighbour, but have been having issues. I will try to explain my problem and goal as best as I can. It will be better to explain by an example.
This is an example of an adjacency matrix.
A0 A1 A2 A3 A4 A5
A0 0.0 1.0 0.0 0.0 0.0 0.0
A1 1.0 0.0 1.0 0.0 0.0 0.0
A2 0.0 1.0 0.0 1.0 0.0 0.0
A3 0.0 0.0 1.0 0.0 1.0 0.0
A4 0.0 0.0 0.0 1.0 0.0 1.0
A5 0.0 0.0 0.0 0.0 1.0 0.0
The titles (A0, A1 etc.) represents an atom. The values represents a connection. So a value of 1 in the row A0 and column A1 means there is a connection between the 2 atoms and they are neighbours.
I am following an equation to find the first order (first neighbour) connectivity index (CF) by the following equation:
CF = Σ(Si * Sj)^(-0.5)
Where Si corresponds to the connectivity of the atom in row i (as in the example above it would be A0) and Sj for j (A1 in above example). Connectivity (S) is defined as the number of bonds, so the sum of the row or column of that atom.
I used the code below to apply this.
def randic_index(A):
n = A.shape[0]
R = 0
for i in range(n):
for j in range(n):
if A[i,j] != 0:
deg_i = np.sum(A[i,:])
deg_j = np.sum(A[:,j])
R += 1 / np.sqrt(deg_i * deg_j)
return 0.5*R
This is all good for the first order connectivity index, but my issues are in retrieving the second order (second neighbour). I have tried different methods, but none seem to be working for me.
The formula for the second order connectivity index (CS) is:
CS = Σ(Si * Sj * Sk)^(-0.5) where k corresponds to the neighbour of the j atom.
I need to code a method of selecting the neighbours of the first neighbour and reiterating for all the potential second neighbours,leaving no combination that hasn't been calculated.
Is it possible anyone has advice on how to tackle this, or provide a modification to solve this? And, perhaps, provide something that will allow higher orders than second to be calculated, though I expect I would be able to adjust that myself.
Many Thanks
If the sum for the second order index is taken over all triples (i, j, k) of connected nodes, including the ones where i=k, then the following should work:
import numpy as np
def randic_index2_path(A):
degs = 1 / A.sum(axis=0)**0.5
B = (A * degs**0.5 * degs[..., None])
return np.tril((B.T#B)).sum()
If, on the other hand, the sum is taken over triples of connected nodes (i, j, k) where i and k are different, then you can try the following:
def randic_index2_trail(A):
degs = 1 / A.sum(axis=0)**0.5
B = (A * degs**0.5 * degs[..., None])
return np.tril((B.T#B), k=-1).sum()
Also, the function computing the first order index included in the question can be rewritten as follows:
def randic_index(A):
degs = 1 / A.sum(axis=0)**0.5
return np.tril(A * degs * degs[..., None]).sum()
Related
I've created distance matrices for time steps at every 0.1 seconds for intervals of 60 seconds. The matrices look so for each time step with distance values populating them:
time = 0.1
a1 b2 c3 d4
a1 0 5.4 9.1 10.1
b2 5.4 0 5.0 3.2
c3 9.1 5.0 0 6.6
d4 10.1 3.2 6.6 0
time = 0.2
a1 b2 c3 d4
a1 0 2.4 9.1 12.1
b2 2.4 0 6.7 3.6
c3 9.1 6.7 0 9.6
d4 12.1 3.6 9.6 0
The goal is to generate an adjacency matrix, or neighbor list at the end of each 60 second interval (examining 600 dataframes) for neighbors that maintain a distance threshold the entire minute (in each distance matrix examined).
For example, if the distance limit is d=10, then for this 0.2 second sample it would return a list of [a1, b2, c3] since for that 0.2 second interval, they all maintained a distance less than 10.
I was wondering if there is a semi-efficient or clever way to do this with pandas and python.
stack your dataframes along a 3rd dimension, then apply your threshold to get boolean values, then use numpy.logical_and.reduce to apply the "and" along your third dimension.
eg if dfs is a list of your dataframes then do
threshold = 10
stacked = np.stack(dfs, axis=2)
result = np.logical_and.reduce(stacked < threshold, axis=2)
You can then put result inside a dataframe with index and column names if you wish.
IIUC, since you have a symmetric matrix, you can use numpy to create boolean masks and filter the indices (or columns) with it. Since it's symmetric, it suffices to analyze either the upper triangle or the lower triangle (I chose lower triangle). Then among the numbers in the lower triangle, build a mask that returns False for the rows that contain a value greater than d.
import numpy as np
def get_neighbor_indices(df, d):
less_than_d = np.tril(df.lt(d).to_numpy())
upper_triangle_dummy = np.triu(np.ones(df.shape)==1)
msk = (less_than_d | upper_triangle_dummy).all(axis=1)
return df.index[msk].tolist()
>>> get_neighbor_indices(df1, 10)
['a1', 'b2', 'c3']
>>> get_neighbor_indices(df2, 10)
['a1', 'b2', 'c3']
I have a –large– dataframe with a list of edges in a bipartite graph. I want transforme it to a python sparse transition matrix.
So I have a dataframe with a list of edges linking nodes from part 1 (a,b,c) with part (x,y,z). Edges have multiplicity: in the example, there are two edges from b to y.
start end multiplicity
a x 1
a y 1
b y 2
b z 1
c x 1
c z 1
The result I want is a sparse matrix, 3x3 in this case. I have dictionaries for part 1 and 2, indicating which node corresponds to which row and columns of the resulting transition matrix:
dic1 = {'a':0,'b':1,'c':2}
dic2 = {'x':1,'y':0,'z':2}
So I want the matrix
y x z
a 1 1 0
b 2 0 1
c 0 1 1
...but in sparse (csr_matrix, lil_matrix, or coo_matrix). I have tried iterating over the list of edges, but it is too slow for long lists.
Also, approaches based on pivot will generate full matrices, which will be slow and memory consumming.
Is there an efficient way to obtain the sparse matrix I want
From what I understand , you can try pivot + reindex with Index.map (I have added 2 variables m and final for readability which you can replace with one after testing):
m = df.pivot(*df).fillna(0).rename_axis(index=None,columns=None)
final = m.reindex(index=m.index[m.index.map(dic1)],columns=m.columns[m.columns.map(dic2)])
print(final)
y x z
a 1.0 1.0 0.0
b 2.0 0.0 1.0
c 0.0 1.0 1.0
How to compare values to next or previous items in loop?
I need to summarize consecutive repetitinos of occurences in columns.
After that I need to create "frequency table" so the dfoutput schould looks like on the bottom picture.
This code doesn't work because I can't compare to another item.
Maybe there is another, simple way to do this without looping?
sumrep=0
df = pd.DataFrame(data = {'1' : [0,0,1,0,1,1,0,1,1,0,1,1,1,1,0],'2' : [0,0,1,1,1,1,0,0,1,0,1,1,0,1,0]})
df.index= [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] # It will be easier to assign repetitions in output df - index will be equal to number of repetitions
dfoutput = pd.DataFrame(0,index=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],columns=['1','2'])
#example for column 1
for val1 in df.columns[1]:
if val1 == 1 and val1 ==0: #can't find the way to check NEXT val1 (one row below) in column 1 :/
if sumrep==0:
dfoutput.loc[1,1]=dfoutput.loc[1,1]+1 #count only SINGLE occurences of values and assign it to proper row number 1 in dfoutput
if sumrep>0:
dfoutput.loc[sumrep,1]=dfoutput.loc[sumrep,1]+1 #count repeated occurences greater then 1 and assign them to proper row in dfoutput
sumrep=0
elif val1 == 1 and df[val1+1]==1 :
sumrep=sumrep+1
Desired output table for column 1 - dfoutput:
I don't undestand why there is no any simple method to move around dataframe like offset function in VBA in Excel:/
You can use the function defined here to perform fast run-length-encoding:
import numpy as np
def rlencode(x, dropna=False):
"""
Run length encoding.
Based on http://stackoverflow.com/a/32681075, which is based on the rle
function from R.
Parameters
----------
x : 1D array_like
Input array to encode
dropna: bool, optional
Drop all runs of NaNs.
Returns
-------
start positions, run lengths, run values
"""
where = np.flatnonzero
x = np.asarray(x)
n = len(x)
if n == 0:
return (np.array([], dtype=int),
np.array([], dtype=int),
np.array([], dtype=x.dtype))
starts = np.r_[0, where(~np.isclose(x[1:], x[:-1], equal_nan=True)) + 1]
lengths = np.diff(np.r_[starts, n])
values = x[starts]
if dropna:
mask = ~np.isnan(values)
starts, lengths, values = starts[mask], lengths[mask], values[mask]
return starts, lengths, values
With this function your task becomes a lot easier:
import pandas as pd
from collections import Counter
from functools import partial
def get_frequency_of_runs(col, value=1, index=None):
_, lengths, values = rlencode(col)
return pd.Series(Counter(lengths[np.where(values == value)]), index=index)
df = pd.DataFrame(data={'1': [0,0,1,0,1,1,0,1,1,0,1,1,1,1,0],
'2': [0,0,1,1,1,1,0,0,1,0,1,1,0,1,0]})
df.apply(partial(get_frequency_of_runs, index=df.index)).fillna(0)
# 1 2
# 0 0.0 0.0
# 1 1.0 2.0
# 2 2.0 1.0
# 3 0.0 0.0
# 4 1.0 1.0
# 5 0.0 0.0
# 6 0.0 0.0
# 7 0.0 0.0
# 8 0.0 0.0
# 9 0.0 0.0
# 10 0.0 0.0
# 11 0.0 0.0
# 12 0.0 0.0
# 13 0.0 0.0
# 14 0.0 0.0
I have a sparse matrix X and a target array Y (which length is equal to the rows of X), imagine something like following :
X=([1.5 0.0 0.0 71.9 0.0 0.0 0.0],
[0.0 10.0 0.0 2.0 0.0 0.0 0.0],
[0.0 0.0 0.0 0.0 0.0 0.0 11.0])
y =[4,2,-6]
what I need is to first have new form of the sparse matrix where each row contain nonzero values and their corresponding indices of rows in X :
Example
X1=( 0:1.5 3:71.9
1:10 3:2
6:11 )
to do so I have already asked this question (however I still don't know how to store X1 there so that later I concatenate it with Y?) but second part of the question is to concatenate X1 and Y (number of rows in X1 is still equal to length of Y) and store the final result, the final result should be something like the following format:
data:
4 0:1.5 3:71.9
2 1:10 3:2
-6 6:11
...
what is the way to get from X,Y into the final data and store it in a text file in Python?
Concatenate like so:
data = [[a]+b for a, b in zip(Y, X1)]
# data = [[a]+b for a, b in zip(Y, [':'.join([k,v]) for k,v in X1.items()])]
and write to file:
with open(filename, 'w') as f:
for row in data:
f.write(' '.join(row))
I have a 3 million rows dataframe that contains the different values :
d a0 a1 a2
0.5 10.0 5.0 1.0
0.8 10.0 2.0 0.0
I want to fill a fourth column with a linear interpolation of (a0,a1,a2) that takes the value in the "d" case,
d a0 a1 a2 newcol
1.5 10.0 5.0 1.0 3.0
0.8 10.0 2.0 0.0 3.6
newcol is the weighted average between a[int(d)] and a[int(d+1)], e.g. when d = 0.8, newcol = 0.2 * a0 + 0.8 * a1 because 0.8 is 80% of the way between 0 and 1
I found that np.interp can be used, but there is no way for me to put the three column names in variable) :
df["newcol"]=np.interp(df["d"],[0,1,2], [100,200,300])
will indeed give me
d a0 a1 a2 newcol
1.5 10.0 5.0 1.0 250.0
0.8 10.0 2.0 0.0 180.0
BUT I have no way to specify that the values vector changes :
df["newcol"]=np.interp(df["d"],[0,1,2], df[["a0","a1","a2"]])
gives me the following traceback :
File "C:\Python27\lib\site-packages\numpy\lib\function_base.py", line 1271, in interp
return compiled_interp(x, xp, fp, left, right)
ValueError: object too deep for desired array
Is there any way to use a different vector for values at each line? Could you think of any workaround ?
Basically, I could find no way to create this new column based on the definition :
What is the value in x = column "d" of the function that is piecewise linear
between given points and whose values at these points are described in the columns "ai"
Edit: Before, I used scipy.interp1d, which is not memory efficient, the comment helped me to solve partially my problem
Edit2 :
I tried the approach from ev-br that stated that I had to try to code the loop myself.
for i in range(len(tps)):
columns=["a1","a2","a3"]
length=len(columns)
x=np.maximum(0,np.minimum(df.ix[i,"d"],len-2))
xint = np.int(x)
xfrac = x-xint
name1=columns[xint]
name2=columns[xint+1]
tps.ix[i,"Multiplier"]=df.ix[i,name1]+xfrac*(df.ix[i,name2]-tps.ix[i,name1])
The above loop loops around 50 times a second, so I guess I have a major optimisation issue. What part of working on a DataFrame do I do wrong?
It might comes a bit too late, but I would use np.interpolate with pandas' apply function. Creating the DataFrame in your example:
t = pd.DataFrame([[1.5,10,5,1],[0.8,10,2,0]], columns=['d', 'a0', 'a1', 'a2'])
Then comes the apply function:
t.apply(lambda x: np.interp(x.d, [0,1,2], x['a0':]), axis=1)
which yields:
0 3.0
1 3.6
dtype: float64
This is perfectly usable on "normal" datasets. However, the size of your DataFrame might call for a better/more optimized solution. The processing time scales linearily, my machine clocks in 10000 lines per second, which means 5 minutes for 3 million...
OK, I have a second solution, which uses the numexpr module. This method is much more specific, but also much faster. I've measured the complete process to take 733 milliseconds for 1 million lines, which is not bad...
So we have the original DataFrame as before:
t = pd.DataFrame([[1.5,10,5,1],[0.8,10,2,0]], columns=['d', 'a0', 'a1', 'a2'])
We import the module and use it, but it requires that we separate the two cases where we will use 'a0' and 'a1' or 'a1' and 'a2' as lower/upper limits for the linear interpolation. We also prepare the numbers so they can be fed to the same evaluation (hence the -1). We do that by creating 3 arrays with the interpolation value (originally: 'd') and the limits, depending on the value of "d". So we have:
import numexpr as ne
lim = np.where(t.d > 1, [t.d-1, t.a1, t.a2], [t.d, t.a0, t.a1])
Then we evaluate the simple linear interpolation expression and finally add it as a new column like that:
x = ne.evaluate('(1-x)*a+x*b', local_dict={'x': lim[0], 'a': lim[1], 'b': lim[2]})
t['IP'] = np.where(t.d > 1, x+1, x)