I've created distance matrices for time steps at every 0.1 seconds for intervals of 60 seconds. The matrices look so for each time step with distance values populating them:
time = 0.1
a1 b2 c3 d4
a1 0 5.4 9.1 10.1
b2 5.4 0 5.0 3.2
c3 9.1 5.0 0 6.6
d4 10.1 3.2 6.6 0
time = 0.2
a1 b2 c3 d4
a1 0 2.4 9.1 12.1
b2 2.4 0 6.7 3.6
c3 9.1 6.7 0 9.6
d4 12.1 3.6 9.6 0
The goal is to generate an adjacency matrix, or neighbor list at the end of each 60 second interval (examining 600 dataframes) for neighbors that maintain a distance threshold the entire minute (in each distance matrix examined).
For example, if the distance limit is d=10, then for this 0.2 second sample it would return a list of [a1, b2, c3] since for that 0.2 second interval, they all maintained a distance less than 10.
I was wondering if there is a semi-efficient or clever way to do this with pandas and python.
stack your dataframes along a 3rd dimension, then apply your threshold to get boolean values, then use numpy.logical_and.reduce to apply the "and" along your third dimension.
eg if dfs is a list of your dataframes then do
threshold = 10
stacked = np.stack(dfs, axis=2)
result = np.logical_and.reduce(stacked < threshold, axis=2)
You can then put result inside a dataframe with index and column names if you wish.
IIUC, since you have a symmetric matrix, you can use numpy to create boolean masks and filter the indices (or columns) with it. Since it's symmetric, it suffices to analyze either the upper triangle or the lower triangle (I chose lower triangle). Then among the numbers in the lower triangle, build a mask that returns False for the rows that contain a value greater than d.
import numpy as np
def get_neighbor_indices(df, d):
less_than_d = np.tril(df.lt(d).to_numpy())
upper_triangle_dummy = np.triu(np.ones(df.shape)==1)
msk = (less_than_d | upper_triangle_dummy).all(axis=1)
return df.index[msk].tolist()
>>> get_neighbor_indices(df1, 10)
['a1', 'b2', 'c3']
>>> get_neighbor_indices(df2, 10)
['a1', 'b2', 'c3']
Related
Hope you are well. I've been converting an adjacency matrix to a connectivity index. I am having problems trying to get it take the second neighbour, but have been having issues. I will try to explain my problem and goal as best as I can. It will be better to explain by an example.
This is an example of an adjacency matrix.
A0 A1 A2 A3 A4 A5
A0 0.0 1.0 0.0 0.0 0.0 0.0
A1 1.0 0.0 1.0 0.0 0.0 0.0
A2 0.0 1.0 0.0 1.0 0.0 0.0
A3 0.0 0.0 1.0 0.0 1.0 0.0
A4 0.0 0.0 0.0 1.0 0.0 1.0
A5 0.0 0.0 0.0 0.0 1.0 0.0
The titles (A0, A1 etc.) represents an atom. The values represents a connection. So a value of 1 in the row A0 and column A1 means there is a connection between the 2 atoms and they are neighbours.
I am following an equation to find the first order (first neighbour) connectivity index (CF) by the following equation:
CF = Σ(Si * Sj)^(-0.5)
Where Si corresponds to the connectivity of the atom in row i (as in the example above it would be A0) and Sj for j (A1 in above example). Connectivity (S) is defined as the number of bonds, so the sum of the row or column of that atom.
I used the code below to apply this.
def randic_index(A):
n = A.shape[0]
R = 0
for i in range(n):
for j in range(n):
if A[i,j] != 0:
deg_i = np.sum(A[i,:])
deg_j = np.sum(A[:,j])
R += 1 / np.sqrt(deg_i * deg_j)
return 0.5*R
This is all good for the first order connectivity index, but my issues are in retrieving the second order (second neighbour). I have tried different methods, but none seem to be working for me.
The formula for the second order connectivity index (CS) is:
CS = Σ(Si * Sj * Sk)^(-0.5) where k corresponds to the neighbour of the j atom.
I need to code a method of selecting the neighbours of the first neighbour and reiterating for all the potential second neighbours,leaving no combination that hasn't been calculated.
Is it possible anyone has advice on how to tackle this, or provide a modification to solve this? And, perhaps, provide something that will allow higher orders than second to be calculated, though I expect I would be able to adjust that myself.
Many Thanks
If the sum for the second order index is taken over all triples (i, j, k) of connected nodes, including the ones where i=k, then the following should work:
import numpy as np
def randic_index2_path(A):
degs = 1 / A.sum(axis=0)**0.5
B = (A * degs**0.5 * degs[..., None])
return np.tril((B.T#B)).sum()
If, on the other hand, the sum is taken over triples of connected nodes (i, j, k) where i and k are different, then you can try the following:
def randic_index2_trail(A):
degs = 1 / A.sum(axis=0)**0.5
B = (A * degs**0.5 * degs[..., None])
return np.tril((B.T#B), k=-1).sum()
Also, the function computing the first order index included in the question can be rewritten as follows:
def randic_index(A):
degs = 1 / A.sum(axis=0)**0.5
return np.tril(A * degs * degs[..., None]).sum()
I have this simplified DataFrame where I want to add a new column Distance_km.
In this new column all values should be in kilometres and converted to float dtype.
d = {'Point': ['a','b','c','d'], 'Distance': ['3km', '400m','1.1km','200m']}
dist=pd.DataFrame(data=d)
dist
Point Distance
0 a 3km
1 b 400m
2 c 1.1km
3 d 200m
Point object
Distance object
dtype: object
How can I get this output?
Point Distance Distance_km
0 a 3.8km 3.8
1 b 400m 0.4
2 c 1.1km 1.1
3 d 200m 0.2
Point object
Distance object
Distance_km float64
dtype: object
Thanks in advance!
You could use Pandas apply method to pass your distance column values to a function that converts it to a standardized unit like so
From the documentation
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is
either the DataFrame’s index (axis=0) or the DataFrame’s columns
(axis=1). By default (result_type=None), the final return type is
inferred from the return type of the applied function. Otherwise, it
depends on the result_type argument.
First create the function that will transform the data, apply can even take in a lambda
import re
def convert_to_km(distance):
'''
distance can be a string with km or m as units
e.g. 300km, 1.1km, 200m, 4.5m
'''
# split the string into value and unit ['300', 'km']
split_dist = re.match('([\d\.]+)?([a-zA-Z]+)', distance)
value = split_dist.group(1) # 300
unit = split_dist.group(2) # km
if unit == 'km':
return float(value)
if unit == 'm':
return round(float(value)/1000, 2)
d = {'Point': ['a','b','c','d'], 'Distance': ['3km', '400m','1.1km','200m']}
dist=pd.DataFrame(data=d)
You can then apply this funtion to your distance column
dist['Distanc_km'] = dist.apply(lambda row: convert_to_km(row['Distance']), axis=1)
dist
The output will be
Point Distance Distanc_km
0 a 3km 3.0
1 b 400m 0.4
2 c 1.1km 1.1
3 d 200m 0.2
You may try following as well:
Check if second last character of the string is 'k'.
If it is then only remove the last two character i.e. 'km'
Otherwise take the characters except last one (i.e. 'm') and divide the float value by 1000
Below is the implementation using apply to Distance column:
dist['Distance_km'] = dist['Distance'].apply(lambda row: float(row[:-1])/1000 if not row[-2]=='k' else row[:-2])
Result is:
Point Distance Distance_km
a 3km 3
b 400m 0.4
c 1.1km 1.1
d 200m 0.2
Try:
# An "Weight" column marking those are in "m" units
dist["Weight"] = 1e-3
dist.loc[dist["Distance"].str.contains("km"),"Weight"] = 1
# Extract the numeric part of string and convert it to float
dist["NumericPart"] = dist["Distance"].str.extract("([0-9.]+)\w+").astype(float)
# Merge the numeric parts with their units(weights) by multiplication
dist["Distance_km"] = dist["NumericPart"] * dist["Weight"]
You will get:
Point Distance Weight NumericPart Distance_km
0 a 3km 1.000 3.0 3.0
1 b 400m 0.001 400.0 0.4
2 c 1.1km 1.000 1.1 1.1
3 d 200m 0.001 200.0 0.2
BTW: Avoid using apply if you can, that will be very slow if your data is big.
I have a matrix like
id |v1_m1 v2_m1 v3_m1 f_m1 v1_m2 v2_m2 v3_m2 f_m2|
1 | 0 .5 .5 4 0.1 0.3 0.6 4 |
2 | 0.3 .3 .4 8 0.2 0.4 0.4 7 |
What I want is to mulply each v's in m1 by the f_m1 column, and all the v's columns with the suffix "_m2" by ghe f_m2 column.
The output that I expect is something like this:
id |v1_m1 v2_m1 v3_m1 v1_m2 v2_m2 v3_m2 |
1 | 0 2 2 0.4 1.2 2.4 |
2 | 2.4 2.4 3.2 1.4 2.8 2.8 |
for m in range (1,maxm):
for i in range (1,maxv):
df["v{}_m{}".format(i,m)] = df["v{}_m{}".format(i,m)]*df["f_m{}".format(m)]
for m in range (1,maxm):
df.drop(columns=["f_m{}".format(m)])
You could do this with some fancy dataframe reshaping:
df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
df=df.stack()
df_mul = df.filter(like='v').mul(df.filter(like='f').squeeze(), axis=0)
df_mul = df_mul.unstack().sort_index(level=1, axis=1)
df_mul.columns = [f'{i}_{j}' for i, j in df_mul.columns]
df_mul
Output:
v1_m1 v2_m1 v3_m1 v1_m2 v2_m2 v3_m2
id
1 0.0 2.0 2.0 0.4 1.2 2.4
2 2.4 2.4 3.2 1.4 2.8 2.8
Details:
Create MultiIndex column headers split on '_'
Reshape dataframe stacking the m# to rows, leaving four columns f and
three v's
Using filter, we can select the v columns and multiply by the f
series created by selecting the single column and using squeeze to
create a pd.Series from a single column dataframe
unstack the m# level back to columns
Flatten the MultiIndex column header back to single level using
f-string with list comprehension.
Assuming that your matrix is a pandas dataframe called df, I would like to give my nomination for a list comprehension approach if you enjoy them.
import itertools
items = [(i[0][0],i[0][1].multiply(i[1][1]))
for i in itertools.product(df.items(),repeat=2)
if (i[0][0][-2:]==i[1][0][-2:])
and i[1][0][:1]=='f'
and i[0][0][:1]!='f']
df_mul = pd.DataFrame.from_dict({i[0]:i[1] for i in items})
It should be superfast on larger versions of this problem.
Explanation -
Creates a generator for cross-product between each column as (c1,c2) tuples
Keeps only the columns where last 2 alphabets are same for both c1,c2 AND c2 starts with 'f', AND c1 doesn't start with 'f' (leaving you with the columns you wanna operate on as individual tuples). Something like this - [('v1_m1', 'f_m1'), ('v2_m1', 'f_m1'), ('v1_m2', 'f_m2')]
Multiplies the columns, attaches a column name and saves them as items (similar structure to df.items())
Turns the items into a dataframe
I have many pairs of coordinate arrays like so
a=[(1.001,3),(1.334, 4.2),...,(17.83, 3.4)]
b=[(1.002,3.0001),(1.67, 5.4),...,(17.8299, 3.4)]
c=[(1.00101,3.002),(1.3345, 4.202),...,(18.6, 12.511)]
Any coordinate in any of the pairs can be a duplicate of another coordinate in another array of pairs. The arrays are also not the same size.
The duplicates will vary slightly in their value and for an example, I would consider the first value in a, b and c to be duplicates.
I could iterate through each array and compare the values one by one using numpy.isclose, however that will be slow.
Is there an efficient way to tackle this problem, hopefully using numpy to keep computing times low?
you might wanna try the round() function which will round off the numbers in your lists to the nearest integers.
the next thing that I'd suggest might be too extreme:
concat the arrays and put them into a pandas dataframe and drop_duplicates()
this might not be the solution you want
You might want to take a look at numpy.testing if you allow for AsertionError handling.
from numpy import testing as ts
a = np.array((1.001,3))
b = np.array((1.000101, 3.002))
ts.assert_array_almost_equal(a, b, decimal=1) # output None
but
ts.assert_array_almost_equal(a, b, decimal=3)
results in
AssertionError:
Arrays are not almost equal to 3 decimals
Mismatch: 50%
Max absolute difference: 0.002
Max relative difference: 0.00089891
x: array([1.001, 3. ])
y: array([1. , 3.002])
There are some more interesting functions from numpy.testing. Make sure to take a look at the docs.
I'm using pandas to give you an intuitive result, rather than just numbers. Of course you can expand the solution to your need
Say you create a pd.DataFrame from each array, and tag them from which array each belongs to. I am rounding the results to 2 decimal places, you may use whatever tolerance you want
dfa = pd.DataFrame(a).round(2)
dfa['arr'] = 'a'
Then, by concatenating, using duplicated and sorting, you may find an intuitive Dataframe that might fulfill your needs
df = pd.concat([dfa, dfb, dfc])
df[df.duplicated(subset=[0,1], keep=False)].sort_values(by=[0,1])
yields
x y arr
0 1.00 3.0 a
0 1.00 3.0 b
0 1.00 3.0 c
1 1.33 4.2 a
1 1.33 4.2 c
2 17.83 3.4 a
2 17.83 3.4 b
The indexes are duplicated, so you can simply use reset_index() at the end and use the newly-generated column as a parameter that indicates the corresponding index on each array. I.e.:
index x y arr
0 0 1.00 3.0 a
1 0 1.00 3.0 b
2 0 1.00 3.0 c
3 1 1.33 4.2 a
4 1 1.33 4.2 c
5 2 17.83 3.4 a
6 2 17.83 3.4 b
So, for example, line 0 indicates a duplicate coordinate, and is found on index 0 of arr a. Line 1 also indicates a dupe coordinate, found or index 0 of arr b, etc.
Now, if you just want to delete the duplicates and get one final array with only non-duplicate values, you may usedrop_duplicates
df.drop_duplicates(subset=[0,1])[[0,1]].to_numpy()
which yields
array([[ 1. , 3. ],
[ 1.33, 4.2 ],
[17.83, 3.4 ],
[ 1.67, 5.4 ],
[18.6 , 12.51]])
I have a 3 million rows dataframe that contains the different values :
d a0 a1 a2
0.5 10.0 5.0 1.0
0.8 10.0 2.0 0.0
I want to fill a fourth column with a linear interpolation of (a0,a1,a2) that takes the value in the "d" case,
d a0 a1 a2 newcol
1.5 10.0 5.0 1.0 3.0
0.8 10.0 2.0 0.0 3.6
newcol is the weighted average between a[int(d)] and a[int(d+1)], e.g. when d = 0.8, newcol = 0.2 * a0 + 0.8 * a1 because 0.8 is 80% of the way between 0 and 1
I found that np.interp can be used, but there is no way for me to put the three column names in variable) :
df["newcol"]=np.interp(df["d"],[0,1,2], [100,200,300])
will indeed give me
d a0 a1 a2 newcol
1.5 10.0 5.0 1.0 250.0
0.8 10.0 2.0 0.0 180.0
BUT I have no way to specify that the values vector changes :
df["newcol"]=np.interp(df["d"],[0,1,2], df[["a0","a1","a2"]])
gives me the following traceback :
File "C:\Python27\lib\site-packages\numpy\lib\function_base.py", line 1271, in interp
return compiled_interp(x, xp, fp, left, right)
ValueError: object too deep for desired array
Is there any way to use a different vector for values at each line? Could you think of any workaround ?
Basically, I could find no way to create this new column based on the definition :
What is the value in x = column "d" of the function that is piecewise linear
between given points and whose values at these points are described in the columns "ai"
Edit: Before, I used scipy.interp1d, which is not memory efficient, the comment helped me to solve partially my problem
Edit2 :
I tried the approach from ev-br that stated that I had to try to code the loop myself.
for i in range(len(tps)):
columns=["a1","a2","a3"]
length=len(columns)
x=np.maximum(0,np.minimum(df.ix[i,"d"],len-2))
xint = np.int(x)
xfrac = x-xint
name1=columns[xint]
name2=columns[xint+1]
tps.ix[i,"Multiplier"]=df.ix[i,name1]+xfrac*(df.ix[i,name2]-tps.ix[i,name1])
The above loop loops around 50 times a second, so I guess I have a major optimisation issue. What part of working on a DataFrame do I do wrong?
It might comes a bit too late, but I would use np.interpolate with pandas' apply function. Creating the DataFrame in your example:
t = pd.DataFrame([[1.5,10,5,1],[0.8,10,2,0]], columns=['d', 'a0', 'a1', 'a2'])
Then comes the apply function:
t.apply(lambda x: np.interp(x.d, [0,1,2], x['a0':]), axis=1)
which yields:
0 3.0
1 3.6
dtype: float64
This is perfectly usable on "normal" datasets. However, the size of your DataFrame might call for a better/more optimized solution. The processing time scales linearily, my machine clocks in 10000 lines per second, which means 5 minutes for 3 million...
OK, I have a second solution, which uses the numexpr module. This method is much more specific, but also much faster. I've measured the complete process to take 733 milliseconds for 1 million lines, which is not bad...
So we have the original DataFrame as before:
t = pd.DataFrame([[1.5,10,5,1],[0.8,10,2,0]], columns=['d', 'a0', 'a1', 'a2'])
We import the module and use it, but it requires that we separate the two cases where we will use 'a0' and 'a1' or 'a1' and 'a2' as lower/upper limits for the linear interpolation. We also prepare the numbers so they can be fed to the same evaluation (hence the -1). We do that by creating 3 arrays with the interpolation value (originally: 'd') and the limits, depending on the value of "d". So we have:
import numexpr as ne
lim = np.where(t.d > 1, [t.d-1, t.a1, t.a2], [t.d, t.a0, t.a1])
Then we evaluate the simple linear interpolation expression and finally add it as a new column like that:
x = ne.evaluate('(1-x)*a+x*b', local_dict={'x': lim[0], 'a': lim[1], 'b': lim[2]})
t['IP'] = np.where(t.d > 1, x+1, x)