Related
I have a column of np.array data that I add to the last column of my pandas dataframe. However, I need the data sorted in ascending order inside that np.array. (It is not sorted in ascending order in the dataframe from which it is taken.)
dataframe structure:
GFP_spot_1_position, GFP_spot_2_position, GFP_spot_3_position, ...
0 _ 0.2, 0.4, 0.6, NaN
1 _ 0.8, 0.2, NaN, NaN
2 _ 0.7, 0.5, 0.6, 0.9
3 _ 0.5, NaN, 0.1, NaN
What I want it to look like:
gfp_spots_all
0 _ [0.2, 0.4, 0.6, nan]
1 _ [0.2, 0.8, nan, nan]
2 _ [0.5, 0.6, 0.7, 0.9]
3 _ [0.1, 0.5, nan, nan]
What it actually looks like with the code below:
gfp_spots_all
0 _ [0.2, 0.4, 0.6, NaN]
1 _ [0.8, 0.2, NaN, NaN]
2 _ [0.7, 0.5, 0.6, 0.9]
3 _ [0.5, NaN, 0.1, NaN]
Here's the code I have so far:
df = pd.read_csv('dfall.csv')
dfgfp = df.loc[:, 'GFP_spot_1_position':'GFP_spot_4_position']
df['gfp_spots_all'] = dfgfp.apply(lambda r: list(r),
axis=1).apply(np.array)
df.head()
I cant seem or sort the values in the array. Please help! Also, I'm new to python as well so I'm learning as I go. Please feel free to correct my sloppy code.
There must be a more pythonique way to do it, but here is a way solve this:
In [1]:
import pandas as pd
# Create the Dataframe
data = {'col1': [[9, 3], [2, 4], [7, 6], [3, 3], [8, 0], [0,4]], 'col2': [[1,3], [9,4], [4,2], [5,1], [3,7], [9,8]]}
df = pd.DataFrame(data=data)
## Loop on each row
for i in range(len(df)):
## Loop on each column
for k in range(len(df.columns)):
df.iloc[i][k].sort()
df
Out [1]:
col1 col2
0 [3, 9] [1, 3]
1 [2, 4] [4, 9]
2 [6, 7] [2, 4]
3 [3, 3] [1, 5]
4 [0, 8] [3, 7]
5 [0, 4] [8, 9]
It seems you can, see the code below
arr = np.array([[3,5,1,7,4,2],[12,18,11,np.nan,np.nan,18]])
df = pd.DataFrame(arr)
print(df)
Output
0 1 2 3 4 5
0 3.0 5.0 1.0 7.0 4.0 2.0
1 12.0 18.0 11.0 NaN NaN 18.0
np.ndarray.sort(df.values)
print(df)
Output
0 1 2 3 4 5
0 1.0 2.0 3.0 4.0 5.0 7.0
1 11.0 12.0 18.0 18.0 NaN NaN
But it will mis-match values and columns, did you intend that?
As per #G. Anderson's comment, adding a sorted() to your lambda expression will solve the issue. Actually quite a bit of the code in your example is redundant:
dfgfp = df.loc[:, 'GFP_spot_1_position':'GFP_spot_4_position']
df['gfp_spots_all'] = dfgfp.apply(lambda r: sorted(r), axis=1)
I believe that will do what you require.
# Here's what worked
df = pd.read_csv('dfall.csv')
dfgfp = df.loc[:, 'GFP_spot_1_position':'GFP_spot_4_position']
df['gfp_spots_all'] = dfgfp.apply(lambda r: list(r), axis=1).apply(np.array)
dfjust = pd.DataFrame([df.gfp_spots_all]).transpose()
## Loop on each row
for i in range(len(dfjust)):
for k in range(len(dfjust.columns)):
dfjust.iloc[i][k].sort()
dfjust.head()
[out:]
gfp_spots_all .
0 [3.4165, 19.63, nan, nan]
1 [6.7447, 18.044, nan, nan]
2 [5.088, 10.261, nan, nan]
3 [5.4081, 16.097, nan, nan]
4 [4.2675, nan, nan, nan]
5 rows × 1 columns
I can only use the numpy import.
I need to calculate the closest distance is the test set to the training set. I.E find the closest distance in the the test(find the distance between all the lists in training array) and return both the test name and training name. The following formula is used:
dist(x,y)=√((a-a2 )^2+(b-b2 )^2+(c-c2 )^2+(d-d2)^2 )
link to data used and expect first row.
This is the code I have that functions correctly for the first row in the Train test set. I need for each row of the train array to go through the same operation in variable q.
Below is my input
Training
a b c d name training
5 3 1.6 0.2 G
5 3.4 1.6 0.4 G
5.5 2.4 3.7 1 R
5.8 2.7 3.9 1.2 R
7.2 3.2 6 1.8 Y
6.2 2.8 4.8 1.8 Y
testing
a2 b2 c2 d2 name true
5 3.6 1.4 0.2 E
5.4 3.9 1.7 0.4 G
6.9 3.1 4.9 1.5 R
5.5 2.3 4 1.3 R
6.4 2.7 5.3 1.9 Y
6.8 3 5.5 2.1 Y
train = np.asarray(train)
test = np.asarray(test)
print('Train shape',train.shape)
print('test shape',test.shape)
train_1 = train[:,0:(train.shape[1])-1].astype(float)
test_1 = test[:,0:(test.shape[1])-1].astype(float)
print('Train '+'\n',train_1)
print('test '+'\`enter code here`n',test_1)
q=min((np.sqrt(np.sum((train_1[0,:]-test_1)**2,axis=1,keepdims=True))))
I expect to get the closest distance from the training row compared to entire array of test. Using this the first row train using the formula would produce the below. I would then return G,E as those are the 2 rows that are closest.
you can use numpy.linalg.norm. here is an example:
>>> import numpy as np
>>> arr = np.array([1, 2, 3, 4])
>>> np.linalg.norm(arr)
5.477225575051661
5.477225575051661 is the result of sqrt(1^2 + 2^2 + 3^2 + 4^2)
import numpy as np
train = np.array([[5, 3, 1.6, 0.2],
[5, 3.4, 1.6, 0.4],
[5.5, 2.4, 3.7, 1],
[5.8, 2.7, 3.9, 1.2],
[7.2, 3.2, 6, 1.8],
[6.2, 2.8, 4.8, 1.8]])
test = np.array([[5, 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4, 1.3],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3, 5.5, 2.1]])
# first get subtraction of each row of train to test
subtraction = train[:, None, :] - test[None, :, :]
# get distance from each train_row to test
s = np.linalg.norm(subtraction, axis=2, keepdims=True)
print(np.min(s, axis=1))
# get minimum
q = np.argmin(s, axis=1)
print("minimum indices:")
print(q)
output:
[[0.63245553]
[0.34641016]
[0.43588989]
[0.51961524]
[0.73484692]
[0.55677644]]
minimum indices:
[[0]
[0]
[3]
[3]
[5]
[4]]
I have a 3d point cloud of n points in the format np.array((n,3)). e.g This could be something like:
P = [[x1,y1,z1],[x2,y2,z2],[x3,y3,z3],[x4,y4,z4],[x5,y5,z5],.....[xn,yn,zn]]
I would like to be able to get the K-nearest neighbors of each point.
so for example the k nearest neighbors of P1 might be P2,P3,P4,P5,P6 and the KNN of P2 might be P100,P150,P2 etc etc.
how does one go about doing that in python?
This can be solved neatly with scipy.spatial.distance.pdist.
First, let's create an example array that stores points in 3D space:
import numpy as np
N = 10 # The number of points
points = np.random.rand(N, 3)
print(points)
Output:
array([[ 0.23087546, 0.56051787, 0.52412935],
[ 0.42379506, 0.19105237, 0.51566572],
[ 0.21961949, 0.14250733, 0.61098618],
[ 0.18798019, 0.39126363, 0.44501143],
[ 0.24576538, 0.08229354, 0.73466956],
[ 0.26736447, 0.78367342, 0.91844028],
[ 0.76650234, 0.40901879, 0.61249828],
[ 0.68905082, 0.45289896, 0.69096152],
[ 0.8358694 , 0.61297944, 0.51879837],
[ 0.80963247, 0.1680279 , 0.87744732]])
We compute for each point, the distance to all other points:
from scipy.spatial import distance
D = distance.squareform(distance.pdist(points))
print(np.round(D, 1)) # Rounding to fit the array on screen
Output:
array([[ 0. , 0.4, 0.4, 0.2, 0.5, 0.5, 0.6, 0.5, 0.6, 0.8],
[ 0.4, 0. , 0.2, 0.3, 0.3, 0.7, 0.4, 0.4, 0.6, 0.5],
[ 0.4, 0.2, 0. , 0.3, 0.1, 0.7, 0.6, 0.6, 0.8, 0.6],
[ 0.2, 0.3, 0.3, 0. , 0.4, 0.6, 0.6, 0.6, 0.7, 0.8],
[ 0.5, 0.3, 0.1, 0.4, 0. , 0.7, 0.6, 0.6, 0.8, 0.6],
[ 0.5, 0.7, 0.7, 0.6, 0.7, 0. , 0.7, 0.6, 0.7, 0.8],
[ 0.6, 0.4, 0.6, 0.6, 0.6, 0.7, 0. , 0.1, 0.2, 0.4],
[ 0.5, 0.4, 0.6, 0.6, 0.6, 0.6, 0.1, 0. , 0.3, 0.4],
[ 0.6, 0.6, 0.8, 0.7, 0.8, 0.7, 0.2, 0.3, 0. , 0.6],
[ 0.8, 0.5, 0.6, 0.8, 0.6, 0.8, 0.4, 0.4, 0.6, 0. ]])
You read this distance matrix like this: the distance between points 1 and 5 is distance[0, 4]. You can also see that the distance between each point and itself is 0, for example distance[6, 6] == 0
We argsort each row of the distance matrix to get for each point a list of which points are closest:
closest = np.argsort(D, axis=1)
print(closest)
Output:
[[0 3 1 2 5 7 4 6 8 9]
[1 2 4 3 7 0 6 9 8 5]
[2 4 1 3 0 7 6 9 5 8]
[3 0 2 1 4 7 6 5 8 9]
[4 2 1 3 0 7 9 6 5 8]
[5 0 7 3 6 2 8 4 1 9]
[6 7 8 9 1 0 3 2 4 5]
[7 6 8 9 1 0 3 2 4 5]
[8 6 7 9 1 0 3 5 2 4]
[9 6 7 1 8 4 2 0 3 5]]
Again, we see that each point is closest to itself. So, disregarding that, we can now select the k closest points:
k = 3 # For each point, find the 3 closest points
print(closest[:, 1:k+1])
Output:
[[3 1 2]
[2 4 3]
[4 1 3]
[0 2 1]
[2 1 3]
[0 7 3]
[7 8 9]
[6 8 9]
[6 7 9]
[6 7 1]]
For example, we see that for point 4, the k=3 closest points are 1, 3 and 2.
#marijn-van-vliet's solution satisfies in most of the scenarios. However, it is called as the brute-force approach and if the point cloud is relatively large or if you have computational/time constraints, you might want to look at building KD-Trees for fast retrieval of K-Nearest Neighbors of a point.
In python, sklearn library provides an easy-to-use implementation here: sklearn.neighbors.KDTree
from sklearn.neighbors import KDTree
tree = KDTree(pcloud)
# For finding K neighbors of P1 with shape (1, 3)
indices, distances = tree.query(P1, K)
(Also see the following answer in another post for more detailed usage and output: https://stackoverflow.com/a/48127117/4406572)
Many other libraries do have the implementation for KD-Tree based KNN retrieaval, including Open3D (FLANN based) and scipy.
I have 2 symmetric matrices, one of them being a correlation matrix and the other one similar to a correlation matrix. Examples of these matrices are shown below:
Correlation Matrix (c):
A B C D
A 1 0.5 0.1 0.4
B 0.5 1 0.9 0.3
C 0.1 0.9 1 0.3
D 0.4 0.3 0.3 1
Other Matrix (z):
A B C D
A 3 2 2 2
B 2 3 3 2
C 2 3 3 2
D 2 2 2 3
I'm ordering the correlation matrix in descending order so I can look at the top-most correlation values, using the following code:
c = corrMatrixMin10.abs()
s = c.unstack()
so = s.sort_values(kind="quicksort")
pd.DataFrame(so[so.values!=1].sort_values(ascending=False))
My question is as follows:
When I arrange the correlation matrix c in a descending order, the correlation matrix itself loses its shape. How do I have the other matrix z in the exact same order?
For example: The intersection of columns A and B in the matrix c is 0.5. The intersection of columns A and B in the matrix z is 2. How can I still preserve this order to associate these 2 values after arranging the matrix c in a descending order?
Any help would be greatly appreciated. TIA.
The code to generate the 2 matrices is as follows:
c = pd.DataFrame([[1, 0.5, 0.1, 0.4],
[0.5, 1, 0.9, 0.3],
[ 0.1, 0.9, 1, 0.3],
[ 0.4, 0.3, 0.3, 1]],
columns=list('ABCD'))
z = pd.DataFrame([[3, 2, 2, 2],
[2, 3, 3, 2],
[ 2, 3, 3, 2],
[ 2, 2, 2, 3]],
columns=list('ABCD'))
You can use Series.reindex
c_series = c.unstack().drop([(x, x) for x in c]).sort_values(ascending=False)
z_series = z.unstack().reindex(c_series.index)
Example:
arr = np.array([[.5, .25, .19, .05, .01],[.25, .5, .19, .05, .01],[.5, .25, .19, .05, .01]])
print(arr)
[[ 0.5 0.25 0.19 0.05 0.01]
[ 0.25 0.5 0.19 0.05 0.01]
[ 0.5 0.25 0.19 0.05 0.01]]
idxs = np.argsort(arr)
print(idxs)
[[4 3 2 1 0]
[4 3 2 0 1]
[4 3 2 1 0]]
How can I use idxs to index arr? I want to do something like arr[idxs], but this does not work.
It's not the prettiest, but I think something like
>>> arr[np.arange(len(arr))[:,None], idxs]
array([[ 0.01, 0.05, 0.19, 0.25, 0.5 ],
[ 0.01, 0.05, 0.19, 0.25, 0.5 ],
[ 0.01, 0.05, 0.19, 0.25, 0.5 ]])
should work. The first term gives the x coordinates we want (using broadcasting over the last singleton axis):
>>> np.arange(len(arr))[:,None]
array([[0],
[1],
[2]])
with idxs providing the y coordinates. Note that if we had used unravel_index, the x coordinates to use would always have been 0 instead:
>>> np.unravel_index(idxs, arr.shape)[0]
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
How about something like this:
I changed variables to make the example more clear, but you basically need to index by two 2D arrays.
In [102]: a = np.array([[1,2,3], [4,5,6]])
In [103]: b = np.array([[0,2,1], [2,1,0]])
In [104]: temp = np.repeat(np.arange(a.shape[0]), a.shape[1]).reshape(a.shape).T
# temp is just [[0,1], [0,1], [0,1]]
# probably can be done more elegantly
In [105]: a[temp, b.T].T
Out[105]:
array([[1, 3, 2],
[6, 5, 4]])