normalize the rows of numpy array based on a custom function - python

I have an numpy array. I want to normalized each rows based on this formula
x_norm = (x-x_min)/(x_max-x_min)
, where x_min is the minimum of each row and x_max is the maximum of each row. Here is a simple example:
a = np.array(
[[0, 1 ,2],
[2, 4 ,7],
[6, 10,5]
])
and desired output:
a = np.array([
[0, 0.5 ,1],
[0, 0.4 ,1],
[0.2, 1 ,0]
])
Thank you

IIUC, you can use raw numpy operations:
x = np.array(
[[0, 1 ,2],
[2, 4 ,7],
[6, 10,5]
])
x_norm = ((x.T-x.min(1))/(x.max(1)-x.min(1))).T
# OR
x_norm = (x-x.min(1)[:,None])/(x.max(1)-x.min(1))[:,None]
output:
array([[0. , 0.5, 1. ],
[0. , 0.4, 1. ],
[0.2, 1. , 0. ]])
NB. if efficiency matters, save the result of x.min(1) in a variable as it is used twice

You could use np.apply_along_axis
a = np.array(
[[0, 1 ,2],
[2, 4 ,7],
[6, 10,5]
])
def scaler(x):
return (x-x.min())/(x.max()-x.min())
np.apply_along_axis(scaler, axis=1, arr=a)
Output:
array([[0. , 0.5, 1. ],
[0. , 0.4, 1. ],
[0.2, 1. , 0. ]])

Related

How can I count the length of the edge associated with each point?

I built the Delaunay triangulation in python.
Now I have 8 points (black) and generate 14 edges (gray).
How can I count the length of the edge associated with each point?
the matrix I want is the edges' length connected by each point, such as
[[P1, E1_length, E2_length, ...], [P2, E6_length, E7_length, ...], ...]
import numpy as np
points = np.array([[0, 0], [0, 1.1], [1, 0], [1, 1],[1.5, 0.6],[1.2, 0.5],[1.7, 0.9],[1.1, 0.1],])
from scipy.spatial import Delaunay
tri = Delaunay(points)
import matplotlib.pyplot as plt
plt.triplot(points[:, 0], points[:, 1], tri.simplices.copy(), color='0.7')
plt.plot(points[:, 0], points[:, 1], 'o', color='0.3')
plt.show()
New answer
Here's an approach which will give you a dictionary of points and edge lengths associated with each point:
simplices = points[tri.simplices]
edge_lengths = {}
for point in points:
key = tuple(point)
vertex_edges = edge_lengths.get(key, [])
adjacency_mask = np.isin(simplices, point).all(axis=2).any(axis=1)
for simplex in simplices[adjacency_mask]:
self_mask = np.isin(simplex, point).all(axis=1)
for other in simplex[~self_mask]:
dist = np.linalg.norm(point - other)
if dist not in vertex_edges:
vertex_edges.append(dist)
edge_lengths[key] = vertex_edges
Output:
{(0.0, 0.0): [1.4142135623730951, 1.1, 1.3, 1.0],
(0.0, 1.1): [1.004987562112089, 1.3416407864998738, 1.4866068747318506],
(1.0, 0.0): [1.4866068747318506, 0.5385164807134504, 0.7810249675906654, 1.140175425099138, 0.14142135623730956],
(1.0, 1.0): [1.004987562112089, 1.4142135623730951, 0.5385164807134504, 0.6403124237432849, 0.7071067811865475],
(1.5, 0.6): [0.6403124237432849, 0.36055512754639896, 0.31622776601683794, 0.6403124237432848],
(1.2, 0.5): [0.5385164807134504, 1.3, 0.31622776601683794, 0.41231056256176607],
(1.7, 0.9): [0.7071067811865475, 0.36055512754639896],
(1.1, 0.1): [0.14142135623730956, 0.41231056256176607, 0.6403124237432848]}
Old answer before requirements changed
The Delaunay object has a simplices attribute which returns the points which make up the simplices. Using scipy.spatial.distance.pdist(), and advanced indexing, you can get all the edge lengths like so:
>>> from scipy.spatial.distance import pdist
>>> edge_lengths = np.array([pdist(x) for x in points[tri.simplices]])
>>> edge_lengths
array([[1.00498756, 1.41421356, 1.1 ],
[0.53851648, 1.3 , 1.41421356],
[0.53851648, 1. , 1.3 ],
[0.64031242, 0.70710678, 0.36055513],
[0.64031242, 0.31622777, 0.53851648],
[0.14142136, 0.53851648, 0.41231056],
[0.64031242, 0.41231056, 0.31622777]])
Note however, that edge lengths are duplicated here, since every simplex shares at least one edge with another simplex.
Step-by-step
The tri.simplices attribute gives the indices in points for each vertex in each simplex in the Delaunay object:
>>> tri.simplices
array([[2, 6, 5],
[7, 2, 5],
[0, 7, 5],
[2, 1, 4],
[1, 2, 7],
[0, 3, 7],
[3, 1, 7]], dtype=int32)
Using advanced indexing, we can get all the points which make up the simplices:
>>> points[tri.simplices]
array([[[1. , 1. ],
[0. , 1.1],
[0. , 0. ]],
[[1.2, 0.5],
[1. , 1. ],
[0. , 0. ]],
[[1. , 0. ],
[1.2, 0.5],
[0. , 0. ]],
[[1. , 1. ],
[1.5, 0.6],
[1.7, 0.9]],
[[1.5, 0.6],
[1. , 1. ],
[1.2, 0.5]],
[[1. , 0. ],
[1.1, 0.1],
[1.2, 0.5]],
[[1.1, 0.1],
[1.5, 0.6],
[1.2, 0.5]]])
Finally, each subarray here represents a simplex and the three points which form it, and by using scipy.spatial.distance.pdist(), we can get the pairwise distances of each point in each simplex by iterating over the simplices:
>>> np.array([pdist(x) for x in points[tri.simplices]])
array([[1.00498756, 1.41421356, 1.1 ],
[0.53851648, 1.3 , 1.41421356],
[0.53851648, 1. , 1.3 ],
[0.64031242, 0.70710678, 0.36055513],
[0.64031242, 0.31622777, 0.53851648],
[0.14142136, 0.53851648, 0.41231056],
[0.64031242, 0.41231056, 0.31622777]])

How to row-normalize a feature matrix? Broadcasting error

I have a feature matrix that I want to row normalize.
This is what I have done based on min-max scaling and I am getting an error. Can anyone help me with this error.
a = np.random.randint(10, size=(4,5))
s=a.max(axis=1) - a.min(axis=1)
np.amax(a,axis=1)
print(s)
(a - a.min(axis=1))/(a.max(axis=1) - a.min(axis=1))\
>>[7 6 4 5]
4 print(s)
5
----> 6 (a - a.min(axis=1))/(a.max(axis=1) - a.min(axis=1))
ValueError: operands could not be broadcast together with shapes (4,5) (4,)
Try to work with transposed matrix:
b = a.T
m = (b - b.min(axis=0)) / (b.max(axis=0) - b.min(axis=0))
m = m.T
>>> a
array([[2, 3, 2, 8, 3], # min=2 -> 0, max=8 -> 1
[3, 3, 9, 2, 1], # min=1 -> 0, max=9 -> 1
[1, 9, 8, 4, 7], # min=1 -> 0, max=9 -> 1
[6, 8, 7, 9, 4]]) # min=4 -> 0, max=9 -> 1
>>> m
array([[0. , 0.16666667, 0. , 1. , 0.16666667],
[0.25 , 0.25 , 1. , 0.125 , 0. ],
[0. , 1. , 0.875 , 0.375 , 0.75 ],
[0.4 , 0.8 , 0.6 , 1. , 0. ]])
I have an alternative solution , I am not sure if this one is correct.Would be great if someone can comment on it.
def row_normalize(mf):
row_sums = np.array(mf.sum(1))
new_matrix = mf / row_sums[:, np.newaxis]
return new_matrix

Efficient probability tree branching

As an example, I have an array of branches and probabilities that looks like this:
paths = np.array([
[1, 0, 1.0],
[2, 0, 0.4],
[2, 1, 0.6],
[3, 1, 1.0],
[5, 1, 0.25],
[5, 2, 0.5],
[5, 4, 0.25],
[6, 0, 0.7],
[6, 5, 0.2],
[6, 2, 0.1]])
The columns are upper node, lower node, probability.
Here's a visual of the nodes:
6
/ | \
5 0 2
/ | \ / \
1 2 4 0 1
| /\ |
0 0 1 0
|
0
I want to be able to pick a starting node and output an array of the branches and cumulative probabilities, including all the duplicate branches. For example:
start_node = 5 should return
array([
[5, 1, 0.25],
[5, 2, 0.5],
[5, 4, 0.25],
[1, 0, 0.25],
[2, 0, 0.2],
[2, 1, 0.3],
[1, 0, 0.3]])
Notice the [1, 0, x] branch is included twice, as it's fed by both the [5, 1, 0.25] branch and the [2, 1, 0.3] branch.
Here's some code I got working but it's far too slow for my application (millions of branches):
def branch(start_node, paths):
output = paths[paths[:,0]==start_node]
next_nodes = output
while True:
can_go_lower = np.isin(next_nodes[:,1], paths[:,0])
if ~np.any(can_go_lower): break
next_nodes_checked = next_nodes[can_go_lower]
next_nodes = np.empty([0,3])
for nodes in next_nodes_checked:
to_append = paths[paths[:,0]==nodes[1]]
to_append[:,2] *= nodes[2]
next_nodes = np.append(next_nodes, to_append, axis=0)
output = np.append(output, next_nodes, axis=0)
return output
The branches are always higher to lower, therefor getting caught in circles isn't a concern. A way to vectorize the for loop and avoid the appends would be the best optimization, I think.
Instead of storing in numpy array lets' store graph in dict.
tree = {k:arr[arr[:, 0] == k] for k in np.unique(arr[:, 0])}
Make as set of nodes which are non-leaf:
non_leaf_nodes = set(np.unique(arr[:, 0]))
Now to find the branch and cumulative probability:
def branch(start_node, tree, non_leaf_nodes):
curr_nodes = [[start_node, start_node, 1.0]] #(prev_node, starting_node, current_probability)
output = []
while True:
next_nodes = []
for _, node, prob in curr_nodes:
if node not in non_leaf_nodes: continue
subtree = tree[node]
to_append = subtree.copy()
to_append[:, 2] *= prob
to_append = to_append.tolist()
output += to_append
next_nodes += to_append
curr_nodes = next_nodes
if len(curr_nodes) == 0:
break
return np.array(output)
Output:
>>> branch(5, tree, non_leaf_nodes)
array([
[5. , 1. , 0.25],
[5. , 2. , 0.5 ],
[5. , 4. , 0.25],
[1. , 0. , 0.25],
[2. , 0. , 0.2 ],
[2. , 1. , 0.3 ],
[1. , 0. , 0.3 ]])
I am expecting it to work faster. Let me know.

Building NumPy array using values from another array

Consider the following code:
import numpy as np
index_info = np.matrix([[1, 1], [1, 2]])
value = np.matrix([[0.5, 0.5]])
initial = np.zeros((3, 3))
How can I produce a matrix, final, which has the structure of initial with the elements specified by value at the locations specified by index_info WITHOUT a for loop? In this toy example, see below.
final = np.matrix([[0, 0, 0], [0, 0.5, 0.5], [0, 0, 0]])
With a for loop, you can easily loop through all of the index's in index_info and value and use that to populate initial and form final. But is there a way to do so with vectorization (no for loop)?
Convert index_info to a tuple and use it to assign:
>>> initial[(*index_info,)]=value
>>> initial
array([[0. , 0. , 0. ],
[0. , 0.5, 0.5],
[0. , 0. , 0. ]])
Please note that use of the matrix class is discouraged. Use ndarray instead.
You can do this with NumPy's array indexing:
>>> initial = np.zeros((3, 3))
>>> row = np.array([1, 1])
>>> col = np.array([1, 2])
>>> final = np.zeros_like(initial)
>>> final[row, col] = [0.5, 0.5]
>>> final
array([[0. , 0. , 0. ],
[0. , 0.5, 0.5],
[0. , 0. , 0. ]])
This is similar to #PaulPanzer's answer, where he is unpacking row and col from index_info all in one step. In other words:
row, col = (*index_info,)

Addition of every two columns

I would like calculate the sum of two in two column in a matrix(the sum between the columns 0 and 1, between 2 and 3...).
So I tried to do nested "for" loops but at every time I haven't the good results.
For example:
c = np.array([[0,0,0.25,0.5],[0,0.5,0.25,0],[0.5,0,0,0]],float)
freq=np.zeros(6,float).reshape((3, 2))
#I calculate the sum between the first and second column, and between the fird and the fourth column
for i in range(0,4,2):
for j in range(1,4,2):
for p in range(0,2):
freq[:,p]=(c[:,i]+c[:,j])
But the result is:
print freq
array([[ 0.75, 0.75],
[ 0.25, 0.25],
[ 0. , 0. ]])
Normaly the good result must be (0., 0.5,0.5) and (0.75,0.25,0). So I think the problem is in the nested "for" loops.
Is there a person who know how I can calculate the sum every two columns, because I have a matrix with 400 columns?
You can simply reshape to split the last dimension into two dimensions, with the last dimension of length 2 and then sum along it, like so -
freq = c.reshape(c.shape[0],-1,2).sum(2).T
Reshaping only creates a view into the array, so effectively, we are just using the summing operation here and as such must be efficient.
Sample run -
In [17]: c
Out[17]:
array([[ 0. , 0. , 0.25, 0.5 ],
[ 0. , 0.5 , 0.25, 0. ],
[ 0.5 , 0. , 0. , 0. ]])
In [18]: c.reshape(c.shape[0],-1,2).sum(2).T
Out[18]:
array([[ 0. , 0.5 , 0.5 ],
[ 0.75, 0.25, 0. ]])
Add the slices c[:, ::2] and c[:, 1::2]:
In [62]: c
Out[62]:
array([[ 0. , 0. , 0.25, 0.5 ],
[ 0. , 0.5 , 0.25, 0. ],
[ 0.5 , 0. , 0. , 0. ]])
In [63]: c[:, ::2] + c[:, 1::2]
Out[63]:
array([[ 0. , 0.75],
[ 0.5 , 0.25],
[ 0.5 , 0. ]])
Here is one way using np.split():
In [36]: np.array(np.split(c, np.arange(2, c.shape[1], 2), axis=1)).sum(axis=-1)
Out[36]:
array([[ 0. , 0.5 , 0.5 ],
[ 0.75, 0.25, 0. ]])
Or as a more general way even for odd length arrays:
In [87]: def vertical_adder(array):
return np.column_stack([np.sum(arr, axis=1) for arr in np.array_split(array, np.arange(2, array.shape[1], 2), axis=1)])
....:
In [88]: vertical_adder(c)
Out[88]:
array([[ 0. , 0.75],
[ 0.5 , 0.25],
[ 0.5 , 0. ]])
In [94]: a
Out[94]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
In [95]: vertical_adder(a)
Out[95]:
array([[ 1, 5, 4],
[11, 15, 9],
[21, 25, 14]])

Categories

Resources