Replace specific values in a matrix using Python - python

I have a m x n matrix where each row is a sample and each column is a class. Each row contains the soft-max probabilities of each class. I want to replace the maximum value in each row with 1 and others with 0. How can I do it efficiently in Python?

Some made up data:
>>> a = np.random.rand(5, 5)
>>> a
array([[ 0.06922196, 0.66444783, 0.2582146 , 0.03886282, 0.75403153],
[ 0.74530361, 0.36357237, 0.3689877 , 0.71927017, 0.55944165],
[ 0.84674582, 0.2834574 , 0.11472191, 0.29572721, 0.03846353],
[ 0.10322931, 0.90932896, 0.03913152, 0.50660894, 0.45083403],
[ 0.55196367, 0.92418942, 0.38171512, 0.01016748, 0.04845774]])
In one line:
>>> (a == a.max(axis=1)[:, None]).astype(int)
array([[0, 0, 0, 0, 1],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 1, 0, 0, 0]])
A more efficient (and verbose) approach:
>>> b = np.zeros_like(a, dtype=int)
>>> b[np.arange(a.shape[0]), np.argmax(a, axis=1)] = 1
>>> b
array([[0, 0, 0, 0, 1],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 1, 0, 0, 0]])

I think the best answer to your particular question is to use a matrix type object.
A sparse matrix should be the most performant in terms of storing large numbers of these matrices of large sizes in a memory friendly way, given that most of the matrix is populated with zeroes. This should be superior to using numpy arrays directly especially for very large matrices in both dimensions, if not in terms of speed of computation, in terms of memory.
import numpy as np
import scipy #older versions may require `import scipy.sparse`
matrix = np.matrix(np.random.randn(10, 5))
maxes = matrix.argmax(axis=1).A1
# was .A[:,0], slightly faster, but .A1 seems more readable
n_rows = len(matrix) # could do matrix.shape[0], but that's slower
data = np.ones(n_rows)
row = np.arange(n_rows)
sparse_matrix = scipy.sparse.coo_matrix((data, (row, maxes)),
shape=matrix.shape,
dtype=np.int8)
This sparse_matrix object should be very lightweight relative to a regular matrix object, which would needlessly track each and every zero in it. To materialize it as a normal matrix:
sparse_matrix.todense()
returns:
matrix([[0, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[0, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 1, 0]], dtype=int8)
Which we can compare to matrix:
matrix([[ 1.41049496, 0.24737968, -0.70849012, 0.24794031, 1.9231408 ],
[-0.08323096, -0.32134873, 2.14154425, -1.30430663, 0.64934781],
[ 0.56249379, 0.07851507, 0.63024234, -0.38683508, -1.75887624],
[-0.41063182, 0.15657594, 0.11175805, 0.37646245, 1.58261556],
[ 1.10421356, -0.26151637, 0.64442885, -1.23544526, -0.91119517],
[ 0.51384883, 1.5901419 , 1.92496778, -1.23541699, 1.00231508],
[-2.42759787, -0.23592018, -0.33534536, 0.17577329, -1.14793293],
[-0.06051458, 1.24004714, 1.23588228, -0.11727146, -0.02627196],
[ 1.66071534, -0.07734444, 1.40305686, -1.02098911, -1.10752638],
[ 0.12466003, -1.60874191, 1.81127175, 2.26257234, -1.26008476]])

This approach using basic numpy and list comprehensions works, but is the least performant. I'm leaving this answer here as it may be somewhat instructive. First we create a numpy matrix:
matrix = np.matrix(np.random.randn(2,2))
matrix is, e.g.:
matrix([[-0.84558168, 0.08836042],
[-0.01963479, 0.35331933]])
Now map 1 to a new matrix if the element is max, else 0:
newmatrix = np.matrix([[1 if i == row.max() else 0 for i in row]
for row in np.array(matrix)])
newmatrix is now:
matrix([[0, 1],
[0, 1]])

Y = np.random.rand(10,10)
X=np.zeros ((5,5))
y_insert=2
x_insert=3
offset = (1,2)
for index_x, row in enumerate(X):
for index_y, e in enumerate(row):
Y[index_x + offset[0]][index_y + offset[1]] = e

Related

How to set a limited defined random values in numpy matrix

How to set a limited random values by amount and range in nupmy matrix ?
Means instead :
random_matrix = np.random.rand(5, 5)
[[0.38555213 0.96454126 0.91586422 0.92638243 0.85516641]
[0.64717218 0.2716665 0.70945594 0.74754943 0.48870502]
[0.23381316 0.01992578 0.86749684 0.85797792 0.19308509]
[0.63565231 0.7056163 0.69110815 0.73506642 0.804646 ]
[0.35512519 0.54900446 0.66311323 0.04899527 0.49349834]]
the wanted setting for example is 3 random integers between the range 1-5
in a null matrix :
0,0,0,4,0
0,0,0,0,0
0,1,0,0,0
0,0,0,3,0
0,0,0,0,0
Thanks in advance
If i understand the question correctly, you want to create a matrix that is zero in all places except for 3 random indices that will have a random value between the range 1-5.
For this i would suggest doing:
null_matrix = np.zeros((5,5), dtype=np.int32)
rng = np.random.default_rng()
x = rng.choice(5, size=3, replace=False)
y = rng.choice(5, size=3, replace=False)
null_matrix[x,y] = rng.choice(np.arange(1,5), 3)
print(null_matrix)
Output:
array([[0, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[4, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 2]], dtype=int32)

numpy make sub-arrays based of unique column

I have an example array that looks like array = np.array([[1,1,0,1], [0,1,0,0], [1,1,1,0], [0,0,1,2], [0,1,3,2], [1,1,0,1], [0,1,0,0]]) ...
array([[1, 1, 0, 1],
[0, 1, 0, 0],
[1, 1, 1, 0],
[0, 0, 1, 2],
[0, 1, 3, 2],
[1, 1, 0, 1],
[0, 1, 0, 0]])
With this in mind I want reformat this array into subarrays based off of the first two columns. Using How to split a numpy array based on a column? as a reference, I made this array into a list of arrays with ...
df = pd.DataFrame(array)
df['4'] = df[0].astype(str) + df[1].astype(str)
df['4'] = df['4'].astype(int)
arr = df.to_numpy()
y = [arr[arr[:,4]==k] for k in np.unique(arr[:,4])]
where y is ...
[array([[0, 0, 1, 2, 0]]),
array([[0, 1, 0, 0, 1],
[0, 1, 3, 2, 1],
[0, 1, 0, 0, 1]]),
array([[ 1, 1, 0, 1, 11],
[ 1, 1, 1, 0, 11],
[ 1, 1, 0, 1, 11]])]
This works fine but it takes far too long for y to run. The amount of time it takes increases exponentially with every row. I am playing around with hundreds of millions of rows and y = [arr[arr[:,4]==k] for k in np.unique(arr[:,4])] is not practical from a time standpoint.
Any ideas on how to speed this up?
What about using the numpy_indexed library:
import numpy as np
import numpy_indexed as npi
a = np.array([[1, 1, 0, 1],
[0, 1, 0, 0],
[1, 1, 1, 0],
[0, 0, 1, 2],
[0, 1, 3, 2],
[1, 1, 0, 1],
[0, 1, 0, 0]])
key = np.dot(a[:,:2], [1, 10])
y = npi.group_by(key).split_array_as_list(arr)
Output
y
[array([[0, 0, 1, 2]]),
array([[0, 1, 0, 0],
[0, 1, 3, 2],
[0, 1, 0, 0]]),
array([[ 1, 1, 0, 1],
[ 1, 1, 1, 0],
[ 1, 1, 0, 1]])]
You can easily install the library with:
> pip install numpy-indexed
Let me know if this performs better,
from collections import defaultdict
import numpy as np
outgen = defaultdict(lambda: [])
# arr: The input numpy array, :type: np.ndarray.
c = map(lambda x: ((x[0], x[1]), x), arr)
for key, val in c:
outgen[key].append(val)
# outgen: The required output, :type: list[np.ndarray].
outgen = [np.array(x) for x in outgen.values()]
You can use np.unique directly here.
unique, indexer = np.unique(arr[:, :2], axis=0, return_inverse=True)
{i: arr[indexer == k, :] for i, k in enumerate(unique)}
This is probably about as good as it gets for your desired output. However, instead of splitting it into a list of subarrays you could sort it by the unique key and then work with slices. This might be helpful if there are many unique values leading to a long list.
arr[:] = arr[np.argsort(indexer), :] # not sure if this is guaranteed to preserve the order within each group
EDIT:
Here is a powerful solution which I have been using for a sort of 2-D factorization. It takes 8ms for 1 million rows of single digit integers (vs > 100ms for np.unique).
columns = x[:, 0], x[:, 1]
factored = map(pd.factorize, columns)
codes, unique_values = map(list, zip(*factored))
group_index = get_group_index(codes, map(len, unique_values), sort=False, xnull=False)
It uses the internal algorithm of Dataframe.drop_duplicates.
Note that the ordering of the keys is not the sort order of the unique tuples.
There is also a new open source library, riptable which emulates numpy and pandas in some ways but is can be a lot more powerful. The creation of th takes around 4ms
import riptable as rt
columns = [x[:, 0], x[:, 1]]
unique_values, key = rt.unique(columns, return_inverse=True)
Here, unique_values is a tuple containing two arrays which can be zipped to get the unique tuples

Updating numpy 2-dimensional array according to conditions across different 2-D arrays

In the code that I am writing, I have three 2D numpy arrays with the same dimensions (m x n), with each 2D array containing info about a specific trait, but each corresponding cell (with a specific row/col value) across all three 2D arrays corresponding to a specific person. The three 2D arrays are trait1, trait2, and trait3. As an example, person (0, 0) will have traits 1, 2, but not three, if only trait1 and trait2 have a value of 1 at location (0,0), but trait3 does not.
What would be an efficient method of updating a 2D array at a specific location based on the values of other corresponding 2D arrays of the same dimension at the same location? That is, how can I efficiently update a 2D array at a specific location such that the other 2D arrays at this same location fulfill specific conditions?
I am currently trying to update the values of the 2D array trait1 and trait2 according to the current values of trait1 and trait2 (such that the corresponding trait1 value == 1, and the corresponding trait2 value == 0); I am also trying to update the values of trait3 according to the current values of trait1, and trait2 (under the same conditions as the previous). However, I am having trouble doing this without using nested for loops, which greatly slows down my program.
Below is my current approach, which works, but is much too slow for my purposes:
for i in range (0, m):
for j in range (0, n):
if trait1[i][j] == 1:
if trait2[i][j] == 0:
trait1[i][j] = 0
trait2[i][j] = 1
new_color(i, j, 1) #updates the color of the specific person on a grid
trait3[i][j] = 0
elif trait1[i][j] == 0:
if trait2[i][j] <= 0:
trait1[i][j] = 1
trait2[i][j] = 0
new_color(i, j, 0)
Numpy array are really slow if you use loop indeed. If you can use matrices operations / numpy function for everything, it will go much faster.
In your case, you could first extract the indices you're interested about, and then update your matrices like this:
import numpy as np
np.random.seed(1)
# Generate some sample data
trait1, trait2, trait3 = ( np.random.randint(0,2, [4,4]) for _ in range(3) )
In [4]: trait1
Out[4]:
array([[1, 1, 0, 0],
[1, 1, 1, 1],
[1, 0, 0, 1],
[0, 1, 1, 0]])
In [5]: trait2
Out[5]:
array([[0, 1, 0, 0],
[0, 1, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0]])
In [6]: trait3
Out[6]:
array([[1, 1, 1, 1],
[1, 0, 0, 0],
[1, 1, 1, 1],
[1, 1, 0, 1]])
And then:
cond1_idx = np.where((trait1 == 1) & (trait2==0))
cond2_idx = np.where((trait1 == 0) & (trait2<=0))
trait1[cond1_idx] = 0
trait2[cond1_idx] = 1
trait3[cond1_idx] = 0
[ new_color(i, j, 1) for i,j in zip(*cond1_idx) ]
trait1[cond2_idx] = 1
trait2[cond2_idx] = 0
[ new_color(i, j, 0) for i,j in zip(*cond2_idx) ]
Result:
In [2]: trait1
Out[2]:
array([[0, 1, 1, 1],
[0, 1, 0, 0],
[1, 1, 1, 0],
[0, 0, 0, 1]])
In [3]: trait2
Out[3]:
array([[1, 1, 0, 0],
[1, 1, 1, 1],
[1, 0, 0, 1],
[1, 1, 1, 0]])
In [4]: trait3
Out[4]:
array([[0, 1, 1, 1],
[0, 0, 0, 0],
[1, 1, 1, 0],
[1, 0, 0, 1]])
I cannot really test the new_color though since I don't have the function

Doubling the matrix in numpy

Let's say I have a matrix in of size mXn.
I am trying to create a matrix out of size 2mX2n such that
the out matrix contains essentially the same elements as the in matrix,
except that the values are alternated with zeros.
For example:
in = [[ 1,2,3],
[4,5,6]]
out = [[1,0,2,0,3,0],
[0,0,0,0,0,0],
[4,0,5,0,6,0],
[0,0,0,0,0,0]]
Is there a vectorized way to achieve this?
Use NumPy:
import numpy as np
Your data:
a = np.array([[ 1,2,3],
[4,5,6]])
Create an array twice the size along both dimensions:
b = np.zeros([x * 2 for x in a.shape], dtype=a.dtype))
Assign the value of a to each second value of b, again in both dimensions:
b[::2,::2] = a
The result:
>>> b
array([[1, 0, 2, 0, 3, 0],
[0, 0, 0, 0, 0, 0],
[4, 0, 5, 0, 6, 0],
[0, 0, 0, 0, 0, 0]])

Slicing different rows of a numpy array differently

I'm working on a Monte Carlo radiative transfer code, which simulates firing photons through a medium and statistically modelling their random walk. It runs slowly firing one photon at a time, so I'd like to vectorize it and run perhaps 1000 photons at once.
I have divided my slab through which the photons are passing into nlayers slices between optical depth 0 and depth. Effectively, that means that I have nlayers + 2 regions (nlayers plus the region above the slab and the region below the slab). At each step, I have to keep track of which layers each photon passes through.
Let's suppose that I already know that two photons start in layer 0. One takes a step and ends up in layer 2, and the other takes a step and ends up in layer 6. This is represented by an array pastpresent that looks like this:
[[ 0 2]
[ 0 6]]
I want to generate an array traveled_through with (nlayers + 2) columns and 2 rows, describing whether photon i passed through layer j (endpoint-inclusive). It would look something like this (with nlayers = 10):
[[ 1 1 1 0 0 0 0 0 0 0 0 0]
[ 1 1 1 1 1 1 1 0 0 0 0 0]]
I could do this by iterating over the photons and generating each row of traveled_through individually, but that's rather slow, and sort of defeats the point of running many photons at once, so I'd rather not do that.
I tried to define the array as follows:
traveled_through = np.zeros((2, nlayers)).astype(int)
traveled_through[ : , np.min(pastpresent, axis = 1) : np.max(pastpresent, axis = 1) + ] = 1
The idea was that in a given photon's row, the indices from the starting layer through and including the ending layer would be set to 1, with all others remaining 0. However, I get the following error:
traveled_through[ : , np.min(pastpresent, axis = 1) : np.max(pastpresent, axis = 1) + 1 ] = 1
IndexError: invalid slice
My best guess is that numpy does not allow different rows of an array to be indexed differently using this method. Does anyone have suggestions for how to generate traveled_through for an arbitrary number of photons and an arbitrary number of layers?
If the two photons always start at 0, you could perhaps construct your array as follows.
First setting the variables...
>>> pastpresent = np.array([[0, 2], [0, 6]])
>>> nlayers = 10
...and then constructing the array:
>>> (pastpresent[:,1][:,np.newaxis] + 1 > np.arange(nlayers+2)).astype(int)
array([[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])
Or if the photons have an arbitrary starting layer:
>>> pastpresent2 = np.array([[1, 7], [3, 9]])
>>> (pastpresent2[:,0][:,np.newaxis] < np.arange(nlayers+2)) &
(pastpresent2[:,1][:,np.newaxis] + 1 > np.arange(nlayers+2)).astype(int)
array([[0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0]])
A little trick I kind of like for this kind of thing involves the accumulate method of the logical_xor ufunc:
>>> a = np.zeros(10, dtype=int)
>>> b = [3, 7]
>>> a[b] = 1
>>> a
array([0, 0, 0, 1, 0, 0, 0, 1, 0, 0])
>>> np.logical_xor.accumulate(a, out=a)
array([0, 0, 0, 1, 1, 1, 1, 0, 0, 0])
Note that this sets to 1 the entries between the positions in b, first index inclusive, last index exclusive, so you have to handle off by 1 errors depending on what exactly you are after.
With several rows, you could make it work as:
>>> a = np.zeros((3, 10), dtype=int)
>>> b = np.array([[1, 7], [0, 4], [3, 8]])
>>> b[:, 1] += 1 # handle the off by 1 error
>>> a[np.arange(len(b))[:, None], b] = 1
>>> a
array([[0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
[1, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 1]])
>>> np.logical_xor.accumulate(a, axis=1, out=a)
array([[0, 1, 1, 1, 1, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 1, 1, 1, 0]])

Categories

Resources