I know that multidimensional numpy arrays may be indexed with other arrays, but I did not figure out how the following works:
I would like to have the the items from raster, a 3d numpy array, based on indx, a 3d index array:
raster=np.random.rand(5,10,50)
indx=np.random.randint(0, high=50, size=(5,10,3))
What I want is another array with dimensions of indx that holds the values of raster based on the index of indx.
What we need in order to properly resolve your indices during broadcasting are two arrays a and b so that raster[a[i,j,k],b[i,j,k],indx[i,j,k]] will be raster[i,j,indx[i,j,k]] for i,j,k in corresponding ranges for indx's axes.
The easiest solution would be:
x,y,z = indx.shape
a,b,_ = np.ogrid[:x,:y,:z]
raster[a,b,indx]
Where np.ogrid[...] creates three arrays with shapes (x,1,1), (1,y,1) and (1,1,z). We don't need the last one so we throw it away. Now when the other two are broadcast with indx they behave exactly the way we need.
If I understood the question correctly, for each row of indx, you are trying to index into the corresponding row in raster, but the column numbers vary depending on the actual values in indx. So, with that assumption, you can use a vectorized approach that uses linear indexing, like so -
M,N,R = raster.shape
linear_indx = R*np.arange(M*N)[:,None] + indx.reshape(M*N,-1)
out = raster.ravel()[linear_indx].reshape(indx.shape)
I'm assuming that you want to get 3 random values from each of the 3rd dimension arrays.
You can do this via list-comprehension thanks to advanced indexing
Here's an example using less number of values and integers so the output is easier to read:
import numpy as np
raster=np.random.randint(0, high=1000, size=(2,3,10))
indices=np.random.randint(0, high=10, size=(2,3,3))
results = np.array([ np.array([ column[col_indices] for (column, col_indices) in zip(row, row_indices) ]) for (row, row_indices) in zip(raster, indices) ])
print("Raster:")
print(raster)
print("Indices:")
print(indices)
print("Results:")
print(results)
Output:
Raster:
[[[864 353 11 69 973 475 962 181 246 385]
[ 54 735 871 218 143 651 159 259 785 383]
[532 476 113 888 554 587 786 172 798 232]]
[[891 263 24 310 652 955 305 470 665 893]
[260 649 466 712 229 474 1 382 269 502]
[323 513 16 236 594 347 129 94 256 478]]]
Indices:
[[[0 1 2]
[7 5 1]
[7 8 9]]
[[4 0 2]
[6 1 4]
[3 9 2]]]
Results:
[[[864 353 11]
[259 651 735]
[172 798 232]]
[[652 891 24]
[ 1 649 229]
[236 478 16]]]
It iterates simultaneously over the corresponding 3rd dimension arrays in raster and indices and uses advanced indexing to slice the desired indices from raster.
Here's a more verbose version that does the exact same thing:
results = []
for i in range(len(raster)):
row = raster[i]
row_indices = indices[i]
row_results = []
for j in range(len(row)):
column = row[j]
column_indices = row_indices[j]
column_results = column[column_indices]
row_results.append(column_results)
results.append(np.array(row_results))
results = np.array(results)
Related
I have two dataframes, let's call them df A and df B
A =
0 1
123 798
456 845
789 932
B =
0 1
321 593
546 603
937 205
Now I would like to multiply them, but also with an expression, as in A-1/B^2 for each of them
AB =
0 1
123-1/(321^2) 798-1/(593^2)
456-1/(546^2) 845-1/603^2)
789-1/(937^2) 932-1/(205^2)
Now, I have figured I could loop through each row and each column and try some sort of
A[i][j]-1/(B[i][j]^2)
But when it goes up to a 1000x1000 matrix, it would take quite some time.
Is there any operation for pandas or numpy that allows these sort of cross matrix operations? Not just multiplying one matrix by the other, but rather doing a math opeartion between them.
Maybe calculate the divider at first for a new df B ?
I have an array of repeated values that are used to match datapoints to some ID.
How can I replace the IDs with counting up index values in a vectorized manner?
Consider the following minimal example:
import numpy as np
n_samples = 10
ids = np.random.randint(0,500, n_samples)
lengths = np.random.randint(1,5, n_samples)
x = np.repeat(ids, lengths)
print(x)
Output:
[129 129 129 129 173 173 173 207 207 5 430 147 143 256 256 256 256 230 230 68]
Desired solution:
indices = np.arange(n_samples)
y = np.repeat(indices, lengths)
print(y)
Output:
[0 0 0 0 1 1 1 2 2 3 4 5 6 7 7 7 7 8 8 9]
However, in the real code, I do not have access to variables like ids and lengths, but only x.
It does not matter what the values in x are, I just want an array with counting up integers which are repeated the same amount as in x.
I can come up with solutions using for-loops or np.unique, but both are too slow for my use case.
Has anyone an idea for a fast algorithm that takes an array like x and returns an array like y?
You can do:
y = np.r_[False, x[1:] != x[:-1]].cumsum()
Or with one less temporary array:
y = np.empty(len(x), int)
y[0] = 0
np.cumsum(x[1:] != x[:-1], out=y[1:])
print(y)
I have a problem where data must be processed across multiple cores. Let df be a Pandas DataFrameGroupBy (size()) object. Each value represent the computational "cost" each GroupBy has for the cores. How can I divide df into n-bins of unequal sizes and with the same (approx) computational cost?
import pandas as pd
import numpy as np
size = 50
rng = np.random.default_rng(2021)
df = pd.DataFrame({
"one": np.linspace(0, 10, size, dtype=np.uint8),
"two": np.linspace(0, 5, size, dtype=np.uint8),
"data": rng.integers(0, 100, size)
})
groups = df.groupby(["one", "two"]).sum()
df
one two data
0 0 0 75
1 0 0 75
2 0 0 49
3 0 0 94
4 0 0 66
...
45 9 4 12
46 9 4 97
47 9 4 12
48 9 4 32
49 10 5 45
People typically split the dataset into n-bins, such as the code below. However, splitting the dataset into n-equal parts is undesirable because the cores receives very unbalanced workload, e.g. 205 vs 788.
n = 4
bins = np.array_split(groups, n) # undesired
[b.sum() for b in bins] #undesired
[data 788
dtype: int64, data 558
dtype: int64, data 768
dtype: int64, data 205
dtype: int64]
A desired solution is splitting the data into bins of unequal sizes and with approximately equal large summed values. I.e. the difference between abs(743-548) = 195 is smaller than the previous method abs(205-788) = 583. The difference should be as small as possible. A simple list-example of how it should be achieved:
# only an example to demonstrate desired functionality
example = [[[10, 5], 45], [[2, 1], 187], [[3, 1], 249], [[6, 3], 262]], [[[9, 4], 153], [[4, 2], 248], [[1, 0], 264]], [[[8, 4], 245], [[7, 3], 326]], [[[5, 2], 189], [[0, 0], 359]]
[sum([size for (group, size) in test]) for test in t] # [743, 665, 571, 548]
Is there a more efficient method to split the dataset into bins as described above in pandas or numpy?
It is important to split/bin the GroupBy object, accessing the data in a similar way as returned by np.array_split().
I think a good approach has been found. Credits to a colleague.
The idea is to sort the group sizes (in descending order) and put groups into bins in a "backward S"-pattern. Let me illustrate with an example. Assume n = 3 (number of bins) and the following data:
groups
data
0 359
1 326
2 264
3 262
4 249
5 248
6 245
7 189
8 187
9 153
10 45
The idea is to put one group in one bin going "right to left" (and vica versa) between the bins in a "backward S"-pattern. First element in bin 0, second element in bin 1, etc. Then go backwards when reaching the last bin: fourth element in bin 2, fifth element in bin 1, etc. See below how the elements are put into bins by the group number in parenthesis. The values are the group sizes.
Bins: | 0 | 1 | 2 |
| 359 (0)| 326 (1)| 264 (2)|
| 248 (5)| 249 (4)| 262 (3)|
| 245 (6)| 189 (7)| 187 (8)|
| | 45(10)| 153 (9)|
The bins will have approximately the same number of values and, thus, approximately the same computational "cost". The bin sizes are: [852, 809, 866] for anyone interested. I have tried on a real-world dataset and the bins are of similar sizes. It is not guaranteed bins will be of similar size for all datasets.
The code can be made more efficient, but this is sufficient to get the idea out:
n = 3
size = 50
rng = np.random.default_rng(2021)
df = pd.DataFrame({
"one": np.linspace(0, 10, size, dtype=np.uint8),
"two": np.linspace(0, 5, size, dtype=np.uint8),
"data": rng.integers(0, 100, size)
})
groups = df.groupby(["one", "two"]).sum()
groups = groups.sort_values("data", ascending=False).reset_index(drop=True)
bins = [[] for i in range(n)]
backward = False
i = 0
for group in groups.iterrows():
bins[i].append(group)
i = i + 1 if not backward else i - 1
if i == n:
backward = True
i -= 1
if i == -1 and backward:
backward = False
i += 1
[sum([size[0] for (group, size) in bin]) for bin in bins]
I have a .mtx file that looks like below:
0 435 1
0 544 1
1 344 1
2 410 1
2 471 1
This matrix has shape of (1000, 1000).
As you can see, node ids starts at 0. I want to change this to start at 1 instead of 0.
In other words, I need to add 1 to all the numbers in the first and second columns that represent the node ids.
So I converted .mtx file to .txt file and tried to add 1 in each first and second columns.
and simply added 1 to each row like below
import numpy as np
data_path = "my_data_path"
data = np.loadtxt(data_path, delimiter=' ', dtype='int')
for i in data:
print(data[i]+1)
and result was
[ 1 436 2]
[ 1 545 2]
[ 2 345 2]
[ 3 411 2]
[ 3 472 2]
now I need to subtract 1 from third column, but I have no idea how to implement that.
Can someone help me to do that?
Or if there's any way to complete my goal way more easier, please tell me. Thank you in advance.
Why wouldn't you increment only the first column?
data[:, 0] += 1
You may want to have a look at indexing in NumPy.
Additionally, I don't think the loop in your code ever worked:
for i in data:
print(data[i]+1)
You are indexing with values from the array which generally is wrong and is surely wrong in this case:
IndexError: index 435 is out of bounds for axis 0 with size 5
You could correct it to print the whole matrix:
print(data + 1)
Giving:
[[ 1 436 2]
[ 1 545 2]
[ 2 345 2]
[ 3 411 2]
[ 3 472 2]]
I have a list of index stored in a list of tuples:
index=[(0,0), (0,1), (1,0), (1,1) ....]
These indexes will be used to calculate energy in an image im (a numpy array) in the following formula:
(1-im[0,0])^2+(1-im[0,1])^2+....
im here is a two dimensional numpy array. Here's an example of im:
im=Image.open('lena_noisy.png')
im=numpy.array(im)
print im
[[168 133 131 ..., 127 213 107]
[174 151 111 ..., 191 88 122]
[197 173 143 ..., 182 153 125]
...,
[ 34 15 6 ..., 111 95 104]
[ 37 15 57 ..., 121 133 134]
[ 49 39 58 ..., 115 74 107]]
How to use map function of list to perform this calculation?
If you break index into two tuples, xidx and yidx then you can use fancy indexing to access all the im values as one numpy array.
Then the calculation becomes simple to express, and faster than doing a Python loop (or list comprehension):
import numpy as np
xidx, yidx = zip(*index)
print(((1-im[xidx, yidx])**2).sum())
import numpy as np
import scipy.misc as misc
im = misc.lena()
n = min(im.shape)
index = np.random.randint(n, size = (10000,2)).tolist()
def using_fancy_indexing(index, im):
xidx, yidx = zip(*index)
return (((1-im[xidx, yidx])**2).sum())
def using_generator_expression(index, im):
return sum(((1 - im[i[0], i[1]]) ** 2) for i in index)
Here is a comparison using timeit:
In [27]: %timeit using_generator_expression(index, im)
100 loops, best of 3: 17.9 ms per loop
In [28]: %timeit using_fancy_indexing(index, im)
100 loops, best of 3: 2.07 ms per loop
Thus, depending on the size of index, using fancy indexing could be 8x faster than using a generator expression.
Like this, using a generator expression:
sum((1-im[i][j])**2 for i, j in index)
That is, assuming that im is a two-dimensional list and index is a list of coordinates in im. Notice that in Python, a two-dimensional list is accessed like this: m[i][j] and not like this: m[i,j].
Using sum and a generator expression:
sum(((1 - im[i[0], i[1]]) ** 2) for i in index)
If index is also a numpy array you can use the array as an index:
sum(((1 - im[i]) ** 2) for i in index)