replace repeated values with counting up values in Numpy (vectorized) - python

I have an array of repeated values that are used to match datapoints to some ID.
How can I replace the IDs with counting up index values in a vectorized manner?
Consider the following minimal example:
import numpy as np
n_samples = 10
ids = np.random.randint(0,500, n_samples)
lengths = np.random.randint(1,5, n_samples)
x = np.repeat(ids, lengths)
print(x)
Output:
[129 129 129 129 173 173 173 207 207 5 430 147 143 256 256 256 256 230 230 68]
Desired solution:
indices = np.arange(n_samples)
y = np.repeat(indices, lengths)
print(y)
Output:
[0 0 0 0 1 1 1 2 2 3 4 5 6 7 7 7 7 8 8 9]
However, in the real code, I do not have access to variables like ids and lengths, but only x.
It does not matter what the values in x are, I just want an array with counting up integers which are repeated the same amount as in x.
I can come up with solutions using for-loops or np.unique, but both are too slow for my use case.
Has anyone an idea for a fast algorithm that takes an array like x and returns an array like y?

You can do:
y = np.r_[False, x[1:] != x[:-1]].cumsum()
Or with one less temporary array:
y = np.empty(len(x), int)
y[0] = 0
np.cumsum(x[1:] != x[:-1], out=y[1:])
print(y)

Related

Create matrix 100x100 each row with next ordinal number

I try to create a matrix 100x100 which should have in each row next ordinal number like below:
I created a vector from 1 to 100 and then using for loop I copied this vector 100 times. I received an array with correct data so I tried to sort arrays using np.argsort, but it didn't worked as I want (I don't know even why there are zeros in after sorting).
Is there any option to get this matrix using another functions? I tried many approaches, but the final layout was not what I expected.
max_x = 101
z = np.arange(1,101)
print(z)
x = []
for i in range(1,max_x):
x.append(z.copy())
print(x)
y = np.argsort(x)
y
argsort returns the indices to sort by, that's why you get zeros. You don't need that, what you want is to transpose the array.
Make x a numpy array and use T
y = np.array(x).T
Output
[[ 1 1 1 ... 1 1 1]
[ 2 2 2 ... 2 2 2]
[ 3 3 3 ... 3 3 3]
...
[ 98 98 98 ... 98 98 98]
[ 99 99 99 ... 99 99 99]
[100 100 100 ... 100 100 100]]
You also don't need to loop to copy the array, use np.tile instead
z = np.arange(1, 101)
x = np.tile(z, (100, 1))
y = x.T
# or one liner
y = np.tile(np.arange(1, 101), (100, 1)).T
import numpy as np
np.asarray([ (k+1)*np.ones(100) for k in range(100) ])
Or simply
np.tile(np.arange(1,101),(100,1)).T

Binning pandas/numpy array in unequal sizes with approx equal computational cost

I have a problem where data must be processed across multiple cores. Let df be a Pandas DataFrameGroupBy (size()) object. Each value represent the computational "cost" each GroupBy has for the cores. How can I divide df into n-bins of unequal sizes and with the same (approx) computational cost?
import pandas as pd
import numpy as np
size = 50
rng = np.random.default_rng(2021)
df = pd.DataFrame({
"one": np.linspace(0, 10, size, dtype=np.uint8),
"two": np.linspace(0, 5, size, dtype=np.uint8),
"data": rng.integers(0, 100, size)
})
groups = df.groupby(["one", "two"]).sum()
df
one two data
0 0 0 75
1 0 0 75
2 0 0 49
3 0 0 94
4 0 0 66
...
45 9 4 12
46 9 4 97
47 9 4 12
48 9 4 32
49 10 5 45
People typically split the dataset into n-bins, such as the code below. However, splitting the dataset into n-equal parts is undesirable because the cores receives very unbalanced workload, e.g. 205 vs 788.
n = 4
bins = np.array_split(groups, n) # undesired
[b.sum() for b in bins] #undesired
[data 788
dtype: int64, data 558
dtype: int64, data 768
dtype: int64, data 205
dtype: int64]
A desired solution is splitting the data into bins of unequal sizes and with approximately equal large summed values. I.e. the difference between abs(743-548) = 195 is smaller than the previous method abs(205-788) = 583. The difference should be as small as possible. A simple list-example of how it should be achieved:
# only an example to demonstrate desired functionality
example = [[[10, 5], 45], [[2, 1], 187], [[3, 1], 249], [[6, 3], 262]], [[[9, 4], 153], [[4, 2], 248], [[1, 0], 264]], [[[8, 4], 245], [[7, 3], 326]], [[[5, 2], 189], [[0, 0], 359]]
[sum([size for (group, size) in test]) for test in t] # [743, 665, 571, 548]
Is there a more efficient method to split the dataset into bins as described above in pandas or numpy?
It is important to split/bin the GroupBy object, accessing the data in a similar way as returned by np.array_split().
I think a good approach has been found. Credits to a colleague.
The idea is to sort the group sizes (in descending order) and put groups into bins in a "backward S"-pattern. Let me illustrate with an example. Assume n = 3 (number of bins) and the following data:
groups
data
0 359
1 326
2 264
3 262
4 249
5 248
6 245
7 189
8 187
9 153
10 45
The idea is to put one group in one bin going "right to left" (and vica versa) between the bins in a "backward S"-pattern. First element in bin 0, second element in bin 1, etc. Then go backwards when reaching the last bin: fourth element in bin 2, fifth element in bin 1, etc. See below how the elements are put into bins by the group number in parenthesis. The values are the group sizes.
Bins: | 0 | 1 | 2 |
| 359 (0)| 326 (1)| 264 (2)|
| 248 (5)| 249 (4)| 262 (3)|
| 245 (6)| 189 (7)| 187 (8)|
| | 45(10)| 153 (9)|
The bins will have approximately the same number of values and, thus, approximately the same computational "cost". The bin sizes are: [852, 809, 866] for anyone interested. I have tried on a real-world dataset and the bins are of similar sizes. It is not guaranteed bins will be of similar size for all datasets.
The code can be made more efficient, but this is sufficient to get the idea out:
n = 3
size = 50
rng = np.random.default_rng(2021)
df = pd.DataFrame({
"one": np.linspace(0, 10, size, dtype=np.uint8),
"two": np.linspace(0, 5, size, dtype=np.uint8),
"data": rng.integers(0, 100, size)
})
groups = df.groupby(["one", "two"]).sum()
groups = groups.sort_values("data", ascending=False).reset_index(drop=True)
bins = [[] for i in range(n)]
backward = False
i = 0
for group in groups.iterrows():
bins[i].append(group)
i = i + 1 if not backward else i - 1
if i == n:
backward = True
i -= 1
if i == -1 and backward:
backward = False
i += 1
[sum([size[0] for (group, size) in bin]) for bin in bins]

Separating my data to groups and combining them

I have a dataframe with data with 2 variables that have values from 0 to 1000. How can I efficiently split it into groups? I want to split x to 10 groups (0 to 100, 100 to 200 ... 900 to 1000 etc.) and then do the same thing for y (0 to 100, 100 to 200 ... 900 to 1000 etc.). And then I want to combine them and have every single combination. One group having 0
Example data
x y
102 602
224 340
368 756
487 305
568 310
510 911
10 50
340 519
9 282
10 150
Well, if you want to split them into groups of 10 remember that from 0 to 1000 we have 1001 numbers so the groups would be something like: 0 - 100, 101 - 200, 201 - 300,.....,901 - 1000. you will have 9 groups of 100 elements and 1 group of 101 elements. If you do the groups like your idea you will have ten groups of 101 elements each one beacuse you are repeating some numbers. So I don't know how you want to distribute the groups so Im going to work with 0 to 999 to have 10 groups of 100 elements each one. The possible permutations for the groups are 10x10 = 100 but if you want the permutations of the the elements as well it would be a huge number 100x100 = 10,000 x 100 = 1,000,000. So first you need to explain if you want the permutations of the groups only or you want to combine the elements as well. I don't know if you are using pandas so I'm going to give you the logic with pure Python.
To split x and y into groups of ten with the numbers from 0 to 999 you can do something like this:
x_group_list = [[i for i in range(k * 100,(k + 1)*100)] for k in range(0,10)]
y_group_list = [[i for i in range(k * 100,(k + 1)*100)] for k in range(0,10)]

Indexing Multi-dimensional arrays

I know that multidimensional numpy arrays may be indexed with other arrays, but I did not figure out how the following works:
I would like to have the the items from raster, a 3d numpy array, based on indx, a 3d index array:
raster=np.random.rand(5,10,50)
indx=np.random.randint(0, high=50, size=(5,10,3))
What I want is another array with dimensions of indx that holds the values of raster based on the index of indx.
What we need in order to properly resolve your indices during broadcasting are two arrays a and b so that raster[a[i,j,k],b[i,j,k],indx[i,j,k]] will be raster[i,j,indx[i,j,k]] for i,j,k in corresponding ranges for indx's axes.
The easiest solution would be:
x,y,z = indx.shape
a,b,_ = np.ogrid[:x,:y,:z]
raster[a,b,indx]
Where np.ogrid[...] creates three arrays with shapes (x,1,1), (1,y,1) and (1,1,z). We don't need the last one so we throw it away. Now when the other two are broadcast with indx they behave exactly the way we need.
If I understood the question correctly, for each row of indx, you are trying to index into the corresponding row in raster, but the column numbers vary depending on the actual values in indx. So, with that assumption, you can use a vectorized approach that uses linear indexing, like so -
M,N,R = raster.shape
linear_indx = R*np.arange(M*N)[:,None] + indx.reshape(M*N,-1)
out = raster.ravel()[linear_indx].reshape(indx.shape)
I'm assuming that you want to get 3 random values from each of the 3rd dimension arrays.
You can do this via list-comprehension thanks to advanced indexing
Here's an example using less number of values and integers so the output is easier to read:
import numpy as np
raster=np.random.randint(0, high=1000, size=(2,3,10))
indices=np.random.randint(0, high=10, size=(2,3,3))
results = np.array([ np.array([ column[col_indices] for (column, col_indices) in zip(row, row_indices) ]) for (row, row_indices) in zip(raster, indices) ])
print("Raster:")
print(raster)
print("Indices:")
print(indices)
print("Results:")
print(results)
Output:
Raster:
[[[864 353 11 69 973 475 962 181 246 385]
[ 54 735 871 218 143 651 159 259 785 383]
[532 476 113 888 554 587 786 172 798 232]]
[[891 263 24 310 652 955 305 470 665 893]
[260 649 466 712 229 474 1 382 269 502]
[323 513 16 236 594 347 129 94 256 478]]]
Indices:
[[[0 1 2]
[7 5 1]
[7 8 9]]
[[4 0 2]
[6 1 4]
[3 9 2]]]
Results:
[[[864 353 11]
[259 651 735]
[172 798 232]]
[[652 891 24]
[ 1 649 229]
[236 478 16]]]
It iterates simultaneously over the corresponding 3rd dimension arrays in raster and indices and uses advanced indexing to slice the desired indices from raster.
Here's a more verbose version that does the exact same thing:
results = []
for i in range(len(raster)):
row = raster[i]
row_indices = indices[i]
row_results = []
for j in range(len(row)):
column = row[j]
column_indices = row_indices[j]
column_results = column[column_indices]
row_results.append(column_results)
results.append(np.array(row_results))
results = np.array(results)

Trouble with numpy.histogram2d

I'm trying to see if numpy.histogram2d will cross tabulate data in 2 arrays for me. I've never used this function before and I'm getting an error I don't know how to fix.
import numpy as np
import random
zones = np.zeros((20,30), int)
values = np.zeros((20,30), int)
for i in range(20):
for j in range(30):
values[i,j] = random.randint(0,10)
zones[:8,:15] = 100
zones[8:,:15] = 101
zones[:8,15:] = 102
zones[8:,15:] = 103
np.histogram2d(zones,values)
This code results in the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-53447df32000> in <module>()
----> 1 np.histogram2d(zones,values)
C:\Python27\ArcGISx6410.2\lib\site-packages\numpy\lib\twodim_base.pyc in histogram2d(x, y, bins, range, normed, weights)
613 xedges = yedges = asarray(bins, float)
614 bins = [xedges, yedges]
--> 615 hist, edges = histogramdd([x,y], bins, range, normed, weights)
616 return hist, edges[0], edges[1]
617
C:\Python27\ArcGISx6410.2\lib\site-packages\numpy\lib\function_base.pyc in histogramdd(sample, bins, range, normed, weights)
279 # Sample is a sequence of 1D arrays.
280 sample = atleast_2d(sample).T
--> 281 N, D = sample.shape
282
283 nbin = empty(D, int)
ValueError: too many values to unpack
Here is what I am trying to accomplish:
I have 2 arrays. One array comes from a geographic dataset (raster) representing Landcover classes (e.g. 1=Tree, 2=Grass, 3=Building, etc.). The other array comes from a geographic dataset (raster) representing some sort of political boundary (e.g. parcels, census blocks, towns, etc). I am trying to get a table that lists each unique political boundary area (array values represent a unique id) as rows and the total number of pixels within each boundary for each landcover class as columns.
I'm assuming values is the landcover and zones is the political boundaries. You might want to use np.bincount, which is like a special histogram where each bin has spacing and width of exactly one.
import numpy as np
zones = np.zeros((20,30), int)
zones[:8,:15] = 100
zones[8:,:15] = 101
zones[:8,15:] = 102
zones[8:,15:] = 103
values = np.random.randint(0,10,(20,30)) # no need for that loop
tab = np.array([np.bincount(values[zones==zone]) for zone in np.unique(zones)])
You can do this more simply with histogram, though, if you are careful with the bin edges:
np.histogram2d(zones.flatten(), values.flatten(), bins=[np.unique(zones).size, values.max()-values.min()+1])
The way this works is as follows. The easiest example is to look at all values regardless of zone:
np.bincount(values)
Which gives you one row with the counts for each value (0 to 10). The next step is to look at the zones.
For one zone, you'd have just one row, and it would be:
zone = 101 # the desired zone
mask = zone==zones # a mask that is True wherever your zones map matches the desired zone
np.bincount(values[mask]) # count the values where the mask is True
Now, we just want to do this for each zone in the map. You can get a list of the unique values in your zones map with
zs = np.unique(zones)
and loop through it with a list comprehension, where each item is one of the rows as above:
tab = np.array([np.bincount(values[zones==zone]) for zone in np.unique(zones)])
Then, your table looks like this:
print tab
# elements with cover =
# 0 1 2 3 4 5 6 7 8 9 # in zone:
[[16 11 10 12 13 15 11 7 13 12] # 100
[13 23 15 16 24 16 24 21 15 13] # 101
[10 12 23 13 12 11 11 5 11 12] # 102
[19 25 20 12 16 19 13 18 22 16]] # 103
Finally, you can plot this in matplotlib as so:
import matplotlib.pyplot as plt
plt.hist2d(zones.flatten(), values.flatten(), bins=[np.unique(zones).size, values.max()-values.min()+1])
histogram2d expects 1D arrays as input, and your zones and values are 2D. You could linearize them with ravel:
np.histogram2d(zones.ravel(), values.ravel())
If efficiency isn't a concern, I think this works for what you want to do
from collections import Counter
c = Counter(zip(zones.flat[:], landcover_classes.flat[:]))
c will contain key/val tuples where the key is a tuple of (zone, landcover class). You can populate an array if you like with
for (i, j), count in c.items():
my_table[i, j] = count
That only works, of course, if i and j are sequential integers starting at zero (i.e., from 0 to Ni and 0 to Nj).

Categories

Resources