I used a piece of code to create a 2D binary valued array to cover all possible scenarios of an event. For the first round, I tested it with 2 members.
Here is my code:
number_of_members = 2
n = number_of_members
values = np.arange(2**n, dtype=np.uint8).reshape(-1, 1)
print('$$$ ===> ', values)
bin_array = np.unpackbits(values, axis=1)[:, -n:]
print('*** ===> ', bin_array)
And the result is this:
$$$ ===> [[0]
[1]
[2]
[3]]
*** ===> [[0 0]
[0 1]
[1 0]
[1 1]]
As you can see, it correctly provided my 2D binary array.
The problem begins when I intended to use number_of_members = 20. If I assign 20 to number_of_members python shows this as result:
$$$ ===> [[ 0]
[ 1]
[ 2]
...
[253]
[254]
[255]]
*** ===> [[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 1]
[0 0 0 ... 0 1 0]
...
[1 1 1 ... 1 0 1]
[1 1 1 ... 1 1 0]
[1 1 1 ... 1 1 1]]
The result has 8 columns, but I expected an array of 32 columns. How can I unpack a uint32 array?
As you noted correctly, np.unpackbits operates only on uint8 arrays. The nice thing is that you can view any type as uint8. You can create a uint32 view into your data like this:
view = values.view(np.uint8)
On my machine, this is little-endian, which makes trimming easier. You can force little-endian order conditionally across all systems:
if values.dtype.byteorder == '>' or (values.dtype.byteorder == '=' and sys.byteorder == 'big'):
view = view[:, ::-1]
Now you can unpack the bits. In fact, unpackbits has a nice feature that I personally added, the count parameter. It allows you to make your output be exactly 20 bits long instead of the full 32, without subsetting. Since the output will be mixed big-endian bits and little-endian bytes, I recommend displaying the bits in little-endian order too, and flipping the entire result:
bin_array = np.unpackbits(view, axis=1, count=20, bitorder='little')[:, ::-1]
The result is a (1<<20, 20) array with the exact values you want.
Related
I have a 2D NumPy array exclusively filled with 1s and 0s.
a = [[0 0 0 0 1 0 0 0 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 0 0 0 0 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 0 1 1 1 1 1]
[1 1 1 1 1 1 0 0 1]
[1 1 1 1 1 1 1 1 1]]
To get the location of the 0s I used the following code:
new_array = np.transpose(np.nonzero(a==0))
As expected, I get the following result showing the location of the 0s within the array
new_array = [[0 0]
[0 1]
[0 2]
[0 3]
[0 5]
[0 6]
[0 7]
[3 4]
[3 5]
[3 6]
[3 7]
[5 3]
[6 6]
[6 7]]
Now comes my question:
Is there way to get the location of the 0s at the start and end of and horizontal group if said group is larger than 2?
EDIT: If group were to finish at the end of a row and continue on the one below it, it would count as 2 separate groups.
My first thought was to implement a process that would delete 0s if they are located in-between 0s but I was not able to figure out how to do that.
I would like "new_array" output to be:
new_array = [[0 0]
[0 3]
[0 5]
[0 7]
[3 4]
[3 7]
[5 3]
[6 6]
[6 7]]
Thanks beforehand!!
EDIT 2:
Thanks you all for your very helpful insights, I was able to solve the problem that I had.
To satisfy the curiosity, this data represents musical information. The purpose of the program I'm working on is to create a musical score based on a image (that consist exclusively of horizontal lines).
Once the image conversion to 1s and 0s is done, I needed to extract the following information from it: Onset, Pitch, and Duration. This translates into position in the "x" axis, position on the "y" axis and total length of group.
Since X and Y locations are fairly easy to get, I decided to process them separately from the "Duration" calculation (which was the main problem to solve in this post).
Thanks to your help I was able to solve the Duration problem and create a new array with all necessary information:
[[0 0 4]
[5 0 3]
[4 3 4]
[6 6 2]]
Note that 1st column represent Onset, 2nd column represents Pitch, and 3rd column represents Duration.
It has also come to my attention the comment that suggested to add an identifier to each event. Eventually I will need to implement that to differentiate between different instruments (and later sending them to individual Midi channels). However, for this first iteration of the program that only aims to create a music score for a single instrument, it is not necessary since all events belong to a single instrument.
I have very little experience with programming, I don't know if this was the most efficient way of achieving my goal. Any suggestions are welcomed.
Thanks!
One possible solution that is easier to follow is:
b = np.diff(a, prepend=1) # prepend a column of 1s and detect
# jumps between adjacent columns (left to right)
y, x = np.where(b > 0) # find positions of the jumps 0->1 (left to right)
# shift positive jumps to the left by 1 position while filling gaps with 0:
b[y, x - 1] = 1
b[y, x] = 0
new_array = list(zip(*np.where(b)))
Another one is:
new_array = list(zip(*np.where(np.diff(a, n=2, prepend=1, append=1) > 0)))
Both solutions are based on the np.diff that computes differences between consecutive columns (when axis=-1 for 2D arrays).
A flaw in the other solution is that it reports all sequences
of zeroes, regardless of their length.
Your expected output also contains such groups, composed of 1 or 2
zeroes, but in my opinion it shouldn't.
My solution is free of the above flaw.
An elegant tool to process groups of adjacent equal elements is
itertools.groupby, so start from:
import itertools
Then generate your intended result as:
res = []
for rowIdx, row in enumerate(a):
colIdx = 0 # Start column index
for k, grp in itertools.groupby(row):
vals = list(grp) # Values in the group
lgth = len(vals) # Length of the group
colIdx2 = colIdx + lgth - 1 # End column index
if k == 0 and lgth > 2: # Record this group
res.append([rowIdx, colIdx])
res.append([rowIdx, colIdx2])
colIdx = colIdx2 + 1 # Advance column index
result = np.array(res)
The result, for your source data, is:
array([[0, 0],
[0, 3],
[0, 5],
[0, 7],
[3, 4],
[3, 7]])
As you can see, it doesn't include shorter sequences of zeroes
in row 5 and 6.
I am trying to replace a row in a 2d numpy array.
array2d = np.arange(20).reshape(4,5)
for i in range(0, 4, 1):
array2d[i] = array2d[i] / np.sum(array2d[i])
but I'm getting all 0s:
[[0 0 0 0 0]
[0 0 0 0 0]
[0 0 0 0 0]
[0 0 0 0 0]]
The expected result is:
[[0 0.1 0.2 0.3 0.4]
[0.14285714 0.17142857 0.2 0.22857143 0.25714286]
[0.16666667 0.18333333 0.2 0.21666667 0.23333333]
[0.17647059 0.18823529 0.2 0.21176471 0.22352941]]
The reason you are getting 0's is because the array's dtype is int but the division returns floats in range 0 to 1 and because you modify the rows in-place they are converted to integers (i.e. to 0 in your example). So to fix it use array2d = np.arange(20, dtype=float).reshape(4,5).
But there is no need for the for-loop:
array2d = np.arange(20).reshape(4,5)
array2d = array2d / np.sum(array2d, axis=1, keepdims=True)
Note that here I didn't specify the dtype of the array to be float, but the resulting array's dtype is float because on the second line we created a new array instead of modifying the first array in-place.
https://numpy.org/doc/stable/user/basics.indexing.html#assigning-values-to-indexed-arrays
Something's odd with the data here.
If I create a scipy.sparse.csr_matrix with the data property containing only 0s and 1s, and then ask it to print the data property, sometimes there are 2s in the output (other times not).
You can see this behaviour here:
from scipy.sparse import csr_matrix
import numpy as np
from collections import OrderedDict
#Generate some fake data
#This makes an OrderedDict of 10 scipy.sparse.csr_matrix objects,
#with 3 rows and 3 columns and binary (0/1) values
od = OrderedDict()
for i in range(10):
row = np.random.randint(3, size=3)
col = np.random.randint(3, size=3)
data = np.random.randint(2, size=3)
print 'data is: ', data
sp_matrix = csr_matrix((data, (row, col)), shape=(3, 3))
od[i] = sp_matrix
#Print the data in each scipy sparse matrix
for i in range(10):
print 'data stored in sparse matrix: ', od[i].data
It'll print something like this:
data is: [1 0 1]
data is: [0 0 1]
data is: [0 0 0]
data is: [0 0 0]
data is: [1 1 1]
data is: [0 0 0]
data is: [1 1 0]
data is: [1 0 1]
data is: [0 0 0]
data is: [0 0 1]
data stored in sparse matrix: [1 1 0]
data stored in sparse matrix: [0 0 1]
data stored in sparse matrix: [0 0]
data stored in sparse matrix: [0 0 0]
data stored in sparse matrix: [2 1]
data stored in sparse matrix: [0 0 0]
data stored in sparse matrix: [1 1 0]
data stored in sparse matrix: [1 1 0]
data stored in sparse matrix: [0 0 0]
data stored in sparse matrix: [1 0 0]
Why does the data stored in the sparse matrix not reflect the data originally put there (there were no 2s in the original data)?
I'm assuming, your kind of matrix-creation:
sp_matrix = csr_matrix((data, (row, col)), shape=(3, 3))
will use coo_matrix under the hood (not found the relevant sources yet; see bottom).
In this case, the docs say (for COO):
By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)
Your random-matrix routine does not check for duplicate entries.
Edit: Ok. It think i found the code.
csr_matrix: no constructor-code -> inheritance from _cs_matrix
compressed.py: _cs_matrix
and there:
else:
if len(arg1) == 2:
# (data, ij) format
from .coo import coo_matrix
other = self.__class__(coo_matrix(arg1, shape=shape))
self._set_self(other)
There is a 2D numpy array of about 500000 rows by 512 values each row:
[
[1,0,1,...,0,0,1], # 512 1's or 0's
[0,1,0,...,0,1,1],
...
[0,0,1,...,1,0,1], # row number 500000
]
How to sort the rows ascending as if each row is a long 512-bit integer?
[
[0,0,1,...,1,0,1],
[0,1,0,...,0,1,1],
[1,0,1,...,0,0,1],
...
]
Instead of converting to strings you can also use a void view (as from #Jaime here) of the data and argsort by that.
def sort_bin(b):
b_view = np.ascontiguousarray(b).view(np.dtype((np.void, b.dtype.itemsize * b.shape[1])))
return b[np.argsort(b_view.ravel())] #as per Divakar's suggestion
Testing
np.random.seed(0)
b = np.random.randint(0, 2, (10,5))
print(b)
print(sort_bin(b))
[[0 1 1 0 1]
[1 1 1 1 1]
[1 0 0 1 0]
...,
[1 0 1 1 0]
[0 1 0 1 1]
[1 1 1 0 1]]
[[0 0 0 0 1]
[0 1 0 1 1]
[0 1 1 0 0]
...,
[1 1 1 0 1]
[1 1 1 1 0]
[1 1 1 1 1]]
Should be much faster and less memory-intensive since b_view is just a view into b
t = np.random.randint(0,2,(2000,512))
%timeit sort_bin(t)
100 loops, best of 3: 3.09 ms per loop
%timeit np.array([[int(i) for i in r] for r in np.sort(np.apply_along_axis(lambda r: ''.join([str(c) for c in r]), 0, t))])
1 loop, best of 3: 3.29 s per loop
About 1000x faster actually
You could sort them in a stable way 512 times, starting with the right-most bit first.
Sort by last bit
Sort by second-last bit, stable (to not mess up results of previous sort)
...
...
Sort by first bit, stable
A smaller example: assume you want to sort these three 2-bit numbers by bits:
11
01
00
In the first step, you sort by the right bit, resulting in:
00
11
01
Now you sort by the first bit, in this case we have two 0s in that column. If your sorting algorithm is not stable it would be allowed to put these equal items in any order in the result, that could cause 01 to appear before 00 which we do not want, so we use a stable sort, keeping the relative order of equal items, for the first column, resulting in the desired:
00
01
11
Creating a string of each row and then applying np.sort()
So if we have an array to test on:
a = np.array([[1,0,0,0],[0,0,0,0],[1,1,1,1],[0,0,1,1]])
We can create strings of each row by using np.apply_along_axis:
a = np.apply_along_axis(lambda r: ''.join([str(c) for c in r]), 0, a)
which would make a now:
array(['1010', '0010', '0011', '0011'], dtype='<U4')
and so now we can sort the strings with np.sort():
a = np.sort(a)
making a:
array(['0010', '0011', '0011', '1010'], dtype='<U4')
we can then convert back to the original format with:
a = np.array([[int(i) for i in r] for r in a])
which makes a:
array([[0, 0, 1, 0],
[0, 0, 1, 1],
[0, 0, 1, 1],
[1, 0, 1, 0]])
And if you wanted to cram this all into one line:
a = np.array([[int(i) for i in r] for r in np.sort(np.apply_along_axis(lambda r: ''.join([str(c) for c in r]), 0, a))])
This is slow but does the job.
def sort_col(arr, col_num=0):
# if we have sorted over all columns return array
if col_num >= arr.shape[1]:
return arr
# sort array over given column
arr_sorted = arr[arr[:, col_num].argsort()]
# if the number of 1s in the given column is not equal to the total number
# of rows neither equal to 0, split on 1 and 0, sort and then merge
if len(arr) > np.sum(arr_sorted[:, col_num]) > 0:
arr_sorted0s = sort_col(arr_sorted[arr_sorted[:, col_num]==0], col_num+1)
arr_sorted1s = sort_col(arr_sorted[arr_sorted[:, col_num]==1], col_num+1)
# change order of stacking if you want ascenting order
return np.vstack((arr_sorted0s, arr_sorted1s))
# if the number of 1s in the given column is equal to the total number
# of rows or equal to 0, just go to the next iteration
return sort_col(arr_sorted, col_num + 1)
np.random.seed(0)
a = np.random.randint(0, 2, (5, 4))
print(a)
print(sort_col(a))
# prints
[[0 1 1 0]
[1 1 1 1]
[1 1 1 0]
[0 1 0 0]
[0 0 0 1]]
[[0 0 0 1]
[0 1 0 0]
[0 1 1 0]
[1 1 1 0]
[1 1 1 1]]
Edit. Or better yet use Daniels solution. I didn't check for new answers before I posted my code.
I have a classified raster that I am reading into a numpy array. (n classes)
I want to use a 2d moving window (e.g. 3 by 3) to create a n-dimensional vector that stores the %cover of each class within the window. Because the raster is large it would be useful to store this information so as not to re-compute it each time....therefor I think the best solution is creating a 3d array to act as the vector. A new raster will be created based on these %/count values.
My idea is to:
1) create a 3d array n+1 'bands'
2) band 1 = the original classified raster. each other 'band' value = count cells of a value within the window (i.e. one band per class) ....for example:
[[2 0 1 2 1]
[2 0 2 0 0]
[0 1 1 2 1]
[0 2 2 1 1]
[0 1 2 1 1]]
[[2 2 3 2 2]
[3 3 3 2 2]
[3 3 2 2 2]
[3 3 0 0 0]
[2 2 0 0 0]]
[[0 1 1 2 1]
[1 3 3 4 2]
[1 2 3 4 3]
[2 3 5 6 5]
[1 1 3 4 4]]
[[2 3 2 2 1]
[2 3 3 3 2]
[2 4 4 3 1]
[1 3 5 3 1]
[1 3 3 2 0]]
4) read these bands into a vrt so only needs be created the once ...and can be read in for further modules.
Question: what is the most efficient 'moving window' method to 'count' within the window?
Currently - I am trying, and failing with the following code:
def lcc_binary_vrt(raster, dim, bands):
footprint = np.zeros(shape = (dim,dim), dtype = int)+1
g = gdal.Open(raster)
data = gdal_array.DatasetReadAsArray(g)
#loop through the band values
for i in bands:
print i
# create a duplicate '0' array of the raster
a_band = data*0
# we create the binary dataset for the band
a_band = np.where(data == i, 1, a_band)
count_a_band_fname = raster[:-4] + '_' + str(i) + '.tif'
# run the moving window (footprint) accross the band to create a 'count'
count_a_band = ndimage.generic_filter(a_band, np.count_nonzero(x), footprint=footprint, mode = 'constant')
geoTiff.create(count_a_band_fname, g, data, count_a_band, gdal.GDT_Byte, np.nan)
Any suggestions very much appreciated.
Becky
I don't know anything about the spatial sciences stuff, so I'll just focus on the main question :)
what is the most efficient 'moving window' method to 'count' within the window?
A common way to do moving window statistics with Numpy is to use numpy.lib.stride_tricks.as_strided, see for example this answer. Basically, the idea is to make an array containing all the windows, without any increase in memory usage:
from numpy.lib.stride_tricks import as_strided
...
m, n = a_band.shape
newshape = (m-dim+1, n-dim+1, dim, dim)
newstrides = a_band.strides * 2 # strides is a tuple
count_a_band = as_strided(ar, newshape, newstrides).sum(axis=(2,3))
However, for your use case this method is inefficient, because you're summing the same numbers over and over again, especially if the window size increases. A better way is to use a cumsum trick, like in this answer:
def windowed_sum_1d(ar, ws, axis=None):
if axis is None:
ar = ar.ravel()
else:
ar = np.swapaxes(ar, axis, 0)
ans = np.cumsum(ar, axis=0)
ans[ws:] = ans[ws:] - ans[:-ws]
ans = ans[ws-1:]
if axis:
ans = np.swapaxes(ans, 0, axis)
return ans
def windowed_sum(ar, ws):
for axis in range(ar.ndim):
ar = windowed_sum_1d(ar, ws, axis)
return ar
...
count_a_band = windowed_sum(a_band, dim)
Note that in both codes above it would be tedious to handle edge cases. Luckily, there is an easy way to include these and get the same efficiency as the second code:
count_a_band = ndimage.uniform_filter(a_band, size=dim, mode='constant') * dim**2
Though very similar to what you already had, this will be much faster! The downside is that you may need to round to integers to get rid of floating point rounding errors.
As a final note, your code
# create a duplicate '0' array of the raster
a_band = data*0
# we create the binary dataset for the band
a_band = np.where(data == i, 1, a_band)
is a bit redundant: You can just use a_band = (data == i).