Numpy pad zeroes of given size - python

I have read most related questions here, but I cannot seem to figure out how to use np.pad in this case. Maybe it is not meant for this particular problem.
Let's say I have a list of Numpy arrays. Every array is the same length, e.g. 2. The list itself has to be padded to be e.g. 5 arrays and can be transformed into a numpy array as well. The padded elements should be arrays filled with zeroes. As an example
arr = [array([0, 1]), array([1, 0]), array([1, 1])]
expected_output = array([array([0, 1]), array([1, 0]), array([1, 1]), array([0, 0]), array([0, 0])])
The following seems to work, but I feel there must be a better and more efficient way. In reality this is run hundreds of thousands if not millions of times so speed is important. Perhaps with np.pad?
import numpy as np
def pad_array(l, item_size, pad_size=5):
s = len(l)
if s < pad_size:
zeros = np.zeros(item_size)
for _ in range(pad_size-s):
# not sure if I need a `copy` of zeros here?
l.append(zeros)
return np.array(l)
B = [np.array([0,1]), np.array([1,0]), np.array([1,1])]
AB = pad_array(B, 2)
print(AB)

It seems like you want to pad zeros at the end of the axis 0, speaking in numpy terms. So what you need is,
output = numpy.pad(arr, ((0,2),(0,0)), 'constant')
The trick is the pad_width parameter, which you need to specify as pad_width=((0,2),(0,0)) to get your expected output. This is you telling pad() to insert 0 padding at the beginning and 2 padding at the end of the axis 0, and to insert 0 padding at the beginning and 0 padding at the end of the axis 1. The format of pad_width is ((before_1, after_1), … (before_N, after_N)) according to the documentation
mode='constant' tells pad() to pad with the value specified by parameter constant_values which defaults to 0.

You could re-write your function like this:
import numpy as np
def pad_array(l, item_size, pad_size=5):
if pad_size < len(l):
return np.array(l)
s = len(l)
res = np.zeros((pad_size, item_size)) # create an array of (item_size, pad_size)
res[:s] = l # set the first rows equal to the elements of l
return res
B = [np.array([0, 1]), np.array([1, 0]), np.array([1, 1])]
AB = pad_array(B, 2)
print(AB)
Output
[[0. 1.]
[1. 0.]
[1. 1.]
[0. 0.]
[0. 0.]]
The idea is to create an array of zeroes and then fill the first rows with the values from the input list.

Related

Equivalent tensorflow expression to numpy mask

I have a numpy array named PixelData of unknown shape, and I am using the following condition to filter values in the array greater than some value x using a mask:
PixelData[PixelData>=x] = PixelData[PixelData>=x] - x
When I convert this numpy array to a tensor, I cannot perform the same masking operation. I have tried using tf.where as follows:
PixelData = tf.where(PixelData>=x, PixelData - x, PixelData)
In the official documentation, they always seem to define the mask dimensions in advance to equal the dimensions of the tensor being masked, but then they talk about the dimensions being broadcasted automatically, so I am a bit confused. Are these two functions equivalent? Are there any situations where they may produce different outputs?
Not sure what PixelData looks like, but here is working example with both methods:
import numpy as np
import tensorflow as tf
x = 2
np_pixel_data = np.array([[3, 4, 5, 1],
[6, 4, 2, 5]], dtype=np.float32)
np_pixel_data[np_pixel_data>=x] = np_pixel_data[np_pixel_data>=x] - x
tf_pixel_data = tf.constant([[3, 4, 5, 1],
[6, 4, 2, 5]], dtype=tf.float32)
tf_pixel_data = tf.where(tf.greater_equal(tf_pixel_data, x), tf_pixel_data - x, tf_pixel_data)
print(np_pixel_data)
print(tf_pixel_data)
[[1. 2. 3. 1.]
[4. 2. 0. 3.]]
tf.Tensor(
[[1. 2. 3. 1.]
[4. 2. 0. 3.]], shape=(2, 4), dtype=float32)
You might have some minor rounding differences, but nothing significant.

2D version of numpy random choice with weighting

This relates to this earlier post: Numpy random choice of tuples
I have a 2D numpy array and want to choose from it using a 2D probability array. The only way I could think to do this was to flatten and then use the modulo and remainder to convert the result back to a 2D index
import numpy as np
# dummy data
x=np.arange(100).reshape(10,10)
# dummy probability array
p=np.zeros([10,10])
p[4:7,1:4]=1.0/9
xy=np.random.choice(x.flatten(),1,p=p.flatten())
index=[int(xy/10),(xy%10)[0]] # convert back to index
print(index)
which gives
[5, 2]
but is there a cleaner way that avoids flattening and the modulo? i.e. I could pass a list of coordinate tuples as x, but how can I then handle the weights?
I don't think it's possible to directly specify a 2D shaped array of probabilities. So raveling should be fine. However to get the corresponding 2D shaped indices from the flat index you can use np.unravel_index
index= np.unravel_index(xy.item(), x.shape)
# (4, 2)
For multiple indices, you can just stack the result:
xy=np.random.choice(x.flatten(),3,p=p.flatten())
indices = np.unravel_index(xy, x.shape)
# (array([4, 4, 5], dtype=int64), array([1, 2, 3], dtype=int64))
np.c_[indices]
array([[4, 1],
[4, 2],
[5, 3]], dtype=int64)
where np.c_ stacks along the right hand axis and gives the same result as
np.column_stack(indices)
You could use numpy.random.randint to generate an index, for example:
# assumes p is a square array
ij = np.random.randint(p.shape[0], size=p.ndim) # size p.ndim = 2 generates 2 coords
# need to convert to tuple to index correctly
p[tuple(i for i in ij))]
>>> 0.0
You can also index multiple random values at once:
ij = np.random.randint(p.shape[0], size=(p.ndim, 5)) # get 5 values
p[tuple(i for i in ij))]
>>> array([0. , 0. , 0. , 0.11111111, 0. ])

MemoryError while creating cartesian product in Numpy

I have 3 numpy arrays and need to form the cartesian product between them. Dimensions of the arrays are not fixed, so they can take different values, one example could be A=(10000, 50), B=(40, 50), C=(10000,50).
Then, I perform some processing (like a+b-c) Below is the function that I am using for the product.
def cartesian_2d(arrays, out=None):
arrays = [np.asarray(x) for x in arrays]
dtype = arrays[0].dtype
n = np.prod([x.shape[0] for x in arrays])
if out is None:
out = np.empty([n, len(arrays), arrays[0].shape[1]], dtype=dtype)
m = n // arrays[0].shape[0]
out[:, 0] = np.repeat(arrays[0], m, axis=0)
if arrays[1:]:
cartesian_2d(arrays[1:], out=out[0:m, 1:, :])
for j in range(1, arrays[0].shape[0]):
out[j * m:(j + 1) * m, 1:] = out[0:m, 1:]
return out
a = [[ 0, -0.02], [1, -0.15]]
b = [[0, 0.03]]
result = cartesian_2d([a,b,a])
// array([[[ 0. , -0.02],
[ 0. , 0.03],
[ 0. , -0.02]],
[[ 0. , -0.02],
[ 0. , 0.03],
[ 1. , -0.15]],
[[ 1. , -0.15],
[ 0. , 0.03],
[ 0. , -0.02]],
[[ 1. , -0.15],
[ 0. , 0.03],
[ 1. , -0.15]]])
The output is the same as with itertools.product. However, I am using my custom function to take advantage of numpy vectorized operations, which is working fine compared to itertools.product in my case.
After this, I do
result[:, 0, :] + result[:, 1, :] - result[:, 2, :]
//array([[ 0. , 0.03],
[-1. , 0.16],
[ 1. , -0.1 ],
[ 0. , 0.03]])
So this is the final expected result.
The function works as expected as long as my array fits in memory. But my usecase requires me to work with huge data and I get a MemoryError at the line np.empty() since it is unable to allocate the memory required.
I am working with circa 20GB data at the moment and this might increase in future.
These arrays represent vectors and will have to be stored in float, so I cannot use int. Also, they are dense arrays, so using sparse is not an option.
I will be using these arrays for further processing and ideally I would not like to store them in files at this stage. So memmap / h5py format may not help, although I am not sure of this.
If there are other ways to form this product, that would be okay too.
As I am sure there are applications with way larger datasets than this, I hope someone has encountered such issues before and would like to know how to handle this issue. Please help.
If at least your result fits in memory
The following produces your expected result without relying on an intermediate three times the size of the result. It uses broadcasting.
Please note that almost any NumPy operation is broadcastable like this, so in practice there is probably no need for an explicit cartesian product:
#shared dimensions:
sh = a.shape[1:]
aba = (a[:, None, None] + b[None, :, None] - a[None, None, :]).reshape(-1, *sh)
aba
#array([[ 0. , 0.03],
# [-1. , 0.16],
# [ 1. , -0.1 ],
# [ 0. , 0.03]])
Addressing result rows by 'ID'
You may consider leaving out the reshape. That would allow you to address the rows in the result by combined index. If your component ID's are just 0,1,2,... like in your example this would be the same as the combined ID. For example aba[1,0,0] would correspond to the row obtained as second row of a + first row of b - first row of a.
A bit of explanation
Broadcasting: When for example adding two arrays their shapes do not have to be identical, only compatible because of broadcasting. Broadcasting is in a sense a generalization of adding scalars to arrays:
[[2], [[7], [[2],
7 + [3], equiv to [7], + [3],
[4]] [7]] [4]]
Broadcasting:
[[4], [[1, 2, 3], [[4, 4, 4],
[[1, 2, 3]] + [5], equiv to [1, 2, 3], + [5, 5, 5],
[6]] [1, 2, 3]] [6, 6, 6]]
For this to work each dimension of each operand must be either 1 or equal to the corresponding dimension in each other operand unless it is 1. If an operand has fewer dimensions than the others its shape is padded with ones on the left. Note that the equiv arrays shown in the illustration are not explicitly created.
If the result also does not fit
In that case I don't see how you can possibly avoid using storage, so h5py or something like that it is.
Removing the first column from each operand
This is just a matter of slicing:
a_no_id = a[:, 1:]
etc. Note that, unlike Python lists, NumPy arrays when sliced do not return a copy but a view. Therefore efficiency (memory or runtime) is not an issue here.
An alternate solution is to create a cartesian product of indices (which is easier, as solutions for cartesian products of 1D arrays exist):
idx = cartesian_product(
np.arange(len(a)),
np.arange(len(b)) + len(a),
np.arange(len(a))
)
And then use fancy indexing to create the output array:
x = np.concatenate((a, b))
result = x[idx.ravel(), :].reshape(*idx.shape, -1)
Writing results efficiently on disk
At first a few minds on the size of the resulting data.
Size of the result data
size_in_GB = A.shape[0]**2*A.shape[1]*B.shape[0]*(size_of_datatype)/1e9
In your question you mentioned A.shape=(10000,50), B=(40,50). Using float64 your result will be aproximately 1600 GB. This can be done without problems if you have enough disk space, but you have to think what you wan't to do with the data next. Maybe this is only a intermediate result and processing the data in blocks is possible.
If this is not the case here is an example how to handle 1600GB of data efficiently (RAM usage will be about 200 MB). The troughput should be around 200 MB/s on realistic data.
The code calculating the results is from #PaulPanzer.
import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
a=np.arange(500*50).reshape(500, 50)
b=np.arange(40*50).reshape(40, 50)
# isn't well documented, have a look at https://github.com/Blosc/hdf5-blosc
compression_opts=(0, 0, 0, 0, 5, 1, 1)
compression_opts[4]=9 #compression level 0...9
compression_opts[5]=1 #shuffle
compression_opts[6]=1 #compressor (I guess that's lz4)
File_Name_HDF5='Test.h5'
f = h5.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**2*300)
dset = f.create_dataset('Data', shape=(a.shape[0]**2*b.shape[0],a.shape[1]),dtype='d',chunks=(a.shape[0]*b.shape[0],1),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
#Write the data
for i in range(a.shape[0]):
sh = a.shape[1:]
aba = (a[i] + b[:, None] - a).reshape(-1, *sh)
dset[i*a.shape[0]*b.shape[0]:(i+1)*a.shape[0]*b.shape[0]]=aba
f.close()
Reading the data
File_Name_HDF5='Test.h5'
f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*300)
dset=f['Data']
chunks_size=500
for i in range(0,dset.shape[0],chunks_size):
#Iterate over the first column
data=dset[i:i+chunks_size,:] #avoid excessive calls to the hdf5 library
#Do something with the data
f.close()
f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*300)
dset=f['Data']
for i in range(dset.shape[1]):
# Iterate over the second dimension
# fancy indexing e.g.[:,i] will be much slower
# use np.expand_dims or in this case np.squeeze after the read operation from the dset
# if you wan't to have the same result than [:,i] (1 dim array)
data=dset[:,i:i+1]
#Do something with the data
f.close()
On this test example I get a write throughput of about 550 M/s, a read throuhput of about (500 M/s first dim, 1000M/s second dim) and a compression ratio of 50. Numpy memmap will only provide acceptable speed if you read or write data along the fastest changing direction (in C the last dimension), with a chunked data format used by HDF5 here, this isn't a problem at all. Compression is also not possible with Numpy memmap, leading to higher file sizes and slower speed.
Please note that the compression filter and chunk shape have to be set up to your needs. This depends on how you wan't to read the data afterwards and the actual data.
If you do something completely wrong, the perfornance can be 10-100 times slower compared to a proper way to do it (e.g. the chunkshape can be optimized for the first or the second read example).

How do i generate numpy linspace type numpy zeros array for initialization?

While generating a linspace array in Numpy we get an array of the form (len(array), ), i.e. it doesn't have any 2nd dimension. How do I generate a similar array and initialize it using Numpy zeros? Because it takes a 2nd argument, like 1, so I get (len(array), 1) while initializing, which I wanted to avoid if possible.
Eg. np.linspace(0,10,5) = [0, 2.5, 5, 7.5, 10] ;
It's array dimension is (5, ).
On the other hand, a zeros array is defined as np.zeros((5,1)) and our output is a vector [0 0 0 0 0] ^ (Transpose). I wanted to be a flat array not like a vector.
Is there a way?
your first argument (5,1) is defining the shape of the array as a 5x1 explicitly 2d shape. Just pass (5,), or more explicitly as follows:
import numpy as np
z = np.zeros(shape=(5,), dtype=float)
print(z)
print(z.shape)
output is:
[ 0. 0. 0. 0. 0.]
(5,)

initialize a numpy array

Is there way to initialize a numpy array of a shape and add to it? I will explain what I need with a list example. If I want to create a list of objects generated in a loop, I can do:
a = []
for i in range(5):
a.append(i)
I want to do something similar with a numpy array. I know about vstack, concatenate etc. However, it seems these require two numpy arrays as inputs. What I need is:
big_array # Initially empty. This is where I don't know what to specify
for i in range(5):
array i of shape = (2,4) created.
add to big_array
The big_array should have a shape (10,4). How to do this?
EDIT:
I want to add the following clarification. I am aware that I can define big_array = numpy.zeros((10,4)) and then fill it up. However, this requires specifying the size of big_array in advance. I know the size in this case, but what if I do not? When we use the .append function for extending the list in python, we don't need to know its final size in advance. I am wondering if something similar exists for creating a bigger array from smaller arrays, starting with an empty array.
numpy.zeros
Return a new array of given shape and
type, filled with zeros.
or
numpy.ones
Return a new array of given shape and
type, filled with ones.
or
numpy.empty
Return a new array of given shape and
type, without initializing entries.
However, the mentality in which we construct an array by appending elements to a list is not much used in numpy, because it's less efficient (numpy datatypes are much closer to the underlying C arrays). Instead, you should preallocate the array to the size that you need it to be, and then fill in the rows. You can use numpy.append if you must, though.
The way I usually do that is by creating a regular list, then append my stuff into it, and finally transform the list to a numpy array as follows :
import numpy as np
big_array = [] # empty regular list
for i in range(5):
arr = i*np.ones((2,4)) # for instance
big_array.append(arr)
big_np_array = np.array(big_array) # transformed to a numpy array
of course your final object takes twice the space in the memory at the creation step, but appending on python list is very fast, and creation using np.array() also.
Introduced in numpy 1.8:
numpy.full
Return a new array of given shape and type, filled with fill_value.
Examples:
>>> import numpy as np
>>> np.full((2, 2), np.inf)
array([[ inf, inf],
[ inf, inf]])
>>> np.full((2, 2), 10)
array([[10, 10],
[10, 10]])
Array analogue for the python's
a = []
for i in range(5):
a.append(i)
is:
import numpy as np
a = np.empty((0))
for i in range(5):
a = np.append(a, i)
You do want to avoid explicit loops as much as possible when doing array computing, as that reduces the speed gain from that form of computing. There are multiple ways to initialize a numpy array. If you want it filled with zeros, do as katrielalex said:
big_array = numpy.zeros((10,4))
EDIT: What sort of sequence is it you're making? You should check out the different numpy functions that create arrays, like numpy.linspace(start, stop, size) (equally spaced number), or numpy.arange(start, stop, inc). Where possible, these functions will make arrays substantially faster than doing the same work in explicit loops
To initialize a numpy array with a specific matrix:
import numpy as np
mat = np.array([[1, 1, 0, 0, 0],
[0, 1, 0, 0, 1],
[1, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[1, 0, 1, 0, 1]])
print mat.shape
print mat
output:
(5, 5)
[[1 1 0 0 0]
[0 1 0 0 1]
[1 0 0 1 1]
[0 0 0 0 0]
[1 0 1 0 1]]
For your first array example use,
a = numpy.arange(5)
To initialize big_array, use
big_array = numpy.zeros((10,4))
This assumes you want to initialize with zeros, which is pretty typical, but there are many other ways to initialize an array in numpy.
Edit:
If you don't know the size of big_array in advance, it's generally best to first build a Python list using append, and when you have everything collected in the list, convert this list to a numpy array using numpy.array(mylist). The reason for this is that lists are meant to grow very efficiently and quickly, whereas numpy.concatenate would be very inefficient since numpy arrays don't change size easily. But once everything is collected in a list, and you know the final array size, a numpy array can be efficiently constructed.
numpy.fromiter() is what you are looking for:
big_array = numpy.fromiter(xrange(5), dtype="int")
It also works with generator expressions, e.g.:
big_array = numpy.fromiter( (i*(i+1)/2 for i in xrange(5)), dtype="int" )
If you know the length of the array in advance, you can specify it with an optional 'count' argument.
I realize that this is a bit late, but I did not notice any of the other answers mentioning indexing into the empty array:
big_array = numpy.empty(10, 4)
for i in range(5):
array_i = numpy.random.random(2, 4)
big_array[2 * i:2 * (i + 1), :] = array_i
This way, you preallocate the entire result array with numpy.empty and fill in the rows as you go using indexed assignment.
It is perfectly safe to preallocate with empty instead of zeros in the example you gave since you are guaranteeing that the entire array will be filled with the chunks you generate.
I'd suggest defining shape first.
Then iterate over it to insert values.
big_array= np.zeros(shape = ( 6, 2 ))
for it in range(6):
big_array[it] = (it,it) # For example
>>>big_array
array([[ 0., 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., 3.],
[ 4., 4.],
[ 5., 5.]])
Whenever you are in the following situation:
a = []
for i in range(5):
a.append(i)
and you want something similar in numpy, several previous answers have pointed out ways to do it, but as #katrielalex pointed out these methods are not efficient. The efficient way to do this is to build a long list and then reshape it the way you want after you have a long list. For example, let's say I am reading some lines from a file and each row has a list of numbers and I want to build a numpy array of shape (number of lines read, length of vector in each row). Here is how I would do it more efficiently:
long_list = []
counter = 0
with open('filename', 'r') as f:
for row in f:
row_list = row.split()
long_list.extend(row_list)
counter++
# now we have a long list and we are ready to reshape
result = np.array(long_list).reshape(counter, len(row_list)) # desired numpy array
Maybe something like this will fit your needs..
import numpy as np
N = 5
res = []
for i in range(N):
res.append(np.cumsum(np.ones(shape=(2,4))))
res = np.array(res).reshape((10, 4))
print(res)
Which produces the following output
[[ 1. 2. 3. 4.]
[ 5. 6. 7. 8.]
[ 1. 2. 3. 4.]
[ 5. 6. 7. 8.]
[ 1. 2. 3. 4.]
[ 5. 6. 7. 8.]
[ 1. 2. 3. 4.]
[ 5. 6. 7. 8.]
[ 1. 2. 3. 4.]
[ 5. 6. 7. 8.]]
If you want to add your item in multi-dimensional array, here is the solution.
import numpy as np
big_array = np.ndarray(shape=(0, 2, 4) # Empty with height and width 2, 4 and length 0
for i in range(5):
big_array = np.concatenate((big_array, i))
Here is the numpy official document for referral
# https://thispointer.com/create-an-empty-2d-numpy-array-matrix-and-append-rows-or-columns-in-python/
# Create an empty Numpy array with 4 columns or 0 rows
empty_array = np.empty((0, 4), int)
# Append a row to the 2D numpy array
empty_array = np.append(empty_array, np.array([[11, 21, 31, 41]]), axis=0)
# Append 2nd rows to the 2D Numpy array
empty_array = np.append(empty_array, np.array([[15, 25, 35, 45]]), axis=0)
print('2D Numpy array:')
print(empty_array)
pay attention that each inputed np.array is 2-dimensional

Categories

Resources