I'm looking to see if there is a way to feed sequence data as Numpy arrays to a text LSTM model defined in CTNK. Each instance in my dataset is a sequence of integers mapping back to words, and the length of each sequence is different. It seems like one can convert their raw text data to the CTF format and feed this data to a model by creating a reader function which generates mini-batches as in this example. However, I'm wondering if there is a way to feed Numpy arrays to this same model.
Further down in this example, there is a discussion of feeding sequences with Numpy, which I was hoping would solve my problem. However, the example deals with sequences of images instead of variable-length sequences of words. In the case of the example, we'll end up with a tensor of n elements that are each 3 x 32 x 32, and we can set up an input variable expecting these dimensions. However, in the case of sequences of words where each sequence has a different length, this example breaks down.
Any help on interop between CTNK and Numpy for text-based LSTM's / RNN's would be greatly appreciated.
You are probably looking for:
x = cntk.sequence.input_variable(shape=())
Here is a sample little program that demonstrates how it works with a variable sequence length:
import numpy as np
import cntk
# define the model
x = cntk.sequence.input_variable(shape=())
z = cntk.sequence.last(x)
# define the data
a = [[1,2,3], [4,5], [6,7,8,9], [0]]
b = [np.array(i, dtype=np.float32) for i in a]
# evaluate
res = z.eval({x: b})
print(res)
Related
So the task is to optimise a Neural Network with a PSO. The PSO needs a one-dimensional list of all the weights and biases, like so [0.1 0.244 ... 0.214]. The NN needs an array of arrays with different dimensions, like so [[x,y], [m,n], ...(all the hidden layer matrices)... ,[p,q]] X and y are the dimensions for the input layer, then all the hidden layers and finally p and q - the dimensions of the output layer.
I can easily flatten the array to pass it to the PSO, but I need a method that takes the modified array and reshapes it back into the same array of arrays with the same dimensions as the starting one from the NN.
The dimensions depend on the amount of neurons in a layer, we have that information from the start.
I have tried to keep track of the shapes array and create an indices array to know when to stop but it doesn't seem to work. I am trying something with slicing now but no cigar yet. A modification to the NN is also possible but how to create it so it takes a predefined list of weights? There might be a very nice and efficient way to do it but I just haven't thought of it yet... Any suggestions?
Example:
a = np.array([1,2,3])
b = np.array([7,8,9,10])
c = np.array([12,13,14,15,16])
b.reshape(2,2)
arr = []
arr.append(a)
arr.append(b)
arr.append(c)
This is a very simple example of what the list of weights is as the NN works with it - a list of multi-dimensional array. Arr can be converted into a numpy array of objects if necessary with np.asarray(arr).
Flattening is easy, here is how I do it (there might be a better that doesn't need a loop, if you know, I'd be thankful if you shared).
Flattening:
new_arr = np.array([])
for i in range(len(arr)):
new_arr = np.append(arr, arr[i].flatten())
My question is how to take new_arr and put it back together to look like arr and is there a beautiful and fast way to do it.
You can save the shape in a variable (it's just a tuple). Try something like:
...
old_shape = arr.shape
# ... do flattening here
new_arr.reshape(old_shape)
new_arr = np.array([])
shapes=[]
for i in range(len(arr)):
new_arr = np.append(new_arr, arr[i].flatten())
shapes.append(arr[i].shape)
#do whatever
restoredArray =[]
offset=0
for i in range(len(shapes)):
s = shapes[i]
n = np.prod(s)
restoredArray.append(new_arr[offset:(offset+n)].reshape(s))
offset+=n
I'm trying to build a neural network with Chainer that takes a 4-dimensional numpy array as an input.
I know that, according to this publication, that is feasible. However, I don't see the way to build it anywhere in the datasets documentation.
Does anyone know how to build it?
You can use any N-dimensional input as long as the input and output data have the same length:
from chainer.datasets import split_dataset_random, TupleDataset
X = [
[[.04, .46], [.18, .26]],
[[.32, .28], [.21, .12]]
]
Y = [.4, .5] # these are the labels for the X instances, in the same order
train, test = split_dataset_random(TupleDataset(X, Y), int(X.shape[0] * .7))
In earlier versions it was required to flatten the arrays into input vectors, but now you can use any N-dimensional numeric array input.
Also, you can use numpy.reshape to change the dimensions of the input.
How can I build xarray from from an iterator of row vectors.
The resulting array may be larger than memory and will be backed by a dask array.
The row vectors also come with unique labels which need to become the row index of the resulting xarray.
In the docs I only see a constructor that takes an in memory numpy array to begin with.
An example use case would be to store a word embedding model as an xarray with words as row labels. These models usually provide an iterator that produces (string, vector) pairs over all words in the vocabulary. Most models have have in the 100s of dimensions and there are usually ~10^6 words in the vocabulary. I would like to stack the vectors into a matrix in order to perform linear algebra operations and also be able to look up rows by the word string.
I would expect to be able to write something like:
import numpy as np
import xarray as xr
vectors = (('V'+str(i), np.random.randn(10000)) for i in range(10**9))
xray = xarray_from_iter(vectors)
xray.to_parquet('big_xarray.parquet')
row1234567 = xray['V1234567']
Does xarray provide something like xarray_from_iter?
If not how do I write it?
xarray_from_iter should work something like numpy.fromiter
except that it should also label the rows as it goes.
It would also need to delay the computation until dump was called,
since the whole issue is that the that array is larger than memory.
TLDR; xarray does not have a from iterator constructor. You'll have to build your dask arrays yourself.
Also, xarray does not have a to_parquet method so that is not an operation you can do (at the moment).
Here is an example of how you might construct a dask array (and xarray.DataArray) for your use case:
import dask.array
import xarray as xr
import numpy as np
num = 10
names = []
arrays = []
for i in range(num):
names.append('V'+str(i))
arrays.append(dask.array.random.random(10000, chunks=(1000,)))
da = xr.DataArray(data, dims=('model', 'sample'), coords={'model': names})
print(da)
Yielding:
<xarray.DataArray 'stack-ff07239b7ea24834ba59f2d05b7f41e2' (model: 10,
sample: 10000)>
dask.array<shape=(10, 10000), dtype=float64, chunksize=(1, 1000)>
Coordinates:
* model (model) <U2 'V0' 'V1' 'V2' 'V3' 'V4' 'V5' 'V6' 'V7' 'V8' 'V9'
Dimensions without coordinates: sample
This is not likely to be efficient, especially when the length of the iterator gets large (like in your example). It may be worth proposing such a constructor on the dask github issues page.
I am new to python. I have to implement k-fold cross validation in python. I am able to split the given data in k equal sized arrays but not able to concatenate the k-1 arrays which will be the training data set. I know about concatenate() in numpy but as k is determine on the fly not sure how to use it in this scenario. Appreciate any info in this regard. Thanks in advance.
Check out numpy.vstack. This stacks an iterable of arrays on top of each other (assuming the column dimensions match). hstack does the opposite.
import numpy as np
k = 10
all_data = [np.random.random((10,5)) for i in range(k)]
train = all_data[:k-1] #list of 9 (10,5) arrays
test = all_data[k-1] #one (10,5) array
train = np.vstack(train) #stacks them on top of each other
print train.shape # one (90, 5) array
I used the MNIST dataset for training a neural network, where the training data is returned as a tuple with two entries. The first entry contains the actual training images. This is a numpy ndarray with 50,000 entries. Each entry is, in turn, a numpy ndarray with 784 values, representing the 28 * 28 = 784 pixels in a single MNIST image.
I would like to create a new training set, however I do not know how to create an ndarray from other ndarrays. For instance, if I have the following two ndarrays:
a = np.ndarray((3,1), buffer=np.array([0.9,1.0,1.0]), dtype=float)
b = np.ndarray((3,1), buffer=np.array([0.8,1.0,1.0]), dtype=float)
how to make a third one containing these two?
I tried the following but it creates only one entry.
c = np.ndarray((1,6,1), buffer=np.array(([a],[b])), dtype=float)
I would need it to be two entries.
Thanks, in the meanwhile I figured out it is simply:
c = np.array((a, b))