N dimensional array in python

N dimensional array in python - python

New at Python and Numpy, trying to create 263-dimensional arrays.
I need so much dimensions for Machine Learning model.
Of course one way is using numpy.zeros or numpy.ones and writing code as below :
x=np.zeros((1,1,1,1,1,1,1,1,1,1,1)) #and more 1,1,1,1
Is there an easier way to create arrays with many dimensions?

You don't need 263-dimensions. If every dimension had only size 2, you'd still have 2 ** 263 elements, which are:
14821387422376473014217086081112052205218558037201992197050570753012880593911808
You wouldn't be able to do anything with such a matrix : not even initializing on Google servers.
You either need an array with 263 values :
>>> np.zeros(263)
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0.])
or a matrix with 263 vectors of M elements (let's say 3):
>>> np.zeros((263, 3))
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
...
...
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
There are many advanced research centers that are perfectly happy with vanilla Numpy. Having to use less than 32 dimensions doesn't seem to bother them much for quantum mechanics or machine learning.

Let's start with the numpy documentation, help(np.zeros) gives
zeros(shape, dtype=float, order='C')
Return a new array of given shape and type, filled with zeros.
Parameters
----------
shape : int or sequence of ints
Shape of the new array, e.g., ``(2, 3)`` or ``2``.
...
Returns
-------
out : ndarray
Array of zeros with the given shape, dtype, and order.
...
The shape argument is just a list of the size of each dimension (but you probably knew that). There are lots of ways to easily create such a list in python, one quick way is
np.zeros(np.ones(263, dtype=int))
But, as others have mentioned, numpy has a somewhat arbitrary limitation of 32 dimensions. In my experience, you can get similar and more flexible behavior by keeping an index array showing which "dimension" each row belongs to.

Most likely, for ML applications you don't actually want this:
shape = np.random.randint(1,10,(263,))
arr = np.zeros(shape) # causes a ValueError anyway
You actually want something sparse
for i, value in enumerate(nonzero_values):
arr[idx[i]] = value
idx in this case is a (num_samples, 263) array and nonzero_values is a (num_samples,) array.
ML algorithms usually work on these idx and value arrays (usually called X and Y) since the actual arrays would be enormous otherwise.
Sometimes you need a "one-hot" array of your dimensions, which will make idx.shape == (num_samples, shape.sum()), with idx containting only 0 or 1 values. But that's still smaller than any sort of high-dimetnsional array.

There is a new package called DimPy which can create multi-dimensional arrays in python very easily. To install use
pip install dimpy
Use example
from dimpy import *
a=dim(4,5,6) # This is a 3 dimensional array of 4x5x6 elements. Use any number of dimensions within '( ) ' separated by comma
print(a)
By default every element will be zero. To change it use dfv(a, 'New value')
To express it into numpy style array, use
a=npary(a)
See in more details here: https://www.respt.in/p/python-package-dimpy.html?m=1

Related

How can we pass a list of strings to a fine tuned bert model?

I want to pass a list of strings instead of a single string input to my fine tuned bert question classification model.
This is my code which accept a single string input.
questionclassification_model = tf.keras.models.load_model('/content/drive/MyDrive/questionclassification_model')
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
def prepare_data(input_text):
token = tokenizer.encode_plus(
input_text,
max_length=256,
truncation=True,
padding='max_length',
add_special_tokens=True,
return_tensors='tf'
)
return {
'input_ids': tf.cast(token['input_ids'], tf.float64),
'attention_mask': tf.cast(token['attention_mask'], tf.float64)
}
def make_prediction(model, processed_data, classes=['Easy', 'Medium', 'Hard']):
probs = model.predict(processed_data)[0]
return classes[np.argmax(probs)],probs;
I don't want to use a for loop over the list as it takes more execution time.
when I tried to pass a list as input to the tokenizer it was returning same output for every input.
input_text = ["What is gandhi commonly considered to be?,Father of the nation in india","What is the long-term warming of the planets overall temperature called?, Global Warming"]
processed_data = prepare_data(input_text)
{'input_ids': <tf.Tensor: shape=(1, 256), dtype=float64, numpy=
array([[101., 100., 100., 102., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0.]])>, 'attention_mask': <tf.Tensor: shape=(1, 256), dtype=float64, numpy=
array([[1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])>}
and that is not the right tokens for the input text.
Thanks in advance...

Different methods for one sentence vs batches
There are different methods for encoding one sentence versus encoding a batch of sentences
According to the documentation (https://huggingface.co/docs/transformers/v4.21.1/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode_plus) the encode_plus method expects the first parameter to be "This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method)."
(emphasis mine) - so that if you're passing a list of strings to this particular method, they are interpreted as a list of tokens, not sentences, and obviously all those very long "tokens" like "What is gandhi commonly considered to be?,Father of the nation in india" do not match anything in the vocabulary so they get mapped to the out-of-vocabulary id.
If you want to encode a batch of sentences, then you need to pass your list of strings to the batch_encode_plus method (https://huggingface.co/docs/transformers/v4.21.1/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_encode_plus)

It is already supported by hugging face by default. both tokenizer and model accept a list. See here tokenizer's documentation: https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__
samples = ["some text1", "some_text2"]
inputs = tokenizer(samples)
predictions = questionclassification_model(inputs)

When trying to integer encode string data using sklearn OrdinalEncoder, returning 0 for all categories?

I have a dataset of many strings, and I want to convert them to integers for my keras model to use. When I try and use the sklearn ordinal encoder and I have tried sklearn one hot encoding, for all the categories only zeroes show up.
Heres some code-
X = dataset.astype(str)
oe = OrdinalEncoder()
oe.fit(X)
x = oe.transform(X)
and x[10] is just returning-
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

ValueError: Can't convert non-rectangular Python sequence to Tensor

I want to change list to tensor with tf.convert_to_tensor, data is following:
data=[
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
1., 0., 0.]),
array([0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.]),
array([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
0., 0., 0.]),
array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.])
]
it didn't work, system says:
ValueError: Can't convert non-rectangular Python sequence to Tensor.
how to solve this problem?

I'm not sure whether they exist in TensorFlow 1 but TensorFlow 2.0 supports RaggedTensors, which the documentation describes as "... the TensorFlow equivalent of nested variable-length lists."
I think it would be trivial to convert your data to RaggedTensors. It might even be as easy as:
data_tensor = tf.ragged.constant(data)
Example:
>>> a = tf.ragged.constant([[1],[2,3]])
>>> a
<tf.RaggedTensor [[1], [2, 3]]>

You can't. Like the error message says, TensorFlow arrays can not have different sizes along one dimension. Try to use a list of TensorFlow arrays instead or the dataset api.

Comparing Arrays for Accuracy

I've a 2 arrays:
np.array(y_pred_list).shape
# returns (5, 47151, 10)
np.array(y_val_lst).shape
# returns (5, 47151, 10)
np.array(y_pred_list)[:, 2, :]
# returns
array([[ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
np.array(y_val_lst)[:, 2, :]
# returns
array([[ 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)
I would like to go through all 47151 examples, and calculate the "accuracy". Meaning the sum of those in y_pred_list that matches y_val_lst over 47151. What's the comparison function for this?

You can find a lot of useful classification scores in sklearn.metrics, particularly accuracy_score(). See the doc here, you would use it as:
import sklearn
acc = sklearn.metrics.accuracy_score(np.array(y_val_list)[:, 2, :],
np.array(y_pred_list)[:, 2, :])

Sounds like you want something like this:
accuracy = (y_pred_list == y_val_lst).all(axis=(0,2)).mean()
...though since your arrays are clearly floating-point arrays, you might want to allow for numerical-precision errors rather than insisting on exact equality:
accuracy = (numpy.abs(y_pred_list - y_val_lst) < tolerance ).all(axis=(0,2)).mean()
(where, for example, tolerance = 1e-10)
The .all(axis=(0,2)) call records cases in which everything in its input is True (i.e. everything matches) when working along the dimension 0 (i.e. the one that has extent 5) and dimension 2 (the one that has extent 10). It outputs a one-dimensional array of length 47151. The .mean() call then gives you the proportion of matches in that sequence, which is my best guess as to what you mean by "over 47151".

Replace values in bigger numpy array with smaller array

I have 2 numpy arrays: The bigger one is a 10 x 10 numpy array and the smaller one is a 2 x 2 array.
I would like to substitute the values in the bigger array with those from the smaller array, at a user specified location. E.g. Replace the values of the 10 x 10 array starting from its center point by replacing 4 values with the 2 x 2 array.
Right now, I am doing this by using a nested for loop, and figuring out which pixels in the bigger array overlap those of the smaller array. Is there a more pythonic way to do it?

In [1]: import numpy as np
In [2]: a = np.zeros(100).reshape(10,10)
In [3]: b = np.ones(4).reshape(2,2)
In [4]: a[4:6, 4:6] = b
In [5]: a
Out[5]:
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

N dimensional array in python - python

Related

How can we pass a list of strings to a fine tuned bert model?

When trying to integer encode string data using sklearn OrdinalEncoder, returning 0 for all categories?

ValueError: Can't convert non-rectangular Python sequence to Tensor

Comparing Arrays for Accuracy

Replace values in bigger numpy array with smaller array

Categories

Resources