Changing shapes of PyTorch tensors and numpy arrays - python

I'm using CLIP model from huggingface to generate image embeddings, and I'm struggling with the output's shape.
I'm trying to get a numpy array of shape (n, 512) - given n samples and 512 is the embedding size of the CLIP model. However, I'm getting an array shape (n,) with each element is of shape (512,).
I have been trying to play with different function like squeeze, reshape, etc but nothing have worked so far.
This is my code to generate a Series of embeddings for a given df with images' URLs:
# initialize model and processor:
device = "cuda" if torch.cuda.is_available() else "cpu"
model_ID = "openai/clip-vit-base-patch32"
# Save the model to device
model = CLIPModel.from_pretrained(model_ID).to(device)
# Get the processor
processor = CLIPProcessor.from_pretrained(model_ID)
# create image embedding
def embed_url_img(img_url):
""" Create embeddings for a given image URL """
inputs = processor(images =,
return model.get_image_features(inputs).squeeze(0).cpu().detach().numpy()
df['embeddings'] = df['url'].apply(embed_url_img)

This post helped:
how to convert a Series of arrays into a single matrix in pandas/numpy?
to transform the Series into a matrix:


Creating a Keras CNN for image alteration

I'm working on a problem that involves computationally evaluating three-dimensional data of the shape (32, 16, 5) and providing a corrected form of this data also in the shape of (32, 16, 5). The problem is relatively specific to my field, but it can be viewed as analogous to processing color images (just with five color channels instead of three). If it helps, this could be thought of as a color correction model.
In my initial efforts, I created a random forest model using XGBoost for each of these output parameters. I had good results, but found that the sheer number of output parameters (32*16*5 = 2560) made the runtime of this approach too long, so I am looking for an alternative.
I'm looking at using Keras to solve this, using a convolutional neural network approach, since the adjacent 'pixels' in my data should have some useful information about their neighbors. Note that 'adjacency' here is both spatial and in the color channels. So far, I am doing alright in creating a simple model that I believe has inputs/outputs of the correct shape, but I am running into an issue when I try to train the model on some dummy images:
#!/usr/bin/env python3
import tensorflow as tf
import pandas as pd
import numpy as np
def create_model(image_shape, batch_size = 10):
width, height, channels = image_shape
conv_shape = (batch_size, width, height, channels)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv3D(filters = channels, kernel_size = 3, input_shape = conv_shape, padding = "same"))
model.add(tf.keras.layers.Dense(channels, activation = "relu"))
return model
if __name__ == "__main__":
image_shape = (32, 16, 5)
# Create test input/output data sets:
input_img = np.random.rand(*image_shape) # Create one dummy input image
output_img = np.random.rand(*image_shape) # Create one dummy output image
# Create a bogus 'training set' by copying the input/output images into lists many times
inputs = [input_img]*500
outputs = [output_img]*500
# Create the model and fit it to the dummy data
model = create_model(image_shape)
model.compile(loss = "mean_squared_error", optimizer = "adam", metrics = ["accuracy"]), output_img)
However, when I run this code, I get the following error:
ValueError: Input 0 of layer sequential is incompatible with the layer: : expected min_ndim=5, found ndim=3. Full shape received: [32, 16, 5]
I am not really sure what the other two expected dimensions are for the data passed into I suspect this is a problem with the way that I am formatting my input data. Even if I have a list of input/output images, that will only bring the ndim of my data to 4, not 5.
I have been trying to find similar examples in the documentation and around the web to see what I'm doing incorrectly, but 3D convolution on a non-classifier network seems a bit off the beaten path, and I'm not having much luck (or just don't know the name of what I should search for).
I have tried passing the dummy training set to instead of two individual images. Fitting with, outputs) instead, I get:
ValueError: Layer sequential expects 1 inputs, but it received 500 input tensors.
It seems that passing a list of tensors isn't correct here. If I convert the list of input images to numpy arrays with:
inputs = np.array(inputs)
outputs = np.array(outputs)
This does bring up the number of dimensions in my input data to 4, but Keras is still expecting 5. The error I get in this case is very similar to the first:
ValueError: Input 0 of layer sequential is incompatible with the layer: : expected min_ndim=5, found ndim=4. Full shape received: [None, 32, 16, 5]
I'm definitely not understanding something here, and any help would be appreciated.
I think you made two mistakes in your code:
Instead of using Conv3D, you need to use Conv2D., output_img) should be, outputs).
The reason why you need to use Conv2D is the shape of your data is (length,width,channel), it doesn't possess an extra dimension.
Try the script below
#!/usr/bin/env python3
import tensorflow as tf
import pandas as pd
import numpy as np
def create_model(image_shape, batch_size = 10):
width, height, channels = image_shape
conv_shape = (width, height, channels)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv2D(filters = channels, kernel_size = 3, input_shape = conv_shape, padding = "same"))
model.add(tf.keras.layers.Dense(channels, activation = "relu"))
return model
if __name__ == "__main__":
image_shape = (32, 16, 5)
# Create test input/output data sets:
input_img = np.random.rand(*image_shape) # Create one dummy input image
output_img = np.random.rand(*image_shape) # Create one dummy output image
# Create a bogus 'training set' by copying the input/output images into lists many times
inputs = np.array([input_img]*500)
outputs = np.array([output_img]*500)
# Create the model and fit it to the dummy data
model = create_model(image_shape)
model.compile(loss = "mean_squared_error", optimizer = "adam", metrics = ["accuracy"]), outputs)

How to do Inference and Transfer Learning with TensorFlow Frozen GraphDef (single saved_model.pb) from Google AutoML Vision Classification

I am using an exported classification model from Google AutoML Vision, hence I only have a saved_model.pb and no variables, checkpoints etc.
I want to load this model graph into a local TensorFlow installation, use it for inference and continue training with more pictures.
Main questions:
Is this plan possible, i.e. to use a single saved_model.pb without variables, checkpoints etc. and train the resulting graph with new data?
If yes: How do you get to an input shape of (?,) with images encoded as strings?
Ideally, looking ahead: Any important thing to consider for the training part?
Background infos about code:
To read the image, I use the same approach as you would when using the Docker container for inference, hence base64 encoded image.
To load the graph, I checked what tag set the graph needs via CLI (saved_model_cli show --dir input/model) which is serve.
To get input tensor names I use graph.get_operations(), which gives me Placeholder:0 for image_bytes and Placeholder:1_0 for the key (just an arbitrary string identify the image). Both have Dimension dim -1
import tensorflow as tf
import numpy as np
import base64
path_img = "input/testimage.jpg"
path_mdl = "input/model"
# input to network expected to be base64 encoded image
with, 'rb') as image_file:
encoded_image = base64.b64encode('utf-8')
# reshaping to (1,) as the expecte dimension is (?,)
feed_dict_option1 = {
"Placeholder:0": { np.array(str(encoded_image)).reshape(1,) },
"Placeholder_1:0" : "image_key"
# reshaping to (1,1) as the expecte dimension is (?,)
feed_dict_option2 = {
"Placeholder:0": np.array(str(encoded_image)).reshape(1,1),
"Placeholder_1:0" : "image_key"
with tf.Session(graph=tf.Graph()) as sess:
tf.saved_model.loader.load(sess, ["serve"], path_mdl)
graph = tf.get_default_graph()'scores:0',
# for input reshaped to (1,)
ValueError: Cannot feed value of shape (1,) for Tensor 'Placeholder:0', which has shape '(?,)'
# for input reshaped to (1,1)
ValueError: Cannot feed value of shape (1, 1) for Tensor 'Placeholder:0', which has shape '(?,)'
How do you get to an input shape of (?,)?
Thanks a lot.
Yes! It is possible, I have an object detection model that should be similar, I can run it as follows in tensorflow 1.14.0:
import cv2
flag, bts = cv.imencode('.jpg', img)
inp = [bts[:,0].tobytes()]
out =[sess.graph.get_tensor_by_name('num_detections:0'),
feed_dict={'encoded_image_string_tensor:0': inp})
I used netron to find my input.
In tensorflow 2.0 it is even easier:
import cv2
flag, bts = cv.imencode('.jpg', img)
inp = [bts[:,0].tobytes()]
saved_model_dir = '.'
loaded = tf.saved_model.load(export_dir=saved_model_dir)
infer = loaded.signatures["serving_default"]
out = infer(key=tf.constant('something_unique'), image_bytes=tf.constant(inp))
Also saved_model.pb is not a frozen_inference_graph.pb, see: What is difference frozen_inference_graph.pb and saved_model.pb?

inputing numpy array images into pytorch neural net

I have a numpy array representation of an image and I want to turn it into a tensor so I can feed it through my pytorch neural network.
I understand that the neural networks take in transformed tensors which are not arranged in [100,100,3] but [3,100,100] and the pixels are rescaled and the images must be in batches.
So I did the following:
import cv2
my_img = cv2.imread('testset/img0.png')
my_img.shape #reuturns [100,100,3] a 3 channel image with 100x100 resolution
my_img = np.transpose(my_img,(2,0,1))
my_img.shape #returns [3,100,100]
#convert the numpy array to tensor
my_img_tensor = torch.from_numpy(my_img)
#rescale to be [0,1] like the data it was trained on by default
my_img_tensor *= (1/255)
#turn the tensor into a batch of size 1
my_img_tensor = my_img_tensor.unsqueeze(0)
#send image to gpu
#put forward through my neural network.
However this returns the error:
RuntimeError: _thnn_conv2d_forward is not implemented for type torch.ByteTensor
The problem is that the input you give to your network is of type ByteTensor while only float operations are implemented for conv like operations. Try the following
my_img_tensor = my_img_tensor.type('torch.DoubleTensor')
# for converting to double tensor
Source PyTorch Discussion Forum
Thanks to AlbanD

Keras, wrong input shape, but x.shape is right

Trying to set up a neural network using Keras in python.
I get this error when trying to predict with my neural network:
ValueError: Error when checking : expected input_1 to have shape (12,) but got array with shape (1,)
However if i print(x.shape) it returns as (12,)
This is the code block:
def predict(str):
y = convert(str)
x = data = np.array(y, dtype='int64')
with graph.as_default():
#perform the prediction
out = model.predict(x)
print ("debug3")
#convert the response to a string
response = np.array_str(np.argmax(out,axis=1))
return response
Keras models often hide the batch size, so actually it is (samples, 12) and each sample has 12 features. In your case what happens is you have 12 samples each with one feature; hence, it feeds (1,).
Either your data is single data point and you need create a 2D array or change your model input_shape=(1,).

Error when using Inception on TensorFlow (Same output for all pictures)

I'm trying to traing a network on cifar-10 dataset, but on instead of using the pictures I want to use the features from Inceptions' one before last layer.
So I wrote a little peace pf code to pass all the pictures in Inception and get the features, here it is:
def run_inference_on_images(images):
#Creates graph from saved GraphDef.
features_vec = np.ndarray(shape=(len(images),2048),dtype=np.float32)
with tf.Session() as sess:
# Some useful tensors:
# 'pool_3:0': A tensor containing the next-to-last layer containing 2048
# float description of the image.
# 'DecodeJpeg:0': A numpy array of the image
# Runs the softmax tensor by feeding the image data as input to the graph.
length = len(images)
for i in range(length):
print ('inferencing image number',i,'out of', length)
features_tensor = sess.graph.get_tensor_by_name('pool_3:0')
features =,
{'DecodeJpeg:0': images[i]})
features_vec[i] = np.squeeze(features)
return features_vec
"images" is the CIFAR-10 dataset. It's a numpy array with shape (50000,32,32,3)
The problem I'm facing is that "features" outputs is alwayes the same even when I feed different pictures to the "" part.
Am I missing something?
I was able to solve this issue. It seems that Inception doesn't work with numPy arrays like I thought, so I coverted the array to a JPEG picture and only then fed it to the network.
Below is the code which works (rest is the same):
def run_inference_on_images(images):
# Creates graph from saved GraphDef.
features_vec = np.ndarray(shape=(len(images),2048),dtype=np.float32)
with tf.Session() as sess:
features_tensor = sess.graph.get_tensor_by_name('pool_3:0')
length = len(images)
for i in range(length):
im = Image.fromarray(images[i],'RGB')"tmp.jpeg")
data = tf.gfile.FastGFile("tmp.jpeg", 'rb').read()
print ('inferencing image number',i,'out of', length)
features =,
{'DecodeJpeg/contents:0': data})
features_vec[i] = np.squeeze(features)
return features_vec
Not sure. But you might try to move your line
features_tensor = sess.graph.get_tensor_by_name('pool_3:0')
features_tensor = tf.get_tensor_by_name('pool_3:0')
out from the inference part to the model creation part

