inputing numpy array images into pytorch neural net - python

I have a numpy array representation of an image and I want to turn it into a tensor so I can feed it through my pytorch neural network.
I understand that the neural networks take in transformed tensors which are not arranged in [100,100,3] but [3,100,100] and the pixels are rescaled and the images must be in batches.
So I did the following:
import cv2
my_img = cv2.imread('testset/img0.png')
my_img.shape #reuturns [100,100,3] a 3 channel image with 100x100 resolution
my_img = np.transpose(my_img,(2,0,1))
my_img.shape #returns [3,100,100]
#convert the numpy array to tensor
my_img_tensor = torch.from_numpy(my_img)
#rescale to be [0,1] like the data it was trained on by default
my_img_tensor *= (1/255)
#turn the tensor into a batch of size 1
my_img_tensor = my_img_tensor.unsqueeze(0)
#send image to gpu
#put forward through my neural network.
However this returns the error:
RuntimeError: _thnn_conv2d_forward is not implemented for type torch.ByteTensor

The problem is that the input you give to your network is of type ByteTensor while only float operations are implemented for conv like operations. Try the following
my_img_tensor = my_img_tensor.type('torch.DoubleTensor')
# for converting to double tensor
Source PyTorch Discussion Forum
Thanks to AlbanD


Changing shapes of PyTorch tensors and numpy arrays

I'm using CLIP model from huggingface to generate image embeddings, and I'm struggling with the output's shape.
I'm trying to get a numpy array of shape (n, 512) - given n samples and 512 is the embedding size of the CLIP model. However, I'm getting an array shape (n,) with each element is of shape (512,).
I have been trying to play with different function like squeeze, reshape, etc but nothing have worked so far.
This is my code to generate a Series of embeddings for a given df with images' URLs:
# initialize model and processor:
device = "cuda" if torch.cuda.is_available() else "cpu"
model_ID = "openai/clip-vit-base-patch32"
# Save the model to device
model = CLIPModel.from_pretrained(model_ID).to(device)
# Get the processor
processor = CLIPProcessor.from_pretrained(model_ID)
# create image embedding
def embed_url_img(img_url):
""" Create embeddings for a given image URL """
inputs = processor(images =,
return model.get_image_features(inputs).squeeze(0).cpu().detach().numpy()
df['embeddings'] = df['url'].apply(embed_url_img)
This post helped:
how to convert a Series of arrays into a single matrix in pandas/numpy?
to transform the Series into a matrix:

InvalidArgumentError: Graph execution error: TensorFlow pose estimation using OpenCV

I am trying to build pose detection using cv2, tensorflow in google colab
I am encountering with the following error..
import tensorflow as tf
import tensorflow_hub as hub
import cv2
from matplotlib import pyplot as plt
import numpy as np
from google.colab.patches import cv2_imshow
model = hub.load('')
movenet = model.signatures['serving_default']
img_original = cv2.imread('/content/brandon-atchison-eexdeq3NleQ-unsplash.jpeg',1)
img_copy = img_original.copy()
input_img = tf.cast(img_original,dtype=tf.int32)
tensor = tf.convert_to_tensor(img_original,dtype=tf.int32)
results = movenet(tensor)
I have created the variable img_copy cuz I need to perform some operations on the image and want the original image as it is. Not sure what is the error I am facing while trying to get results from the movenet model.
results = movenet(tensor[None, ...])
since you are missing the batch dimension, which is needed to feed data to your model. You could also use tf.expand_dims:
tensor = tf.expand_dims(tensor, axis=0)
# resize
tensor = tf.image.resize(tensor, [32 * 186, 32 * 125])
Here is a working example:
import tensorflow_hub as hub
model = hub.load('')
movenet = model.signatures['serving_default']
tensor = tf.random.uniform((1, 160, 256, 3), minval=0, maxval=255, dtype=tf.int32)
Check the model description and make sure you have the correct shape:
A frame of video or an image, represented as an int32 tensor of dynamic shape: 1xHxWx3, where H and W need to be a multiple of 32 and the larger dimension is recommended to be 256. To prepare the input image tensor, one should resize (and pad if needed) the image such that the above conditions are hold. Please see the Usage section for more detailed explanation. Note that the size of the input image controls the tradeoff between speed vs. accuracy so choose the value that best suits your application. The channel order is RGB with values in [0, 255].

How to do Inference and Transfer Learning with TensorFlow Frozen GraphDef (single saved_model.pb) from Google AutoML Vision Classification

I am using an exported classification model from Google AutoML Vision, hence I only have a saved_model.pb and no variables, checkpoints etc.
I want to load this model graph into a local TensorFlow installation, use it for inference and continue training with more pictures.
Main questions:
Is this plan possible, i.e. to use a single saved_model.pb without variables, checkpoints etc. and train the resulting graph with new data?
If yes: How do you get to an input shape of (?,) with images encoded as strings?
Ideally, looking ahead: Any important thing to consider for the training part?
Background infos about code:
To read the image, I use the same approach as you would when using the Docker container for inference, hence base64 encoded image.
To load the graph, I checked what tag set the graph needs via CLI (saved_model_cli show --dir input/model) which is serve.
To get input tensor names I use graph.get_operations(), which gives me Placeholder:0 for image_bytes and Placeholder:1_0 for the key (just an arbitrary string identify the image). Both have Dimension dim -1
import tensorflow as tf
import numpy as np
import base64
path_img = "input/testimage.jpg"
path_mdl = "input/model"
# input to network expected to be base64 encoded image
with, 'rb') as image_file:
encoded_image = base64.b64encode('utf-8')
# reshaping to (1,) as the expecte dimension is (?,)
feed_dict_option1 = {
"Placeholder:0": { np.array(str(encoded_image)).reshape(1,) },
"Placeholder_1:0" : "image_key"
# reshaping to (1,1) as the expecte dimension is (?,)
feed_dict_option2 = {
"Placeholder:0": np.array(str(encoded_image)).reshape(1,1),
"Placeholder_1:0" : "image_key"
with tf.Session(graph=tf.Graph()) as sess:
tf.saved_model.loader.load(sess, ["serve"], path_mdl)
graph = tf.get_default_graph()'scores:0',
# for input reshaped to (1,)
ValueError: Cannot feed value of shape (1,) for Tensor 'Placeholder:0', which has shape '(?,)'
# for input reshaped to (1,1)
ValueError: Cannot feed value of shape (1, 1) for Tensor 'Placeholder:0', which has shape '(?,)'
How do you get to an input shape of (?,)?
Thanks a lot.
Yes! It is possible, I have an object detection model that should be similar, I can run it as follows in tensorflow 1.14.0:
import cv2
flag, bts = cv.imencode('.jpg', img)
inp = [bts[:,0].tobytes()]
out =[sess.graph.get_tensor_by_name('num_detections:0'),
feed_dict={'encoded_image_string_tensor:0': inp})
I used netron to find my input.
In tensorflow 2.0 it is even easier:
import cv2
flag, bts = cv.imencode('.jpg', img)
inp = [bts[:,0].tobytes()]
saved_model_dir = '.'
loaded = tf.saved_model.load(export_dir=saved_model_dir)
infer = loaded.signatures["serving_default"]
out = infer(key=tf.constant('something_unique'), image_bytes=tf.constant(inp))
Also saved_model.pb is not a frozen_inference_graph.pb, see: What is difference frozen_inference_graph.pb and saved_model.pb?

Save Audio Features extracted using Librosa in a multichannel Numpy array

I am trying to extract features from audio files using Librosa, to feed to a CNN as Numpy arrays.
Currently i save a single feature at a time to feed into the CNN. I save two dimensional (single-channel) log-scaled mel-spectrogram features in Python using Librosa:
def build_features():
y, sr = librosa.load("audio.wav")
mel = librosa.feature.melspectrogram(
n_mels=128, #Mel-bins
logamplitude = librosa.amplitude_to_db
logspec = logamplitude(mel, ref=1.0)[np.newaxis, :, :, np.newaxis]
This gives the shape (1,128,323,1).
I would like to add another feature, let's say a tempogram. I can do this, using the same code, but replacing melspectrogram to tempogram', and setting the window length to 128.
This gives me a tempogram shape of (1,128,323,1).
Now i would like to "stack" these 2 feature layers, into a multi-channel numpy object, that i can feed into a CNN in Keras.
How should i code this?
Think I figured it out, using np.vstack()

How to format training input and output data on Keras

I am new to Deep Learning and I struggle with some data format on Keras. My CNN is based on the Stacked Hourglass Networks for Human Pose Estimation from A.Newell et al.
On this network the input is a 256x256 RGB image and the output should be a 64x64 heatmap highlighting body joints (shoulder, knee,...). I manage to build the network and I have all the data (images) with their annotations (pixel labels for body joints). I was wondering how should I format the Input and Output Data of the training set to train my model. Currently I use a numpy array (256,256,3) for an image and I don't know how to format my output. Should I create a table [n,64,64,7]? (n being the size of the training set and 7 is the number of filters I use to obtain a heatmap for 7 joints)
Thank you for your time.
The output can also be a numpy array.
Consider this example:
Training set: 50 images of size 256x256x3. This can be combined into a single numpy array of shape(50, 256, 256, 3).
Similar approach to format the output data.
Sample code below:
#a, b and c are arrays of size 256x256x3
import numpy as np
temp = []
output_labels = []
output_labels = np.stack(temp)
The output_labels array will be of shape(3x256x256x3).
Keras recommend to create data generator to feed training data and ground truth to network.
Specific to stacked hourglass network case, you can refer to my implementation for details

