Loading binary data with FixedLengthRecordDataset in TensorFlow

Loading binary data with FixedLengthRecordDataset in TensorFlow - python

I'm trying to figure out how to load binary data file using FixedLengthRecordDataset:
import tensorflow as tf
import struct
import numpy as np
RAW_N = 2 + 20*20 + 1
def convert_binary_to_float_array(register):
return struct.unpack('f'*RAW_N, register.numpy())
raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['mydata.bin'],record_bytes=RAW_N*4)
float_ds = raw_dataset.map(map_func=convert_binary_to_float_array)
This code throws:
AttributeError: in user code:
tf-load-data.py:14 convert_binary_to_float_array *
return struct.unpack('f'*RAW_N, register.numpy())
AttributeError: 'Tensor' object has no attribute 'numpy'
numpy() is available if I try to iterate over the dataset:
raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['mydata.bin'],record_bytes=RAW_N*4)
for register in raw_dataset:
print(struct.unpack('f'*RAW_N, register.numpy()))
By reading the Tensor type description, I realized that numpy() is available only during eager execution. Thus, I can deduce that during the map() call the elements are not provided as EagerTensor.
How to load this data into a dataset?
I'm using TensorFlow 2.4.1

I would suggest working with tf.io.decode_raw. I unfortunately do not know what mydata.bin looks like so I created some dummy data:
import random
import struct
import tensorflow as tf
import numpy as np
RAW_N = 2 + 20*20 + 1
bytess = random.sample(range(1, 5000), RAW_N*4)
with open('mydata.bin', 'wb') as f:
f.write(struct.pack('1612i', *bytess))
def convert_binary_to_float_array(register):
return tf.io.decode_raw(register, out_type=tf.float32)
raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['/content/mydata.bin'], record_bytes=RAW_N*4)
raw_dataset = raw_dataset.map(convert_binary_to_float_array)
for register in raw_dataset:
print(register)
You could also try first decoding your data into integers with tf.io.decode_raw and then casting to float with tf.cast, but I am not sure if it will make a difference.

Related

how can I fix "1 leaked semaphore objects to clean up at shutdown" error on Mac M1

I am using Apple Mac M1:
OS: MacOS Monterey
Python 3.9.13
I want to implement a semantic search using SentenceTransformer.
Here is my code:
from sentence_transformers import SentenceTransformer
import faiss
from pprint import pprint
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def load_index():
index = faiss.read_index("movie_plot.index")
return index
def fetch_movie_info(dataframe_idx):
info = df.iloc[dataframe_idx]
meta_dict = dict()
meta_dict['Title'] = info['Title']
meta_dict['Plot'] = info['Plot'][:500]
return meta_dict
def search(query, top_k, index, model):
print("starting search!")
t=time.time()
query_vector = model.encode([query])
top_k = index.search(query_vector, top_k)
print('>>>> Results in Total Time: {}'.format(time.time()-t))
top_k_ids = top_k[1].tolist()[0]
top_k_ids = list(np.unique(top_k_ids))
results = [fetch_movie_info(idx) for idx in top_k_ids]
return results
def main():
# GET MODEL
model = SentenceTransformer('msmarco-distilbert-base-dot-prod-v3')
print("model set!")
#GET INDEX
index = load_index()
print("index loaded!")
query="Artificial Intelligence based action movie"
results=search(query, top_k=5, index=index, model=model)
print("\n")
for result in results:
print('\t',result)
main()
when i run the code above it gets stuck at this error
/opt/homebrew/Caskroom/miniforge/base/envs/searchapp/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
what is causing this and how can I fix it?

I had the same problem, upgrading to Python 3.9 solved it in my case.
It looks like it is an official bug: https://bugs.python.org/issue45209

Check your memory utilisation. You might be out of memory.
Since you're using a pretrained model for inference/prediction, you can reduce the memory requirements of the model by avoiding computation of the gradients. Gradients are only used for training the model typically.
You can wrap your model call with torch.no_grad like so:
with torch.no_grad():
query_vector = model.encode([query])
You can also reduce the memory utilisation a model by reducing the batch size (i.e. number of samples passed to the model at one time), but that doesn't seem to apply here.

How to load a model saved as RDS file from Python and make predictions?

I have a machine learning model saved in *.rds format. I want to open this model in Python in order to make predictions. To do so, I installed rpy2. This is my Jupyter Notebook code:
!pip install rpy2
import json
import pandas as pd
import numpy as np
import rpy2.robjects as robjects
from rpy2.robjects import numpy2ri
from rpy2.robjects.packages import importr
r = robjects.r
numpy2ri.activate()
model_rds_path = "model.rds"
model = r.readRDS(model_rds_path)
raw_data = '{"data":[[79],[63]]}'
data = json.loads(raw_data)["data"]
if type(data) is not np.ndarray:
data = np.array(data)
result = r.predict(model, data, probability=False)
result
I get the following error at the line r.predict(…):
RRuntimeError: Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
'data' must be a data.frame, not a matrix or an array
Calls: <Anonymous> -> predict.lm -> model.frame -> model.frame.default
The training script in R looks as follows:
library(caret)
# Reading `data` from CSV file
x <- data$height
y <- data$weight
model <- lm(y~x)
# Test predictions
df_test_heights <- data.frame(x = as.numeric(c(115,20)))
result <- predict(model,df_test_heights)
print(result)
I am so much confused… Spent whole day trying to solve this issue!! Does anybody know how to fix it??? I would also appreciate if somebody knows an alternative way (alternative to rpy2) to open RDS files from Python.
Thanks!!!

Here is an option with pyper
import numpy as np
import pandas as pd
from pyper import *
import json
r=R(use_pandas=True)
model_rds_path = "model.rds"
r.assign("rmodel", model_rds_path)
raw_data = '{"data":[[79],[63]]}'
data = json.loads(raw_data)["data"]
if type(data) is not np.ndarray:
data = dat = pd.DataFrame( np.array(data), columns = ['x'])
r.assign("rdata", data)
# rdata
expr = 'model <- readRDS(rmodel); result <- predict(model, rdata, probability=False)'
r(expr)
res= r.get('result')

The R function predict() will expect an R data frame for data. However, what you have as this point is a numpy array.
data = json.loads(raw_data)["data"]
if type(data) is not np.ndarray:
data = np.array(data)
In Python, pandas's DataFrame objects are a closer conceptual equivalent to R data frames. This section of the rpy2 documentation might help you:
https://rpy2.github.io/doc/v3.2.x/html/pandas.html

Why doesn't setting `random.seed(42)` give me identical results in pytorch?

I am setting random seed for both random and numpy.random at the beginning of my main file:
import random
import numpy as np
np.random.seed(42)
random.seed(42)
import torch
Nevertheless, when I create a Net() object with randomly initialized parameters, it gives a completely different result every time:
net=neuralnet.Net()
print ("initialized params: ", net.fc1.weight)
Note that neuralnet.Net() is in a different file, and is a class that extends torch.nn.Module. it is torch.nn.Module that is randomly initializing net.fc1.weight, not my own code.
How is it possible that when I create a Net() object with randomly initialized parameters, it gives a completely different result every time?

try:
import torch
torch.manual_seed(0)
For further information:
https://pytorch.org/docs/stable/notes/randomness.html

Have you looked at: https://github.com/pytorch/pytorch/issues/7068?
There are some recommendations on how to reproduce the results.
Example:
import sys
import random
import datetime as dt
import numpy as np
import torch
torch.manual_seed(42)
torch.cuda.manual_seed(42)
np.random.seed(42)
random.seed(42)
torch.backends.cudnn.deterministic = True
features = torch.randn(2, 5)
# Print stuff.
fnp = features.view(-1).numpy()
print("Time: {}".format(dt.datetime.now()))
for el in fnp:
print("{:.20f}".format(el))
print("Python: {}".format(sys.version))
print("Numpy: {}".format(np.__version__))
print("Pytorch: {}".format(torch.__version__))

Parsing CSV into Pytorch tensors

I have a CSV files with all numeric values except the header row. When trying to build tensors, I get the following exception:
Traceback (most recent call last):
File "pytorch.py", line 14, in <module>
test_tensor = torch.tensor(test)
ValueError: could not determine the shape of object type 'DataFrame'
This is my code:
import torch
import dask.dataframe as dd
device = torch.device("cuda:0")
print("Loading CSV...")
test = dd.read_csv("test.csv", encoding = "UTF-8")
train = dd.read_csv("train.csv", encoding = "UTF-8")
print("Converting to Tensor...")
test_tensor = torch.tensor(test)
train_tensor = torch.tensor(train)
Using pandas instead of Dask for CSV parsing produced the same error. I also tried to specify dtype=torch.float64 inside the call to torch.tensor(data), but got the same error again.

Try converting it to an array first:
test_tensor = torch.Tensor(test.values)

I think you're just missing .values
import torch
import pandas as pd
train = pd.read_csv('train.csv')
train_tensor = torch.tensor(train.values)

Newer version of pandas highly recommend to use to_numpy instead of values
train_tensor = torch.tensor(train.to_numpy())

Only using NumPy
import numpy as np
import torch
tensor = torch.from_numpy(
np.genfromtxt("train.csv", delimiter=",")
)

how to load images into a python list and convert to a dataframe object

I am trying to read a bunch of png images in a directory into a python list. After getting the images in the list I would like to resize and do some reshaping. But it always throws an error. Below is the sample of my code.
Well it doesn't really throw an error but it always prints out an empty dataset.
import pandas as pd
from sklearn.manifold import Isomap
from scipy import misc
from os import listdir
import glob
dset = []
for image_path in glob.glob("/Documents/Python/DAT210x-master/Module4 /Datasets/ALOI/32/*.png"):
img = misc.imread(image_path)
img = img[::2, ::2]
X = (img / 255.0).reshape(-1)
dset.append(X)
dset = pd.DataFrame(dset)
print(dset)

If I'm not mistaken, DataFrame objects are typically initialized from dict objects. What happens if you add:
dset = {'Pictures': dset}
dset = pd.DataFrame(dset)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loading binary data with FixedLengthRecordDataset in TensorFlow - python

Related

how can I fix "1 leaked semaphore objects to clean up at shutdown" error on Mac M1

How to load a model saved as RDS file from Python and make predictions?

Why doesn't setting `random.seed(42)` give me identical results in pytorch?

Parsing CSV into Pytorch tensors

how to load images into a python list and convert to a dataframe object

Categories

Resources