Parsing CSV into Pytorch tensors - python

I have a CSV files with all numeric values except the header row. When trying to build tensors, I get the following exception:
Traceback (most recent call last):
File "pytorch.py", line 14, in <module>
test_tensor = torch.tensor(test)
ValueError: could not determine the shape of object type 'DataFrame'
This is my code:
import torch
import dask.dataframe as dd
device = torch.device("cuda:0")
print("Loading CSV...")
test = dd.read_csv("test.csv", encoding = "UTF-8")
train = dd.read_csv("train.csv", encoding = "UTF-8")
print("Converting to Tensor...")
test_tensor = torch.tensor(test)
train_tensor = torch.tensor(train)
Using pandas instead of Dask for CSV parsing produced the same error. I also tried to specify dtype=torch.float64 inside the call to torch.tensor(data), but got the same error again.

Try converting it to an array first:
test_tensor = torch.Tensor(test.values)

I think you're just missing .values
import torch
import pandas as pd
train = pd.read_csv('train.csv')
train_tensor = torch.tensor(train.values)

Newer version of pandas highly recommend to use to_numpy instead of values
train_tensor = torch.tensor(train.to_numpy())

Only using NumPy
import numpy as np
import torch
tensor = torch.from_numpy(
np.genfromtxt("train.csv", delimiter=",")
)

Related

Loading binary data with FixedLengthRecordDataset in TensorFlow

I'm trying to figure out how to load binary data file using FixedLengthRecordDataset:
import tensorflow as tf
import struct
import numpy as np
RAW_N = 2 + 20*20 + 1
def convert_binary_to_float_array(register):
return struct.unpack('f'*RAW_N, register.numpy())
raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['mydata.bin'],record_bytes=RAW_N*4)
float_ds = raw_dataset.map(map_func=convert_binary_to_float_array)
This code throws:
AttributeError: in user code:
tf-load-data.py:14 convert_binary_to_float_array *
return struct.unpack('f'*RAW_N, register.numpy())
AttributeError: 'Tensor' object has no attribute 'numpy'
numpy() is available if I try to iterate over the dataset:
raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['mydata.bin'],record_bytes=RAW_N*4)
for register in raw_dataset:
print(struct.unpack('f'*RAW_N, register.numpy()))
By reading the Tensor type description, I realized that numpy() is available only during eager execution. Thus, I can deduce that during the map() call the elements are not provided as EagerTensor.
How to load this data into a dataset?
I'm using TensorFlow 2.4.1
I would suggest working with tf.io.decode_raw. I unfortunately do not know what mydata.bin looks like so I created some dummy data:
import random
import struct
import tensorflow as tf
import numpy as np
RAW_N = 2 + 20*20 + 1
bytess = random.sample(range(1, 5000), RAW_N*4)
with open('mydata.bin', 'wb') as f:
f.write(struct.pack('1612i', *bytess))
def convert_binary_to_float_array(register):
return tf.io.decode_raw(register, out_type=tf.float32)
raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['/content/mydata.bin'], record_bytes=RAW_N*4)
raw_dataset = raw_dataset.map(convert_binary_to_float_array)
for register in raw_dataset:
print(register)
You could also try first decoding your data into integers with tf.io.decode_raw and then casting to float with tf.cast, but I am not sure if it will make a difference.

xarray cannot directly convert an xarray.Dataset into a numpy array

I'm experiencing an error that I cannot seem to resolve when attempting to convert from a Dataset to array when using xarray. I'm encountering this because I'm attempting to add a time dimension to a netcdf file (open netcdf, add a timestamp that is the same across all data, save out netcdf).
import xarray as xr
import pandas as pd
scriptpath = os.path.dirname(os.path.abspath(__file__))
outputfile = scriptpath + '\\20210629_deadgrass.aus.nc'
times= pd.to_datetime(str(yesterday.strftime('%Y%m%d')))
time_da = xr.Dataset({"time": times})
arr = xr.open_dataset(outputfile)
ds = arr.to_array()
dst = ds.expand_dims(time=time_da) #errors here
The error I'm receiving is
Exception has occurred: TypeError
cannot directly convert an xarray.Dataset into a numpy array. Instead, create an xarray.DataArray first, either with indexing on the Dataset or by invoking the `to_array()` method.
File "Z:\UpdateAussieGRASS.py", line 101, in <module>
dst = ds.expand_dims(time=time_da)
I can't seem to work out what I'm doing wrong with to_array() in the second last line. Examples of to_array() are here. Autogenerated documentation is here.
ds is already an xarray.DataArray. The error occurs on this line:
dst = ds.expand_dims(time=time_da) #errors here
Because while ds is a DataArray, time_da is not. This should work:
dst = ds.expand_dims(time=time_da.to_array())

Data import error in Python

I'm trying to import the MNIST dataset in Python as follows:
import h5py
f = h5py.File("mnist.h5")
x_test = f["x_test"]
x_train = f["x_train"]
y_test = f["y_test"]
y_train = f["y_train"]
the type of say, y_train says h5py._hl.dataset.Dataset
I want to convert them to float for mathematical convenience. I try this:
D = x_train.astype(float)
y_train = y_train.astype(float)+np.ones((60000,1));
but I get this traceback:
Traceback (most recent call last):
File "<ipython-input-14-f3677d523d45>", line 1, in <module>
y_train = y_train.astype(float)+np.ones((60000,1));
TypeError: unsupported operand type(s) for +: 'AstypeContext' and 'float'
Where am I missing out? Thanks.
You are using two different libraries that have two completely different meanings for astype.
If you were doing this in numpy, something like this works:
a = np.array([1, 2, 3])
a = a.astype(float) + np.ones((60000,1))
But in h5py, astype is a different function and meant to be used in a context manager:
This will throw the same error as what you are getting:
import h5py
f = h5py.File('mytestfile.hdf5', 'w')
dset = f.create_dataset("default", (100,))
dset.astype(float) + np.ones((60000,1))
But the code below, will work (see astype in h5py docs):
f = h5py.File('mytestfile.hdf5', 'w')
dset = f.create_dataset("default", (100,))
with dset.astype('float'):
out = dset[:]
out += np.ones((100,))
This problem is similar to Creating reference to HDF dataset in H5py using astype

How to read a csv file and plot confusion matrix in python

I have a CSV file with 2 columns as
actual,predicted
1,0
1,0
1,1
0,1
.,.
.,.
How do I read this file and plot a confusion matrix in Python?
I tried the following code from a program.
import pandas as pd
from sklearn.metrics import confusion_matrix
import numpy
CSVFILE='./mappings.csv'
test_df=pd.read_csv[CSVFILE]
actualValue=test_df['actual']
predictedValue=test_df['predicted']
actualValue=actualValue.values
predictedValue=predictedValue.values
cmt=confusion_matrix(actualValue,predictedValue)
print cmt
but it gives me this error.
Traceback (most recent call last):
File "confusionMatrixCSV.py", line 7, in <module>
test_df=pd.read_csv[CSVFILE]
TypeError: 'function' object has no attribute '__getitem__'
pd.read_csv is a function. You call a function in Python by using parenthesis.
You should use pd.read_csv(CSVFILE) instead of pd.read_csv[CSVFILE].
import pandas as pd
from sklearn.metrics import confusion_matrix
import numpy as np
CSVFILE = './mappings.csv'
test_df = pd.read_csv(CSVFILE)
actualValue = test_df['actual']
predictedValue = test_df['predicted']
actualValue = actualValue.values.argmax(axis=1)
predictedValue =predictedValue.values.argmax(axis=1)
cmt = confusion_matrix(actualValue, predictedValue)
print cmt
Here's a simple solution to calculate the accuracy and plot confusion matrix for the input in the format mentioned in the question.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
file=open("results.txt","r")
result=[]
actual=[]
i = 0
for line in file:
i+=1
sent=line.split("\t")
sent[0]=int(sent[0])
sent[1]=int(sent[1])
result.append(sent[1])
actual.append(sent[0])
cnf_mat=confusion_matrix(actual,result)
print cnf_mat
print('Test Accuracy:', accuracy_score(actual,result))

How can I split a Dataset from a .csv file for Training and Testing?

I'm using Python and I need to split my .csv imported data in two parts, a training and test set, E.G 70% training and 30% test.
I keep getting various errors, such as 'list' object is not callable and so on.
Is there any easy way of doing this?
Thanks
EDIT:
The code is basic, I'm just looking to split the dataset.
from csv import reader
with open('C:/Dataset.csv', 'r') as f:
data = list(reader(f)) #Imports the CSV
data[0:1] ( data )
TypeError: 'list' object is not callable
You can use pandas:
import pandas as pd
import numpy as np
df = pd.read_csv('C:/Dataset.csv')
df['split'] = np.random.randn(df.shape[0], 1)
msk = np.random.rand(len(df)) <= 0.7
train = df[msk]
test = df[~msk]
Better practice and maybe more random is to use df.sample:
from numpy.random import RandomState
import pandas as pd
df = pd.read_csv('C:/Dataset.csv')
rng = RandomState()
train = df.sample(frac=0.7, random_state=rng)
test = df.loc[~df.index.isin(train.index)]
You should use the read_csv () function from the pandas module. It reads all your data straight into the dataframe which you can use further to break your data into train and test. Equally, you can use the train_test_split() function from the scikit-learn module.

Categories

Resources