Data import error in Python

Data import error in Python - python

I'm trying to import the MNIST dataset in Python as follows:
import h5py
f = h5py.File("mnist.h5")
x_test = f["x_test"]
x_train = f["x_train"]
y_test = f["y_test"]
y_train = f["y_train"]
the type of say, y_train says h5py._hl.dataset.Dataset
I want to convert them to float for mathematical convenience. I try this:
D = x_train.astype(float)
y_train = y_train.astype(float)+np.ones((60000,1));
but I get this traceback:
Traceback (most recent call last):
File "<ipython-input-14-f3677d523d45>", line 1, in <module>
y_train = y_train.astype(float)+np.ones((60000,1));
TypeError: unsupported operand type(s) for +: 'AstypeContext' and 'float'
Where am I missing out? Thanks.

You are using two different libraries that have two completely different meanings for astype.
If you were doing this in numpy, something like this works:
a = np.array([1, 2, 3])
a = a.astype(float) + np.ones((60000,1))
But in h5py, astype is a different function and meant to be used in a context manager:
This will throw the same error as what you are getting:
import h5py
f = h5py.File('mytestfile.hdf5', 'w')
dset = f.create_dataset("default", (100,))
dset.astype(float) + np.ones((60000,1))
But the code below, will work (see astype in h5py docs):
f = h5py.File('mytestfile.hdf5', 'w')
dset = f.create_dataset("default", (100,))
with dset.astype('float'):
out = dset[:]
out += np.ones((100,))
This problem is similar to Creating reference to HDF dataset in H5py using astype

Related

Loading binary data with FixedLengthRecordDataset in TensorFlow

I'm trying to figure out how to load binary data file using FixedLengthRecordDataset:
import tensorflow as tf
import struct
import numpy as np
RAW_N = 2 + 20*20 + 1
def convert_binary_to_float_array(register):
return struct.unpack('f'*RAW_N, register.numpy())
raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['mydata.bin'],record_bytes=RAW_N*4)
float_ds = raw_dataset.map(map_func=convert_binary_to_float_array)
This code throws:
AttributeError: in user code:
tf-load-data.py:14 convert_binary_to_float_array *
return struct.unpack('f'*RAW_N, register.numpy())
AttributeError: 'Tensor' object has no attribute 'numpy'
numpy() is available if I try to iterate over the dataset:
raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['mydata.bin'],record_bytes=RAW_N*4)
for register in raw_dataset:
print(struct.unpack('f'*RAW_N, register.numpy()))
By reading the Tensor type description, I realized that numpy() is available only during eager execution. Thus, I can deduce that during the map() call the elements are not provided as EagerTensor.
How to load this data into a dataset?
I'm using TensorFlow 2.4.1

I would suggest working with tf.io.decode_raw. I unfortunately do not know what mydata.bin looks like so I created some dummy data:
import random
import struct
import tensorflow as tf
import numpy as np
RAW_N = 2 + 20*20 + 1
bytess = random.sample(range(1, 5000), RAW_N*4)
with open('mydata.bin', 'wb') as f:
f.write(struct.pack('1612i', *bytess))
def convert_binary_to_float_array(register):
return tf.io.decode_raw(register, out_type=tf.float32)
raw_dataset = tf.data.FixedLengthRecordDataset(filenames=['/content/mydata.bin'], record_bytes=RAW_N*4)
raw_dataset = raw_dataset.map(convert_binary_to_float_array)
for register in raw_dataset:
print(register)
You could also try first decoding your data into integers with tf.io.decode_raw and then casting to float with tf.cast, but I am not sure if it will make a difference.

xarray cannot directly convert an xarray.Dataset into a numpy array

I'm experiencing an error that I cannot seem to resolve when attempting to convert from a Dataset to array when using xarray. I'm encountering this because I'm attempting to add a time dimension to a netcdf file (open netcdf, add a timestamp that is the same across all data, save out netcdf).
import xarray as xr
import pandas as pd
scriptpath = os.path.dirname(os.path.abspath(__file__))
outputfile = scriptpath + '\\20210629_deadgrass.aus.nc'
times= pd.to_datetime(str(yesterday.strftime('%Y%m%d')))
time_da = xr.Dataset({"time": times})
arr = xr.open_dataset(outputfile)
ds = arr.to_array()
dst = ds.expand_dims(time=time_da) #errors here
The error I'm receiving is
Exception has occurred: TypeError
cannot directly convert an xarray.Dataset into a numpy array. Instead, create an xarray.DataArray first, either with indexing on the Dataset or by invoking the `to_array()` method.
File "Z:\UpdateAussieGRASS.py", line 101, in <module>
dst = ds.expand_dims(time=time_da)
I can't seem to work out what I'm doing wrong with to_array() in the second last line. Examples of to_array() are here. Autogenerated documentation is here.

ds is already an xarray.DataArray. The error occurs on this line:
dst = ds.expand_dims(time=time_da) #errors here
Because while ds is a DataArray, time_da is not. This should work:
dst = ds.expand_dims(time=time_da.to_array())

Parsing CSV into Pytorch tensors

I have a CSV files with all numeric values except the header row. When trying to build tensors, I get the following exception:
Traceback (most recent call last):
File "pytorch.py", line 14, in <module>
test_tensor = torch.tensor(test)
ValueError: could not determine the shape of object type 'DataFrame'
This is my code:
import torch
import dask.dataframe as dd
device = torch.device("cuda:0")
print("Loading CSV...")
test = dd.read_csv("test.csv", encoding = "UTF-8")
train = dd.read_csv("train.csv", encoding = "UTF-8")
print("Converting to Tensor...")
test_tensor = torch.tensor(test)
train_tensor = torch.tensor(train)
Using pandas instead of Dask for CSV parsing produced the same error. I also tried to specify dtype=torch.float64 inside the call to torch.tensor(data), but got the same error again.

Try converting it to an array first:
test_tensor = torch.Tensor(test.values)

I think you're just missing .values
import torch
import pandas as pd
train = pd.read_csv('train.csv')
train_tensor = torch.tensor(train.values)

Newer version of pandas highly recommend to use to_numpy instead of values
train_tensor = torch.tensor(train.to_numpy())

Only using NumPy
import numpy as np
import torch
tensor = torch.from_numpy(
np.genfromtxt("train.csv", delimiter=",")
)

Reducing Dimensions using PCA: AttributeError: 'numpy.ndarray' object has no attribute 'items'

I'm trying to implement an example project on DZone (https://dzone.com/articles/cv-r-cvs-retrieval-system-based-on-job-description) and running into a problem. In this case, I've set
dir_pca_we_EWE = 'pickle_model_pca.pkl'
And am executing the following:
def reduce_dimensions_WE(dir_we_EWE, dir_pca_we_EWE):
m1 = KeyedVectors.load_word2vec_format('./wiki.en/GoogleNews.bin', binary=True)
model1 = {}
# normalize vectors
for string in m1.wv.vocab:
model1[string] = m1.wv[string] / np.linalg.norm(m1.wv[string])
# reduce dimensionality
pca = decomposition.PCA(n_components=200)
pca.fit(np.array(list(model1.values())))
model1 = pca.transform(np.array(list(model1.values())))
i = 0
for key, value in model1.items():
model1[key] = model1[i] / np.linalg.norm(model1[i])
i = i + 1
with open(dir_pca_we_EWE, 'wb') as handle:
pickle.dump(model1, handle, protocol=pickle.HIGHEST_PROTOCOL)
return model1
This then produces the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 12, in reduce_dimensions_WE
AttributeError: 'numpy.ndarray' object has no attribute 'items'
As always, all help is greatly appreciated!

You start by initializing model1 = {} as an empty dict. By using transform in
model1 = pca.transform(np.array(list(model1.values())))
the variable model1 becomes a numpy.ndarray, which is the return type of the transform method of the pca. In the line
for key, value in model1.items():
...
you still use model1 as if it is a dict, which it no longer is.

#datasailor answers your question and tells you what is wrong. In the comments you ask for how to reduce dimentions of data to 200, and I think the easiest way to do this would be using the .fit_transform from sklearn.decomposition.PCA, instead of the .transform as you are currently using:
from sklearn.decomposition import PCA
pca = PCA(n_components=200)
lower_dim_Data=pca.fit_transform(data)

TypeError: 'module' object is not callable . MFCC

Working on a project based on speaker recognition using python and getting the following error while finding MFCC.
Traceback (most recent call last):
File "neh1.py", line 10, in <module>
complexSpectrum = numpy.fft(signal)
TypeError: 'module' object is not callable
This is the part of code:
import numpy
from scipy.fftpack import dct
from scipy.io import wavfile
sampleRate, signal = wavfile.read("/home/neha/Audio/b6.wav")
numCoefficients = 13 # choose the sive of mfcc array
minHz = 0
maxHz = 22.000
complexSpectrum = numpy.fft(signal)
powerSpectrum = abs(complexSpectrum) ** 2
filteredSpectrum = numpy.dot(powerSpectrum, melFilterBank())
logSpectrum = numpy.log(filteredSpectrum)
dctSpectrum = dct(logSpectrum, type=2)
What would be the issue?

A TypeError: 'module' object is not callable means you're trying to use something like a function when it's not actually a function or a method (e.g. doing foo() when foo is an int or a module). As #JohnGordon points out, numpy.fft is a module, but you're calling it like a function. You want to use `numpy.fft.fft() to do what you want.
See the numpy.fft docs for more functions related to fast Fourier Transforms.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data import error in Python - python

Related

Loading binary data with FixedLengthRecordDataset in TensorFlow

xarray cannot directly convert an xarray.Dataset into a numpy array

Parsing CSV into Pytorch tensors

Reducing Dimensions using PCA: AttributeError: 'numpy.ndarray' object has no attribute 'items'

TypeError: 'module' object is not callable . MFCC

Categories

Resources