How to convert a Python dictionary to a Numpy array? - python

So the logistic regression from the sklearn library from Python has the .fit() function which takes x_train(features) and y_train(labels) as arguments to train the classifier.
It seems that x_train.shape = (number_of_samples, number_of_features)
For x_train I should use the extracted xvector.scp file, which I am reading like so:
b = kaldiio.load_scp('xvector.scp')
And I can print the content like so:
for file_id in b:
xvector = b[file_id]
print(xvector)
Right now the b variable is like a dictionary and you can get the x-vector value of the corresponding id. I want to use sklearn Logistic Regression to classify the x-vectors and in order to use the .fit() method I should pass an array as an argument.
My question is how can I make an array that contains only the xvector variables?
PS: the file_ids are like 1 million and each xvector has length of 512, which is too big for an array

It seems you are trying to store the dictionary into a numpy array. If the dictionary is small, you can directly store the values as:
import numpy as np
x = np.array(list(b.values()))
However, this will run into OOM issues if the dictionary is large. In this case, you would need to use np.memmap as explained here: https://ipython-books.github.io/48-processing-large-numpy-arrays-with-memory-mapping/
Essentially, you have to add rows to the array one at a time, and flush it when you have run out of memory. The array is stored directly on the disk, so it avoids OOM issues.

Related

PCA of stock returns

I have a particular stock returns and want to find which of these returns can be used to explain the whole set of returns. Hence I am using PCA to the top 2 returns to explain the returns of a stock. I have taken the log return of the stock.
My code looks like this:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pcadata = stock['lr']
pca.fit(pcadata)
first_pc= pca.components_[0]
second_pc = pca.components_[1]
When i run this, I get this error:
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
How do i resolve this error?
PCA is a dimension-reduction procedure therefore you need a 2D array of samples x variables. PCA will then look for the combinations of variables that vary the most within these samples. It looks like you are only including one variable which is stock['lr']; therefore you receive the error. Perhaps you could give us a little more explanation about your data so that we could deduce how you should input your data.
Reading your comments (I can't reply because I need 50 reputations to do that...), I think you might have mistaken the use of PCA. You are looking for representative sample while PCA gives you 'representative' variables.

How to apply a 1D median filter to a 3D DataArray using xarray.apply_ufunc()

I have 3-dimensional DataArray (using xarray). I would like to apply a 1-dimensional to it along a certain dimension. Specifically, I want to apply the scipy.signal.medfilt() function but, again, it should be 1-dimensional.
So far I've successfully implemented this the following way:
for sample in data_raw.coords["sample"]:
for experiment in data_raw.coords["experiment"]:
data_filtered.loc[sample,experiment,:] = signal.medfilt(data_raw.loc[sample,experiment,:], 15)
(My data array has dimensions "sample", "experiment" and "wave_number. This code applies the filter along the "wave_number" dimension)
The problem with this is that it takes rather long to calculate and my intuition tells me that looping though coordinates like this is an inefficient way to do it. So I'm thinking about using the xarray.apply_ufunc() function, especially since I've used it in a similar fashion in the same code:
xr.apply_ufunc(np.linalg.norm, data, kwargs={"axis": 2}, input_core_dims=[["wave_number"]])
(This calculates the length of the vector along the "wave_number" dimension.)
I originally also had this loop through the coordinates just like the first code here.
The problem is when I try
xr.apply_ufunc(signal.medfilt, data_smooth, kwargs={"kernel_size": 15})
it returns a data array full of zeroes, presumably because it applies a 3D median filter and the data array contains NaN entries. I realize that the problem here is that I need to feed the scipy.signal.medfilt() function a 1D array but unfortunately there is no way to specify an axis along which to apply the filter (unlike numpy.linalg.norm()).
SO, how do I apply a 1D median filter without looping through coordinates?
If I understood correctly, you should use it like this:
xr.apply_ufunc(signal.medfilt, data_smooth, kwargs={"kernel_size": 15}, input_core_dims = [['wave_number']], vectorize=True)
with vectorize = True you vectorize your input function to be applied to slices of your array defined to preserve the core dimensions.
Nonetheless, as stated in the documentation:
This option exists for convenience, but is almost always slower than supplying a pre-vectorized function
because the implementation is essentially a for loop. However I still got faster results than by making my own loops.

Creating a Y_true Dataset in Keras

Here's my current call to model.fit in Keras
history_callback = model.fit(x_train/255.,
validation_train_data,
validation_split=validation_split,
batch_size=batch_size,
callbacks=callbacks)
in this example x_train is a list of numpy arrays that contains all of my image data. The way validation_train_data is structured though is its a list of numpy arrays of totally different sizes that is equal in length to the list of numpy arrays that contains my image. The data for each image though is contained in validation_train_data such that x_train[i] would correspond to a set containing validation_train_data[0][i], validation_train_data[1][i], validation_train_data[2][i], etc. Is there any way I can reformat my validation_train_data such that it can properly be used as a y_true in a custom keras loss function.
I managed to solve my problem by writing a generator function which generated a batch of x and y data as lists and put them together as a tuple. I then called fit_generator with the argument where generator = my_generator and it worked just fine. If you have odd input data then you should consider writing a generator to take care of it.
This is the tutorial I used to do so:
https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly

How do I convert a Matlab matrix to a python array

I have a 100x200 input, and a 1x100 target matrix that I am using to run a gridsearch and create a classifier in python. However, I get errors that my training set of target data is not an array. I've tried:
target=np.asarray(matTarget)
Where the matTarget is just my target imported from Matlab using scipy.io.loadmat.
My exact error is
len() of unsized object
When I try target.size I get a blank size as well.
If I do not do the array conversion, then I get
Expected array-like (array or non string sequence) got {'_header_': b'Matlab matfile ... Array([[1],[1]...)}
I still have the original matrix in Matlab and have also tried using np.array instead of asarray.
If I do print(matTarget.keys()) then I get ('header`,'version','globals','y_train'])
y_train is the name of the mat file itself
According to the documentation of scipy.io.loadmat it returns a dictionary where the values are the contained matrices.
Returns: mat_dict : dict
dictionary with variable names as keys, and loaded matrices as values.
So you need to select your matrix by its name before using it with numpy:
matrix = matTarget['name of matrix']

Why does netCDF4 give different results depending on how data is read?

I am coding in python, and trying to use netCDF4 to read in some floating point netCDF data. Mt original code looked like
from netCDF4 import Dataset
import numpy as np
infile='blahblahblah'
ds = Dataset(infile)
start_pt = 5 # or whatever
x = ds.variables['thedata'][start_pt:start_pt+2,:,:,:]
Now, because of various and sundry other things, I now have to read 'thedata' one slice at a time:
x = np.zeros([2,I,J,K]) # I,J,K match size of input array
for n in range(2):
x[n,:,:,:] = ds.variables['thedata'][start_pt+n,:,:,:]
The thing is that the two methods of reading give slightly different results. Nothing big, like one part in 10 to the fifth, but still ....
So can anyone tell me why this is happening and how I can guarantee the same results from the two methods? My thought was that the first method perhaps automatically establishes x as being the same type as the input data, while the second method establishes x as the default type for a numpy array. However, the input data is 64 bit and I thought the default for a numpy array was also 64 bit. So that doesn't explain it. Any ideas? Thanks.
The first example pulls the data into a NetCDF4 Variable object, while the second example pulls the data into a numpy array. Is it possible that the Variable object is just displaying the data with a different amount of precision?

Categories

Resources